2025-05-07T20:23:26.0255917Z Current runner version: '2.323.0' 2025-05-07T20:23:26.0261444Z Runner name: 'i-00cc0d8f8d78d1eb8' 2025-05-07T20:23:26.0262408Z Machine name: 'ip-10-0-58-159' 2025-05-07T20:23:26.0265113Z ##[group]GITHUB_TOKEN Permissions 2025-05-07T20:23:26.0267448Z Contents: read 2025-05-07T20:23:26.0267965Z Metadata: read 2025-05-07T20:23:26.0268453Z Packages: read 2025-05-07T20:23:26.0268942Z ##[endgroup] 2025-05-07T20:23:26.0270857Z Secret source: None 2025-05-07T20:23:26.0271487Z Prepare workflow directory 2025-05-07T20:23:26.1189832Z Prepare all required actions 2025-05-07T20:23:26.1238789Z Getting action download info 2025-05-07T20:23:26.3490857Z Download action repository 'actions/checkout@v4' (SHA:11bd71901bbe5b1630ceea73d27597364c9af683) 2025-05-07T20:23:26.6180783Z Download action repository 'actions/download-artifact@v4' (SHA:d3f86a106a0bac45b974a628896c90dbdf5c8093) 2025-05-07T20:23:26.9706940Z Download action repository 'pytorch/test-infra@main' (SHA:117fccdf5892ff9a958d2afb4b4b8b6e930d3187) 2025-05-07T20:23:28.5370050Z Getting action download info 2025-05-07T20:23:28.6506214Z Download action repository 'nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482' (SHA:3e91a01664abd3c5cd539100d10d33b9c5b68482) 2025-05-07T20:23:28.9140602Z Complete job name: test_and_publish_artifact (x86, linux.g5.4xlarge.nvidia.gpu, genai, 3.13, 12.6.3, 12.6.3, clang) 2025-05-07T20:23:28.9637674Z A job started hook has been configured by the self-hosted runner administrator 2025-05-07T20:23:28.9742393Z ##[group]Run '/home/ec2-user/runner-scripts/before_job.sh' 2025-05-07T20:23:28.9753737Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:23:28.9754382Z ##[endgroup] 2025-05-07T20:23:30.2491339Z Runner Type: linux.g5.4xlarge.nvidia.gpu 2025-05-07T20:23:30.2491801Z Instance Type: g5.4xlarge 2025-05-07T20:23:30.2492037Z AMI Name: unknown 2025-05-07T20:23:30.2535295Z AMI ID: ami-071226ecf16aa7d96 2025-05-07T20:23:35.8122184Z ##[group]Run actions/checkout@v4 2025-05-07T20:23:35.8122511Z with: 2025-05-07T20:23:35.8122732Z submodules: true 2025-05-07T20:23:35.8122978Z repository: pytorch/FBGEMM 2025-05-07T20:23:35.8123360Z token: *** 2025-05-07T20:23:35.8123568Z ssh-strict: true 2025-05-07T20:23:35.8123774Z ssh-user: git 2025-05-07T20:23:35.8124001Z persist-credentials: true 2025-05-07T20:23:35.8124245Z clean: true 2025-05-07T20:23:35.8124578Z sparse-checkout-cone-mode: true 2025-05-07T20:23:35.8124841Z fetch-depth: 1 2025-05-07T20:23:35.8125056Z fetch-tags: false 2025-05-07T20:23:35.8125285Z show-progress: true 2025-05-07T20:23:35.8125501Z lfs: false 2025-05-07T20:23:35.8125718Z set-safe-directory: true 2025-05-07T20:23:35.8125981Z env: 2025-05-07T20:23:35.8126195Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:23:35.8126492Z BUILD_ENV: build_binary 2025-05-07T20:23:35.8126762Z BUILD_TARGET: genai 2025-05-07T20:23:35.8126983Z BUILD_VARIANT: cuda 2025-05-07T20:23:35.8127274Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:23:35.8127533Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:23:35.8127791Z ##[endgroup] 2025-05-07T20:23:35.9294302Z Syncing repository: pytorch/FBGEMM 2025-05-07T20:23:35.9295444Z ##[group]Getting Git version info 2025-05-07T20:23:35.9295877Z Working directory is '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM' 2025-05-07T20:23:35.9296550Z [command]/usr/bin/git version 2025-05-07T20:23:35.9297023Z git version 2.47.1 2025-05-07T20:23:35.9307168Z ##[endgroup] 2025-05-07T20:23:35.9320848Z Temporarily overriding HOME='/home/ec2-user/actions-runner/_work/_temp/cc65702f-e083-453e-a7b7-2486d1798cdb' before making global git config changes 2025-05-07T20:23:35.9321939Z Adding repository directory to the temporary git global config as a safe directory 2025-05-07T20:23:35.9334549Z [command]/usr/bin/git config --global --add safe.directory /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM 2025-05-07T20:23:35.9375874Z [command]/usr/bin/git config --local --get remote.origin.url 2025-05-07T20:23:35.9399469Z https://github.com/pytorch/FBGEMM 2025-05-07T20:23:35.9419889Z ##[group]Removing previously created refs, to avoid conflicts 2025-05-07T20:23:35.9425679Z [command]/usr/bin/git rev-parse --symbolic-full-name --verify --quiet HEAD 2025-05-07T20:23:35.9451744Z refs/heads/main 2025-05-07T20:23:35.9460830Z [command]/usr/bin/git checkout --detach 2025-05-07T20:23:36.8147512Z HEAD is now at b6b2ce3 Migrate TBE forward kernels to `FBGEMM_LAUNCH_KERNEL` (#4079) 2025-05-07T20:23:36.8203343Z [command]/usr/bin/git branch --delete --force main 2025-05-07T20:23:36.8236024Z Deleted branch main (was b6b2ce3). 2025-05-07T20:23:36.8243027Z ##[endgroup] 2025-05-07T20:23:36.8246741Z [command]/usr/bin/git submodule status 2025-05-07T20:23:36.8673778Z e5d7c0bd5d9aec44d68830187138149e6a8c4e32 external/asmjit (e5d7c0b) 2025-05-07T20:23:36.8763184Z 4a61bdd4bd4ed730e078aebc7c0fcf046ff29406 external/composable_kernel (4a61bdd) 2025-05-07T20:23:36.8850369Z 6543fec09b2f04ac4a666882998b534afc9c1349 external/cpuinfo (6543fec) 2025-05-07T20:23:36.8935140Z 3ed8d2ec4ba35ef5d9d8353826209b6f868f63d3 external/cutlass (3ed8d2e) 2025-05-07T20:23:36.9021403Z f8d7d77c06936315286eb55f8de22cd23c188571 external/googletest (f8d7d77) 2025-05-07T20:23:36.9107849Z 420084499c7c1e1c2d801922f40df202eac5f3a0 external/hipify_torch (4200844) 2025-05-07T20:23:36.9191480Z 9cca280a4d0ccf0c08f47a99aa71d1b0e52f8d03 external/json (9cca280) 2025-05-07T20:23:36.9207995Z ##[group]Cleaning the repository 2025-05-07T20:23:36.9213426Z [command]/usr/bin/git clean -ffdx 2025-05-07T20:23:36.9273303Z [command]/usr/bin/git reset --hard HEAD 2025-05-07T20:23:36.9386167Z HEAD is now at b6b2ce3 Migrate TBE forward kernels to `FBGEMM_LAUNCH_KERNEL` (#4079) 2025-05-07T20:23:36.9393375Z ##[endgroup] 2025-05-07T20:23:36.9395550Z ##[group]Disabling automatic garbage collection 2025-05-07T20:23:36.9399586Z [command]/usr/bin/git config --local gc.auto 0 2025-05-07T20:23:36.9434648Z ##[endgroup] 2025-05-07T20:23:36.9435588Z ##[group]Setting up auth 2025-05-07T20:23:36.9452461Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand 2025-05-07T20:23:36.9483716Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :" 2025-05-07T20:23:36.9816601Z Entering 'external/asmjit' 2025-05-07T20:23:36.9883201Z Entering 'external/composable_kernel' 2025-05-07T20:23:36.9959111Z Entering 'external/cpuinfo' 2025-05-07T20:23:37.0028692Z Entering 'external/cutlass' 2025-05-07T20:23:37.0102800Z Entering 'external/googletest' 2025-05-07T20:23:37.0170548Z Entering 'external/hipify_torch' 2025-05-07T20:23:37.0238889Z Entering 'external/json' 2025-05-07T20:23:37.0322206Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader 2025-05-07T20:23:37.0358125Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :" 2025-05-07T20:23:37.0686443Z Entering 'external/asmjit' 2025-05-07T20:23:37.0754634Z Entering 'external/composable_kernel' 2025-05-07T20:23:37.0828683Z Entering 'external/cpuinfo' 2025-05-07T20:23:37.0895910Z Entering 'external/cutlass' 2025-05-07T20:23:37.0972454Z Entering 'external/googletest' 2025-05-07T20:23:37.1038788Z Entering 'external/hipify_torch' 2025-05-07T20:23:37.1111063Z Entering 'external/json' 2025-05-07T20:23:37.1199326Z [command]/usr/bin/git config --local http.https://github.com/.extraheader AUTHORIZATION: basic *** 2025-05-07T20:23:37.1251852Z ##[endgroup] 2025-05-07T20:23:37.1252447Z ##[group]Fetching the repository 2025-05-07T20:23:37.1259404Z [command]/usr/bin/git -c protocol.version=2 fetch --no-tags --prune --no-recurse-submodules --depth=1 origin +a2f4c52051596e74bc8c16e3d2867a4ecdd271e0:refs/remotes/pull/4066/merge 2025-05-07T20:23:37.3716999Z From https://github.com/pytorch/FBGEMM 2025-05-07T20:23:37.3717679Z * [new ref] a2f4c52051596e74bc8c16e3d2867a4ecdd271e0 -> pull/4066/merge 2025-05-07T20:23:37.3743803Z ##[endgroup] 2025-05-07T20:23:37.3744293Z ##[group]Determining the checkout info 2025-05-07T20:23:37.3745696Z ##[endgroup] 2025-05-07T20:23:37.3750347Z [command]/usr/bin/git sparse-checkout disable 2025-05-07T20:23:37.3803202Z [command]/usr/bin/git config --local --unset-all extensions.worktreeConfig 2025-05-07T20:23:37.3831986Z ##[group]Checking out the ref 2025-05-07T20:23:37.3835902Z [command]/usr/bin/git checkout --progress --force refs/remotes/pull/4066/merge 2025-05-07T20:23:37.3957605Z Previous HEAD position was b6b2ce3 Migrate TBE forward kernels to `FBGEMM_LAUNCH_KERNEL` (#4079) 2025-05-07T20:23:37.3961185Z HEAD is now at a2f4c52 Merge 6060cd4b5f971680caecdcc657faccb5720d1c3e into fd4df5f456e0cca514bacd98a39efb72990fd9f4 2025-05-07T20:23:37.3971692Z ##[endgroup] 2025-05-07T20:23:37.3972261Z ##[group]Setting up auth for fetching submodules 2025-05-07T20:23:37.3977586Z [command]/usr/bin/git config --global http.https://github.com/.extraheader AUTHORIZATION: basic *** 2025-05-07T20:23:37.4030015Z [command]/usr/bin/git config --global --unset-all url.https://github.com/.insteadOf 2025-05-07T20:23:37.4061116Z [command]/usr/bin/git config --global --add url.https://github.com/.insteadOf git@github.com: 2025-05-07T20:23:37.4092512Z [command]/usr/bin/git config --global --add url.https://github.com/.insteadOf org-21003710@github.com: 2025-05-07T20:23:37.4121590Z ##[endgroup] 2025-05-07T20:23:37.4122254Z ##[group]Fetching submodules 2025-05-07T20:23:37.4125394Z [command]/usr/bin/git submodule sync 2025-05-07T20:23:37.4507515Z Synchronizing submodule url for 'external/asmjit' 2025-05-07T20:23:37.4508172Z Synchronizing submodule url for 'external/composable_kernel' 2025-05-07T20:23:37.4508935Z Synchronizing submodule url for 'external/cpuinfo' 2025-05-07T20:23:37.4509419Z Synchronizing submodule url for 'external/cutlass' 2025-05-07T20:23:37.4510168Z Synchronizing submodule url for 'external/googletest' 2025-05-07T20:23:37.4510711Z Synchronizing submodule url for 'external/hipify_torch' 2025-05-07T20:23:37.4511187Z Synchronizing submodule url for 'external/json' 2025-05-07T20:23:37.4524862Z [command]/usr/bin/git -c protocol.version=2 submodule update --init --force --depth=1 2025-05-07T20:23:37.4964876Z Submodule path 'external/asmjit': checked out 'e5d7c0bd5d9aec44d68830187138149e6a8c4e32' 2025-05-07T20:23:37.5119334Z Submodule path 'external/composable_kernel': checked out '4a61bdd4bd4ed730e078aebc7c0fcf046ff29406' 2025-05-07T20:23:37.5222088Z Submodule path 'external/cpuinfo': checked out '6543fec09b2f04ac4a666882998b534afc9c1349' 2025-05-07T20:23:37.5390840Z Submodule path 'external/cutlass': checked out '3ed8d2ec4ba35ef5d9d8353826209b6f868f63d3' 2025-05-07T20:23:37.5484521Z Submodule path 'external/googletest': checked out 'f8d7d77c06936315286eb55f8de22cd23c188571' 2025-05-07T20:23:37.5569069Z Submodule path 'external/hipify_torch': checked out '420084499c7c1e1c2d801922f40df202eac5f3a0' 2025-05-07T20:23:37.5672974Z Submodule path 'external/json': checked out '9cca280a4d0ccf0c08f47a99aa71d1b0e52f8d03' 2025-05-07T20:23:37.5691006Z [command]/usr/bin/git submodule foreach git config --local gc.auto 0 2025-05-07T20:23:37.6037984Z Entering 'external/asmjit' 2025-05-07T20:23:37.6069771Z Entering 'external/composable_kernel' 2025-05-07T20:23:37.6102307Z Entering 'external/cpuinfo' 2025-05-07T20:23:37.6134217Z Entering 'external/cutlass' 2025-05-07T20:23:37.6165373Z Entering 'external/googletest' 2025-05-07T20:23:37.6196276Z Entering 'external/hipify_torch' 2025-05-07T20:23:37.6228399Z Entering 'external/json' 2025-05-07T20:23:37.6272277Z ##[endgroup] 2025-05-07T20:23:37.6272813Z ##[group]Persisting credentials for submodules 2025-05-07T20:23:37.6278510Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'url\.https\:\/\/github\.com\/\.insteadOf' && git config --local --unset-all 'url.https://github.com/.insteadOf' || :" 2025-05-07T20:23:37.6615545Z Entering 'external/asmjit' 2025-05-07T20:23:37.6659079Z url.https://github.com/.insteadof 2025-05-07T20:23:37.6659917Z url.https://github.com/.insteadof 2025-05-07T20:23:37.6703583Z Entering 'external/composable_kernel' 2025-05-07T20:23:37.6749444Z url.https://github.com/.insteadof 2025-05-07T20:23:37.6750007Z url.https://github.com/.insteadof 2025-05-07T20:23:37.6798886Z Entering 'external/cpuinfo' 2025-05-07T20:23:37.6845261Z url.https://github.com/.insteadof 2025-05-07T20:23:37.6845668Z url.https://github.com/.insteadof 2025-05-07T20:23:37.6888637Z Entering 'external/cutlass' 2025-05-07T20:23:37.6932311Z url.https://github.com/.insteadof 2025-05-07T20:23:37.6932743Z url.https://github.com/.insteadof 2025-05-07T20:23:37.6984011Z Entering 'external/googletest' 2025-05-07T20:23:37.7027038Z url.https://github.com/.insteadof 2025-05-07T20:23:37.7027708Z url.https://github.com/.insteadof 2025-05-07T20:23:37.7073718Z Entering 'external/hipify_torch' 2025-05-07T20:23:37.7117999Z url.https://github.com/.insteadof 2025-05-07T20:23:37.7118317Z url.https://github.com/.insteadof 2025-05-07T20:23:37.7161810Z Entering 'external/json' 2025-05-07T20:23:37.7203961Z url.https://github.com/.insteadof 2025-05-07T20:23:37.7204379Z url.https://github.com/.insteadof 2025-05-07T20:23:37.7267181Z [command]/usr/bin/git submodule foreach sh -c "git config --local 'http.https://github.com/.extraheader' 'AUTHORIZATION: basic ***' && git config --local --show-origin --name-only --get-regexp remote.origin.url" 2025-05-07T20:23:37.7598177Z Entering 'external/asmjit' 2025-05-07T20:23:37.7660682Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/asmjit/config remote.origin.url 2025-05-07T20:23:37.7663483Z Entering 'external/composable_kernel' 2025-05-07T20:23:37.7725994Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/composable_kernel/config remote.origin.url 2025-05-07T20:23:37.7728645Z Entering 'external/cpuinfo' 2025-05-07T20:23:37.7792464Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/cpuinfo/config remote.origin.url 2025-05-07T20:23:37.7795138Z Entering 'external/cutlass' 2025-05-07T20:23:37.7856038Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/cutlass/config remote.origin.url 2025-05-07T20:23:37.7858888Z Entering 'external/googletest' 2025-05-07T20:23:37.7919637Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/googletest/config remote.origin.url 2025-05-07T20:23:37.7922741Z Entering 'external/hipify_torch' 2025-05-07T20:23:37.7985591Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/hipify_torch/config remote.origin.url 2025-05-07T20:23:37.7988642Z Entering 'external/json' 2025-05-07T20:23:37.8049643Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/json/config remote.origin.url 2025-05-07T20:23:37.8173407Z [command]/usr/bin/git submodule foreach git config --local --add 'url.https://github.com/.insteadOf' 'git@github.com:' 2025-05-07T20:23:37.8508904Z Entering 'external/asmjit' 2025-05-07T20:23:37.8542990Z Entering 'external/composable_kernel' 2025-05-07T20:23:37.8577048Z Entering 'external/cpuinfo' 2025-05-07T20:23:37.8609622Z Entering 'external/cutlass' 2025-05-07T20:23:37.8641788Z Entering 'external/googletest' 2025-05-07T20:23:37.8676648Z Entering 'external/hipify_torch' 2025-05-07T20:23:37.8709629Z Entering 'external/json' 2025-05-07T20:23:37.8758618Z [command]/usr/bin/git submodule foreach git config --local --add 'url.https://github.com/.insteadOf' 'org-21003710@github.com:' 2025-05-07T20:23:37.9103719Z Entering 'external/asmjit' 2025-05-07T20:23:37.9136502Z Entering 'external/composable_kernel' 2025-05-07T20:23:37.9168465Z Entering 'external/cpuinfo' 2025-05-07T20:23:37.9200237Z Entering 'external/cutlass' 2025-05-07T20:23:37.9233024Z Entering 'external/googletest' 2025-05-07T20:23:37.9266384Z Entering 'external/hipify_torch' 2025-05-07T20:23:37.9298442Z Entering 'external/json' 2025-05-07T20:23:37.9342656Z ##[endgroup] 2025-05-07T20:23:37.9384523Z [command]/usr/bin/git log -1 --format=%H 2025-05-07T20:23:37.9411543Z a2f4c52051596e74bc8c16e3d2867a4ecdd271e0 2025-05-07T20:23:37.9593226Z ##[group]Run actions/download-artifact@v4 2025-05-07T20:23:37.9593544Z with: 2025-05-07T20:23:37.9593784Z name: fbgemm_genai_x86_clang_py3.13_cu12.6.3.whl 2025-05-07T20:23:37.9594108Z merge-multiple: false 2025-05-07T20:23:37.9594355Z repository: pytorch/FBGEMM 2025-05-07T20:23:37.9594610Z run-id: 14891846252 2025-05-07T20:23:37.9594816Z env: 2025-05-07T20:23:37.9595030Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:23:37.9595318Z BUILD_ENV: build_binary 2025-05-07T20:23:37.9595560Z BUILD_TARGET: genai 2025-05-07T20:23:37.9595779Z BUILD_VARIANT: cuda 2025-05-07T20:23:37.9596009Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:23:37.9596252Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:23:37.9596484Z ##[endgroup] 2025-05-07T20:23:38.2005940Z Downloading single artifact 2025-05-07T20:23:38.3006096Z Preparing to download the following artifacts: 2025-05-07T20:23:38.3006944Z - fbgemm_genai_x86_clang_py3.13_cu12.6.3.whl (ID: 3081362277, Size: 12530270, Expected Digest: sha256:6fa4516502c42a89fd649c1939af90f32cc7d86658a396f78f59cfb176666b1d) 2025-05-07T20:23:38.3600149Z Redirecting to blob download url: https://productionresultssa4.blob.core.windows.net/actions-results/b81c1ade-b872-4473-afc9-b227c140a38f/workflow-job-run-c8828c4a-eec1-58f2-b24b-eb0fdc904bcf/artifacts/8d055d153845bcf029149b916cc2e353d66c98a769054b62a391af6d1d7e4629.zip 2025-05-07T20:23:38.3601554Z Starting download of artifact to: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM 2025-05-07T20:23:38.4141623Z (node:58210) [DEP0005] DeprecationWarning: Buffer() is deprecated due to security and usability issues. Please use the Buffer.alloc(), Buffer.allocUnsafe(), or Buffer.from() methods instead. 2025-05-07T20:23:38.4142639Z (Use `node --trace-deprecation ...` to show where the warning was created) 2025-05-07T20:23:38.6108623Z SHA256 digest of downloaded artifact is 6fa4516502c42a89fd649c1939af90f32cc7d86658a396f78f59cfb176666b1d 2025-05-07T20:23:38.6109229Z Artifact download completed successfully. 2025-05-07T20:23:38.6109606Z Total of 1 artifact(s) downloaded 2025-05-07T20:23:38.6115095Z Download artifact has finished successfully 2025-05-07T20:23:38.6364973Z ##[group]Run pytorch/test-infra/.github/actions/setup-nvidia@main 2025-05-07T20:23:38.6365359Z with: 2025-05-07T20:23:38.6365568Z driver-version: 570.133.07 2025-05-07T20:23:38.6365814Z env: 2025-05-07T20:23:38.6366031Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:23:38.6366320Z BUILD_ENV: build_binary 2025-05-07T20:23:38.6366557Z BUILD_TARGET: genai 2025-05-07T20:23:38.6366784Z BUILD_VARIANT: cuda 2025-05-07T20:23:38.6367005Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:23:38.6367252Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:23:38.6367486Z ##[endgroup] 2025-05-07T20:23:38.6462729Z ##[group]Run nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482 2025-05-07T20:23:38.6463123Z with: 2025-05-07T20:23:38.6463346Z timeout_minutes: 10 2025-05-07T20:23:38.6463598Z max_attempts: 3 2025-05-07T20:23:38.6487029Z command: # Is it disgusting to have a full shell script here in this github action? Sure # But is it the best way to make it so that this action relies on nothing else? Absolutely set -eou pipefail DISTRIBUTION=$(. /etc/os-release;echo $ID$VERSION_ID) DRIVER_FN="NVIDIA-Linux-x86_64-${DRIVER_VERSION}.run" install_nvidia_docker2_amzn2() { ( set -x # Needed for yum-config-manager sudo yum install -y yum-utils if [[ "${DISTRIBUTION}" == "amzn2023" ]] ; then YUM_REPO_URL="https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo" else # Amazon Linux 2 YUM_REPO_URL="https://nvidia.github.io/nvidia-docker/${DISTRIBUTION}/nvidia-docker.repo" fi sudo yum-config-manager --add-repo "${YUM_REPO_URL}" sudo yum install -y nvidia-docker2 nvidia-container-toolkit-1.16.2 sudo systemctl restart docker ) } install_nvidia_docker2_ubuntu20() { ( set -x # Install nvidia-driver package if not installed status="$(dpkg-query -W --showformat='${db:Status-Status}' nvidia-docker2 2>&1)" if [ ! $? = 0 ] || [ ! "$status" = installed ]; then sudo apt-get install -y nvidia-docker2 nvidia-container-toolkit-1.16.2 sudo systemctl restart docker fi ) } pre_install_nvidia_driver_amzn2() { ( # Purge any nvidia driver installed from RHEL repo sudo yum remove -y nvidia-driver-latest-dkms ) } install_nvidia_driver_common() { ( # Try to gather more information about the runner and its existing NVIDIA driver if any echo "Before installing NVIDIA driver" lspci lsmod modinfo nvidia || true HAS_NVIDIA_DRIVER=0 # Check if NVIDIA driver has already been installed if [ -x "$(command -v nvidia-smi)" ]; then set +e # The driver exists, check its version next. Also check only the first GPU if there are more than one of them # so that the same driver version is not print over multiple lines INSTALLED_DRIVER_VERSION=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0) NVIDIA_SMI_STATUS=$? if [ "$NVIDIA_SMI_STATUS" -ne 0 ] && [ "$NVIDIA_SMI_STATUS" -ne 14 ]; then echo "Failed to get NVIDIA driver version ($INSTALLED_DRIVER_VERSION). Continuing" elif [ "$INSTALLED_DRIVER_VERSION" != "$DRIVER_VERSION" ]; then echo "NVIDIA driver ($INSTALLED_DRIVER_VERSION) has been installed, but we expect to have $DRIVER_VERSION instead. Continuing" # Turn off persistent mode so that the installation script can unload the kernel module sudo killall nvidia-persistenced || true else HAS_NVIDIA_DRIVER=1 echo "NVIDIA driver ($INSTALLED_DRIVER_VERSION) has already been installed. Skipping NVIDIA driver installation" fi set -e fi if [ "$HAS_NVIDIA_DRIVER" -eq 0 ]; then # CAUTION: this may need to be updated in future if [ "${DISTRIBUTION}" != ubuntu20.04 ]; then sudo yum groupinstall -y "Development Tools" # ensure our kernel install is the same as our underlying kernel, # groupinstall "Development Tools" has a habit of mismatching kernel headers sudo yum install -y "kernel-devel-uname-r == $(uname -r)" sudo modprobe backlight fi sudo curl -fsL -o /tmp/nvidia_driver "https://s3.amazonaws.com/ossci-linux/nvidia_driver/$DRIVER_FN" set +e sudo /bin/bash /tmp/nvidia_driver -s --no-drm NVIDIA_INSTALLATION_STATUS=$? RESET_GPU=0 if [ "$NVIDIA_INSTALLATION_STATUS" -ne 0 ]; then sudo cat /var/log/nvidia-installer.log # Fail to install NVIDIA driver, try to reset the GPU RESET_GPU=1 elif [ -x "$(command -v nvidia-smi)" ]; then # Check again if nvidia-smi works even if the driver installation completes successfully INSTALLED_DRIVER_VERSION=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0) NVIDIA_SMI_STATUS=$? if [ "$NVIDIA_SMI_STATUS" -ne 0 ] && [ "$NVIDIA_SMI_STATUS" -ne 14 ]; then RESET_GPU=1 fi fi if [ "$RESET_GPU" -eq 1 ]; then NVIDIA_DEVICES=$(lspci -D | grep -i NVIDIA | cut -d' ' -f1) # The GPU can get stuck in a failure state if somehow the test crashs the GPU microcode. When this # happens, we'll try to reset all NVIDIA devices https://github.com/pytorch/pytorch/issues/88388 for PCI_ID in $NVIDIA_DEVICES; do DEVICE_ENABLED=$(cat /sys/bus/pci/devices/$PCI_ID/enable) echo "Reseting $PCI_ID (enabled state: $DEVICE_ENABLED)" # This requires sudo permission of course echo "1" | sudo tee /sys/bus/pci/devices/$PCI_ID/reset sleep 1 done fi sudo rm -fv /tmp/nvidia_driver set -e fi ) } post_install_nvidia_driver_common() { ( sudo modprobe nvidia || true echo "After installing NVIDIA driver" lspci lsmod modinfo nvidia || true ( set +e nvidia-smi # NB: Annoyingly, nvidia-smi command returns successfully with return code 0 even in # the case where the driver has already crashed as it still can get the driver version # and some basic information like the bus ID. However, the rest of the information # would be missing (ERR!), for example: # # +-----------------------------------------------------------------------------+ # | NVIDIA-SMI 525.89.02 Driver Version: 525.89.02 CUDA Version: 12.0 | # |-------------------------------+----------------------+----------------------+ # | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | # | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | # | | | MIG M. | # |===============================+======================+======================| # | 0 ERR! Off | 00000000:00:1E.0 Off | ERR! | # |ERR! ERR! ERR! ERR! / ERR! | 4184MiB / 23028MiB | ERR! Default | # | | | ERR! | # +-------------------------------+----------------------+----------------------+ # # +-----------------------------------------------------------------------------+ # | Processes: | # | GPU GI CI PID Type Process name GPU Memory | # | ID ID Usage | # |=============================================================================| # +-----------------------------------------------------------------------------+ # # This should be reported as a failure instead as it will guarantee to fail when # Docker tries to run with --gpus all # # So, the correct check here is to query one of the missing piece of info like # GPU name, so that the command can fail accordingly nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0 NVIDIA_SMI_STATUS=$? # Allowable exit statuses for nvidia-smi, see: https://github.com/NVIDIA/gpu-operator/issues/285 if [ "$NVIDIA_SMI_STATUS" -eq 0 ] || [ "$NVIDIA_SMI_STATUS" -eq 14 ]; then echo "INFO: Ignoring allowed status ${NVIDIA_SMI_STATUS}" else echo "ERROR: nvidia-smi exited with unresolved status ${NVIDIA_SMI_STATUS}" exit ${NVIDIA_SMI_STATUS} fi set -e ) ) } install_nvidia_driver_amzn2() { ( set -x pre_install_nvidia_driver_amzn2 install_nvidia_driver_common post_install_nvidia_driver_common ) } install_nvidia_driver_ubuntu20() { ( set -x install_nvidia_driver_common post_install_nvidia_driver_common ) } echo "== Installing nvidia driver ${DRIVER_FN} ==" case "${DISTRIBUTION}" in amzn*) install_nvidia_driver_amzn2 ;; ubuntu20.04) install_nvidia_driver_ubuntu20 ;; *) echo "ERROR: Unknown distribution ${DISTRIBUTION}" exit 1 ;; esac # Install container toolkit based on distribution echo "== Installing nvidia container toolkit for ${DISTRIBUTION} ==" case "${DISTRIBUTION}" in amzn*) install_nvidia_docker2_amzn2 ;; ubuntu20.04) install_nvidia_docker2_ubuntu20 ;; *) echo "ERROR: Unknown distribution ${DISTRIBUTION}" exit 1 ;; esac echo "GPU_FLAG=--gpus all -e NVIDIA_DRIVER_CAPABILITIES=all" >> "${GITHUB_ENV}" # Fix https://github.com/NVIDIA/nvidia-docker/issues/1648 on runners with # more than one GPUs. This just needs to be run once. The command fails # on subsequent runs and complains that the mode is already on, but that's # ok sudo nvidia-persistenced || true # This should show persistence mode ON nvidia-smi 2025-05-07T20:23:38.6510502Z retry_wait_seconds: 10 2025-05-07T20:23:38.6510767Z polling_interval_seconds: 1 2025-05-07T20:23:38.6511027Z warning_on_retry: true 2025-05-07T20:23:38.6511280Z continue_on_error: false 2025-05-07T20:23:38.6511528Z env: 2025-05-07T20:23:38.6511744Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:23:38.6512047Z BUILD_ENV: build_binary 2025-05-07T20:23:38.6512305Z BUILD_TARGET: genai 2025-05-07T20:23:38.6512526Z BUILD_VARIANT: cuda 2025-05-07T20:23:38.6512771Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:23:38.6513034Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:23:38.6531414Z DRIVER_VERSION: 570.133.07 2025-05-07T20:23:38.6531694Z ##[endgroup] 2025-05-07T20:23:38.7344027Z == Installing nvidia driver NVIDIA-Linux-x86_64-570.133.07.run == 2025-05-07T20:23:38.7344786Z + pre_install_nvidia_driver_amzn2 2025-05-07T20:23:38.7347859Z + sudo yum remove -y nvidia-driver-latest-dkms 2025-05-07T20:23:39.0500416Z No match for argument: nvidia-driver-latest-dkms 2025-05-07T20:23:39.0500842Z No packages marked for removal. 2025-05-07T20:23:39.0566974Z Dependencies resolved. 2025-05-07T20:23:39.0577644Z Nothing to do. 2025-05-07T20:23:39.0578115Z Complete! 2025-05-07T20:23:39.1510676Z + install_nvidia_driver_common 2025-05-07T20:23:39.1513942Z + echo 'Before installing NVIDIA driver' 2025-05-07T20:23:39.1514261Z + lspci 2025-05-07T20:23:39.1516228Z Before installing NVIDIA driver 2025-05-07T20:23:39.1715225Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] 2025-05-07T20:23:39.1716412Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] 2025-05-07T20:23:39.1716955Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08) 2025-05-07T20:23:39.1717456Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111 2025-05-07T20:23:39.1717996Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller 2025-05-07T20:23:39.1718693Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA) 2025-05-07T20:23:39.1719248Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1) 2025-05-07T20:23:39.1719712Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller 2025-05-07T20:23:39.1720097Z + lsmod 2025-05-07T20:23:39.1758609Z Module Size Used by 2025-05-07T20:23:39.1759011Z xt_conntrack 16384 1 2025-05-07T20:23:39.1759440Z nft_chain_nat 16384 3 2025-05-07T20:23:39.1759782Z xt_MASQUERADE 20480 1 2025-05-07T20:23:39.1760170Z nf_nat 57344 2 nft_chain_nat,xt_MASQUERADE 2025-05-07T20:23:39.1760485Z nf_conntrack_netlink 57344 0 2025-05-07T20:23:39.1760920Z nf_conntrack 184320 4 xt_conntrack,nf_nat,nf_conntrack_netlink,xt_MASQUERADE 2025-05-07T20:23:39.1761580Z nf_defrag_ipv6 24576 1 nf_conntrack 2025-05-07T20:23:39.1762174Z nf_defrag_ipv4 16384 1 nf_conntrack 2025-05-07T20:23:39.1762488Z xfrm_user 57344 1 2025-05-07T20:23:39.1762744Z xfrm_algo 16384 1 xfrm_user 2025-05-07T20:23:39.1763028Z xt_addrtype 16384 2 2025-05-07T20:23:39.1763283Z nft_compat 20480 4 2025-05-07T20:23:39.1763585Z nf_tables 311296 57 nft_compat,nft_chain_nat 2025-05-07T20:23:39.1763991Z nfnetlink 20480 4 nft_compat,nf_conntrack_netlink,nf_tables 2025-05-07T20:23:39.1764469Z br_netfilter 36864 0 2025-05-07T20:23:39.1764986Z bridge 323584 1 br_netfilter 2025-05-07T20:23:39.1765283Z stp 16384 1 bridge 2025-05-07T20:23:39.1765566Z llc 16384 2 bridge,stp 2025-05-07T20:23:39.1765848Z overlay 167936 0 2025-05-07T20:23:39.1766088Z tls 135168 0 2025-05-07T20:23:39.1766335Z nls_ascii 16384 1 2025-05-07T20:23:39.1766589Z nls_cp437 20480 1 2025-05-07T20:23:39.1766828Z vfat 24576 1 2025-05-07T20:23:39.1767077Z fat 86016 1 vfat 2025-05-07T20:23:39.1767341Z sunrpc 696320 1 2025-05-07T20:23:39.1767586Z ena 180224 0 2025-05-07T20:23:39.1767819Z i8042 45056 0 2025-05-07T20:23:39.1768067Z serio 28672 3 i8042 2025-05-07T20:23:39.1768335Z button 24576 0 2025-05-07T20:23:39.1768580Z ghash_clmulni_intel 16384 0 2025-05-07T20:23:39.1768849Z sch_fq_codel 20480 17 2025-05-07T20:23:39.1769107Z dm_mod 188416 0 2025-05-07T20:23:39.1769343Z fuse 163840 1 2025-05-07T20:23:39.1769583Z loop 36864 0 2025-05-07T20:23:39.1769814Z configfs 57344 1 2025-05-07T20:23:39.1770044Z dax 45056 1 dm_mod 2025-05-07T20:23:39.1770299Z dmi_sysfs 20480 0 2025-05-07T20:23:39.1770533Z crc32_pclmul 16384 0 2025-05-07T20:23:39.1770766Z crc32c_intel 24576 0 2025-05-07T20:23:39.1771005Z efivarfs 24576 1 2025-05-07T20:23:39.1771244Z + modinfo nvidia 2025-05-07T20:23:39.1777528Z filename: /lib/modules/6.1.130-139.222.amzn2023.x86_64/kernel/drivers/video/nvidia.ko 2025-05-07T20:23:39.1778156Z import_ns: DMA_BUF 2025-05-07T20:23:39.1778477Z alias: char-major-195-* 2025-05-07T20:23:39.1778826Z version: 570.133.07 2025-05-07T20:23:39.1779066Z supported: external 2025-05-07T20:23:39.1779504Z license: Dual MIT/GPL 2025-05-07T20:23:39.1779941Z firmware: nvidia/570.133.07/gsp_tu10x.bin 2025-05-07T20:23:39.1780379Z firmware: nvidia/570.133.07/gsp_ga10x.bin 2025-05-07T20:23:39.1780901Z srcversion: 49515739FD8F721A3F2F714 2025-05-07T20:23:39.1781211Z alias: pci:v000010DEd*sv*sd*bc06sc80i00* 2025-05-07T20:23:39.1781534Z alias: pci:v000010DEd*sv*sd*bc03sc02i00* 2025-05-07T20:23:39.1781851Z alias: pci:v000010DEd*sv*sd*bc03sc00i00* 2025-05-07T20:23:39.1782135Z depends: i2c-core,drm 2025-05-07T20:23:39.1782375Z retpoline: Y 2025-05-07T20:23:39.1782577Z name: nvidia 2025-05-07T20:23:39.1782985Z vermagic: 6.1.130-139.222.amzn2023.x86_64 SMP preempt mod_unload modversions 2025-05-07T20:23:39.1783615Z parm: NvSwitchRegDwords:NvSwitch regkey (charp) 2025-05-07T20:23:39.1784163Z parm: NvSwitchBlacklist:NvSwitchBlacklist=uuid[,uuid...] (charp) 2025-05-07T20:23:39.1784565Z parm: NVreg_ResmanDebugLevel:int 2025-05-07T20:23:39.1784855Z parm: NVreg_RmLogonRC:int 2025-05-07T20:23:39.1785160Z parm: NVreg_ModifyDeviceFiles:int 2025-05-07T20:23:39.1785468Z parm: NVreg_DeviceFileUID:int 2025-05-07T20:23:39.1785754Z parm: NVreg_DeviceFileGID:int 2025-05-07T20:23:39.1786100Z parm: NVreg_DeviceFileMode:int 2025-05-07T20:23:39.1786584Z parm: NVreg_InitializeSystemMemoryAllocations:int 2025-05-07T20:23:39.1787080Z parm: NVreg_UsePageAttributeTable:int 2025-05-07T20:23:39.1787435Z parm: NVreg_EnablePCIeGen3:int 2025-05-07T20:23:39.1787721Z parm: NVreg_EnableMSI:int 2025-05-07T20:23:39.1788006Z parm: NVreg_EnableStreamMemOPs:int 2025-05-07T20:23:39.1788353Z parm: NVreg_RestrictProfilingToAdminUsers:int 2025-05-07T20:23:39.1788733Z parm: NVreg_PreserveVideoMemoryAllocations:int 2025-05-07T20:23:39.1789095Z parm: NVreg_EnableS0ixPowerManagement:int 2025-05-07T20:23:39.1789490Z parm: NVreg_S0ixPowerManagementVideoMemoryThreshold:int 2025-05-07T20:23:39.1790160Z parm: NVreg_DynamicPowerManagement:int 2025-05-07T20:23:39.1790719Z parm: NVreg_DynamicPowerManagementVideoMemoryThreshold:int 2025-05-07T20:23:39.1791136Z parm: NVreg_EnableGpuFirmware:int 2025-05-07T20:23:39.1791457Z parm: NVreg_EnableGpuFirmwareLogs:int 2025-05-07T20:23:39.1791812Z parm: NVreg_OpenRmEnableUnsupportedGpus:int 2025-05-07T20:23:39.1792167Z parm: NVreg_EnableUserNUMAManagement:int 2025-05-07T20:23:39.1792482Z parm: NVreg_MemoryPoolSize:int 2025-05-07T20:23:39.1792789Z parm: NVreg_KMallocHeapMaxSize:int 2025-05-07T20:23:39.1793103Z parm: NVreg_VMallocHeapMaxSize:int 2025-05-07T20:23:39.1793402Z parm: NVreg_IgnoreMMIOCheck:int 2025-05-07T20:23:39.1793698Z parm: NVreg_NvLinkDisable:int 2025-05-07T20:23:39.1794030Z parm: NVreg_EnablePCIERelaxedOrderingMode:int 2025-05-07T20:23:39.1794365Z parm: NVreg_RegisterPCIDriver:int 2025-05-07T20:23:39.1794683Z parm: NVreg_EnableResizableBar:int 2025-05-07T20:23:39.1795000Z parm: NVreg_EnableDbgBreakpoint:int 2025-05-07T20:23:39.1795324Z parm: NVreg_EnableNonblockingOpen:int 2025-05-07T20:23:39.1795644Z parm: NVreg_RegistryDwords:charp 2025-05-07T20:23:39.1795967Z parm: NVreg_RegistryDwordsPerDevice:charp 2025-05-07T20:23:39.1796283Z parm: NVreg_RmMsg:charp 2025-05-07T20:23:39.1796546Z parm: NVreg_GpuBlacklist:charp 2025-05-07T20:23:39.1796853Z parm: NVreg_TemporaryFilePath:charp 2025-05-07T20:23:39.1797159Z parm: NVreg_ExcludedGpus:charp 2025-05-07T20:23:39.1797452Z parm: NVreg_DmaRemapPeerMmio:int 2025-05-07T20:23:39.1797764Z parm: NVreg_RmNvlinkBandwidth:charp 2025-05-07T20:23:39.1798105Z parm: NVreg_RmNvlinkBandwidthLinkCount:int 2025-05-07T20:23:39.1798428Z parm: NVreg_ImexChannelCount:int 2025-05-07T20:23:39.1798744Z parm: NVreg_CreateImexChannel0:int 2025-05-07T20:23:39.1799131Z parm: NVreg_GrdmaPciTopoCheckOverride:int 2025-05-07T20:23:39.1799448Z parm: rm_firmware_active:charp 2025-05-07T20:23:39.1799835Z + HAS_NVIDIA_DRIVER=0 2025-05-07T20:23:39.1800067Z ++ command -v nvidia-smi 2025-05-07T20:23:39.1800314Z + '[' -x /usr/bin/nvidia-smi ']' 2025-05-07T20:23:39.1800547Z + set +e 2025-05-07T20:23:39.1800839Z ++ nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0 2025-05-07T20:23:41.0157877Z + INSTALLED_DRIVER_VERSION=570.133.07 2025-05-07T20:23:41.0158237Z + NVIDIA_SMI_STATUS=0 2025-05-07T20:23:41.0158519Z + '[' 0 -ne 0 ']' 2025-05-07T20:23:41.0158854Z + '[' 570.133.07 '!=' 570.133.07 ']' 2025-05-07T20:23:41.0159284Z + HAS_NVIDIA_DRIVER=1 2025-05-07T20:23:41.0159963Z + echo 'NVIDIA driver (570.133.07) has already been installed. Skipping NVIDIA driver installation' 2025-05-07T20:23:41.0160715Z + set -e 2025-05-07T20:23:41.0160978Z + '[' 1 -eq 0 ']' 2025-05-07T20:23:41.0161346Z NVIDIA driver (570.133.07) has already been installed. Skipping NVIDIA driver installation 2025-05-07T20:23:41.0161915Z + post_install_nvidia_driver_common 2025-05-07T20:23:41.0166153Z + sudo modprobe nvidia 2025-05-07T20:23:41.1800724Z + echo 'After installing NVIDIA driver' 2025-05-07T20:23:41.1918990Z + lspci 2025-05-07T20:23:41.1919349Z After installing NVIDIA driver 2025-05-07T20:23:41.1919954Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] 2025-05-07T20:23:41.1920669Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] 2025-05-07T20:23:41.1921203Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08) 2025-05-07T20:23:41.1922001Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111 2025-05-07T20:23:41.1922661Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller 2025-05-07T20:23:41.1923172Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA) 2025-05-07T20:23:41.1923633Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1) 2025-05-07T20:23:41.1924527Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller 2025-05-07T20:23:41.1924927Z + lsmod 2025-05-07T20:23:41.1953487Z Module Size Used by 2025-05-07T20:23:41.1953971Z nvidia_uvm 1884160 0 2025-05-07T20:23:41.1954425Z nvidia 11583488 1 nvidia_uvm 2025-05-07T20:23:41.1954937Z drm 602112 1 nvidia 2025-05-07T20:23:41.1955443Z drm_panel_orientation_quirks 32768 1 drm 2025-05-07T20:23:41.1955956Z backlight 24576 1 drm 2025-05-07T20:23:41.1956420Z i2c_core 110592 2 nvidia,drm 2025-05-07T20:23:41.1956694Z xt_conntrack 16384 1 2025-05-07T20:23:41.1956956Z nft_chain_nat 16384 3 2025-05-07T20:23:41.1957217Z xt_MASQUERADE 20480 1 2025-05-07T20:23:41.1957518Z nf_nat 57344 2 nft_chain_nat,xt_MASQUERADE 2025-05-07T20:23:41.1957843Z nf_conntrack_netlink 57344 0 2025-05-07T20:23:41.1958248Z nf_conntrack 184320 4 xt_conntrack,nf_nat,nf_conntrack_netlink,xt_MASQUERADE 2025-05-07T20:23:41.1958700Z nf_defrag_ipv6 24576 1 nf_conntrack 2025-05-07T20:23:41.1959015Z nf_defrag_ipv4 16384 1 nf_conntrack 2025-05-07T20:23:41.1959318Z xfrm_user 57344 1 2025-05-07T20:23:41.1959587Z xfrm_algo 16384 1 xfrm_user 2025-05-07T20:23:41.1959868Z xt_addrtype 16384 2 2025-05-07T20:23:41.1960139Z nft_compat 20480 4 2025-05-07T20:23:41.1960457Z nf_tables 311296 57 nft_compat,nft_chain_nat 2025-05-07T20:23:41.1960885Z nfnetlink 20480 4 nft_compat,nf_conntrack_netlink,nf_tables 2025-05-07T20:23:41.1961257Z br_netfilter 36864 0 2025-05-07T20:23:41.1961540Z bridge 323584 1 br_netfilter 2025-05-07T20:23:41.1961846Z stp 16384 1 bridge 2025-05-07T20:23:41.1962138Z llc 16384 2 bridge,stp 2025-05-07T20:23:41.1962430Z overlay 167936 0 2025-05-07T20:23:41.1962695Z tls 135168 0 2025-05-07T20:23:41.1962954Z nls_ascii 16384 1 2025-05-07T20:23:41.1963428Z nls_cp437 20480 1 2025-05-07T20:23:41.1963699Z vfat 24576 1 2025-05-07T20:23:41.1963955Z fat 86016 1 vfat 2025-05-07T20:23:41.1964232Z sunrpc 696320 1 2025-05-07T20:23:41.1964644Z ena 180224 0 2025-05-07T20:23:41.1964891Z i8042 45056 0 2025-05-07T20:23:41.1965171Z serio 28672 3 i8042 2025-05-07T20:23:41.1965457Z button 24576 0 2025-05-07T20:23:41.1965722Z ghash_clmulni_intel 16384 0 2025-05-07T20:23:41.1965991Z sch_fq_codel 20480 17 2025-05-07T20:23:41.1966265Z dm_mod 188416 0 2025-05-07T20:23:41.1966530Z fuse 163840 1 2025-05-07T20:23:41.1966776Z loop 36864 0 2025-05-07T20:23:41.1967037Z configfs 57344 1 2025-05-07T20:23:41.1967297Z dax 45056 1 dm_mod 2025-05-07T20:23:41.1967572Z dmi_sysfs 20480 0 2025-05-07T20:23:41.1967829Z crc32_pclmul 16384 0 2025-05-07T20:23:41.1968098Z crc32c_intel 24576 0 2025-05-07T20:23:41.1968347Z efivarfs 24576 1 2025-05-07T20:23:41.1968602Z + modinfo nvidia 2025-05-07T20:23:41.1970753Z filename: /lib/modules/6.1.130-139.222.amzn2023.x86_64/kernel/drivers/video/nvidia.ko 2025-05-07T20:23:41.1971544Z import_ns: DMA_BUF 2025-05-07T20:23:41.1971958Z alias: char-major-195-* 2025-05-07T20:23:41.1972334Z version: 570.133.07 2025-05-07T20:23:41.1972588Z supported: external 2025-05-07T20:23:41.1972828Z license: Dual MIT/GPL 2025-05-07T20:23:41.1973117Z firmware: nvidia/570.133.07/gsp_tu10x.bin 2025-05-07T20:23:41.1973455Z firmware: nvidia/570.133.07/gsp_ga10x.bin 2025-05-07T20:23:41.1973776Z srcversion: 49515739FD8F721A3F2F714 2025-05-07T20:23:41.1974096Z alias: pci:v000010DEd*sv*sd*bc06sc80i00* 2025-05-07T20:23:41.1974429Z alias: pci:v000010DEd*sv*sd*bc03sc02i00* 2025-05-07T20:23:41.1974889Z alias: pci:v000010DEd*sv*sd*bc03sc00i00* 2025-05-07T20:23:41.1975197Z depends: i2c-core,drm 2025-05-07T20:23:41.1975515Z retpoline: Y 2025-05-07T20:23:41.1975832Z name: nvidia 2025-05-07T20:23:41.1976319Z vermagic: 6.1.130-139.222.amzn2023.x86_64 SMP preempt mod_unload modversions 2025-05-07T20:23:41.1976961Z parm: NvSwitchRegDwords:NvSwitch regkey (charp) 2025-05-07T20:23:41.1977430Z parm: NvSwitchBlacklist:NvSwitchBlacklist=uuid[,uuid...] (charp) 2025-05-07T20:23:41.1977855Z parm: NVreg_ResmanDebugLevel:int 2025-05-07T20:23:41.1978160Z parm: NVreg_RmLogonRC:int 2025-05-07T20:23:41.1978468Z parm: NVreg_ModifyDeviceFiles:int 2025-05-07T20:23:41.1978794Z parm: NVreg_DeviceFileUID:int 2025-05-07T20:23:41.1979093Z parm: NVreg_DeviceFileGID:int 2025-05-07T20:23:41.1979410Z parm: NVreg_DeviceFileMode:int 2025-05-07T20:23:41.1979791Z parm: NVreg_InitializeSystemMemoryAllocations:int 2025-05-07T20:23:41.1980179Z parm: NVreg_UsePageAttributeTable:int 2025-05-07T20:23:41.1980520Z parm: NVreg_EnablePCIeGen3:int 2025-05-07T20:23:41.1980831Z parm: NVreg_EnableMSI:int 2025-05-07T20:23:41.1981132Z parm: NVreg_EnableStreamMemOPs:int 2025-05-07T20:23:41.1981504Z parm: NVreg_RestrictProfilingToAdminUsers:int 2025-05-07T20:23:41.1981914Z parm: NVreg_PreserveVideoMemoryAllocations:int 2025-05-07T20:23:41.1982298Z parm: NVreg_EnableS0ixPowerManagement:int 2025-05-07T20:23:41.1982702Z parm: NVreg_S0ixPowerManagementVideoMemoryThreshold:int 2025-05-07T20:23:41.1983115Z parm: NVreg_DynamicPowerManagement:int 2025-05-07T20:23:41.1983544Z parm: NVreg_DynamicPowerManagementVideoMemoryThreshold:int 2025-05-07T20:23:41.1983945Z parm: NVreg_EnableGpuFirmware:int 2025-05-07T20:23:41.1984284Z parm: NVreg_EnableGpuFirmwareLogs:int 2025-05-07T20:23:41.1984655Z parm: NVreg_OpenRmEnableUnsupportedGpus:int 2025-05-07T20:23:41.1985124Z parm: NVreg_EnableUserNUMAManagement:int 2025-05-07T20:23:41.1985463Z parm: NVreg_MemoryPoolSize:int 2025-05-07T20:23:41.1985781Z parm: NVreg_KMallocHeapMaxSize:int 2025-05-07T20:23:41.1986106Z parm: NVreg_VMallocHeapMaxSize:int 2025-05-07T20:23:41.1986419Z parm: NVreg_IgnoreMMIOCheck:int 2025-05-07T20:23:41.1986725Z parm: NVreg_NvLinkDisable:int 2025-05-07T20:23:41.1987071Z parm: NVreg_EnablePCIERelaxedOrderingMode:int 2025-05-07T20:23:41.1987417Z parm: NVreg_RegisterPCIDriver:int 2025-05-07T20:23:41.1987737Z parm: NVreg_EnableResizableBar:int 2025-05-07T20:23:41.1988065Z parm: NVreg_EnableDbgBreakpoint:int 2025-05-07T20:23:41.1988402Z parm: NVreg_EnableNonblockingOpen:int 2025-05-07T20:23:41.1988734Z parm: NVreg_RegistryDwords:charp 2025-05-07T20:23:41.1989070Z parm: NVreg_RegistryDwordsPerDevice:charp 2025-05-07T20:23:41.1989403Z parm: NVreg_RmMsg:charp 2025-05-07T20:23:41.1989685Z parm: NVreg_GpuBlacklist:charp 2025-05-07T20:23:41.1990004Z parm: NVreg_TemporaryFilePath:charp 2025-05-07T20:23:41.1990324Z parm: NVreg_ExcludedGpus:charp 2025-05-07T20:23:41.1990629Z parm: NVreg_DmaRemapPeerMmio:int 2025-05-07T20:23:41.1990954Z parm: NVreg_RmNvlinkBandwidth:charp 2025-05-07T20:23:41.1991309Z parm: NVreg_RmNvlinkBandwidthLinkCount:int 2025-05-07T20:23:41.1991719Z parm: NVreg_ImexChannelCount:int 2025-05-07T20:23:41.1992056Z parm: NVreg_CreateImexChannel0:int 2025-05-07T20:23:41.1992399Z parm: NVreg_GrdmaPciTopoCheckOverride:int 2025-05-07T20:23:41.1992747Z parm: rm_firmware_active:charp 2025-05-07T20:23:41.1993031Z + set +e 2025-05-07T20:23:41.1993221Z + nvidia-smi 2025-05-07T20:23:42.6113330Z Wed May 7 20:23:42 2025 2025-05-07T20:23:42.6113857Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:42.6114774Z | NVIDIA-SMI 570.133.07 Driver Version: 570.133.07 CUDA Version: 12.8 | 2025-05-07T20:23:42.6115253Z |-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:42.6115747Z | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 2025-05-07T20:23:42.6116269Z | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 2025-05-07T20:23:42.6116699Z | | | MIG M. | 2025-05-07T20:23:42.6117025Z |=========================================+========================+======================| 2025-05-07T20:23:42.6178922Z | 0 NVIDIA A10G Off | 00000000:00:1E.0 Off | 0 | 2025-05-07T20:23:42.6179602Z | 0% 31C P0 62W / 300W | 0MiB / 23028MiB | 4% Default | 2025-05-07T20:23:42.6180005Z | | | N/A | 2025-05-07T20:23:42.6180470Z +-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:42.6180878Z 2025-05-07T20:23:42.6181274Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:42.6181704Z | Processes: | 2025-05-07T20:23:42.6182137Z | GPU GI CI PID Type Process name GPU Memory | 2025-05-07T20:23:42.6182548Z | ID ID Usage | 2025-05-07T20:23:42.6182897Z |=========================================================================================| 2025-05-07T20:23:42.6183602Z | No running processes found | 2025-05-07T20:23:42.6184267Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:43.0392303Z + nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0 2025-05-07T20:23:44.4534678Z NVIDIA A10G 2025-05-07T20:23:44.7269374Z + NVIDIA_SMI_STATUS=0 2025-05-07T20:23:44.7269739Z + '[' 0 -eq 0 ']' 2025-05-07T20:23:44.7269983Z + echo 'INFO: Ignoring allowed status 0' 2025-05-07T20:23:44.7270270Z + set -e 2025-05-07T20:23:44.7270474Z INFO: Ignoring allowed status 0 2025-05-07T20:23:44.7280935Z == Installing nvidia container toolkit for amzn2023 == 2025-05-07T20:23:44.7284676Z + sudo yum install -y yum-utils 2025-05-07T20:23:45.1867960Z Last metadata expiration check: 0:05:02 ago on Wed May 7 20:18:43 2025. 2025-05-07T20:23:45.2122374Z Package dnf-utils-4.3.0-13.amzn2023.0.5.noarch is already installed. 2025-05-07T20:23:45.2518432Z Dependencies resolved. 2025-05-07T20:23:45.2703774Z Nothing to do. 2025-05-07T20:23:45.2704107Z Complete! 2025-05-07T20:23:45.3120825Z + [[ amzn2023 == \a\m\z\n\2\0\2\3 ]] 2025-05-07T20:23:45.3121541Z + YUM_REPO_URL=https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo 2025-05-07T20:23:45.3122446Z + sudo yum-config-manager --add-repo https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo 2025-05-07T20:23:45.7225696Z Adding repo from: https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo 2025-05-07T20:23:45.7798862Z + sudo yum install -y nvidia-docker2 nvidia-container-toolkit-1.16.2 2025-05-07T20:23:46.3876851Z nvidia-container-toolkit 14 kB/s | 833 B 00:00 2025-05-07T20:23:46.4126399Z Package nvidia-docker2-2.14.0-1.noarch is already installed. 2025-05-07T20:23:46.4528247Z Dependencies resolved. 2025-05-07T20:23:46.4710304Z ================================================================================ 2025-05-07T20:23:46.4723975Z Package Arch Version Repository Size 2025-05-07T20:23:46.4724679Z ================================================================================ 2025-05-07T20:23:46.4724988Z Downgrading: 2025-05-07T20:23:46.4725353Z nvidia-container-toolkit x86_64 1.16.2-1 nvidia-container-toolkit 1.2 M 2025-05-07T20:23:46.4725922Z nvidia-container-toolkit-base x86_64 1.16.2-1 nvidia-container-toolkit 5.6 M 2025-05-07T20:23:46.4726268Z 2025-05-07T20:23:46.4726361Z Transaction Summary 2025-05-07T20:23:46.4726614Z ================================================================================ 2025-05-07T20:23:46.4726927Z Downgrade 2 Packages 2025-05-07T20:23:46.4727072Z 2025-05-07T20:23:46.4727171Z Total download size: 6.8 M 2025-05-07T20:23:46.4727428Z Downloading Packages: 2025-05-07T20:23:46.5386697Z (1/2): nvidia-container-toolkit-1.16.2-1.x86_64 19 MB/s | 1.2 MB 00:00 2025-05-07T20:23:46.5688425Z (2/2): nvidia-container-toolkit-base-1.16.2-1.x 58 MB/s | 5.6 MB 00:00 2025-05-07T20:23:46.5697937Z -------------------------------------------------------------------------------- 2025-05-07T20:23:46.5700951Z Total 70 MB/s | 6.8 MB 00:00 2025-05-07T20:23:46.5703674Z Running transaction check 2025-05-07T20:23:46.5808059Z Transaction check succeeded. 2025-05-07T20:23:46.5808658Z Running transaction test 2025-05-07T20:23:46.6103833Z Transaction test succeeded. 2025-05-07T20:23:46.6106814Z Running transaction 2025-05-07T20:23:47.1598737Z Preparing : 1/1 2025-05-07T20:23:47.2659832Z Downgrading : nvidia-container-toolkit-base-1.16.2-1.x86_64 1/4 2025-05-07T20:23:47.2681660Z Downgrading : nvidia-container-toolkit-1.16.2-1.x86_64 2/4 2025-05-07T20:23:47.2938685Z Running scriptlet: nvidia-container-toolkit-1.16.2-1.x86_64 2/4 2025-05-07T20:23:47.2939252Z Cleanup : nvidia-container-toolkit-1.17.6-1.x86_64 3/4 2025-05-07T20:23:47.3039089Z Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64 3/4 2025-05-07T20:23:47.3062520Z Cleanup : nvidia-container-toolkit-base-1.17.6-1.x86_64 4/4 2025-05-07T20:23:47.4818854Z Running scriptlet: nvidia-container-toolkit-1.16.2-1.x86_64 4/4 2025-05-07T20:23:47.4819494Z Verifying : nvidia-container-toolkit-1.16.2-1.x86_64 1/4 2025-05-07T20:23:47.4820035Z Verifying : nvidia-container-toolkit-1.17.6-1.x86_64 2/4 2025-05-07T20:23:47.4820571Z Verifying : nvidia-container-toolkit-base-1.16.2-1.x86_64 3/4 2025-05-07T20:23:47.6068060Z Verifying : nvidia-container-toolkit-base-1.17.6-1.x86_64 4/4================================================================================ 2025-05-07T20:23:47.6069105Z WARNING: 2025-05-07T20:23:47.6069765Z A newer release of "Amazon Linux" is available. 2025-05-07T20:23:47.6070426Z 2025-05-07T20:23:47.6070646Z Available Versions: 2025-05-07T20:23:47.6071019Z 2025-05-07T20:23:47.6071261Z Version 2023.7.20250331: 2025-05-07T20:23:47.6071897Z Run the following command to upgrade to 2023.7.20250331: 2025-05-07T20:23:47.6072140Z 2025-05-07T20:23:47.6072265Z dnf upgrade --releasever=2023.7.20250331 2025-05-07T20:23:47.6072475Z 2025-05-07T20:23:47.6072551Z Release notes: 2025-05-07T20:23:47.6072948Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html 2025-05-07T20:23:47.6073308Z 2025-05-07T20:23:47.6073397Z Version 2023.7.20250414: 2025-05-07T20:23:47.6073688Z Run the following command to upgrade to 2023.7.20250414: 2025-05-07T20:23:47.6073933Z 2025-05-07T20:23:47.6074041Z dnf upgrade --releasever=2023.7.20250414 2025-05-07T20:23:47.6074240Z 2025-05-07T20:23:47.6074323Z Release notes: 2025-05-07T20:23:47.6074701Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html 2025-05-07T20:23:47.6075365Z 2025-05-07T20:23:47.6075446Z Version 2023.7.20250428: 2025-05-07T20:23:47.6075745Z Run the following command to upgrade to 2023.7.20250428: 2025-05-07T20:23:47.6075983Z 2025-05-07T20:23:47.6076098Z dnf upgrade --releasever=2023.7.20250428 2025-05-07T20:23:47.6076297Z 2025-05-07T20:23:47.6076372Z Release notes: 2025-05-07T20:23:47.6076747Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html 2025-05-07T20:23:47.6077103Z 2025-05-07T20:23:47.6077213Z ================================================================================ 2025-05-07T20:23:47.6434660Z 2025-05-07T20:23:47.6435177Z 2025-05-07T20:23:47.6435500Z Downgraded: 2025-05-07T20:23:47.6436224Z nvidia-container-toolkit-1.16.2-1.x86_64 2025-05-07T20:23:47.6437337Z nvidia-container-toolkit-base-1.16.2-1.x86_64 2025-05-07T20:23:47.6438004Z 2025-05-07T20:23:47.6438156Z Complete! 2025-05-07T20:23:47.6926648Z + sudo systemctl restart docker 2025-05-07T20:23:51.9801278Z Wed May 7 20:23:51 2025 2025-05-07T20:23:51.9801706Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:51.9802191Z | NVIDIA-SMI 570.133.07 Driver Version: 570.133.07 CUDA Version: 12.8 | 2025-05-07T20:23:51.9802667Z |-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:51.9803156Z | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 2025-05-07T20:23:51.9803662Z | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 2025-05-07T20:23:51.9804084Z | | | MIG M. | 2025-05-07T20:23:51.9804563Z |=========================================+========================+======================| 2025-05-07T20:23:51.9886678Z | 0 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 | 2025-05-07T20:23:51.9887972Z | 0% 30C P0 60W / 300W | 0MiB / 23028MiB | 4% Default | 2025-05-07T20:23:51.9888509Z | | | N/A | 2025-05-07T20:23:51.9888896Z +-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:51.9889283Z 2025-05-07T20:23:51.9889652Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:51.9890064Z | Processes: | 2025-05-07T20:23:51.9890485Z | GPU GI CI PID Type Process name GPU Memory | 2025-05-07T20:23:51.9890880Z | ID ID Usage | 2025-05-07T20:23:51.9891213Z |=========================================================================================| 2025-05-07T20:23:51.9891928Z | No running processes found | 2025-05-07T20:23:51.9892381Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:52.7126456Z Command completed after 1 attempt(s). 2025-05-07T20:23:52.7217374Z ##[group]Run . $PRELUDE; print_system_info; print_ec2_info 2025-05-07T20:23:52.7217840Z . $PRELUDE; print_system_info; print_ec2_info 2025-05-07T20:23:52.7233588Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:23:52.7233936Z env: 2025-05-07T20:23:52.7234163Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:23:52.7234460Z BUILD_ENV: build_binary 2025-05-07T20:23:52.7234708Z BUILD_TARGET: genai 2025-05-07T20:23:52.7234939Z BUILD_VARIANT: cuda 2025-05-07T20:23:52.7235172Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:23:52.7235646Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:23:52.7235952Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:23:52.7236276Z ##[endgroup] 2025-05-07T20:23:53.0662684Z ################################################################################ 2025-05-07T20:23:53.0663041Z # Print System Info 2025-05-07T20:23:53.0663253Z # 2025-05-07T20:23:53.0680940Z # [2025-05-07T20:23:53.067Z] + print_system_info 2025-05-07T20:23:53.0681304Z ################################################################################ 2025-05-07T20:23:53.0681525Z 2025-05-07T20:23:53.0681637Z ################################################################################ 2025-05-07T20:23:53.0681981Z [INFO] Printing environment variables ... 2025-05-07T20:23:53.0682281Z + printenv 2025-05-07T20:23:53.0682398Z 2025-05-07T20:23:53.0701213Z SHELL=/bin/bash 2025-05-07T20:23:53.0701640Z GITHUB_WORKSPACE=/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM 2025-05-07T20:23:53.0702047Z BUILD_VARIANT=cuda 2025-05-07T20:23:53.0702606Z GITHUB_PATH=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/add_path_030f2d6f-c22b-4ae0-b10b-d128e6220f31 2025-05-07T20:23:53.0703190Z GITHUB_ACTION=__run 2025-05-07T20:23:53.0703495Z GPU_FLAG=--gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:23:53.0703849Z GITHUB_RUN_NUMBER=10601 2025-05-07T20:23:53.0704107Z RUNNER_NAME=i-00cc0d8f8d78d1eb8 2025-05-07T20:23:53.0704422Z GITHUB_REPOSITORY_OWNER_ID=21003710 2025-05-07T20:23:53.0704737Z PLATFORM_NAME_LC=linux-x86_64 2025-05-07T20:23:53.0704999Z MACHINE_NAME_LC=x86_64 2025-05-07T20:23:53.0705370Z ACTIONS_RUNNER_HOOK_JOB_COMPLETED=/home/ec2-user/runner-scripts/after_job.sh 2025-05-07T20:23:53.0705803Z GITHUB_TRIGGERING_ACTOR=q10 2025-05-07T20:23:53.0706078Z PRELUDE=.github/scripts/setup_env.bash 2025-05-07T20:23:53.0706375Z GITHUB_REF_TYPE=branch 2025-05-07T20:23:53.0707089Z *** 2025-05-07T20:23:53.0707284Z LOGNAME=ec2-user 2025-05-07T20:23:53.0707531Z GITHUB_REPOSITORY_ID=150154628 2025-05-07T20:23:53.0707807Z ENFORCE_CUDA_DEVICE=1 2025-05-07T20:23:53.0708060Z GITHUB_ACTIONS=true 2025-05-07T20:23:53.0708511Z SYSTEMD_EXEC_PID=55434 2025-05-07T20:23:53.0708894Z GITHUB_SHA=a2f4c52051596e74bc8c16e3d2867a4ecdd271e0 2025-05-07T20:23:53.0709437Z GITHUB_WORKFLOW_REF=pytorch/FBGEMM/.github/workflows/fbgemm_gpu_ci_cuda.yml@refs/pull/4066/merge 2025-05-07T20:23:53.0709947Z RUNNER_ENVIRONMENT=self-hosted 2025-05-07T20:23:53.0710238Z GITHUB_REF=refs/pull/4066/merge 2025-05-07T20:23:53.0710508Z RUNNER_OS=Linux 2025-05-07T20:23:53.0710733Z GITHUB_REF_PROTECTED=false 2025-05-07T20:23:53.0710987Z HOME=/home/ec2-user 2025-05-07T20:23:53.0711243Z GITHUB_API_URL=https://api.github.com 2025-05-07T20:23:53.0711523Z LANG=C.UTF-8 2025-05-07T20:23:53.0711815Z RUNNER_TRACKING_ID=github_1e8cb0cf-68f0-4b91-8769-c71669f2594f 2025-05-07T20:23:53.0712172Z RUNNER_ARCH=X64 2025-05-07T20:23:53.0712440Z RUNNER_TEMP=/home/ec2-user/actions-runner/_work/_temp 2025-05-07T20:23:53.0712799Z BUILD_TARGET=genai 2025-05-07T20:23:53.0713328Z GITHUB_STATE=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/save_state_030f2d6f-c22b-4ae0-b10b-d128e6220f31 2025-05-07T20:23:53.0714250Z GITHUB_ENV=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/set_env_030f2d6f-c22b-4ae0-b10b-d128e6220f31 2025-05-07T20:23:53.0714988Z GITHUB_EVENT_PATH=/home/ec2-user/actions-runner/_work/_temp/_github_workflow/event.json 2025-05-07T20:23:53.0715819Z INVOCATION_ID=384b034384d8415eb8e54073b34c72ff 2025-05-07T20:23:53.0716141Z GITHUB_EVENT_NAME=pull_request 2025-05-07T20:23:53.0716406Z GITHUB_RUN_ID=14891846252 2025-05-07T20:23:53.0716980Z GITHUB_STEP_SUMMARY=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/step_summary_030f2d6f-c22b-4ae0-b10b-d128e6220f31 2025-05-07T20:23:53.0717599Z BUILD_ENV=build_binary 2025-05-07T20:23:53.0717834Z GITHUB_ACTOR=q10 2025-05-07T20:23:53.0718047Z GITHUB_RUN_ATTEMPT=1 2025-05-07T20:23:53.0718278Z KERN_NAME_LC=linux 2025-05-07T20:23:53.0718500Z BUILD_CUDA_VERSION=12.6.3 2025-05-07T20:23:53.0718791Z GITHUB_GRAPHQL_URL=https://api.github.com/graphql 2025-05-07T20:23:53.0719324Z PLATFORM_NAME=Linux-x86_64 2025-05-07T20:23:53.0719561Z USER=ec2-user 2025-05-07T20:23:53.0719781Z GITHUB_SERVER_URL=https://github.com 2025-05-07T20:23:53.0720047Z SHLVL=1 2025-05-07T20:23:53.0720234Z GITHUB_ACTOR_ID=255046 2025-05-07T20:23:53.0720539Z RUNNER_TOOL_CACHE=/home/ec2-user/actions-runner/_work/_tool 2025-05-07T20:23:53.0720972Z GITHUB_WORKFLOW_SHA=6060cd4b5f971680caecdcc657faccb5720d1c3e 2025-05-07T20:23:53.0721325Z GITHUB_REF_NAME=4066/merge 2025-05-07T20:23:53.0721564Z KERN_NAME=Linux 2025-05-07T20:23:53.0721793Z GITHUB_JOB=test_and_publish_artifact 2025-05-07T20:23:53.0722201Z ACTIONS_RUNNER_HOOK_JOB_STARTED=/home/ec2-user/runner-scripts/before_job.sh 2025-05-07T20:23:53.0722632Z GITHUB_REPOSITORY=pytorch/FBGEMM 2025-05-07T20:23:53.0722905Z GITHUB_RETENTION_DAYS=90 2025-05-07T20:23:53.0723154Z JOURNAL_STREAM=8:84893 2025-05-07T20:23:53.0723465Z RUNNER_WORKSPACE=/home/ec2-user/actions-runner/_work/FBGEMM 2025-05-07T20:23:53.0723823Z GITHUB_ACTION_REPOSITORY= 2025-05-07T20:23:53.0724154Z PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin 2025-05-07T20:23:53.0724633Z GITHUB_BASE_REF=main 2025-05-07T20:23:53.0724864Z CI=true 2025-05-07T20:23:53.0725070Z GITHUB_REPOSITORY_OWNER=pytorch 2025-05-07T20:23:53.0725364Z GITHUB_HEAD_REF=bm/genai-rocm-oss-6 2025-05-07T20:23:53.0725650Z GITHUB_ACTION_REF= 2025-05-07T20:23:53.0725901Z GITHUB_WORKFLOW=FBGEMM GPU/GenAI CUDA CI 2025-05-07T20:23:53.0726528Z GITHUB_OUTPUT=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/set_output_030f2d6f-c22b-4ae0-b10b-d128e6220f31 2025-05-07T20:23:53.0727129Z MACHINE_NAME=x86_64 2025-05-07T20:23:53.0727354Z _=/usr/bin/printenv 2025-05-07T20:23:53.0727502Z 2025-05-07T20:23:53.0727622Z ################################################################################ 2025-05-07T20:23:53.0727963Z [INFO] Print ldd version ... 2025-05-07T20:23:53.0728227Z + ldd --version 2025-05-07T20:23:53.0728369Z 2025-05-07T20:23:53.0728471Z ldd (GNU libc) 2.34 2025-05-07T20:23:53.0728763Z Copyright (C) 2021 Free Software Foundation, Inc. 2025-05-07T20:23:53.0729216Z This is free software; see the source for copying conditions. There is NO 2025-05-07T20:23:53.0729749Z warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. 2025-05-07T20:23:53.0730209Z Written by Roland McGrath and Ulrich Drepper. 2025-05-07T20:23:53.0730428Z 2025-05-07T20:23:53.0730568Z ################################################################################ 2025-05-07T20:23:53.0730895Z [INFO] Print CPU info ... 2025-05-07T20:23:53.0731141Z + nproc 2025-05-07T20:23:53.0731261Z 2025-05-07T20:23:53.0743838Z 16 2025-05-07T20:23:53.0745385Z 2025-05-07T20:23:53.0745601Z + lscpu 2025-05-07T20:23:53.0745712Z 2025-05-07T20:23:53.0864425Z Architecture: x86_64 2025-05-07T20:23:53.0865362Z CPU op-mode(s): 32-bit, 64-bit 2025-05-07T20:23:53.0866302Z Address sizes: 48 bits physical, 48 bits virtual 2025-05-07T20:23:53.0866863Z Byte Order: Little Endian 2025-05-07T20:23:53.0867193Z CPU(s): 16 2025-05-07T20:23:53.0867488Z On-line CPU(s) list: 0-15 2025-05-07T20:23:53.0867807Z Vendor ID: AuthenticAMD 2025-05-07T20:23:53.0868138Z Model name: AMD EPYC 7R32 2025-05-07T20:23:53.0868453Z CPU family: 23 2025-05-07T20:23:53.0868945Z Model: 49 2025-05-07T20:23:53.0869226Z Thread(s) per core: 2 2025-05-07T20:23:53.0869511Z Core(s) per socket: 8 2025-05-07T20:23:53.0869784Z Socket(s): 1 2025-05-07T20:23:53.0870048Z Stepping: 0 2025-05-07T20:23:53.0870341Z BogoMIPS: 5600.00 2025-05-07T20:23:53.0872454Z Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:53.0875229Z Hypervisor vendor: KVM 2025-05-07T20:23:53.0875581Z Virtualization type: full 2025-05-07T20:23:53.0875964Z L1d cache: 256 KiB (8 instances) 2025-05-07T20:23:53.0876372Z L1i cache: 256 KiB (8 instances) 2025-05-07T20:23:53.0876782Z L2 cache: 4 MiB (8 instances) 2025-05-07T20:23:53.0877180Z L3 cache: 32 MiB (2 instances) 2025-05-07T20:23:53.0877541Z NUMA node(s): 1 2025-05-07T20:23:53.0877866Z NUMA node0 CPU(s): 0-15 2025-05-07T20:23:53.0878249Z Vulnerability Gather data sampling: Not affected 2025-05-07T20:23:53.0878657Z Vulnerability Itlb multihit: Not affected 2025-05-07T20:23:53.0879058Z Vulnerability L1tf: Not affected 2025-05-07T20:23:53.0879450Z Vulnerability Mds: Not affected 2025-05-07T20:23:53.0879870Z Vulnerability Meltdown: Not affected 2025-05-07T20:23:53.0880281Z Vulnerability Mmio stale data: Not affected 2025-05-07T20:23:53.0880743Z Vulnerability Reg file data sampling: Not affected 2025-05-07T20:23:53.0881319Z Vulnerability Retbleed: Mitigation; untrained return thunk; SMT enabled with STIBP protection 2025-05-07T20:23:53.0881872Z Vulnerability Spec rstack overflow: Mitigation; safe RET 2025-05-07T20:23:53.0882396Z Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl 2025-05-07T20:23:53.0883062Z Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization 2025-05-07T20:23:53.0883904Z Vulnerability Spectre v2: Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected 2025-05-07T20:23:53.0884724Z Vulnerability Srbds: Not affected 2025-05-07T20:23:53.0885074Z Vulnerability Tsx async abort: Not affected 2025-05-07T20:23:53.0885392Z 2025-05-07T20:23:53.0885476Z + cat /proc/cpuinfo 2025-05-07T20:23:53.0885603Z 2025-05-07T20:23:53.0885687Z processor : 0 2025-05-07T20:23:53.0885886Z vendor_id : AuthenticAMD 2025-05-07T20:23:53.0886114Z cpu family : 23 2025-05-07T20:23:53.0886310Z model : 49 2025-05-07T20:23:53.0886498Z model name : AMD EPYC 7R32 2025-05-07T20:23:53.0886730Z stepping : 0 2025-05-07T20:23:53.0886925Z microcode : 0x830107f 2025-05-07T20:23:53.0887133Z cpu MHz : 3309.246 2025-05-07T20:23:53.0887338Z cache size : 512 KB 2025-05-07T20:23:53.0887541Z physical id : 0 2025-05-07T20:23:53.0887736Z siblings : 16 2025-05-07T20:23:53.0887924Z core id : 0 2025-05-07T20:23:53.0888107Z cpu cores : 8 2025-05-07T20:23:53.0888292Z apicid : 0 2025-05-07T20:23:53.0888482Z initial apicid : 0 2025-05-07T20:23:53.0888681Z fpu : yes 2025-05-07T20:23:53.0888862Z fpu_exception : yes 2025-05-07T20:23:53.0889066Z cpuid level : 13 2025-05-07T20:23:53.0889262Z wp : yes 2025-05-07T20:23:53.0891293Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:53.0896040Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:53.0896510Z bogomips : 5600.00 2025-05-07T20:23:53.0896725Z TLB size : 3072 4K pages 2025-05-07T20:23:53.0896953Z clflush size : 64 2025-05-07T20:23:53.0897151Z cache_alignment : 64 2025-05-07T20:23:53.0897417Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:53.0897732Z power management: 2025-05-07T20:23:53.0897856Z 2025-05-07T20:23:53.0897931Z processor : 1 2025-05-07T20:23:53.0898137Z vendor_id : AuthenticAMD 2025-05-07T20:23:53.0898364Z cpu family : 23 2025-05-07T20:23:53.0898551Z model : 49 2025-05-07T20:23:53.0898744Z model name : AMD EPYC 7R32 2025-05-07T20:23:53.0898981Z stepping : 0 2025-05-07T20:23:53.0899175Z microcode : 0x830107f 2025-05-07T20:23:53.0899392Z cpu MHz : 3297.931 2025-05-07T20:23:53.0899597Z cache size : 512 KB 2025-05-07T20:23:53.0899793Z physical id : 0 2025-05-07T20:23:53.0899990Z siblings : 16 2025-05-07T20:23:53.0900179Z core id : 1 2025-05-07T20:23:53.0900370Z cpu cores : 8 2025-05-07T20:23:53.0900559Z apicid : 2 2025-05-07T20:23:53.0900746Z initial apicid : 2 2025-05-07T20:23:53.0900939Z fpu : yes 2025-05-07T20:23:53.0901126Z fpu_exception : yes 2025-05-07T20:23:53.0901328Z cpuid level : 13 2025-05-07T20:23:53.0901514Z wp : yes 2025-05-07T20:23:53.0903432Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:53.0905613Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:53.0906089Z bogomips : 5600.00 2025-05-07T20:23:53.0906296Z TLB size : 3072 4K pages 2025-05-07T20:23:53.0906515Z clflush size : 64 2025-05-07T20:23:53.0906720Z cache_alignment : 64 2025-05-07T20:23:53.0906978Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:53.0907272Z power management: 2025-05-07T20:23:53.0907400Z 2025-05-07T20:23:53.0907478Z processor : 2 2025-05-07T20:23:53.0907678Z vendor_id : AuthenticAMD 2025-05-07T20:23:53.0907896Z cpu family : 23 2025-05-07T20:23:53.0908091Z model : 49 2025-05-07T20:23:53.0909041Z model name : AMD EPYC 7R32 2025-05-07T20:23:53.0909311Z stepping : 0 2025-05-07T20:23:53.0909525Z microcode : 0x830107f 2025-05-07T20:23:53.0909765Z cpu MHz : 3301.374 2025-05-07T20:23:53.0909991Z cache size : 512 KB 2025-05-07T20:23:53.0910210Z physical id : 0 2025-05-07T20:23:53.0910426Z siblings : 16 2025-05-07T20:23:53.0910652Z core id : 2 2025-05-07T20:23:53.0910856Z cpu cores : 8 2025-05-07T20:23:53.0911064Z apicid : 4 2025-05-07T20:23:53.0911278Z initial apicid : 4 2025-05-07T20:23:53.0911496Z fpu : yes 2025-05-07T20:23:53.0965670Z fpu_exception : yes 2025-05-07T20:23:53.0965977Z cpuid level : 13 2025-05-07T20:23:53.0966249Z wp : yes 2025-05-07T20:23:53.0968685Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:53.0970902Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:53.0971374Z bogomips : 5600.00 2025-05-07T20:23:53.0971734Z TLB size : 3072 4K pages 2025-05-07T20:23:53.0971966Z clflush size : 64 2025-05-07T20:23:53.0972189Z cache_alignment : 64 2025-05-07T20:23:53.0972448Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:53.0972761Z power management: 2025-05-07T20:23:53.0972890Z 2025-05-07T20:23:53.0972978Z processor : 3 2025-05-07T20:23:53.0973184Z vendor_id : AuthenticAMD 2025-05-07T20:23:53.0973426Z cpu family : 23 2025-05-07T20:23:53.0973625Z model : 49 2025-05-07T20:23:53.0973815Z model name : AMD EPYC 7R32 2025-05-07T20:23:53.0974055Z stepping : 0 2025-05-07T20:23:53.0974254Z microcode : 0x830107f 2025-05-07T20:23:53.0974474Z cpu MHz : 3287.495 2025-05-07T20:23:53.0974688Z cache size : 512 KB 2025-05-07T20:23:53.0974901Z physical id : 0 2025-05-07T20:23:53.0975095Z siblings : 16 2025-05-07T20:23:53.0975287Z core id : 3 2025-05-07T20:23:53.0975482Z cpu cores : 8 2025-05-07T20:23:53.0975665Z apicid : 6 2025-05-07T20:23:53.0975856Z initial apicid : 6 2025-05-07T20:23:53.0976071Z fpu : yes 2025-05-07T20:23:53.0976259Z fpu_exception : yes 2025-05-07T20:23:53.0976470Z cpuid level : 13 2025-05-07T20:23:53.0976674Z wp : yes 2025-05-07T20:23:53.0978595Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:53.0980779Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:53.0981247Z bogomips : 5600.00 2025-05-07T20:23:53.0981467Z TLB size : 3072 4K pages 2025-05-07T20:23:53.0981704Z clflush size : 64 2025-05-07T20:23:53.0981910Z cache_alignment : 64 2025-05-07T20:23:53.0982176Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:53.0982481Z power management: 2025-05-07T20:23:53.0982606Z 2025-05-07T20:23:53.0982685Z processor : 4 2025-05-07T20:23:53.0982896Z vendor_id : AuthenticAMD 2025-05-07T20:23:53.0983133Z cpu family : 23 2025-05-07T20:23:53.0983331Z model : 49 2025-05-07T20:23:53.0983542Z model name : AMD EPYC 7R32 2025-05-07T20:23:53.0983786Z stepping : 0 2025-05-07T20:23:53.0983983Z microcode : 0x830107f 2025-05-07T20:23:53.0984209Z cpu MHz : 3298.767 2025-05-07T20:23:53.0984420Z cache size : 512 KB 2025-05-07T20:23:53.0984624Z physical id : 0 2025-05-07T20:23:53.0984826Z siblings : 16 2025-05-07T20:23:53.0985026Z core id : 4 2025-05-07T20:23:53.0985220Z cpu cores : 8 2025-05-07T20:23:53.0985406Z apicid : 8 2025-05-07T20:23:53.0985603Z initial apicid : 8 2025-05-07T20:23:53.0985812Z fpu : yes 2025-05-07T20:23:53.0986076Z fpu_exception : yes 2025-05-07T20:23:53.0986304Z cpuid level : 13 2025-05-07T20:23:53.0986508Z wp : yes 2025-05-07T20:23:53.0988536Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:53.0990729Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:53.0991201Z bogomips : 5600.00 2025-05-07T20:23:53.0991416Z TLB size : 3072 4K pages 2025-05-07T20:23:53.0991637Z clflush size : 64 2025-05-07T20:23:53.0991841Z cache_alignment : 64 2025-05-07T20:23:53.0992179Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:53.0992481Z power management: 2025-05-07T20:23:53.0992614Z 2025-05-07T20:23:53.0992691Z processor : 5 2025-05-07T20:23:53.0992897Z vendor_id : AuthenticAMD 2025-05-07T20:23:53.0993127Z cpu family : 23 2025-05-07T20:23:53.0993318Z model : 49 2025-05-07T20:23:53.0993521Z model name : AMD EPYC 7R32 2025-05-07T20:23:53.0993761Z stepping : 0 2025-05-07T20:23:53.0993960Z microcode : 0x830107f 2025-05-07T20:23:53.0994180Z cpu MHz : 3261.757 2025-05-07T20:23:53.0994384Z cache size : 512 KB 2025-05-07T20:23:53.0994589Z physical id : 0 2025-05-07T20:23:53.0994792Z siblings : 16 2025-05-07T20:23:53.0994986Z core id : 5 2025-05-07T20:23:53.0995169Z cpu cores : 8 2025-05-07T20:23:53.0995363Z apicid : 10 2025-05-07T20:23:53.0995558Z initial apicid : 10 2025-05-07T20:23:53.0995758Z fpu : yes 2025-05-07T20:23:53.0995950Z fpu_exception : yes 2025-05-07T20:23:53.0996161Z cpuid level : 13 2025-05-07T20:23:53.0996352Z wp : yes 2025-05-07T20:23:53.0998272Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:53.1000451Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:53.1000928Z bogomips : 5600.00 2025-05-07T20:23:53.1001141Z TLB size : 3072 4K pages 2025-05-07T20:23:53.1001364Z clflush size : 64 2025-05-07T20:23:53.1001575Z cache_alignment : 64 2025-05-07T20:23:53.1001835Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:53.1002146Z power management: 2025-05-07T20:23:53.1002278Z 2025-05-07T20:23:53.1002360Z processor : 6 2025-05-07T20:23:53.1002571Z vendor_id : AuthenticAMD 2025-05-07T20:23:53.1002797Z cpu family : 23 2025-05-07T20:23:53.1002997Z model : 49 2025-05-07T20:23:53.1003194Z model name : AMD EPYC 7R32 2025-05-07T20:23:53.1003418Z stepping : 0 2025-05-07T20:23:53.1003624Z microcode : 0x830107f 2025-05-07T20:23:53.1003873Z cpu MHz : 3292.007 2025-05-07T20:23:53.1004099Z cache size : 512 KB 2025-05-07T20:23:53.1004443Z physical id : 0 2025-05-07T20:23:53.1004657Z siblings : 16 2025-05-07T20:23:53.1004838Z core id : 6 2025-05-07T20:23:53.1005033Z cpu cores : 8 2025-05-07T20:23:53.1005231Z apicid : 12 2025-05-07T20:23:53.1005431Z initial apicid : 12 2025-05-07T20:23:53.1005634Z fpu : yes 2025-05-07T20:23:53.1005817Z fpu_exception : yes 2025-05-07T20:23:53.1006019Z cpuid level : 13 2025-05-07T20:23:53.1006213Z wp : yes 2025-05-07T20:23:53.1008467Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:53.1010857Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:53.1011323Z bogomips : 5600.00 2025-05-07T20:23:53.1011530Z TLB size : 3072 4K pages 2025-05-07T20:23:53.1011760Z clflush size : 64 2025-05-07T20:23:53.1011966Z cache_alignment : 64 2025-05-07T20:23:53.1012224Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:53.1012531Z power management: 2025-05-07T20:23:53.1012793Z 2025-05-07T20:23:53.1012880Z processor : 7 2025-05-07T20:23:53.1013083Z vendor_id : AuthenticAMD 2025-05-07T20:23:53.1013321Z cpu family : 23 2025-05-07T20:23:53.1013523Z model : 49 2025-05-07T20:23:53.1013716Z model name : AMD EPYC 7R32 2025-05-07T20:23:53.1013953Z stepping : 0 2025-05-07T20:23:53.1014156Z microcode : 0x830107f 2025-05-07T20:23:53.1014367Z cpu MHz : 3266.554 2025-05-07T20:23:53.1014578Z cache size : 512 KB 2025-05-07T20:23:53.1014780Z physical id : 0 2025-05-07T20:23:53.1014982Z siblings : 16 2025-05-07T20:23:53.1015164Z core id : 7 2025-05-07T20:23:53.1015356Z cpu cores : 8 2025-05-07T20:23:53.1015540Z apicid : 14 2025-05-07T20:23:53.1015733Z initial apicid : 14 2025-05-07T20:23:53.1015941Z fpu : yes 2025-05-07T20:23:53.1016134Z fpu_exception : yes 2025-05-07T20:23:53.1016334Z cpuid level : 13 2025-05-07T20:23:53.1016535Z wp : yes 2025-05-07T20:23:53.1018456Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:53.1020701Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:53.1021168Z bogomips : 5600.00 2025-05-07T20:23:53.1021384Z TLB size : 3072 4K pages 2025-05-07T20:23:53.1021616Z clflush size : 64 2025-05-07T20:23:53.1021819Z cache_alignment : 64 2025-05-07T20:23:53.1022085Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:53.1022390Z power management: 2025-05-07T20:23:53.1022514Z 2025-05-07T20:23:53.1022596Z processor : 8 2025-05-07T20:23:53.1022797Z vendor_id : AuthenticAMD 2025-05-07T20:23:53.1023032Z cpu family : 23 2025-05-07T20:23:53.1023228Z model : 49 2025-05-07T20:23:53.1023415Z model name : AMD EPYC 7R32 2025-05-07T20:23:53.1023647Z stepping : 0 2025-05-07T20:23:53.1023849Z microcode : 0x830107f 2025-05-07T20:23:53.1024061Z cpu MHz : 3293.696 2025-05-07T20:23:53.1024264Z cache size : 512 KB 2025-05-07T20:23:53.1024469Z physical id : 0 2025-05-07T20:23:53.1024668Z siblings : 16 2025-05-07T20:23:53.1024859Z core id : 0 2025-05-07T20:23:53.1025046Z cpu cores : 8 2025-05-07T20:23:53.1025225Z apicid : 1 2025-05-07T20:23:53.1025422Z initial apicid : 1 2025-05-07T20:23:53.1025622Z fpu : yes 2025-05-07T20:23:53.1025801Z fpu_exception : yes 2025-05-07T20:23:53.1026008Z cpuid level : 13 2025-05-07T20:23:53.1026209Z wp : yes 2025-05-07T20:23:53.1028118Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:53.1030437Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:53.1030921Z bogomips : 5600.00 2025-05-07T20:23:53.1031135Z TLB size : 3072 4K pages 2025-05-07T20:23:53.1031359Z clflush size : 64 2025-05-07T20:23:53.1031565Z cache_alignment : 64 2025-05-07T20:23:53.1031817Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:53.1032117Z power management: 2025-05-07T20:23:53.1032249Z 2025-05-07T20:23:53.1032330Z processor : 9 2025-05-07T20:23:53.1032531Z vendor_id : AuthenticAMD 2025-05-07T20:23:53.1032754Z cpu family : 23 2025-05-07T20:23:53.1033027Z model : 49 2025-05-07T20:23:53.1033222Z model name : AMD EPYC 7R32 2025-05-07T20:23:53.1033441Z stepping : 0 2025-05-07T20:23:53.1033646Z microcode : 0x830107f 2025-05-07T20:23:53.1033861Z cpu MHz : 3291.027 2025-05-07T20:23:53.1034065Z cache size : 512 KB 2025-05-07T20:23:53.1034266Z physical id : 0 2025-05-07T20:23:53.1034461Z siblings : 16 2025-05-07T20:23:53.1034651Z core id : 1 2025-05-07T20:23:53.1034838Z cpu cores : 8 2025-05-07T20:23:53.1035024Z apicid : 3 2025-05-07T20:23:53.1035204Z initial apicid : 3 2025-05-07T20:23:53.1035409Z fpu : yes 2025-05-07T20:23:53.1035588Z fpu_exception : yes 2025-05-07T20:23:53.1035795Z cpuid level : 13 2025-05-07T20:23:53.1035985Z wp : yes 2025-05-07T20:23:53.1037883Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:53.1040080Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:53.1040556Z bogomips : 5600.00 2025-05-07T20:23:53.1040770Z TLB size : 3072 4K pages 2025-05-07T20:23:53.1040991Z clflush size : 64 2025-05-07T20:23:53.1041204Z cache_alignment : 64 2025-05-07T20:23:53.1041466Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:53.1041768Z power management: 2025-05-07T20:23:53.1041906Z 2025-05-07T20:23:53.1041986Z processor : 10 2025-05-07T20:23:53.1042200Z vendor_id : AuthenticAMD 2025-05-07T20:23:53.1042432Z cpu family : 23 2025-05-07T20:23:53.1042624Z model : 49 2025-05-07T20:23:53.1042824Z model name : AMD EPYC 7R32 2025-05-07T20:23:53.1043060Z stepping : 0 2025-05-07T20:23:53.1043255Z microcode : 0x830107f 2025-05-07T20:23:53.1043473Z cpu MHz : 3277.478 2025-05-07T20:23:53.1043683Z cache size : 512 KB 2025-05-07T20:23:53.1043884Z physical id : 0 2025-05-07T20:23:53.1044082Z siblings : 16 2025-05-07T20:23:53.1044275Z core id : 2 2025-05-07T20:23:53.1044537Z cpu cores : 8 2025-05-07T20:23:53.1044729Z apicid : 5 2025-05-07T20:23:53.1044923Z initial apicid : 5 2025-05-07T20:23:53.1045120Z fpu : yes 2025-05-07T20:23:53.1045307Z fpu_exception : yes 2025-05-07T20:23:53.1045517Z cpuid level : 13 2025-05-07T20:23:53.1045708Z wp : yes 2025-05-07T20:23:53.1047617Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:53.1049811Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:53.1050285Z bogomips : 5600.00 2025-05-07T20:23:53.1050588Z TLB size : 3072 4K pages 2025-05-07T20:23:53.1050814Z clflush size : 64 2025-05-07T20:23:53.1051023Z cache_alignment : 64 2025-05-07T20:23:53.1051287Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:53.1051586Z power management: 2025-05-07T20:23:53.1051718Z 2025-05-07T20:23:53.1051799Z processor : 11 2025-05-07T20:23:53.1052009Z vendor_id : AuthenticAMD 2025-05-07T20:23:53.1052231Z cpu family : 23 2025-05-07T20:23:53.1052430Z model : 49 2025-05-07T20:23:53.1052628Z model name : AMD EPYC 7R32 2025-05-07T20:23:53.1052853Z stepping : 0 2025-05-07T20:23:53.1053160Z microcode : 0x830107f 2025-05-07T20:23:53.1053379Z cpu MHz : 3299.388 2025-05-07T20:23:53.1053583Z cache size : 512 KB 2025-05-07T20:23:53.1053797Z physical id : 0 2025-05-07T20:23:53.1054001Z siblings : 16 2025-05-07T20:23:53.1054190Z core id : 3 2025-05-07T20:23:53.1054382Z cpu cores : 8 2025-05-07T20:23:53.1054575Z apicid : 7 2025-05-07T20:23:53.1054762Z initial apicid : 7 2025-05-07T20:23:53.1054975Z fpu : yes 2025-05-07T20:23:53.1055170Z fpu_exception : yes 2025-05-07T20:23:53.1055373Z cpuid level : 13 2025-05-07T20:23:53.1055574Z wp : yes 2025-05-07T20:23:53.1057494Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:53.1059683Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:53.1060159Z bogomips : 5600.00 2025-05-07T20:23:53.1060373Z TLB size : 3072 4K pages 2025-05-07T20:23:53.1060608Z clflush size : 64 2025-05-07T20:23:53.1060818Z cache_alignment : 64 2025-05-07T20:23:53.1061075Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:53.1061387Z power management: 2025-05-07T20:23:53.1061514Z 2025-05-07T20:23:53.1061604Z processor : 12 2025-05-07T20:23:53.1061808Z vendor_id : AuthenticAMD 2025-05-07T20:23:53.1062039Z cpu family : 23 2025-05-07T20:23:53.1062236Z model : 49 2025-05-07T20:23:53.1062425Z model name : AMD EPYC 7R32 2025-05-07T20:23:53.1062663Z stepping : 0 2025-05-07T20:23:53.1062865Z microcode : 0x830107f 2025-05-07T20:23:53.1063077Z cpu MHz : 3300.423 2025-05-07T20:23:53.1063289Z cache size : 512 KB 2025-05-07T20:23:53.1063497Z physical id : 0 2025-05-07T20:23:53.1063700Z siblings : 16 2025-05-07T20:23:53.1063885Z core id : 4 2025-05-07T20:23:53.1064078Z cpu cores : 8 2025-05-07T20:23:53.1064277Z apicid : 9 2025-05-07T20:23:53.1064463Z initial apicid : 9 2025-05-07T20:23:53.1064669Z fpu : yes 2025-05-07T20:23:53.1064864Z fpu_exception : yes 2025-05-07T20:23:53.1065069Z cpuid level : 13 2025-05-07T20:23:53.1065269Z wp : yes 2025-05-07T20:23:53.1067173Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:53.1069359Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:53.1069824Z bogomips : 5600.00 2025-05-07T20:23:53.1070035Z TLB size : 3072 4K pages 2025-05-07T20:23:53.1070264Z clflush size : 64 2025-05-07T20:23:53.1070471Z cache_alignment : 64 2025-05-07T20:23:53.1070828Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:53.1071138Z power management: 2025-05-07T20:23:53.1071263Z 2025-05-07T20:23:53.1071349Z processor : 13 2025-05-07T20:23:53.1071557Z vendor_id : AuthenticAMD 2025-05-07T20:23:53.1071787Z cpu family : 23 2025-05-07T20:23:53.1071985Z model : 49 2025-05-07T20:23:53.1072183Z model name : AMD EPYC 7R32 2025-05-07T20:23:53.1072416Z stepping : 0 2025-05-07T20:23:53.1072611Z microcode : 0x830107f 2025-05-07T20:23:53.1072818Z cpu MHz : 3317.395 2025-05-07T20:23:53.1073017Z cache size : 512 KB 2025-05-07T20:23:53.1073295Z physical id : 0 2025-05-07T20:23:53.1073482Z siblings : 16 2025-05-07T20:23:53.1073670Z core id : 5 2025-05-07T20:23:53.1073872Z cpu cores : 8 2025-05-07T20:23:53.1074077Z apicid : 11 2025-05-07T20:23:53.1074268Z initial apicid : 11 2025-05-07T20:23:53.1074464Z fpu : yes 2025-05-07T20:23:53.1074642Z fpu_exception : yes 2025-05-07T20:23:53.1074848Z cpuid level : 13 2025-05-07T20:23:53.1075042Z wp : yes 2025-05-07T20:23:53.1076950Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:53.1079146Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:53.1079618Z bogomips : 5600.00 2025-05-07T20:23:53.1079823Z TLB size : 3072 4K pages 2025-05-07T20:23:53.1080044Z clflush size : 64 2025-05-07T20:23:53.1080239Z cache_alignment : 64 2025-05-07T20:23:53.1080500Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:53.1080806Z power management: 2025-05-07T20:23:53.1080931Z 2025-05-07T20:23:53.1081005Z processor : 14 2025-05-07T20:23:53.1081216Z vendor_id : AuthenticAMD 2025-05-07T20:23:53.1081440Z cpu family : 23 2025-05-07T20:23:53.1081624Z model : 49 2025-05-07T20:23:53.1081820Z model name : AMD EPYC 7R32 2025-05-07T20:23:53.1082050Z stepping : 0 2025-05-07T20:23:53.1082239Z microcode : 0x830107f 2025-05-07T20:23:53.1082455Z cpu MHz : 3290.824 2025-05-07T20:23:53.1082664Z cache size : 512 KB 2025-05-07T20:23:53.1082859Z physical id : 0 2025-05-07T20:23:53.1083058Z siblings : 16 2025-05-07T20:23:53.1083254Z core id : 6 2025-05-07T20:23:53.1083430Z cpu cores : 8 2025-05-07T20:23:53.1083617Z apicid : 13 2025-05-07T20:23:53.1083812Z initial apicid : 13 2025-05-07T20:23:53.1084005Z fpu : yes 2025-05-07T20:23:53.1084195Z fpu_exception : yes 2025-05-07T20:23:53.1084458Z cpuid level : 13 2025-05-07T20:23:53.1084649Z wp : yes 2025-05-07T20:23:53.1086566Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:53.1088747Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:53.1089227Z bogomips : 5600.00 2025-05-07T20:23:53.1089440Z TLB size : 3072 4K pages 2025-05-07T20:23:53.1089659Z clflush size : 64 2025-05-07T20:23:53.1089866Z cache_alignment : 64 2025-05-07T20:23:53.1090134Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:53.1090433Z power management: 2025-05-07T20:23:53.1090570Z 2025-05-07T20:23:53.1090748Z processor : 15 2025-05-07T20:23:53.1090972Z vendor_id : AuthenticAMD 2025-05-07T20:23:53.1091206Z cpu family : 23 2025-05-07T20:23:53.1091419Z model : 49 2025-05-07T20:23:53.1091632Z model name : AMD EPYC 7R32 2025-05-07T20:23:53.1091863Z stepping : 0 2025-05-07T20:23:53.1092075Z microcode : 0x830107f 2025-05-07T20:23:53.1092306Z cpu MHz : 3138.690 2025-05-07T20:23:53.1092507Z cache size : 512 KB 2025-05-07T20:23:53.1092723Z physical id : 0 2025-05-07T20:23:53.1092932Z siblings : 16 2025-05-07T20:23:53.1093135Z core id : 7 2025-05-07T20:23:53.1093326Z cpu cores : 8 2025-05-07T20:23:53.1093612Z apicid : 15 2025-05-07T20:23:53.1093829Z initial apicid : 15 2025-05-07T20:23:53.1094061Z fpu : yes 2025-05-07T20:23:53.1094260Z fpu_exception : yes 2025-05-07T20:23:53.1094476Z cpuid level : 13 2025-05-07T20:23:53.1094671Z wp : yes 2025-05-07T20:23:53.1096598Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:53.1098786Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:53.1099272Z bogomips : 5600.00 2025-05-07T20:23:53.1099480Z TLB size : 3072 4K pages 2025-05-07T20:23:53.1099711Z clflush size : 64 2025-05-07T20:23:53.1099924Z cache_alignment : 64 2025-05-07T20:23:53.1100188Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:53.1100502Z power management: 2025-05-07T20:23:53.1100642Z 2025-05-07T20:23:53.1100647Z 2025-05-07T20:23:53.1100766Z ################################################################################ 2025-05-07T20:23:53.1101085Z [INFO] Print PCI info ... 2025-05-07T20:23:53.1101323Z + lspci -v 2025-05-07T20:23:53.1101454Z 2025-05-07T20:23:53.1101680Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] 2025-05-07T20:23:53.1102068Z Subsystem: Amazon.com, Inc. Device 1237 2025-05-07T20:23:53.1102395Z Flags: bus master, medium devsel, latency 0 2025-05-07T20:23:53.1102600Z 2025-05-07T20:23:53.1102791Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] 2025-05-07T20:23:53.1103173Z Physical Slot: 1 2025-05-07T20:23:53.1103424Z Flags: bus master, fast devsel, latency 0 2025-05-07T20:23:53.1103624Z 2025-05-07T20:23:53.1103878Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08) 2025-05-07T20:23:53.1104300Z Physical Slot: 1 2025-05-07T20:23:53.1104560Z Flags: bus master, fast devsel, latency 0, IRQ 9 2025-05-07T20:23:53.1104780Z 2025-05-07T20:23:53.1105059Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111 (prog-if 00 [VGA controller]) 2025-05-07T20:23:53.1105496Z Physical Slot: 3 2025-05-07T20:23:53.1105736Z Flags: bus master, fast devsel, latency 0 2025-05-07T20:23:53.1113005Z Memory at c1000000 (32-bit, prefetchable) [size=4M] 2025-05-07T20:23:53.1113382Z Expansion ROM at 000c0000 [disabled] [size=128K] 2025-05-07T20:23:53.1113614Z 2025-05-07T20:23:53.1113914Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller (prog-if 02 [NVM Express]) 2025-05-07T20:23:53.1114416Z Subsystem: Amazon.com, Inc. Device 0000 2025-05-07T20:23:53.1114712Z Physical Slot: 4 2025-05-07T20:23:53.1114963Z Flags: bus master, fast devsel, latency 0, IRQ 11 2025-05-07T20:23:53.1115334Z Memory at c1808000 (32-bit, non-prefetchable) [size=16K] 2025-05-07T20:23:53.1115694Z Capabilities: 2025-05-07T20:23:53.1115993Z Kernel driver in use: nvme 2025-05-07T20:23:53.1116172Z 2025-05-07T20:23:53.1116609Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA) 2025-05-07T20:23:53.1117090Z Subsystem: Amazon.com, Inc. Elastic Network Adapter (ENA) 2025-05-07T20:23:53.1117442Z Physical Slot: 5 2025-05-07T20:23:53.1117677Z Flags: bus master, fast devsel, latency 0 2025-05-07T20:23:53.1118029Z Memory at c1804000 (32-bit, non-prefetchable) [size=16K] 2025-05-07T20:23:53.1118410Z Memory at c1400000 (32-bit, prefetchable) [size=4M] 2025-05-07T20:23:53.1118724Z Capabilities: 2025-05-07T20:23:53.1118986Z Kernel driver in use: ena 2025-05-07T20:23:53.1119227Z Kernel modules: ena 2025-05-07T20:23:53.1119562Z 2025-05-07T20:23:53.1119794Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1) 2025-05-07T20:23:53.1120254Z Subsystem: NVIDIA Corporation Device 152f 2025-05-07T20:23:53.1120548Z Physical Slot: 30 2025-05-07T20:23:53.1120796Z Flags: bus master, fast devsel, latency 0, IRQ 10 2025-05-07T20:23:53.1121174Z Memory at c0000000 (32-bit, non-prefetchable) [size=16M] 2025-05-07T20:23:53.1121592Z Memory at 1800000000 (64-bit, prefetchable) [size=32G] 2025-05-07T20:23:53.1121963Z Memory at 1040000000 (64-bit, prefetchable) [size=32M] 2025-05-07T20:23:53.1122288Z Capabilities: 2025-05-07T20:23:53.1122539Z Kernel driver in use: nvidia 2025-05-07T20:23:53.1122792Z Kernel modules: nvidia 2025-05-07T20:23:53.1122933Z 2025-05-07T20:23:53.1123235Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller (prog-if 02 [NVM Express]) 2025-05-07T20:23:53.1123743Z Subsystem: Amazon.com, Inc. Device 0000 2025-05-07T20:23:53.1124071Z Physical Slot: 31 2025-05-07T20:23:53.1124393Z Flags: bus master, fast devsel, latency 0 2025-05-07T20:23:53.1124742Z Memory at c1800000 (32-bit, non-prefetchable) [size=16K] 2025-05-07T20:23:53.1125116Z Memory at c180c000 (32-bit, prefetchable) [size=8K] 2025-05-07T20:23:53.1125429Z Capabilities: 2025-05-07T20:23:53.1125694Z Kernel driver in use: nvme 2025-05-07T20:23:53.1125850Z 2025-05-07T20:23:53.1125854Z 2025-05-07T20:23:53.1125976Z ################################################################################ 2025-05-07T20:23:53.1126291Z [INFO] Print Linux distribution info ... 2025-05-07T20:23:53.1126578Z + uname -a 2025-05-07T20:23:53.1126690Z 2025-05-07T20:23:53.1127104Z Linux ip-10-0-58-159.ec2.internal 6.1.130-139.222.amzn2023.x86_64 #1 SMP PREEMPT_DYNAMIC Tue Mar 11 01:10:58 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux 2025-05-07T20:23:53.1127595Z 2025-05-07T20:23:53.1127681Z + uname -m 2025-05-07T20:23:53.1127791Z 2025-05-07T20:23:53.1127866Z x86_64 2025-05-07T20:23:53.1127982Z 2025-05-07T20:23:53.1128076Z + cat /proc/version 2025-05-07T20:23:53.1128207Z 2025-05-07T20:23:53.1128750Z Linux version 6.1.130-139.222.amzn2023.x86_64 (mockbuild@ip-10-0-55-76) (gcc (GCC) 11.5.0 20240719 (Red Hat 11.5.0-5), GNU ld version 2.39-6.amzn2023.0.11) #1 SMP PREEMPT_DYNAMIC Tue Mar 11 01:10:58 UTC 2025 2025-05-07T20:23:53.1129366Z 2025-05-07T20:23:53.1129461Z + cat /etc/os-release 2025-05-07T20:23:53.1129600Z 2025-05-07T20:23:53.1129688Z NAME="Amazon Linux" 2025-05-07T20:23:53.1129916Z VERSION="2023" 2025-05-07T20:23:53.1130122Z ID="amzn" 2025-05-07T20:23:53.1130304Z ID_LIKE="fedora" 2025-05-07T20:23:53.1130515Z VERSION_ID="2023" 2025-05-07T20:23:53.1130737Z PLATFORM_ID="platform:al2023" 2025-05-07T20:23:53.1131011Z PRETTY_NAME="Amazon Linux 2023.6.20250317" 2025-05-07T20:23:53.1131292Z ANSI_COLOR="0;33" 2025-05-07T20:23:53.1131538Z CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2023" 2025-05-07T20:23:53.1131919Z HOME_URL="https://aws.amazon.com/linux/amazon-linux-2023/" 2025-05-07T20:23:53.1132347Z DOCUMENTATION_URL="https://docs.aws.amazon.com/linux/" 2025-05-07T20:23:53.1132759Z SUPPORT_URL="https://aws.amazon.com/premiumsupport/" 2025-05-07T20:23:53.1133205Z BUG_REPORT_URL="https://github.com/amazonlinux/amazon-linux-2023" 2025-05-07T20:23:53.1133564Z VENDOR_NAME="AWS" 2025-05-07T20:23:53.1133797Z VENDOR_URL="https://aws.amazon.com/" 2025-05-07T20:23:53.1134080Z SUPPORT_END="2029-06-30" 2025-05-07T20:23:53.1134233Z 2025-05-07T20:23:53.1134458Z ################################################################################ 2025-05-07T20:23:53.1134762Z # Print EC2 Instance Info 2025-05-07T20:23:53.1134998Z # 2025-05-07T20:23:53.1135195Z # [2025-05-07T20:23:53.107Z] + print_ec2_info 2025-05-07T20:23:53.1135503Z ################################################################################ 2025-05-07T20:23:53.1135725Z 2025-05-07T20:23:53.1207555Z ami-id: ami-071226ecf16aa7d96 2025-05-07T20:23:53.1329375Z instance-id: i-00cc0d8f8d78d1eb8 2025-05-07T20:23:53.1446146Z instance-type: g5.4xlarge 2025-05-07T20:23:53.1485139Z ##[group]Run . $PRELUDE; print_gpu_info 2025-05-07T20:23:53.1485637Z . $PRELUDE; print_gpu_info 2025-05-07T20:23:53.1495261Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:23:53.1495595Z env: 2025-05-07T20:23:53.1495797Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:23:53.1496085Z BUILD_ENV: build_binary 2025-05-07T20:23:53.1496319Z BUILD_TARGET: genai 2025-05-07T20:23:53.1496526Z BUILD_VARIANT: cuda 2025-05-07T20:23:53.1496750Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:23:53.1496992Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:23:53.1497273Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:23:53.1497593Z ##[endgroup] 2025-05-07T20:23:53.4888136Z ################################################################################ 2025-05-07T20:23:53.4888535Z [INFO] Printing general display info ... 2025-05-07T20:23:53.4917561Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:23:53.5999949Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:23:53.6009558Z /usr/bin/sudo 2025-05-07T20:23:53.6020687Z which: no apt-get in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin) 2025-05-07T20:23:53.6030906Z /usr/bin/yum 2025-05-07T20:23:53.6032650Z [INSTALL] Updating system repositories ... 2025-05-07T20:23:53.6053328Z [EXEC] [ATTEMPT 0/3] + sudo yum update -y 2025-05-07T20:23:54.0679006Z Last metadata expiration check: 0:00:08 ago on Wed May 7 20:23:46 2025. 2025-05-07T20:23:54.1355818Z ================================================================================ 2025-05-07T20:23:54.1356174Z WARNING: 2025-05-07T20:23:54.1356411Z A newer release of "Amazon Linux" is available. 2025-05-07T20:23:54.1356640Z 2025-05-07T20:23:54.1356728Z Available Versions: 2025-05-07T20:23:54.1356881Z 2025-05-07T20:23:54.1356967Z Version 2023.7.20250331: 2025-05-07T20:23:54.1357268Z Run the following command to upgrade to 2023.7.20250331: 2025-05-07T20:23:54.1357511Z 2025-05-07T20:23:54.1357672Z dnf upgrade --releasever=2023.7.20250331 2025-05-07T20:23:54.1357900Z 2025-05-07T20:23:54.1357980Z Release notes: 2025-05-07T20:23:54.1358380Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html 2025-05-07T20:23:54.1358741Z 2025-05-07T20:23:54.1358826Z Version 2023.7.20250414: 2025-05-07T20:23:54.1359130Z Run the following command to upgrade to 2023.7.20250414: 2025-05-07T20:23:54.1359379Z 2025-05-07T20:23:54.1359491Z dnf upgrade --releasever=2023.7.20250414 2025-05-07T20:23:54.1359693Z 2025-05-07T20:23:54.1359780Z Release notes: 2025-05-07T20:23:54.1360154Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html 2025-05-07T20:23:54.1360516Z 2025-05-07T20:23:54.1360598Z Version 2023.7.20250428: 2025-05-07T20:23:54.1360898Z Run the following command to upgrade to 2023.7.20250428: 2025-05-07T20:23:54.1361136Z 2025-05-07T20:23:54.1361262Z dnf upgrade --releasever=2023.7.20250428 2025-05-07T20:23:54.1361468Z 2025-05-07T20:23:54.1361555Z Release notes: 2025-05-07T20:23:54.1361940Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html 2025-05-07T20:23:54.1362296Z 2025-05-07T20:23:54.1362417Z ================================================================================ 2025-05-07T20:23:54.2521361Z Dependencies resolved. 2025-05-07T20:23:54.2810017Z ================================================================================ 2025-05-07T20:23:54.2810504Z Package Arch Version Repository Size 2025-05-07T20:23:54.2810896Z ================================================================================ 2025-05-07T20:23:54.2811195Z Upgrading: 2025-05-07T20:23:54.2811555Z nvidia-container-toolkit x86_64 1.17.6-1 nvidia-container-toolkit 1.2 M 2025-05-07T20:23:54.2812151Z nvidia-container-toolkit-base x86_64 1.17.6-1 nvidia-container-toolkit 5.7 M 2025-05-07T20:23:54.2812504Z 2025-05-07T20:23:54.2812999Z Transaction Summary 2025-05-07T20:23:54.2813426Z ================================================================================ 2025-05-07T20:23:54.2813735Z Upgrade 2 Packages 2025-05-07T20:23:54.2813870Z 2025-05-07T20:23:54.2813982Z Total download size: 6.9 M 2025-05-07T20:23:54.2814613Z Downloading Packages: 2025-05-07T20:23:54.3170874Z (1/2): nvidia-container-toolkit-1.17.6-1.x86_64 36 MB/s | 1.2 MB 00:00 2025-05-07T20:23:54.3988755Z (2/2): nvidia-container-toolkit-base-1.17.6-1.x 49 MB/s | 5.7 MB 00:00 2025-05-07T20:23:54.4001225Z -------------------------------------------------------------------------------- 2025-05-07T20:23:54.4004135Z Total 58 MB/s | 6.9 MB 00:00 2025-05-07T20:23:54.4006506Z Running transaction check 2025-05-07T20:23:54.4100924Z Transaction check succeeded. 2025-05-07T20:23:54.4101867Z Running transaction test 2025-05-07T20:23:54.4397466Z Transaction test succeeded. 2025-05-07T20:23:54.4399826Z Running transaction 2025-05-07T20:23:54.9927642Z Preparing : 1/1 2025-05-07T20:23:55.0981124Z Upgrading : nvidia-container-toolkit-base-1.17.6-1.x86_64 1/4 2025-05-07T20:23:55.1002328Z Upgrading : nvidia-container-toolkit-1.17.6-1.x86_64 2/4 2025-05-07T20:23:55.1222331Z Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64 2/4 2025-05-07T20:23:55.1223050Z Cleanup : nvidia-container-toolkit-1.16.2-1.x86_64 3/4 2025-05-07T20:23:55.1329660Z Running scriptlet: nvidia-container-toolkit-1.16.2-1.x86_64 3/4 2025-05-07T20:23:55.1354582Z Cleanup : nvidia-container-toolkit-base-1.16.2-1.x86_64 4/4 2025-05-07T20:23:55.2860139Z Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64 4/4 2025-05-07T20:23:55.2860718Z Verifying : nvidia-container-toolkit-1.17.6-1.x86_64 1/4 2025-05-07T20:23:55.2861272Z Verifying : nvidia-container-toolkit-1.16.2-1.x86_64 2/4 2025-05-07T20:23:55.2861804Z Verifying : nvidia-container-toolkit-base-1.17.6-1.x86_64 3/4 2025-05-07T20:23:55.4141634Z ================================================================================ 2025-05-07T20:23:55.4142011Z WARNING: 2025-05-07T20:23:55.4142239Z A newer release of "Amazon Linux" is available. 2025-05-07T20:23:55.4142471Z 2025-05-07T20:23:55.4142560Z Available Versions: 2025-05-07T20:23:55.4142702Z 2025-05-07T20:23:55.4142796Z Version 2023.7.20250331: 2025-05-07T20:23:55.4143093Z Run the following command to upgrade to 2023.7.20250331: 2025-05-07T20:23:55.4143347Z 2025-05-07T20:23:55.4143470Z dnf upgrade --releasever=2023.7.20250331 2025-05-07T20:23:55.4143674Z 2025-05-07T20:23:55.4143761Z Release notes: 2025-05-07T20:23:55.4144160Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html 2025-05-07T20:23:55.4144519Z 2025-05-07T20:23:55.4144632Z Version 2023.7.20250414: 2025-05-07T20:23:55.4144943Z Run the following command to upgrade to 2023.7.20250414: 2025-05-07T20:23:55.4145182Z 2025-05-07T20:23:55.4145300Z dnf upgrade --releasever=2023.7.20250414 2025-05-07T20:23:55.4145500Z 2025-05-07T20:23:55.4145580Z Release notes: 2025-05-07T20:23:55.4145965Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html 2025-05-07T20:23:55.4146323Z 2025-05-07T20:23:55.4146405Z Version 2023.7.20250428: 2025-05-07T20:23:55.4146703Z Run the following command to upgrade to 2023.7.20250428: 2025-05-07T20:23:55.4146941Z 2025-05-07T20:23:55.4147048Z dnf upgrade --releasever=2023.7.20250428 2025-05-07T20:23:55.4147252Z 2025-05-07T20:23:55.4147330Z Release notes: 2025-05-07T20:23:55.4147707Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html 2025-05-07T20:23:55.4148056Z 2025-05-07T20:23:55.4148521Z ================================================================================ 2025-05-07T20:23:55.4717874Z Verifying : nvidia-container-toolkit-base-1.16.2-1.x86_64 4/4 2025-05-07T20:23:55.4718232Z 2025-05-07T20:23:55.4718318Z Upgraded: 2025-05-07T20:23:55.4718658Z nvidia-container-toolkit-1.17.6-1.x86_64 2025-05-07T20:23:55.4719201Z nvidia-container-toolkit-base-1.17.6-1.x86_64 2025-05-07T20:23:55.4719540Z 2025-05-07T20:23:55.4719618Z Complete! 2025-05-07T20:23:55.5172982Z [INSTALL] Installing system package(s): hostname lshw ... 2025-05-07T20:23:55.5197117Z [EXEC] [ATTEMPT 0/3] + sudo yum install -y hostname lshw 2025-05-07T20:23:55.9532171Z Last metadata expiration check: 0:00:09 ago on Wed May 7 20:23:46 2025. 2025-05-07T20:23:55.9773239Z Package hostname-3.23-4.amzn2023.0.3.x86_64 is already installed. 2025-05-07T20:23:56.0173915Z Dependencies resolved. 2025-05-07T20:23:56.0352282Z ================================================================================ 2025-05-07T20:23:56.0352743Z Package Architecture Version Repository Size 2025-05-07T20:23:56.0353180Z ================================================================================ 2025-05-07T20:23:56.0353475Z Installing: 2025-05-07T20:23:56.0353760Z lshw x86_64 B.02.19.2-7.amzn2023.0.3 amazonlinux 319 k 2025-05-07T20:23:56.0354021Z 2025-05-07T20:23:56.0354110Z Transaction Summary 2025-05-07T20:23:56.0354352Z ================================================================================ 2025-05-07T20:23:56.0354649Z Install 1 Package 2025-05-07T20:23:56.0354779Z 2025-05-07T20:23:56.0355268Z Total download size: 319 k 2025-05-07T20:23:56.0355854Z Installed size: 837 k 2025-05-07T20:23:56.0357486Z Downloading Packages: 2025-05-07T20:23:56.1109726Z lshw-B.02.19.2-7.amzn2023.0.3.x86_64.rpm 7.0 MB/s | 319 kB 00:00 2025-05-07T20:23:56.1116664Z -------------------------------------------------------------------------------- 2025-05-07T20:23:56.1119423Z Total 4.1 MB/s | 319 kB 00:00 2025-05-07T20:23:56.1280855Z Running transaction check 2025-05-07T20:23:56.1338189Z Transaction check succeeded. 2025-05-07T20:23:56.1338711Z Running transaction test 2025-05-07T20:23:56.1800669Z Transaction test succeeded. 2025-05-07T20:23:56.1804577Z Running transaction 2025-05-07T20:23:56.2835589Z Preparing : 1/1 2025-05-07T20:23:56.3343371Z Installing : lshw-B.02.19.2-7.amzn2023.0.3.x86_64 1/1 2025-05-07T20:23:56.4968713Z Running scriptlet: lshw-B.02.19.2-7.amzn2023.0.3.x86_64 1/1 2025-05-07T20:23:56.6238944Z ================================================================================ 2025-05-07T20:23:56.6239441Z WARNING: 2025-05-07T20:23:56.6239671Z A newer release of "Amazon Linux" is available. 2025-05-07T20:23:56.6239906Z 2025-05-07T20:23:56.6239988Z Available Versions: 2025-05-07T20:23:56.6240164Z 2025-05-07T20:23:56.6240272Z Version 2023.7.20250331: 2025-05-07T20:23:56.6240565Z Run the following command to upgrade to 2023.7.20250331: 2025-05-07T20:23:56.6240819Z 2025-05-07T20:23:56.6240944Z dnf upgrade --releasever=2023.7.20250331 2025-05-07T20:23:56.6241154Z 2025-05-07T20:23:56.6241233Z Release notes: 2025-05-07T20:23:56.6241638Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html 2025-05-07T20:23:56.6241998Z 2025-05-07T20:23:56.6242081Z Version 2023.7.20250414: 2025-05-07T20:23:56.6242384Z Run the following command to upgrade to 2023.7.20250414: 2025-05-07T20:23:56.6242622Z 2025-05-07T20:23:56.6242737Z dnf upgrade --releasever=2023.7.20250414 2025-05-07T20:23:56.6242936Z 2025-05-07T20:23:56.6243012Z Release notes: 2025-05-07T20:23:56.6243398Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html 2025-05-07T20:23:56.6243759Z 2025-05-07T20:23:56.6244150Z Version 2023.7.20250428: 2025-05-07T20:23:56.6244756Z Run the following command to upgrade to 2023.7.20250428: 2025-05-07T20:23:56.6245010Z 2025-05-07T20:23:56.6245131Z dnf upgrade --releasever=2023.7.20250428 2025-05-07T20:23:56.6245363Z 2025-05-07T20:23:56.6245440Z Release notes: 2025-05-07T20:23:56.6245817Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html 2025-05-07T20:23:56.6246166Z 2025-05-07T20:23:56.6246283Z ================================================================================ 2025-05-07T20:23:56.6586727Z Verifying : lshw-B.02.19.2-7.amzn2023.0.3.x86_64 1/1 2025-05-07T20:23:56.6587267Z 2025-05-07T20:23:56.6587351Z Installed: 2025-05-07T20:23:56.6587668Z lshw-B.02.19.2-7.amzn2023.0.3.x86_64 2025-05-07T20:23:56.6587954Z 2025-05-07T20:23:56.6588044Z Complete! 2025-05-07T20:23:56.7063421Z + hostname 2025-05-07T20:23:56.7063564Z 2025-05-07T20:23:56.7078548Z ip-10-0-58-159.ec2.internal 2025-05-07T20:23:56.7080136Z 2025-05-07T20:23:56.7080566Z + sudo lshw -C display 2025-05-07T20:23:56.7080724Z 2025-05-07T20:23:57.2736292Z *-display:0 UNCLAIMED 2025-05-07T20:23:57.2736654Z description: VGA compatible controller 2025-05-07T20:23:57.2736983Z product: Amazon.com, Inc. 2025-05-07T20:23:57.2737254Z vendor: Amazon.com, Inc. 2025-05-07T20:23:57.2737506Z physical id: 3 2025-05-07T20:23:57.2737743Z bus info: pci@0000:00:03.0 2025-05-07T20:23:57.2738003Z version: 00 2025-05-07T20:23:57.2738211Z width: 32 bits 2025-05-07T20:23:57.2738431Z clock: 33MHz 2025-05-07T20:23:57.2738672Z capabilities: vga_controller bus_master 2025-05-07T20:23:57.2738984Z configuration: latency=0 2025-05-07T20:23:57.2739316Z resources: memory:c1000000-c13fffff memory:c0000-dffff 2025-05-07T20:23:57.2739653Z *-display:1 2025-05-07T20:23:57.2739878Z description: 3D controller 2025-05-07T20:23:57.2740190Z product: GA102GL [A10G] 2025-05-07T20:23:57.2740456Z vendor: NVIDIA Corporation 2025-05-07T20:23:57.2740735Z physical id: 1e 2025-05-07T20:23:57.2740979Z bus info: pci@0000:00:1e.0 2025-05-07T20:23:57.2741229Z version: a1 2025-05-07T20:23:57.2741443Z width: 64 bits 2025-05-07T20:23:57.2741665Z clock: 33MHz 2025-05-07T20:23:57.2741960Z capabilities: pm pciexpress msix bus_master cap_list 2025-05-07T20:23:57.2742324Z configuration: driver=nvidia latency=0 2025-05-07T20:23:57.2742954Z resources: iomemory:180-17f iomemory:100-ff irq:10 memory:c0000000-c0ffffff memory:1800000000-1fffffffff memory:1040000000-1041ffffff 2025-05-07T20:23:57.2779075Z 2025-05-07T20:23:57.2779521Z ################################################################################ 2025-05-07T20:23:57.2788586Z [INFO] Printing NVIDIA GPU info ... 2025-05-07T20:23:57.2910986Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1) 2025-05-07T20:23:57.3093077Z Wed May 7 20:23:57 2025 2025-05-07T20:23:57.3093821Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:57.3094386Z | NVIDIA-SMI 570.133.07 Driver Version: 570.133.07 CUDA Version: 12.8 | 2025-05-07T20:23:57.3094862Z |-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:57.3095356Z | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 2025-05-07T20:23:57.3095876Z | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 2025-05-07T20:23:57.3096294Z | | | MIG M. | 2025-05-07T20:23:57.3096628Z |=========================================+========================+======================| 2025-05-07T20:23:57.3172502Z | 0 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 | 2025-05-07T20:23:57.3173534Z | 0% 31C P0 58W / 300W | 0MiB / 23028MiB | 0% Default | 2025-05-07T20:23:57.3174219Z | | | N/A | 2025-05-07T20:23:57.3174764Z +-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:57.3175251Z 2025-05-07T20:23:57.3175641Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:57.3176081Z | Processes: | 2025-05-07T20:23:57.3176546Z | GPU GI CI PID Type Process name GPU Memory | 2025-05-07T20:23:57.3176950Z | ID ID Usage | 2025-05-07T20:23:57.3177308Z |=========================================================================================| 2025-05-07T20:23:57.3178057Z | No running processes found | 2025-05-07T20:23:57.3178854Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:57.4622193Z ################################################################################ 2025-05-07T20:23:57.4766640Z which: no rocminfo in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin) 2025-05-07T20:23:57.4767387Z [INFO] Printing AMD GPU info ... 2025-05-07T20:23:57.4767900Z [CHECK] rocminfo not found 2025-05-07T20:23:57.4776595Z which: no rocm-smi in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin) 2025-05-07T20:23:57.4777429Z [CHECK] rocm-smi not found 2025-05-07T20:23:57.4812740Z ##[group]Run . $PRELUDE; setup_miniconda $HOME/miniconda 2025-05-07T20:23:57.4813164Z . $PRELUDE; setup_miniconda $HOME/miniconda 2025-05-07T20:23:57.4826066Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:23:57.4826415Z env: 2025-05-07T20:23:57.4826632Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:23:57.4826947Z BUILD_ENV: build_binary 2025-05-07T20:23:57.4827197Z BUILD_TARGET: genai 2025-05-07T20:23:57.4827430Z BUILD_VARIANT: cuda 2025-05-07T20:23:57.4827658Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:23:57.4827921Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:23:57.4828229Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:23:57.4828563Z ##[endgroup] 2025-05-07T20:23:57.8209344Z ################################################################################ 2025-05-07T20:23:57.8210608Z # Setup Miniconda 2025-05-07T20:23:57.8211183Z # 2025-05-07T20:23:57.8225411Z # [2025-05-07T20:23:57.822Z] + setup_miniconda /home/ec2-user/miniconda 2025-05-07T20:23:57.8226078Z ################################################################################ 2025-05-07T20:23:57.8226458Z 2025-05-07T20:23:57.8242502Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:23:57.9122628Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:23:57.9123221Z + mkdir -p /home/ec2-user/miniconda 2025-05-07T20:23:57.9123533Z 2025-05-07T20:23:57.9140119Z 2025-05-07T20:23:57.9140495Z [SETUP] Downloading the Miniconda installer ... 2025-05-07T20:23:57.9162780Z [EXEC] [ATTEMPT 0/3] + wget -q https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh 2025-05-07T20:23:58.9066862Z [SETUP] Installing Miniconda ... 2025-05-07T20:23:58.9067573Z + bash miniconda.sh -b -p /home/ec2-user/miniconda -u 2025-05-07T20:23:58.9068087Z 2025-05-07T20:23:58.9216981Z PREFIX=/home/ec2-user/miniconda 2025-05-07T20:23:59.3696451Z Unpacking payload ... 2025-05-07T20:23:59.8913282Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior. 2025-05-07T20:24:00.7061761Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior. 2025-05-07T20:24:02.8277465Z 2025-05-07T20:24:02.8278221Z Installing base environment... 2025-05-07T20:24:02.8278840Z 2025-05-07T20:24:03.9112699Z Preparing transaction: ...working... done 2025-05-07T20:24:06.9616162Z Executing transaction: ...working... done 2025-05-07T20:24:07.6421858Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior. 2025-05-07T20:24:07.7444401Z installation finished. 2025-05-07T20:24:07.7452949Z 2025-05-07T20:24:07.7453232Z + rm -f miniconda.sh 2025-05-07T20:24:07.7453404Z 2025-05-07T20:24:07.8396852Z 2025-05-07T20:24:07.8397174Z [SETUP] Reloading the bash configuration ... 2025-05-07T20:24:07.8397516Z + /home/ec2-user/miniconda/bin/conda init bash 2025-05-07T20:24:07.8397734Z 2025-05-07T20:24:08.2102348Z no change /home/ec2-user/miniconda/condabin/conda 2025-05-07T20:24:08.2103104Z no change /home/ec2-user/miniconda/bin/conda 2025-05-07T20:24:08.2103785Z no change /home/ec2-user/miniconda/bin/conda-env 2025-05-07T20:24:08.2104461Z no change /home/ec2-user/miniconda/bin/activate 2025-05-07T20:24:08.2105153Z no change /home/ec2-user/miniconda/bin/deactivate 2025-05-07T20:24:08.2105907Z no change /home/ec2-user/miniconda/etc/profile.d/conda.sh 2025-05-07T20:24:08.2106736Z no change /home/ec2-user/miniconda/etc/fish/conf.d/conda.fish 2025-05-07T20:24:08.2107567Z no change /home/ec2-user/miniconda/shell/condabin/Conda.psm1 2025-05-07T20:24:08.2108832Z no change /home/ec2-user/miniconda/shell/condabin/conda-hook.ps1 2025-05-07T20:24:08.2110131Z no change /home/ec2-user/miniconda/lib/python3.13/site-packages/xontrib/conda.xsh 2025-05-07T20:24:08.2110636Z no change /home/ec2-user/miniconda/etc/profile.d/conda.csh 2025-05-07T20:24:08.2110992Z modified /home/ec2-user/.bashrc 2025-05-07T20:24:08.2111186Z 2025-05-07T20:24:08.2111378Z ==> For changes to take effect, close and re-open your current shell. <== 2025-05-07T20:24:08.2111673Z 2025-05-07T20:24:08.2871140Z 2025-05-07T20:24:08.2871771Z + . /home/ec2-user/.bashrc 2025-05-07T20:24:08.2871981Z 2025-05-07T20:24:09.1468595Z 2025-05-07T20:24:09.1469208Z [SETUP] Installing libmamba-solver (required since Anaconda 2024.02-1) and libarchive ... 2025-05-07T20:24:09.1494416Z [EXEC] [ATTEMPT 0/3] + conda install --solver=classic -c conda-forge --override-channels -y conda-libmamba-solver libmamba libmambapy libarchive 2025-05-07T20:24:23.0614692Z Collecting package metadata (current_repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - done 2025-05-07T20:24:24.7331853Z Solving environment: | / - \ | / - \ | / - \ done 2025-05-07T20:24:24.8314239Z 2025-05-07T20:24:24.8314680Z ## Package Plan ## 2025-05-07T20:24:24.8314959Z 2025-05-07T20:24:24.8315143Z environment location: /home/ec2-user/miniconda 2025-05-07T20:24:24.8315425Z 2025-05-07T20:24:24.8315525Z added / updated specs: 2025-05-07T20:24:24.8315792Z - conda-libmamba-solver 2025-05-07T20:24:24.8316044Z - libarchive 2025-05-07T20:24:24.8316239Z - libmamba 2025-05-07T20:24:24.8316439Z - libmambapy 2025-05-07T20:24:24.8316561Z 2025-05-07T20:24:24.8316576Z 2025-05-07T20:24:24.8316717Z The following packages will be downloaded: 2025-05-07T20:24:24.8316924Z 2025-05-07T20:24:24.8317046Z package | build 2025-05-07T20:24:24.8317355Z ---------------------------|----------------- 2025-05-07T20:24:24.8317759Z ca-certificates-2025.4.26 | hbd8a1cb_0 149 KB conda-forge 2025-05-07T20:24:24.8318467Z certifi-2025.4.26 | pyhd8ed1ab_0 154 KB conda-forge 2025-05-07T20:24:24.8318883Z conda-25.3.1 | py313h78bf25f_1 1.1 MB conda-forge 2025-05-07T20:24:24.8319354Z conda-libmamba-solver-25.4.0| pyhd8ed1ab_0 41 KB conda-forge 2025-05-07T20:24:24.8319800Z ------------------------------------------------------------ 2025-05-07T20:24:24.8320138Z Total: 1.4 MB 2025-05-07T20:24:24.8320342Z 2025-05-07T20:24:24.8320447Z The following packages will be UPDATED: 2025-05-07T20:24:24.8320652Z 2025-05-07T20:24:24.8324587Z ca-certificates pkgs/main/linux-64::ca-certificates-2~ --> conda-forge/noarch::ca-certificates-2025.4.26-hbd8a1cb_0 2025-05-07T20:24:24.8325361Z conda pkgs/main::conda-25.3.1-py313h06a4308~ --> conda-forge::conda-25.3.1-py313h78bf25f_1 2025-05-07T20:24:24.8325731Z 2025-05-07T20:24:24.8325950Z The following packages will be SUPERSEDED by a higher-priority channel: 2025-05-07T20:24:24.8326263Z 2025-05-07T20:24:24.8326576Z certifi pkgs/main/linux-64::certifi-2025.4.26~ --> conda-forge/noarch::certifi-2025.4.26-pyhd8ed1ab_0 2025-05-07T20:24:24.8327361Z conda-libmamba-so~ pkgs/main::conda-libmamba-solver-25.4~ --> conda-forge::conda-libmamba-solver-25.4.0-pyhd8ed1ab_0 2025-05-07T20:24:24.8327843Z 2025-05-07T20:24:24.8327847Z 2025-05-07T20:24:24.8327851Z 2025-05-07T20:24:24.8327993Z Downloading and Extracting Packages: ...working... 2025-05-07T20:24:24.8328367Z conda-25.3.1 | 1.1 MB | | 0% 2025-05-07T20:24:24.8328580Z 2025-05-07T20:24:24.8329060Z certifi-2025.4.26 | 154 KB | | 0%  2025-05-07T20:24:24.8329294Z 2025-05-07T20:24:24.8329298Z 2025-05-07T20:24:24.8336342Z ca-certificates-2025 | 149 KB | | 0%  2025-05-07T20:24:24.8336611Z 2025-05-07T20:24:24.8336615Z 2025-05-07T20:24:24.8336618Z 2025-05-07T20:24:24.8775493Z conda-libmamba-solve | 41 KB | | 0%  2025-05-07T20:24:24.8907710Z conda-25.3.1 | 1.1 MB | ########## | 100% 2025-05-07T20:24:24.8908562Z 2025-05-07T20:24:24.9084886Z certifi-2025.4.26 | 154 KB | ########## | 100%  2025-05-07T20:24:24.9085143Z 2025-05-07T20:24:24.9299817Z certifi-2025.4.26 | 154 KB | ########## | 100%  2025-05-07T20:24:24.9300216Z 2025-05-07T20:24:24.9300220Z 2025-05-07T20:24:24.9300224Z 2025-05-07T20:24:24.9501721Z conda-libmamba-solve | 41 KB | ########## | 100%  2025-05-07T20:24:24.9502021Z 2025-05-07T20:24:24.9502025Z 2025-05-07T20:24:24.9502029Z 2025-05-07T20:24:24.9504256Z conda-libmamba-solve | 41 KB | ########## | 100%  2025-05-07T20:24:24.9504600Z 2025-05-07T20:24:24.9504604Z 2025-05-07T20:24:24.9504855Z 2025-05-07T20:24:25.0094697Z conda-libmamba-solve | 41 KB | ########## | 100%  2025-05-07T20:24:25.0095008Z 2025-05-07T20:24:25.0095013Z 2025-05-07T20:24:25.0126218Z ca-certificates-2025 | 149 KB | # | 11%  2025-05-07T20:24:25.0126517Z 2025-05-07T20:24:25.0126521Z 2025-05-07T20:24:25.0217254Z ca-certificates-2025 | 149 KB | ########## | 100%  2025-05-07T20:24:25.0218340Z conda-25.3.1 | 1.1 MB | ########## | 100% 2025-05-07T20:24:25.0238701Z conda-25.3.1 | 1.1 MB | ########## | 100% 2025-05-07T20:24:25.0238979Z 2025-05-07T20:24:25.0238982Z 2025-05-07T20:24:25.0245746Z ca-certificates-2025 | 149 KB | ########## | 100%  2025-05-07T20:24:25.0246257Z 2025-05-07T20:24:25.0246462Z 2025-05-07T20:24:25.0246627Z  2025-05-07T20:24:25.0246844Z 2025-05-07T20:24:25.0246848Z 2025-05-07T20:24:25.0247015Z  2025-05-07T20:24:25.0247213Z 2025-05-07T20:24:25.0247218Z 2025-05-07T20:24:25.0247221Z 2025-05-07T20:24:25.0247401Z  done 2025-05-07T20:24:25.1249643Z Preparing transaction: / done 2025-05-07T20:24:25.2255009Z Verifying transaction: \ done 2025-05-07T20:24:26.6288481Z Executing transaction: / - \ | / - \ | / - \ | / - done 2025-05-07T20:24:28.6271974Z [SETUP] Updating Miniconda base packages ... 2025-05-07T20:24:28.6298001Z [EXEC] [ATTEMPT 0/3] + conda update -n base -c defaults --update-deps -y conda 2025-05-07T20:24:29.5900787Z Channels: 2025-05-07T20:24:29.5901175Z - defaults 2025-05-07T20:24:29.5901670Z Platform: linux-64 2025-05-07T20:24:30.8623261Z Collecting package metadata (repodata.json): - \ | / - \ | done 2025-05-07T20:24:30.9803735Z Solving environment: - \ Channels: 2025-05-07T20:24:30.9804066Z - defaults 2025-05-07T20:24:30.9804265Z Platform: linux-64 2025-05-07T20:24:31.2802340Z Collecting package metadata (repodata.json): / - \ | done 2025-05-07T20:24:31.4958467Z Solving environment: - \ | / done 2025-05-07T20:24:31.5747427Z done 2025-05-07T20:24:31.6426804Z 2025-05-07T20:24:31.6427202Z ## Package Plan ## 2025-05-07T20:24:31.6427382Z 2025-05-07T20:24:31.6427528Z environment location: /home/ec2-user/miniconda 2025-05-07T20:24:31.6427774Z 2025-05-07T20:24:31.6427870Z added / updated specs: 2025-05-07T20:24:31.6428102Z - conda 2025-05-07T20:24:31.6428222Z 2025-05-07T20:24:31.6428227Z 2025-05-07T20:24:31.6428338Z The following packages will be downloaded: 2025-05-07T20:24:31.6428554Z 2025-05-07T20:24:31.6428666Z package | build 2025-05-07T20:24:31.6428975Z ---------------------------|----------------- 2025-05-07T20:24:31.6429787Z pip-25.1 | pyhc872135_2 1.3 MB 2025-05-07T20:24:31.6430174Z tzdata-2025b | h04d1e81_0 116 KB 2025-05-07T20:24:31.6430532Z ------------------------------------------------------------ 2025-05-07T20:24:31.6430852Z Total: 1.4 MB 2025-05-07T20:24:31.6431058Z 2025-05-07T20:24:31.6431171Z The following packages will be UPDATED: 2025-05-07T20:24:31.6431384Z 2025-05-07T20:24:31.6431675Z pip pkgs/main/linux-64::pip-25.0-py313h06~ --> pkgs/main/noarch::pip-25.1-pyhc872135_2 2025-05-07T20:24:31.6432171Z tzdata 2025a-h04d1e81_0 --> 2025b-h04d1e81_0 2025-05-07T20:24:31.6432410Z 2025-05-07T20:24:31.6432414Z 2025-05-07T20:24:31.6432418Z 2025-05-07T20:24:31.6432557Z Downloading and Extracting Packages: ...working... 2025-05-07T20:24:31.6432904Z pip-25.1 | 1.3 MB | | 0% 2025-05-07T20:24:31.6433120Z 2025-05-07T20:24:31.6817128Z tzdata-2025b | 116 KB | | 0%  2025-05-07T20:24:31.6817578Z 2025-05-07T20:24:31.7034630Z tzdata-2025b | 116 KB | ########## | 100%  2025-05-07T20:24:31.9019714Z pip-25.1 | 1.3 MB | ########## | 100% 2025-05-07T20:24:31.9020016Z 2025-05-07T20:24:31.9022726Z tzdata-2025b | 116 KB | ########## | 100%  2025-05-07T20:24:31.9023101Z 2025-05-07T20:24:31.9063378Z tzdata-2025b | 116 KB | ########## | 100%  2025-05-07T20:24:31.9063821Z pip-25.1 | 1.3 MB | ########## | 100% 2025-05-07T20:24:31.9068151Z pip-25.1 | 1.3 MB | ########## | 100% 2025-05-07T20:24:31.9068752Z 2025-05-07T20:24:31.9069086Z 2025-05-07T20:24:31.9069362Z  done 2025-05-07T20:24:32.0074584Z Preparing transaction: \ done 2025-05-07T20:24:32.1078101Z Verifying transaction: / done 2025-05-07T20:24:34.2212267Z Executing transaction: \ | / - \ | / - \ | / - \ | / - \ | / - \ done 2025-05-07T20:24:34.8676019Z [SETUP] Cleaning up Conda packages ... 2025-05-07T20:24:34.8681044Z + conda clean --packages --tarball -y 2025-05-07T20:24:34.8681252Z 2025-05-07T20:24:35.8954246Z Will remove 99 (117.8 MB) tarball(s). 2025-05-07T20:24:35.8954869Z Will remove 11 (16.0 MB) package(s). 2025-05-07T20:24:35.9717779Z 2025-05-07T20:24:35.9727252Z + conda clean --all -y 2025-05-07T20:24:35.9727432Z 2025-05-07T20:24:36.5190402Z There are no unused tarball(s) to remove. 2025-05-07T20:24:36.5190828Z Will remove 1 index cache(s). 2025-05-07T20:24:36.5191282Z There are no unused package(s) to remove. 2025-05-07T20:24:36.5191583Z There are no tempfile(s) to remove. 2025-05-07T20:24:36.5191875Z There are no logfile(s) to remove. 2025-05-07T20:24:36.5888809Z 2025-05-07T20:24:36.5893802Z + conda info 2025-05-07T20:24:36.5893937Z 2025-05-07T20:24:37.3860090Z 2025-05-07T20:24:37.3860888Z active environment : base 2025-05-07T20:24:37.3861282Z active env location : /home/ec2-user/miniconda 2025-05-07T20:24:37.3861600Z shell level : 1 2025-05-07T20:24:37.3861889Z user config file : /home/ec2-user/.condarc 2025-05-07T20:24:37.3862270Z populated config files : /home/ec2-user/miniconda/.condarc 2025-05-07T20:24:37.3862622Z conda version : 25.3.1 2025-05-07T20:24:37.3862896Z conda-build version : not installed 2025-05-07T20:24:37.3863181Z python version : 3.13.2.final.0 2025-05-07T20:24:37.3863468Z solver : libmamba (default) 2025-05-07T20:24:37.3863753Z virtual packages : __archspec=1=zen2 2025-05-07T20:24:37.3864044Z __conda=25.3.1=0 2025-05-07T20:24:37.3864312Z __cuda=12.8=0 2025-05-07T20:24:37.3864565Z __glibc=2.34=0 2025-05-07T20:24:37.3864827Z __linux=6.1.130=0 2025-05-07T20:24:37.3865094Z __unix=0=0 2025-05-07T20:24:37.3865880Z base environment : /home/ec2-user/miniconda (writable) 2025-05-07T20:24:37.3866277Z conda av data dir : /home/ec2-user/miniconda/etc/conda 2025-05-07T20:24:37.3866616Z conda av metadata url : None 2025-05-07T20:24:37.3866978Z channel URLs : https://repo.anaconda.com/pkgs/main/linux-64 2025-05-07T20:24:37.3867384Z https://repo.anaconda.com/pkgs/main/noarch 2025-05-07T20:24:37.3867765Z https://repo.anaconda.com/pkgs/r/linux-64 2025-05-07T20:24:37.3868127Z https://repo.anaconda.com/pkgs/r/noarch 2025-05-07T20:24:37.3868473Z package cache : /home/ec2-user/miniconda/pkgs 2025-05-07T20:24:37.3868805Z /home/ec2-user/.conda/pkgs 2025-05-07T20:24:37.3869134Z envs directories : /home/ec2-user/miniconda/envs 2025-05-07T20:24:37.3869464Z /home/ec2-user/.conda/envs 2025-05-07T20:24:37.3869755Z platform : linux-64 2025-05-07T20:24:37.3870622Z user-agent : conda/25.3.1 requests/2.32.3 CPython/3.13.2 Linux/6.1.130-139.222.amzn2023.x86_64 amzn/2023.6.20250317 glibc/2.34 solver/libmamba conda-libmamba-solver/25.4.0 libmambapy/2.0.5 aau/0.7.0 c/. s/. e/. 2025-05-07T20:24:37.3871427Z UID:GID : 1000:1000 2025-05-07T20:24:37.3871682Z netrc file : None 2025-05-07T20:24:37.3871937Z offline mode : False 2025-05-07T20:24:37.3872103Z 2025-05-07T20:24:37.4607071Z 2025-05-07T20:24:37.4610728Z [SETUP] Exporting Miniconda variables ... 2025-05-07T20:24:37.4611965Z [SETUP] Saving Miniconda variables to /home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/add_path_10a99a44-67a7-4380-9add-068cd6ab572a ... 2025-05-07T20:24:37.4613271Z [SETUP] Successfully set up Miniconda at /home/ec2-user/miniconda 2025-05-07T20:24:37.4693152Z ##[group]Run . $PRELUDE; create_conda_environment $BUILD_ENV 3.13 2025-05-07T20:24:37.4702932Z . $PRELUDE; create_conda_environment $BUILD_ENV 3.13 2025-05-07T20:24:37.4720559Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:24:37.4720901Z env: 2025-05-07T20:24:37.4721116Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:24:37.4721401Z BUILD_ENV: build_binary 2025-05-07T20:24:37.4721636Z BUILD_TARGET: genai 2025-05-07T20:24:37.4721858Z BUILD_VARIANT: cuda 2025-05-07T20:24:37.4722255Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:24:37.4722498Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:24:37.4722794Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:24:37.4723127Z ##[endgroup] 2025-05-07T20:24:37.8120107Z ################################################################################ 2025-05-07T20:24:37.8120627Z # Create Conda Environment 2025-05-07T20:24:37.8120879Z # 2025-05-07T20:24:37.8138595Z # [2025-05-07T20:24:37.813Z] + create_conda_environment build_binary 3.13 2025-05-07T20:24:37.8139156Z ################################################################################ 2025-05-07T20:24:37.8139462Z 2025-05-07T20:24:37.8154258Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:24:37.9047474Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:24:37.9047879Z [SETUP] Listing existing Conda environments ... 2025-05-07T20:24:37.9048205Z + conda info --envs 2025-05-07T20:24:37.9048354Z 2025-05-07T20:24:38.6894764Z 2025-05-07T20:24:38.6895576Z # conda environments: 2025-05-07T20:24:38.6895971Z # 2025-05-07T20:24:38.6896265Z base /home/ec2-user/miniconda 2025-05-07T20:24:38.6896497Z 2025-05-07T20:24:38.7630959Z 2025-05-07T20:24:38.7631535Z [SETUP] Deleting the prefix directory if it exists ... 2025-05-07T20:24:40.4561297Z + rm -rf /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:24:40.4561581Z 2025-05-07T20:24:40.4575031Z 2025-05-07T20:24:40.4584778Z [SETUP] Creating new Conda environment (Python 3.13) ... 2025-05-07T20:24:40.4608867Z [EXEC] [ATTEMPT 0/3] + conda create -y -n build_binary python=3.13 2025-05-07T20:24:41.2446273Z Channels: 2025-05-07T20:24:41.2446654Z - defaults 2025-05-07T20:24:41.2446948Z Platform: linux-64 2025-05-07T20:24:42.6959842Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ done 2025-05-07T20:24:42.8207318Z Solving environment: / done 2025-05-07T20:24:42.8500400Z 2025-05-07T20:24:42.8501026Z ## Package Plan ## 2025-05-07T20:24:42.8501345Z 2025-05-07T20:24:42.8501680Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:24:42.8502215Z 2025-05-07T20:24:42.8502316Z added / updated specs: 2025-05-07T20:24:42.8502567Z - python=3.13 2025-05-07T20:24:42.8502698Z 2025-05-07T20:24:42.8502703Z 2025-05-07T20:24:42.8502821Z The following packages will be downloaded: 2025-05-07T20:24:42.8503048Z 2025-05-07T20:24:42.8503196Z package | build 2025-05-07T20:24:42.8503540Z ---------------------------|----------------- 2025-05-07T20:24:42.8503963Z _libgcc_mutex-0.1 | main 3 KB 2025-05-07T20:24:42.8504634Z _openmp_mutex-5.1 | 1_gnu 21 KB 2025-05-07T20:24:42.8505279Z ca-certificates-2025.2.25 | h06a4308_0 129 KB 2025-05-07T20:24:42.8505680Z python_abi-3.13 | 0_cp313 6 KB 2025-05-07T20:24:42.8506028Z ------------------------------------------------------------ 2025-05-07T20:24:42.8506360Z Total: 159 KB 2025-05-07T20:24:42.8506556Z 2025-05-07T20:24:42.8506681Z The following NEW packages will be INSTALLED: 2025-05-07T20:24:42.8506891Z 2025-05-07T20:24:42.8507092Z _libgcc_mutex pkgs/main/linux-64::_libgcc_mutex-0.1-main 2025-05-07T20:24:42.8507505Z _openmp_mutex pkgs/main/linux-64::_openmp_mutex-5.1-1_gnu 2025-05-07T20:24:42.8508654Z bzip2 pkgs/main/linux-64::bzip2-1.0.8-h5eee18b_6 2025-05-07T20:24:42.8509438Z ca-certificates pkgs/main/linux-64::ca-certificates-2025.2.25-h06a4308_0 2025-05-07T20:24:42.8509941Z expat pkgs/main/linux-64::expat-2.7.1-h6a678d5_0 2025-05-07T20:24:42.8510369Z ld_impl_linux-64 pkgs/main/linux-64::ld_impl_linux-64-2.40-h12ee557_0 2025-05-07T20:24:42.8510815Z libffi pkgs/main/linux-64::libffi-3.4.4-h6a678d5_1 2025-05-07T20:24:42.8511225Z libgcc-ng pkgs/main/linux-64::libgcc-ng-11.2.0-h1234567_1 2025-05-07T20:24:42.8511790Z libgomp pkgs/main/linux-64::libgomp-11.2.0-h1234567_1 2025-05-07T20:24:42.8512211Z libmpdec pkgs/main/linux-64::libmpdec-4.0.0-h5eee18b_0 2025-05-07T20:24:42.8512666Z libstdcxx-ng pkgs/main/linux-64::libstdcxx-ng-11.2.0-h1234567_1 2025-05-07T20:24:42.8513108Z libuuid pkgs/main/linux-64::libuuid-1.41.5-h5eee18b_0 2025-05-07T20:24:42.8513754Z ncurses pkgs/main/linux-64::ncurses-6.4-h6a678d5_0 2025-05-07T20:24:42.8514397Z openssl pkgs/main/linux-64::openssl-3.0.16-h5eee18b_0 2025-05-07T20:24:42.8514892Z pip pkgs/main/noarch::pip-25.1-pyhc872135_2 2025-05-07T20:24:42.8515298Z python pkgs/main/linux-64::python-3.13.2-hf623796_100_cp313 2025-05-07T20:24:42.8515723Z python_abi pkgs/main/linux-64::python_abi-3.13-0_cp313 2025-05-07T20:24:42.8516142Z readline pkgs/main/linux-64::readline-8.2-h5eee18b_0 2025-05-07T20:24:42.8516594Z setuptools pkgs/main/linux-64::setuptools-78.1.1-py313h06a4308_0 2025-05-07T20:24:42.8517044Z sqlite pkgs/main/linux-64::sqlite-3.45.3-h5eee18b_0 2025-05-07T20:24:42.8517415Z tk pkgs/main/linux-64::tk-8.6.14-h39e8969_0 2025-05-07T20:24:42.8517787Z tzdata pkgs/main/noarch::tzdata-2025b-h04d1e81_0 2025-05-07T20:24:42.8518190Z wheel pkgs/main/linux-64::wheel-0.45.1-py313h06a4308_0 2025-05-07T20:24:42.8518566Z xz pkgs/main/linux-64::xz-5.6.4-h5eee18b_1 2025-05-07T20:24:42.8519033Z zlib pkgs/main/linux-64::zlib-1.2.13-h5eee18b_1 2025-05-07T20:24:42.8519373Z 2025-05-07T20:24:42.8519379Z 2025-05-07T20:24:42.8519384Z 2025-05-07T20:24:42.8519571Z Downloading and Extracting Packages: ...working... 2025-05-07T20:24:42.8520081Z ca-certificates-2025 | 129 KB | | 0% 2025-05-07T20:24:42.8520319Z 2025-05-07T20:24:42.8520673Z _openmp_mutex-5.1 | 21 KB | | 0%  2025-05-07T20:24:42.8520906Z 2025-05-07T20:24:42.8520917Z 2025-05-07T20:24:42.8521115Z python_abi-3.13 | 6 KB | | 0%  2025-05-07T20:24:42.8521350Z 2025-05-07T20:24:42.8521354Z 2025-05-07T20:24:42.8521357Z 2025-05-07T20:24:42.8831212Z _libgcc_mutex-0.1 | 3 KB | | 0%  2025-05-07T20:24:42.8831633Z 2025-05-07T20:24:42.8832470Z 2025-05-07T20:24:42.8895998Z python_abi-3.13 | 6 KB | ########## | 100%  2025-05-07T20:24:42.8896464Z 2025-05-07T20:24:42.8906800Z _openmp_mutex-5.1 | 21 KB | ########## | 100%  2025-05-07T20:24:42.8907224Z 2025-05-07T20:24:42.8907231Z 2025-05-07T20:24:42.9038749Z python_abi-3.13 | 6 KB | ########## | 100%  2025-05-07T20:24:42.9039226Z 2025-05-07T20:24:42.9041942Z _openmp_mutex-5.1 | 21 KB | ########## | 100%  2025-05-07T20:24:42.9042405Z 2025-05-07T20:24:42.9042413Z 2025-05-07T20:24:42.9042420Z 2025-05-07T20:24:42.9080193Z _libgcc_mutex-0.1 | 3 KB | ########## | 100%  2025-05-07T20:24:42.9080722Z 2025-05-07T20:24:42.9080743Z 2025-05-07T20:24:42.9080751Z 2025-05-07T20:24:42.9408836Z _libgcc_mutex-0.1 | 3 KB | ########## | 100%  2025-05-07T20:24:42.9473796Z ca-certificates-2025 | 129 KB | ########## | 100% 2025-05-07T20:24:42.9478233Z ca-certificates-2025 | 129 KB | ########## | 100% 2025-05-07T20:24:42.9478794Z 2025-05-07T20:24:42.9479111Z 2025-05-07T20:24:42.9479638Z  2025-05-07T20:24:42.9479971Z 2025-05-07T20:24:42.9479978Z 2025-05-07T20:24:42.9480188Z  2025-05-07T20:24:42.9480391Z 2025-05-07T20:24:42.9480395Z 2025-05-07T20:24:42.9480398Z 2025-05-07T20:24:42.9480569Z  done 2025-05-07T20:24:43.1586630Z Preparing transaction: \ | done 2025-05-07T20:24:44.6138364Z Verifying transaction: - \ | / - \ | / - \ | / - done 2025-05-07T20:24:47.0333838Z Executing transaction: | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ done 2025-05-07T20:24:47.0838207Z # 2025-05-07T20:24:47.0838835Z # To activate this environment, use 2025-05-07T20:24:47.0839593Z # 2025-05-07T20:24:47.0840128Z # $ conda activate build_binary 2025-05-07T20:24:47.0840840Z # 2025-05-07T20:24:47.0841308Z # To deactivate an active environment, use 2025-05-07T20:24:47.0841875Z # 2025-05-07T20:24:47.0842221Z # $ conda deactivate 2025-05-07T20:24:47.0842514Z 2025-05-07T20:24:47.1999317Z [SETUP] Upgrading PIP to latest ... 2025-05-07T20:24:47.2021905Z [EXEC] [ATTEMPT 0/3] + conda run -n build_binary pip install --upgrade pip 2025-05-07T20:24:50.2307768Z Requirement already satisfied: pip in /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages (25.1) 2025-05-07T20:24:50.2308854Z Collecting pip 2025-05-07T20:24:50.2309197Z Downloading pip-25.1.1-py3-none-any.whl.metadata (3.6 kB) 2025-05-07T20:24:50.2309693Z Downloading pip-25.1.1-py3-none-any.whl (1.8 MB) 2025-05-07T20:24:50.2313903Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.8/1.8 MB 58.6 MB/s eta 0:00:00 2025-05-07T20:24:50.2314306Z Installing collected packages: pip 2025-05-07T20:24:50.2314622Z Attempting uninstall: pip 2025-05-07T20:24:50.2314911Z Found existing installation: pip 25.1 2025-05-07T20:24:50.2315236Z Uninstalling pip-25.1: 2025-05-07T20:24:50.2315530Z Successfully uninstalled pip-25.1 2025-05-07T20:24:50.2315839Z Successfully installed pip-25.1.1 2025-05-07T20:24:50.2316031Z 2025-05-07T20:24:50.3053975Z [SETUP] Upgrading pyOpenSSL ... 2025-05-07T20:24:50.3077312Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y pyOpenSSL>22.1.0 2025-05-07T20:24:51.1779347Z Channels: 2025-05-07T20:24:51.1779712Z - conda-forge 2025-05-07T20:24:51.1779936Z Platform: linux-64 2025-05-07T20:25:02.0006145Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | / done 2025-05-07T20:25:03.7429472Z Solving environment: \ | / - \ | done 2025-05-07T20:25:03.8066608Z 2025-05-07T20:25:03.8067111Z ## Package Plan ## 2025-05-07T20:25:03.8067385Z 2025-05-07T20:25:03.8067760Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:25:03.8068189Z 2025-05-07T20:25:03.8068283Z added / updated specs: 2025-05-07T20:25:03.8068586Z - pyopenssl[version='>22.1.0'] 2025-05-07T20:25:03.8068771Z 2025-05-07T20:25:03.8068787Z 2025-05-07T20:25:03.8068900Z The following packages will be downloaded: 2025-05-07T20:25:03.8069111Z 2025-05-07T20:25:03.8069230Z package | build 2025-05-07T20:25:03.8069538Z ---------------------------|----------------- 2025-05-07T20:25:03.8069897Z cffi-1.17.1 | py313hfab6e84_0 289 KB conda-forge 2025-05-07T20:25:03.8070343Z cryptography-44.0.3 | py313h6556f6e_0 1.5 MB conda-forge 2025-05-07T20:25:03.8071077Z libgcc-15.1.0 | h767d61c_2 810 KB conda-forge 2025-05-07T20:25:03.8071829Z libgcc-ng-15.1.0 | h69a702a_2 34 KB conda-forge 2025-05-07T20:25:03.8072438Z libgomp-15.1.0 | h767d61c_2 442 KB conda-forge 2025-05-07T20:25:03.8072834Z openssl-3.5.0 | h7b32b05_1 3.0 MB conda-forge 2025-05-07T20:25:03.8073691Z pycparser-2.22 | pyh29332c3_1 108 KB conda-forge 2025-05-07T20:25:03.8074124Z pyopenssl-25.0.0 | pyhd8ed1ab_0 120 KB conda-forge 2025-05-07T20:25:03.8074572Z typing-extensions-4.13.2 | h0e9735f_0 88 KB conda-forge 2025-05-07T20:25:03.8075040Z typing_extensions-4.13.2 | pyh29332c3_0 51 KB conda-forge 2025-05-07T20:25:03.8075444Z ------------------------------------------------------------ 2025-05-07T20:25:03.8075775Z Total: 6.4 MB 2025-05-07T20:25:03.8076149Z 2025-05-07T20:25:03.8076273Z The following NEW packages will be INSTALLED: 2025-05-07T20:25:03.8076483Z 2025-05-07T20:25:03.8076682Z cffi conda-forge/linux-64::cffi-1.17.1-py313hfab6e84_0 2025-05-07T20:25:03.8077165Z cryptography conda-forge/linux-64::cryptography-44.0.3-py313h6556f6e_0 2025-05-07T20:25:03.8077649Z libgcc conda-forge/linux-64::libgcc-15.1.0-h767d61c_2 2025-05-07T20:25:03.8080195Z pycparser conda-forge/noarch::pycparser-2.22-pyh29332c3_1 2025-05-07T20:25:03.8080740Z pyopenssl conda-forge/noarch::pyopenssl-25.0.0-pyhd8ed1ab_0 2025-05-07T20:25:03.8081259Z typing-extensions conda-forge/noarch::typing-extensions-4.13.2-h0e9735f_0 2025-05-07T20:25:03.8081834Z typing_extensions conda-forge/noarch::typing_extensions-4.13.2-pyh29332c3_0 2025-05-07T20:25:03.8082161Z 2025-05-07T20:25:03.8082274Z The following packages will be UPDATED: 2025-05-07T20:25:03.8082472Z 2025-05-07T20:25:03.8082858Z ca-certificates pkgs/main/linux-64::ca-certificates-2~ --> conda-forge/noarch::ca-certificates-2025.4.26-hbd8a1cb_0 2025-05-07T20:25:03.8083602Z libgcc-ng pkgs/main::libgcc-ng-11.2.0-h1234567_1 --> conda-forge::libgcc-ng-15.1.0-h69a702a_2 2025-05-07T20:25:03.8084232Z libgomp pkgs/main::libgomp-11.2.0-h1234567_1 --> conda-forge::libgomp-15.1.0-h767d61c_2 2025-05-07T20:25:03.8085069Z openssl pkgs/main::openssl-3.0.16-h5eee18b_0 --> conda-forge::openssl-3.5.0-h7b32b05_1 2025-05-07T20:25:03.8085607Z 2025-05-07T20:25:03.8085613Z 2025-05-07T20:25:03.8085619Z 2025-05-07T20:25:03.8085827Z Downloading and Extracting Packages: ...working... 2025-05-07T20:25:03.8086262Z openssl-3.5.0 | 3.0 MB | | 0% 2025-05-07T20:25:03.8086486Z 2025-05-07T20:25:03.8086874Z cryptography-44.0.3 | 1.5 MB | | 0%  2025-05-07T20:25:03.8087111Z 2025-05-07T20:25:03.8087126Z 2025-05-07T20:25:03.8087321Z libgcc-15.1.0 | 810 KB | | 0%  2025-05-07T20:25:03.8087556Z 2025-05-07T20:25:03.8087560Z 2025-05-07T20:25:03.8087564Z 2025-05-07T20:25:03.8110951Z libgomp-15.1.0 | 442 KB | | 0%  2025-05-07T20:25:03.8111391Z 2025-05-07T20:25:03.8111399Z 2025-05-07T20:25:03.8111405Z 2025-05-07T20:25:03.8111411Z 2025-05-07T20:25:03.8123900Z cffi-1.17.1 | 289 KB | | 0%  2025-05-07T20:25:03.8124190Z 2025-05-07T20:25:03.8124233Z 2025-05-07T20:25:03.8124239Z 2025-05-07T20:25:03.8124259Z 2025-05-07T20:25:03.8124266Z 2025-05-07T20:25:03.8124959Z pyopenssl-25.0.0 | 120 KB | | 0%  2025-05-07T20:25:03.8125240Z 2025-05-07T20:25:03.8125245Z 2025-05-07T20:25:03.8125249Z 2025-05-07T20:25:03.8125254Z 2025-05-07T20:25:03.8125259Z 2025-05-07T20:25:03.8125266Z 2025-05-07T20:25:03.8135102Z pycparser-2.22 | 108 KB | | 0%  2025-05-07T20:25:03.8135365Z 2025-05-07T20:25:03.8135369Z 2025-05-07T20:25:03.8135386Z 2025-05-07T20:25:03.8135391Z 2025-05-07T20:25:03.8135395Z 2025-05-07T20:25:03.8135400Z 2025-05-07T20:25:03.8139319Z 2025-05-07T20:25:03.8140908Z typing-extensions-4. | 88 KB | | 0%  2025-05-07T20:25:03.8141204Z 2025-05-07T20:25:03.8141208Z 2025-05-07T20:25:03.8141214Z 2025-05-07T20:25:03.8141218Z 2025-05-07T20:25:03.8141222Z 2025-05-07T20:25:03.8141233Z 2025-05-07T20:25:03.8141236Z 2025-05-07T20:25:03.8141459Z 2025-05-07T20:25:03.8150456Z typing_extensions-4. | 51 KB | | 0%  2025-05-07T20:25:03.8150747Z 2025-05-07T20:25:03.8150761Z 2025-05-07T20:25:03.8150764Z 2025-05-07T20:25:03.8150768Z 2025-05-07T20:25:03.8150772Z 2025-05-07T20:25:03.8150775Z 2025-05-07T20:25:03.8150779Z 2025-05-07T20:25:03.8150783Z 2025-05-07T20:25:03.8150786Z 2025-05-07T20:25:03.8652365Z libgcc-ng-15.1.0 | 34 KB | | 0%  2025-05-07T20:25:03.8652637Z 2025-05-07T20:25:03.8652835Z 2025-05-07T20:25:03.8654228Z 2025-05-07T20:25:03.8869086Z libgomp-15.1.0 | 442 KB | ########## | 100%  2025-05-07T20:25:03.8871833Z 2025-05-07T20:25:03.9069648Z cryptography-44.0.3 | 1.5 MB | ########## | 100%  2025-05-07T20:25:03.9093831Z openssl-3.5.0 | 3.0 MB | ####6 | 46% 2025-05-07T20:25:03.9094083Z 2025-05-07T20:25:03.9094088Z 2025-05-07T20:25:03.9094092Z 2025-05-07T20:25:03.9096011Z 2025-05-07T20:25:03.9105555Z cffi-1.17.1 | 289 KB | ########## | 100%  2025-05-07T20:25:03.9105796Z 2025-05-07T20:25:03.9105800Z 2025-05-07T20:25:03.9105804Z 2025-05-07T20:25:03.9105808Z 2025-05-07T20:25:03.9107405Z 2025-05-07T20:25:03.9371913Z pyopenssl-25.0.0 | 120 KB | ########## | 100%  2025-05-07T20:25:03.9372182Z 2025-05-07T20:25:03.9372186Z 2025-05-07T20:25:03.9372189Z 2025-05-07T20:25:03.9372193Z 2025-05-07T20:25:03.9372196Z 2025-05-07T20:25:03.9373855Z 2025-05-07T20:25:03.9436624Z pycparser-2.22 | 108 KB | #4 | 15%  2025-05-07T20:25:03.9436900Z 2025-05-07T20:25:03.9436904Z 2025-05-07T20:25:03.9436907Z 2025-05-07T20:25:03.9436911Z 2025-05-07T20:25:03.9436915Z 2025-05-07T20:25:03.9436918Z 2025-05-07T20:25:03.9473722Z pycparser-2.22 | 108 KB | ########## | 100%  2025-05-07T20:25:03.9474023Z 2025-05-07T20:25:03.9474027Z 2025-05-07T20:25:03.9474031Z 2025-05-07T20:25:03.9474035Z 2025-05-07T20:25:03.9474049Z 2025-05-07T20:25:03.9474053Z 2025-05-07T20:25:03.9474057Z 2025-05-07T20:25:03.9479061Z 2025-05-07T20:25:03.9481201Z typing_extensions-4. | 51 KB | ###1 | 31%  2025-05-07T20:25:03.9481506Z 2025-05-07T20:25:03.9481511Z 2025-05-07T20:25:03.9481514Z 2025-05-07T20:25:03.9490389Z libgomp-15.1.0 | 442 KB | ########## | 100%  2025-05-07T20:25:03.9490642Z 2025-05-07T20:25:03.9490645Z 2025-05-07T20:25:03.9490649Z 2025-05-07T20:25:03.9525162Z libgomp-15.1.0 | 442 KB | ########## | 100%  2025-05-07T20:25:03.9525420Z 2025-05-07T20:25:03.9525424Z 2025-05-07T20:25:03.9525428Z 2025-05-07T20:25:03.9525432Z 2025-05-07T20:25:03.9525435Z 2025-05-07T20:25:03.9525439Z 2025-05-07T20:25:03.9525442Z 2025-05-07T20:25:03.9525446Z 2025-05-07T20:25:03.9548238Z typing_extensions-4. | 51 KB | ########## | 100%  2025-05-07T20:25:03.9548546Z 2025-05-07T20:25:03.9548550Z 2025-05-07T20:25:03.9548554Z 2025-05-07T20:25:03.9548566Z 2025-05-07T20:25:03.9548570Z 2025-05-07T20:25:03.9548573Z 2025-05-07T20:25:03.9549432Z 2025-05-07T20:25:03.9595304Z typing-extensions-4. | 88 KB | #8 | 18%  2025-05-07T20:25:03.9595689Z 2025-05-07T20:25:03.9595693Z 2025-05-07T20:25:03.9595697Z 2025-05-07T20:25:03.9595700Z 2025-05-07T20:25:03.9595704Z 2025-05-07T20:25:03.9595708Z 2025-05-07T20:25:03.9596885Z 2025-05-07T20:25:03.9681640Z typing-extensions-4. | 88 KB | ########## | 100%  2025-05-07T20:25:03.9681931Z 2025-05-07T20:25:03.9682192Z 2025-05-07T20:25:03.9839324Z libgcc-15.1.0 | 810 KB | 1 | 2%  2025-05-07T20:25:03.9839827Z 2025-05-07T20:25:03.9839835Z 2025-05-07T20:25:03.9839842Z 2025-05-07T20:25:03.9839848Z 2025-05-07T20:25:03.9839854Z 2025-05-07T20:25:03.9839860Z 2025-05-07T20:25:03.9839867Z 2025-05-07T20:25:03.9839873Z 2025-05-07T20:25:03.9839880Z 2025-05-07T20:25:03.9880507Z libgcc-ng-15.1.0 | 34 KB | ####7 | 47%  2025-05-07T20:25:03.9880937Z 2025-05-07T20:25:03.9880940Z 2025-05-07T20:25:03.9880944Z 2025-05-07T20:25:03.9880948Z 2025-05-07T20:25:03.9880951Z 2025-05-07T20:25:03.9880955Z 2025-05-07T20:25:03.9880958Z 2025-05-07T20:25:03.9880962Z 2025-05-07T20:25:03.9884738Z 2025-05-07T20:25:03.9970905Z libgcc-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:25:04.0056327Z openssl-3.5.0 | 3.0 MB | ########## | 100% 2025-05-07T20:25:04.0056563Z 2025-05-07T20:25:04.0057836Z 2025-05-07T20:25:04.0293251Z libgcc-15.1.0 | 810 KB | ########## | 100%  2025-05-07T20:25:04.0293493Z 2025-05-07T20:25:04.0293497Z 2025-05-07T20:25:04.0293501Z 2025-05-07T20:25:04.0293505Z 2025-05-07T20:25:04.0294168Z 2025-05-07T20:25:04.0299786Z pyopenssl-25.0.0 | 120 KB | ########## | 100%  2025-05-07T20:25:04.0300104Z 2025-05-07T20:25:04.0300108Z 2025-05-07T20:25:04.0300112Z 2025-05-07T20:25:04.0300115Z 2025-05-07T20:25:04.0301531Z 2025-05-07T20:25:04.0467954Z pyopenssl-25.0.0 | 120 KB | ########## | 100%  2025-05-07T20:25:04.0468513Z 2025-05-07T20:25:04.0468522Z 2025-05-07T20:25:04.0468529Z 2025-05-07T20:25:04.0468537Z 2025-05-07T20:25:04.0472595Z cffi-1.17.1 | 289 KB | ########## | 100%  2025-05-07T20:25:04.0472833Z 2025-05-07T20:25:04.0472837Z 2025-05-07T20:25:04.0472841Z 2025-05-07T20:25:04.0473139Z 2025-05-07T20:25:04.0697688Z cffi-1.17.1 | 289 KB | ########## | 100%  2025-05-07T20:25:04.0697968Z 2025-05-07T20:25:04.0697972Z 2025-05-07T20:25:04.0697976Z 2025-05-07T20:25:04.0697991Z 2025-05-07T20:25:04.0697995Z 2025-05-07T20:25:04.0697998Z 2025-05-07T20:25:04.0698002Z 2025-05-07T20:25:04.0698006Z 2025-05-07T20:25:04.0702588Z typing_extensions-4. | 51 KB | ########## | 100%  2025-05-07T20:25:04.0702904Z 2025-05-07T20:25:04.0702908Z 2025-05-07T20:25:04.0702911Z 2025-05-07T20:25:04.0702915Z 2025-05-07T20:25:04.0702926Z 2025-05-07T20:25:04.0702930Z 2025-05-07T20:25:04.0702933Z 2025-05-07T20:25:04.0702944Z 2025-05-07T20:25:04.0919926Z typing_extensions-4. | 51 KB | ########## | 100%  2025-05-07T20:25:04.0920238Z 2025-05-07T20:25:04.0920242Z 2025-05-07T20:25:04.0920246Z 2025-05-07T20:25:04.0920250Z 2025-05-07T20:25:04.0920253Z 2025-05-07T20:25:04.0920257Z 2025-05-07T20:25:04.0920261Z 2025-05-07T20:25:04.0927557Z typing-extensions-4. | 88 KB | ########## | 100%  2025-05-07T20:25:04.0927866Z 2025-05-07T20:25:04.0927878Z 2025-05-07T20:25:04.0927882Z 2025-05-07T20:25:04.0927887Z 2025-05-07T20:25:04.0927890Z 2025-05-07T20:25:04.0927894Z 2025-05-07T20:25:04.0928042Z 2025-05-07T20:25:04.1485051Z typing-extensions-4. | 88 KB | ########## | 100%  2025-05-07T20:25:04.1485381Z 2025-05-07T20:25:04.1485387Z 2025-05-07T20:25:04.1485392Z 2025-05-07T20:25:04.1485405Z 2025-05-07T20:25:04.1485411Z 2025-05-07T20:25:04.1485416Z 2025-05-07T20:25:04.1485434Z 2025-05-07T20:25:04.1485439Z 2025-05-07T20:25:04.1485443Z 2025-05-07T20:25:04.1490015Z libgcc-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:25:04.1490520Z 2025-05-07T20:25:04.1490527Z 2025-05-07T20:25:04.1490532Z 2025-05-07T20:25:04.1490536Z 2025-05-07T20:25:04.1490539Z 2025-05-07T20:25:04.1490543Z 2025-05-07T20:25:04.1490547Z 2025-05-07T20:25:04.1490550Z 2025-05-07T20:25:04.1490554Z 2025-05-07T20:25:04.1820422Z libgcc-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:25:04.1820734Z 2025-05-07T20:25:04.1820737Z 2025-05-07T20:25:04.1820741Z 2025-05-07T20:25:04.1820745Z 2025-05-07T20:25:04.1820748Z 2025-05-07T20:25:04.1821126Z 2025-05-07T20:25:04.1823487Z pycparser-2.22 | 108 KB | ########## | 100%  2025-05-07T20:25:04.1823776Z 2025-05-07T20:25:04.1823780Z 2025-05-07T20:25:04.1823784Z 2025-05-07T20:25:04.1823787Z 2025-05-07T20:25:04.1823791Z 2025-05-07T20:25:04.1824158Z 2025-05-07T20:25:04.2518489Z pycparser-2.22 | 108 KB | ########## | 100%  2025-05-07T20:25:04.2518755Z 2025-05-07T20:25:04.2518916Z 2025-05-07T20:25:04.2521856Z libgcc-15.1.0 | 810 KB | ########## | 100%  2025-05-07T20:25:04.2522098Z 2025-05-07T20:25:04.2522101Z 2025-05-07T20:25:04.2992378Z libgcc-15.1.0 | 810 KB | ########## | 100%  2025-05-07T20:25:04.2992705Z 2025-05-07T20:25:04.2994626Z cryptography-44.0.3 | 1.5 MB | ########## | 100%  2025-05-07T20:25:04.2994891Z 2025-05-07T20:25:04.3183372Z cryptography-44.0.3 | 1.5 MB | ########## | 100%  2025-05-07T20:25:04.3183904Z openssl-3.5.0 | 3.0 MB | ########## | 100% 2025-05-07T20:25:04.3191255Z openssl-3.5.0 | 3.0 MB | ########## | 100% 2025-05-07T20:25:04.3192009Z 2025-05-07T20:25:04.3192655Z 2025-05-07T20:25:04.3193177Z  2025-05-07T20:25:04.3193786Z 2025-05-07T20:25:04.3193797Z 2025-05-07T20:25:04.3194188Z  2025-05-07T20:25:04.3194699Z 2025-05-07T20:25:04.3194707Z 2025-05-07T20:25:04.3194714Z 2025-05-07T20:25:04.3195069Z  2025-05-07T20:25:04.3195543Z 2025-05-07T20:25:04.3195551Z 2025-05-07T20:25:04.3195558Z 2025-05-07T20:25:04.3195565Z 2025-05-07T20:25:04.3195903Z  2025-05-07T20:25:04.3196334Z 2025-05-07T20:25:04.3196353Z 2025-05-07T20:25:04.3196359Z 2025-05-07T20:25:04.3196365Z 2025-05-07T20:25:04.3196371Z 2025-05-07T20:25:04.3196647Z  2025-05-07T20:25:04.3197041Z 2025-05-07T20:25:04.3197048Z 2025-05-07T20:25:04.3197056Z 2025-05-07T20:25:04.3197063Z 2025-05-07T20:25:04.3197070Z 2025-05-07T20:25:04.3197077Z 2025-05-07T20:25:04.3197317Z  2025-05-07T20:25:04.3197578Z 2025-05-07T20:25:04.3197585Z 2025-05-07T20:25:04.3197591Z 2025-05-07T20:25:04.3197598Z 2025-05-07T20:25:04.3197604Z 2025-05-07T20:25:04.3197611Z 2025-05-07T20:25:04.3197618Z 2025-05-07T20:25:04.3197921Z  2025-05-07T20:25:04.3198320Z 2025-05-07T20:25:04.3198328Z 2025-05-07T20:25:04.3198336Z 2025-05-07T20:25:04.3198345Z 2025-05-07T20:25:04.3198353Z 2025-05-07T20:25:04.3198359Z 2025-05-07T20:25:04.3198366Z 2025-05-07T20:25:04.3198385Z 2025-05-07T20:25:04.3198723Z  2025-05-07T20:25:04.3199164Z 2025-05-07T20:25:04.3199171Z 2025-05-07T20:25:04.3199180Z 2025-05-07T20:25:04.3199187Z 2025-05-07T20:25:04.3199194Z 2025-05-07T20:25:04.3199202Z 2025-05-07T20:25:04.3199212Z 2025-05-07T20:25:04.3199221Z 2025-05-07T20:25:04.3199231Z 2025-05-07T20:25:04.3199602Z  done 2025-05-07T20:25:04.4204025Z Preparing transaction: - done 2025-05-07T20:25:04.5206941Z Verifying transaction: | done 2025-05-07T20:25:06.0232689Z Executing transaction: - \ | / - \ | / - \ | / - \ | done 2025-05-07T20:25:06.2137060Z [SETUP] Testing pyOpenSSL import ... 2025-05-07T20:25:08.0013825Z [CHECK] Python (sub-)package 'OpenSSL' found ... 2025-05-07T20:25:08.0027360Z [SETUP] Installing libxcrypt ... 2025-05-07T20:25:08.0051092Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y libxcrypt 2025-05-07T20:25:08.8834074Z Channels: 2025-05-07T20:25:08.8834335Z - conda-forge 2025-05-07T20:25:08.8834555Z Platform: linux-64 2025-05-07T20:25:12.4560657Z Collecting package metadata (repodata.json): - \ | / - done 2025-05-07T20:25:12.8354767Z Solving environment: | done 2025-05-07T20:25:12.8975959Z 2025-05-07T20:25:12.8976720Z ## Package Plan ## 2025-05-07T20:25:12.8977347Z 2025-05-07T20:25:12.8977581Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:25:12.8977921Z 2025-05-07T20:25:12.8978017Z added / updated specs: 2025-05-07T20:25:12.8978263Z - libxcrypt 2025-05-07T20:25:12.8978386Z 2025-05-07T20:25:12.8978391Z 2025-05-07T20:25:12.8978515Z The following packages will be downloaded: 2025-05-07T20:25:12.8978725Z 2025-05-07T20:25:12.8978834Z package | build 2025-05-07T20:25:12.8979144Z ---------------------------|----------------- 2025-05-07T20:25:12.8979699Z libxcrypt-4.4.36 | hd590300_1 98 KB conda-forge 2025-05-07T20:25:12.8980087Z ------------------------------------------------------------ 2025-05-07T20:25:12.8980419Z Total: 98 KB 2025-05-07T20:25:12.8980633Z 2025-05-07T20:25:12.8980752Z The following NEW packages will be INSTALLED: 2025-05-07T20:25:12.8980966Z 2025-05-07T20:25:12.8981201Z libxcrypt conda-forge/linux-64::libxcrypt-4.4.36-hd590300_1 2025-05-07T20:25:12.8981478Z 2025-05-07T20:25:12.8981482Z 2025-05-07T20:25:12.8981486Z 2025-05-07T20:25:12.8981635Z Downloading and Extracting Packages: ...working... 2025-05-07T20:25:13.0746763Z libxcrypt-4.4.36 | 98 KB | | 0% 2025-05-07T20:25:13.0766847Z libxcrypt-4.4.36 | 98 KB | #6 | 16% 2025-05-07T20:25:13.0872347Z libxcrypt-4.4.36 | 98 KB | ########## | 100% 2025-05-07T20:25:13.0875074Z libxcrypt-4.4.36 | 98 KB | ########## | 100% 2025-05-07T20:25:13.0875466Z 2025-05-07T20:25:13.0875751Z done 2025-05-07T20:25:13.1881857Z Preparing transaction: - done 2025-05-07T20:25:13.2886838Z Verifying transaction: | done 2025-05-07T20:25:13.3893880Z Executing transaction: - done 2025-05-07T20:25:16.9368598Z [SETUP] Copying over ... 2025-05-07T20:25:16.9370002Z + cp /home/ec2-user/miniconda/envs/build_binary/include/crypt.h /home/ec2-user/miniconda/envs/build_binary/include/python3.13/crypt.h 2025-05-07T20:25:16.9371070Z 2025-05-07T20:25:16.9401506Z 2025-05-07T20:25:18.6290215Z [SETUP] Installed Python version: Python 3.13.2 2025-05-07T20:25:18.6290822Z [SETUP] Successfully created Conda environment: build_binary 2025-05-07T20:25:18.6324705Z ##[group]Run . $PRELUDE; install_cxx_compiler $BUILD_ENV gcc 2025-05-07T20:25:18.6325248Z . $PRELUDE; install_cxx_compiler $BUILD_ENV gcc 2025-05-07T20:25:18.6348279Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:25:18.6348631Z env: 2025-05-07T20:25:18.6348842Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:25:18.6349130Z BUILD_ENV: build_binary 2025-05-07T20:25:18.6349355Z BUILD_TARGET: genai 2025-05-07T20:25:18.6349573Z BUILD_VARIANT: cuda 2025-05-07T20:25:18.6349796Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:25:18.6350030Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:25:18.6350319Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:25:18.6350700Z ##[endgroup] 2025-05-07T20:25:18.9786448Z ################################################################################ 2025-05-07T20:25:18.9786948Z # Install C/C++ Compilers 2025-05-07T20:25:18.9787202Z # 2025-05-07T20:25:18.9803801Z # [2025-05-07T20:25:18.979Z] + install_cxx_compiler build_binary gcc 2025-05-07T20:25:18.9804454Z ################################################################################ 2025-05-07T20:25:18.9804702Z 2025-05-07T20:25:18.9823842Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:25:19.0722583Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:25:19.0733708Z [INSTALL] Installing GLIBC (architecture = 64) ... 2025-05-07T20:25:19.0757103Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y sysroot_linux-64=2.17 2025-05-07T20:25:19.9534969Z Channels: 2025-05-07T20:25:19.9535205Z - conda-forge 2025-05-07T20:25:19.9535424Z Platform: linux-64 2025-05-07T20:25:23.5121274Z Collecting package metadata (repodata.json): - \ | / done 2025-05-07T20:25:23.8919011Z Solving environment: \ done 2025-05-07T20:25:23.9542008Z 2025-05-07T20:25:23.9542617Z ## Package Plan ## 2025-05-07T20:25:23.9542797Z 2025-05-07T20:25:23.9543049Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:25:23.9543394Z 2025-05-07T20:25:23.9543486Z added / updated specs: 2025-05-07T20:25:23.9543765Z - sysroot_linux-64=2.17 2025-05-07T20:25:23.9543933Z 2025-05-07T20:25:23.9543940Z 2025-05-07T20:25:23.9544542Z The following packages will be downloaded: 2025-05-07T20:25:23.9544758Z 2025-05-07T20:25:23.9544873Z package | build 2025-05-07T20:25:23.9545206Z ---------------------------|----------------- 2025-05-07T20:25:23.9545634Z kernel-headers_linux-64-3.10.0| he073ed8_18 921 KB conda-forge 2025-05-07T20:25:23.9546125Z sysroot_linux-64-2.17 | h0157908_18 14.5 MB conda-forge 2025-05-07T20:25:23.9546535Z ------------------------------------------------------------ 2025-05-07T20:25:23.9546878Z Total: 15.4 MB 2025-05-07T20:25:23.9547093Z 2025-05-07T20:25:23.9547227Z The following NEW packages will be INSTALLED: 2025-05-07T20:25:23.9547453Z 2025-05-07T20:25:23.9547752Z kernel-headers_li~ conda-forge/noarch::kernel-headers_linux-64-3.10.0-he073ed8_18 2025-05-07T20:25:23.9548312Z sysroot_linux-64 conda-forge/noarch::sysroot_linux-64-2.17-h0157908_18 2025-05-07T20:25:23.9548630Z 2025-05-07T20:25:23.9548643Z 2025-05-07T20:25:23.9548647Z 2025-05-07T20:25:23.9548786Z Downloading and Extracting Packages: ...working... 2025-05-07T20:25:23.9549170Z sysroot_linux-64-2.1 | 14.5 MB | | 0% 2025-05-07T20:25:23.9549401Z 2025-05-07T20:25:24.1700159Z kernel-headers_linux | 921 KB | | 0%  2025-05-07T20:25:24.2160948Z sysroot_linux-64-2.1 | 14.5 MB | | 0% 2025-05-07T20:25:24.2161283Z 2025-05-07T20:25:24.2273510Z kernel-headers_linux | 921 KB | 1 | 2%  2025-05-07T20:25:24.2275258Z 2025-05-07T20:25:24.2718726Z kernel-headers_linux | 921 KB | ########## | 100%  2025-05-07T20:25:24.3523810Z sysroot_linux-64-2.1 | 14.5 MB | #######3 | 74% 2025-05-07T20:25:24.5355661Z sysroot_linux-64-2.1 | 14.5 MB | ########## | 100% 2025-05-07T20:25:24.5355936Z 2025-05-07T20:25:24.5356364Z kernel-headers_linux | 921 KB | ########## | 100%  2025-05-07T20:25:24.5356609Z 2025-05-07T20:25:24.9907451Z kernel-headers_linux | 921 KB | ########## | 100%  2025-05-07T20:25:24.9912494Z sysroot_linux-64-2.1 | 14.5 MB | ########## | 100% 2025-05-07T20:25:24.9913055Z 2025-05-07T20:25:24.9913397Z 2025-05-07T20:25:24.9913712Z  done 2025-05-07T20:25:25.0918178Z Preparing transaction: / done 2025-05-07T20:25:25.2924632Z Verifying transaction: \ | done 2025-05-07T20:25:25.4983526Z Executing transaction: - \ done 2025-05-07T20:25:25.6761089Z [CHECK] LD_LIBRARY_PATH = 2025-05-07T20:25:27.4089738Z [CHECK] CONDA_PREFIX is not set. 2025-05-07T20:25:27.4090421Z [CHECK] libstdc++.so.6 found in CONDA_PREFIX PATH (symbolic link): /home/ec2-user/miniconda/envs/build_binary/lib/libstdc++.so.6 2025-05-07T20:25:27.4102337Z [INSTALL] Installing GCC (11.4.0, 64) through Conda ... 2025-05-07T20:25:27.4127118Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y gxx_linux-64=11.4.0 2025-05-07T20:25:28.3124515Z Channels: 2025-05-07T20:25:28.3124763Z - conda-forge 2025-05-07T20:25:28.3124984Z Platform: linux-64 2025-05-07T20:25:31.8246505Z Collecting package metadata (repodata.json): - \ | / done 2025-05-07T20:25:32.8200180Z Solving environment: \ | / done 2025-05-07T20:25:32.8847928Z 2025-05-07T20:25:32.8848321Z ## Package Plan ## 2025-05-07T20:25:32.8848483Z 2025-05-07T20:25:32.8848755Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:25:32.8849267Z 2025-05-07T20:25:32.8849372Z added / updated specs: 2025-05-07T20:25:32.8849633Z - gxx_linux-64=11.4.0 2025-05-07T20:25:32.8849793Z 2025-05-07T20:25:32.8849798Z 2025-05-07T20:25:32.8849922Z The following packages will be downloaded: 2025-05-07T20:25:32.8850206Z 2025-05-07T20:25:32.8850327Z package | build 2025-05-07T20:25:32.8850823Z ---------------------------|----------------- 2025-05-07T20:25:32.8851586Z binutils_impl_linux-64-2.40| ha1999f0_7 6.0 MB conda-forge 2025-05-07T20:25:32.8852944Z binutils_linux-64-2.40 | hb3c18ed_4 28 KB conda-forge 2025-05-07T20:25:32.8853792Z gcc_impl_linux-64-11.4.0 | h00c12a0_13 53.0 MB conda-forge 2025-05-07T20:25:32.8854242Z gcc_linux-64-11.4.0 | ha077dfb_4 31 KB conda-forge 2025-05-07T20:25:32.8854691Z gxx_impl_linux-64-11.4.0 | h634f3ee_13 11.2 MB conda-forge 2025-05-07T20:25:32.8855344Z gxx_linux-64-11.4.0 | h35bfe5d_4 29 KB conda-forge 2025-05-07T20:25:32.8855784Z ld_impl_linux-64-2.40 | hf3520f5_7 691 KB conda-forge 2025-05-07T20:25:32.8856264Z libgcc-devel_linux-64-11.4.0| h8f596e0_113 2.3 MB conda-forge 2025-05-07T20:25:32.8856746Z libsanitizer-11.4.0 | h5763a12_13 3.5 MB conda-forge 2025-05-07T20:25:32.8857187Z libstdcxx-15.1.0 | h8f9b012_2 3.7 MB conda-forge 2025-05-07T20:25:32.8857669Z libstdcxx-devel_linux-64-11.4.0| h8f596e0_113 11.1 MB conda-forge 2025-05-07T20:25:32.8858168Z libstdcxx-ng-15.1.0 | h4852527_2 34 KB conda-forge 2025-05-07T20:25:32.8858580Z ------------------------------------------------------------ 2025-05-07T20:25:32.8858920Z Total: 91.6 MB 2025-05-07T20:25:32.8859144Z 2025-05-07T20:25:32.8859293Z The following NEW packages will be INSTALLED: 2025-05-07T20:25:32.8859541Z 2025-05-07T20:25:32.8859885Z binutils_impl_lin~ conda-forge/linux-64::binutils_impl_linux-64-2.40-ha1999f0_7 2025-05-07T20:25:32.8860642Z binutils_linux-64 conda-forge/linux-64::binutils_linux-64-2.40-hb3c18ed_4 2025-05-07T20:25:32.8861548Z gcc_impl_linux-64 conda-forge/linux-64::gcc_impl_linux-64-11.4.0-h00c12a0_13 2025-05-07T20:25:32.8862145Z gcc_linux-64 conda-forge/linux-64::gcc_linux-64-11.4.0-ha077dfb_4 2025-05-07T20:25:32.8862662Z gxx_impl_linux-64 conda-forge/linux-64::gxx_impl_linux-64-11.4.0-h634f3ee_13 2025-05-07T20:25:32.8863181Z gxx_linux-64 conda-forge/linux-64::gxx_linux-64-11.4.0-h35bfe5d_4 2025-05-07T20:25:32.8863697Z libgcc-devel_linu~ conda-forge/noarch::libgcc-devel_linux-64-11.4.0-h8f596e0_113 2025-05-07T20:25:32.8864255Z libsanitizer conda-forge/linux-64::libsanitizer-11.4.0-h5763a12_13 2025-05-07T20:25:32.8864798Z libstdcxx conda-forge/linux-64::libstdcxx-15.1.0-h8f9b012_2 2025-05-07T20:25:32.8865356Z libstdcxx-devel_l~ conda-forge/noarch::libstdcxx-devel_linux-64-11.4.0-h8f596e0_113 2025-05-07T20:25:32.8865711Z 2025-05-07T20:25:32.8865815Z The following packages will be UPDATED: 2025-05-07T20:25:32.8866020Z 2025-05-07T20:25:32.8866332Z ld_impl_linux-64 pkgs/main::ld_impl_linux-64-2.40-h12e~ --> conda-forge::ld_impl_linux-64-2.40-hf3520f5_7 2025-05-07T20:25:32.8867035Z libstdcxx-ng pkgs/main::libstdcxx-ng-11.2.0-h12345~ --> conda-forge::libstdcxx-ng-15.1.0-h4852527_2 2025-05-07T20:25:32.8867435Z 2025-05-07T20:25:32.8867445Z 2025-05-07T20:25:32.8867450Z 2025-05-07T20:25:32.8867595Z Downloading and Extracting Packages: ...working... 2025-05-07T20:25:32.8867949Z gcc_impl_linux-64-11 | 53.0 MB | | 0% 2025-05-07T20:25:32.8868178Z 2025-05-07T20:25:32.8868561Z gxx_impl_linux-64-11 | 11.2 MB | | 0%  2025-05-07T20:25:32.8868794Z 2025-05-07T20:25:32.8868798Z 2025-05-07T20:25:32.8874006Z libstdcxx-devel_linu | 11.1 MB | | 0%  2025-05-07T20:25:32.8874354Z 2025-05-07T20:25:32.8874359Z 2025-05-07T20:25:32.8880264Z 2025-05-07T20:25:32.8884886Z binutils_impl_linux- | 6.0 MB | | 0%  2025-05-07T20:25:32.8885244Z 2025-05-07T20:25:32.8885249Z 2025-05-07T20:25:32.8885259Z 2025-05-07T20:25:32.8891041Z 2025-05-07T20:25:32.8902085Z libstdcxx-15.1.0 | 3.7 MB | | 0%  2025-05-07T20:25:32.8902572Z 2025-05-07T20:25:32.8902579Z 2025-05-07T20:25:32.8902587Z 2025-05-07T20:25:32.8902593Z 2025-05-07T20:25:32.8902601Z 2025-05-07T20:25:32.8923038Z libsanitizer-11.4.0 | 3.5 MB | | 0%  2025-05-07T20:25:32.8923537Z 2025-05-07T20:25:32.8923544Z 2025-05-07T20:25:32.8923551Z 2025-05-07T20:25:32.8923558Z 2025-05-07T20:25:32.8923573Z 2025-05-07T20:25:32.8923580Z 2025-05-07T20:25:32.8924503Z libgcc-devel_linux-6 | 2.3 MB | | 0%  2025-05-07T20:25:32.8925002Z 2025-05-07T20:25:32.8925010Z 2025-05-07T20:25:32.8925016Z 2025-05-07T20:25:32.8925021Z 2025-05-07T20:25:32.8925025Z 2025-05-07T20:25:32.8926509Z 2025-05-07T20:25:32.8926520Z 2025-05-07T20:25:32.8927590Z ld_impl_linux-64-2.4 | 691 KB | | 0%  2025-05-07T20:25:32.8928058Z 2025-05-07T20:25:32.8928066Z 2025-05-07T20:25:32.8928085Z 2025-05-07T20:25:32.8928100Z 2025-05-07T20:25:32.8928107Z 2025-05-07T20:25:32.8928113Z 2025-05-07T20:25:32.8928119Z 2025-05-07T20:25:32.8928130Z 2025-05-07T20:25:32.8952170Z libstdcxx-ng-15.1.0 | 34 KB | | 0%  2025-05-07T20:25:32.8952748Z 2025-05-07T20:25:32.8952757Z 2025-05-07T20:25:32.8952764Z 2025-05-07T20:25:32.8952772Z 2025-05-07T20:25:32.8952781Z 2025-05-07T20:25:32.8952787Z 2025-05-07T20:25:32.8952796Z 2025-05-07T20:25:32.8952805Z 2025-05-07T20:25:32.8952812Z 2025-05-07T20:25:32.8953253Z gcc_linux-64-11.4.0 | 31 KB | | 0%  2025-05-07T20:25:32.8953791Z 2025-05-07T20:25:32.8953799Z 2025-05-07T20:25:32.8953807Z 2025-05-07T20:25:32.8953813Z 2025-05-07T20:25:32.8953821Z 2025-05-07T20:25:32.8953828Z 2025-05-07T20:25:32.8953835Z 2025-05-07T20:25:32.8953842Z 2025-05-07T20:25:32.8953850Z 2025-05-07T20:25:32.8953860Z 2025-05-07T20:25:32.8955807Z gxx_linux-64-11.4.0 | 29 KB | | 0%  2025-05-07T20:25:32.8956367Z 2025-05-07T20:25:32.8956377Z 2025-05-07T20:25:32.8956385Z 2025-05-07T20:25:32.8956393Z 2025-05-07T20:25:32.8956400Z 2025-05-07T20:25:32.8956408Z 2025-05-07T20:25:32.8956415Z 2025-05-07T20:25:32.8956425Z 2025-05-07T20:25:32.8956429Z 2025-05-07T20:25:32.8956432Z 2025-05-07T20:25:32.8956436Z 2025-05-07T20:25:32.9942376Z binutils_linux-64-2. | 28 KB | | 0%  2025-05-07T20:25:32.9942800Z 2025-05-07T20:25:32.9942806Z 2025-05-07T20:25:32.9942811Z 2025-05-07T20:25:33.0518148Z binutils_impl_linux- | 6.0 MB | | 0%  2025-05-07T20:25:33.0518533Z 2025-05-07T20:25:33.0518538Z 2025-05-07T20:25:33.0518544Z 2025-05-07T20:25:33.0518549Z 2025-05-07T20:25:33.0528743Z libstdcxx-15.1.0 | 3.7 MB | | 0%  2025-05-07T20:25:33.0529180Z 2025-05-07T20:25:33.0530920Z 2025-05-07T20:25:33.0943545Z libstdcxx-devel_linu | 11.1 MB | | 0%  2025-05-07T20:25:33.0944062Z 2025-05-07T20:25:33.0944070Z 2025-05-07T20:25:33.0944640Z 2025-05-07T20:25:33.1018628Z binutils_impl_linux- | 6.0 MB | ####6 | 47%  2025-05-07T20:25:33.1020000Z 2025-05-07T20:25:33.1103478Z gxx_impl_linux-64-11 | 11.2 MB | | 0%  2025-05-07T20:25:33.1532611Z gcc_impl_linux-64-11 | 53.0 MB | | 0% 2025-05-07T20:25:33.1533014Z 2025-05-07T20:25:33.1533552Z 2025-05-07T20:25:33.2020980Z libstdcxx-devel_linu | 11.1 MB | #### | 40%  2025-05-07T20:25:33.2021386Z 2025-05-07T20:25:33.2105656Z gxx_impl_linux-64-11 | 11.2 MB | ###9 | 40%  2025-05-07T20:25:33.2196801Z gcc_impl_linux-64-11 | 53.0 MB | 7 | 7% 2025-05-07T20:25:33.2197155Z 2025-05-07T20:25:33.2197163Z 2025-05-07T20:25:33.2197169Z 2025-05-07T20:25:33.2198263Z 2025-05-07T20:25:33.2204536Z libstdcxx-15.1.0 | 3.7 MB | ########## | 100%  2025-05-07T20:25:33.2204920Z 2025-05-07T20:25:33.2204925Z 2025-05-07T20:25:33.2204944Z 2025-05-07T20:25:33.2208848Z 2025-05-07T20:25:33.2533000Z libstdcxx-15.1.0 | 3.7 MB | ########## | 100%  2025-05-07T20:25:33.2533423Z 2025-05-07T20:25:33.2533907Z 2025-05-07T20:25:33.2715753Z libstdcxx-devel_linu | 11.1 MB | #######9 | 79%  2025-05-07T20:25:33.2716618Z 2025-05-07T20:25:33.2716626Z 2025-05-07T20:25:33.2716633Z 2025-05-07T20:25:33.2716641Z 2025-05-07T20:25:33.2716648Z 2025-05-07T20:25:33.2976072Z libsanitizer-11.4.0 | 3.5 MB | | 0%  2025-05-07T20:25:33.2976528Z 2025-05-07T20:25:33.2976534Z 2025-05-07T20:25:33.2977906Z 2025-05-07T20:25:33.2985265Z binutils_impl_linux- | 6.0 MB | ########## | 100%  2025-05-07T20:25:33.2985773Z 2025-05-07T20:25:33.2985781Z 2025-05-07T20:25:33.2987050Z 2025-05-07T20:25:33.3028347Z binutils_impl_linux- | 6.0 MB | ########## | 100%  2025-05-07T20:25:33.3029329Z 2025-05-07T20:25:33.3107694Z gxx_impl_linux-64-11 | 11.2 MB | #######6 | 76%  2025-05-07T20:25:33.3372734Z gcc_impl_linux-64-11 | 53.0 MB | #4 | 14% 2025-05-07T20:25:33.3373157Z 2025-05-07T20:25:33.3373164Z 2025-05-07T20:25:33.3373171Z 2025-05-07T20:25:33.3373177Z 2025-05-07T20:25:33.3373184Z 2025-05-07T20:25:33.3373202Z 2025-05-07T20:25:33.3715816Z libgcc-devel_linux-6 | 2.3 MB | | 1%  2025-05-07T20:25:33.3716240Z 2025-05-07T20:25:33.3716245Z 2025-05-07T20:25:33.3716251Z 2025-05-07T20:25:33.3716256Z 2025-05-07T20:25:33.3716274Z 2025-05-07T20:25:33.4112139Z libsanitizer-11.4.0 | 3.5 MB | ########2 | 82%  2025-05-07T20:25:33.4841235Z gcc_impl_linux-64-11 | 53.0 MB | ## | 20% 2025-05-07T20:25:33.4841678Z 2025-05-07T20:25:33.4841686Z 2025-05-07T20:25:33.4841692Z 2025-05-07T20:25:33.4841699Z 2025-05-07T20:25:33.4841706Z 2025-05-07T20:25:33.4843546Z 2025-05-07T20:25:33.4858780Z libgcc-devel_linux-6 | 2.3 MB | ########## | 100%  2025-05-07T20:25:33.4859192Z 2025-05-07T20:25:33.4859516Z 2025-05-07T20:25:33.4859523Z 2025-05-07T20:25:33.4859527Z 2025-05-07T20:25:33.4859530Z 2025-05-07T20:25:33.4859534Z 2025-05-07T20:25:33.5070357Z libgcc-devel_linux-6 | 2.3 MB | ########## | 100%  2025-05-07T20:25:33.5070652Z 2025-05-07T20:25:33.5070674Z 2025-05-07T20:25:33.5070677Z 2025-05-07T20:25:33.5070681Z 2025-05-07T20:25:33.5070685Z 2025-05-07T20:25:33.5112339Z libsanitizer-11.4.0 | 3.5 MB | ########## | 100%  2025-05-07T20:25:33.5309696Z gcc_impl_linux-64-11 | 53.0 MB | ##7 | 28% 2025-05-07T20:25:33.5309932Z 2025-05-07T20:25:33.5309936Z 2025-05-07T20:25:33.5309940Z 2025-05-07T20:25:33.5309943Z 2025-05-07T20:25:33.5309947Z 2025-05-07T20:25:33.5309951Z 2025-05-07T20:25:33.5312778Z 2025-05-07T20:25:33.5485179Z ld_impl_linux-64-2.4 | 691 KB | 2 | 2%  2025-05-07T20:25:33.5485461Z 2025-05-07T20:25:33.5485465Z 2025-05-07T20:25:33.5485468Z 2025-05-07T20:25:33.5485472Z 2025-05-07T20:25:33.5485484Z 2025-05-07T20:25:33.5485498Z 2025-05-07T20:25:33.5485501Z 2025-05-07T20:25:33.5487388Z 2025-05-07T20:25:33.5532341Z libstdcxx-ng-15.1.0 | 34 KB | ####7 | 47%  2025-05-07T20:25:33.5532848Z 2025-05-07T20:25:33.5532852Z 2025-05-07T20:25:33.5532865Z 2025-05-07T20:25:33.5532869Z 2025-05-07T20:25:33.5532873Z 2025-05-07T20:25:33.5532877Z 2025-05-07T20:25:33.5532880Z 2025-05-07T20:25:33.5532884Z 2025-05-07T20:25:33.5817755Z libstdcxx-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:25:33.5818047Z 2025-05-07T20:25:33.5818050Z 2025-05-07T20:25:33.5818054Z 2025-05-07T20:25:33.5818058Z 2025-05-07T20:25:33.5818061Z 2025-05-07T20:25:33.5818065Z 2025-05-07T20:25:33.5820366Z 2025-05-07T20:25:33.5929440Z ld_impl_linux-64-2.4 | 691 KB | ########## | 100%  2025-05-07T20:25:33.5929904Z 2025-05-07T20:25:33.5929908Z 2025-05-07T20:25:33.5929912Z 2025-05-07T20:25:33.5929916Z 2025-05-07T20:25:33.5929919Z 2025-05-07T20:25:33.5929933Z 2025-05-07T20:25:33.5929937Z 2025-05-07T20:25:33.5929941Z 2025-05-07T20:25:33.5929944Z 2025-05-07T20:25:33.5964077Z gcc_linux-64-11.4.0 | 31 KB | #####2 | 52%  2025-05-07T20:25:33.5964467Z 2025-05-07T20:25:33.5964661Z 2025-05-07T20:25:33.5964665Z 2025-05-07T20:25:33.5964669Z 2025-05-07T20:25:33.5964672Z 2025-05-07T20:25:33.5964676Z 2025-05-07T20:25:33.5964679Z 2025-05-07T20:25:33.5964683Z 2025-05-07T20:25:33.5970048Z 2025-05-07T20:25:33.6117066Z gcc_linux-64-11.4.0 | 31 KB | ########## | 100%  2025-05-07T20:25:33.6156102Z gcc_impl_linux-64-11 | 53.0 MB | ###4 | 34% 2025-05-07T20:25:33.6156329Z 2025-05-07T20:25:33.6156332Z 2025-05-07T20:25:33.6156336Z 2025-05-07T20:25:33.6156340Z 2025-05-07T20:25:33.6156343Z 2025-05-07T20:25:33.6156347Z 2025-05-07T20:25:33.6156350Z 2025-05-07T20:25:33.6156354Z 2025-05-07T20:25:33.6156357Z 2025-05-07T20:25:33.6158049Z 2025-05-07T20:25:33.6189199Z gxx_linux-64-11.4.0 | 29 KB | #####5 | 55%  2025-05-07T20:25:33.6189474Z 2025-05-07T20:25:33.6189478Z 2025-05-07T20:25:33.6189482Z 2025-05-07T20:25:33.6189486Z 2025-05-07T20:25:33.6189489Z 2025-05-07T20:25:33.6189493Z 2025-05-07T20:25:33.6189509Z 2025-05-07T20:25:33.6189513Z 2025-05-07T20:25:33.6189517Z 2025-05-07T20:25:33.6190504Z 2025-05-07T20:25:33.6463235Z gxx_linux-64-11.4.0 | 29 KB | ########## | 100%  2025-05-07T20:25:33.6463541Z 2025-05-07T20:25:33.6463545Z 2025-05-07T20:25:33.6463560Z 2025-05-07T20:25:33.6463564Z 2025-05-07T20:25:33.6463567Z 2025-05-07T20:25:33.6463571Z 2025-05-07T20:25:33.6463574Z 2025-05-07T20:25:33.6463578Z 2025-05-07T20:25:33.6463582Z 2025-05-07T20:25:33.6463585Z 2025-05-07T20:25:33.6465150Z 2025-05-07T20:25:33.6504393Z binutils_linux-64-2. | 28 KB | #####6 | 56%  2025-05-07T20:25:33.6504884Z 2025-05-07T20:25:33.6504888Z 2025-05-07T20:25:33.6504892Z 2025-05-07T20:25:33.6505114Z 2025-05-07T20:25:33.6505120Z 2025-05-07T20:25:33.6505124Z 2025-05-07T20:25:33.6505127Z 2025-05-07T20:25:33.6505131Z 2025-05-07T20:25:33.6505134Z 2025-05-07T20:25:33.6505138Z 2025-05-07T20:25:33.6505142Z 2025-05-07T20:25:33.7038947Z binutils_linux-64-2. | 28 KB | ########## | 100%  2025-05-07T20:25:33.7039294Z 2025-05-07T20:25:33.7043904Z 2025-05-07T20:25:33.7120170Z libstdcxx-devel_linu | 11.1 MB | ########## | 100%  2025-05-07T20:25:33.7219634Z gcc_impl_linux-64-11 | 53.0 MB | ####2 | 42% 2025-05-07T20:25:33.7221205Z 2025-05-07T20:25:33.7342995Z gxx_impl_linux-64-11 | 11.2 MB | ########## | 100%  2025-05-07T20:25:33.7343343Z 2025-05-07T20:25:33.7343349Z 2025-05-07T20:25:33.7343354Z 2025-05-07T20:25:33.7343360Z 2025-05-07T20:25:33.8121273Z libstdcxx-15.1.0 | 3.7 MB | ########## | 100%  2025-05-07T20:25:33.8379146Z gcc_impl_linux-64-11 | 53.0 MB | #####1 | 51% 2025-05-07T20:25:33.8379645Z 2025-05-07T20:25:33.8379682Z 2025-05-07T20:25:33.8379693Z 2025-05-07T20:25:33.8379702Z 2025-05-07T20:25:33.8379712Z 2025-05-07T20:25:33.8380469Z 2025-05-07T20:25:33.9123906Z libgcc-devel_linux-6 | 2.3 MB | ########## | 100%  2025-05-07T20:25:33.9285241Z gcc_impl_linux-64-11 | 53.0 MB | ######1 | 62% 2025-05-07T20:25:33.9285699Z 2025-05-07T20:25:33.9285707Z 2025-05-07T20:25:33.9285714Z 2025-05-07T20:25:33.9285721Z 2025-05-07T20:25:33.9285728Z 2025-05-07T20:25:33.9285734Z 2025-05-07T20:25:33.9285742Z 2025-05-07T20:25:33.9285943Z 2025-05-07T20:25:33.9292590Z libstdcxx-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:25:33.9293014Z 2025-05-07T20:25:33.9293020Z 2025-05-07T20:25:33.9293026Z 2025-05-07T20:25:33.9293031Z 2025-05-07T20:25:33.9293037Z 2025-05-07T20:25:33.9293042Z 2025-05-07T20:25:33.9293047Z 2025-05-07T20:25:33.9293052Z 2025-05-07T20:25:34.0054876Z libstdcxx-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:25:34.0055346Z 2025-05-07T20:25:34.0055352Z 2025-05-07T20:25:34.0055359Z 2025-05-07T20:25:34.0055364Z 2025-05-07T20:25:34.0055371Z 2025-05-07T20:25:34.0055377Z 2025-05-07T20:25:34.0056912Z 2025-05-07T20:25:34.0063367Z ld_impl_linux-64-2.4 | 691 KB | ########## | 100%  2025-05-07T20:25:34.0064133Z 2025-05-07T20:25:34.0064140Z 2025-05-07T20:25:34.0064146Z 2025-05-07T20:25:34.0064152Z 2025-05-07T20:25:34.0064157Z 2025-05-07T20:25:34.0064163Z 2025-05-07T20:25:34.0064168Z 2025-05-07T20:25:34.0125072Z ld_impl_linux-64-2.4 | 691 KB | ########## | 100%  2025-05-07T20:25:34.0606285Z gcc_impl_linux-64-11 | 53.0 MB | #######3 | 73% 2025-05-07T20:25:34.0606834Z 2025-05-07T20:25:34.0606843Z 2025-05-07T20:25:34.0606854Z 2025-05-07T20:25:34.0606862Z 2025-05-07T20:25:34.0607124Z 2025-05-07T20:25:34.0714453Z libsanitizer-11.4.0 | 3.5 MB | ########## | 100%  2025-05-07T20:25:34.0714882Z 2025-05-07T20:25:34.0714889Z 2025-05-07T20:25:34.0714928Z 2025-05-07T20:25:34.0714934Z 2025-05-07T20:25:34.0714940Z 2025-05-07T20:25:34.0714946Z 2025-05-07T20:25:34.0714952Z 2025-05-07T20:25:34.0714958Z 2025-05-07T20:25:34.0716242Z 2025-05-07T20:25:34.0722222Z gcc_linux-64-11.4.0 | 31 KB | ########## | 100%  2025-05-07T20:25:34.0722758Z 2025-05-07T20:25:34.0722764Z 2025-05-07T20:25:34.0722772Z 2025-05-07T20:25:34.0722778Z 2025-05-07T20:25:34.0722785Z 2025-05-07T20:25:34.0722791Z 2025-05-07T20:25:34.0722799Z 2025-05-07T20:25:34.0722805Z 2025-05-07T20:25:34.0724014Z 2025-05-07T20:25:34.1151560Z gcc_linux-64-11.4.0 | 31 KB | ########## | 100%  2025-05-07T20:25:34.1350486Z gcc_impl_linux-64-11 | 53.0 MB | ########2 | 83% 2025-05-07T20:25:34.1350883Z 2025-05-07T20:25:34.1350890Z 2025-05-07T20:25:34.1350896Z 2025-05-07T20:25:34.1350904Z 2025-05-07T20:25:34.1350911Z 2025-05-07T20:25:34.1350918Z 2025-05-07T20:25:34.1350924Z 2025-05-07T20:25:34.1350931Z 2025-05-07T20:25:34.1351241Z 2025-05-07T20:25:34.1351261Z 2025-05-07T20:25:34.1353287Z 2025-05-07T20:25:34.1363115Z binutils_linux-64-2. | 28 KB | ########## | 100%  2025-05-07T20:25:34.1363853Z 2025-05-07T20:25:34.1363865Z 2025-05-07T20:25:34.1363891Z 2025-05-07T20:25:34.1363901Z 2025-05-07T20:25:34.1363910Z 2025-05-07T20:25:34.1363919Z 2025-05-07T20:25:34.1363928Z 2025-05-07T20:25:34.1363936Z 2025-05-07T20:25:34.1363946Z 2025-05-07T20:25:34.1363954Z 2025-05-07T20:25:34.1363964Z 2025-05-07T20:25:34.1403315Z binutils_linux-64-2. | 28 KB | ########## | 100%  2025-05-07T20:25:34.1403862Z 2025-05-07T20:25:34.1403870Z 2025-05-07T20:25:34.1403877Z 2025-05-07T20:25:34.1403884Z 2025-05-07T20:25:34.1403891Z 2025-05-07T20:25:34.1403897Z 2025-05-07T20:25:34.1403904Z 2025-05-07T20:25:34.1403911Z 2025-05-07T20:25:34.1403918Z 2025-05-07T20:25:34.1407184Z 2025-05-07T20:25:34.1414733Z gxx_linux-64-11.4.0 | 29 KB | ########## | 100%  2025-05-07T20:25:34.1415160Z 2025-05-07T20:25:34.1415166Z 2025-05-07T20:25:34.1415172Z 2025-05-07T20:25:34.1415177Z 2025-05-07T20:25:34.1415192Z 2025-05-07T20:25:34.1415197Z 2025-05-07T20:25:34.1415203Z 2025-05-07T20:25:34.1415217Z 2025-05-07T20:25:34.1415222Z 2025-05-07T20:25:34.1415733Z 2025-05-07T20:25:34.2170026Z gxx_linux-64-11.4.0 | 29 KB | ########## | 100%  2025-05-07T20:25:34.3359337Z gcc_impl_linux-64-11 | 53.0 MB | #########2 | 92% 2025-05-07T20:25:34.3359848Z 2025-05-07T20:25:34.3359859Z 2025-05-07T20:25:34.3362428Z 2025-05-07T20:25:34.4916379Z binutils_impl_linux- | 6.0 MB | ########## | 100%  2025-05-07T20:25:34.5186420Z gcc_impl_linux-64-11 | 53.0 MB | ########## | 100% 2025-05-07T20:25:34.5187372Z 2025-05-07T20:25:34.8315015Z gxx_impl_linux-64-11 | 11.2 MB | ########## | 100%  2025-05-07T20:25:34.8315363Z 2025-05-07T20:25:34.8315372Z 2025-05-07T20:25:35.2493221Z libstdcxx-devel_linu | 11.1 MB | ########## | 100%  2025-05-07T20:25:35.2499152Z gcc_impl_linux-64-11 | 53.0 MB | ########## | 100% 2025-05-07T20:25:35.2499711Z 2025-05-07T20:25:35.2500022Z 2025-05-07T20:25:35.2500307Z  2025-05-07T20:25:35.2501030Z 2025-05-07T20:25:35.2501036Z 2025-05-07T20:25:35.2501273Z  2025-05-07T20:25:35.2501581Z 2025-05-07T20:25:35.2501587Z 2025-05-07T20:25:35.2501592Z 2025-05-07T20:25:35.2501828Z  2025-05-07T20:25:35.2502123Z 2025-05-07T20:25:35.2502128Z 2025-05-07T20:25:35.2502143Z 2025-05-07T20:25:35.2502150Z 2025-05-07T20:25:35.2502406Z  2025-05-07T20:25:35.2502693Z 2025-05-07T20:25:35.2502704Z 2025-05-07T20:25:35.2502708Z 2025-05-07T20:25:35.2502712Z 2025-05-07T20:25:35.2502715Z 2025-05-07T20:25:35.2502904Z  2025-05-07T20:25:35.2503111Z 2025-05-07T20:25:35.2503115Z 2025-05-07T20:25:35.2503118Z 2025-05-07T20:25:35.2503131Z 2025-05-07T20:25:35.2503134Z 2025-05-07T20:25:35.2503146Z 2025-05-07T20:25:35.2503321Z  2025-05-07T20:25:35.2503534Z 2025-05-07T20:25:35.2503539Z 2025-05-07T20:25:35.2503543Z 2025-05-07T20:25:35.2503548Z 2025-05-07T20:25:35.2503562Z 2025-05-07T20:25:35.2503570Z 2025-05-07T20:25:35.2503575Z 2025-05-07T20:25:35.2503831Z  2025-05-07T20:25:35.2504157Z 2025-05-07T20:25:35.2504163Z 2025-05-07T20:25:35.2504168Z 2025-05-07T20:25:35.2504174Z 2025-05-07T20:25:35.2504188Z 2025-05-07T20:25:35.2504193Z 2025-05-07T20:25:35.2504199Z 2025-05-07T20:25:35.2504204Z 2025-05-07T20:25:35.2504677Z  2025-05-07T20:25:35.2504990Z 2025-05-07T20:25:35.2504994Z 2025-05-07T20:25:35.2505005Z 2025-05-07T20:25:35.2505009Z 2025-05-07T20:25:35.2505013Z 2025-05-07T20:25:35.2505016Z 2025-05-07T20:25:35.2505020Z 2025-05-07T20:25:35.2505023Z 2025-05-07T20:25:35.2505032Z 2025-05-07T20:25:35.2505225Z  2025-05-07T20:25:35.2505450Z 2025-05-07T20:25:35.2505454Z 2025-05-07T20:25:35.2505458Z 2025-05-07T20:25:35.2505461Z 2025-05-07T20:25:35.2505465Z 2025-05-07T20:25:35.2505469Z 2025-05-07T20:25:35.2505472Z 2025-05-07T20:25:35.2505476Z 2025-05-07T20:25:35.2505479Z 2025-05-07T20:25:35.2505483Z 2025-05-07T20:25:35.2505666Z  2025-05-07T20:25:35.2505889Z 2025-05-07T20:25:35.2505892Z 2025-05-07T20:25:35.2505896Z 2025-05-07T20:25:35.2505899Z 2025-05-07T20:25:35.2505903Z 2025-05-07T20:25:35.2505906Z 2025-05-07T20:25:35.2505910Z 2025-05-07T20:25:35.2505922Z 2025-05-07T20:25:35.2505926Z 2025-05-07T20:25:35.2505929Z 2025-05-07T20:25:35.2505933Z 2025-05-07T20:25:35.2506135Z  done 2025-05-07T20:25:35.3509300Z Preparing transaction: \ done 2025-05-07T20:25:35.6522621Z Verifying transaction: / - \ done 2025-05-07T20:25:35.7533146Z Executing transaction: / done 2025-05-07T20:25:35.9364475Z [INSTALL] Setting the C/C++ compiler symlinks ... 2025-05-07T20:25:39.9732947Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-cc /home/ec2-user/miniconda/envs/build_binary/bin/cc 2025-05-07T20:25:39.9733537Z 2025-05-07T20:25:39.9744220Z 2025-05-07T20:25:39.9763540Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-cc /home/ec2-user/miniconda/envs/build_binary/bin/gcc 2025-05-07T20:25:39.9764497Z 2025-05-07T20:25:39.9776777Z 2025-05-07T20:25:39.9794459Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-c++ /home/ec2-user/miniconda/envs/build_binary/bin/c++ 2025-05-07T20:25:39.9795166Z 2025-05-07T20:25:39.9809249Z 2025-05-07T20:25:39.9826907Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-c++ /home/ec2-user/miniconda/envs/build_binary/bin/g++ 2025-05-07T20:25:39.9827875Z 2025-05-07T20:25:39.9839357Z 2025-05-07T20:25:41.9158883Z /home/ec2-user/miniconda/envs/build_binary/bin/cc 2025-05-07T20:25:41.9159183Z 2025-05-07T20:25:41.9840945Z [CHECK] Binary cc found in PATH 2025-05-07T20:25:43.9222656Z /home/ec2-user/miniconda/envs/build_binary/bin/gcc 2025-05-07T20:25:43.9222959Z 2025-05-07T20:25:43.9875758Z [CHECK] Binary gcc found in PATH 2025-05-07T20:25:45.9274133Z /home/ec2-user/miniconda/envs/build_binary/bin/c++ 2025-05-07T20:25:45.9274442Z 2025-05-07T20:25:45.9950880Z [CHECK] Binary c++ found in PATH 2025-05-07T20:25:47.9354589Z /home/ec2-user/miniconda/envs/build_binary/bin/g++ 2025-05-07T20:25:47.9354882Z 2025-05-07T20:25:48.0064288Z [CHECK] Binary g++ found in PATH 2025-05-07T20:25:48.0070823Z [INFO] Printing out all preprocessor defines in the C compiler ... 2025-05-07T20:25:48.0071260Z + conda run -n build_binary cc -dM -E - 2025-05-07T20:25:48.0071477Z 2025-05-07T20:25:49.9489253Z #define __DBL_MIN_EXP__ (-1021) 2025-05-07T20:25:49.9489867Z #define __UINT_LEAST16_MAX__ 0xffff 2025-05-07T20:25:49.9490323Z #define __ATOMIC_ACQUIRE 2 2025-05-07T20:25:49.9490853Z #define __FLT128_MAX_10_EXP__ 4932 2025-05-07T20:25:49.9491448Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F 2025-05-07T20:25:49.9491980Z #define __GCC_IEC_559_COMPLEX 2 2025-05-07T20:25:49.9492379Z #define __UINT_LEAST8_TYPE__ unsigned char 2025-05-07T20:25:49.9492811Z #define __SIZEOF_FLOAT80__ 16 2025-05-07T20:25:49.9493179Z #define __INTMAX_C(c) c ## L 2025-05-07T20:25:49.9493522Z #define __CHAR_BIT__ 8 2025-05-07T20:25:49.9493838Z #define __UINT8_MAX__ 0xff 2025-05-07T20:25:49.9494188Z #define __SCHAR_WIDTH__ 8 2025-05-07T20:25:49.9495042Z #define __WINT_MAX__ 0xffffffffU 2025-05-07T20:25:49.9495442Z #define __FLT32_MIN_EXP__ (-125) 2025-05-07T20:25:49.9495842Z #define __ORDER_LITTLE_ENDIAN__ 1234 2025-05-07T20:25:49.9496272Z #define __SIZE_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:49.9496681Z #define __WCHAR_MAX__ 0x7fffffff 2025-05-07T20:25:49.9497086Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1 2025-05-07T20:25:49.9497539Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1 2025-05-07T20:25:49.9497973Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1 2025-05-07T20:25:49.9498552Z #define __DBL_DENORM_MIN__ ((double)4.94065645841246544176568792868221372e-324L) 2025-05-07T20:25:49.9499142Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1 2025-05-07T20:25:49.9499575Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2 2025-05-07T20:25:49.9499961Z #define __GCC_IEC_559 2 2025-05-07T20:25:49.9500297Z #define __FLT32X_DECIMAL_DIG__ 17 2025-05-07T20:25:49.9500678Z #define __FLT_EVAL_METHOD__ 0 2025-05-07T20:25:49.9501033Z #define __FLT64_DECIMAL_DIG__ 17 2025-05-07T20:25:49.9501450Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2 2025-05-07T20:25:49.9501918Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:49.9502385Z #define __SIG_ATOMIC_TYPE__ int 2025-05-07T20:25:49.9502769Z #define __DBL_MIN_10_EXP__ (-307) 2025-05-07T20:25:49.9503146Z #define __FINITE_MATH_ONLY__ 0 2025-05-07T20:25:49.9503494Z #define __FLT32X_MAX_EXP__ 1024 2025-05-07T20:25:49.9503855Z #define __FLT32_HAS_DENORM__ 1 2025-05-07T20:25:49.9504207Z #define __UINT_FAST8_MAX__ 0xff 2025-05-07T20:25:49.9504541Z #define __FLT32_MAX_10_EXP__ 38 2025-05-07T20:25:49.9504885Z #define __DEC64_MAX_EXP__ 385 2025-05-07T20:25:49.9505227Z #define __INT8_C(c) c 2025-05-07T20:25:49.9505541Z #define __INT_LEAST8_WIDTH__ 8 2025-05-07T20:25:49.9505936Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:49.9506377Z #define __SHRT_MAX__ 0x7fff 2025-05-07T20:25:49.9506802Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:25:49.9507289Z #define __FLT64X_MAX_10_EXP__ 4932 2025-05-07T20:25:49.9507681Z #define __LDBL_IS_IEC_60559__ 2 2025-05-07T20:25:49.9508063Z #define __FLT64X_HAS_QUIET_NAN__ 1 2025-05-07T20:25:49.9508704Z #define __UINT_LEAST8_MAX__ 0xff 2025-05-07T20:25:49.9509073Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2 2025-05-07T20:25:49.9509620Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128 2025-05-07T20:25:49.9510457Z #define __UINTMAX_TYPE__ long unsigned int 2025-05-07T20:25:49.9510855Z #define __linux 1 2025-05-07T20:25:49.9511167Z #define __DEC32_EPSILON__ 1E-6DF 2025-05-07T20:25:49.9511548Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0 2025-05-07T20:25:49.9511946Z #define __unix 1 2025-05-07T20:25:49.9512267Z #define __UINT32_MAX__ 0xffffffffU 2025-05-07T20:25:49.9512661Z #define __FLT128_MIN_EXP__ (-16381) 2025-05-07T20:25:49.9513029Z #define __WINT_MIN__ 0U 2025-05-07T20:25:49.9513378Z #define __FLT128_MIN_10_EXP__ (-4931) 2025-05-07T20:25:49.9513789Z #define __FLT32X_IS_IEC_60559__ 2 2025-05-07T20:25:49.9514163Z #define __INT_LEAST16_WIDTH__ 16 2025-05-07T20:25:49.9514538Z #define __SCHAR_MAX__ 0x7f 2025-05-07T20:25:49.9514884Z #define __FLT128_MANT_DIG__ 113 2025-05-07T20:25:49.9515266Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1) 2025-05-07T20:25:49.9515679Z #define __INT64_C(c) c ## L 2025-05-07T20:25:49.9516067Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2 2025-05-07T20:25:49.9516506Z #define __FLT32X_MANT_DIG__ 53 2025-05-07T20:25:49.9516881Z #define __USER_LABEL_PREFIX__ 2025-05-07T20:25:49.9517380Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x 2025-05-07T20:25:49.9517904Z #define __STDC_HOSTED__ 1 2025-05-07T20:25:49.9518263Z #define __DEC64_MIN_EXP__ (-382) 2025-05-07T20:25:49.9518619Z #define __DBL_DIG__ 15 2025-05-07T20:25:49.9518936Z #define __FLT32_DIG__ 6 2025-05-07T20:25:49.9519362Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F 2025-05-07T20:25:49.9519854Z #define __SHRT_WIDTH__ 16 2025-05-07T20:25:49.9520198Z #define __FLT32_IS_IEC_60559__ 2 2025-05-07T20:25:49.9520846Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L 2025-05-07T20:25:49.9521317Z #define __STDC_UTF_16__ 1 2025-05-07T20:25:49.9521640Z #define __DBL_IS_IEC_60559__ 2 2025-05-07T20:25:49.9521994Z #define __DEC32_MAX__ 9.999999E96DF 2025-05-07T20:25:49.9522519Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x 2025-05-07T20:25:49.9523101Z #define __FLT32X_HAS_INFINITY__ 1 2025-05-07T20:25:49.9523497Z #define __INT32_MAX__ 0x7fffffff 2025-05-07T20:25:49.9523863Z #define __unix__ 1 2025-05-07T20:25:49.9524174Z #define __INT_WIDTH__ 32 2025-05-07T20:25:49.9524703Z #define __SIZEOF_LONG__ 8 2025-05-07T20:25:49.9525066Z #define __STDC_IEC_559__ 1 2025-05-07T20:25:49.9536684Z #define __STDC_ISO_10646__ 201103L 2025-05-07T20:25:49.9537065Z #define __UINT16_C(c) c 2025-05-07T20:25:49.9537401Z #define __DECIMAL_DIG__ 21 2025-05-07T20:25:49.9537749Z #define __STDC_IEC_559_COMPLEX__ 1 2025-05-07T20:25:49.9538252Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64 2025-05-07T20:25:49.9538764Z #define __gnu_linux__ 1 2025-05-07T20:25:49.9539103Z #define __FLT128_IS_IEC_60559__ 2 2025-05-07T20:25:49.9539493Z #define __FLT64X_MIN_10_EXP__ (-4931) 2025-05-07T20:25:49.9539868Z #define __LDBL_HAS_QUIET_NAN__ 1 2025-05-07T20:25:49.9540232Z #define __FLT64_MANT_DIG__ 53 2025-05-07T20:25:49.9540602Z #define __FLT64X_MANT_DIG__ 64 2025-05-07T20:25:49.9540932Z #define __GNUC__ 11 2025-05-07T20:25:49.9541211Z #define __pie__ 2 2025-05-07T20:25:49.9541494Z #define __MMX__ 1 2025-05-07T20:25:49.9541789Z #define __FLT_HAS_DENORM__ 1 2025-05-07T20:25:49.9542162Z #define __SIZEOF_LONG_DOUBLE__ 16 2025-05-07T20:25:49.9542559Z #define __BIGGEST_ALIGNMENT__ 16 2025-05-07T20:25:49.9542949Z #define __FLT64_MAX_10_EXP__ 308 2025-05-07T20:25:49.9543473Z #define __DBL_MAX__ ((double)1.79769313486231570814527423731704357e+308L) 2025-05-07T20:25:49.9544063Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:49.9544496Z #define __DBL_HAS_INFINITY__ 1 2025-05-07T20:25:49.9544856Z #define __SIZEOF_FLOAT__ 4 2025-05-07T20:25:49.9545240Z #define __HAVE_SPECULATION_SAFE_VALUE 1 2025-05-07T20:25:49.9545644Z #define __DEC32_MIN_EXP__ (-94) 2025-05-07T20:25:49.9546023Z #define __INTPTR_WIDTH__ 64 2025-05-07T20:25:49.9546379Z #define __FLT64X_HAS_INFINITY__ 1 2025-05-07T20:25:49.9546748Z #define __UINT_LEAST32_MAX__ 0xffffffffU 2025-05-07T20:25:49.9547362Z #define __FLT32X_HAS_DENORM__ 1 2025-05-07T20:25:49.9547744Z #define __INT_FAST16_TYPE__ long int 2025-05-07T20:25:49.9548113Z #define __MMX_WITH_SSE__ 1 2025-05-07T20:25:49.9548462Z #define __LDBL_HAS_DENORM__ 1 2025-05-07T20:25:49.9548830Z #define __FLT128_HAS_INFINITY__ 1 2025-05-07T20:25:49.9549205Z #define __DEC32_MIN__ 1E-95DF 2025-05-07T20:25:49.9549562Z #define __DBL_MAX_EXP__ 1024 2025-05-07T20:25:49.9549909Z #define __WCHAR_WIDTH__ 32 2025-05-07T20:25:49.9550332Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:25:49.9550845Z #define __DEC128_EPSILON__ 1E-33DL 2025-05-07T20:25:49.9551229Z #define __SSE2_MATH__ 1 2025-05-07T20:25:49.9551563Z #define __ATOMIC_HLE_RELEASE 131072 2025-05-07T20:25:49.9552024Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:49.9552439Z #define __amd64 1 2025-05-07T20:25:49.9552738Z #define __STDC_NO_THREADS__ 1 2025-05-07T20:25:49.9553110Z #define __ATOMIC_HLE_ACQUIRE 65536 2025-05-07T20:25:49.9553541Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL 2025-05-07T20:25:49.9553986Z #define __SIZEOF_SIZE_T__ 8 2025-05-07T20:25:49.9554326Z #define __FLT64X_MIN_EXP__ (-16381) 2025-05-07T20:25:49.9554682Z #define __SIZEOF_WINT_T__ 4 2025-05-07T20:25:49.9555033Z #define __LONG_LONG_WIDTH__ 64 2025-05-07T20:25:49.9555382Z #define __FLT32_MAX_EXP__ 128 2025-05-07T20:25:49.9555740Z #define __GXX_ABI_VERSION 1016 2025-05-07T20:25:49.9556093Z #define __FLT_MIN_EXP__ (-125) 2025-05-07T20:25:49.9556447Z #define __GCC_HAVE_DWARF2_CFI_ASM 1 2025-05-07T20:25:49.9556822Z #define __INT16_MAX__ 0x7fff 2025-05-07T20:25:49.9557143Z #define __x86_64 1 2025-05-07T20:25:49.9557449Z #define __INT_FAST64_TYPE__ long int 2025-05-07T20:25:49.9558079Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64 2025-05-07T20:25:49.9558741Z #define __DBL_MIN__ ((double)2.22507385850720138309023271733240406e-308L) 2025-05-07T20:25:49.9559383Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128 2025-05-07T20:25:49.9560036Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:25:49.9560575Z #define __SIZEOF_POINTER__ 8 2025-05-07T20:25:49.9560920Z #define __LP64__ 1 2025-05-07T20:25:49.9561228Z #define __DBL_HAS_QUIET_NAN__ 1 2025-05-07T20:25:49.9561727Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x 2025-05-07T20:25:49.9562269Z #define __DECIMAL_BID_FORMAT__ 1 2025-05-07T20:25:49.9562645Z #define __FLT64_MIN_EXP__ (-1021) 2025-05-07T20:25:49.9563009Z #define __FLT64_MIN_10_EXP__ (-307) 2025-05-07T20:25:49.9563397Z #define __FLT64X_DECIMAL_DIG__ 21 2025-05-07T20:25:49.9563778Z #define __DEC128_MIN__ 1E-6143DL 2025-05-07T20:25:49.9564153Z #define __REGISTER_PREFIX__ 2025-05-07T20:25:49.9564674Z #define __UINT16_MAX__ 0xffff 2025-05-07T20:25:49.9565046Z #define __DBL_HAS_DENORM__ 1 2025-05-07T20:25:49.9565391Z #define __LDBL_HAS_INFINITY__ 1 2025-05-07T20:25:49.9565869Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32 2025-05-07T20:25:49.9566400Z #define __UINT8_TYPE__ unsigned char 2025-05-07T20:25:49.9566766Z #define __FLT_DIG__ 6 2025-05-07T20:25:49.9567151Z #define __NO_INLINE__ 1 2025-05-07T20:25:49.9567463Z #define __DEC_EVAL_METHOD__ 2 2025-05-07T20:25:49.9567925Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL 2025-05-07T20:25:49.9568415Z #define __FLT_MANT_DIG__ 24 2025-05-07T20:25:49.9568766Z #define __LDBL_DECIMAL_DIG__ 21 2025-05-07T20:25:49.9569141Z #define __VERSION__ "11.4.0" 2025-05-07T20:25:49.9569499Z #define __UINT64_C(c) c ## UL 2025-05-07T20:25:49.9569843Z #define _STDC_PREDEF_H 1 2025-05-07T20:25:49.9570205Z #define __INT_LEAST32_MAX__ 0x7fffffff 2025-05-07T20:25:49.9570623Z #define __GCC_ATOMIC_INT_LOCK_FREE 2 2025-05-07T20:25:49.9571025Z #define __FLT128_MAX_EXP__ 16384 2025-05-07T20:25:49.9571371Z #define __FLT32_MANT_DIG__ 24 2025-05-07T20:25:49.9571782Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:25:49.9572231Z #define __FLT128_HAS_DENORM__ 1 2025-05-07T20:25:49.9572712Z #define __FLT32_DECIMAL_DIG__ 9 2025-05-07T20:25:49.9573062Z #define __FLT128_DIG__ 33 2025-05-07T20:25:49.9573382Z #define __INT32_C(c) c 2025-05-07T20:25:49.9573710Z #define __DEC64_EPSILON__ 1E-15DD 2025-05-07T20:25:49.9574098Z #define __ORDER_PDP_ENDIAN__ 3412 2025-05-07T20:25:49.9574503Z #define __DEC128_MIN_EXP__ (-6142) 2025-05-07T20:25:49.9574898Z #define __INT_FAST32_TYPE__ long int 2025-05-07T20:25:49.9575362Z #define __UINT_LEAST16_TYPE__ short unsigned int 2025-05-07T20:25:49.9575786Z #define unix 1 2025-05-07T20:25:49.9576079Z #define __SIZE_TYPE__ long unsigned int 2025-05-07T20:25:49.9576517Z #define __UINT64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:49.9576935Z #define __FLT_IS_IEC_60559__ 2 2025-05-07T20:25:49.9577354Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE" 2025-05-07T20:25:49.9577812Z #define __FLT64X_DIG__ 18 2025-05-07T20:25:49.9578143Z #define __INT8_TYPE__ signed char 2025-05-07T20:25:49.9578494Z #define __ELF__ 1 2025-05-07T20:25:49.9578805Z #define __GCC_ASM_FLAG_OUTPUTS__ 1 2025-05-07T20:25:49.9579210Z #define __UINT32_TYPE__ unsigned int 2025-05-07T20:25:49.9579599Z #define __FLT_RADIX__ 2 2025-05-07T20:25:49.9579918Z #define __INT_LEAST16_TYPE__ short int 2025-05-07T20:25:49.9580415Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L 2025-05-07T20:25:49.9580912Z #define __UINTMAX_C(c) c ## UL 2025-05-07T20:25:49.9581257Z #define __SSE_MATH__ 1 2025-05-07T20:25:49.9581552Z #define __k8 1 2025-05-07T20:25:49.9581951Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x 2025-05-07T20:25:49.9582459Z #define __SIG_ATOMIC_MAX__ 0x7fffffff 2025-05-07T20:25:49.9582870Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2 2025-05-07T20:25:49.9583419Z #define __SIZEOF_PTRDIFF_T__ 8 2025-05-07T20:25:49.9583791Z #define __LDBL_DIG__ 18 2025-05-07T20:25:49.9584113Z #define __FLT64_IS_IEC_60559__ 2 2025-05-07T20:25:49.9584464Z #define __x86_64__ 1 2025-05-07T20:25:49.9584785Z #define __FLT32X_MIN_EXP__ (-1021) 2025-05-07T20:25:49.9585184Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF 2025-05-07T20:25:49.9585636Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:49.9586045Z #define __FLT64_DIG__ 15 2025-05-07T20:25:49.9586418Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:49.9586889Z #define __UINT_LEAST64_TYPE__ long unsigned int 2025-05-07T20:25:49.9587317Z #define __FLT_HAS_QUIET_NAN__ 1 2025-05-07T20:25:49.9587676Z #define __FLT_MAX_10_EXP__ 38 2025-05-07T20:25:49.9588063Z #define __LONG_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:49.9588487Z #define __FLT64X_HAS_DENORM__ 1 2025-05-07T20:25:49.9589013Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL 2025-05-07T20:25:49.9589597Z #define __FLT_HAS_INFINITY__ 1 2025-05-07T20:25:49.9590006Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8" 2025-05-07T20:25:49.9590498Z #define __UINT_FAST16_TYPE__ long unsigned int 2025-05-07T20:25:49.9590975Z #define __DEC64_MAX__ 9.999999999999999E384DD 2025-05-07T20:25:49.9591407Z #define __INT_FAST32_WIDTH__ 64 2025-05-07T20:25:49.9591803Z #define __CHAR16_TYPE__ short unsigned int 2025-05-07T20:25:49.9592241Z #define __PRAGMA_REDEFINE_EXTNAME 1 2025-05-07T20:25:49.9592633Z #define __SIZE_WIDTH__ 64 2025-05-07T20:25:49.9592966Z #define __SEG_FS 1 2025-05-07T20:25:49.9593286Z #define __INT_LEAST16_MAX__ 0x7fff 2025-05-07T20:25:49.9593669Z #define __DEC64_MANT_DIG__ 16 2025-05-07T20:25:49.9594044Z #define __INT64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:49.9594426Z #define __SEG_GS 1 2025-05-07T20:25:49.9594857Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32 2025-05-07T20:25:49.9595393Z #define __SIG_ATOMIC_WIDTH__ 32 2025-05-07T20:25:49.9595778Z #define __INT_LEAST64_TYPE__ long int 2025-05-07T20:25:49.9596170Z #define __INT16_TYPE__ short int 2025-05-07T20:25:49.9596555Z #define __INT_LEAST8_TYPE__ signed char 2025-05-07T20:25:49.9596961Z #define __STDC_VERSION__ 201710L 2025-05-07T20:25:49.9597316Z #define __SIZEOF_INT__ 4 2025-05-07T20:25:49.9597662Z #define __DEC32_MAX_EXP__ 97 2025-05-07T20:25:49.9598209Z #define __INT_FAST8_MAX__ 0x7f 2025-05-07T20:25:49.9598676Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:25:49.9599285Z #define __INTPTR_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:49.9599682Z #define linux 1 2025-05-07T20:25:49.9599977Z #define __FLT64_HAS_QUIET_NAN__ 1 2025-05-07T20:25:49.9600365Z #define __FLT32_MIN_10_EXP__ (-37) 2025-05-07T20:25:49.9600732Z #define __FLT32X_DIG__ 15 2025-05-07T20:25:49.9601059Z #define __PTRDIFF_WIDTH__ 64 2025-05-07T20:25:49.9601400Z #define __LDBL_MANT_DIG__ 64 2025-05-07T20:25:49.9601749Z #define __FLT64_HAS_INFINITY__ 1 2025-05-07T20:25:49.9602234Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:25:49.9602795Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1) 2025-05-07T20:25:49.9603254Z #define __code_model_small__ 1 2025-05-07T20:25:49.9603613Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2 2025-05-07T20:25:49.9603984Z #define __DEC32_MANT_DIG__ 7 2025-05-07T20:25:49.9604471Z #define __k8__ 1 2025-05-07T20:25:49.9604790Z #define __INTPTR_TYPE__ long int 2025-05-07T20:25:49.9605152Z #define __UINT16_TYPE__ short unsigned int 2025-05-07T20:25:49.9605551Z #define __WCHAR_TYPE__ int 2025-05-07T20:25:49.9605909Z #define __pic__ 2 2025-05-07T20:25:49.9606243Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:49.9606655Z #define __INT_FAST64_WIDTH__ 64 2025-05-07T20:25:49.9607039Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:49.9607470Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1 2025-05-07T20:25:49.9607960Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:25:49.9608970Z #define __FLT32_HAS_INFINITY__ 1 2025-05-07T20:25:49.9609628Z #define __FLT64X_MAX_EXP__ 16384 2025-05-07T20:25:49.9610021Z #define __UINT_FAST64_TYPE__ long unsigned int 2025-05-07T20:25:49.9610429Z #define __INT_MAX__ 0x7fffffff 2025-05-07T20:25:49.9610751Z #define __linux__ 1 2025-05-07T20:25:49.9611028Z #define __INT64_TYPE__ long int 2025-05-07T20:25:49.9611379Z #define __FLT_MAX_EXP__ 128 2025-05-07T20:25:49.9611719Z #define __ORDER_BIG_ENDIAN__ 4321 2025-05-07T20:25:49.9612065Z #define __DBL_MANT_DIG__ 53 2025-05-07T20:25:49.9612399Z #define __SIZEOF_FLOAT128__ 16 2025-05-07T20:25:49.9612785Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:49.9613224Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2 2025-05-07T20:25:49.9613620Z #define __DEC64_MIN__ 1E-383DD 2025-05-07T20:25:49.9613969Z #define __WINT_TYPE__ unsigned int 2025-05-07T20:25:49.9614360Z #define __UINT_LEAST32_TYPE__ unsigned int 2025-05-07T20:25:49.9614746Z #define __SIZEOF_SHORT__ 2 2025-05-07T20:25:49.9615180Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:25:49.9615656Z #define __SSE__ 1 2025-05-07T20:25:49.9615941Z #define __LDBL_MIN_EXP__ (-16381) 2025-05-07T20:25:49.9616403Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:25:49.9616886Z #define __amd64__ 1 2025-05-07T20:25:49.9617178Z #define __WINT_WIDTH__ 32 2025-05-07T20:25:49.9617520Z #define __INT_LEAST8_MAX__ 0x7f 2025-05-07T20:25:49.9617877Z #define __INT_LEAST64_WIDTH__ 64 2025-05-07T20:25:49.9618226Z #define __LDBL_MAX_EXP__ 16384 2025-05-07T20:25:49.9618581Z #define __FLT32X_MAX_10_EXP__ 308 2025-05-07T20:25:49.9618952Z #define __SIZEOF_INT128__ 16 2025-05-07T20:25:49.9619303Z #define __FLT64X_IS_IEC_60559__ 2 2025-05-07T20:25:49.9619645Z #define __LDBL_MAX_10_EXP__ 4932 2025-05-07T20:25:49.9619992Z #define __ATOMIC_RELAXED 0 2025-05-07T20:25:49.9620456Z #define __DBL_EPSILON__ ((double)2.22044604925031308084726333618164062e-16L) 2025-05-07T20:25:49.9621089Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128 2025-05-07T20:25:49.9621584Z #define _LP64 1 2025-05-07T20:25:49.9621870Z #define __UINT8_C(c) c 2025-05-07T20:25:49.9622195Z #define __FLT64_MAX_EXP__ 1024 2025-05-07T20:25:49.9622556Z #define __INT_LEAST32_TYPE__ int 2025-05-07T20:25:49.9622914Z #define __SIZEOF_WCHAR_T__ 4 2025-05-07T20:25:49.9623461Z #define __UINT64_TYPE__ long unsigned int 2025-05-07T20:25:49.9623864Z #define __GNUC_PATCHLEVEL__ 0 2025-05-07T20:25:49.9624374Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:25:49.9625024Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:25:49.9625524Z #define __FLT128_HAS_QUIET_NAN__ 1 2025-05-07T20:25:49.9625906Z #define __INTMAX_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:49.9626320Z #define __INT_FAST8_TYPE__ signed char 2025-05-07T20:25:49.9626799Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x 2025-05-07T20:25:49.9627290Z #define __GNUC_STDC_INLINE__ 1 2025-05-07T20:25:49.9627633Z #define __FLT64_HAS_DENORM__ 1 2025-05-07T20:25:49.9628094Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32 2025-05-07T20:25:49.9628589Z #define __DBL_DECIMAL_DIG__ 17 2025-05-07T20:25:49.9628930Z #define __STDC_UTF_32__ 1 2025-05-07T20:25:49.9629303Z #define __INT_FAST8_WIDTH__ 8 2025-05-07T20:25:49.9629637Z #define __FXSR__ 1 2025-05-07T20:25:49.9630034Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:25:49.9630657Z #define __DBL_NORM_MAX__ ((double)1.79769313486231570814527423731704357e+308L) 2025-05-07T20:25:49.9631204Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:25:49.9631622Z #define __INTMAX_WIDTH__ 64 2025-05-07T20:25:49.9631955Z #define __UINT32_C(c) c ## U 2025-05-07T20:25:49.9632394Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F 2025-05-07T20:25:49.9632890Z #define __INT8_MAX__ 0x7f 2025-05-07T20:25:49.9633220Z #define __LONG_WIDTH__ 64 2025-05-07T20:25:49.9633518Z #define __PIC__ 2 2025-05-07T20:25:49.9633957Z #define __UINT_FAST32_TYPE__ long unsigned int 2025-05-07T20:25:49.9634484Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:25:49.9634988Z #define __CHAR32_TYPE__ unsigned int 2025-05-07T20:25:49.9635423Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:25:49.9635855Z #define __SSE2__ 1 2025-05-07T20:25:49.9636143Z #define __INT32_TYPE__ int 2025-05-07T20:25:49.9636450Z #define __SIZEOF_DOUBLE__ 8 2025-05-07T20:25:49.9636790Z #define __FLT_MIN_10_EXP__ (-37) 2025-05-07T20:25:49.9637230Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64 2025-05-07T20:25:49.9637689Z #define __INT_LEAST32_WIDTH__ 32 2025-05-07T20:25:49.9638058Z #define __INTMAX_TYPE__ long int 2025-05-07T20:25:49.9638410Z #define __DEC128_MAX_EXP__ 6145 2025-05-07T20:25:49.9638753Z #define __FLT32X_HAS_QUIET_NAN__ 1 2025-05-07T20:25:49.9639112Z #define __ATOMIC_CONSUME 1 2025-05-07T20:25:49.9639425Z #define __GNUC_MINOR__ 4 2025-05-07T20:25:49.9639739Z #define __INT_FAST16_WIDTH__ 64 2025-05-07T20:25:49.9640125Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:49.9640518Z #define __PIE__ 2 2025-05-07T20:25:49.9640942Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x 2025-05-07T20:25:49.9641453Z #define __DBL_MAX_10_EXP__ 308 2025-05-07T20:25:49.9641916Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L 2025-05-07T20:25:49.9642406Z #define __INT16_C(c) c 2025-05-07T20:25:49.9642689Z #define __STDC__ 1 2025-05-07T20:25:49.9642993Z #define __PTRDIFF_TYPE__ long int 2025-05-07T20:25:49.9643346Z #define __ATOMIC_SEQ_CST 5 2025-05-07T20:25:49.9643669Z #define __FLT32X_MIN_10_EXP__ (-307) 2025-05-07T20:25:49.9644067Z #define __UINTPTR_TYPE__ long unsigned int 2025-05-07T20:25:49.9644712Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD 2025-05-07T20:25:49.9645139Z #define __DEC128_MANT_DIG__ 34 2025-05-07T20:25:49.9645487Z #define __LDBL_MIN_10_EXP__ (-4931) 2025-05-07T20:25:49.9645853Z #define __SIZEOF_LONG_LONG__ 8 2025-05-07T20:25:49.9646197Z #define __FLT128_DECIMAL_DIG__ 36 2025-05-07T20:25:49.9646559Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2 2025-05-07T20:25:49.9646962Z #define __FLT32_HAS_QUIET_NAN__ 1 2025-05-07T20:25:49.9647334Z #define __FLT_DECIMAL_DIG__ 9 2025-05-07T20:25:49.9647856Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:49.9648402Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:25:49.9648920Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2 2025-05-07T20:25:49.9649314Z #define __UINT_FAST8_TYPE__ unsigned char 2025-05-07T20:25:49.9649719Z #define __ATOMIC_ACQ_REL 4 2025-05-07T20:25:49.9650037Z #define __ATOMIC_RELEASE 3 2025-05-07T20:25:49.9650239Z 2025-05-07T20:25:50.0240504Z 2025-05-07T20:25:50.0241463Z [INFO] Printing out all preprocessor defines in the C++ compiler ... 2025-05-07T20:25:50.0241929Z + conda run -n build_binary c++ -dM -E -x c++ - 2025-05-07T20:25:50.0242154Z 2025-05-07T20:25:51.9641112Z #define __DBL_MIN_EXP__ (-1021) 2025-05-07T20:25:51.9641525Z #define __cpp_attributes 200809L 2025-05-07T20:25:51.9641865Z #define __cpp_nontype_template_parameter_auto 201606L 2025-05-07T20:25:51.9642203Z #define __UINT_LEAST16_MAX__ 0xffff 2025-05-07T20:25:51.9642611Z #define __ATOMIC_ACQUIRE 2 2025-05-07T20:25:51.9643048Z #define __FLT128_MAX_10_EXP__ 4932 2025-05-07T20:25:51.9643600Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F 2025-05-07T20:25:51.9644177Z #define __GCC_IEC_559_COMPLEX 2 2025-05-07T20:25:51.9644748Z #define __cpp_aggregate_nsdmi 201304L 2025-05-07T20:25:51.9645043Z #define __UINT_LEAST8_TYPE__ unsigned char 2025-05-07T20:25:51.9645339Z #define __SIZEOF_FLOAT80__ 16 2025-05-07T20:25:51.9645591Z #define __INTMAX_C(c) c ## L 2025-05-07T20:25:51.9645825Z #define __CHAR_BIT__ 8 2025-05-07T20:25:51.9646050Z #define __UINT8_MAX__ 0xff 2025-05-07T20:25:51.9646284Z #define __SCHAR_WIDTH__ 8 2025-05-07T20:25:51.9646521Z #define __WINT_MAX__ 0xffffffffU 2025-05-07T20:25:51.9646770Z #define __FLT32_MIN_EXP__ (-125) 2025-05-07T20:25:51.9647368Z #define __cpp_static_assert 201411L 2025-05-07T20:25:51.9647646Z #define __ORDER_LITTLE_ENDIAN__ 1234 2025-05-07T20:25:51.9647923Z #define __SIZE_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:51.9648210Z #define __WCHAR_MAX__ 0x7fffffff 2025-05-07T20:25:51.9648503Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1 2025-05-07T20:25:51.9648823Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1 2025-05-07T20:25:51.9649147Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1 2025-05-07T20:25:51.9649552Z #define __DBL_DENORM_MIN__ double(4.94065645841246544176568792868221372e-324L) 2025-05-07T20:25:51.9649958Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1 2025-05-07T20:25:51.9650273Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2 2025-05-07T20:25:51.9650552Z #define __GCC_IEC_559 2 2025-05-07T20:25:51.9650795Z #define __FLT32X_DECIMAL_DIG__ 17 2025-05-07T20:25:51.9651061Z #define __FLT_EVAL_METHOD__ 0 2025-05-07T20:25:51.9651332Z #define __cpp_binary_literals 201304L 2025-05-07T20:25:51.9651622Z #define __FLT64_DECIMAL_DIG__ 17 2025-05-07T20:25:51.9651914Z #define __cpp_noexcept_function_type 201510L 2025-05-07T20:25:51.9652230Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2 2025-05-07T20:25:51.9652538Z #define __cpp_variadic_templates 200704L 2025-05-07T20:25:51.9652861Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:51.9653189Z #define __SIG_ATOMIC_TYPE__ int 2025-05-07T20:25:51.9653465Z #define __DBL_MIN_10_EXP__ (-307) 2025-05-07T20:25:51.9653733Z #define __FINITE_MATH_ONLY__ 0 2025-05-07T20:25:51.9654011Z #define __cpp_variable_templates 201304L 2025-05-07T20:25:51.9654314Z #define __FLT32X_MAX_EXP__ 1024 2025-05-07T20:25:51.9654573Z #define __FLT32_HAS_DENORM__ 1 2025-05-07T20:25:51.9654838Z #define __UINT_FAST8_MAX__ 0xff 2025-05-07T20:25:51.9655112Z #define __cpp_rvalue_reference 200610L 2025-05-07T20:25:51.9655451Z #define __cpp_nested_namespace_definitions 201411L 2025-05-07T20:25:51.9655773Z #define __DEC64_MAX_EXP__ 385 2025-05-07T20:25:51.9656028Z #define __INT8_C(c) c 2025-05-07T20:25:51.9656275Z #define __INT_LEAST8_WIDTH__ 8 2025-05-07T20:25:51.9656545Z #define __cpp_variadic_using 201611L 2025-05-07T20:25:51.9656873Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:51.9657185Z #define __INT_LEAST8_MAX__ 0x7f 2025-05-07T20:25:51.9657443Z #define __cpp_capture_star_this 201603L 2025-05-07T20:25:51.9657890Z #define __SHRT_MAX__ 0x7fff 2025-05-07T20:25:51.9658194Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:25:51.9658525Z #define __FLT64X_MAX_10_EXP__ 4932 2025-05-07T20:25:51.9658797Z #define __cpp_if_constexpr 201606L 2025-05-07T20:25:51.9659064Z #define __LDBL_IS_IEC_60559__ 2 2025-05-07T20:25:51.9659318Z #define __FLT64X_HAS_QUIET_NAN__ 1 2025-05-07T20:25:51.9659575Z #define __UINT_LEAST8_MAX__ 0xff 2025-05-07T20:25:51.9659841Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2 2025-05-07T20:25:51.9660223Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128 2025-05-07T20:25:51.9660614Z #define __UINTMAX_TYPE__ long unsigned int 2025-05-07T20:25:51.9660890Z #define __linux 1 2025-05-07T20:25:51.9661107Z #define __DEC32_EPSILON__ 1E-6DF 2025-05-07T20:25:51.9661394Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0 2025-05-07T20:25:51.9661652Z #define __unix 1 2025-05-07T20:25:51.9661870Z #define __UINT32_MAX__ 0xffffffffU 2025-05-07T20:25:51.9662157Z #define __GXX_EXPERIMENTAL_CXX0X__ 1 2025-05-07T20:25:51.9662425Z #define __FLT128_MIN_EXP__ (-16381) 2025-05-07T20:25:51.9662682Z #define __WINT_MIN__ 0U 2025-05-07T20:25:51.9662915Z #define __FLT128_MIN_10_EXP__ (-4931) 2025-05-07T20:25:51.9663180Z #define __FLT32X_IS_IEC_60559__ 2 2025-05-07T20:25:51.9663453Z #define __INT_LEAST16_WIDTH__ 16 2025-05-07T20:25:51.9663708Z #define __SCHAR_MAX__ 0x7f 2025-05-07T20:25:51.9663940Z #define __FLT128_MANT_DIG__ 113 2025-05-07T20:25:51.9664210Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1) 2025-05-07T20:25:51.9664497Z #define __INT64_C(c) c ## L 2025-05-07T20:25:51.9664741Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2 2025-05-07T20:25:51.9665025Z #define __FLT32X_MANT_DIG__ 53 2025-05-07T20:25:51.9665375Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2 2025-05-07T20:25:51.9665667Z #define __cpp_aligned_new 201606L 2025-05-07T20:25:51.9665926Z #define __USER_LABEL_PREFIX__ 2025-05-07T20:25:51.9666189Z #define __FLT32_MAX_10_EXP__ 38 2025-05-07T20:25:51.9666544Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x 2025-05-07T20:25:51.9666908Z #define __STDC_HOSTED__ 1 2025-05-07T20:25:51.9676046Z #define __DEC64_MIN_EXP__ (-382) 2025-05-07T20:25:51.9676349Z #define __cpp_decltype_auto 201304L 2025-05-07T20:25:51.9676634Z #define __DBL_DIG__ 15 2025-05-07T20:25:51.9676859Z #define __FLT32_DIG__ 6 2025-05-07T20:25:51.9677170Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F 2025-05-07T20:25:51.9677521Z #define __GXX_WEAK__ 1 2025-05-07T20:25:51.9677749Z #define __SHRT_WIDTH__ 16 2025-05-07T20:25:51.9678006Z #define __FLT32_IS_IEC_60559__ 2 2025-05-07T20:25:51.9678339Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L 2025-05-07T20:25:51.9678696Z #define __DBL_IS_IEC_60559__ 2 2025-05-07T20:25:51.9678952Z #define __DEC32_MAX__ 9.999999E96DF 2025-05-07T20:25:51.9679252Z #define __cpp_threadsafe_static_init 200806L 2025-05-07T20:25:51.9679583Z #define __cpp_enumerator_attributes 201411L 2025-05-07T20:25:51.9679987Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x 2025-05-07T20:25:51.9680392Z #define __FLT32X_HAS_INFINITY__ 1 2025-05-07T20:25:51.9680677Z #define __INT32_MAX__ 0x7fffffff 2025-05-07T20:25:51.9680930Z #define __unix__ 1 2025-05-07T20:25:51.9681167Z #define __INT_WIDTH__ 32 2025-05-07T20:25:51.9681418Z #define __SIZEOF_LONG__ 8 2025-05-07T20:25:51.9681662Z #define __STDC_IEC_559__ 1 2025-05-07T20:25:51.9681911Z #define __STDC_ISO_10646__ 201103L 2025-05-07T20:25:51.9682185Z #define __UINT16_C(c) c 2025-05-07T20:25:51.9682432Z #define __DECIMAL_DIG__ 21 2025-05-07T20:25:51.9682678Z #define __STDC_IEC_559_COMPLEX__ 1 2025-05-07T20:25:51.9683043Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64 2025-05-07T20:25:51.9683410Z #define __gnu_linux__ 1 2025-05-07T20:25:51.9683655Z #define __INT16_MAX__ 0x7fff 2025-05-07T20:25:51.9683909Z #define __FLT64_MIN_EXP__ (-1021) 2025-05-07T20:25:51.9684192Z #define __FLT64X_MIN_10_EXP__ (-4931) 2025-05-07T20:25:51.9684739Z #define __LDBL_HAS_QUIET_NAN__ 1 2025-05-07T20:25:51.9685000Z #define __FLT64_MANT_DIG__ 53 2025-05-07T20:25:51.9685262Z #define __FLT64X_MANT_DIG__ 64 2025-05-07T20:25:51.9685515Z #define __GNUC__ 11 2025-05-07T20:25:51.9685733Z #define __GXX_RTTI 1 2025-05-07T20:25:51.9685963Z #define __pie__ 2 2025-05-07T20:25:51.9686183Z #define __MMX__ 1 2025-05-07T20:25:51.9686402Z #define __FLT_HAS_DENORM__ 1 2025-05-07T20:25:51.9686670Z #define __SIZEOF_LONG_DOUBLE__ 16 2025-05-07T20:25:51.9686950Z #define __BIGGEST_ALIGNMENT__ 16 2025-05-07T20:25:51.9687208Z #define __STDC_UTF_16__ 1 2025-05-07T20:25:51.9687452Z #define __FLT64_MAX_10_EXP__ 308 2025-05-07T20:25:51.9687746Z #define __cpp_delegating_constructors 200604L 2025-05-07T20:25:51.9688067Z #define __FLT32_HAS_INFINITY__ 1 2025-05-07T20:25:51.9688400Z #define __DBL_MAX__ double(1.79769313486231570814527423731704357e+308L) 2025-05-07T20:25:51.9688770Z #define __cpp_raw_strings 200710L 2025-05-07T20:25:51.9689068Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:51.9689377Z #define __DBL_HAS_INFINITY__ 1 2025-05-07T20:25:51.9689637Z #define __SIZEOF_FLOAT__ 4 2025-05-07T20:25:51.9689897Z #define __HAVE_SPECULATION_SAFE_VALUE 1 2025-05-07T20:25:51.9690188Z #define __cpp_fold_expressions 201603L 2025-05-07T20:25:51.9690474Z #define __DEC32_MIN_EXP__ (-94) 2025-05-07T20:25:51.9690734Z #define __INTPTR_WIDTH__ 64 2025-05-07T20:25:51.9690979Z #define __FLT64X_HAS_INFINITY__ 1 2025-05-07T20:25:51.9691257Z #define __UINT_LEAST32_MAX__ 0xffffffffU 2025-05-07T20:25:51.9691542Z #define __FLT32X_HAS_DENORM__ 1 2025-05-07T20:25:51.9691796Z #define __INT_FAST16_TYPE__ long int 2025-05-07T20:25:51.9692069Z #define __MMX_WITH_SSE__ 1 2025-05-07T20:25:51.9692317Z #define __LDBL_HAS_DENORM__ 1 2025-05-07T20:25:51.9692676Z #define __cplusplus 201703L 2025-05-07T20:25:51.9692932Z #define __cpp_ref_qualifiers 200710L 2025-05-07T20:25:51.9693283Z #define __DEC32_MIN__ 1E-95DF 2025-05-07T20:25:51.9693533Z #define __DEPRECATED 1 2025-05-07T20:25:51.9693777Z #define __cpp_rvalue_references 200610L 2025-05-07T20:25:51.9694072Z #define __DBL_MAX_EXP__ 1024 2025-05-07T20:25:51.9694329Z #define __WCHAR_WIDTH__ 32 2025-05-07T20:25:51.9694629Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:25:51.9694980Z #define __DEC128_EPSILON__ 1E-33DL 2025-05-07T20:25:51.9695243Z #define __SSE2_MATH__ 1 2025-05-07T20:25:51.9695476Z #define __ATOMIC_HLE_RELEASE 131072 2025-05-07T20:25:51.9695770Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:51.9696059Z #define __amd64 1 2025-05-07T20:25:51.9696278Z #define __STDC_NO_THREADS__ 1 2025-05-07T20:25:51.9696530Z #define __ATOMIC_HLE_ACQUIRE 65536 2025-05-07T20:25:51.9696790Z #define __GNUG__ 11 2025-05-07T20:25:51.9697045Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL 2025-05-07T20:25:51.9697343Z #define __SIZEOF_SIZE_T__ 8 2025-05-07T20:25:51.9697592Z #define __cpp_nsdmi 200809L 2025-05-07T20:25:51.9697847Z #define __FLT64X_MIN_EXP__ (-16381) 2025-05-07T20:25:51.9698106Z #define __SIZEOF_WINT_T__ 4 2025-05-07T20:25:51.9698367Z #define __LONG_LONG_WIDTH__ 64 2025-05-07T20:25:51.9698639Z #define __cpp_initializer_lists 200806L 2025-05-07T20:25:51.9698920Z #define __FLT32_MAX_EXP__ 128 2025-05-07T20:25:51.9699180Z #define __cpp_hex_float 201603L 2025-05-07T20:25:51.9699444Z #define __GXX_ABI_VERSION 1016 2025-05-07T20:25:51.9699695Z #define __FLT128_HAS_INFINITY__ 1 2025-05-07T20:25:51.9699966Z #define __FLT_MIN_EXP__ (-125) 2025-05-07T20:25:51.9700230Z #define __GCC_HAVE_DWARF2_CFI_ASM 1 2025-05-07T20:25:51.9700494Z #define __x86_64 1 2025-05-07T20:25:51.9700706Z #define __cpp_lambdas 200907L 2025-05-07T20:25:51.9700971Z #define __INT_FAST64_TYPE__ long int 2025-05-07T20:25:51.9701342Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64 2025-05-07T20:25:51.9701717Z #define __cpp_template_auto 201606L 2025-05-07T20:25:51.9702070Z #define __DBL_MIN__ double(2.22507385850720138309023271733240406e-308L) 2025-05-07T20:25:51.9702512Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128 2025-05-07T20:25:51.9703045Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:25:51.9703423Z #define __SIZEOF_POINTER__ 8 2025-05-07T20:25:51.9703670Z #define __LP64__ 1 2025-05-07T20:25:51.9703886Z #define __DBL_HAS_QUIET_NAN__ 1 2025-05-07T20:25:51.9704231Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x 2025-05-07T20:25:51.9704598Z #define __DECIMAL_BID_FORMAT__ 1 2025-05-07T20:25:51.9704866Z #define __FLT64_MIN_10_EXP__ (-307) 2025-05-07T20:25:51.9705132Z #define __FLT64X_DECIMAL_DIG__ 21 2025-05-07T20:25:51.9705404Z #define __DEC128_MIN__ 1E-6143DL 2025-05-07T20:25:51.9705667Z #define __REGISTER_PREFIX__ 2025-05-07T20:25:51.9705918Z #define __UINT16_MAX__ 0xffff 2025-05-07T20:25:51.9706176Z #define __LDBL_HAS_INFINITY__ 1 2025-05-07T20:25:51.9706496Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32 2025-05-07T20:25:51.9706838Z #define __UINT8_TYPE__ unsigned char 2025-05-07T20:25:51.9707110Z #define __FLT_DIG__ 6 2025-05-07T20:25:51.9707342Z #define __NO_INLINE__ 1 2025-05-07T20:25:51.9707574Z #define __DEC_EVAL_METHOD__ 2 2025-05-07T20:25:51.9707889Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL 2025-05-07T20:25:51.9708226Z #define __FLT_MANT_DIG__ 24 2025-05-07T20:25:51.9708767Z #define __LDBL_DECIMAL_DIG__ 21 2025-05-07T20:25:51.9709093Z #define __VERSION__ "11.4.0" 2025-05-07T20:25:51.9709414Z #define __UINT64_C(c) c ## UL 2025-05-07T20:25:51.9709765Z #define __cpp_unicode_characters 201411L 2025-05-07T20:25:51.9710135Z #define _STDC_PREDEF_H 1 2025-05-07T20:25:51.9710465Z #define __INT_LEAST32_MAX__ 0x7fffffff 2025-05-07T20:25:51.9710791Z #define __GCC_ATOMIC_INT_LOCK_FREE 2 2025-05-07T20:25:51.9711262Z #define __FLT128_MAX_EXP__ 16384 2025-05-07T20:25:51.9711526Z #define __FLT32_MANT_DIG__ 24 2025-05-07T20:25:51.9711825Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:25:51.9712151Z #define __cpp_aggregate_bases 201603L 2025-05-07T20:25:51.9712432Z #define __FLT128_HAS_DENORM__ 1 2025-05-07T20:25:51.9712694Z #define __FLT32_DECIMAL_DIG__ 9 2025-05-07T20:25:51.9712943Z #define __FLT128_DIG__ 33 2025-05-07T20:25:51.9713183Z #define __INT32_C(c) c 2025-05-07T20:25:51.9713421Z #define __DEC64_EPSILON__ 1E-15DD 2025-05-07T20:25:51.9713701Z #define __ORDER_PDP_ENDIAN__ 3412 2025-05-07T20:25:51.9713965Z #define __DEC128_MIN_EXP__ (-6142) 2025-05-07T20:25:51.9714237Z #define __INT_FAST32_TYPE__ long int 2025-05-07T20:25:51.9714548Z #define __UINT_LEAST16_TYPE__ short unsigned int 2025-05-07T20:25:51.9714845Z #define unix 1 2025-05-07T20:25:51.9715059Z #define __DBL_HAS_DENORM__ 1 2025-05-07T20:25:51.9715318Z #define __cpp_rtti 199711L 2025-05-07T20:25:51.9715570Z #define __SIZE_TYPE__ long unsigned int 2025-05-07T20:25:51.9715897Z #define __UINT64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:51.9716196Z #define __FLT_IS_IEC_60559__ 2 2025-05-07T20:25:51.9716491Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE" 2025-05-07T20:25:51.9716812Z #define __FLT64X_DIG__ 18 2025-05-07T20:25:51.9717065Z #define __INT8_TYPE__ signed char 2025-05-07T20:25:51.9717335Z #define __cpp_digit_separators 201309L 2025-05-07T20:25:51.9717614Z #define __ELF__ 1 2025-05-07T20:25:51.9717841Z #define __GCC_ASM_FLAG_OUTPUTS__ 1 2025-05-07T20:25:51.9718119Z #define __UINT32_TYPE__ unsigned int 2025-05-07T20:25:51.9718378Z #define __FLT_RADIX__ 2 2025-05-07T20:25:51.9718624Z #define __INT_LEAST16_TYPE__ short int 2025-05-07T20:25:51.9718972Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L 2025-05-07T20:25:51.9719319Z #define __UINTMAX_C(c) c ## UL 2025-05-07T20:25:51.9719582Z #define __GLIBCXX_BITSIZE_INT_N_0 128 2025-05-07T20:25:51.9719847Z #define __k8 1 2025-05-07T20:25:51.9720140Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x 2025-05-07T20:25:51.9720504Z #define __SIG_ATOMIC_MAX__ 0x7fffffff 2025-05-07T20:25:51.9720792Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2 2025-05-07T20:25:51.9721074Z #define __SIZEOF_PTRDIFF_T__ 8 2025-05-07T20:25:51.9721328Z #define __LDBL_DIG__ 18 2025-05-07T20:25:51.9721725Z #define __FLT64_IS_IEC_60559__ 2 2025-05-07T20:25:51.9721970Z #define __x86_64__ 1 2025-05-07T20:25:51.9722210Z #define __FLT32X_MIN_EXP__ (-1021) 2025-05-07T20:25:51.9722511Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF 2025-05-07T20:25:51.9722859Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:51.9723160Z #define __FLT64_DIG__ 15 2025-05-07T20:25:51.9723451Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:51.9723802Z #define __UINT_LEAST64_TYPE__ long unsigned int 2025-05-07T20:25:51.9724115Z #define __FLT_HAS_QUIET_NAN__ 1 2025-05-07T20:25:51.9724468Z #define __FLT_MAX_10_EXP__ 38 2025-05-07T20:25:51.9724745Z #define __LONG_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:51.9725035Z #define __FLT64X_HAS_DENORM__ 1 2025-05-07T20:25:51.9725388Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL 2025-05-07T20:25:51.9725773Z #define __FLT_HAS_INFINITY__ 1 2025-05-07T20:25:51.9726073Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8" 2025-05-07T20:25:51.9726402Z #define __cpp_unicode_literals 200710L 2025-05-07T20:25:51.9726702Z #define __UINT_FAST16_TYPE__ long unsigned int 2025-05-07T20:25:51.9727002Z #define __DEC64_MAX__ 9.999999999999999E384DD 2025-05-07T20:25:51.9727283Z #define __INT_FAST32_WIDTH__ 64 2025-05-07T20:25:51.9727549Z #define __CHAR16_TYPE__ short unsigned int 2025-05-07T20:25:51.9727837Z #define __PRAGMA_REDEFINE_EXTNAME 1 2025-05-07T20:25:51.9728100Z #define __SIZE_WIDTH__ 64 2025-05-07T20:25:51.9728327Z #define __SEG_FS 1 2025-05-07T20:25:51.9728540Z #define __INT_LEAST16_MAX__ 0x7fff 2025-05-07T20:25:51.9728802Z #define __DEC64_MANT_DIG__ 16 2025-05-07T20:25:51.9729062Z #define __INT64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:51.9729423Z #define __SEG_GS 1 2025-05-07T20:25:51.9729722Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32 2025-05-07T20:25:51.9730082Z #define __SIG_ATOMIC_WIDTH__ 32 2025-05-07T20:25:51.9730342Z #define __INT_LEAST64_TYPE__ long int 2025-05-07T20:25:51.9730619Z #define __INT16_TYPE__ short int 2025-05-07T20:25:51.9730882Z #define __INT_LEAST8_TYPE__ signed char 2025-05-07T20:25:51.9731190Z #define __cpp_structured_bindings 201606L 2025-05-07T20:25:51.9731487Z #define __SIZEOF_INT__ 4 2025-05-07T20:25:51.9731730Z #define __DEC32_MAX_EXP__ 97 2025-05-07T20:25:51.9731993Z #define __INT_FAST8_MAX__ 0x7f 2025-05-07T20:25:51.9732338Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:25:51.9732715Z #define __INTPTR_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:51.9733030Z #define __cpp_sized_deallocation 201309L 2025-05-07T20:25:51.9733356Z #define __cpp_guaranteed_copy_elision 201606L 2025-05-07T20:25:51.9733644Z #define linux 1 2025-05-07T20:25:51.9733876Z #define __FLT64_HAS_QUIET_NAN__ 1 2025-05-07T20:25:51.9734152Z #define __FLT32_MIN_10_EXP__ (-37) 2025-05-07T20:25:51.9734414Z #define __EXCEPTIONS 1 2025-05-07T20:25:51.9734663Z #define __PTRDIFF_WIDTH__ 64 2025-05-07T20:25:51.9734920Z #define __LDBL_MANT_DIG__ 64 2025-05-07T20:25:51.9735190Z #define __cpp_range_based_for 201603L 2025-05-07T20:25:51.9735468Z #define __FLT64_HAS_INFINITY__ 1 2025-05-07T20:25:51.9735808Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:25:51.9736207Z #define __STDCPP_DEFAULT_NEW_ALIGNMENT__ 16 2025-05-07T20:25:51.9736567Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1) 2025-05-07T20:25:51.9736886Z #define __code_model_small__ 1 2025-05-07T20:25:51.9737157Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2 2025-05-07T20:25:51.9737451Z #define __cpp_nontype_template_args 201411L 2025-05-07T20:25:51.9737749Z #define __DEC32_MANT_DIG__ 7 2025-05-07T20:25:51.9738023Z #define __cpp_return_type_deduction 201304L 2025-05-07T20:25:51.9738303Z #define __k8__ 1 2025-05-07T20:25:51.9738526Z #define __INTPTR_TYPE__ long int 2025-05-07T20:25:51.9738812Z #define __UINT16_TYPE__ short unsigned int 2025-05-07T20:25:51.9739106Z #define __WCHAR_TYPE__ int 2025-05-07T20:25:51.9739334Z #define __pic__ 2 2025-05-07T20:25:51.9739668Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:51.9739977Z #define __INT_FAST64_WIDTH__ 64 2025-05-07T20:25:51.9740234Z #define __cpp_decltype 200707L 2025-05-07T20:25:51.9740523Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:51.9740852Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1 2025-05-07T20:25:51.9741206Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:25:51.9741560Z #define __FLT64X_MAX_EXP__ 16384 2025-05-07T20:25:51.9741845Z #define __UINT_FAST64_TYPE__ long unsigned int 2025-05-07T20:25:51.9742142Z #define __cpp_inline_variables 201606L 2025-05-07T20:25:51.9742418Z #define __INT_MAX__ 0x7fffffff 2025-05-07T20:25:51.9742656Z #define __linux__ 1 2025-05-07T20:25:51.9742869Z #define __INT64_TYPE__ long int 2025-05-07T20:25:51.9743117Z #define __FLT_MAX_EXP__ 128 2025-05-07T20:25:51.9743363Z #define __ORDER_BIG_ENDIAN__ 4321 2025-05-07T20:25:51.9743624Z #define __DBL_MANT_DIG__ 53 2025-05-07T20:25:51.9743889Z #define __cpp_inheriting_constructors 201511L 2025-05-07T20:25:51.9744197Z #define __SIZEOF_FLOAT128__ 16 2025-05-07T20:25:51.9744481Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:51.9744778Z #define __DEC64_MIN__ 1E-383DD 2025-05-07T20:25:51.9745039Z #define __WINT_TYPE__ unsigned int 2025-05-07T20:25:51.9745315Z #define __UINT_LEAST32_TYPE__ unsigned int 2025-05-07T20:25:51.9745587Z #define __SIZEOF_SHORT__ 2 2025-05-07T20:25:51.9745903Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:25:51.9746238Z #define __SSE__ 1 2025-05-07T20:25:51.9746444Z #define __LDBL_MIN_EXP__ (-16381) 2025-05-07T20:25:51.9746771Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:25:51.9747182Z #define __amd64__ 1 2025-05-07T20:25:51.9747391Z #define __WINT_WIDTH__ 32 2025-05-07T20:25:51.9747623Z #define __INT_LEAST64_WIDTH__ 64 2025-05-07T20:25:51.9747887Z #define __LDBL_MAX_EXP__ 16384 2025-05-07T20:25:51.9748139Z #define __FLT32X_MAX_10_EXP__ 308 2025-05-07T20:25:51.9748394Z #define __SIZEOF_INT128__ 16 2025-05-07T20:25:51.9748638Z #define __FLT64X_IS_IEC_60559__ 2 2025-05-07T20:25:51.9748892Z #define __LDBL_MAX_10_EXP__ 4932 2025-05-07T20:25:51.9749135Z #define __ATOMIC_RELAXED 0 2025-05-07T20:25:51.9749466Z #define __DBL_EPSILON__ double(2.22044604925031308084726333618164062e-16L) 2025-05-07T20:25:51.9749910Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128 2025-05-07T20:25:51.9750238Z #define _LP64 1 2025-05-07T20:25:51.9750447Z #define __UINT8_C(c) c 2025-05-07T20:25:51.9750672Z #define __FLT64_MAX_EXP__ 1024 2025-05-07T20:25:51.9750916Z #define __INT_LEAST32_TYPE__ int 2025-05-07T20:25:51.9751171Z #define __SIZEOF_WCHAR_T__ 4 2025-05-07T20:25:51.9751420Z #define __GNUC_PATCHLEVEL__ 0 2025-05-07T20:25:51.9751773Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:25:51.9752217Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:25:51.9752575Z #define __FLT128_HAS_QUIET_NAN__ 1 2025-05-07T20:25:51.9752864Z #define __INTMAX_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:51.9753152Z #define __INT_FAST8_TYPE__ signed char 2025-05-07T20:25:51.9753447Z #define __cpp_namespace_attributes 201411L 2025-05-07T20:25:51.9753817Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x 2025-05-07T20:25:51.9754164Z #define __STDCPP_THREADS__ 1 2025-05-07T20:25:51.9754416Z #define __GNUC_STDC_INLINE__ 1 2025-05-07T20:25:51.9754665Z #define __FLT64_HAS_DENORM__ 1 2025-05-07T20:25:51.9754992Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32 2025-05-07T20:25:51.9755335Z #define __DBL_DECIMAL_DIG__ 17 2025-05-07T20:25:51.9755579Z #define __STDC_UTF_32__ 1 2025-05-07T20:25:51.9755822Z #define __INT_FAST8_WIDTH__ 8 2025-05-07T20:25:51.9756049Z #define __FXSR__ 1 2025-05-07T20:25:51.9756335Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:25:51.9756773Z #define __DBL_NORM_MAX__ double(1.79769313486231570814527423731704357e+308L) 2025-05-07T20:25:51.9757240Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:25:51.9757535Z #define __INTMAX_WIDTH__ 64 2025-05-07T20:25:51.9757786Z #define __cpp_runtime_arrays 198712L 2025-05-07T20:25:51.9758062Z #define __UINT64_TYPE__ long unsigned int 2025-05-07T20:25:51.9758341Z #define __UINT32_C(c) c ## U 2025-05-07T20:25:51.9758593Z #define __cpp_alias_templates 200704L 2025-05-07T20:25:51.9758936Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F 2025-05-07T20:25:51.9759272Z #define __FLT128_IS_IEC_60559__ 2 2025-05-07T20:25:51.9759522Z #define __INT8_MAX__ 0x7f 2025-05-07T20:25:51.9759756Z #define __LONG_WIDTH__ 64 2025-05-07T20:25:51.9759967Z #define __PIC__ 2 2025-05-07T20:25:51.9760207Z #define __UINT_FAST32_TYPE__ long unsigned int 2025-05-07T20:25:51.9760589Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:25:51.9760953Z #define __CHAR32_TYPE__ unsigned int 2025-05-07T20:25:51.9761274Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:25:51.9761614Z #define __cpp_constexpr 201603L 2025-05-07T20:25:51.9761846Z #define __SSE2__ 1 2025-05-07T20:25:51.9762069Z #define __cpp_deduction_guides 201703L 2025-05-07T20:25:51.9762341Z #define __INT32_TYPE__ int 2025-05-07T20:25:51.9762576Z #define __SIZEOF_DOUBLE__ 8 2025-05-07T20:25:51.9762817Z #define __cpp_exceptions 199711L 2025-05-07T20:25:51.9763076Z #define __FLT_MIN_10_EXP__ (-37) 2025-05-07T20:25:51.9763393Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64 2025-05-07T20:25:51.9763723Z #define __INT_LEAST32_WIDTH__ 32 2025-05-07T20:25:51.9763978Z #define __INTMAX_TYPE__ long int 2025-05-07T20:25:51.9764235Z #define __DEC128_MAX_EXP__ 6145 2025-05-07T20:25:51.9764660Z #define __FLT32X_HAS_QUIET_NAN__ 1 2025-05-07T20:25:51.9764924Z #define __ATOMIC_CONSUME 1 2025-05-07T20:25:51.9765156Z #define __GNUC_MINOR__ 4 2025-05-07T20:25:51.9765394Z #define __GLIBCXX_TYPE_INT_N_0 __int128 2025-05-07T20:25:51.9765671Z #define __INT_FAST16_WIDTH__ 64 2025-05-07T20:25:51.9765959Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:51.9766269Z #define __PIE__ 2 2025-05-07T20:25:51.9766578Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x 2025-05-07T20:25:51.9766976Z #define __cpp_template_template_args 201611L 2025-05-07T20:25:51.9767269Z #define __DBL_MAX_10_EXP__ 308 2025-05-07T20:25:51.9767595Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L 2025-05-07T20:25:51.9767943Z #define __INT16_C(c) c 2025-05-07T20:25:51.9768161Z #define __STDC__ 1 2025-05-07T20:25:51.9768361Z #define __FLT32X_DIG__ 15 2025-05-07T20:25:51.9768601Z #define __PTRDIFF_TYPE__ long int 2025-05-07T20:25:51.9768858Z #define __ATOMIC_SEQ_CST 5 2025-05-07T20:25:51.9769103Z #define __FLT32X_MIN_10_EXP__ (-307) 2025-05-07T20:25:51.9769387Z #define __UINTPTR_TYPE__ long unsigned int 2025-05-07T20:25:51.9769722Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD 2025-05-07T20:25:51.9770034Z #define __DEC128_MANT_DIG__ 34 2025-05-07T20:25:51.9770293Z #define __LDBL_MIN_10_EXP__ (-4931) 2025-05-07T20:25:51.9770569Z #define __cpp_generic_lambdas 201304L 2025-05-07T20:25:51.9770838Z #define __SSE_MATH__ 1 2025-05-07T20:25:51.9771054Z #define __SIZEOF_LONG_LONG__ 8 2025-05-07T20:25:51.9771323Z #define __cpp_user_defined_literals 200809L 2025-05-07T20:25:51.9771619Z #define __FLT128_DECIMAL_DIG__ 36 2025-05-07T20:25:51.9771877Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2 2025-05-07T20:25:51.9773648Z #define __FLT32_HAS_QUIET_NAN__ 1 2025-05-07T20:25:51.9773907Z #define __FLT_DECIMAL_DIG__ 9 2025-05-07T20:25:51.9774182Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:51.9774559Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:25:51.9774921Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2 2025-05-07T20:25:51.9775200Z #define __UINT_FAST8_TYPE__ unsigned char 2025-05-07T20:25:51.9775475Z #define _GNU_SOURCE 1 2025-05-07T20:25:51.9775706Z #define __cpp_init_captures 201304L 2025-05-07T20:25:51.9776057Z #define __ATOMIC_ACQ_REL 4 2025-05-07T20:25:51.9776335Z #define __ATOMIC_RELEASE 3 2025-05-07T20:25:51.9776494Z 2025-05-07T20:25:52.0358114Z 2025-05-07T20:25:52.0358482Z + conda run -n build_binary c++ --version 2025-05-07T20:25:52.0358722Z 2025-05-07T20:25:53.9767259Z c++ (conda-forge gcc 11.4.0-13) 11.4.0 2025-05-07T20:25:53.9767736Z Copyright (C) 2021 Free Software Foundation, Inc. 2025-05-07T20:25:53.9768245Z This is free software; see the source for copying conditions. There is NO 2025-05-07T20:25:53.9768771Z warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. 2025-05-07T20:25:53.9769094Z 2025-05-07T20:25:53.9769099Z 2025-05-07T20:25:54.0497131Z 2025-05-07T20:25:54.0498294Z [INFO] Printing the default version of the C standard used by the compiler ... 2025-05-07T20:25:54.0498894Z + conda run -n build_binary cc -dM -E - < /dev/null | grep __STDC_VERSION__ 2025-05-07T20:25:54.0499199Z 2025-05-07T20:25:56.0659566Z #define __STDC_VERSION__ 201710L 2025-05-07T20:25:56.0662059Z 2025-05-07T20:25:56.0662455Z [INFO] Printing the default version of the C++ standard used by the compiler ... 2025-05-07T20:25:56.0663272Z + conda run -n build_binary c++ -dM -E -x c++ - < /dev/null | grep __cplusplus 2025-05-07T20:25:56.0663610Z 2025-05-07T20:25:58.0939314Z #define __cplusplus 201703L 2025-05-07T20:25:58.0941641Z 2025-05-07T20:25:58.0942422Z [INSTALL] Successfully installed C/C++ compilers 2025-05-07T20:25:58.0985215Z ##[group]Run . $PRELUDE; install_cuda $BUILD_ENV 12.6.3 2025-05-07T20:25:58.0985625Z . $PRELUDE; install_cuda $BUILD_ENV 12.6.3 2025-05-07T20:25:58.0999396Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:25:58.0999732Z env: 2025-05-07T20:25:58.0999944Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:25:58.1000222Z BUILD_ENV: build_binary 2025-05-07T20:25:58.1000455Z BUILD_TARGET: genai 2025-05-07T20:25:58.1000670Z BUILD_VARIANT: cuda 2025-05-07T20:25:58.1000880Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:25:58.1001134Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:25:58.1001422Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:25:58.1001731Z ##[endgroup] 2025-05-07T20:25:58.4394558Z ################################################################################ 2025-05-07T20:25:58.4394921Z # Install CUDA 2025-05-07T20:25:58.4395127Z # 2025-05-07T20:25:58.4411501Z # [2025-05-07T20:25:58.440Z] + install_cuda build_binary 12.6.3 2025-05-07T20:25:58.4411988Z ################################################################################ 2025-05-07T20:25:58.4412205Z 2025-05-07T20:25:58.4428295Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:25:58.5279529Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:25:58.5279873Z [SETUP] Cleaning up Conda packages ... 2025-05-07T20:25:58.5288210Z + conda clean --packages --tarball -y 2025-05-07T20:25:58.5288428Z 2025-05-07T20:25:59.2364378Z Will remove 29 (113.6 MB) tarball(s). 2025-05-07T20:25:59.2364720Z Will remove 6 (619 KB) package(s). 2025-05-07T20:25:59.3104846Z 2025-05-07T20:25:59.3115605Z + conda clean --all -y 2025-05-07T20:25:59.3115770Z 2025-05-07T20:25:59.9767995Z There are no unused tarball(s) to remove. 2025-05-07T20:25:59.9768310Z Will remove 1 index cache(s). 2025-05-07T20:25:59.9768781Z There are no unused package(s) to remove. 2025-05-07T20:25:59.9769113Z There are no tempfile(s) to remove. 2025-05-07T20:25:59.9769406Z There are no logfile(s) to remove. 2025-05-07T20:26:00.0478216Z 2025-05-07T20:26:00.0492625Z [INSTALL] Installing CUDA 12.6.3 ... 2025-05-07T20:26:00.0518635Z [EXEC] [ATTEMPT 0/3] + conda install --force-reinstall -n build_binary -c conda-forge --override-channels -y cuda=12.6.3 2025-05-07T20:26:00.9747417Z Channels: 2025-05-07T20:26:00.9747722Z - conda-forge 2025-05-07T20:26:00.9747938Z Platform: linux-64 2025-05-07T20:26:11.9118223Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | / - \ done 2025-05-07T20:26:13.0383437Z Solving environment: / - \ | / done 2025-05-07T20:26:13.1133397Z 2025-05-07T20:26:13.1133732Z ## Package Plan ## 2025-05-07T20:26:13.1133896Z 2025-05-07T20:26:13.1134160Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:26:13.1134495Z 2025-05-07T20:26:13.1134587Z added / updated specs: 2025-05-07T20:26:13.1134840Z - cuda=12.6.3 2025-05-07T20:26:13.1135005Z 2025-05-07T20:26:13.1135010Z 2025-05-07T20:26:13.1135128Z The following packages will be downloaded: 2025-05-07T20:26:13.1135340Z 2025-05-07T20:26:13.1135456Z package | build 2025-05-07T20:26:13.1135764Z ---------------------------|----------------- 2025-05-07T20:26:13.1136128Z alsa-lib-1.2.14 | hb9d3cd8_0 553 KB conda-forge 2025-05-07T20:26:13.1136632Z attr-2.5.1 | h166bdaf_1 69 KB conda-forge 2025-05-07T20:26:13.1137213Z binutils-2.40 | h4852527_7 31 KB conda-forge 2025-05-07T20:26:13.1137800Z c-compiler-1.5.2 | h0b41bf4_0 6 KB conda-forge 2025-05-07T20:26:13.1138195Z cuda-12.6.3 | ha804496_0 26 KB conda-forge 2025-05-07T20:26:13.1138609Z cuda-cccl_linux-64-12.6.77 | ha770c72_0 1.0 MB conda-forge 2025-05-07T20:26:13.1139463Z cuda-command-line-tools-12.6.3| ha770c72_0 20 KB conda-forge 2025-05-07T20:26:13.1139955Z cuda-compiler-12.6.3 | hbad6d8a_0 20 KB conda-forge 2025-05-07T20:26:13.1140414Z cuda-crt-dev_linux-64-12.6.85| ha770c72_0 87 KB conda-forge 2025-05-07T20:26:13.1140874Z cuda-crt-tools-12.6.85 | ha770c72_0 26 KB conda-forge 2025-05-07T20:26:13.1141307Z cuda-cudart-12.6.77 | h5888daf_0 22 KB conda-forge 2025-05-07T20:26:13.1141747Z cuda-cudart-dev-12.6.77 | h5888daf_0 22 KB conda-forge 2025-05-07T20:26:13.1142223Z cuda-cudart-dev_linux-64-12.6.77| h3f2d84a_0 357 KB conda-forge 2025-05-07T20:26:13.1142711Z cuda-cudart-static-12.6.77 | h5888daf_0 22 KB conda-forge 2025-05-07T20:26:13.1143210Z cuda-cudart-static_linux-64-12.6.77| h3f2d84a_0 744 KB conda-forge 2025-05-07T20:26:13.1143712Z cuda-cudart_linux-64-12.6.77| h3f2d84a_0 184 KB conda-forge 2025-05-07T20:26:13.1144184Z cuda-cuobjdump-12.6.77 | hbd13f7d_1 241 KB conda-forge 2025-05-07T20:26:13.1144617Z cuda-cupti-12.6.80 | hbd13f7d_0 1.9 MB conda-forge 2025-05-07T20:26:13.1145057Z cuda-cupti-dev-12.6.80 | h5888daf_0 3.4 MB conda-forge 2025-05-07T20:26:13.1145500Z cuda-cuxxfilt-12.6.77 | hbd13f7d_1 211 KB conda-forge 2025-05-07T20:26:13.1145939Z cuda-driver-dev-12.6.77 | h5888daf_0 22 KB conda-forge 2025-05-07T20:26:13.1146413Z cuda-driver-dev_linux-64-12.6.77| h3f2d84a_0 35 KB conda-forge 2025-05-07T20:26:13.1146875Z cuda-gdb-12.6.77 | h50b4baa_1 370 KB conda-forge 2025-05-07T20:26:13.1147303Z cuda-libraries-12.6.3 | ha770c72_0 20 KB conda-forge 2025-05-07T20:26:13.1147757Z cuda-libraries-dev-12.6.3 | ha770c72_0 20 KB conda-forge 2025-05-07T20:26:13.1148213Z cuda-nsight-12.6.77 | h7938cbb_0 113.2 MB conda-forge 2025-05-07T20:26:13.1148633Z cuda-nvcc-12.6.85 | hcdd1206_0 23 KB conda-forge 2025-05-07T20:26:13.1149076Z cuda-nvcc-dev_linux-64-12.6.85| he91c749_0 10.8 MB conda-forge 2025-05-07T20:26:13.1149526Z cuda-nvcc-impl-12.6.85 | h85509e4_0 25 KB conda-forge 2025-05-07T20:26:13.1149972Z cuda-nvcc-tools-12.6.85 | he02047a_0 23.0 MB conda-forge 2025-05-07T20:26:13.1150425Z cuda-nvcc_linux-64-12.6.85 | h04802cd_0 25 KB conda-forge 2025-05-07T20:26:13.1150864Z cuda-nvdisasm-12.6.77 | hbd13f7d_1 47.6 MB conda-forge 2025-05-07T20:26:13.1151477Z cuda-nvml-dev-12.6.77 | hbd13f7d_1 159 KB conda-forge 2025-05-07T20:26:13.1151906Z cuda-nvprof-12.6.80 | hbd13f7d_0 2.6 MB conda-forge 2025-05-07T20:26:13.1152340Z cuda-nvprune-12.6.77 | hbd13f7d_1 66 KB conda-forge 2025-05-07T20:26:13.1152768Z cuda-nvrtc-12.6.85 | hbd13f7d_0 17.3 MB conda-forge 2025-05-07T20:26:13.1153194Z cuda-nvrtc-dev-12.6.85 | h5888daf_0 31 KB conda-forge 2025-05-07T20:26:13.1153624Z cuda-nvtx-12.6.77 | hbd13f7d_0 31 KB conda-forge 2025-05-07T20:26:13.1154058Z cuda-nvvm-dev_linux-64-12.6.85| ha770c72_0 25 KB conda-forge 2025-05-07T20:26:13.1154514Z cuda-nvvm-impl-12.6.85 | he02047a_0 7.7 MB conda-forge 2025-05-07T20:26:13.1154956Z cuda-nvvm-tools-12.6.85 | he02047a_0 10.4 MB conda-forge 2025-05-07T20:26:13.1155382Z cuda-nvvp-12.6.80 | hbd13f7d_1 109.3 MB conda-forge 2025-05-07T20:26:13.1155816Z cuda-opencl-12.6.77 | hbd13f7d_0 29 KB conda-forge 2025-05-07T20:26:13.1156256Z cuda-opencl-dev-12.6.77 | h5888daf_0 93 KB conda-forge 2025-05-07T20:26:13.1156836Z cuda-profiler-api-12.6.77 | h7938cbb_0 22 KB conda-forge 2025-05-07T20:26:13.1157291Z cuda-runtime-12.6.3 | ha804496_0 19 KB conda-forge 2025-05-07T20:26:13.1157745Z cuda-sanitizer-api-12.6.77 | hbd13f7d_1 8.9 MB conda-forge 2025-05-07T20:26:13.1158195Z cuda-toolkit-12.6.3 | ha804496_0 19 KB conda-forge 2025-05-07T20:26:13.1158608Z cuda-tools-12.6.3 | ha770c72_0 19 KB conda-forge 2025-05-07T20:26:13.1159027Z cuda-version-12.6 | h7480c83_3 20 KB conda-forge 2025-05-07T20:26:13.1159472Z cuda-visual-tools-12.6.3 | ha770c72_0 19 KB conda-forge 2025-05-07T20:26:13.1159920Z cxx-compiler-1.5.2 | hf52228f_0 6 KB conda-forge 2025-05-07T20:26:13.1160320Z dbus-1.13.6 | h5008d03_3 604 KB conda-forge 2025-05-07T20:26:13.1160695Z expat-2.7.0 | h5888daf_0 137 KB conda-forge 2025-05-07T20:26:13.1161153Z font-ttf-dejavu-sans-mono-2.37| hab24e00_0 388 KB conda-forge 2025-05-07T20:26:13.1161651Z font-ttf-inconsolata-3.000 | h77eed37_0 94 KB conda-forge 2025-05-07T20:26:13.1162152Z font-ttf-source-code-pro-2.038| h77eed37_0 684 KB conda-forge 2025-05-07T20:26:13.1162634Z font-ttf-ubuntu-0.83 | h77eed37_3 1.5 MB conda-forge 2025-05-07T20:26:13.1163060Z fontconfig-2.15.0 | h7e30c49_1 259 KB conda-forge 2025-05-07T20:26:13.1163510Z fonts-conda-ecosystem-1 | 0 4 KB conda-forge 2025-05-07T20:26:13.1163970Z fonts-conda-forge-1 | 0 4 KB conda-forge 2025-05-07T20:26:13.1164535Z freetype-2.13.3 | ha770c72_1 168 KB conda-forge 2025-05-07T20:26:13.1164914Z gcc-11.4.0 | h602e360_13 49 KB conda-forge 2025-05-07T20:26:13.1165303Z gds-tools-1.11.1.6 | h5888daf_4 37.8 MB conda-forge 2025-05-07T20:26:13.1165731Z gmp-6.3.0 | hac33072_2 449 KB conda-forge 2025-05-07T20:26:13.1166095Z gxx-11.4.0 | h602e360_13 49 KB conda-forge 2025-05-07T20:26:13.1166475Z keyutils-1.6.1 | h166bdaf_0 115 KB conda-forge 2025-05-07T20:26:13.1166854Z krb5-1.21.3 | h659f571_0 1.3 MB conda-forge 2025-05-07T20:26:13.1167231Z libcap-2.71 | h39aace5_0 100 KB conda-forge 2025-05-07T20:26:13.1167636Z libcublas-12.6.4.1 | h5888daf_1 256.2 MB conda-forge 2025-05-07T20:26:13.1168154Z libcublas-dev-12.6.4.1 | h5888daf_1 88 KB conda-forge 2025-05-07T20:26:13.1168574Z libcufft-11.3.0.4 | hbd13f7d_0 156.2 MB conda-forge 2025-05-07T20:26:13.1168995Z libcufft-dev-11.3.0.4 | h5888daf_0 33 KB conda-forge 2025-05-07T20:26:13.1169428Z libcufile-1.11.1.6 | h12f29b5_4 900 KB conda-forge 2025-05-07T20:26:13.1169860Z libcufile-dev-1.11.1.6 | h5888daf_4 35 KB conda-forge 2025-05-07T20:26:13.1170286Z libcurand-10.3.7.77 | hbd13f7d_0 39.9 MB conda-forge 2025-05-07T20:26:13.1170720Z libcurand-dev-10.3.7.77 | h5888daf_0 262 KB conda-forge 2025-05-07T20:26:13.1171157Z libcusolver-11.7.1.2 | h5888daf_1 95.8 MB conda-forge 2025-05-07T20:26:13.1171598Z libcusolver-dev-11.7.1.2 | h5888daf_1 59 KB conda-forge 2025-05-07T20:26:13.1172054Z libcusparse-12.5.4.2 | hbd13f7d_0 118.6 MB conda-forge 2025-05-07T20:26:13.1172504Z libcusparse-dev-12.5.4.2 | h5888daf_0 51 KB conda-forge 2025-05-07T20:26:13.1172965Z libedit-3.1.20250104 | pl5321h7949ede_0 132 KB conda-forge 2025-05-07T20:26:13.1173392Z libexpat-2.7.0 | h5888daf_0 73 KB conda-forge 2025-05-07T20:26:13.1173903Z libfreetype-2.13.3 | ha770c72_1 8 KB conda-forge 2025-05-07T20:26:13.1174337Z libfreetype6-2.13.3 | h48d6fc4_1 371 KB conda-forge 2025-05-07T20:26:13.1174769Z libgcrypt-lib-1.11.0 | hb9d3cd8_2 572 KB conda-forge 2025-05-07T20:26:13.1175192Z libglib-2.84.0 | h2ff4ddf_0 3.8 MB conda-forge 2025-05-07T20:26:13.1175611Z libgpg-error-1.55 | h3f2d84a_0 305 KB conda-forge 2025-05-07T20:26:13.1176031Z libiconv-1.18 | h4ce23a2_1 696 KB conda-forge 2025-05-07T20:26:13.1176427Z libnl-3.11.0 | hb9d3cd8_0 724 KB conda-forge 2025-05-07T20:26:13.1176825Z libnpp-12.3.1.54 | h5888daf_0 93.4 MB conda-forge 2025-05-07T20:26:13.1177250Z libnpp-dev-12.3.1.54 | h5888daf_0 441 KB conda-forge 2025-05-07T20:26:13.1177661Z libnuma-2.0.18 | h4ab18f5_2 42 KB conda-forge 2025-05-07T20:26:13.1178084Z libnvfatbin-12.6.77 | hbd13f7d_0 783 KB conda-forge 2025-05-07T20:26:13.1178536Z libnvfatbin-dev-12.6.77 | h5888daf_0 26 KB conda-forge 2025-05-07T20:26:13.1178990Z libnvjitlink-12.6.85 | hbd13f7d_0 14.9 MB conda-forge 2025-05-07T20:26:13.1179436Z libnvjitlink-dev-12.6.85 | h5888daf_0 25 KB conda-forge 2025-05-07T20:26:13.1179883Z libnvjpeg-12.3.3.54 | h5888daf_0 2.4 MB conda-forge 2025-05-07T20:26:13.1180321Z libnvjpeg-dev-12.3.3.54 | ha770c72_0 31 KB conda-forge 2025-05-07T20:26:13.1180747Z libpng-1.6.47 | h943b412_0 282 KB conda-forge 2025-05-07T20:26:13.1181143Z libsqlite-3.49.2 | hee588c1_0 895 KB conda-forge 2025-05-07T20:26:13.1181567Z libsystemd0-256.9 | h2774228_0 401 KB conda-forge 2025-05-07T20:26:13.1181997Z libudev1-257.4 | h9a4d06a_0 140 KB conda-forge 2025-05-07T20:26:13.1182396Z libuuid-2.38.1 | h0b41bf4_0 33 KB conda-forge 2025-05-07T20:26:13.1182792Z libxcb-1.17.0 | h8a09558_0 387 KB conda-forge 2025-05-07T20:26:13.1183204Z libxkbcommon-1.8.0 | hc4a0caf_0 627 KB conda-forge 2025-05-07T20:26:13.1183637Z libxkbfile-1.1.0 | h166bdaf_1 111 KB conda-forge 2025-05-07T20:26:13.1184038Z libxml2-2.13.5 | h064dc61_0 673 KB conda-forge 2025-05-07T20:26:13.1184568Z libzlib-1.3.1 | hb9d3cd8_2 60 KB conda-forge 2025-05-07T20:26:13.1184977Z lz4-c-1.9.4 | hcb278e6_0 140 KB conda-forge 2025-05-07T20:26:13.1185372Z ncurses-6.5 | h2d0b736_3 871 KB conda-forge 2025-05-07T20:26:13.1185817Z nsight-compute-2024.3.2.3 | hb5ebaad_0 443.1 MB conda-forge 2025-05-07T20:26:13.1186246Z nspr-4.36 | h5888daf_0 225 KB conda-forge 2025-05-07T20:26:13.1186615Z nss-3.111 | h159eef7_0 1.9 MB conda-forge 2025-05-07T20:26:13.1186988Z ocl-icd-2.3.3 | hb9d3cd8_0 104 KB conda-forge 2025-05-07T20:26:13.1187421Z opencl-headers-2024.10.24 | h5888daf_0 53 KB conda-forge 2025-05-07T20:26:13.1187846Z pcre2-10.44 | hc749103_2 934 KB conda-forge 2025-05-07T20:26:13.1188252Z pthread-stubs-0.4 | hb9d3cd8_1002 8 KB conda-forge 2025-05-07T20:26:13.1188692Z python-3.13.0 |h9ebbce0_101_cp313 31.5 MB conda-forge 2025-05-07T20:26:13.1189111Z rdma-core-55.0 | h5888daf_0 1.2 MB conda-forge 2025-05-07T20:26:13.1189510Z sqlite-3.49.2 | h9eae976_0 840 KB conda-forge 2025-05-07T20:26:13.1189984Z tk-8.6.13 |noxft_h4845f30_101 3.2 MB conda-forge 2025-05-07T20:26:13.1190377Z wayland-1.23.1 | h3e06ad9_0 314 KB conda-forge 2025-05-07T20:26:13.1190772Z xcb-util-0.4.1 | hb711507_2 19 KB conda-forge 2025-05-07T20:26:13.1191195Z xcb-util-cursor-0.1.5 | hb9d3cd8_0 20 KB conda-forge 2025-05-07T20:26:13.1191638Z xcb-util-image-0.4.0 | hb711507_2 24 KB conda-forge 2025-05-07T20:26:13.1192088Z xcb-util-keysyms-0.4.1 | hb711507_0 14 KB conda-forge 2025-05-07T20:26:13.1192566Z xcb-util-renderutil-0.3.10 | hb711507_0 17 KB conda-forge 2025-05-07T20:26:13.1193011Z xcb-util-wm-0.4.2 | hb711507_0 50 KB conda-forge 2025-05-07T20:26:13.1193454Z xkeyboard-config-2.44 | hb9d3cd8_0 384 KB conda-forge 2025-05-07T20:26:13.1193906Z xorg-libice-1.1.2 | hb9d3cd8_0 57 KB conda-forge 2025-05-07T20:26:13.1194330Z xorg-libsm-1.2.6 | he73a12e_0 27 KB conda-forge 2025-05-07T20:26:13.1194746Z xorg-libx11-1.8.12 | h4f16b4b_0 816 KB conda-forge 2025-05-07T20:26:13.1195179Z xorg-libxau-1.0.12 | hb9d3cd8_0 14 KB conda-forge 2025-05-07T20:26:13.1195639Z xorg-libxcomposite-0.4.6 | hb9d3cd8_2 13 KB conda-forge 2025-05-07T20:26:13.1196103Z xorg-libxdamage-1.1.6 | hb9d3cd8_0 13 KB conda-forge 2025-05-07T20:26:13.1196565Z xorg-libxdmcp-1.1.5 | hb9d3cd8_0 19 KB conda-forge 2025-05-07T20:26:13.1197013Z xorg-libxext-1.3.6 | hb9d3cd8_0 49 KB conda-forge 2025-05-07T20:26:13.1197463Z xorg-libxfixes-6.0.1 | hb9d3cd8_0 19 KB conda-forge 2025-05-07T20:26:13.1197887Z xorg-libxi-1.8.2 | hb9d3cd8_0 46 KB conda-forge 2025-05-07T20:26:13.1198323Z xorg-libxrandr-1.5.4 | hb9d3cd8_0 29 KB conda-forge 2025-05-07T20:26:13.1198776Z xorg-libxrender-0.9.12 | hb9d3cd8_0 32 KB conda-forge 2025-05-07T20:26:13.1199213Z xorg-libxtst-1.2.5 | hb9d3cd8_3 32 KB conda-forge 2025-05-07T20:26:13.1199616Z zlib-1.3.1 | hb9d3cd8_2 90 KB conda-forge 2025-05-07T20:26:13.1199991Z zstd-1.5.7 | hb8e6e7a_2 554 KB conda-forge 2025-05-07T20:26:13.1200358Z ------------------------------------------------------------ 2025-05-07T20:26:13.1200770Z Total: 1.64 GB 2025-05-07T20:26:13.1200988Z 2025-05-07T20:26:13.1201110Z The following NEW packages will be INSTALLED: 2025-05-07T20:26:13.1201320Z 2025-05-07T20:26:13.1201534Z alsa-lib conda-forge/linux-64::alsa-lib-1.2.14-hb9d3cd8_0 2025-05-07T20:26:13.1201946Z attr conda-forge/linux-64::attr-2.5.1-h166bdaf_1 2025-05-07T20:26:13.1202355Z binutils conda-forge/linux-64::binutils-2.40-h4852527_7 2025-05-07T20:26:13.1202809Z c-compiler conda-forge/linux-64::c-compiler-1.5.2-h0b41bf4_0 2025-05-07T20:26:13.1203230Z cuda conda-forge/noarch::cuda-12.6.3-ha804496_0 2025-05-07T20:26:13.1203685Z cuda-cccl_linux-64 conda-forge/noarch::cuda-cccl_linux-64-12.6.77-ha770c72_0 2025-05-07T20:26:13.1204259Z cuda-command-line~ conda-forge/linux-64::cuda-command-line-tools-12.6.3-ha770c72_0 2025-05-07T20:26:13.1204887Z cuda-compiler conda-forge/noarch::cuda-compiler-12.6.3-hbad6d8a_0 2025-05-07T20:26:13.1205421Z cuda-crt-dev_linu~ conda-forge/noarch::cuda-crt-dev_linux-64-12.6.85-ha770c72_0 2025-05-07T20:26:13.1205961Z cuda-crt-tools conda-forge/linux-64::cuda-crt-tools-12.6.85-ha770c72_0 2025-05-07T20:26:13.1206456Z cuda-cudart conda-forge/linux-64::cuda-cudart-12.6.77-h5888daf_0 2025-05-07T20:26:13.1207054Z cuda-cudart-dev conda-forge/linux-64::cuda-cudart-dev-12.6.77-h5888daf_0 2025-05-07T20:26:13.1207614Z cuda-cudart-dev_l~ conda-forge/noarch::cuda-cudart-dev_linux-64-12.6.77-h3f2d84a_0 2025-05-07T20:26:13.1208198Z cuda-cudart-static conda-forge/linux-64::cuda-cudart-static-12.6.77-h5888daf_0 2025-05-07T20:26:13.1209293Z cuda-cudart-stati~ conda-forge/noarch::cuda-cudart-static_linux-64-12.6.77-h3f2d84a_0 2025-05-07T20:26:13.1209889Z cuda-cudart_linux~ conda-forge/noarch::cuda-cudart_linux-64-12.6.77-h3f2d84a_0 2025-05-07T20:26:13.1210439Z cuda-cuobjdump conda-forge/linux-64::cuda-cuobjdump-12.6.77-hbd13f7d_1 2025-05-07T20:26:13.1210941Z cuda-cupti conda-forge/linux-64::cuda-cupti-12.6.80-hbd13f7d_0 2025-05-07T20:26:13.1211445Z cuda-cupti-dev conda-forge/linux-64::cuda-cupti-dev-12.6.80-h5888daf_0 2025-05-07T20:26:13.1211974Z cuda-cuxxfilt conda-forge/linux-64::cuda-cuxxfilt-12.6.77-hbd13f7d_1 2025-05-07T20:26:13.1212498Z cuda-driver-dev conda-forge/linux-64::cuda-driver-dev-12.6.77-h5888daf_0 2025-05-07T20:26:13.1213064Z cuda-driver-dev_l~ conda-forge/noarch::cuda-driver-dev_linux-64-12.6.77-h3f2d84a_0 2025-05-07T20:26:13.1213585Z cuda-gdb conda-forge/linux-64::cuda-gdb-12.6.77-h50b4baa_1 2025-05-07T20:26:13.1214065Z cuda-libraries conda-forge/linux-64::cuda-libraries-12.6.3-ha770c72_0 2025-05-07T20:26:13.1214671Z cuda-libraries-dev conda-forge/linux-64::cuda-libraries-dev-12.6.3-ha770c72_0 2025-05-07T20:26:13.1215200Z cuda-nsight conda-forge/linux-64::cuda-nsight-12.6.77-h7938cbb_0 2025-05-07T20:26:13.1215671Z cuda-nvcc conda-forge/linux-64::cuda-nvcc-12.6.85-hcdd1206_0 2025-05-07T20:26:13.1216189Z cuda-nvcc-dev_lin~ conda-forge/noarch::cuda-nvcc-dev_linux-64-12.6.85-he91c749_0 2025-05-07T20:26:13.1216736Z cuda-nvcc-impl conda-forge/linux-64::cuda-nvcc-impl-12.6.85-h85509e4_0 2025-05-07T20:26:13.1217369Z cuda-nvcc-tools conda-forge/linux-64::cuda-nvcc-tools-12.6.85-he02047a_0 2025-05-07T20:26:13.1217919Z cuda-nvcc_linux-64 conda-forge/linux-64::cuda-nvcc_linux-64-12.6.85-h04802cd_0 2025-05-07T20:26:13.1218475Z cuda-nvdisasm conda-forge/linux-64::cuda-nvdisasm-12.6.77-hbd13f7d_1 2025-05-07T20:26:13.1219039Z cuda-nvml-dev conda-forge/linux-64::cuda-nvml-dev-12.6.77-hbd13f7d_1 2025-05-07T20:26:13.1219527Z cuda-nvprof conda-forge/linux-64::cuda-nvprof-12.6.80-hbd13f7d_0 2025-05-07T20:26:13.1220016Z cuda-nvprune conda-forge/linux-64::cuda-nvprune-12.6.77-hbd13f7d_1 2025-05-07T20:26:13.1220509Z cuda-nvrtc conda-forge/linux-64::cuda-nvrtc-12.6.85-hbd13f7d_0 2025-05-07T20:26:13.1220998Z cuda-nvrtc-dev conda-forge/linux-64::cuda-nvrtc-dev-12.6.85-h5888daf_0 2025-05-07T20:26:13.1221649Z cuda-nvtx conda-forge/linux-64::cuda-nvtx-12.6.77-hbd13f7d_0 2025-05-07T20:26:13.1222152Z cuda-nvvm-dev_lin~ conda-forge/noarch::cuda-nvvm-dev_linux-64-12.6.85-ha770c72_0 2025-05-07T20:26:13.1222696Z cuda-nvvm-impl conda-forge/linux-64::cuda-nvvm-impl-12.6.85-he02047a_0 2025-05-07T20:26:13.1223229Z cuda-nvvm-tools conda-forge/linux-64::cuda-nvvm-tools-12.6.85-he02047a_0 2025-05-07T20:26:13.1223744Z cuda-nvvp conda-forge/linux-64::cuda-nvvp-12.6.80-hbd13f7d_1 2025-05-07T20:26:13.1224232Z cuda-opencl conda-forge/linux-64::cuda-opencl-12.6.77-hbd13f7d_0 2025-05-07T20:26:13.1224737Z cuda-opencl-dev conda-forge/linux-64::cuda-opencl-dev-12.6.77-h5888daf_0 2025-05-07T20:26:13.1225282Z cuda-profiler-api conda-forge/linux-64::cuda-profiler-api-12.6.77-h7938cbb_0 2025-05-07T20:26:13.1225807Z cuda-runtime conda-forge/noarch::cuda-runtime-12.6.3-ha804496_0 2025-05-07T20:26:13.1226343Z cuda-sanitizer-api conda-forge/linux-64::cuda-sanitizer-api-12.6.77-hbd13f7d_1 2025-05-07T20:26:13.1226879Z cuda-toolkit conda-forge/noarch::cuda-toolkit-12.6.3-ha804496_0 2025-05-07T20:26:13.1227339Z cuda-tools conda-forge/linux-64::cuda-tools-12.6.3-ha770c72_0 2025-05-07T20:26:13.1227797Z cuda-version conda-forge/noarch::cuda-version-12.6-h7480c83_3 2025-05-07T20:26:13.1228456Z cuda-visual-tools conda-forge/linux-64::cuda-visual-tools-12.6.3-ha770c72_0 2025-05-07T20:26:13.1228990Z cxx-compiler conda-forge/linux-64::cxx-compiler-1.5.2-hf52228f_0 2025-05-07T20:26:13.1229423Z dbus conda-forge/linux-64::dbus-1.13.6-h5008d03_3 2025-05-07T20:26:13.1230057Z font-ttf-dejavu-s~ conda-forge/noarch::font-ttf-dejavu-sans-mono-2.37-hab24e00_0 2025-05-07T20:26:13.1230862Z font-ttf-inconsol~ conda-forge/noarch::font-ttf-inconsolata-3.000-h77eed37_0 2025-05-07T20:26:13.1231534Z font-ttf-source-c~ conda-forge/noarch::font-ttf-source-code-pro-2.038-h77eed37_0 2025-05-07T20:26:13.1232091Z font-ttf-ubuntu conda-forge/noarch::font-ttf-ubuntu-0.83-h77eed37_3 2025-05-07T20:26:13.1232581Z fontconfig conda-forge/linux-64::fontconfig-2.15.0-h7e30c49_1 2025-05-07T20:26:13.1233062Z fonts-conda-ecosy~ conda-forge/noarch::fonts-conda-ecosystem-1-0 2025-05-07T20:26:13.1233539Z fonts-conda-forge conda-forge/noarch::fonts-conda-forge-1-0 2025-05-07T20:26:13.1233987Z freetype conda-forge/linux-64::freetype-2.13.3-ha770c72_1 2025-05-07T20:26:13.1234407Z gcc conda-forge/linux-64::gcc-11.4.0-h602e360_13 2025-05-07T20:26:13.1234941Z gds-tools conda-forge/linux-64::gds-tools-1.11.1.6-h5888daf_4 2025-05-07T20:26:13.1235356Z gmp conda-forge/linux-64::gmp-6.3.0-hac33072_2 2025-05-07T20:26:13.1235728Z gxx conda-forge/linux-64::gxx-11.4.0-h602e360_13 2025-05-07T20:26:13.1236133Z keyutils conda-forge/linux-64::keyutils-1.6.1-h166bdaf_0 2025-05-07T20:26:13.1236549Z krb5 conda-forge/linux-64::krb5-1.21.3-h659f571_0 2025-05-07T20:26:13.1236937Z libcap conda-forge/linux-64::libcap-2.71-h39aace5_0 2025-05-07T20:26:13.1237370Z libcublas conda-forge/linux-64::libcublas-12.6.4.1-h5888daf_1 2025-05-07T20:26:13.1237866Z libcublas-dev conda-forge/linux-64::libcublas-dev-12.6.4.1-h5888daf_1 2025-05-07T20:26:13.1238355Z libcufft conda-forge/linux-64::libcufft-11.3.0.4-hbd13f7d_0 2025-05-07T20:26:13.1238818Z libcufft-dev conda-forge/linux-64::libcufft-dev-11.3.0.4-h5888daf_0 2025-05-07T20:26:13.1239301Z libcufile conda-forge/linux-64::libcufile-1.11.1.6-h12f29b5_4 2025-05-07T20:26:13.1239790Z libcufile-dev conda-forge/linux-64::libcufile-dev-1.11.1.6-h5888daf_4 2025-05-07T20:26:13.1240281Z libcurand conda-forge/linux-64::libcurand-10.3.7.77-hbd13f7d_0 2025-05-07T20:26:13.1240816Z libcurand-dev conda-forge/linux-64::libcurand-dev-10.3.7.77-h5888daf_0 2025-05-07T20:26:13.1241431Z libcusolver conda-forge/linux-64::libcusolver-11.7.1.2-h5888daf_1 2025-05-07T20:26:13.1241952Z libcusolver-dev conda-forge/linux-64::libcusolver-dev-11.7.1.2-h5888daf_1 2025-05-07T20:26:13.1242475Z libcusparse conda-forge/linux-64::libcusparse-12.5.4.2-hbd13f7d_0 2025-05-07T20:26:13.1242988Z libcusparse-dev conda-forge/linux-64::libcusparse-dev-12.5.4.2-h5888daf_0 2025-05-07T20:26:13.1243670Z libedit conda-forge/linux-64::libedit-3.1.20250104-pl5321h7949ede_0 2025-05-07T20:26:13.1244155Z libexpat conda-forge/linux-64::libexpat-2.7.0-h5888daf_0 2025-05-07T20:26:13.1244719Z libfreetype conda-forge/linux-64::libfreetype-2.13.3-ha770c72_1 2025-05-07T20:26:13.1245219Z libfreetype6 conda-forge/linux-64::libfreetype6-2.13.3-h48d6fc4_1 2025-05-07T20:26:13.1245731Z libgcrypt-lib conda-forge/linux-64::libgcrypt-lib-1.11.0-hb9d3cd8_2 2025-05-07T20:26:13.1246213Z libglib conda-forge/linux-64::libglib-2.84.0-h2ff4ddf_0 2025-05-07T20:26:13.1246674Z libgpg-error conda-forge/linux-64::libgpg-error-1.55-h3f2d84a_0 2025-05-07T20:26:13.1247141Z libiconv conda-forge/linux-64::libiconv-1.18-h4ce23a2_1 2025-05-07T20:26:13.1247568Z libnl conda-forge/linux-64::libnl-3.11.0-hb9d3cd8_0 2025-05-07T20:26:13.1247993Z libnpp conda-forge/linux-64::libnpp-12.3.1.54-h5888daf_0 2025-05-07T20:26:13.1248581Z libnpp-dev conda-forge/linux-64::libnpp-dev-12.3.1.54-h5888daf_0 2025-05-07T20:26:13.1249052Z libnuma conda-forge/linux-64::libnuma-2.0.18-h4ab18f5_2 2025-05-07T20:26:13.1249516Z libnvfatbin conda-forge/linux-64::libnvfatbin-12.6.77-hbd13f7d_0 2025-05-07T20:26:13.1250041Z libnvfatbin-dev conda-forge/linux-64::libnvfatbin-dev-12.6.77-h5888daf_0 2025-05-07T20:26:13.1250570Z libnvjitlink conda-forge/linux-64::libnvjitlink-12.6.85-hbd13f7d_0 2025-05-07T20:26:13.1251110Z libnvjitlink-dev conda-forge/linux-64::libnvjitlink-dev-12.6.85-h5888daf_0 2025-05-07T20:26:13.1251642Z libnvjpeg conda-forge/linux-64::libnvjpeg-12.3.3.54-h5888daf_0 2025-05-07T20:26:13.1261699Z libnvjpeg-dev conda-forge/linux-64::libnvjpeg-dev-12.3.3.54-ha770c72_0 2025-05-07T20:26:13.1262198Z libpng conda-forge/linux-64::libpng-1.6.47-h943b412_0 2025-05-07T20:26:13.1262643Z libsqlite conda-forge/linux-64::libsqlite-3.49.2-hee588c1_0 2025-05-07T20:26:13.1263101Z libsystemd0 conda-forge/linux-64::libsystemd0-256.9-h2774228_0 2025-05-07T20:26:13.1263551Z libudev1 conda-forge/linux-64::libudev1-257.4-h9a4d06a_0 2025-05-07T20:26:13.1263969Z libxcb conda-forge/linux-64::libxcb-1.17.0-h8a09558_0 2025-05-07T20:26:13.1264417Z libxkbcommon conda-forge/linux-64::libxkbcommon-1.8.0-hc4a0caf_0 2025-05-07T20:26:13.1264888Z libxkbfile conda-forge/linux-64::libxkbfile-1.1.0-h166bdaf_1 2025-05-07T20:26:13.1265324Z libxml2 conda-forge/linux-64::libxml2-2.13.5-h064dc61_0 2025-05-07T20:26:13.1265743Z libzlib conda-forge/linux-64::libzlib-1.3.1-hb9d3cd8_2 2025-05-07T20:26:13.1266143Z lz4-c conda-forge/linux-64::lz4-c-1.9.4-hcb278e6_0 2025-05-07T20:26:13.1266618Z nsight-compute conda-forge/linux-64::nsight-compute-2024.3.2.3-hb5ebaad_0 2025-05-07T20:26:13.1267096Z nspr conda-forge/linux-64::nspr-4.36-h5888daf_0 2025-05-07T20:26:13.1267496Z nss conda-forge/linux-64::nss-3.111-h159eef7_0 2025-05-07T20:26:13.1268010Z ocl-icd conda-forge/linux-64::ocl-icd-2.3.3-hb9d3cd8_0 2025-05-07T20:26:13.1268495Z opencl-headers conda-forge/linux-64::opencl-headers-2024.10.24-h5888daf_0 2025-05-07T20:26:13.1268990Z pcre2 conda-forge/linux-64::pcre2-10.44-hc749103_2 2025-05-07T20:26:13.1269457Z pthread-stubs conda-forge/linux-64::pthread-stubs-0.4-hb9d3cd8_1002 2025-05-07T20:26:13.1269939Z rdma-core conda-forge/linux-64::rdma-core-55.0-h5888daf_0 2025-05-07T20:26:13.1270504Z wayland conda-forge/linux-64::wayland-1.23.1-h3e06ad9_0 2025-05-07T20:26:13.1270939Z xcb-util conda-forge/linux-64::xcb-util-0.4.1-hb711507_2 2025-05-07T20:26:13.1271434Z xcb-util-cursor conda-forge/linux-64::xcb-util-cursor-0.1.5-hb9d3cd8_0 2025-05-07T20:26:13.1271958Z xcb-util-image conda-forge/linux-64::xcb-util-image-0.4.0-hb711507_2 2025-05-07T20:26:13.1272505Z xcb-util-keysyms conda-forge/linux-64::xcb-util-keysyms-0.4.1-hb711507_0 2025-05-07T20:26:13.1273077Z xcb-util-renderut~ conda-forge/linux-64::xcb-util-renderutil-0.3.10-hb711507_0 2025-05-07T20:26:13.1273613Z xcb-util-wm conda-forge/linux-64::xcb-util-wm-0.4.2-hb711507_0 2025-05-07T20:26:13.1274116Z xkeyboard-config conda-forge/linux-64::xkeyboard-config-2.44-hb9d3cd8_0 2025-05-07T20:26:13.1274636Z xorg-libice conda-forge/linux-64::xorg-libice-1.1.2-hb9d3cd8_0 2025-05-07T20:26:13.1275107Z xorg-libsm conda-forge/linux-64::xorg-libsm-1.2.6-he73a12e_0 2025-05-07T20:26:13.1275577Z xorg-libx11 conda-forge/linux-64::xorg-libx11-1.8.12-h4f16b4b_0 2025-05-07T20:26:13.1276060Z xorg-libxau conda-forge/linux-64::xorg-libxau-1.0.12-hb9d3cd8_0 2025-05-07T20:26:13.1276600Z xorg-libxcomposite conda-forge/linux-64::xorg-libxcomposite-0.4.6-hb9d3cd8_2 2025-05-07T20:26:13.1277175Z xorg-libxdamage conda-forge/linux-64::xorg-libxdamage-1.1.6-hb9d3cd8_0 2025-05-07T20:26:13.1277786Z xorg-libxdmcp conda-forge/linux-64::xorg-libxdmcp-1.1.5-hb9d3cd8_0 2025-05-07T20:26:13.1278293Z xorg-libxext conda-forge/linux-64::xorg-libxext-1.3.6-hb9d3cd8_0 2025-05-07T20:26:13.1278843Z xorg-libxfixes conda-forge/linux-64::xorg-libxfixes-6.0.1-hb9d3cd8_0 2025-05-07T20:26:13.1279329Z xorg-libxi conda-forge/linux-64::xorg-libxi-1.8.2-hb9d3cd8_0 2025-05-07T20:26:13.1279803Z xorg-libxrandr conda-forge/linux-64::xorg-libxrandr-1.5.4-hb9d3cd8_0 2025-05-07T20:26:13.1280380Z xorg-libxrender conda-forge/linux-64::xorg-libxrender-0.9.12-hb9d3cd8_0 2025-05-07T20:26:13.1280903Z xorg-libxtst conda-forge/linux-64::xorg-libxtst-1.2.5-hb9d3cd8_3 2025-05-07T20:26:13.1281432Z zstd conda-forge/linux-64::zstd-1.5.7-hb8e6e7a_2 2025-05-07T20:26:13.1281671Z 2025-05-07T20:26:13.1281783Z The following packages will be UPDATED: 2025-05-07T20:26:13.1281992Z 2025-05-07T20:26:13.1282264Z libuuid pkgs/main::libuuid-1.41.5-h5eee18b_0 --> conda-forge::libuuid-2.38.1-h0b41bf4_0 2025-05-07T20:26:13.1282880Z ncurses pkgs/main::ncurses-6.4-h6a678d5_0 --> conda-forge::ncurses-6.5-h2d0b736_3 2025-05-07T20:26:13.1283483Z sqlite pkgs/main::sqlite-3.45.3-h5eee18b_0 --> conda-forge::sqlite-3.49.2-h9eae976_0 2025-05-07T20:26:13.1284049Z zlib pkgs/main::zlib-1.2.13-h5eee18b_1 --> conda-forge::zlib-1.3.1-hb9d3cd8_2 2025-05-07T20:26:13.1284511Z 2025-05-07T20:26:13.1284724Z The following packages will be SUPERSEDED by a higher-priority channel: 2025-05-07T20:26:13.1285036Z 2025-05-07T20:26:13.1285272Z expat pkgs/main::expat-2.7.1-h6a678d5_0 --> conda-forge::expat-2.7.0-h5888daf_0 2025-05-07T20:26:13.1285911Z python pkgs/main::python-3.13.2-hf623796_100~ --> conda-forge::python-3.13.0-h9ebbce0_101_cp313 2025-05-07T20:26:13.1286601Z tk pkgs/main::tk-8.6.14-h39e8969_0 --> conda-forge::tk-8.6.13-noxft_h4845f30_101 2025-05-07T20:26:13.1286934Z 2025-05-07T20:26:13.1286954Z 2025-05-07T20:26:13.1286958Z 2025-05-07T20:26:13.1287101Z Downloading and Extracting Packages: ...working... 2025-05-07T20:26:13.1287484Z nsight-compute-2024. | 443.1 MB | | 0% 2025-05-07T20:26:13.1287724Z 2025-05-07T20:26:13.1288144Z libcublas-12.6.4.1 | 256.2 MB | | 0%  2025-05-07T20:26:13.1288380Z 2025-05-07T20:26:13.1288384Z 2025-05-07T20:26:13.1288600Z libcufft-11.3.0.4 | 156.2 MB | | 0%  2025-05-07T20:26:13.1288840Z 2025-05-07T20:26:13.1288844Z 2025-05-07T20:26:13.1288848Z 2025-05-07T20:26:13.1289190Z libcusparse-12.5.4.2 | 118.6 MB | | 0%  2025-05-07T20:26:13.1289444Z 2025-05-07T20:26:13.1289448Z 2025-05-07T20:26:13.1289451Z 2025-05-07T20:26:13.1289455Z 2025-05-07T20:26:13.1297986Z cuda-nsight-12.6.77 | 113.2 MB | | 0%  2025-05-07T20:26:13.1298301Z 2025-05-07T20:26:13.1298305Z 2025-05-07T20:26:13.1298308Z 2025-05-07T20:26:13.1298312Z 2025-05-07T20:26:13.1299634Z 2025-05-07T20:26:13.1306210Z cuda-nvvp-12.6.80 | 109.3 MB | | 0%  2025-05-07T20:26:13.1306598Z 2025-05-07T20:26:13.1306603Z 2025-05-07T20:26:13.1306608Z 2025-05-07T20:26:13.1306613Z 2025-05-07T20:26:13.1306618Z 2025-05-07T20:26:13.1306624Z 2025-05-07T20:26:13.1316058Z libcusolver-11.7.1.2 | 95.8 MB | | 0%  2025-05-07T20:26:13.1316521Z 2025-05-07T20:26:13.1316527Z 2025-05-07T20:26:13.1316540Z 2025-05-07T20:26:13.1316547Z 2025-05-07T20:26:13.1316552Z 2025-05-07T20:26:13.1316570Z 2025-05-07T20:26:13.1316589Z 2025-05-07T20:26:13.1317511Z libnpp-12.3.1.54 | 93.4 MB | | 0%  2025-05-07T20:26:13.1317934Z 2025-05-07T20:26:13.1317940Z 2025-05-07T20:26:13.1317946Z 2025-05-07T20:26:13.1317961Z 2025-05-07T20:26:13.1317967Z 2025-05-07T20:26:13.1317978Z 2025-05-07T20:26:13.1317984Z 2025-05-07T20:26:13.1317989Z 2025-05-07T20:26:13.1321979Z cuda-nvdisasm-12.6.7 | 47.6 MB | | 0%  2025-05-07T20:26:13.1322440Z 2025-05-07T20:26:13.1322455Z 2025-05-07T20:26:13.1322458Z 2025-05-07T20:26:13.1322462Z 2025-05-07T20:26:13.1322465Z 2025-05-07T20:26:13.1322469Z 2025-05-07T20:26:13.1322472Z 2025-05-07T20:26:13.1322476Z 2025-05-07T20:26:13.1322479Z 2025-05-07T20:26:13.1322755Z libcurand-10.3.7.77 | 39.9 MB | | 0%  2025-05-07T20:26:13.1323033Z 2025-05-07T20:26:13.1323036Z 2025-05-07T20:26:13.1323040Z 2025-05-07T20:26:13.1323043Z 2025-05-07T20:26:13.1323047Z 2025-05-07T20:26:13.1323051Z 2025-05-07T20:26:13.1323060Z 2025-05-07T20:26:13.1323064Z 2025-05-07T20:26:13.1323067Z 2025-05-07T20:26:13.1323429Z 2025-05-07T20:26:13.1325152Z gds-tools-1.11.1.6 | 37.8 MB | | 0%  2025-05-07T20:26:13.1325518Z 2025-05-07T20:26:13.1325522Z 2025-05-07T20:26:13.1325525Z 2025-05-07T20:26:13.1325529Z 2025-05-07T20:26:13.1325532Z 2025-05-07T20:26:13.1325536Z 2025-05-07T20:26:13.1325549Z 2025-05-07T20:26:13.1325552Z 2025-05-07T20:26:13.1325556Z 2025-05-07T20:26:13.1325559Z 2025-05-07T20:26:13.1325563Z 2025-05-07T20:26:13.1326369Z python-3.13.0 | 31.5 MB | | 0%  2025-05-07T20:26:13.1326769Z 2025-05-07T20:26:13.1326775Z 2025-05-07T20:26:13.1326781Z 2025-05-07T20:26:13.1326786Z 2025-05-07T20:26:13.1326792Z 2025-05-07T20:26:13.1326803Z 2025-05-07T20:26:13.1326818Z 2025-05-07T20:26:13.1326824Z 2025-05-07T20:26:13.1326829Z 2025-05-07T20:26:13.1326835Z 2025-05-07T20:26:13.1326840Z 2025-05-07T20:26:13.1326853Z 2025-05-07T20:26:13.1328073Z cuda-nvcc-tools-12.6 | 23.0 MB | | 0%  2025-05-07T20:26:13.1328546Z 2025-05-07T20:26:13.1328552Z 2025-05-07T20:26:13.1328557Z 2025-05-07T20:26:13.1328563Z 2025-05-07T20:26:13.1328569Z 2025-05-07T20:26:13.1328574Z 2025-05-07T20:26:13.1328580Z 2025-05-07T20:26:13.1328592Z 2025-05-07T20:26:13.1328598Z 2025-05-07T20:26:13.1328612Z 2025-05-07T20:26:13.1328617Z 2025-05-07T20:26:13.1328623Z 2025-05-07T20:26:13.1328628Z 2025-05-07T20:26:13.1329639Z cuda-nvrtc-12.6.85 | 17.3 MB | | 0%  2025-05-07T20:26:13.1330067Z 2025-05-07T20:26:13.1330073Z 2025-05-07T20:26:13.1330078Z 2025-05-07T20:26:13.1330084Z 2025-05-07T20:26:13.1330089Z 2025-05-07T20:26:13.1330095Z 2025-05-07T20:26:13.1330100Z 2025-05-07T20:26:13.1330106Z 2025-05-07T20:26:13.1330111Z 2025-05-07T20:26:13.1330121Z 2025-05-07T20:26:13.1330127Z 2025-05-07T20:26:13.1330132Z 2025-05-07T20:26:13.1330330Z 2025-05-07T20:26:13.1330344Z 2025-05-07T20:26:13.1331215Z libnvjitlink-12.6.85 | 14.9 MB | | 0%  2025-05-07T20:26:13.1331681Z 2025-05-07T20:26:13.1331695Z 2025-05-07T20:26:13.1331701Z 2025-05-07T20:26:13.1331717Z 2025-05-07T20:26:13.1331723Z 2025-05-07T20:26:13.1331728Z 2025-05-07T20:26:13.1331734Z 2025-05-07T20:26:13.1331739Z 2025-05-07T20:26:13.1331752Z 2025-05-07T20:26:13.1331758Z 2025-05-07T20:26:13.1331763Z 2025-05-07T20:26:13.1331769Z 2025-05-07T20:26:13.1331774Z 2025-05-07T20:26:13.1331780Z 2025-05-07T20:26:13.1331786Z 2025-05-07T20:26:13.1332776Z cuda-nvcc-dev_linux- | 10.8 MB | | 0%  2025-05-07T20:26:13.1333250Z 2025-05-07T20:26:13.1333256Z 2025-05-07T20:26:13.1333261Z 2025-05-07T20:26:13.1333267Z 2025-05-07T20:26:13.1333272Z 2025-05-07T20:26:13.1333278Z 2025-05-07T20:26:13.1333283Z 2025-05-07T20:26:13.1333289Z 2025-05-07T20:26:13.1333294Z 2025-05-07T20:26:13.1333308Z 2025-05-07T20:26:13.1333319Z 2025-05-07T20:26:13.1333325Z 2025-05-07T20:26:13.1333330Z 2025-05-07T20:26:13.1333336Z 2025-05-07T20:26:13.1333341Z 2025-05-07T20:26:13.1333360Z 2025-05-07T20:26:13.1334371Z cuda-nvvm-tools-12.6 | 10.4 MB | | 0%  2025-05-07T20:26:13.1334848Z 2025-05-07T20:26:13.1334854Z 2025-05-07T20:26:13.1334870Z 2025-05-07T20:26:13.1335009Z 2025-05-07T20:26:13.1335016Z 2025-05-07T20:26:13.1335022Z 2025-05-07T20:26:13.1335027Z 2025-05-07T20:26:13.1335033Z 2025-05-07T20:26:13.1335039Z 2025-05-07T20:26:13.1335045Z 2025-05-07T20:26:13.1335050Z 2025-05-07T20:26:13.1335055Z 2025-05-07T20:26:13.1335061Z 2025-05-07T20:26:13.1335066Z 2025-05-07T20:26:13.1335072Z 2025-05-07T20:26:13.1335077Z 2025-05-07T20:26:13.1335083Z 2025-05-07T20:26:13.1335965Z cuda-sanitizer-api-1 | 8.9 MB | | 0%  2025-05-07T20:26:13.1336454Z 2025-05-07T20:26:13.1336467Z 2025-05-07T20:26:13.1336473Z 2025-05-07T20:26:13.1336486Z 2025-05-07T20:26:13.1336491Z 2025-05-07T20:26:13.1336497Z 2025-05-07T20:26:13.1336502Z 2025-05-07T20:26:13.1336507Z 2025-05-07T20:26:13.1336513Z 2025-05-07T20:26:13.1336518Z 2025-05-07T20:26:13.1336524Z 2025-05-07T20:26:13.1336529Z 2025-05-07T20:26:13.1336535Z 2025-05-07T20:26:13.1336548Z 2025-05-07T20:26:13.1336554Z 2025-05-07T20:26:13.1336567Z 2025-05-07T20:26:13.1336573Z 2025-05-07T20:26:13.1336579Z 2025-05-07T20:26:13.1337489Z cuda-nvvm-impl-12.6. | 7.7 MB | | 0%  2025-05-07T20:26:13.1337957Z 2025-05-07T20:26:13.1337963Z 2025-05-07T20:26:13.1337968Z 2025-05-07T20:26:13.1337981Z 2025-05-07T20:26:13.1337987Z 2025-05-07T20:26:13.1337992Z 2025-05-07T20:26:13.1337998Z 2025-05-07T20:26:13.1338003Z 2025-05-07T20:26:13.1338009Z 2025-05-07T20:26:13.1338014Z 2025-05-07T20:26:13.1338020Z 2025-05-07T20:26:13.1338025Z 2025-05-07T20:26:13.1338030Z 2025-05-07T20:26:13.1338044Z 2025-05-07T20:26:13.1338050Z 2025-05-07T20:26:13.1338055Z 2025-05-07T20:26:13.1338061Z 2025-05-07T20:26:13.1338066Z 2025-05-07T20:26:13.1338071Z 2025-05-07T20:26:13.2240032Z ... (more hidden) ... 2025-05-07T20:26:13.2240500Z 2025-05-07T20:26:13.2258385Z libcublas-12.6.4.1 | 256.2 MB | | 0%  2025-05-07T20:26:13.2258761Z 2025-05-07T20:26:13.2258767Z 2025-05-07T20:26:13.2275274Z libcufft-11.3.0.4 | 156.2 MB | | 1%  2025-05-07T20:26:13.2275530Z 2025-05-07T20:26:13.2275534Z 2025-05-07T20:26:13.2278111Z 2025-05-07T20:26:13.2295555Z libcusparse-12.5.4.2 | 118.6 MB | 5 | 5%  2025-05-07T20:26:13.2295826Z 2025-05-07T20:26:13.2295923Z 2025-05-07T20:26:13.2295928Z 2025-05-07T20:26:13.2298611Z 2025-05-07T20:26:13.2440367Z cuda-nsight-12.6.77 | 113.2 MB | 1 | 1%  2025-05-07T20:26:13.3241388Z nsight-compute-2024. | 443.1 MB | | 0% 2025-05-07T20:26:13.3246164Z 2025-05-07T20:26:13.3261714Z libcublas-12.6.4.1 | 256.2 MB | 1 | 2%  2025-05-07T20:26:13.3262003Z 2025-05-07T20:26:13.3262007Z 2025-05-07T20:26:13.3298285Z libcufft-11.3.0.4 | 156.2 MB | 3 | 4%  2025-05-07T20:26:13.3298563Z 2025-05-07T20:26:13.3298567Z 2025-05-07T20:26:13.3298571Z 2025-05-07T20:26:13.3299283Z 2025-05-07T20:26:13.3442478Z cuda-nsight-12.6.77 | 113.2 MB | 4 | 4%  2025-05-07T20:26:13.3985593Z nsight-compute-2024. | 443.1 MB | | 1% 2025-05-07T20:26:13.3985866Z 2025-05-07T20:26:13.3985870Z 2025-05-07T20:26:13.3986654Z 2025-05-07T20:26:13.4242511Z libcusparse-12.5.4.2 | 118.6 MB | # | 11%  2025-05-07T20:26:13.4244643Z 2025-05-07T20:26:13.4264612Z libcublas-12.6.4.1 | 256.2 MB | 2 | 3%  2025-05-07T20:26:13.4264887Z 2025-05-07T20:26:13.4265551Z 2025-05-07T20:26:13.4300581Z libcufft-11.3.0.4 | 156.2 MB | 6 | 6%  2025-05-07T20:26:13.4300918Z 2025-05-07T20:26:13.4300922Z 2025-05-07T20:26:13.4300926Z 2025-05-07T20:26:13.4303832Z 2025-05-07T20:26:13.4443677Z cuda-nsight-12.6.77 | 113.2 MB | 7 | 8%  2025-05-07T20:26:13.5212355Z nsight-compute-2024. | 443.1 MB | 1 | 1% 2025-05-07T20:26:13.5212656Z 2025-05-07T20:26:13.5212660Z 2025-05-07T20:26:13.5213375Z 2025-05-07T20:26:13.5245314Z libcusparse-12.5.4.2 | 118.6 MB | #4 | 15%  2025-05-07T20:26:13.5245607Z 2025-05-07T20:26:13.5269203Z libcublas-12.6.4.1 | 256.2 MB | 4 | 4%  2025-05-07T20:26:13.5269450Z 2025-05-07T20:26:13.5269916Z 2025-05-07T20:26:13.5303656Z libcufft-11.3.0.4 | 156.2 MB | 8 | 8%  2025-05-07T20:26:13.5303954Z 2025-05-07T20:26:13.5303958Z 2025-05-07T20:26:13.5303962Z 2025-05-07T20:26:13.5304636Z 2025-05-07T20:26:13.5501654Z cuda-nsight-12.6.77 | 113.2 MB | #1 | 11%  2025-05-07T20:26:13.6217850Z nsight-compute-2024. | 443.1 MB | 1 | 2% 2025-05-07T20:26:13.6218129Z 2025-05-07T20:26:13.6218134Z 2025-05-07T20:26:13.6218269Z 2025-05-07T20:26:13.6247802Z libcusparse-12.5.4.2 | 118.6 MB | #8 | 18%  2025-05-07T20:26:13.6248067Z 2025-05-07T20:26:13.6269492Z libcublas-12.6.4.1 | 256.2 MB | 6 | 6%  2025-05-07T20:26:13.6269735Z 2025-05-07T20:26:13.6270872Z 2025-05-07T20:26:13.6305063Z libcufft-11.3.0.4 | 156.2 MB | #1 | 12%  2025-05-07T20:26:13.6305378Z 2025-05-07T20:26:13.6305382Z 2025-05-07T20:26:13.6305386Z 2025-05-07T20:26:13.6305658Z 2025-05-07T20:26:13.6504320Z cuda-nsight-12.6.77 | 113.2 MB | #5 | 15%  2025-05-07T20:26:13.7223856Z nsight-compute-2024. | 443.1 MB | 2 | 3% 2025-05-07T20:26:13.7224226Z 2025-05-07T20:26:13.7224327Z 2025-05-07T20:26:13.7226697Z 2025-05-07T20:26:13.7271676Z libcusparse-12.5.4.2 | 118.6 MB | ##2 | 22%  2025-05-07T20:26:13.7272048Z 2025-05-07T20:26:13.7305394Z libcublas-12.6.4.1 | 256.2 MB | 7 | 8%  2025-05-07T20:26:13.7305814Z 2025-05-07T20:26:13.7309847Z 2025-05-07T20:26:13.7342396Z libcufft-11.3.0.4 | 156.2 MB | #4 | 14%  2025-05-07T20:26:13.7342749Z 2025-05-07T20:26:13.7342765Z 2025-05-07T20:26:13.7342771Z 2025-05-07T20:26:13.7346098Z 2025-05-07T20:26:13.8224880Z cuda-nsight-12.6.77 | 113.2 MB | #8 | 19%  2025-05-07T20:26:13.8225331Z 2025-05-07T20:26:13.8225347Z 2025-05-07T20:26:13.8226310Z 2025-05-07T20:26:13.8308208Z libcusparse-12.5.4.2 | 118.6 MB | ##6 | 27%  2025-05-07T20:26:13.8308869Z 2025-05-07T20:26:13.8310116Z 2025-05-07T20:26:13.8337692Z libcufft-11.3.0.4 | 156.2 MB | #7 | 17%  2025-05-07T20:26:13.8338637Z 2025-05-07T20:26:13.8344119Z libcublas-12.6.4.1 | 256.2 MB | 9 | 10%  2025-05-07T20:26:13.8344472Z 2025-05-07T20:26:13.8344479Z 2025-05-07T20:26:13.8344486Z 2025-05-07T20:26:13.8345028Z 2025-05-07T20:26:13.8744763Z cuda-nsight-12.6.77 | 113.2 MB | ##3 | 24%  2025-05-07T20:26:13.9227293Z nsight-compute-2024. | 443.1 MB | 3 | 3% 2025-05-07T20:26:13.9227718Z 2025-05-07T20:26:13.9227722Z 2025-05-07T20:26:13.9229178Z 2025-05-07T20:26:13.9308736Z libcusparse-12.5.4.2 | 118.6 MB | ### | 31%  2025-05-07T20:26:13.9309138Z 2025-05-07T20:26:13.9310375Z 2025-05-07T20:26:13.9340058Z libcufft-11.3.0.4 | 156.2 MB | ## | 21%  2025-05-07T20:26:13.9340996Z 2025-05-07T20:26:13.9346788Z libcublas-12.6.4.1 | 256.2 MB | #1 | 11%  2025-05-07T20:26:13.9347165Z 2025-05-07T20:26:13.9347170Z 2025-05-07T20:26:13.9347173Z 2025-05-07T20:26:13.9347177Z 2025-05-07T20:26:13.9902326Z cuda-nsight-12.6.77 | 113.2 MB | ##7 | 28%  2025-05-07T20:26:14.0383544Z nsight-compute-2024. | 443.1 MB | 3 | 4% 2025-05-07T20:26:14.0383919Z 2025-05-07T20:26:14.0383926Z 2025-05-07T20:26:14.0386422Z 2025-05-07T20:26:14.0506297Z libcusparse-12.5.4.2 | 118.6 MB | ###4 | 35%  2025-05-07T20:26:14.0506701Z 2025-05-07T20:26:14.0506707Z 2025-05-07T20:26:14.0506712Z 2025-05-07T20:26:14.0510299Z 2025-05-07T20:26:14.0561725Z cuda-nsight-12.6.77 | 113.2 MB | ###1 | 32%  2025-05-07T20:26:14.0564116Z 2025-05-07T20:26:14.0588368Z libcublas-12.6.4.1 | 256.2 MB | #3 | 13%  2025-05-07T20:26:14.0588781Z 2025-05-07T20:26:14.0589981Z 2025-05-07T20:26:14.0911947Z libcufft-11.3.0.4 | 156.2 MB | ##3 | 24%  2025-05-07T20:26:14.1650851Z nsight-compute-2024. | 443.1 MB | 4 | 4% 2025-05-07T20:26:14.1651387Z 2025-05-07T20:26:14.1651398Z 2025-05-07T20:26:14.1658653Z 2025-05-07T20:26:14.1697472Z libcusparse-12.5.4.2 | 118.6 MB | ###8 | 39%  2025-05-07T20:26:14.1697902Z 2025-05-07T20:26:14.1697908Z 2025-05-07T20:26:14.1697914Z 2025-05-07T20:26:14.1699258Z 2025-05-07T20:26:14.1758154Z cuda-nsight-12.6.77 | 113.2 MB | ###5 | 36%  2025-05-07T20:26:14.1759217Z 2025-05-07T20:26:14.1914233Z libcublas-12.6.4.1 | 256.2 MB | #4 | 15%  2025-05-07T20:26:14.1914830Z 2025-05-07T20:26:14.1914836Z 2025-05-07T20:26:14.1916308Z libcufft-11.3.0.4 | 156.2 MB | ##6 | 26%  2025-05-07T20:26:14.2756227Z nsight-compute-2024. | 443.1 MB | 5 | 5% 2025-05-07T20:26:14.2756614Z 2025-05-07T20:26:14.2756620Z 2025-05-07T20:26:14.2756626Z 2025-05-07T20:26:14.2826919Z libcusparse-12.5.4.2 | 118.6 MB | ####2 | 42%  2025-05-07T20:26:14.2827380Z 2025-05-07T20:26:14.2854671Z libcublas-12.6.4.1 | 256.2 MB | #6 | 16%  2025-05-07T20:26:14.2855102Z 2025-05-07T20:26:14.2855109Z 2025-05-07T20:26:14.2855115Z 2025-05-07T20:26:14.2855122Z 2025-05-07T20:26:14.2917365Z cuda-nsight-12.6.77 | 113.2 MB | ###9 | 39%  2025-05-07T20:26:14.3088880Z nsight-compute-2024. | 443.1 MB | 5 | 6% 2025-05-07T20:26:14.3089314Z 2025-05-07T20:26:14.3091085Z 2025-05-07T20:26:14.3901270Z libcufft-11.3.0.4 | 156.2 MB | ##8 | 29%  2025-05-07T20:26:14.3901673Z 2025-05-07T20:26:14.3901680Z 2025-05-07T20:26:14.3908524Z 2025-05-07T20:26:14.3918807Z libcusparse-12.5.4.2 | 118.6 MB | ####5 | 46%  2025-05-07T20:26:14.3975143Z nsight-compute-2024. | 443.1 MB | 6 | 7% 2025-05-07T20:26:14.3975592Z 2025-05-07T20:26:14.3975599Z 2025-05-07T20:26:14.3975605Z 2025-05-07T20:26:14.3975612Z 2025-05-07T20:26:14.4043438Z cuda-nsight-12.6.77 | 113.2 MB | ####2 | 43%  2025-05-07T20:26:14.4047329Z 2025-05-07T20:26:14.4203907Z libcublas-12.6.4.1 | 256.2 MB | #7 | 18%  2025-05-07T20:26:14.4204258Z 2025-05-07T20:26:14.4204765Z 2025-05-07T20:26:14.4919145Z libcufft-11.3.0.4 | 156.2 MB | ###1 | 31%  2025-05-07T20:26:14.4978660Z nsight-compute-2024. | 443.1 MB | 7 | 7% 2025-05-07T20:26:14.4979043Z 2025-05-07T20:26:14.4979060Z 2025-05-07T20:26:14.4979590Z 2025-05-07T20:26:14.5008702Z libcusparse-12.5.4.2 | 118.6 MB | ####9 | 49%  2025-05-07T20:26:14.5009391Z 2025-05-07T20:26:14.5009396Z 2025-05-07T20:26:14.5009412Z 2025-05-07T20:26:14.5012405Z 2025-05-07T20:26:14.5078455Z cuda-nsight-12.6.77 | 113.2 MB | ####6 | 46%  2025-05-07T20:26:14.5082213Z 2025-05-07T20:26:14.5366888Z libcublas-12.6.4.1 | 256.2 MB | #8 | 19%  2025-05-07T20:26:14.5367387Z 2025-05-07T20:26:14.5370395Z 2025-05-07T20:26:14.5920396Z libcufft-11.3.0.4 | 156.2 MB | ###3 | 33%  2025-05-07T20:26:14.6081929Z nsight-compute-2024. | 443.1 MB | 8 | 8% 2025-05-07T20:26:14.6083859Z 2025-05-07T20:26:14.6088904Z libcublas-12.6.4.1 | 256.2 MB | ## | 20%  2025-05-07T20:26:14.6089354Z 2025-05-07T20:26:14.6089361Z 2025-05-07T20:26:14.6091633Z 2025-05-07T20:26:14.6136717Z libcusparse-12.5.4.2 | 118.6 MB | #####2 | 53%  2025-05-07T20:26:14.6137000Z 2025-05-07T20:26:14.6137004Z 2025-05-07T20:26:14.6137008Z 2025-05-07T20:26:14.6143568Z 2025-05-07T20:26:14.6456337Z cuda-nsight-12.6.77 | 113.2 MB | ####9 | 50%  2025-05-07T20:26:14.6456788Z 2025-05-07T20:26:14.6456795Z 2025-05-07T20:26:14.6923023Z libcufft-11.3.0.4 | 156.2 MB | ###5 | 36%  2025-05-07T20:26:14.7125357Z nsight-compute-2024. | 443.1 MB | 9 | 9% 2025-05-07T20:26:14.7125675Z 2025-05-07T20:26:14.7208106Z libcublas-12.6.4.1 | 256.2 MB | ##1 | 22%  2025-05-07T20:26:14.7209010Z 2025-05-07T20:26:14.7209016Z 2025-05-07T20:26:14.7210846Z 2025-05-07T20:26:14.7270687Z libcusparse-12.5.4.2 | 118.6 MB | #####5 | 56%  2025-05-07T20:26:14.7271199Z 2025-05-07T20:26:14.7271208Z 2025-05-07T20:26:14.7271214Z 2025-05-07T20:26:14.7273411Z 2025-05-07T20:26:14.7780434Z cuda-nsight-12.6.77 | 113.2 MB | #####2 | 53%  2025-05-07T20:26:14.7780714Z 2025-05-07T20:26:14.7781901Z 2025-05-07T20:26:14.7926662Z libcufft-11.3.0.4 | 156.2 MB | ###7 | 38%  2025-05-07T20:26:14.8287735Z nsight-compute-2024. | 443.1 MB | 9 | 10% 2025-05-07T20:26:14.8288733Z 2025-05-07T20:26:14.8348156Z libcublas-12.6.4.1 | 256.2 MB | ##3 | 23%  2025-05-07T20:26:14.8348599Z 2025-05-07T20:26:14.8348616Z 2025-05-07T20:26:14.8349738Z 2025-05-07T20:26:14.8463487Z libcusparse-12.5.4.2 | 118.6 MB | #####8 | 59%  2025-05-07T20:26:14.8463961Z 2025-05-07T20:26:14.8463965Z 2025-05-07T20:26:14.8463969Z 2025-05-07T20:26:14.8464734Z 2025-05-07T20:26:14.8829172Z cuda-nsight-12.6.77 | 113.2 MB | #####5 | 56%  2025-05-07T20:26:14.8829451Z 2025-05-07T20:26:14.8829457Z 2025-05-07T20:26:14.8941686Z libcufft-11.3.0.4 | 156.2 MB | ###9 | 40%  2025-05-07T20:26:14.9288899Z nsight-compute-2024. | 443.1 MB | # | 11% 2025-05-07T20:26:14.9290464Z 2025-05-07T20:26:14.9348141Z libcublas-12.6.4.1 | 256.2 MB | ##4 | 25%  2025-05-07T20:26:14.9348515Z 2025-05-07T20:26:14.9348520Z 2025-05-07T20:26:14.9349466Z 2025-05-07T20:26:14.9464969Z libcusparse-12.5.4.2 | 118.6 MB | ######2 | 62%  2025-05-07T20:26:14.9465260Z 2025-05-07T20:26:14.9465264Z 2025-05-07T20:26:14.9465268Z 2025-05-07T20:26:14.9466824Z 2025-05-07T20:26:14.9829326Z cuda-nsight-12.6.77 | 113.2 MB | #####9 | 59%  2025-05-07T20:26:14.9829608Z 2025-05-07T20:26:14.9831662Z 2025-05-07T20:26:14.9943740Z libcufft-11.3.0.4 | 156.2 MB | ####2 | 42%  2025-05-07T20:26:15.0294460Z nsight-compute-2024. | 443.1 MB | #1 | 11% 2025-05-07T20:26:15.0296691Z 2025-05-07T20:26:15.0368933Z libcublas-12.6.4.1 | 256.2 MB | ##5 | 26%  2025-05-07T20:26:15.0369177Z 2025-05-07T20:26:15.0369181Z 2025-05-07T20:26:15.0369828Z 2025-05-07T20:26:15.0472183Z libcusparse-12.5.4.2 | 118.6 MB | ######5 | 65%  2025-05-07T20:26:15.0472713Z 2025-05-07T20:26:15.0472717Z 2025-05-07T20:26:15.0472721Z 2025-05-07T20:26:15.0473435Z 2025-05-07T20:26:15.0831505Z cuda-nsight-12.6.77 | 113.2 MB | ######2 | 63%  2025-05-07T20:26:15.0831857Z 2025-05-07T20:26:15.0835126Z 2025-05-07T20:26:15.0982503Z libcufft-11.3.0.4 | 156.2 MB | ####4 | 45%  2025-05-07T20:26:15.1297232Z nsight-compute-2024. | 443.1 MB | #2 | 12% 2025-05-07T20:26:15.1297639Z 2025-05-07T20:26:15.1457761Z libcublas-12.6.4.1 | 256.2 MB | ##7 | 27%  2025-05-07T20:26:15.1458019Z 2025-05-07T20:26:15.1458023Z 2025-05-07T20:26:15.1458028Z 2025-05-07T20:26:15.1609149Z libcusparse-12.5.4.2 | 118.6 MB | ######8 | 68%  2025-05-07T20:26:15.1609728Z 2025-05-07T20:26:15.1609737Z 2025-05-07T20:26:15.1609744Z 2025-05-07T20:26:15.1612272Z 2025-05-07T20:26:15.1934570Z cuda-nsight-12.6.77 | 113.2 MB | ######5 | 66%  2025-05-07T20:26:15.1934962Z 2025-05-07T20:26:15.1938661Z 2025-05-07T20:26:15.1985454Z libcufft-11.3.0.4 | 156.2 MB | ####7 | 47%  2025-05-07T20:26:15.2383952Z nsight-compute-2024. | 443.1 MB | #2 | 13% 2025-05-07T20:26:15.2384337Z 2025-05-07T20:26:15.2513622Z libcublas-12.6.4.1 | 256.2 MB | ##8 | 29%  2025-05-07T20:26:15.2513912Z 2025-05-07T20:26:15.2513916Z 2025-05-07T20:26:15.2514620Z 2025-05-07T20:26:15.2612260Z libcusparse-12.5.4.2 | 118.6 MB | #######1 | 71%  2025-05-07T20:26:15.2612569Z 2025-05-07T20:26:15.2612577Z 2025-05-07T20:26:15.2612585Z 2025-05-07T20:26:15.2613769Z 2025-05-07T20:26:15.2934513Z cuda-nsight-12.6.77 | 113.2 MB | ######8 | 69%  2025-05-07T20:26:15.2934955Z 2025-05-07T20:26:15.2934960Z 2025-05-07T20:26:15.2986838Z libcufft-11.3.0.4 | 156.2 MB | ####9 | 50%  2025-05-07T20:26:15.3490322Z nsight-compute-2024. | 443.1 MB | #3 | 14% 2025-05-07T20:26:15.3490693Z 2025-05-07T20:26:15.3515002Z libcublas-12.6.4.1 | 256.2 MB | ### | 30%  2025-05-07T20:26:15.3515261Z 2025-05-07T20:26:15.3515265Z 2025-05-07T20:26:15.3515269Z 2025-05-07T20:26:15.3617493Z libcusparse-12.5.4.2 | 118.6 MB | #######4 | 74%  2025-05-07T20:26:15.3617975Z 2025-05-07T20:26:15.3617979Z 2025-05-07T20:26:15.3618003Z 2025-05-07T20:26:15.3618006Z 2025-05-07T20:26:15.3935344Z cuda-nsight-12.6.77 | 113.2 MB | #######1 | 72%  2025-05-07T20:26:15.3935694Z 2025-05-07T20:26:15.3935700Z 2025-05-07T20:26:15.3987305Z libcufft-11.3.0.4 | 156.2 MB | #####1 | 52%  2025-05-07T20:26:15.4515633Z nsight-compute-2024. | 443.1 MB | #4 | 15% 2025-05-07T20:26:15.4516018Z 2025-05-07T20:26:15.4516046Z 2025-05-07T20:26:15.4516050Z 2025-05-07T20:26:15.4618027Z libcusparse-12.5.4.2 | 118.6 MB | #######7 | 78%  2025-05-07T20:26:15.4618316Z 2025-05-07T20:26:15.4618320Z 2025-05-07T20:26:15.4618324Z 2025-05-07T20:26:15.4618328Z 2025-05-07T20:26:15.4703978Z cuda-nsight-12.6.77 | 113.2 MB | #######5 | 75%  2025-05-07T20:26:15.4705582Z 2025-05-07T20:26:15.4936821Z libcublas-12.6.4.1 | 256.2 MB | ###1 | 32%  2025-05-07T20:26:15.4937094Z 2025-05-07T20:26:15.4937098Z 2025-05-07T20:26:15.4988748Z libcufft-11.3.0.4 | 156.2 MB | #####4 | 54%  2025-05-07T20:26:15.5547847Z nsight-compute-2024. | 443.1 MB | #5 | 16% 2025-05-07T20:26:15.5548106Z 2025-05-07T20:26:15.5548110Z 2025-05-07T20:26:15.5548114Z 2025-05-07T20:26:15.5620885Z libcusparse-12.5.4.2 | 118.6 MB | ######## | 81%  2025-05-07T20:26:15.5621165Z 2025-05-07T20:26:15.5621169Z 2025-05-07T20:26:15.5621173Z 2025-05-07T20:26:15.5621177Z 2025-05-07T20:26:15.6009182Z cuda-nsight-12.6.77 | 113.2 MB | #######8 | 79%  2025-05-07T20:26:15.6056900Z nsight-compute-2024. | 443.1 MB | #6 | 16% 2025-05-07T20:26:15.6057153Z 2025-05-07T20:26:15.6057157Z 2025-05-07T20:26:15.6062047Z libcufft-11.3.0.4 | 156.2 MB | #####6 | 57%  2025-05-07T20:26:15.6064026Z 2025-05-07T20:26:15.6682311Z libcublas-12.6.4.1 | 256.2 MB | ###2 | 33%  2025-05-07T20:26:15.6682590Z 2025-05-07T20:26:15.6682594Z 2025-05-07T20:26:15.6682598Z 2025-05-07T20:26:15.6683415Z 2025-05-07T20:26:15.6768640Z cuda-nsight-12.6.77 | 113.2 MB | ########2 | 82%  2025-05-07T20:26:15.6769215Z 2025-05-07T20:26:15.6769222Z 2025-05-07T20:26:15.6769560Z 2025-05-07T20:26:15.7037008Z libcusparse-12.5.4.2 | 118.6 MB | ########3 | 84%  2025-05-07T20:26:15.7128231Z nsight-compute-2024. | 443.1 MB | #7 | 17% 2025-05-07T20:26:15.7128489Z 2025-05-07T20:26:15.7128695Z 2025-05-07T20:26:15.7231047Z libcufft-11.3.0.4 | 156.2 MB | #####9 | 59%  2025-05-07T20:26:15.7231312Z 2025-05-07T20:26:15.7683933Z libcublas-12.6.4.1 | 256.2 MB | ###4 | 34%  2025-05-07T20:26:15.7684190Z 2025-05-07T20:26:15.7684194Z 2025-05-07T20:26:15.7684198Z 2025-05-07T20:26:15.7684202Z 2025-05-07T20:26:15.7773504Z cuda-nsight-12.6.77 | 113.2 MB | ########5 | 85%  2025-05-07T20:26:15.7774125Z 2025-05-07T20:26:15.7774135Z 2025-05-07T20:26:15.7775198Z 2025-05-07T20:26:15.8056440Z libcusparse-12.5.4.2 | 118.6 MB | ########6 | 87%  2025-05-07T20:26:15.8232422Z nsight-compute-2024. | 443.1 MB | #8 | 18% 2025-05-07T20:26:15.8232878Z 2025-05-07T20:26:15.8265621Z libcublas-12.6.4.1 | 256.2 MB | ###5 | 35%  2025-05-07T20:26:15.8265886Z 2025-05-07T20:26:15.8267296Z 2025-05-07T20:26:15.8685065Z libcufft-11.3.0.4 | 156.2 MB | ######1 | 61%  2025-05-07T20:26:15.8685451Z 2025-05-07T20:26:15.8685458Z 2025-05-07T20:26:15.8685464Z 2025-05-07T20:26:15.8685470Z 2025-05-07T20:26:15.9057718Z cuda-nsight-12.6.77 | 113.2 MB | ########8 | 89%  2025-05-07T20:26:15.9238428Z nsight-compute-2024. | 443.1 MB | #9 | 19% 2025-05-07T20:26:15.9238772Z 2025-05-07T20:26:15.9689604Z libcublas-12.6.4.1 | 256.2 MB | ###7 | 37%  2025-05-07T20:26:15.9689860Z 2025-05-07T20:26:15.9689864Z 2025-05-07T20:26:15.9689868Z 2025-05-07T20:26:15.9690200Z 2025-05-07T20:26:15.9765175Z cuda-nsight-12.6.77 | 113.2 MB | #########2 | 93%  2025-05-07T20:26:15.9765464Z 2025-05-07T20:26:15.9765468Z 2025-05-07T20:26:15.9765471Z 2025-05-07T20:26:15.9950071Z libcusparse-12.5.4.2 | 118.6 MB | ########9 | 89%  2025-05-07T20:26:15.9950449Z 2025-05-07T20:26:15.9950453Z 2025-05-07T20:26:16.0058780Z libcufft-11.3.0.4 | 156.2 MB | ######3 | 64%  2025-05-07T20:26:16.0239494Z nsight-compute-2024. | 443.1 MB | ## | 20% 2025-05-07T20:26:16.0241658Z 2025-05-07T20:26:16.0850178Z libcublas-12.6.4.1 | 256.2 MB | ###8 | 39%  2025-05-07T20:26:16.0850466Z 2025-05-07T20:26:16.0850470Z 2025-05-07T20:26:16.0850474Z 2025-05-07T20:26:16.0850478Z 2025-05-07T20:26:16.0952294Z cuda-nsight-12.6.77 | 113.2 MB | #########6 | 96%  2025-05-07T20:26:16.0952586Z 2025-05-07T20:26:16.0953686Z 2025-05-07T20:26:16.0985578Z libcufft-11.3.0.4 | 156.2 MB | ######5 | 66%  2025-05-07T20:26:16.0985901Z 2025-05-07T20:26:16.0985905Z 2025-05-07T20:26:16.0985909Z 2025-05-07T20:26:16.1149349Z libcusparse-12.5.4.2 | 118.6 MB | #########2 | 92%  2025-05-07T20:26:16.1249152Z nsight-compute-2024. | 443.1 MB | ##1 | 21% 2025-05-07T20:26:16.1249624Z 2025-05-07T20:26:16.1865216Z libcublas-12.6.4.1 | 256.2 MB | #### | 41%  2025-05-07T20:26:16.1865601Z 2025-05-07T20:26:16.1865611Z 2025-05-07T20:26:16.1865618Z 2025-05-07T20:26:16.1865625Z 2025-05-07T20:26:16.1954358Z cuda-nsight-12.6.77 | 113.2 MB | #########9 | 100%  2025-05-07T20:26:16.1954822Z 2025-05-07T20:26:16.1954853Z 2025-05-07T20:26:16.1986009Z libcufft-11.3.0.4 | 156.2 MB | ######8 | 68%  2025-05-07T20:26:16.1986332Z 2025-05-07T20:26:16.1986339Z 2025-05-07T20:26:16.1987941Z 2025-05-07T20:26:16.2195288Z libcusparse-12.5.4.2 | 118.6 MB | #########5 | 95%  2025-05-07T20:26:16.2300282Z nsight-compute-2024. | 443.1 MB | ##1 | 22% 2025-05-07T20:26:16.2300581Z 2025-05-07T20:26:16.2956373Z libcublas-12.6.4.1 | 256.2 MB | ####2 | 42%  2025-05-07T20:26:16.2956664Z 2025-05-07T20:26:16.2956668Z 2025-05-07T20:26:16.2990105Z libcufft-11.3.0.4 | 156.2 MB | ####### | 71%  2025-05-07T20:26:16.2990704Z 2025-05-07T20:26:16.2990709Z 2025-05-07T20:26:16.2993740Z 2025-05-07T20:26:16.3300590Z libcusparse-12.5.4.2 | 118.6 MB | #########8 | 98%  2025-05-07T20:26:16.3303862Z nsight-compute-2024. | 443.1 MB | ##2 | 23% 2025-05-07T20:26:16.3304327Z 2025-05-07T20:26:16.3987757Z libcublas-12.6.4.1 | 256.2 MB | ####3 | 44%  2025-05-07T20:26:16.3988059Z 2025-05-07T20:26:16.3988632Z 2025-05-07T20:26:16.4303483Z libcufft-11.3.0.4 | 156.2 MB | #######3 | 74%  2025-05-07T20:26:16.4306306Z nsight-compute-2024. | 443.1 MB | ##3 | 24% 2025-05-07T20:26:16.4308663Z 2025-05-07T20:26:16.5217400Z libcublas-12.6.4.1 | 256.2 MB | ####5 | 46%  2025-05-07T20:26:16.5217666Z 2025-05-07T20:26:16.5217670Z 2025-05-07T20:26:16.5305063Z libcufft-11.3.0.4 | 156.2 MB | #######6 | 76%  2025-05-07T20:26:16.5311952Z nsight-compute-2024. | 443.1 MB | ##4 | 25% 2025-05-07T20:26:16.5312253Z 2025-05-07T20:26:16.6217637Z libcublas-12.6.4.1 | 256.2 MB | ####7 | 48%  2025-05-07T20:26:16.6217946Z 2025-05-07T20:26:16.6217954Z 2025-05-07T20:26:16.6333280Z libcufft-11.3.0.4 | 156.2 MB | #######8 | 79%  2025-05-07T20:26:16.6369822Z nsight-compute-2024. | 443.1 MB | ##5 | 26% 2025-05-07T20:26:16.6373362Z 2025-05-07T20:26:16.7223797Z libcublas-12.6.4.1 | 256.2 MB | ####9 | 49%  2025-05-07T20:26:16.7224384Z 2025-05-07T20:26:16.7226256Z 2025-05-07T20:26:16.7336085Z libcufft-11.3.0.4 | 156.2 MB | ########1 | 82%  2025-05-07T20:26:16.7532048Z nsight-compute-2024. | 443.1 MB | ##6 | 27% 2025-05-07T20:26:16.7533165Z 2025-05-07T20:26:16.8225798Z libcublas-12.6.4.1 | 256.2 MB | ##### | 51%  2025-05-07T20:26:16.8226197Z 2025-05-07T20:26:16.8226201Z 2025-05-07T20:26:16.8337974Z libcufft-11.3.0.4 | 156.2 MB | ########4 | 85%  2025-05-07T20:26:16.8681669Z nsight-compute-2024. | 443.1 MB | ##7 | 28% 2025-05-07T20:26:16.8682298Z 2025-05-07T20:26:16.9227405Z libcublas-12.6.4.1 | 256.2 MB | #####2 | 53%  2025-05-07T20:26:16.9227671Z 2025-05-07T20:26:16.9227675Z 2025-05-07T20:26:16.9340663Z libcufft-11.3.0.4 | 156.2 MB | ########7 | 88%  2025-05-07T20:26:16.9682225Z nsight-compute-2024. | 443.1 MB | ##8 | 29% 2025-05-07T20:26:16.9686447Z 2025-05-07T20:26:17.0242842Z libcublas-12.6.4.1 | 256.2 MB | #####4 | 54%  2025-05-07T20:26:17.0243131Z 2025-05-07T20:26:17.0243146Z 2025-05-07T20:26:17.0388933Z libcufft-11.3.0.4 | 156.2 MB | ######### | 90%  2025-05-07T20:26:17.0710016Z nsight-compute-2024. | 443.1 MB | ##9 | 30% 2025-05-07T20:26:17.0711292Z 2025-05-07T20:26:17.1247021Z libcublas-12.6.4.1 | 256.2 MB | #####5 | 56%  2025-05-07T20:26:17.1247453Z 2025-05-07T20:26:17.1249566Z 2025-05-07T20:26:17.1394700Z libcufft-11.3.0.4 | 156.2 MB | #########3 | 93%  2025-05-07T20:26:17.1854377Z nsight-compute-2024. | 443.1 MB | ### | 31% 2025-05-07T20:26:17.1854767Z 2025-05-07T20:26:17.2249160Z libcublas-12.6.4.1 | 256.2 MB | #####7 | 57%  2025-05-07T20:26:17.2249422Z 2025-05-07T20:26:17.2249426Z 2025-05-07T20:26:17.2396626Z libcufft-11.3.0.4 | 156.2 MB | #########6 | 96%  2025-05-07T20:26:17.2982613Z nsight-compute-2024. | 443.1 MB | ###1 | 32% 2025-05-07T20:26:17.2983395Z 2025-05-07T20:26:17.3250020Z libcublas-12.6.4.1 | 256.2 MB | #####8 | 59%  2025-05-07T20:26:17.3250291Z 2025-05-07T20:26:17.3250295Z 2025-05-07T20:26:17.3397167Z libcufft-11.3.0.4 | 156.2 MB | #########9 | 99%  2025-05-07T20:26:17.4143974Z nsight-compute-2024. | 443.1 MB | ###2 | 33% 2025-05-07T20:26:17.4144356Z 2025-05-07T20:26:17.4397942Z libcublas-12.6.4.1 | 256.2 MB | ###### | 60%  2025-05-07T20:26:17.5146676Z nsight-compute-2024. | 443.1 MB | ###4 | 34% 2025-05-07T20:26:17.5147542Z 2025-05-07T20:26:17.5398939Z libcublas-12.6.4.1 | 256.2 MB | ######1 | 62%  2025-05-07T20:26:17.6149656Z nsight-compute-2024. | 443.1 MB | ###5 | 36% 2025-05-07T20:26:17.6150527Z 2025-05-07T20:26:17.6427210Z libcublas-12.6.4.1 | 256.2 MB | ######3 | 64%  2025-05-07T20:26:17.7152164Z nsight-compute-2024. | 443.1 MB | ###7 | 37% 2025-05-07T20:26:17.7152467Z 2025-05-07T20:26:17.7427719Z libcublas-12.6.4.1 | 256.2 MB | ######5 | 66%  2025-05-07T20:26:17.8152682Z nsight-compute-2024. | 443.1 MB | ###8 | 39% 2025-05-07T20:26:17.8153008Z 2025-05-07T20:26:17.8682099Z libcublas-12.6.4.1 | 256.2 MB | ######7 | 68%  2025-05-07T20:26:17.9154584Z nsight-compute-2024. | 443.1 MB | ###9 | 40% 2025-05-07T20:26:17.9155487Z 2025-05-07T20:26:17.9683606Z libcublas-12.6.4.1 | 256.2 MB | ####### | 70%  2025-05-07T20:26:18.0159216Z nsight-compute-2024. | 443.1 MB | ####1 | 41% 2025-05-07T20:26:18.0160275Z 2025-05-07T20:26:18.0861676Z libcublas-12.6.4.1 | 256.2 MB | #######1 | 72%  2025-05-07T20:26:18.1185791Z nsight-compute-2024. | 443.1 MB | ####2 | 42% 2025-05-07T20:26:18.1187627Z 2025-05-07T20:26:18.1885457Z libcublas-12.6.4.1 | 256.2 MB | #######3 | 74%  2025-05-07T20:26:18.2187426Z nsight-compute-2024. | 443.1 MB | ####3 | 43% 2025-05-07T20:26:18.2187777Z 2025-05-07T20:26:18.2546274Z libcublas-12.6.4.1 | 256.2 MB | #######5 | 76%  2025-05-07T20:26:18.2546611Z 2025-05-07T20:26:18.2546620Z 2025-05-07T20:26:18.2547010Z 2025-05-07T20:26:18.2547016Z 2025-05-07T20:26:18.2885886Z cuda-nsight-12.6.77 | 113.2 MB | ########## | 100%  2025-05-07T20:26:18.2925176Z nsight-compute-2024. | 443.1 MB | ####4 | 45% 2025-05-07T20:26:18.2925470Z 2025-05-07T20:26:18.2925475Z 2025-05-07T20:26:18.2925479Z 2025-05-07T20:26:18.2925484Z 2025-05-07T20:26:18.2927242Z 2025-05-07T20:26:18.3926707Z cuda-nvvp-12.6.80 | 109.3 MB | | 0%  2025-05-07T20:26:18.3927293Z 2025-05-07T20:26:18.3927298Z 2025-05-07T20:26:18.3927303Z 2025-05-07T20:26:18.3927309Z 2025-05-07T20:26:18.3927346Z 2025-05-07T20:26:18.4054080Z cuda-nvvp-12.6.80 | 109.3 MB | 3 | 4%  2025-05-07T20:26:18.4057887Z 2025-05-07T20:26:18.4289976Z libcublas-12.6.4.1 | 256.2 MB | #######7 | 78%  2025-05-07T20:26:18.4928740Z nsight-compute-2024. | 443.1 MB | ####5 | 46% 2025-05-07T20:26:18.4929057Z 2025-05-07T20:26:18.4929061Z 2025-05-07T20:26:18.4929064Z 2025-05-07T20:26:18.4929101Z 2025-05-07T20:26:18.4934296Z 2025-05-07T20:26:18.5366226Z cuda-nvvp-12.6.80 | 109.3 MB | 6 | 7%  2025-05-07T20:26:18.5366536Z 2025-05-07T20:26:18.5681467Z libcublas-12.6.4.1 | 256.2 MB | #######9 | 79%  2025-05-07T20:26:18.5929806Z nsight-compute-2024. | 443.1 MB | ####6 | 47% 2025-05-07T20:26:18.5930069Z 2025-05-07T20:26:18.5930073Z 2025-05-07T20:26:18.5930077Z 2025-05-07T20:26:18.5930082Z 2025-05-07T20:26:18.5930095Z 2025-05-07T20:26:18.6858778Z cuda-nvvp-12.6.80 | 109.3 MB | # | 10%  2025-05-07T20:26:18.6880702Z nsight-compute-2024. | 443.1 MB | ####7 | 48% 2025-05-07T20:26:18.6881069Z 2025-05-07T20:26:18.6881076Z 2025-05-07T20:26:18.6882549Z 2025-05-07T20:26:18.6933941Z libcusparse-12.5.4.2 | 118.6 MB | ########## | 100%  2025-05-07T20:26:18.6934221Z 2025-05-07T20:26:18.6934225Z 2025-05-07T20:26:18.6934228Z 2025-05-07T20:26:18.6934233Z 2025-05-07T20:26:18.6934237Z 2025-05-07T20:26:18.7272190Z cuda-nvvp-12.6.80 | 109.3 MB | #3 | 14%  2025-05-07T20:26:18.7272770Z 2025-05-07T20:26:18.7272780Z 2025-05-07T20:26:18.7272789Z 2025-05-07T20:26:18.7272817Z 2025-05-07T20:26:18.7272826Z 2025-05-07T20:26:18.7272835Z 2025-05-07T20:26:18.7301299Z libcusolver-11.7.1.2 | 95.8 MB | | 0%  2025-05-07T20:26:18.7304365Z 2025-05-07T20:26:18.8108544Z libcublas-12.6.4.1 | 256.2 MB | ########1 | 81%  2025-05-07T20:26:18.8108990Z 2025-05-07T20:26:18.8108995Z 2025-05-07T20:26:18.8109000Z 2025-05-07T20:26:18.8109005Z 2025-05-07T20:26:18.8111736Z 2025-05-07T20:26:18.8275961Z cuda-nvvp-12.6.80 | 109.3 MB | #6 | 17%  2025-05-07T20:26:18.8276250Z 2025-05-07T20:26:18.8276255Z 2025-05-07T20:26:18.8276260Z 2025-05-07T20:26:18.8276264Z 2025-05-07T20:26:18.8276269Z 2025-05-07T20:26:18.8279468Z 2025-05-07T20:26:18.8411739Z libcusolver-11.7.1.2 | 95.8 MB | 2 | 3%  2025-05-07T20:26:18.8479911Z nsight-compute-2024. | 443.1 MB | ####8 | 49% 2025-05-07T20:26:18.8481937Z 2025-05-07T20:26:18.9196548Z libcublas-12.6.4.1 | 256.2 MB | ########2 | 82%  2025-05-07T20:26:18.9196959Z 2025-05-07T20:26:18.9196968Z 2025-05-07T20:26:18.9196976Z 2025-05-07T20:26:18.9196985Z 2025-05-07T20:26:18.9196993Z 2025-05-07T20:26:18.9277965Z cuda-nvvp-12.6.80 | 109.3 MB | #9 | 20%  2025-05-07T20:26:18.9278277Z 2025-05-07T20:26:18.9278282Z 2025-05-07T20:26:18.9278286Z 2025-05-07T20:26:18.9278290Z 2025-05-07T20:26:18.9278294Z 2025-05-07T20:26:18.9279975Z 2025-05-07T20:26:18.9670844Z libcusolver-11.7.1.2 | 95.8 MB | 5 | 6%  2025-05-07T20:26:18.9672873Z 2025-05-07T20:26:18.9676815Z libcublas-12.6.4.1 | 256.2 MB | ########3 | 84%  2025-05-07T20:26:19.0222683Z nsight-compute-2024. | 443.1 MB | ####9 | 50% 2025-05-07T20:26:19.0222946Z 2025-05-07T20:26:19.0222950Z 2025-05-07T20:26:19.0222954Z 2025-05-07T20:26:19.0223228Z 2025-05-07T20:26:19.0228134Z 2025-05-07T20:26:19.0277927Z cuda-nvvp-12.6.80 | 109.3 MB | ##2 | 23%  2025-05-07T20:26:19.0278208Z 2025-05-07T20:26:19.0278212Z 2025-05-07T20:26:19.0278216Z 2025-05-07T20:26:19.0278219Z 2025-05-07T20:26:19.0278223Z 2025-05-07T20:26:19.0279598Z 2025-05-07T20:26:19.0713579Z libcusolver-11.7.1.2 | 95.8 MB | 8 | 8%  2025-05-07T20:26:19.0714924Z 2025-05-07T20:26:19.0877363Z libcublas-12.6.4.1 | 256.2 MB | ########4 | 85%  2025-05-07T20:26:19.1281064Z nsight-compute-2024. | 443.1 MB | ##### | 50% 2025-05-07T20:26:19.1281550Z 2025-05-07T20:26:19.1281559Z 2025-05-07T20:26:19.1281566Z 2025-05-07T20:26:19.1281573Z 2025-05-07T20:26:19.1281579Z 2025-05-07T20:26:19.1284164Z 2025-05-07T20:26:19.1354022Z libcusolver-11.7.1.2 | 95.8 MB | #1 | 11%  2025-05-07T20:26:19.1354473Z 2025-05-07T20:26:19.1354479Z 2025-05-07T20:26:19.1354484Z 2025-05-07T20:26:19.1354489Z 2025-05-07T20:26:19.1354512Z 2025-05-07T20:26:19.1859460Z cuda-nvvp-12.6.80 | 109.3 MB | ##5 | 26%  2025-05-07T20:26:19.1862833Z 2025-05-07T20:26:19.2186713Z libcublas-12.6.4.1 | 256.2 MB | ########5 | 86%  2025-05-07T20:26:19.2282396Z nsight-compute-2024. | 443.1 MB | #####1 | 51% 2025-05-07T20:26:19.2282948Z 2025-05-07T20:26:19.2282959Z 2025-05-07T20:26:19.2282967Z 2025-05-07T20:26:19.2282974Z 2025-05-07T20:26:19.2282983Z 2025-05-07T20:26:19.2284796Z 2025-05-07T20:26:19.2441293Z libcusolver-11.7.1.2 | 95.8 MB | #4 | 14%  2025-05-07T20:26:19.2441814Z 2025-05-07T20:26:19.2441819Z 2025-05-07T20:26:19.2441822Z 2025-05-07T20:26:19.2441826Z 2025-05-07T20:26:19.2444643Z 2025-05-07T20:26:19.3027175Z cuda-nvvp-12.6.80 | 109.3 MB | ##8 | 29%  2025-05-07T20:26:19.3027606Z 2025-05-07T20:26:19.3283891Z libcublas-12.6.4.1 | 256.2 MB | ########7 | 87%  2025-05-07T20:26:19.3284174Z 2025-05-07T20:26:19.3284210Z 2025-05-07T20:26:19.3284214Z 2025-05-07T20:26:19.3284218Z 2025-05-07T20:26:19.3284222Z 2025-05-07T20:26:19.3287919Z 2025-05-07T20:26:19.3365793Z libcusolver-11.7.1.2 | 95.8 MB | #6 | 17%  2025-05-07T20:26:19.3449732Z nsight-compute-2024. | 443.1 MB | #####1 | 52% 2025-05-07T20:26:19.3450010Z 2025-05-07T20:26:19.3450015Z 2025-05-07T20:26:19.3450019Z 2025-05-07T20:26:19.3450023Z 2025-05-07T20:26:19.3450027Z 2025-05-07T20:26:19.4028361Z cuda-nvvp-12.6.80 | 109.3 MB | ###1 | 31%  2025-05-07T20:26:19.4030048Z 2025-05-07T20:26:19.4285185Z libcublas-12.6.4.1 | 256.2 MB | ########8 | 88%  2025-05-07T20:26:19.4285739Z 2025-05-07T20:26:19.4285748Z 2025-05-07T20:26:19.4285754Z 2025-05-07T20:26:19.4285760Z 2025-05-07T20:26:19.4285765Z 2025-05-07T20:26:19.4286398Z 2025-05-07T20:26:19.4455473Z libcusolver-11.7.1.2 | 95.8 MB | ## | 20%  2025-05-07T20:26:19.4573605Z nsight-compute-2024. | 443.1 MB | #####2 | 52% 2025-05-07T20:26:19.4573868Z 2025-05-07T20:26:19.4573872Z 2025-05-07T20:26:19.4573876Z 2025-05-07T20:26:19.4573880Z 2025-05-07T20:26:19.4575330Z 2025-05-07T20:26:19.5034857Z cuda-nvvp-12.6.80 | 109.3 MB | ###4 | 34%  2025-05-07T20:26:19.5035188Z 2025-05-07T20:26:19.5287368Z libcublas-12.6.4.1 | 256.2 MB | ########9 | 89%  2025-05-07T20:26:19.5287636Z 2025-05-07T20:26:19.5287640Z 2025-05-07T20:26:19.5287644Z 2025-05-07T20:26:19.5287647Z 2025-05-07T20:26:19.5287651Z 2025-05-07T20:26:19.5287655Z 2025-05-07T20:26:19.5458331Z libcusolver-11.7.1.2 | 95.8 MB | ##3 | 23%  2025-05-07T20:26:19.5592528Z nsight-compute-2024. | 443.1 MB | #####3 | 53% 2025-05-07T20:26:19.5592795Z 2025-05-07T20:26:19.5592799Z 2025-05-07T20:26:19.5592803Z 2025-05-07T20:26:19.5592807Z 2025-05-07T20:26:19.5594810Z 2025-05-07T20:26:19.6055696Z cuda-nvvp-12.6.80 | 109.3 MB | ###6 | 37%  2025-05-07T20:26:19.6055985Z 2025-05-07T20:26:19.6290684Z libcublas-12.6.4.1 | 256.2 MB | ######### | 91%  2025-05-07T20:26:19.6291226Z 2025-05-07T20:26:19.6291235Z 2025-05-07T20:26:19.6291243Z 2025-05-07T20:26:19.6291251Z 2025-05-07T20:26:19.6291260Z 2025-05-07T20:26:19.6292083Z 2025-05-07T20:26:19.6459150Z libcusolver-11.7.1.2 | 95.8 MB | ##6 | 27%  2025-05-07T20:26:19.6595853Z nsight-compute-2024. | 443.1 MB | #####3 | 54% 2025-05-07T20:26:19.6596184Z 2025-05-07T20:26:19.6596191Z 2025-05-07T20:26:19.6596197Z 2025-05-07T20:26:19.6596213Z 2025-05-07T20:26:19.6599369Z 2025-05-07T20:26:19.7058355Z cuda-nvvp-12.6.80 | 109.3 MB | ###9 | 39%  2025-05-07T20:26:19.7058959Z 2025-05-07T20:26:19.7297254Z libcublas-12.6.4.1 | 256.2 MB | #########1 | 92%  2025-05-07T20:26:19.7297659Z 2025-05-07T20:26:19.7297665Z 2025-05-07T20:26:19.7297671Z 2025-05-07T20:26:19.7297676Z 2025-05-07T20:26:19.7297682Z 2025-05-07T20:26:19.7298903Z 2025-05-07T20:26:19.7460655Z libcusolver-11.7.1.2 | 95.8 MB | ### | 30%  2025-05-07T20:26:19.7617268Z nsight-compute-2024. | 443.1 MB | #####4 | 55% 2025-05-07T20:26:19.7617572Z 2025-05-07T20:26:19.7617799Z 2025-05-07T20:26:19.7617807Z 2025-05-07T20:26:19.7617818Z 2025-05-07T20:26:19.7619029Z 2025-05-07T20:26:19.8061878Z cuda-nvvp-12.6.80 | 109.3 MB | ####2 | 42%  2025-05-07T20:26:19.8062192Z 2025-05-07T20:26:19.8387164Z libcublas-12.6.4.1 | 256.2 MB | #########2 | 93%  2025-05-07T20:26:19.8387427Z 2025-05-07T20:26:19.8387431Z 2025-05-07T20:26:19.8387435Z 2025-05-07T20:26:19.8387464Z 2025-05-07T20:26:19.8387469Z 2025-05-07T20:26:19.8388132Z 2025-05-07T20:26:19.8464592Z libcusolver-11.7.1.2 | 95.8 MB | ###3 | 34%  2025-05-07T20:26:19.8619008Z nsight-compute-2024. | 443.1 MB | #####5 | 55% 2025-05-07T20:26:19.8619270Z 2025-05-07T20:26:19.8619274Z 2025-05-07T20:26:19.8619278Z 2025-05-07T20:26:19.8619282Z 2025-05-07T20:26:19.8619301Z 2025-05-07T20:26:19.9095689Z cuda-nvvp-12.6.80 | 109.3 MB | ####4 | 45%  2025-05-07T20:26:19.9095970Z 2025-05-07T20:26:19.9387702Z libcublas-12.6.4.1 | 256.2 MB | #########3 | 94%  2025-05-07T20:26:19.9387963Z 2025-05-07T20:26:19.9387967Z 2025-05-07T20:26:19.9387971Z 2025-05-07T20:26:19.9387975Z 2025-05-07T20:26:19.9387978Z 2025-05-07T20:26:19.9391431Z 2025-05-07T20:26:19.9481733Z libcusolver-11.7.1.2 | 95.8 MB | ###6 | 37%  2025-05-07T20:26:19.9622384Z nsight-compute-2024. | 443.1 MB | #####5 | 56% 2025-05-07T20:26:19.9622934Z 2025-05-07T20:26:19.9622938Z 2025-05-07T20:26:19.9622942Z 2025-05-07T20:26:19.9622945Z 2025-05-07T20:26:19.9623660Z 2025-05-07T20:26:20.0096773Z cuda-nvvp-12.6.80 | 109.3 MB | ####7 | 48%  2025-05-07T20:26:20.0101166Z 2025-05-07T20:26:20.0392535Z libcublas-12.6.4.1 | 256.2 MB | #########5 | 95%  2025-05-07T20:26:20.0392806Z 2025-05-07T20:26:20.0392812Z 2025-05-07T20:26:20.0392948Z 2025-05-07T20:26:20.0392957Z 2025-05-07T20:26:20.0392962Z 2025-05-07T20:26:20.0393703Z 2025-05-07T20:26:20.0476825Z libcusolver-11.7.1.2 | 95.8 MB | #### | 40%  2025-05-07T20:26:20.0672513Z nsight-compute-2024. | 443.1 MB | #####6 | 57% 2025-05-07T20:26:20.0672763Z 2025-05-07T20:26:20.0672859Z 2025-05-07T20:26:20.0672863Z 2025-05-07T20:26:20.0672867Z 2025-05-07T20:26:20.0673594Z 2025-05-07T20:26:20.1097517Z cuda-nvvp-12.6.80 | 109.3 MB | ##### | 50%  2025-05-07T20:26:20.1097979Z 2025-05-07T20:26:20.1393784Z libcublas-12.6.4.1 | 256.2 MB | #########6 | 96%  2025-05-07T20:26:20.1394076Z 2025-05-07T20:26:20.1394080Z 2025-05-07T20:26:20.1394083Z 2025-05-07T20:26:20.1394087Z 2025-05-07T20:26:20.1394090Z 2025-05-07T20:26:20.1394094Z 2025-05-07T20:26:20.1478870Z libcusolver-11.7.1.2 | 95.8 MB | ####3 | 43%  2025-05-07T20:26:20.1674085Z nsight-compute-2024. | 443.1 MB | #####7 | 57% 2025-05-07T20:26:20.1674576Z 2025-05-07T20:26:20.1674581Z 2025-05-07T20:26:20.1674584Z 2025-05-07T20:26:20.1674588Z 2025-05-07T20:26:20.1675433Z 2025-05-07T20:26:20.2105118Z cuda-nvvp-12.6.80 | 109.3 MB | #####3 | 53%  2025-05-07T20:26:20.2106826Z 2025-05-07T20:26:20.2394885Z libcublas-12.6.4.1 | 256.2 MB | #########7 | 98%  2025-05-07T20:26:20.2395186Z 2025-05-07T20:26:20.2395191Z 2025-05-07T20:26:20.2395195Z 2025-05-07T20:26:20.2395199Z 2025-05-07T20:26:20.2395203Z 2025-05-07T20:26:20.2398223Z 2025-05-07T20:26:20.2488927Z libcusolver-11.7.1.2 | 95.8 MB | ####6 | 47%  2025-05-07T20:26:20.2675969Z nsight-compute-2024. | 443.1 MB | #####8 | 58% 2025-05-07T20:26:20.2676239Z 2025-05-07T20:26:20.2676243Z 2025-05-07T20:26:20.2676247Z 2025-05-07T20:26:20.2676250Z 2025-05-07T20:26:20.2677593Z 2025-05-07T20:26:20.3109235Z cuda-nvvp-12.6.80 | 109.3 MB | #####6 | 56%  2025-05-07T20:26:20.3110978Z 2025-05-07T20:26:20.3438399Z libcublas-12.6.4.1 | 256.2 MB | #########8 | 99%  2025-05-07T20:26:20.3438685Z 2025-05-07T20:26:20.3438689Z 2025-05-07T20:26:20.3438693Z 2025-05-07T20:26:20.3438698Z 2025-05-07T20:26:20.3438701Z 2025-05-07T20:26:20.3438984Z 2025-05-07T20:26:20.3577689Z libcusolver-11.7.1.2 | 95.8 MB | ##### | 50%  2025-05-07T20:26:20.3783874Z nsight-compute-2024. | 443.1 MB | #####8 | 59% 2025-05-07T20:26:20.3784187Z 2025-05-07T20:26:20.3784194Z 2025-05-07T20:26:20.3784200Z 2025-05-07T20:26:20.3784207Z 2025-05-07T20:26:20.3786110Z 2025-05-07T20:26:20.4445373Z cuda-nvvp-12.6.80 | 109.3 MB | #####8 | 59%  2025-05-07T20:26:20.4445967Z 2025-05-07T20:26:20.4445973Z 2025-05-07T20:26:20.4445979Z 2025-05-07T20:26:20.4445985Z 2025-05-07T20:26:20.4445991Z 2025-05-07T20:26:20.4445996Z 2025-05-07T20:26:20.4581608Z libcusolver-11.7.1.2 | 95.8 MB | #####3 | 54%  2025-05-07T20:26:20.4784034Z nsight-compute-2024. | 443.1 MB | #####9 | 60% 2025-05-07T20:26:20.4784347Z 2025-05-07T20:26:20.4784364Z 2025-05-07T20:26:20.4784368Z 2025-05-07T20:26:20.4784372Z 2025-05-07T20:26:20.4788036Z 2025-05-07T20:26:20.5444903Z cuda-nvvp-12.6.80 | 109.3 MB | ######1 | 62%  2025-05-07T20:26:20.5445205Z 2025-05-07T20:26:20.5445209Z 2025-05-07T20:26:20.5445212Z 2025-05-07T20:26:20.5445216Z 2025-05-07T20:26:20.5445220Z 2025-05-07T20:26:20.5445223Z 2025-05-07T20:26:20.5586475Z libcusolver-11.7.1.2 | 95.8 MB | #####7 | 57%  2025-05-07T20:26:20.5784267Z nsight-compute-2024. | 443.1 MB | ###### | 60% 2025-05-07T20:26:20.5784938Z 2025-05-07T20:26:20.5784954Z 2025-05-07T20:26:20.5784957Z 2025-05-07T20:26:20.5784961Z 2025-05-07T20:26:20.5786355Z 2025-05-07T20:26:20.6465102Z cuda-nvvp-12.6.80 | 109.3 MB | ######5 | 65%  2025-05-07T20:26:20.6465460Z 2025-05-07T20:26:20.6465467Z 2025-05-07T20:26:20.6465474Z 2025-05-07T20:26:20.6465481Z 2025-05-07T20:26:20.6465486Z 2025-05-07T20:26:20.6468151Z 2025-05-07T20:26:20.6588109Z libcusolver-11.7.1.2 | 95.8 MB | ###### | 61%  2025-05-07T20:26:20.6786160Z nsight-compute-2024. | 443.1 MB | ######1 | 61% 2025-05-07T20:26:20.6786586Z 2025-05-07T20:26:20.6786597Z 2025-05-07T20:26:20.6786605Z 2025-05-07T20:26:20.6786612Z 2025-05-07T20:26:20.6791433Z 2025-05-07T20:26:20.7466248Z cuda-nvvp-12.6.80 | 109.3 MB | ######8 | 68%  2025-05-07T20:26:20.7466560Z 2025-05-07T20:26:20.7466564Z 2025-05-07T20:26:20.7466568Z 2025-05-07T20:26:20.7466572Z 2025-05-07T20:26:20.7466576Z 2025-05-07T20:26:20.7466812Z 2025-05-07T20:26:20.7591182Z libcusolver-11.7.1.2 | 95.8 MB | ######4 | 64%  2025-05-07T20:26:20.7791768Z nsight-compute-2024. | 443.1 MB | ######1 | 62% 2025-05-07T20:26:20.7792043Z 2025-05-07T20:26:20.7792047Z 2025-05-07T20:26:20.7792051Z 2025-05-07T20:26:20.7792055Z 2025-05-07T20:26:20.7796112Z 2025-05-07T20:26:20.8467835Z cuda-nvvp-12.6.80 | 109.3 MB | #######1 | 72%  2025-05-07T20:26:20.8468433Z 2025-05-07T20:26:20.8468441Z 2025-05-07T20:26:20.8468448Z 2025-05-07T20:26:20.8468455Z 2025-05-07T20:26:20.8468462Z 2025-05-07T20:26:20.8470781Z 2025-05-07T20:26:20.8620716Z libcusolver-11.7.1.2 | 95.8 MB | ######7 | 68%  2025-05-07T20:26:20.8797520Z nsight-compute-2024. | 443.1 MB | ######2 | 63% 2025-05-07T20:26:20.8797838Z 2025-05-07T20:26:20.8797842Z 2025-05-07T20:26:20.8797847Z 2025-05-07T20:26:20.8797850Z 2025-05-07T20:26:20.8800426Z 2025-05-07T20:26:20.9469100Z cuda-nvvp-12.6.80 | 109.3 MB | #######4 | 75%  2025-05-07T20:26:20.9469438Z 2025-05-07T20:26:20.9469456Z 2025-05-07T20:26:20.9469460Z 2025-05-07T20:26:20.9469463Z 2025-05-07T20:26:20.9469467Z 2025-05-07T20:26:20.9469471Z 2025-05-07T20:26:20.9622817Z libcusolver-11.7.1.2 | 95.8 MB | #######1 | 71%  2025-05-07T20:26:20.9801352Z nsight-compute-2024. | 443.1 MB | ######3 | 63% 2025-05-07T20:26:20.9801654Z 2025-05-07T20:26:20.9801659Z 2025-05-07T20:26:20.9801662Z 2025-05-07T20:26:20.9801666Z 2025-05-07T20:26:20.9801669Z 2025-05-07T20:26:21.0506759Z cuda-nvvp-12.6.80 | 109.3 MB | #######8 | 78%  2025-05-07T20:26:21.0507153Z 2025-05-07T20:26:21.0507160Z 2025-05-07T20:26:21.0507166Z 2025-05-07T20:26:21.0507173Z 2025-05-07T20:26:21.0507179Z 2025-05-07T20:26:21.0507199Z 2025-05-07T20:26:21.0681020Z libcusolver-11.7.1.2 | 95.8 MB | #######4 | 75%  2025-05-07T20:26:21.0807337Z nsight-compute-2024. | 443.1 MB | ######4 | 64% 2025-05-07T20:26:21.0807720Z 2025-05-07T20:26:21.0807726Z 2025-05-07T20:26:21.0807732Z 2025-05-07T20:26:21.0807739Z 2025-05-07T20:26:21.0810686Z 2025-05-07T20:26:21.1512330Z cuda-nvvp-12.6.80 | 109.3 MB | ########1 | 81%  2025-05-07T20:26:21.1513021Z 2025-05-07T20:26:21.1513031Z 2025-05-07T20:26:21.1513042Z 2025-05-07T20:26:21.1513052Z 2025-05-07T20:26:21.1513060Z 2025-05-07T20:26:21.1513098Z 2025-05-07T20:26:21.1737428Z libcusolver-11.7.1.2 | 95.8 MB | #######8 | 78%  2025-05-07T20:26:21.1807607Z nsight-compute-2024. | 443.1 MB | ######4 | 65% 2025-05-07T20:26:21.1807874Z 2025-05-07T20:26:21.1807879Z 2025-05-07T20:26:21.1807883Z 2025-05-07T20:26:21.1807887Z 2025-05-07T20:26:21.1810128Z 2025-05-07T20:26:21.2560431Z cuda-nvvp-12.6.80 | 109.3 MB | ########4 | 85%  2025-05-07T20:26:21.2560747Z 2025-05-07T20:26:21.2560751Z 2025-05-07T20:26:21.2560755Z 2025-05-07T20:26:21.2560759Z 2025-05-07T20:26:21.2560762Z 2025-05-07T20:26:21.2561054Z 2025-05-07T20:26:21.2739210Z libcusolver-11.7.1.2 | 95.8 MB | ########1 | 82%  2025-05-07T20:26:21.2817040Z nsight-compute-2024. | 443.1 MB | ######5 | 66% 2025-05-07T20:26:21.2817293Z 2025-05-07T20:26:21.2817332Z 2025-05-07T20:26:21.2817336Z 2025-05-07T20:26:21.2817340Z 2025-05-07T20:26:21.2819417Z 2025-05-07T20:26:21.3576978Z cuda-nvvp-12.6.80 | 109.3 MB | ########8 | 88%  2025-05-07T20:26:21.3577285Z 2025-05-07T20:26:21.3577289Z 2025-05-07T20:26:21.3577293Z 2025-05-07T20:26:21.3577296Z 2025-05-07T20:26:21.3577300Z 2025-05-07T20:26:21.3577339Z 2025-05-07T20:26:21.3749135Z libcusolver-11.7.1.2 | 95.8 MB | ########4 | 85%  2025-05-07T20:26:21.3864471Z nsight-compute-2024. | 443.1 MB | ######6 | 66% 2025-05-07T20:26:21.3864738Z 2025-05-07T20:26:21.3864742Z 2025-05-07T20:26:21.3864747Z 2025-05-07T20:26:21.3864751Z 2025-05-07T20:26:21.3866665Z 2025-05-07T20:26:21.4679470Z cuda-nvvp-12.6.80 | 109.3 MB | #########1 | 92%  2025-05-07T20:26:21.4680073Z 2025-05-07T20:26:21.4680077Z 2025-05-07T20:26:21.4680081Z 2025-05-07T20:26:21.4680084Z 2025-05-07T20:26:21.4680088Z 2025-05-07T20:26:21.4680092Z 2025-05-07T20:26:21.4789118Z libcusolver-11.7.1.2 | 95.8 MB | ########8 | 88%  2025-05-07T20:26:21.4905115Z nsight-compute-2024. | 443.1 MB | ######6 | 67% 2025-05-07T20:26:21.4905808Z 2025-05-07T20:26:21.4905814Z 2025-05-07T20:26:21.4905817Z 2025-05-07T20:26:21.4905821Z 2025-05-07T20:26:21.4907568Z 2025-05-07T20:26:21.5738723Z cuda-nvvp-12.6.80 | 109.3 MB | #########4 | 95%  2025-05-07T20:26:21.5739146Z 2025-05-07T20:26:21.5739153Z 2025-05-07T20:26:21.5739159Z 2025-05-07T20:26:21.5739164Z 2025-05-07T20:26:21.5739171Z 2025-05-07T20:26:21.5741463Z 2025-05-07T20:26:21.5788701Z libcusolver-11.7.1.2 | 95.8 MB | #########1 | 92%  2025-05-07T20:26:21.5910609Z nsight-compute-2024. | 443.1 MB | ######7 | 68% 2025-05-07T20:26:21.5911038Z 2025-05-07T20:26:21.5911045Z 2025-05-07T20:26:21.5911050Z 2025-05-07T20:26:21.5911056Z 2025-05-07T20:26:21.5912804Z 2025-05-07T20:26:21.6527409Z cuda-nvvp-12.6.80 | 109.3 MB | #########8 | 98%  2025-05-07T20:26:21.6527729Z 2025-05-07T20:26:21.6527736Z 2025-05-07T20:26:21.6527741Z 2025-05-07T20:26:21.6527759Z 2025-05-07T20:26:21.6747132Z cuda-nsight-12.6.77 | 113.2 MB | ########## | 100%  2025-05-07T20:26:21.6747424Z 2025-05-07T20:26:21.6747430Z 2025-05-07T20:26:21.6747435Z 2025-05-07T20:26:21.6747440Z 2025-05-07T20:26:21.6747461Z 2025-05-07T20:26:21.6747464Z 2025-05-07T20:26:21.6788915Z libcusolver-11.7.1.2 | 95.8 MB | #########4 | 95%  2025-05-07T20:26:21.7417333Z nsight-compute-2024. | 443.1 MB | ######8 | 69% 2025-05-07T20:26:21.7417581Z 2025-05-07T20:26:21.7418796Z 2025-05-07T20:26:21.7753053Z libcufft-11.3.0.4 | 156.2 MB | ########## | 100%  2025-05-07T20:26:21.7753458Z 2025-05-07T20:26:21.7753494Z 2025-05-07T20:26:21.7753500Z 2025-05-07T20:26:21.7753506Z 2025-05-07T20:26:21.7753511Z 2025-05-07T20:26:21.7753518Z 2025-05-07T20:26:21.7789796Z libcusolver-11.7.1.2 | 95.8 MB | #########9 | 99%  2025-05-07T20:26:21.7805441Z nsight-compute-2024. | 443.1 MB | ######9 | 69% 2025-05-07T20:26:21.7805702Z 2025-05-07T20:26:21.7805706Z 2025-05-07T20:26:21.7805710Z 2025-05-07T20:26:21.7805727Z 2025-05-07T20:26:21.7805731Z 2025-05-07T20:26:21.7805735Z 2025-05-07T20:26:21.7805738Z 2025-05-07T20:26:21.8791843Z libnpp-12.3.1.54 | 93.4 MB | | 0%  2025-05-07T20:26:21.8806592Z nsight-compute-2024. | 443.1 MB | ####### | 70% 2025-05-07T20:26:21.8806917Z 2025-05-07T20:26:21.8806922Z 2025-05-07T20:26:21.8806927Z 2025-05-07T20:26:21.8806934Z 2025-05-07T20:26:21.8806939Z 2025-05-07T20:26:21.8806945Z 2025-05-07T20:26:21.8806950Z 2025-05-07T20:26:21.9793117Z libnpp-12.3.1.54 | 93.4 MB | 3 | 4%  2025-05-07T20:26:21.9808912Z nsight-compute-2024. | 443.1 MB | ####### | 71% 2025-05-07T20:26:21.9809183Z 2025-05-07T20:26:21.9809187Z 2025-05-07T20:26:21.9809190Z 2025-05-07T20:26:21.9809194Z 2025-05-07T20:26:21.9809198Z 2025-05-07T20:26:21.9809201Z 2025-05-07T20:26:21.9811744Z 2025-05-07T20:26:22.0795319Z libnpp-12.3.1.54 | 93.4 MB | 7 | 8%  2025-05-07T20:26:22.0813773Z nsight-compute-2024. | 443.1 MB | #######1 | 72% 2025-05-07T20:26:22.0814068Z 2025-05-07T20:26:22.0814073Z 2025-05-07T20:26:22.0814079Z 2025-05-07T20:26:22.0814085Z 2025-05-07T20:26:22.0814090Z 2025-05-07T20:26:22.0814097Z 2025-05-07T20:26:22.0814102Z 2025-05-07T20:26:22.1798394Z libnpp-12.3.1.54 | 93.4 MB | #1 | 11%  2025-05-07T20:26:22.1814941Z nsight-compute-2024. | 443.1 MB | #######2 | 73% 2025-05-07T20:26:22.1815208Z 2025-05-07T20:26:22.1815224Z 2025-05-07T20:26:22.1815228Z 2025-05-07T20:26:22.1815232Z 2025-05-07T20:26:22.1815236Z 2025-05-07T20:26:22.1815264Z 2025-05-07T20:26:22.1815269Z 2025-05-07T20:26:22.2816523Z libnpp-12.3.1.54 | 93.4 MB | #5 | 15%  2025-05-07T20:26:22.2816835Z 2025-05-07T20:26:22.2816839Z 2025-05-07T20:26:22.2816843Z 2025-05-07T20:26:22.2816846Z 2025-05-07T20:26:22.2816850Z 2025-05-07T20:26:22.2816854Z 2025-05-07T20:26:22.2816858Z 2025-05-07T20:26:22.2904265Z libnpp-12.3.1.54 | 93.4 MB | #8 | 19%  2025-05-07T20:26:22.3816904Z nsight-compute-2024. | 443.1 MB | #######3 | 73% 2025-05-07T20:26:22.3817220Z 2025-05-07T20:26:22.3817232Z 2025-05-07T20:26:22.3817236Z 2025-05-07T20:26:22.3817240Z 2025-05-07T20:26:22.3817243Z 2025-05-07T20:26:22.3817247Z 2025-05-07T20:26:22.3817250Z 2025-05-07T20:26:22.3976053Z libnpp-12.3.1.54 | 93.4 MB | ##2 | 23%  2025-05-07T20:26:22.4831618Z nsight-compute-2024. | 443.1 MB | #######4 | 74% 2025-05-07T20:26:22.4832033Z 2025-05-07T20:26:22.4832039Z 2025-05-07T20:26:22.4832078Z 2025-05-07T20:26:22.4832083Z 2025-05-07T20:26:22.4832087Z 2025-05-07T20:26:22.4832093Z 2025-05-07T20:26:22.4832098Z 2025-05-07T20:26:22.5018563Z libnpp-12.3.1.54 | 93.4 MB | ##6 | 27%  2025-05-07T20:26:22.5833887Z nsight-compute-2024. | 443.1 MB | #######4 | 75% 2025-05-07T20:26:22.5834453Z 2025-05-07T20:26:22.5834462Z 2025-05-07T20:26:22.5834470Z 2025-05-07T20:26:22.5834514Z 2025-05-07T20:26:22.5834524Z 2025-05-07T20:26:22.5834533Z 2025-05-07T20:26:22.5834542Z 2025-05-07T20:26:22.6077590Z libnpp-12.3.1.54 | 93.4 MB | ### | 30%  2025-05-07T20:26:22.6841736Z nsight-compute-2024. | 443.1 MB | #######5 | 76% 2025-05-07T20:26:22.6842166Z 2025-05-07T20:26:22.6842171Z 2025-05-07T20:26:22.6842179Z 2025-05-07T20:26:22.6842183Z 2025-05-07T20:26:22.6842197Z 2025-05-07T20:26:22.6842200Z 2025-05-07T20:26:22.6842205Z 2025-05-07T20:26:22.7080071Z libnpp-12.3.1.54 | 93.4 MB | ###4 | 34%  2025-05-07T20:26:22.7903905Z nsight-compute-2024. | 443.1 MB | #######6 | 77% 2025-05-07T20:26:22.7904226Z 2025-05-07T20:26:22.7904237Z 2025-05-07T20:26:22.7904244Z 2025-05-07T20:26:22.7904253Z 2025-05-07T20:26:22.7904262Z 2025-05-07T20:26:22.7904273Z 2025-05-07T20:26:22.7904281Z 2025-05-07T20:26:22.8082502Z libnpp-12.3.1.54 | 93.4 MB | ###8 | 38%  2025-05-07T20:26:22.8906272Z nsight-compute-2024. | 443.1 MB | #######7 | 77% 2025-05-07T20:26:22.8906539Z 2025-05-07T20:26:22.8906543Z 2025-05-07T20:26:22.8906547Z 2025-05-07T20:26:22.8906551Z 2025-05-07T20:26:22.8906555Z 2025-05-07T20:26:22.8906558Z 2025-05-07T20:26:22.8907526Z 2025-05-07T20:26:22.9082785Z libnpp-12.3.1.54 | 93.4 MB | ####1 | 42%  2025-05-07T20:26:22.9948516Z nsight-compute-2024. | 443.1 MB | #######8 | 78% 2025-05-07T20:26:22.9948793Z 2025-05-07T20:26:22.9948797Z 2025-05-07T20:26:22.9948800Z 2025-05-07T20:26:22.9948805Z 2025-05-07T20:26:22.9949062Z 2025-05-07T20:26:22.9949066Z 2025-05-07T20:26:22.9951257Z 2025-05-07T20:26:23.0123908Z libnpp-12.3.1.54 | 93.4 MB | ####5 | 46%  2025-05-07T20:26:23.0951077Z nsight-compute-2024. | 443.1 MB | #######9 | 79% 2025-05-07T20:26:23.0951372Z 2025-05-07T20:26:23.0951376Z 2025-05-07T20:26:23.0951380Z 2025-05-07T20:26:23.0951386Z 2025-05-07T20:26:23.0951394Z 2025-05-07T20:26:23.0951438Z 2025-05-07T20:26:23.0951445Z 2025-05-07T20:26:23.1131493Z libnpp-12.3.1.54 | 93.4 MB | ####9 | 50%  2025-05-07T20:26:23.1997927Z nsight-compute-2024. | 443.1 MB | #######9 | 80% 2025-05-07T20:26:23.1998220Z 2025-05-07T20:26:23.1998224Z 2025-05-07T20:26:23.1998227Z 2025-05-07T20:26:23.1998231Z 2025-05-07T20:26:23.1998235Z 2025-05-07T20:26:23.1998238Z 2025-05-07T20:26:23.2002343Z 2025-05-07T20:26:23.2225142Z libnpp-12.3.1.54 | 93.4 MB | #####3 | 53%  2025-05-07T20:26:23.3045921Z nsight-compute-2024. | 443.1 MB | ######## | 81% 2025-05-07T20:26:23.3046269Z 2025-05-07T20:26:23.3046273Z 2025-05-07T20:26:23.3046278Z 2025-05-07T20:26:23.3046281Z 2025-05-07T20:26:23.3046285Z 2025-05-07T20:26:23.3046290Z 2025-05-07T20:26:23.3046294Z 2025-05-07T20:26:23.3289739Z libnpp-12.3.1.54 | 93.4 MB | #####7 | 57%  2025-05-07T20:26:23.4080764Z nsight-compute-2024. | 443.1 MB | ########1 | 82% 2025-05-07T20:26:23.4081115Z 2025-05-07T20:26:23.4081124Z 2025-05-07T20:26:23.4081132Z 2025-05-07T20:26:23.4081139Z 2025-05-07T20:26:23.4081148Z 2025-05-07T20:26:23.4081156Z 2025-05-07T20:26:23.4083844Z 2025-05-07T20:26:23.4345297Z libnpp-12.3.1.54 | 93.4 MB | ###### | 61%  2025-05-07T20:26:23.5081066Z nsight-compute-2024. | 443.1 MB | ########2 | 82% 2025-05-07T20:26:23.5081334Z 2025-05-07T20:26:23.5081338Z 2025-05-07T20:26:23.5081343Z 2025-05-07T20:26:23.5081346Z 2025-05-07T20:26:23.5081351Z 2025-05-07T20:26:23.5081356Z 2025-05-07T20:26:23.5082646Z 2025-05-07T20:26:23.5348758Z libnpp-12.3.1.54 | 93.4 MB | ######4 | 65%  2025-05-07T20:26:23.6083718Z nsight-compute-2024. | 443.1 MB | ########3 | 83% 2025-05-07T20:26:23.6084060Z 2025-05-07T20:26:23.6084068Z 2025-05-07T20:26:23.6084076Z 2025-05-07T20:26:23.6084083Z 2025-05-07T20:26:23.6084091Z 2025-05-07T20:26:23.6084097Z 2025-05-07T20:26:23.6084135Z 2025-05-07T20:26:23.6356928Z libnpp-12.3.1.54 | 93.4 MB | ######8 | 69%  2025-05-07T20:26:23.7154980Z nsight-compute-2024. | 443.1 MB | ########3 | 84% 2025-05-07T20:26:23.7155324Z 2025-05-07T20:26:23.7155331Z 2025-05-07T20:26:23.7155337Z 2025-05-07T20:26:23.7155342Z 2025-05-07T20:26:23.7155347Z 2025-05-07T20:26:23.7155353Z 2025-05-07T20:26:23.7155361Z 2025-05-07T20:26:23.7361701Z libnpp-12.3.1.54 | 93.4 MB | #######2 | 72%  2025-05-07T20:26:23.8159277Z nsight-compute-2024. | 443.1 MB | ########4 | 85% 2025-05-07T20:26:23.8159633Z 2025-05-07T20:26:23.8159637Z 2025-05-07T20:26:23.8159640Z 2025-05-07T20:26:23.8159644Z 2025-05-07T20:26:23.8159647Z 2025-05-07T20:26:23.8159651Z 2025-05-07T20:26:23.8159655Z 2025-05-07T20:26:23.8368390Z libnpp-12.3.1.54 | 93.4 MB | #######6 | 76%  2025-05-07T20:26:23.9161955Z nsight-compute-2024. | 443.1 MB | ########5 | 86% 2025-05-07T20:26:23.9162506Z 2025-05-07T20:26:23.9162553Z 2025-05-07T20:26:23.9162573Z 2025-05-07T20:26:23.9162581Z 2025-05-07T20:26:23.9162589Z 2025-05-07T20:26:23.9162598Z 2025-05-07T20:26:23.9162606Z 2025-05-07T20:26:24.0162358Z libnpp-12.3.1.54 | 93.4 MB | ######## | 81%  2025-05-07T20:26:24.0162795Z 2025-05-07T20:26:24.0162799Z 2025-05-07T20:26:24.0162803Z 2025-05-07T20:26:24.0162807Z 2025-05-07T20:26:24.0162810Z 2025-05-07T20:26:24.0162814Z 2025-05-07T20:26:24.0164159Z 2025-05-07T20:26:24.0224244Z libnpp-12.3.1.54 | 93.4 MB | ########5 | 85%  2025-05-07T20:26:24.1173517Z nsight-compute-2024. | 443.1 MB | ########6 | 86% 2025-05-07T20:26:24.1173952Z 2025-05-07T20:26:24.1173956Z 2025-05-07T20:26:24.1173960Z 2025-05-07T20:26:24.1173964Z 2025-05-07T20:26:24.1173970Z 2025-05-07T20:26:24.1173974Z 2025-05-07T20:26:24.1173978Z 2025-05-07T20:26:24.1248937Z libnpp-12.3.1.54 | 93.4 MB | ########9 | 90%  2025-05-07T20:26:24.2252447Z nsight-compute-2024. | 443.1 MB | ########7 | 87% 2025-05-07T20:26:24.2285446Z nsight-compute-2024. | 443.1 MB | ########7 | 88% 2025-05-07T20:26:24.2285807Z 2025-05-07T20:26:24.2285813Z 2025-05-07T20:26:24.2285819Z 2025-05-07T20:26:24.2285825Z 2025-05-07T20:26:24.2285829Z 2025-05-07T20:26:24.2285835Z 2025-05-07T20:26:24.2287074Z 2025-05-07T20:26:24.3253198Z libnpp-12.3.1.54 | 93.4 MB | #########3 | 94%  2025-05-07T20:26:24.3319398Z nsight-compute-2024. | 443.1 MB | ########8 | 89% 2025-05-07T20:26:24.3319720Z 2025-05-07T20:26:24.3319725Z 2025-05-07T20:26:24.3319728Z 2025-05-07T20:26:24.3319757Z 2025-05-07T20:26:24.3319761Z 2025-05-07T20:26:24.3319764Z 2025-05-07T20:26:24.3320044Z 2025-05-07T20:26:24.4259197Z libnpp-12.3.1.54 | 93.4 MB | #########7 | 98%  2025-05-07T20:26:24.5258469Z nsight-compute-2024. | 443.1 MB | ########9 | 90% 2025-05-07T20:26:24.6262535Z nsight-compute-2024. | 443.1 MB | ######### | 91% 2025-05-07T20:26:24.7135935Z nsight-compute-2024. | 443.1 MB | #########1 | 92% 2025-05-07T20:26:24.7136283Z 2025-05-07T20:26:24.7136289Z 2025-05-07T20:26:24.7136294Z 2025-05-07T20:26:24.7136318Z 2025-05-07T20:26:24.7136326Z 2025-05-07T20:26:24.7139979Z 2025-05-07T20:26:24.7341775Z libcusolver-11.7.1.2 | 95.8 MB | ########## | 100%  2025-05-07T20:26:24.7600355Z nsight-compute-2024. | 443.1 MB | #########2 | 93% 2025-05-07T20:26:24.7600617Z 2025-05-07T20:26:24.7600621Z 2025-05-07T20:26:24.7600625Z 2025-05-07T20:26:24.7600629Z 2025-05-07T20:26:24.7600632Z 2025-05-07T20:26:24.7600636Z 2025-05-07T20:26:24.7600665Z 2025-05-07T20:26:24.7606509Z 2025-05-07T20:26:24.7809021Z cuda-nvdisasm-12.6.7 | 47.6 MB | | 0%  2025-05-07T20:26:24.7809342Z 2025-05-07T20:26:24.7809346Z 2025-05-07T20:26:24.7809350Z 2025-05-07T20:26:24.7809353Z 2025-05-07T20:26:24.7811178Z 2025-05-07T20:26:24.8230713Z cuda-nvvp-12.6.80 | 109.3 MB | ########## | 100%  2025-05-07T20:26:24.8231044Z 2025-05-07T20:26:24.8231048Z 2025-05-07T20:26:24.8231052Z 2025-05-07T20:26:24.8231055Z 2025-05-07T20:26:24.8231059Z 2025-05-07T20:26:24.8231063Z 2025-05-07T20:26:24.8231066Z 2025-05-07T20:26:24.8231070Z 2025-05-07T20:26:24.8233500Z 2025-05-07T20:26:24.8537791Z libcurand-10.3.7.77 | 39.9 MB | | 0%  2025-05-07T20:26:24.8600917Z nsight-compute-2024. | 443.1 MB | #########3 | 94% 2025-05-07T20:26:24.8601192Z 2025-05-07T20:26:24.8601285Z 2025-05-07T20:26:24.8601288Z 2025-05-07T20:26:24.8601463Z 2025-05-07T20:26:24.8601473Z 2025-05-07T20:26:24.8601524Z 2025-05-07T20:26:24.8601534Z 2025-05-07T20:26:24.8606586Z 2025-05-07T20:26:24.9233825Z cuda-nvdisasm-12.6.7 | 47.6 MB | 6 | 7%  2025-05-07T20:26:24.9234293Z 2025-05-07T20:26:24.9234298Z 2025-05-07T20:26:24.9234301Z 2025-05-07T20:26:24.9234305Z 2025-05-07T20:26:24.9234309Z 2025-05-07T20:26:24.9234313Z 2025-05-07T20:26:24.9234316Z 2025-05-07T20:26:24.9234345Z 2025-05-07T20:26:24.9235075Z 2025-05-07T20:26:24.9762165Z libcurand-10.3.7.77 | 39.9 MB | 7 | 7%  2025-05-07T20:26:24.9762505Z 2025-05-07T20:26:24.9762509Z 2025-05-07T20:26:24.9762513Z 2025-05-07T20:26:24.9762517Z 2025-05-07T20:26:24.9762520Z 2025-05-07T20:26:24.9762525Z 2025-05-07T20:26:24.9762529Z 2025-05-07T20:26:24.9762533Z 2025-05-07T20:26:24.9943159Z cuda-nvdisasm-12.6.7 | 47.6 MB | #3 | 13%  2025-05-07T20:26:25.0403508Z nsight-compute-2024. | 443.1 MB | #########4 | 94% 2025-05-07T20:26:25.0404069Z 2025-05-07T20:26:25.0404073Z 2025-05-07T20:26:25.0404077Z 2025-05-07T20:26:25.0404081Z 2025-05-07T20:26:25.0404088Z 2025-05-07T20:26:25.0404092Z 2025-05-07T20:26:25.0404097Z 2025-05-07T20:26:25.0404113Z 2025-05-07T20:26:25.0405961Z 2025-05-07T20:26:25.0821237Z libcurand-10.3.7.77 | 39.9 MB | #4 | 15%  2025-05-07T20:26:25.0821532Z 2025-05-07T20:26:25.0821537Z 2025-05-07T20:26:25.0821566Z 2025-05-07T20:26:25.0821580Z 2025-05-07T20:26:25.0821584Z 2025-05-07T20:26:25.0821587Z 2025-05-07T20:26:25.0821591Z 2025-05-07T20:26:25.0821598Z 2025-05-07T20:26:25.1314319Z cuda-nvdisasm-12.6.7 | 47.6 MB | #9 | 19%  2025-05-07T20:26:25.1506152Z nsight-compute-2024. | 443.1 MB | #########5 | 95% 2025-05-07T20:26:25.1506521Z 2025-05-07T20:26:25.1506528Z 2025-05-07T20:26:25.1506533Z 2025-05-07T20:26:25.1506538Z 2025-05-07T20:26:25.1506543Z 2025-05-07T20:26:25.1506559Z 2025-05-07T20:26:25.1506564Z 2025-05-07T20:26:25.1506601Z 2025-05-07T20:26:25.1506606Z 2025-05-07T20:26:25.1976450Z libcurand-10.3.7.77 | 39.9 MB | ##1 | 22%  2025-05-07T20:26:25.1976769Z 2025-05-07T20:26:25.1976773Z 2025-05-07T20:26:25.1976776Z 2025-05-07T20:26:25.1976780Z 2025-05-07T20:26:25.1976783Z 2025-05-07T20:26:25.1976787Z 2025-05-07T20:26:25.1976790Z 2025-05-07T20:26:25.1978667Z 2025-05-07T20:26:25.2513704Z cuda-nvdisasm-12.6.7 | 47.6 MB | ##5 | 25%  2025-05-07T20:26:25.2514055Z 2025-05-07T20:26:25.2514059Z 2025-05-07T20:26:25.2514071Z 2025-05-07T20:26:25.2514076Z 2025-05-07T20:26:25.2514080Z 2025-05-07T20:26:25.2514084Z 2025-05-07T20:26:25.2514088Z 2025-05-07T20:26:25.2514093Z 2025-05-07T20:26:25.2515987Z 2025-05-07T20:26:25.2530712Z libcurand-10.3.7.77 | 39.9 MB | ##9 | 30%  2025-05-07T20:26:25.2978302Z nsight-compute-2024. | 443.1 MB | #########5 | 96% 2025-05-07T20:26:25.2978682Z 2025-05-07T20:26:25.2978776Z 2025-05-07T20:26:25.2978810Z 2025-05-07T20:26:25.2978814Z 2025-05-07T20:26:25.2978818Z 2025-05-07T20:26:25.2978824Z 2025-05-07T20:26:25.2978828Z 2025-05-07T20:26:25.2978923Z 2025-05-07T20:26:25.3550628Z cuda-nvdisasm-12.6.7 | 47.6 MB | ###1 | 31%  2025-05-07T20:26:25.3551127Z 2025-05-07T20:26:25.3551131Z 2025-05-07T20:26:25.3551135Z 2025-05-07T20:26:25.3551139Z 2025-05-07T20:26:25.3551173Z 2025-05-07T20:26:25.3551177Z 2025-05-07T20:26:25.3551180Z 2025-05-07T20:26:25.3551184Z 2025-05-07T20:26:25.3551628Z 2025-05-07T20:26:25.3597449Z libcurand-10.3.7.77 | 39.9 MB | ###6 | 37%  2025-05-07T20:26:25.3982286Z nsight-compute-2024. | 443.1 MB | #########6 | 97% 2025-05-07T20:26:25.3982710Z 2025-05-07T20:26:25.3982717Z 2025-05-07T20:26:25.3982726Z 2025-05-07T20:26:25.3982732Z 2025-05-07T20:26:25.3982738Z 2025-05-07T20:26:25.3982745Z 2025-05-07T20:26:25.3982753Z 2025-05-07T20:26:25.3984779Z 2025-05-07T20:26:25.4550500Z cuda-nvdisasm-12.6.7 | 47.6 MB | ###7 | 38%  2025-05-07T20:26:25.4551066Z 2025-05-07T20:26:25.4551073Z 2025-05-07T20:26:25.4551079Z 2025-05-07T20:26:25.4551086Z 2025-05-07T20:26:25.4551092Z 2025-05-07T20:26:25.4551098Z 2025-05-07T20:26:25.4551117Z 2025-05-07T20:26:25.4551123Z 2025-05-07T20:26:25.4551130Z 2025-05-07T20:26:25.4683399Z libcurand-10.3.7.77 | 39.9 MB | ####4 | 44%  2025-05-07T20:26:25.4988378Z nsight-compute-2024. | 443.1 MB | #########7 | 97% 2025-05-07T20:26:25.4988736Z 2025-05-07T20:26:25.4988742Z 2025-05-07T20:26:25.4988747Z 2025-05-07T20:26:25.4988755Z 2025-05-07T20:26:25.4988760Z 2025-05-07T20:26:25.4988766Z 2025-05-07T20:26:25.4988773Z 2025-05-07T20:26:25.4990648Z 2025-05-07T20:26:25.5615642Z cuda-nvdisasm-12.6.7 | 47.6 MB | ####3 | 44%  2025-05-07T20:26:25.5616122Z 2025-05-07T20:26:25.5616126Z 2025-05-07T20:26:25.5616130Z 2025-05-07T20:26:25.5616133Z 2025-05-07T20:26:25.5616137Z 2025-05-07T20:26:25.5616428Z 2025-05-07T20:26:25.5616432Z 2025-05-07T20:26:25.5616436Z 2025-05-07T20:26:25.5616444Z 2025-05-07T20:26:25.5688966Z libcurand-10.3.7.77 | 39.9 MB | #####1 | 52%  2025-05-07T20:26:25.6035188Z nsight-compute-2024. | 443.1 MB | #########7 | 98% 2025-05-07T20:26:25.6035543Z 2025-05-07T20:26:25.6035549Z 2025-05-07T20:26:25.6035554Z 2025-05-07T20:26:25.6035592Z 2025-05-07T20:26:25.6035597Z 2025-05-07T20:26:25.6035602Z 2025-05-07T20:26:25.6035607Z 2025-05-07T20:26:25.6038835Z 2025-05-07T20:26:25.6645456Z cuda-nvdisasm-12.6.7 | 47.6 MB | ####9 | 50%  2025-05-07T20:26:25.6645860Z 2025-05-07T20:26:25.6645865Z 2025-05-07T20:26:25.6645870Z 2025-05-07T20:26:25.6645894Z 2025-05-07T20:26:25.6645901Z 2025-05-07T20:26:25.6645907Z 2025-05-07T20:26:25.6645915Z 2025-05-07T20:26:25.6645919Z 2025-05-07T20:26:25.6646635Z 2025-05-07T20:26:25.6692623Z libcurand-10.3.7.77 | 39.9 MB | #####8 | 59%  2025-05-07T20:26:25.7037030Z nsight-compute-2024. | 443.1 MB | #########8 | 98% 2025-05-07T20:26:25.7037427Z 2025-05-07T20:26:25.7037444Z 2025-05-07T20:26:25.7037450Z 2025-05-07T20:26:25.7037456Z 2025-05-07T20:26:25.7037461Z 2025-05-07T20:26:25.7037467Z 2025-05-07T20:26:25.7037472Z 2025-05-07T20:26:25.7044714Z 2025-05-07T20:26:25.7647523Z cuda-nvdisasm-12.6.7 | 47.6 MB | #####5 | 56%  2025-05-07T20:26:25.7647941Z 2025-05-07T20:26:25.7647946Z 2025-05-07T20:26:25.7647950Z 2025-05-07T20:26:25.7647954Z 2025-05-07T20:26:25.7647959Z 2025-05-07T20:26:25.7647962Z 2025-05-07T20:26:25.7647966Z 2025-05-07T20:26:25.7647970Z 2025-05-07T20:26:25.7647973Z 2025-05-07T20:26:25.7711483Z libcurand-10.3.7.77 | 39.9 MB | ######5 | 66%  2025-05-07T20:26:25.8070916Z nsight-compute-2024. | 443.1 MB | #########9 | 99% 2025-05-07T20:26:25.8071283Z 2025-05-07T20:26:25.8071287Z 2025-05-07T20:26:25.8071292Z 2025-05-07T20:26:25.8071323Z 2025-05-07T20:26:25.8071327Z 2025-05-07T20:26:25.8071330Z 2025-05-07T20:26:25.8071334Z 2025-05-07T20:26:25.8075200Z 2025-05-07T20:26:25.8648427Z cuda-nvdisasm-12.6.7 | 47.6 MB | ######1 | 62%  2025-05-07T20:26:25.8648766Z 2025-05-07T20:26:25.8648770Z 2025-05-07T20:26:25.8648773Z 2025-05-07T20:26:25.8648777Z 2025-05-07T20:26:25.8648781Z 2025-05-07T20:26:25.8648785Z 2025-05-07T20:26:25.8648969Z 2025-05-07T20:26:25.8648973Z 2025-05-07T20:26:25.8650108Z 2025-05-07T20:26:25.8823383Z libcurand-10.3.7.77 | 39.9 MB | #######2 | 73%  2025-05-07T20:26:25.9221938Z nsight-compute-2024. | 443.1 MB | #########9 | 100% 2025-05-07T20:26:25.9222231Z 2025-05-07T20:26:25.9222236Z 2025-05-07T20:26:25.9222242Z 2025-05-07T20:26:25.9222247Z 2025-05-07T20:26:25.9222251Z 2025-05-07T20:26:25.9222256Z 2025-05-07T20:26:25.9222261Z 2025-05-07T20:26:25.9222265Z 2025-05-07T20:26:26.0229984Z cuda-nvdisasm-12.6.7 | 47.6 MB | ######7 | 68%  2025-05-07T20:26:26.0230411Z 2025-05-07T20:26:26.0230420Z 2025-05-07T20:26:26.0230429Z 2025-05-07T20:26:26.0230433Z 2025-05-07T20:26:26.0230438Z 2025-05-07T20:26:26.0230443Z 2025-05-07T20:26:26.0230449Z 2025-05-07T20:26:26.0231721Z 2025-05-07T20:26:26.1233330Z cuda-nvdisasm-12.6.7 | 47.6 MB | #######4 | 75%  2025-05-07T20:26:26.1234010Z 2025-05-07T20:26:26.1234059Z 2025-05-07T20:26:26.1234065Z 2025-05-07T20:26:26.1234083Z 2025-05-07T20:26:26.1234088Z 2025-05-07T20:26:26.1234093Z 2025-05-07T20:26:26.1234098Z 2025-05-07T20:26:26.1234102Z 2025-05-07T20:26:26.1316689Z cuda-nvdisasm-12.6.7 | 47.6 MB | ########2 | 83%  2025-05-07T20:26:26.1317011Z 2025-05-07T20:26:26.1317015Z 2025-05-07T20:26:26.1317019Z 2025-05-07T20:26:26.1317022Z 2025-05-07T20:26:26.1317026Z 2025-05-07T20:26:26.1317030Z 2025-05-07T20:26:26.1317033Z 2025-05-07T20:26:26.1317037Z 2025-05-07T20:26:26.1319107Z 2025-05-07T20:26:26.2240313Z libcurand-10.3.7.77 | 39.9 MB | #######9 | 80%  2025-05-07T20:26:26.2241096Z 2025-05-07T20:26:26.2241101Z 2025-05-07T20:26:26.2241105Z 2025-05-07T20:26:26.2241108Z 2025-05-07T20:26:26.2241112Z 2025-05-07T20:26:26.2241116Z 2025-05-07T20:26:26.2241119Z 2025-05-07T20:26:26.2241123Z 2025-05-07T20:26:26.2322548Z cuda-nvdisasm-12.6.7 | 47.6 MB | ########9 | 89%  2025-05-07T20:26:26.2322985Z 2025-05-07T20:26:26.2322990Z 2025-05-07T20:26:26.2322993Z 2025-05-07T20:26:26.2322997Z 2025-05-07T20:26:26.2323001Z 2025-05-07T20:26:26.2323004Z 2025-05-07T20:26:26.2323008Z 2025-05-07T20:26:26.2323011Z 2025-05-07T20:26:26.2323015Z 2025-05-07T20:26:26.3325591Z libcurand-10.3.7.77 | 39.9 MB | ########7 | 88%  2025-05-07T20:26:26.3325942Z 2025-05-07T20:26:26.3325948Z 2025-05-07T20:26:26.3325953Z 2025-05-07T20:26:26.3325958Z 2025-05-07T20:26:26.3325964Z 2025-05-07T20:26:26.3325971Z 2025-05-07T20:26:26.3325978Z 2025-05-07T20:26:26.3326024Z 2025-05-07T20:26:26.3327741Z 2025-05-07T20:26:26.3336157Z libcurand-10.3.7.77 | 39.9 MB | #########5 | 96%  2025-05-07T20:26:26.3336548Z 2025-05-07T20:26:26.3336552Z 2025-05-07T20:26:26.3336556Z 2025-05-07T20:26:26.3336571Z 2025-05-07T20:26:26.3336575Z 2025-05-07T20:26:26.3336579Z 2025-05-07T20:26:26.3336583Z 2025-05-07T20:26:26.3336586Z 2025-05-07T20:26:26.7345441Z cuda-nvdisasm-12.6.7 | 47.6 MB | #########6 | 96%  2025-05-07T20:26:26.7345771Z 2025-05-07T20:26:26.7345775Z 2025-05-07T20:26:26.7345779Z 2025-05-07T20:26:27.5820730Z libcusparse-12.5.4.2 | 118.6 MB | ########## | 100%  2025-05-07T20:26:27.5821036Z 2025-05-07T20:26:27.5821040Z 2025-05-07T20:26:27.5821045Z 2025-05-07T20:26:27.5821048Z 2025-05-07T20:26:27.5821061Z 2025-05-07T20:26:27.5821065Z 2025-05-07T20:26:27.5821069Z 2025-05-07T20:26:27.5821072Z 2025-05-07T20:26:27.5824230Z 2025-05-07T20:26:27.6233144Z libcurand-10.3.7.77 | 39.9 MB | ########## | 100%  2025-05-07T20:26:27.6233487Z 2025-05-07T20:26:27.6233491Z 2025-05-07T20:26:27.6233494Z 2025-05-07T20:26:27.6233498Z 2025-05-07T20:26:27.6233502Z 2025-05-07T20:26:27.6233505Z 2025-05-07T20:26:27.6233509Z 2025-05-07T20:26:27.6233513Z 2025-05-07T20:26:27.6233516Z 2025-05-07T20:26:27.6235240Z 2025-05-07T20:26:27.7234913Z gds-tools-1.11.1.6 | 37.8 MB | | 0%  2025-05-07T20:26:27.7235301Z 2025-05-07T20:26:27.7235309Z 2025-05-07T20:26:27.7235315Z 2025-05-07T20:26:27.7235324Z 2025-05-07T20:26:27.7235333Z 2025-05-07T20:26:27.7235340Z 2025-05-07T20:26:27.7235347Z 2025-05-07T20:26:27.7235353Z 2025-05-07T20:26:27.7235360Z 2025-05-07T20:26:27.7235367Z 2025-05-07T20:26:27.7292037Z gds-tools-1.11.1.6 | 37.8 MB | # | 11%  2025-05-07T20:26:27.7292442Z 2025-05-07T20:26:27.7292448Z 2025-05-07T20:26:27.7292456Z 2025-05-07T20:26:27.7292462Z 2025-05-07T20:26:27.7292469Z 2025-05-07T20:26:27.7292496Z 2025-05-07T20:26:27.7295068Z 2025-05-07T20:26:27.7820126Z libnpp-12.3.1.54 | 93.4 MB | ########## | 100%  2025-05-07T20:26:27.7820686Z 2025-05-07T20:26:27.7820698Z 2025-05-07T20:26:27.7820708Z 2025-05-07T20:26:27.7820717Z 2025-05-07T20:26:27.7820728Z 2025-05-07T20:26:27.7820737Z 2025-05-07T20:26:27.7820761Z 2025-05-07T20:26:27.7820771Z 2025-05-07T20:26:27.7820808Z 2025-05-07T20:26:27.7820818Z 2025-05-07T20:26:27.7820827Z 2025-05-07T20:26:27.8389557Z python-3.13.0 | 31.5 MB | | 0%  2025-05-07T20:26:27.8390086Z 2025-05-07T20:26:27.8390093Z 2025-05-07T20:26:27.8390099Z 2025-05-07T20:26:27.8390107Z 2025-05-07T20:26:27.8390114Z 2025-05-07T20:26:27.8390120Z 2025-05-07T20:26:27.8390128Z 2025-05-07T20:26:27.8390137Z 2025-05-07T20:26:27.8390144Z 2025-05-07T20:26:27.8390151Z 2025-05-07T20:26:27.8822283Z gds-tools-1.11.1.6 | 37.8 MB | ##1 | 22%  2025-05-07T20:26:27.8823017Z 2025-05-07T20:26:27.8823023Z 2025-05-07T20:26:27.8823028Z 2025-05-07T20:26:27.8823033Z 2025-05-07T20:26:27.8823038Z 2025-05-07T20:26:27.8823046Z 2025-05-07T20:26:27.8823051Z 2025-05-07T20:26:27.8823055Z 2025-05-07T20:26:27.8823060Z 2025-05-07T20:26:27.8823065Z 2025-05-07T20:26:27.8823069Z 2025-05-07T20:26:27.9762176Z python-3.13.0 | 31.5 MB | 8 | 9%  2025-05-07T20:26:27.9762479Z 2025-05-07T20:26:27.9762484Z 2025-05-07T20:26:27.9762487Z 2025-05-07T20:26:27.9762491Z 2025-05-07T20:26:27.9762495Z 2025-05-07T20:26:27.9762499Z 2025-05-07T20:26:27.9762502Z 2025-05-07T20:26:27.9762506Z 2025-05-07T20:26:27.9762509Z 2025-05-07T20:26:27.9763213Z 2025-05-07T20:26:27.9824815Z gds-tools-1.11.1.6 | 37.8 MB | ###1 | 31%  2025-05-07T20:26:27.9825264Z 2025-05-07T20:26:27.9825271Z 2025-05-07T20:26:27.9825278Z 2025-05-07T20:26:27.9825285Z 2025-05-07T20:26:27.9825304Z 2025-05-07T20:26:27.9825327Z 2025-05-07T20:26:27.9825334Z 2025-05-07T20:26:27.9825342Z 2025-05-07T20:26:27.9825349Z 2025-05-07T20:26:27.9825357Z 2025-05-07T20:26:27.9826953Z 2025-05-07T20:26:28.0832443Z python-3.13.0 | 31.5 MB | #8 | 19%  2025-05-07T20:26:28.0832784Z 2025-05-07T20:26:28.0832791Z 2025-05-07T20:26:28.0832798Z 2025-05-07T20:26:28.0832804Z 2025-05-07T20:26:28.0832810Z 2025-05-07T20:26:28.0833109Z 2025-05-07T20:26:28.0833116Z 2025-05-07T20:26:28.0833119Z 2025-05-07T20:26:28.0833123Z 2025-05-07T20:26:28.0839128Z 2025-05-07T20:26:28.0873963Z gds-tools-1.11.1.6 | 37.8 MB | #### | 40%  2025-05-07T20:26:28.0874318Z 2025-05-07T20:26:28.0874324Z 2025-05-07T20:26:28.0874330Z 2025-05-07T20:26:28.0874337Z 2025-05-07T20:26:28.0874344Z 2025-05-07T20:26:28.0874350Z 2025-05-07T20:26:28.0874355Z 2025-05-07T20:26:28.0874361Z 2025-05-07T20:26:28.0874367Z 2025-05-07T20:26:28.0874374Z 2025-05-07T20:26:28.0875985Z 2025-05-07T20:26:28.1412857Z python-3.13.0 | 31.5 MB | ##8 | 28%  2025-05-07T20:26:28.1413171Z 2025-05-07T20:26:28.1413176Z 2025-05-07T20:26:28.1413181Z 2025-05-07T20:26:28.1413186Z 2025-05-07T20:26:28.1413190Z 2025-05-07T20:26:28.1413194Z 2025-05-07T20:26:28.1413199Z 2025-05-07T20:26:28.1414543Z 2025-05-07T20:26:28.1875532Z cuda-nvdisasm-12.6.7 | 47.6 MB | ########## | 100%  2025-05-07T20:26:28.1875861Z 2025-05-07T20:26:28.1875865Z 2025-05-07T20:26:28.1875869Z 2025-05-07T20:26:28.1875873Z 2025-05-07T20:26:28.1875887Z 2025-05-07T20:26:28.1875891Z 2025-05-07T20:26:28.1875894Z 2025-05-07T20:26:28.1875898Z 2025-05-07T20:26:28.1875902Z 2025-05-07T20:26:28.1875905Z 2025-05-07T20:26:28.1876173Z 2025-05-07T20:26:28.1990305Z python-3.13.0 | 31.5 MB | ###9 | 39%  2025-05-07T20:26:28.1990600Z 2025-05-07T20:26:28.1990604Z 2025-05-07T20:26:28.1990608Z 2025-05-07T20:26:28.1990611Z 2025-05-07T20:26:28.1990638Z 2025-05-07T20:26:28.1990641Z 2025-05-07T20:26:28.1990645Z 2025-05-07T20:26:28.1990649Z 2025-05-07T20:26:28.1990652Z 2025-05-07T20:26:28.1990656Z 2025-05-07T20:26:28.1990659Z 2025-05-07T20:26:28.1993796Z 2025-05-07T20:26:28.2503615Z cuda-nvcc-tools-12.6 | 23.0 MB | | 0%  2025-05-07T20:26:28.2504088Z 2025-05-07T20:26:28.2504101Z 2025-05-07T20:26:28.2504154Z 2025-05-07T20:26:28.2504163Z 2025-05-07T20:26:28.2504170Z 2025-05-07T20:26:28.2504178Z 2025-05-07T20:26:28.2504184Z 2025-05-07T20:26:28.2504192Z 2025-05-07T20:26:28.2504200Z 2025-05-07T20:26:28.2505984Z 2025-05-07T20:26:28.2913066Z gds-tools-1.11.1.6 | 37.8 MB | ####8 | 49%  2025-05-07T20:26:28.2913377Z 2025-05-07T20:26:28.2913381Z 2025-05-07T20:26:28.2913385Z 2025-05-07T20:26:28.2913390Z 2025-05-07T20:26:28.2913394Z 2025-05-07T20:26:28.2913410Z 2025-05-07T20:26:28.2913416Z 2025-05-07T20:26:28.2913420Z 2025-05-07T20:26:28.2913699Z 2025-05-07T20:26:28.2913703Z 2025-05-07T20:26:28.2917291Z 2025-05-07T20:26:28.2995356Z python-3.13.0 | 31.5 MB | ####8 | 49%  2025-05-07T20:26:28.2995648Z 2025-05-07T20:26:28.2995652Z 2025-05-07T20:26:28.2995656Z 2025-05-07T20:26:28.2995659Z 2025-05-07T20:26:28.2995672Z 2025-05-07T20:26:28.2995676Z 2025-05-07T20:26:28.2995680Z 2025-05-07T20:26:28.2995701Z 2025-05-07T20:26:28.2995705Z 2025-05-07T20:26:28.2995708Z 2025-05-07T20:26:28.2995712Z 2025-05-07T20:26:28.2996280Z 2025-05-07T20:26:28.3652818Z cuda-nvcc-tools-12.6 | 23.0 MB | 9 | 10%  2025-05-07T20:26:28.3653151Z 2025-05-07T20:26:28.3653155Z 2025-05-07T20:26:28.3653159Z 2025-05-07T20:26:28.3653162Z 2025-05-07T20:26:28.3653166Z 2025-05-07T20:26:28.3653170Z 2025-05-07T20:26:28.3653174Z 2025-05-07T20:26:28.3653179Z 2025-05-07T20:26:28.3653183Z 2025-05-07T20:26:28.3653187Z 2025-05-07T20:26:28.3996872Z gds-tools-1.11.1.6 | 37.8 MB | #####5 | 56%  2025-05-07T20:26:28.3997592Z 2025-05-07T20:26:28.3997602Z 2025-05-07T20:26:28.3997611Z 2025-05-07T20:26:28.3997620Z 2025-05-07T20:26:28.3997629Z 2025-05-07T20:26:28.3997638Z 2025-05-07T20:26:28.3997647Z 2025-05-07T20:26:28.3997656Z 2025-05-07T20:26:28.3997665Z 2025-05-07T20:26:28.3997673Z 2025-05-07T20:26:28.3997683Z 2025-05-07T20:26:28.3997691Z 2025-05-07T20:26:28.4174514Z cuda-nvcc-tools-12.6 | 23.0 MB | ##1 | 21%  2025-05-07T20:26:28.4174845Z 2025-05-07T20:26:28.4174849Z 2025-05-07T20:26:28.4174852Z 2025-05-07T20:26:28.4174856Z 2025-05-07T20:26:28.4174860Z 2025-05-07T20:26:28.4174863Z 2025-05-07T20:26:28.4174867Z 2025-05-07T20:26:28.4174881Z 2025-05-07T20:26:28.4174885Z 2025-05-07T20:26:28.4174888Z 2025-05-07T20:26:28.4174892Z 2025-05-07T20:26:28.4763800Z python-3.13.0 | 31.5 MB | #####8 | 59%  2025-05-07T20:26:28.4764089Z 2025-05-07T20:26:28.4764134Z 2025-05-07T20:26:28.4764137Z 2025-05-07T20:26:28.4764141Z 2025-05-07T20:26:28.4764145Z 2025-05-07T20:26:28.4764149Z 2025-05-07T20:26:28.4764152Z 2025-05-07T20:26:28.4764156Z 2025-05-07T20:26:28.4764160Z 2025-05-07T20:26:28.4764187Z 2025-05-07T20:26:28.5000052Z gds-tools-1.11.1.6 | 37.8 MB | ######2 | 63%  2025-05-07T20:26:28.5000468Z 2025-05-07T20:26:28.5000491Z 2025-05-07T20:26:28.5000498Z 2025-05-07T20:26:28.5000505Z 2025-05-07T20:26:28.5000512Z 2025-05-07T20:26:28.5000518Z 2025-05-07T20:26:28.5000525Z 2025-05-07T20:26:28.5000531Z 2025-05-07T20:26:28.5000539Z 2025-05-07T20:26:28.5000546Z 2025-05-07T20:26:28.5000553Z 2025-05-07T20:26:28.5000561Z 2025-05-07T20:26:28.5244593Z cuda-nvcc-tools-12.6 | 23.0 MB | ###3 | 34%  2025-05-07T20:26:28.5244921Z 2025-05-07T20:26:28.5244925Z 2025-05-07T20:26:28.5244929Z 2025-05-07T20:26:28.5244932Z 2025-05-07T20:26:28.5244936Z 2025-05-07T20:26:28.5244954Z 2025-05-07T20:26:28.5244965Z 2025-05-07T20:26:28.5244969Z 2025-05-07T20:26:28.5244973Z 2025-05-07T20:26:28.5244977Z 2025-05-07T20:26:28.5244980Z 2025-05-07T20:26:28.5776385Z python-3.13.0 | 31.5 MB | ######7 | 68%  2025-05-07T20:26:28.5776712Z 2025-05-07T20:26:28.5776716Z 2025-05-07T20:26:28.5776720Z 2025-05-07T20:26:28.5776724Z 2025-05-07T20:26:28.5776728Z 2025-05-07T20:26:28.5776762Z 2025-05-07T20:26:28.5776766Z 2025-05-07T20:26:28.5776770Z 2025-05-07T20:26:28.5776774Z 2025-05-07T20:26:28.5779840Z 2025-05-07T20:26:28.6000822Z gds-tools-1.11.1.6 | 37.8 MB | ######9 | 70%  2025-05-07T20:26:28.6001287Z 2025-05-07T20:26:28.6001298Z 2025-05-07T20:26:28.6001305Z 2025-05-07T20:26:28.6001313Z 2025-05-07T20:26:28.6001321Z 2025-05-07T20:26:28.6001329Z 2025-05-07T20:26:28.6001336Z 2025-05-07T20:26:28.6001344Z 2025-05-07T20:26:28.6001351Z 2025-05-07T20:26:28.6001360Z 2025-05-07T20:26:28.6001367Z 2025-05-07T20:26:28.6007318Z 2025-05-07T20:26:28.6322464Z cuda-nvcc-tools-12.6 | 23.0 MB | ####6 | 46%  2025-05-07T20:26:28.6322814Z 2025-05-07T20:26:28.6322818Z 2025-05-07T20:26:28.6322822Z 2025-05-07T20:26:28.6322826Z 2025-05-07T20:26:28.6322830Z 2025-05-07T20:26:28.6322833Z 2025-05-07T20:26:28.6322837Z 2025-05-07T20:26:28.6322850Z 2025-05-07T20:26:28.6322854Z 2025-05-07T20:26:28.6322872Z 2025-05-07T20:26:28.6322876Z 2025-05-07T20:26:28.6776606Z python-3.13.0 | 31.5 MB | #######6 | 77%  2025-05-07T20:26:28.6776913Z 2025-05-07T20:26:28.6776917Z 2025-05-07T20:26:28.6776920Z 2025-05-07T20:26:28.6776924Z 2025-05-07T20:26:28.6776928Z 2025-05-07T20:26:28.6776931Z 2025-05-07T20:26:28.6776935Z 2025-05-07T20:26:28.6776939Z 2025-05-07T20:26:28.6776943Z 2025-05-07T20:26:28.6778556Z 2025-05-07T20:26:28.7002440Z gds-tools-1.11.1.6 | 37.8 MB | #######6 | 77%  2025-05-07T20:26:28.7002965Z 2025-05-07T20:26:28.7002972Z 2025-05-07T20:26:28.7002978Z 2025-05-07T20:26:28.7002985Z 2025-05-07T20:26:28.7002992Z 2025-05-07T20:26:28.7002998Z 2025-05-07T20:26:28.7003004Z 2025-05-07T20:26:28.7003011Z 2025-05-07T20:26:28.7003019Z 2025-05-07T20:26:28.7003026Z 2025-05-07T20:26:28.7003032Z 2025-05-07T20:26:28.7007273Z 2025-05-07T20:26:28.7400014Z cuda-nvcc-tools-12.6 | 23.0 MB | #####9 | 59%  2025-05-07T20:26:28.7400496Z 2025-05-07T20:26:28.7400501Z 2025-05-07T20:26:28.7400504Z 2025-05-07T20:26:28.7400508Z 2025-05-07T20:26:28.7400511Z 2025-05-07T20:26:28.7400515Z 2025-05-07T20:26:28.7400519Z 2025-05-07T20:26:28.7400522Z 2025-05-07T20:26:28.7400541Z 2025-05-07T20:26:28.7400544Z 2025-05-07T20:26:28.7403941Z 2025-05-07T20:26:28.7789880Z python-3.13.0 | 31.5 MB | ########5 | 86%  2025-05-07T20:26:28.7790327Z 2025-05-07T20:26:28.7790332Z 2025-05-07T20:26:28.7790336Z 2025-05-07T20:26:28.7790342Z 2025-05-07T20:26:28.7790375Z 2025-05-07T20:26:28.7790379Z 2025-05-07T20:26:28.7790384Z 2025-05-07T20:26:28.7790389Z 2025-05-07T20:26:28.7790393Z 2025-05-07T20:26:28.7791311Z 2025-05-07T20:26:28.8007326Z gds-tools-1.11.1.6 | 37.8 MB | ########4 | 85%  2025-05-07T20:26:28.8007888Z 2025-05-07T20:26:28.8007899Z 2025-05-07T20:26:28.8007908Z 2025-05-07T20:26:28.8007918Z 2025-05-07T20:26:28.8007947Z 2025-05-07T20:26:28.8007954Z 2025-05-07T20:26:28.8007961Z 2025-05-07T20:26:28.8007967Z 2025-05-07T20:26:28.8007976Z 2025-05-07T20:26:28.8007984Z 2025-05-07T20:26:28.8007990Z 2025-05-07T20:26:28.8010763Z 2025-05-07T20:26:28.8536552Z cuda-nvcc-tools-12.6 | 23.0 MB | #######1 | 71%  2025-05-07T20:26:28.8536912Z 2025-05-07T20:26:28.8536919Z 2025-05-07T20:26:28.8536924Z 2025-05-07T20:26:28.8536931Z 2025-05-07T20:26:28.8536936Z 2025-05-07T20:26:28.8536945Z 2025-05-07T20:26:28.8536950Z 2025-05-07T20:26:28.8536954Z 2025-05-07T20:26:28.8537021Z 2025-05-07T20:26:28.8537025Z 2025-05-07T20:26:28.8538193Z 2025-05-07T20:26:28.8889029Z python-3.13.0 | 31.5 MB | #########4 | 94%  2025-05-07T20:26:28.8889602Z 2025-05-07T20:26:28.8889607Z 2025-05-07T20:26:28.8889611Z 2025-05-07T20:26:28.8889615Z 2025-05-07T20:26:28.8889619Z 2025-05-07T20:26:28.8889622Z 2025-05-07T20:26:28.8889627Z 2025-05-07T20:26:28.8889654Z 2025-05-07T20:26:28.8889658Z 2025-05-07T20:26:28.8894039Z 2025-05-07T20:26:28.9013421Z gds-tools-1.11.1.6 | 37.8 MB | #########1 | 92%  2025-05-07T20:26:28.9013797Z 2025-05-07T20:26:28.9013803Z 2025-05-07T20:26:28.9013809Z 2025-05-07T20:26:28.9013814Z 2025-05-07T20:26:28.9013820Z 2025-05-07T20:26:28.9013825Z 2025-05-07T20:26:28.9013829Z 2025-05-07T20:26:28.9013833Z 2025-05-07T20:26:28.9013837Z 2025-05-07T20:26:28.9013840Z 2025-05-07T20:26:28.9013844Z 2025-05-07T20:26:28.9013848Z 2025-05-07T20:26:28.9892511Z cuda-nvcc-tools-12.6 | 23.0 MB | ########3 | 84%  2025-05-07T20:26:28.9893152Z 2025-05-07T20:26:28.9893156Z 2025-05-07T20:26:28.9893160Z 2025-05-07T20:26:28.9893163Z 2025-05-07T20:26:28.9893168Z 2025-05-07T20:26:28.9893172Z 2025-05-07T20:26:28.9893176Z 2025-05-07T20:26:28.9893181Z 2025-05-07T20:26:28.9893184Z 2025-05-07T20:26:28.9893357Z 2025-05-07T20:26:29.0025894Z gds-tools-1.11.1.6 | 37.8 MB | #########9 | 99%  2025-05-07T20:26:29.0026562Z 2025-05-07T20:26:29.0026572Z 2025-05-07T20:26:29.0026581Z 2025-05-07T20:26:29.0026590Z 2025-05-07T20:26:29.0026599Z 2025-05-07T20:26:29.0026608Z 2025-05-07T20:26:29.0026617Z 2025-05-07T20:26:29.0026626Z 2025-05-07T20:26:29.0026634Z 2025-05-07T20:26:29.0026642Z 2025-05-07T20:26:29.0026650Z 2025-05-07T20:26:29.0032403Z 2025-05-07T20:26:29.2109631Z cuda-nvcc-tools-12.6 | 23.0 MB | #########5 | 96%  2025-05-07T20:26:29.2110404Z 2025-05-07T20:26:29.2654785Z libcublas-12.6.4.1 | 256.2 MB | ########## | 100%  2025-05-07T20:26:29.2655083Z 2025-05-07T20:26:29.2655087Z 2025-05-07T20:26:29.2655091Z 2025-05-07T20:26:29.2655095Z 2025-05-07T20:26:29.2655108Z 2025-05-07T20:26:29.2655111Z 2025-05-07T20:26:29.2655115Z 2025-05-07T20:26:29.2655119Z 2025-05-07T20:26:29.2655122Z 2025-05-07T20:26:29.2655126Z 2025-05-07T20:26:29.2655129Z 2025-05-07T20:26:29.2655133Z 2025-05-07T20:26:29.2655439Z 2025-05-07T20:26:29.3662214Z cuda-nvrtc-12.6.85 | 17.3 MB | | 0%  2025-05-07T20:26:29.3662594Z 2025-05-07T20:26:29.3662599Z 2025-05-07T20:26:29.3662603Z 2025-05-07T20:26:29.3662606Z 2025-05-07T20:26:29.3662610Z 2025-05-07T20:26:29.3662613Z 2025-05-07T20:26:29.3662617Z 2025-05-07T20:26:29.3662620Z 2025-05-07T20:26:29.3662624Z 2025-05-07T20:26:29.3662628Z 2025-05-07T20:26:29.3662631Z 2025-05-07T20:26:29.3662635Z 2025-05-07T20:26:29.3662639Z 2025-05-07T20:26:29.4667135Z cuda-nvrtc-12.6.85 | 17.3 MB | #9 | 19%  2025-05-07T20:26:29.4667638Z 2025-05-07T20:26:29.4667643Z 2025-05-07T20:26:29.4667646Z 2025-05-07T20:26:29.4667659Z 2025-05-07T20:26:29.4667663Z 2025-05-07T20:26:29.4667668Z 2025-05-07T20:26:29.4667673Z 2025-05-07T20:26:29.4667678Z 2025-05-07T20:26:29.4667683Z 2025-05-07T20:26:29.4667688Z 2025-05-07T20:26:29.4667703Z 2025-05-07T20:26:29.4667708Z 2025-05-07T20:26:29.4667728Z 2025-05-07T20:26:29.5671916Z cuda-nvrtc-12.6.85 | 17.3 MB | ###8 | 38%  2025-05-07T20:26:29.5672444Z 2025-05-07T20:26:29.5672451Z 2025-05-07T20:26:29.5672455Z 2025-05-07T20:26:29.5672459Z 2025-05-07T20:26:29.5672464Z 2025-05-07T20:26:29.5672468Z 2025-05-07T20:26:29.5672471Z 2025-05-07T20:26:29.5672475Z 2025-05-07T20:26:29.5672478Z 2025-05-07T20:26:29.5672482Z 2025-05-07T20:26:29.5672486Z 2025-05-07T20:26:29.5672489Z 2025-05-07T20:26:29.5672501Z 2025-05-07T20:26:29.6656589Z cuda-nvrtc-12.6.85 | 17.3 MB | #####7 | 57%  2025-05-07T20:26:29.6656947Z 2025-05-07T20:26:29.6658996Z 2025-05-07T20:26:29.6675493Z libcufft-11.3.0.4 | 156.2 MB | ########## | 100%  2025-05-07T20:26:29.6675930Z 2025-05-07T20:26:29.6675947Z 2025-05-07T20:26:29.6675954Z 2025-05-07T20:26:29.6675961Z 2025-05-07T20:26:29.6675967Z 2025-05-07T20:26:29.6675974Z 2025-05-07T20:26:29.6675981Z 2025-05-07T20:26:29.6676030Z 2025-05-07T20:26:29.6676037Z 2025-05-07T20:26:29.6676044Z 2025-05-07T20:26:29.6676050Z 2025-05-07T20:26:29.6676057Z 2025-05-07T20:26:29.6676125Z 2025-05-07T20:26:29.7818939Z cuda-nvrtc-12.6.85 | 17.3 MB | #######7 | 77%  2025-05-07T20:26:29.7819391Z 2025-05-07T20:26:29.7819395Z 2025-05-07T20:26:29.7819400Z 2025-05-07T20:26:29.7819405Z 2025-05-07T20:26:29.7819409Z 2025-05-07T20:26:29.7819412Z 2025-05-07T20:26:29.7819416Z 2025-05-07T20:26:29.7819420Z 2025-05-07T20:26:29.7819423Z 2025-05-07T20:26:29.7819427Z 2025-05-07T20:26:29.7819771Z 2025-05-07T20:26:29.7819774Z 2025-05-07T20:26:29.7819778Z 2025-05-07T20:26:29.8731006Z cuda-nvrtc-12.6.85 | 17.3 MB | #########6 | 96%  2025-05-07T20:26:29.8731347Z 2025-05-07T20:26:29.8731354Z 2025-05-07T20:26:29.8731358Z 2025-05-07T20:26:29.8731363Z 2025-05-07T20:26:29.8731369Z 2025-05-07T20:26:29.8731378Z 2025-05-07T20:26:29.8731384Z 2025-05-07T20:26:29.8731423Z 2025-05-07T20:26:29.8731429Z 2025-05-07T20:26:29.8731433Z 2025-05-07T20:26:29.8731436Z 2025-05-07T20:26:29.8735484Z 2025-05-07T20:26:29.9105790Z cuda-nvcc-tools-12.6 | 23.0 MB | ########## | 100%  2025-05-07T20:26:29.9106471Z 2025-05-07T20:26:29.9106480Z 2025-05-07T20:26:29.9106498Z 2025-05-07T20:26:29.9106507Z 2025-05-07T20:26:29.9106514Z 2025-05-07T20:26:29.9106522Z 2025-05-07T20:26:29.9106530Z 2025-05-07T20:26:29.9106538Z 2025-05-07T20:26:29.9106547Z 2025-05-07T20:26:29.9106554Z 2025-05-07T20:26:29.9106563Z 2025-05-07T20:26:29.9106593Z 2025-05-07T20:26:29.9106599Z 2025-05-07T20:26:29.9106805Z 2025-05-07T20:26:29.9437719Z libnvjitlink-12.6.85 | 14.9 MB | | 0%  2025-05-07T20:26:29.9438192Z 2025-05-07T20:26:29.9438196Z 2025-05-07T20:26:29.9438200Z 2025-05-07T20:26:29.9438204Z 2025-05-07T20:26:29.9438208Z 2025-05-07T20:26:29.9438211Z 2025-05-07T20:26:29.9438215Z 2025-05-07T20:26:29.9438518Z 2025-05-07T20:26:29.9438528Z 2025-05-07T20:26:29.9438534Z 2025-05-07T20:26:29.9443656Z 2025-05-07T20:26:29.9822831Z python-3.13.0 | 31.5 MB | ########## | 100%  2025-05-07T20:26:29.9823221Z 2025-05-07T20:26:29.9823227Z 2025-05-07T20:26:29.9823232Z 2025-05-07T20:26:29.9823238Z 2025-05-07T20:26:29.9823243Z 2025-05-07T20:26:29.9823249Z 2025-05-07T20:26:29.9823254Z 2025-05-07T20:26:29.9823271Z 2025-05-07T20:26:29.9823275Z 2025-05-07T20:26:29.9823279Z 2025-05-07T20:26:29.9823282Z 2025-05-07T20:26:29.9823288Z 2025-05-07T20:26:29.9823294Z 2025-05-07T20:26:29.9823314Z 2025-05-07T20:26:29.9823319Z 2025-05-07T20:26:30.0106522Z cuda-nvcc-dev_linux- | 10.8 MB | | 0%  2025-05-07T20:26:30.0106854Z 2025-05-07T20:26:30.0106858Z 2025-05-07T20:26:30.0106862Z 2025-05-07T20:26:30.0106866Z 2025-05-07T20:26:30.0106869Z 2025-05-07T20:26:30.0106873Z 2025-05-07T20:26:30.0106887Z 2025-05-07T20:26:30.0106907Z 2025-05-07T20:26:30.0106911Z 2025-05-07T20:26:30.0106914Z 2025-05-07T20:26:30.0106918Z 2025-05-07T20:26:30.0106921Z 2025-05-07T20:26:30.0106925Z 2025-05-07T20:26:30.0106929Z 2025-05-07T20:26:30.0825379Z libnvjitlink-12.6.85 | 14.9 MB | ##4 | 24%  2025-05-07T20:26:30.0825739Z 2025-05-07T20:26:30.0825744Z 2025-05-07T20:26:30.0825751Z 2025-05-07T20:26:30.0825756Z 2025-05-07T20:26:30.0825760Z 2025-05-07T20:26:30.0825764Z 2025-05-07T20:26:30.0825769Z 2025-05-07T20:26:30.0825773Z 2025-05-07T20:26:30.0825778Z 2025-05-07T20:26:30.0825811Z 2025-05-07T20:26:30.0825814Z 2025-05-07T20:26:30.0825818Z 2025-05-07T20:26:30.0825821Z 2025-05-07T20:26:30.0825825Z 2025-05-07T20:26:30.0825828Z 2025-05-07T20:26:30.1283361Z cuda-nvcc-dev_linux- | 10.8 MB | ##7 | 28%  2025-05-07T20:26:30.1283705Z 2025-05-07T20:26:30.1283712Z 2025-05-07T20:26:30.1283717Z 2025-05-07T20:26:30.1283722Z 2025-05-07T20:26:30.1283751Z 2025-05-07T20:26:30.1283756Z 2025-05-07T20:26:30.1283762Z 2025-05-07T20:26:30.1283768Z 2025-05-07T20:26:30.1283773Z 2025-05-07T20:26:30.1283789Z 2025-05-07T20:26:30.1283795Z 2025-05-07T20:26:30.1283800Z 2025-05-07T20:26:30.1283806Z 2025-05-07T20:26:30.1284017Z 2025-05-07T20:26:30.1828456Z libnvjitlink-12.6.85 | 14.9 MB | ####8 | 49%  2025-05-07T20:26:30.1828888Z 2025-05-07T20:26:30.1828892Z 2025-05-07T20:26:30.1828896Z 2025-05-07T20:26:30.1828900Z 2025-05-07T20:26:30.1828904Z 2025-05-07T20:26:30.1828907Z 2025-05-07T20:26:30.1829156Z 2025-05-07T20:26:30.1829160Z 2025-05-07T20:26:30.1829164Z 2025-05-07T20:26:30.1829167Z 2025-05-07T20:26:30.1829171Z 2025-05-07T20:26:30.1829174Z 2025-05-07T20:26:30.1829178Z 2025-05-07T20:26:30.1829182Z 2025-05-07T20:26:30.1829185Z 2025-05-07T20:26:30.2389706Z cuda-nvcc-dev_linux- | 10.8 MB | #####7 | 57%  2025-05-07T20:26:30.2390250Z 2025-05-07T20:26:30.2390272Z 2025-05-07T20:26:30.2390276Z 2025-05-07T20:26:30.2390280Z 2025-05-07T20:26:30.2390293Z 2025-05-07T20:26:30.2390296Z 2025-05-07T20:26:30.2390300Z 2025-05-07T20:26:30.2390304Z 2025-05-07T20:26:30.2390307Z 2025-05-07T20:26:30.2390311Z 2025-05-07T20:26:30.2390315Z 2025-05-07T20:26:30.2390318Z 2025-05-07T20:26:30.2390322Z 2025-05-07T20:26:30.2390326Z 2025-05-07T20:26:30.2859198Z libnvjitlink-12.6.85 | 14.9 MB | ####### | 71%  2025-05-07T20:26:30.2859711Z 2025-05-07T20:26:30.2859719Z 2025-05-07T20:26:30.2859726Z 2025-05-07T20:26:30.2859761Z 2025-05-07T20:26:30.2859767Z 2025-05-07T20:26:30.2859774Z 2025-05-07T20:26:30.2859781Z 2025-05-07T20:26:30.2859787Z 2025-05-07T20:26:30.2859794Z 2025-05-07T20:26:30.2859800Z 2025-05-07T20:26:30.2859805Z 2025-05-07T20:26:30.2859812Z 2025-05-07T20:26:30.2859818Z 2025-05-07T20:26:30.2859824Z 2025-05-07T20:26:30.2859830Z 2025-05-07T20:26:30.3400257Z cuda-nvcc-dev_linux- | 10.8 MB | ########6 | 86%  2025-05-07T20:26:30.3400720Z 2025-05-07T20:26:30.3400724Z 2025-05-07T20:26:30.3400728Z 2025-05-07T20:26:30.3400731Z 2025-05-07T20:26:30.3400735Z 2025-05-07T20:26:30.3400738Z 2025-05-07T20:26:30.3400751Z 2025-05-07T20:26:30.3400755Z 2025-05-07T20:26:30.3400758Z 2025-05-07T20:26:30.3400762Z 2025-05-07T20:26:30.3400765Z 2025-05-07T20:26:30.3400770Z 2025-05-07T20:26:30.3400773Z 2025-05-07T20:26:30.3402115Z 2025-05-07T20:26:30.3436718Z libnvjitlink-12.6.85 | 14.9 MB | #########2 | 92%  2025-05-07T20:26:30.3437198Z 2025-05-07T20:26:30.3437205Z 2025-05-07T20:26:30.3437210Z 2025-05-07T20:26:30.3437216Z 2025-05-07T20:26:30.3437222Z 2025-05-07T20:26:30.3437227Z 2025-05-07T20:26:30.3437232Z 2025-05-07T20:26:30.3437237Z 2025-05-07T20:26:30.3437243Z 2025-05-07T20:26:30.3441766Z 2025-05-07T20:26:30.3832044Z gds-tools-1.11.1.6 | 37.8 MB | ########## | 100%  2025-05-07T20:26:30.3832425Z 2025-05-07T20:26:30.3832429Z 2025-05-07T20:26:30.3832433Z 2025-05-07T20:26:30.3832437Z 2025-05-07T20:26:30.3832440Z 2025-05-07T20:26:30.3832444Z 2025-05-07T20:26:30.3832448Z 2025-05-07T20:26:30.3832451Z 2025-05-07T20:26:30.3832455Z 2025-05-07T20:26:30.3832458Z 2025-05-07T20:26:30.3832462Z 2025-05-07T20:26:30.3832466Z 2025-05-07T20:26:30.3832477Z 2025-05-07T20:26:30.3832481Z 2025-05-07T20:26:30.3832485Z 2025-05-07T20:26:30.3832489Z 2025-05-07T20:26:30.3932941Z cuda-nvvm-tools-12.6 | 10.4 MB | | 0%  2025-05-07T20:26:30.3933403Z 2025-05-07T20:26:30.3933410Z 2025-05-07T20:26:30.3933417Z 2025-05-07T20:26:30.3933422Z 2025-05-07T20:26:30.3933428Z 2025-05-07T20:26:30.3933433Z 2025-05-07T20:26:30.3933439Z 2025-05-07T20:26:30.3933445Z 2025-05-07T20:26:30.3933451Z 2025-05-07T20:26:30.3933457Z 2025-05-07T20:26:30.3933464Z 2025-05-07T20:26:30.3933470Z 2025-05-07T20:26:30.3935441Z 2025-05-07T20:26:30.4405766Z cuda-nvrtc-12.6.85 | 17.3 MB | ########## | 100%  2025-05-07T20:26:30.4406091Z 2025-05-07T20:26:30.4406095Z 2025-05-07T20:26:30.4406099Z 2025-05-07T20:26:30.4406102Z 2025-05-07T20:26:30.4406106Z 2025-05-07T20:26:30.4406110Z 2025-05-07T20:26:30.4406113Z 2025-05-07T20:26:30.4406117Z 2025-05-07T20:26:30.4406120Z 2025-05-07T20:26:30.4406124Z 2025-05-07T20:26:30.4406128Z 2025-05-07T20:26:30.4406131Z 2025-05-07T20:26:30.4406135Z 2025-05-07T20:26:30.4406139Z 2025-05-07T20:26:30.4406153Z 2025-05-07T20:26:30.4406157Z 2025-05-07T20:26:30.4406645Z 2025-05-07T20:26:30.4836477Z cuda-sanitizer-api-1 | 8.9 MB | | 0%  2025-05-07T20:26:30.4836940Z 2025-05-07T20:26:30.4836944Z 2025-05-07T20:26:30.4836948Z 2025-05-07T20:26:30.4836951Z 2025-05-07T20:26:30.4836956Z 2025-05-07T20:26:30.4836960Z 2025-05-07T20:26:30.4836964Z 2025-05-07T20:26:30.4836967Z 2025-05-07T20:26:30.4836971Z 2025-05-07T20:26:30.4836995Z 2025-05-07T20:26:30.4836999Z 2025-05-07T20:26:30.4837003Z 2025-05-07T20:26:30.4837006Z 2025-05-07T20:26:30.4837010Z 2025-05-07T20:26:30.4837013Z 2025-05-07T20:26:30.4838333Z 2025-05-07T20:26:30.5410261Z cuda-nvvm-tools-12.6 | 10.4 MB | ### | 30%  2025-05-07T20:26:30.5410760Z 2025-05-07T20:26:30.5410764Z 2025-05-07T20:26:30.5410768Z 2025-05-07T20:26:30.5410773Z 2025-05-07T20:26:30.5410776Z 2025-05-07T20:26:30.5410780Z 2025-05-07T20:26:30.5410784Z 2025-05-07T20:26:30.5410788Z 2025-05-07T20:26:30.5410791Z 2025-05-07T20:26:30.5410822Z 2025-05-07T20:26:30.5410826Z 2025-05-07T20:26:30.5410830Z 2025-05-07T20:26:30.5410834Z 2025-05-07T20:26:30.5410837Z 2025-05-07T20:26:30.5410841Z 2025-05-07T20:26:30.5410845Z 2025-05-07T20:26:30.5417279Z 2025-05-07T20:26:30.5840765Z cuda-sanitizer-api-1 | 8.9 MB | ###6 | 37%  2025-05-07T20:26:30.5841125Z 2025-05-07T20:26:30.5841437Z 2025-05-07T20:26:30.5841446Z 2025-05-07T20:26:30.5841452Z 2025-05-07T20:26:30.5841458Z 2025-05-07T20:26:30.5841463Z 2025-05-07T20:26:30.5841468Z 2025-05-07T20:26:30.5841473Z 2025-05-07T20:26:30.5841479Z 2025-05-07T20:26:30.5841485Z 2025-05-07T20:26:30.5841505Z 2025-05-07T20:26:30.5841510Z 2025-05-07T20:26:30.5841516Z 2025-05-07T20:26:30.5841521Z 2025-05-07T20:26:30.5841527Z 2025-05-07T20:26:30.5841532Z 2025-05-07T20:26:30.6482930Z cuda-nvvm-tools-12.6 | 10.4 MB | ######2 | 62%  2025-05-07T20:26:30.6483393Z 2025-05-07T20:26:30.6483440Z 2025-05-07T20:26:30.6483444Z 2025-05-07T20:26:30.6483448Z 2025-05-07T20:26:30.6483451Z 2025-05-07T20:26:30.6483455Z 2025-05-07T20:26:30.6483459Z 2025-05-07T20:26:30.6483462Z 2025-05-07T20:26:30.6483467Z 2025-05-07T20:26:30.6483471Z 2025-05-07T20:26:30.6483474Z 2025-05-07T20:26:30.6483479Z 2025-05-07T20:26:30.6483484Z 2025-05-07T20:26:30.6483488Z 2025-05-07T20:26:30.6483493Z 2025-05-07T20:26:30.6483511Z 2025-05-07T20:26:30.6485762Z 2025-05-07T20:26:30.6966749Z cuda-sanitizer-api-1 | 8.9 MB | #######3 | 73%  2025-05-07T20:26:30.6967517Z 2025-05-07T20:26:30.6967527Z 2025-05-07T20:26:30.6967536Z 2025-05-07T20:26:30.6967544Z 2025-05-07T20:26:30.6967555Z 2025-05-07T20:26:30.6967564Z 2025-05-07T20:26:30.6967575Z 2025-05-07T20:26:30.6967602Z 2025-05-07T20:26:30.6967611Z 2025-05-07T20:26:30.6967621Z 2025-05-07T20:26:30.6967628Z 2025-05-07T20:26:30.6967633Z 2025-05-07T20:26:30.6967638Z 2025-05-07T20:26:30.6967641Z 2025-05-07T20:26:30.6967677Z 2025-05-07T20:26:30.6967681Z 2025-05-07T20:26:30.7659686Z cuda-nvvm-tools-12.6 | 10.4 MB | #########3 | 94%  2025-05-07T20:26:30.7660308Z 2025-05-07T20:26:30.7660316Z 2025-05-07T20:26:30.7660324Z 2025-05-07T20:26:30.7660333Z 2025-05-07T20:26:30.7660342Z 2025-05-07T20:26:30.7660349Z 2025-05-07T20:26:30.7660356Z 2025-05-07T20:26:30.7660394Z 2025-05-07T20:26:30.7660400Z 2025-05-07T20:26:30.7660407Z 2025-05-07T20:26:30.7660413Z 2025-05-07T20:26:30.7660419Z 2025-05-07T20:26:30.7660426Z 2025-05-07T20:26:30.7660430Z 2025-05-07T20:26:30.7661857Z 2025-05-07T20:26:30.8014335Z cuda-nvcc-dev_linux- | 10.8 MB | ########## | 100%  2025-05-07T20:26:30.8014723Z 2025-05-07T20:26:30.8014727Z 2025-05-07T20:26:30.8014730Z 2025-05-07T20:26:30.8014734Z 2025-05-07T20:26:30.8014738Z 2025-05-07T20:26:30.8014741Z 2025-05-07T20:26:30.8014753Z 2025-05-07T20:26:30.8014757Z 2025-05-07T20:26:30.8015033Z 2025-05-07T20:26:30.8015037Z 2025-05-07T20:26:30.8015040Z 2025-05-07T20:26:30.8015044Z 2025-05-07T20:26:30.8015048Z 2025-05-07T20:26:30.8015051Z 2025-05-07T20:26:30.8015055Z 2025-05-07T20:26:30.8015058Z 2025-05-07T20:26:30.8015062Z 2025-05-07T20:26:30.8015066Z 2025-05-07T20:26:30.9019217Z cuda-nvvm-impl-12.6. | 7.7 MB | | 0%  2025-05-07T20:26:30.9019599Z 2025-05-07T20:26:30.9019604Z 2025-05-07T20:26:30.9019607Z 2025-05-07T20:26:30.9019611Z 2025-05-07T20:26:30.9019615Z 2025-05-07T20:26:30.9019618Z 2025-05-07T20:26:30.9019622Z 2025-05-07T20:26:30.9019626Z 2025-05-07T20:26:30.9019629Z 2025-05-07T20:26:30.9019633Z 2025-05-07T20:26:30.9019637Z 2025-05-07T20:26:30.9019640Z 2025-05-07T20:26:30.9019644Z 2025-05-07T20:26:30.9019648Z 2025-05-07T20:26:30.9019661Z 2025-05-07T20:26:30.9019665Z 2025-05-07T20:26:30.9019669Z 2025-05-07T20:26:30.9019971Z 2025-05-07T20:26:30.9158287Z cuda-nvvm-impl-12.6. | 7.7 MB | ####8 | 48%  2025-05-07T20:26:30.9158956Z 2025-05-07T20:26:30.9158961Z 2025-05-07T20:26:30.9158964Z 2025-05-07T20:26:30.9158968Z 2025-05-07T20:26:30.9158972Z 2025-05-07T20:26:30.9158975Z 2025-05-07T20:26:30.9158979Z 2025-05-07T20:26:30.9158982Z 2025-05-07T20:26:30.9158986Z 2025-05-07T20:26:30.9158990Z 2025-05-07T20:26:30.9158993Z 2025-05-07T20:26:30.9159260Z 2025-05-07T20:26:30.9159265Z 2025-05-07T20:26:30.9159269Z 2025-05-07T20:26:30.9596820Z libnvjitlink-12.6.85 | 14.9 MB | ########## | 100%  2025-05-07T20:26:30.9597182Z 2025-05-07T20:26:30.9597186Z 2025-05-07T20:26:30.9597190Z 2025-05-07T20:26:30.9597194Z 2025-05-07T20:26:30.9597197Z 2025-05-07T20:26:30.9597201Z 2025-05-07T20:26:30.9597205Z 2025-05-07T20:26:30.9597208Z 2025-05-07T20:26:30.9597212Z 2025-05-07T20:26:30.9597216Z 2025-05-07T20:26:30.9597219Z 2025-05-07T20:26:30.9597223Z 2025-05-07T20:26:30.9597234Z 2025-05-07T20:26:30.9597248Z 2025-05-07T20:26:30.9597252Z 2025-05-07T20:26:30.9597255Z 2025-05-07T20:26:30.9597259Z 2025-05-07T20:26:30.9597263Z 2025-05-07T20:26:30.9597266Z 2025-05-07T20:26:31.0018308Z ... (more hidden) ... 2025-05-07T20:26:31.0018638Z 2025-05-07T20:26:31.0018647Z 2025-05-07T20:26:31.0018656Z 2025-05-07T20:26:31.0018663Z 2025-05-07T20:26:31.0018687Z 2025-05-07T20:26:31.0018696Z 2025-05-07T20:26:31.0018704Z 2025-05-07T20:26:31.0018712Z 2025-05-07T20:26:31.0018719Z 2025-05-07T20:26:31.0018727Z 2025-05-07T20:26:31.0018735Z 2025-05-07T20:26:31.0018743Z 2025-05-07T20:26:31.0018750Z 2025-05-07T20:26:31.0018758Z 2025-05-07T20:26:31.0018765Z 2025-05-07T20:26:31.0018773Z 2025-05-07T20:26:31.0018781Z 2025-05-07T20:26:31.0018789Z 2025-05-07T20:26:31.0242200Z cuda-nvvm-impl-12.6. | 7.7 MB | #########8 | 98%  2025-05-07T20:26:31.0242578Z 2025-05-07T20:26:31.0242582Z 2025-05-07T20:26:31.0242596Z 2025-05-07T20:26:31.0242599Z 2025-05-07T20:26:31.0242611Z 2025-05-07T20:26:31.0242615Z 2025-05-07T20:26:31.0242619Z 2025-05-07T20:26:31.0242622Z 2025-05-07T20:26:31.0242626Z 2025-05-07T20:26:31.0242629Z 2025-05-07T20:26:31.0242633Z 2025-05-07T20:26:31.0242637Z 2025-05-07T20:26:31.0242640Z 2025-05-07T20:26:31.0242644Z 2025-05-07T20:26:31.0242647Z 2025-05-07T20:26:31.0242651Z 2025-05-07T20:26:31.0247865Z 2025-05-07T20:26:31.0599797Z cuda-sanitizer-api-1 | 8.9 MB | ########## | 100%  2025-05-07T20:26:31.0600168Z 2025-05-07T20:26:31.0600176Z 2025-05-07T20:26:31.0600182Z 2025-05-07T20:26:31.0600189Z 2025-05-07T20:26:31.0600196Z 2025-05-07T20:26:31.0600204Z 2025-05-07T20:26:31.0600211Z 2025-05-07T20:26:31.0600217Z 2025-05-07T20:26:31.0600224Z 2025-05-07T20:26:31.0600231Z 2025-05-07T20:26:31.0600237Z 2025-05-07T20:26:31.0600244Z 2025-05-07T20:26:31.0600250Z 2025-05-07T20:26:31.0600266Z 2025-05-07T20:26:31.0600273Z 2025-05-07T20:26:31.0600533Z 2025-05-07T20:26:31.0600537Z 2025-05-07T20:26:31.0600541Z 2025-05-07T20:26:31.0600547Z 2025-05-07T20:26:31.0897187Z ... (more hidden) ... 2025-05-07T20:26:31.0897481Z 2025-05-07T20:26:31.0897485Z 2025-05-07T20:26:31.0897489Z 2025-05-07T20:26:31.0897493Z 2025-05-07T20:26:31.0897496Z 2025-05-07T20:26:31.0897500Z 2025-05-07T20:26:31.0897517Z 2025-05-07T20:26:31.0897521Z 2025-05-07T20:26:31.0897525Z 2025-05-07T20:26:31.0897528Z 2025-05-07T20:26:31.0897532Z 2025-05-07T20:26:31.0897535Z 2025-05-07T20:26:31.0897539Z 2025-05-07T20:26:31.0897543Z 2025-05-07T20:26:31.0897546Z 2025-05-07T20:26:31.0902226Z 2025-05-07T20:26:31.1957275Z cuda-nvvm-tools-12.6 | 10.4 MB | ########## | 100%  2025-05-07T20:26:31.1957630Z 2025-05-07T20:26:31.1957634Z 2025-05-07T20:26:31.1957638Z 2025-05-07T20:26:31.1957641Z 2025-05-07T20:26:31.1957645Z 2025-05-07T20:26:31.1957648Z 2025-05-07T20:26:31.1957677Z 2025-05-07T20:26:31.1957694Z 2025-05-07T20:26:31.1957698Z 2025-05-07T20:26:31.1957702Z 2025-05-07T20:26:31.1957706Z 2025-05-07T20:26:31.1957711Z 2025-05-07T20:26:31.1957714Z 2025-05-07T20:26:31.1957718Z 2025-05-07T20:26:31.1957721Z 2025-05-07T20:26:31.1957725Z 2025-05-07T20:26:31.1957729Z 2025-05-07T20:26:31.1957732Z 2025-05-07T20:26:31.1957736Z 2025-05-07T20:26:31.2662331Z ... (more hidden) ... 2025-05-07T20:26:31.2662658Z 2025-05-07T20:26:31.2662662Z 2025-05-07T20:26:31.2662666Z 2025-05-07T20:26:31.2662669Z 2025-05-07T20:26:31.2662673Z 2025-05-07T20:26:31.2662677Z 2025-05-07T20:26:31.2662680Z 2025-05-07T20:26:31.2662684Z 2025-05-07T20:26:31.2662688Z 2025-05-07T20:26:31.2662691Z 2025-05-07T20:26:31.2662695Z 2025-05-07T20:26:31.2662699Z 2025-05-07T20:26:31.2662702Z 2025-05-07T20:26:31.2662706Z 2025-05-07T20:26:31.2662710Z 2025-05-07T20:26:31.2662724Z 2025-05-07T20:26:31.2662728Z 2025-05-07T20:26:31.2665006Z 2025-05-07T20:26:31.6978370Z cuda-nvvm-impl-12.6. | 7.7 MB | ########## | 100%  2025-05-07T20:26:31.6978741Z 2025-05-07T20:26:31.6978746Z 2025-05-07T20:26:31.6978750Z 2025-05-07T20:26:31.6978754Z 2025-05-07T20:26:31.6978758Z 2025-05-07T20:26:31.6980120Z 2025-05-07T20:26:32.1658223Z libcusolver-11.7.1.2 | 95.8 MB | ########## | 100%  2025-05-07T20:26:32.1658556Z 2025-05-07T20:26:32.1658560Z 2025-05-07T20:26:32.1658564Z 2025-05-07T20:26:32.1658569Z 2025-05-07T20:26:32.1658572Z 2025-05-07T20:26:32.1658576Z 2025-05-07T20:26:32.1658580Z 2025-05-07T20:26:32.1658584Z 2025-05-07T20:26:32.1658616Z 2025-05-07T20:26:32.7660722Z libcurand-10.3.7.77 | 39.9 MB | ########## | 100%  2025-05-07T20:26:32.7661041Z 2025-05-07T20:26:32.7661046Z 2025-05-07T20:26:32.7661050Z 2025-05-07T20:26:32.7661054Z 2025-05-07T20:26:32.7661057Z 2025-05-07T20:26:32.7661062Z 2025-05-07T20:26:32.7661066Z 2025-05-07T20:26:32.7661688Z 2025-05-07T20:26:32.9467199Z cuda-nvdisasm-12.6.7 | 47.6 MB | ########## | 100%  2025-05-07T20:26:32.9467492Z 2025-05-07T20:26:32.9467496Z 2025-05-07T20:26:32.9467499Z 2025-05-07T20:26:32.9467503Z 2025-05-07T20:26:32.9467908Z 2025-05-07T20:26:33.7802693Z cuda-nvvp-12.6.80 | 109.3 MB | ########## | 100%  2025-05-07T20:26:33.8093780Z nsight-compute-2024. | 443.1 MB | ########## | 100% 2025-05-07T20:26:33.8094203Z 2025-05-07T20:26:33.8094210Z 2025-05-07T20:26:33.8094215Z 2025-05-07T20:26:33.8094221Z 2025-05-07T20:26:33.8094227Z 2025-05-07T20:26:33.8094233Z 2025-05-07T20:26:33.8094238Z 2025-05-07T20:26:33.8094244Z 2025-05-07T20:26:33.8094250Z 2025-05-07T20:26:33.8094256Z 2025-05-07T20:26:33.8094261Z 2025-05-07T20:26:33.8094267Z 2025-05-07T20:26:34.1912110Z cuda-nvcc-tools-12.6 | 23.0 MB | ########## | 100%  2025-05-07T20:26:34.1912532Z 2025-05-07T20:26:34.1912536Z 2025-05-07T20:26:34.1912812Z 2025-05-07T20:26:34.1912816Z 2025-05-07T20:26:34.1912821Z 2025-05-07T20:26:34.1912824Z 2025-05-07T20:26:34.1912828Z 2025-05-07T20:26:34.4386004Z libnpp-12.3.1.54 | 93.4 MB | ########## | 100%  2025-05-07T20:26:34.4386593Z 2025-05-07T20:26:34.4386603Z 2025-05-07T20:26:34.4386611Z 2025-05-07T20:26:34.4386620Z 2025-05-07T20:26:34.4386630Z 2025-05-07T20:26:34.4386638Z 2025-05-07T20:26:34.4386682Z 2025-05-07T20:26:34.4386691Z 2025-05-07T20:26:34.4386698Z 2025-05-07T20:26:34.4386705Z 2025-05-07T20:26:34.8547001Z gds-tools-1.11.1.6 | 37.8 MB | ########## | 100%  2025-05-07T20:26:34.8547324Z 2025-05-07T20:26:34.8547329Z 2025-05-07T20:26:34.8547347Z 2025-05-07T20:26:34.8547353Z 2025-05-07T20:26:34.8547360Z 2025-05-07T20:26:34.8547365Z 2025-05-07T20:26:34.8547370Z 2025-05-07T20:26:34.8547373Z 2025-05-07T20:26:34.8547378Z 2025-05-07T20:26:34.8547381Z 2025-05-07T20:26:34.8547385Z 2025-05-07T20:26:34.8547389Z 2025-05-07T20:26:34.8547426Z 2025-05-07T20:26:35.1332147Z cuda-nvrtc-12.6.85 | 17.3 MB | ########## | 100%  2025-05-07T20:26:35.1332496Z 2025-05-07T20:26:35.1332502Z 2025-05-07T20:26:35.1332508Z 2025-05-07T20:26:35.1332513Z 2025-05-07T20:26:35.1332519Z 2025-05-07T20:26:35.1332525Z 2025-05-07T20:26:35.1332530Z 2025-05-07T20:26:35.1332536Z 2025-05-07T20:26:35.1332541Z 2025-05-07T20:26:35.1332856Z 2025-05-07T20:26:35.1332862Z 2025-05-07T20:26:35.1332866Z 2025-05-07T20:26:35.1332871Z 2025-05-07T20:26:35.1332875Z 2025-05-07T20:26:35.1332880Z 2025-05-07T20:26:35.2862840Z cuda-nvcc-dev_linux- | 10.8 MB | ########## | 100%  2025-05-07T20:26:35.2863190Z 2025-05-07T20:26:35.2863194Z 2025-05-07T20:26:35.2863198Z 2025-05-07T20:26:35.2863201Z 2025-05-07T20:26:35.2863205Z 2025-05-07T20:26:35.2863209Z 2025-05-07T20:26:35.2863212Z 2025-05-07T20:26:35.2863230Z 2025-05-07T20:26:35.2863234Z 2025-05-07T20:26:35.2863237Z 2025-05-07T20:26:35.2863274Z 2025-05-07T20:26:35.4306472Z python-3.13.0 | 31.5 MB | ########## | 100%  2025-05-07T20:26:35.4306804Z 2025-05-07T20:26:35.4306812Z 2025-05-07T20:26:35.4306818Z 2025-05-07T20:26:35.4306824Z 2025-05-07T20:26:35.4306830Z 2025-05-07T20:26:35.4306844Z 2025-05-07T20:26:35.4306848Z 2025-05-07T20:26:35.4306854Z 2025-05-07T20:26:35.4306859Z 2025-05-07T20:26:35.4306905Z 2025-05-07T20:26:35.4306910Z 2025-05-07T20:26:35.4306916Z 2025-05-07T20:26:35.4306921Z 2025-05-07T20:26:35.4306925Z 2025-05-07T20:26:35.4827905Z libnvjitlink-12.6.85 | 14.9 MB | ########## | 100%  2025-05-07T20:26:35.4828254Z 2025-05-07T20:26:35.4828257Z 2025-05-07T20:26:35.4828264Z 2025-05-07T20:26:35.4828272Z 2025-05-07T20:26:35.4828276Z 2025-05-07T20:26:35.4828279Z 2025-05-07T20:26:35.4828283Z 2025-05-07T20:26:35.4828287Z 2025-05-07T20:26:35.4828302Z 2025-05-07T20:26:35.4828306Z 2025-05-07T20:26:35.4828310Z 2025-05-07T20:26:35.4828342Z 2025-05-07T20:26:35.4828346Z 2025-05-07T20:26:35.4828350Z 2025-05-07T20:26:35.4828354Z 2025-05-07T20:26:35.4828358Z 2025-05-07T20:26:35.4828361Z 2025-05-07T20:26:35.6150271Z cuda-sanitizer-api-1 | 8.9 MB | ########## | 100%  2025-05-07T20:26:35.6151014Z 2025-05-07T20:26:35.6151021Z 2025-05-07T20:26:35.6151027Z 2025-05-07T20:26:35.6151034Z 2025-05-07T20:26:35.6151070Z 2025-05-07T20:26:35.6151077Z 2025-05-07T20:26:35.6151083Z 2025-05-07T20:26:35.6151090Z 2025-05-07T20:26:35.6151096Z 2025-05-07T20:26:35.6151103Z 2025-05-07T20:26:35.6151110Z 2025-05-07T20:26:35.6151117Z 2025-05-07T20:26:35.6151135Z 2025-05-07T20:26:35.6151141Z 2025-05-07T20:26:35.6151146Z 2025-05-07T20:26:35.6151153Z 2025-05-07T20:26:35.6630354Z cuda-nvvm-tools-12.6 | 10.4 MB | ########## | 100%  2025-05-07T20:26:35.6630839Z 2025-05-07T20:26:35.6630845Z 2025-05-07T20:26:35.6630850Z 2025-05-07T20:26:35.6630856Z 2025-05-07T20:26:35.6631189Z 2025-05-07T20:26:35.6631193Z 2025-05-07T20:26:35.6631197Z 2025-05-07T20:26:35.6631200Z 2025-05-07T20:26:35.6631204Z 2025-05-07T20:26:35.6631211Z 2025-05-07T20:26:35.6631217Z 2025-05-07T20:26:35.6631223Z 2025-05-07T20:26:35.6631228Z 2025-05-07T20:26:35.6631233Z 2025-05-07T20:26:35.6631238Z 2025-05-07T20:26:35.6631243Z 2025-05-07T20:26:35.6631249Z 2025-05-07T20:26:35.6631255Z 2025-05-07T20:26:35.6631281Z 2025-05-07T20:26:35.7534962Z ... (more hidden) ... 2025-05-07T20:26:35.7535636Z 2025-05-07T20:26:35.7535645Z 2025-05-07T20:26:35.7535653Z 2025-05-07T20:26:35.7535661Z 2025-05-07T20:26:35.7535670Z 2025-05-07T20:26:35.7535679Z 2025-05-07T20:26:35.7535700Z 2025-05-07T20:26:35.7535708Z 2025-05-07T20:26:35.7535715Z 2025-05-07T20:26:35.7535724Z 2025-05-07T20:26:35.7535731Z 2025-05-07T20:26:35.7535739Z 2025-05-07T20:26:35.7535749Z 2025-05-07T20:26:35.7535757Z 2025-05-07T20:26:35.7535764Z 2025-05-07T20:26:35.7535810Z 2025-05-07T20:26:35.7535819Z 2025-05-07T20:26:35.7535830Z 2025-05-07T20:26:37.0560440Z cuda-nvvm-impl-12.6. | 7.7 MB | ########## | 100%  2025-05-07T20:26:37.0560828Z 2025-05-07T20:26:42.8697923Z libcublas-12.6.4.1 | 256.2 MB | ########## | 100%  2025-05-07T20:26:42.8706157Z nsight-compute-2024. | 443.1 MB | ########## | 100% 2025-05-07T20:26:42.8707030Z 2025-05-07T20:26:42.8707040Z 2025-05-07T20:26:42.8707048Z 2025-05-07T20:26:42.8707057Z 2025-05-07T20:26:42.8707068Z 2025-05-07T20:26:42.8707075Z 2025-05-07T20:26:42.8707082Z 2025-05-07T20:26:42.8707090Z 2025-05-07T20:26:42.8707098Z 2025-05-07T20:26:42.8707106Z 2025-05-07T20:26:42.8707115Z 2025-05-07T20:26:42.8707124Z 2025-05-07T20:26:42.8707132Z 2025-05-07T20:26:42.8707140Z 2025-05-07T20:26:42.8707148Z 2025-05-07T20:26:42.8707156Z 2025-05-07T20:26:42.8707163Z 2025-05-07T20:26:42.8707173Z 2025-05-07T20:26:42.8707180Z 2025-05-07T20:26:42.8707356Z 2025-05-07T20:26:42.8707961Z  2025-05-07T20:26:42.8708732Z 2025-05-07T20:26:42.8709087Z 2025-05-07T20:26:42.8709394Z  2025-05-07T20:26:42.8709799Z 2025-05-07T20:26:42.8709807Z 2025-05-07T20:26:42.8710140Z  2025-05-07T20:26:42.8710541Z 2025-05-07T20:26:42.8710550Z 2025-05-07T20:26:42.8710557Z 2025-05-07T20:26:42.8710863Z  2025-05-07T20:26:42.8711290Z 2025-05-07T20:26:42.8711296Z 2025-05-07T20:26:42.8711303Z 2025-05-07T20:26:42.8711311Z 2025-05-07T20:26:42.8711631Z  2025-05-07T20:26:42.8712047Z 2025-05-07T20:26:42.8712055Z 2025-05-07T20:26:42.8712062Z 2025-05-07T20:26:42.8712069Z 2025-05-07T20:26:42.8712076Z 2025-05-07T20:26:42.8712386Z  2025-05-07T20:26:42.8712830Z 2025-05-07T20:26:42.8712837Z 2025-05-07T20:26:42.8712845Z 2025-05-07T20:26:42.8712852Z 2025-05-07T20:26:42.8712859Z 2025-05-07T20:26:42.8712866Z 2025-05-07T20:26:42.8713171Z  2025-05-07T20:26:42.8713606Z 2025-05-07T20:26:42.8713614Z 2025-05-07T20:26:42.8713636Z 2025-05-07T20:26:42.8713644Z 2025-05-07T20:26:42.8713650Z 2025-05-07T20:26:42.8713655Z 2025-05-07T20:26:42.8713661Z 2025-05-07T20:26:42.8713982Z  2025-05-07T20:26:42.8714416Z 2025-05-07T20:26:42.8714424Z 2025-05-07T20:26:42.8714431Z 2025-05-07T20:26:42.8714440Z 2025-05-07T20:26:42.8714446Z 2025-05-07T20:26:42.8714454Z 2025-05-07T20:26:42.8714460Z 2025-05-07T20:26:42.8714468Z 2025-05-07T20:26:42.8714838Z  2025-05-07T20:26:42.8715526Z 2025-05-07T20:26:42.8715534Z 2025-05-07T20:26:42.8715541Z 2025-05-07T20:26:42.8715548Z 2025-05-07T20:26:42.8715556Z 2025-05-07T20:26:42.8715563Z 2025-05-07T20:26:42.8715570Z 2025-05-07T20:26:42.8715577Z 2025-05-07T20:26:42.8715584Z 2025-05-07T20:26:42.8715947Z  2025-05-07T20:26:42.8716371Z 2025-05-07T20:26:42.8716388Z 2025-05-07T20:26:42.8716397Z 2025-05-07T20:26:42.8716405Z 2025-05-07T20:26:42.8716412Z 2025-05-07T20:26:42.8716418Z 2025-05-07T20:26:42.8716424Z 2025-05-07T20:26:42.8716430Z 2025-05-07T20:26:42.8716436Z 2025-05-07T20:26:42.8716444Z 2025-05-07T20:26:42.8716785Z  2025-05-07T20:26:42.8717231Z 2025-05-07T20:26:42.8717238Z 2025-05-07T20:26:42.8717246Z 2025-05-07T20:26:42.8717252Z 2025-05-07T20:26:42.8717260Z 2025-05-07T20:26:42.8717267Z 2025-05-07T20:26:42.8717274Z 2025-05-07T20:26:42.8717281Z 2025-05-07T20:26:42.8717308Z 2025-05-07T20:26:42.8717316Z 2025-05-07T20:26:42.8717323Z 2025-05-07T20:26:42.8717624Z  2025-05-07T20:26:42.8717931Z 2025-05-07T20:26:42.8717936Z 2025-05-07T20:26:42.8717941Z 2025-05-07T20:26:42.8717946Z 2025-05-07T20:26:42.8717961Z 2025-05-07T20:26:42.8717966Z 2025-05-07T20:26:42.8717971Z 2025-05-07T20:26:42.8718119Z 2025-05-07T20:26:42.8718126Z 2025-05-07T20:26:42.8718131Z 2025-05-07T20:26:42.8718136Z 2025-05-07T20:26:42.8718141Z 2025-05-07T20:26:42.8718428Z  2025-05-07T20:26:42.8718761Z 2025-05-07T20:26:42.8718766Z 2025-05-07T20:26:42.8718771Z 2025-05-07T20:26:42.8718776Z 2025-05-07T20:26:42.8718781Z 2025-05-07T20:26:42.8718786Z 2025-05-07T20:26:42.8718791Z 2025-05-07T20:26:42.8718796Z 2025-05-07T20:26:42.8718801Z 2025-05-07T20:26:42.8718806Z 2025-05-07T20:26:42.8718811Z 2025-05-07T20:26:42.8718824Z 2025-05-07T20:26:42.8718829Z 2025-05-07T20:26:42.8719127Z  2025-05-07T20:26:42.8719452Z 2025-05-07T20:26:42.8719457Z 2025-05-07T20:26:42.8719462Z 2025-05-07T20:26:42.8719467Z 2025-05-07T20:26:42.8719473Z 2025-05-07T20:26:42.8719477Z 2025-05-07T20:26:42.8719483Z 2025-05-07T20:26:42.8719488Z 2025-05-07T20:26:42.8719512Z 2025-05-07T20:26:42.8719517Z 2025-05-07T20:26:42.8719522Z 2025-05-07T20:26:42.8719527Z 2025-05-07T20:26:42.8719532Z 2025-05-07T20:26:42.8719538Z 2025-05-07T20:26:42.8719812Z  2025-05-07T20:26:42.8720120Z 2025-05-07T20:26:42.8720133Z 2025-05-07T20:26:42.8720138Z 2025-05-07T20:26:42.8720143Z 2025-05-07T20:26:42.8720148Z 2025-05-07T20:26:42.8720154Z 2025-05-07T20:26:42.8720159Z 2025-05-07T20:26:42.8720165Z 2025-05-07T20:26:42.8720171Z 2025-05-07T20:26:42.8720177Z 2025-05-07T20:26:42.8720189Z 2025-05-07T20:26:42.8720194Z 2025-05-07T20:26:42.8720199Z 2025-05-07T20:26:42.8720204Z 2025-05-07T20:26:42.8720209Z 2025-05-07T20:26:42.8720482Z  2025-05-07T20:26:42.8720807Z 2025-05-07T20:26:42.8720811Z 2025-05-07T20:26:42.8720816Z 2025-05-07T20:26:42.8720821Z 2025-05-07T20:26:42.8720826Z 2025-05-07T20:26:42.8720838Z 2025-05-07T20:26:42.8720843Z 2025-05-07T20:26:42.8720848Z 2025-05-07T20:26:42.8720853Z 2025-05-07T20:26:42.8720858Z 2025-05-07T20:26:42.8720863Z 2025-05-07T20:26:42.8720868Z 2025-05-07T20:26:42.8720873Z 2025-05-07T20:26:42.8720878Z 2025-05-07T20:26:42.8720883Z 2025-05-07T20:26:42.8720888Z 2025-05-07T20:26:42.8721180Z  2025-05-07T20:26:42.8721513Z 2025-05-07T20:26:42.8721518Z 2025-05-07T20:26:42.8721523Z 2025-05-07T20:26:42.8721528Z 2025-05-07T20:26:42.8721534Z 2025-05-07T20:26:42.8721709Z 2025-05-07T20:26:42.8721725Z 2025-05-07T20:26:42.8721730Z 2025-05-07T20:26:42.8721735Z 2025-05-07T20:26:42.8721740Z 2025-05-07T20:26:42.8721745Z 2025-05-07T20:26:42.8721750Z 2025-05-07T20:26:42.8721755Z 2025-05-07T20:26:42.8721760Z 2025-05-07T20:26:42.8721766Z 2025-05-07T20:26:42.8721771Z 2025-05-07T20:26:42.8721776Z 2025-05-07T20:26:42.8722083Z  2025-05-07T20:26:42.8722419Z 2025-05-07T20:26:42.8722424Z 2025-05-07T20:26:42.8722429Z 2025-05-07T20:26:42.8722434Z 2025-05-07T20:26:42.8722439Z 2025-05-07T20:26:42.8722444Z 2025-05-07T20:26:42.8722449Z 2025-05-07T20:26:42.8722454Z 2025-05-07T20:26:42.8722459Z 2025-05-07T20:26:42.8722464Z 2025-05-07T20:26:42.8722469Z 2025-05-07T20:26:42.8722474Z 2025-05-07T20:26:42.8722479Z 2025-05-07T20:26:42.8722484Z 2025-05-07T20:26:42.8722489Z 2025-05-07T20:26:42.8722494Z 2025-05-07T20:26:42.8722499Z 2025-05-07T20:26:42.8722511Z 2025-05-07T20:26:42.8722819Z  2025-05-07T20:26:42.8723145Z 2025-05-07T20:26:42.8723150Z 2025-05-07T20:26:42.8723288Z  2025-05-07T20:26:42.8723432Z 2025-05-07T20:26:42.8723437Z 2025-05-07T20:26:42.8723601Z  2025-05-07T20:26:42.8723745Z 2025-05-07T20:26:42.8723750Z 2025-05-07T20:26:42.8723852Z 2025-05-07T20:26:42.8724001Z  2025-05-07T20:26:42.8724151Z 2025-05-07T20:26:42.8724156Z 2025-05-07T20:26:42.8724161Z 2025-05-07T20:26:42.8724166Z 2025-05-07T20:26:42.8724462Z  2025-05-07T20:26:42.8724631Z 2025-05-07T20:26:42.8724636Z 2025-05-07T20:26:42.8724641Z 2025-05-07T20:26:42.8724647Z 2025-05-07T20:26:42.8724652Z 2025-05-07T20:26:42.8724801Z  2025-05-07T20:26:42.8724971Z 2025-05-07T20:26:42.8724977Z 2025-05-07T20:26:42.8724982Z 2025-05-07T20:26:42.8724988Z 2025-05-07T20:26:42.8724993Z 2025-05-07T20:26:42.8724998Z 2025-05-07T20:26:42.8725159Z  2025-05-07T20:26:42.8725332Z 2025-05-07T20:26:42.8725353Z 2025-05-07T20:26:42.8725358Z 2025-05-07T20:26:42.8725364Z 2025-05-07T20:26:42.8725369Z 2025-05-07T20:26:42.8725374Z 2025-05-07T20:26:42.8725379Z 2025-05-07T20:26:42.8725533Z  2025-05-07T20:26:42.8725734Z 2025-05-07T20:26:42.8725740Z 2025-05-07T20:26:42.8725763Z 2025-05-07T20:26:42.8725768Z 2025-05-07T20:26:42.8725782Z 2025-05-07T20:26:42.8725788Z 2025-05-07T20:26:42.8725793Z 2025-05-07T20:26:42.8725798Z 2025-05-07T20:26:42.8725960Z  2025-05-07T20:26:42.8726162Z 2025-05-07T20:26:42.8726167Z 2025-05-07T20:26:42.8726189Z 2025-05-07T20:26:42.8726194Z 2025-05-07T20:26:42.8726199Z 2025-05-07T20:26:42.8726204Z 2025-05-07T20:26:42.8726209Z 2025-05-07T20:26:42.8726215Z 2025-05-07T20:26:42.8726219Z 2025-05-07T20:26:42.8726384Z  2025-05-07T20:26:42.8726614Z 2025-05-07T20:26:42.8726619Z 2025-05-07T20:26:42.8726624Z 2025-05-07T20:26:42.8726637Z 2025-05-07T20:26:42.8726642Z 2025-05-07T20:26:42.8726647Z 2025-05-07T20:26:42.8726652Z 2025-05-07T20:26:42.8726657Z 2025-05-07T20:26:42.8726662Z 2025-05-07T20:26:42.8726668Z 2025-05-07T20:26:42.8726842Z  2025-05-07T20:26:42.8727082Z 2025-05-07T20:26:42.8727088Z 2025-05-07T20:26:42.8727093Z 2025-05-07T20:26:42.8727098Z 2025-05-07T20:26:42.8727103Z 2025-05-07T20:26:42.8727116Z 2025-05-07T20:26:42.8727121Z 2025-05-07T20:26:42.8727126Z 2025-05-07T20:26:42.8727131Z 2025-05-07T20:26:42.8727136Z 2025-05-07T20:26:42.8727141Z 2025-05-07T20:26:42.8727365Z  2025-05-07T20:26:42.8727596Z 2025-05-07T20:26:42.8727601Z 2025-05-07T20:26:42.8727606Z 2025-05-07T20:26:42.8727611Z 2025-05-07T20:26:42.8727616Z 2025-05-07T20:26:42.8727621Z 2025-05-07T20:26:42.8727626Z 2025-05-07T20:26:42.8727631Z 2025-05-07T20:26:42.8727636Z 2025-05-07T20:26:42.8727641Z 2025-05-07T20:26:42.8727646Z 2025-05-07T20:26:42.8727670Z 2025-05-07T20:26:42.8727978Z  2025-05-07T20:26:42.8728230Z 2025-05-07T20:26:42.8728236Z 2025-05-07T20:26:42.8728241Z 2025-05-07T20:26:42.8728246Z 2025-05-07T20:26:42.8728251Z 2025-05-07T20:26:42.8728256Z 2025-05-07T20:26:42.8728273Z 2025-05-07T20:26:42.8728278Z 2025-05-07T20:26:42.8728283Z 2025-05-07T20:26:42.8728288Z 2025-05-07T20:26:42.8728293Z 2025-05-07T20:26:42.8728298Z 2025-05-07T20:26:42.8728312Z 2025-05-07T20:26:42.8728533Z  2025-05-07T20:26:42.8728817Z 2025-05-07T20:26:42.8728822Z 2025-05-07T20:26:42.8728827Z 2025-05-07T20:26:42.8728832Z 2025-05-07T20:26:42.8728836Z 2025-05-07T20:26:42.8728841Z 2025-05-07T20:26:42.8728846Z 2025-05-07T20:26:42.8728851Z 2025-05-07T20:26:42.8728856Z 2025-05-07T20:26:42.8728861Z 2025-05-07T20:26:42.8728866Z 2025-05-07T20:26:42.8728871Z 2025-05-07T20:26:42.8728875Z 2025-05-07T20:26:42.8728881Z 2025-05-07T20:26:42.8737206Z  2025-05-07T20:26:42.8737582Z 2025-05-07T20:26:42.8737602Z 2025-05-07T20:26:42.8737607Z 2025-05-07T20:26:42.8737612Z 2025-05-07T20:26:42.8737617Z 2025-05-07T20:26:42.8737622Z 2025-05-07T20:26:42.8737627Z 2025-05-07T20:26:42.8737632Z 2025-05-07T20:26:42.8737638Z 2025-05-07T20:26:42.8737643Z 2025-05-07T20:26:42.8737648Z 2025-05-07T20:26:42.8737653Z 2025-05-07T20:26:42.8737658Z 2025-05-07T20:26:42.8737663Z 2025-05-07T20:26:42.8737668Z 2025-05-07T20:26:42.8738111Z  2025-05-07T20:26:42.8738405Z 2025-05-07T20:26:42.8738411Z 2025-05-07T20:26:42.8738416Z 2025-05-07T20:26:42.8738421Z 2025-05-07T20:26:42.8738426Z 2025-05-07T20:26:42.8738432Z 2025-05-07T20:26:42.8738437Z 2025-05-07T20:26:42.8738442Z 2025-05-07T20:26:42.8738447Z 2025-05-07T20:26:42.8738452Z 2025-05-07T20:26:42.8738457Z 2025-05-07T20:26:42.8738462Z 2025-05-07T20:26:42.8738490Z 2025-05-07T20:26:42.8738495Z 2025-05-07T20:26:42.8738500Z 2025-05-07T20:26:42.8738505Z 2025-05-07T20:26:42.8738723Z  2025-05-07T20:26:42.8739083Z 2025-05-07T20:26:42.8739088Z 2025-05-07T20:26:42.8739093Z 2025-05-07T20:26:42.8739098Z 2025-05-07T20:26:42.8739120Z 2025-05-07T20:26:42.8739125Z 2025-05-07T20:26:42.8739130Z 2025-05-07T20:26:42.8739135Z 2025-05-07T20:26:42.8739140Z 2025-05-07T20:26:42.8739145Z 2025-05-07T20:26:42.8739151Z 2025-05-07T20:26:42.8739156Z 2025-05-07T20:26:42.8739161Z 2025-05-07T20:26:42.8739177Z 2025-05-07T20:26:42.8739183Z 2025-05-07T20:26:42.8739188Z 2025-05-07T20:26:42.8739193Z 2025-05-07T20:26:42.8739443Z  2025-05-07T20:26:42.8739761Z 2025-05-07T20:26:42.8739766Z 2025-05-07T20:26:42.8739771Z 2025-05-07T20:26:42.8739776Z 2025-05-07T20:26:42.8739781Z 2025-05-07T20:26:42.8739786Z 2025-05-07T20:26:42.8739791Z 2025-05-07T20:26:42.8739796Z 2025-05-07T20:26:42.8739801Z 2025-05-07T20:26:42.8739806Z 2025-05-07T20:26:42.8739811Z 2025-05-07T20:26:42.8739816Z 2025-05-07T20:26:42.8739821Z 2025-05-07T20:26:42.8739832Z 2025-05-07T20:26:42.8739838Z 2025-05-07T20:26:42.8739843Z 2025-05-07T20:26:42.8739848Z 2025-05-07T20:26:42.8739870Z 2025-05-07T20:26:42.8740182Z  2025-05-07T20:26:42.8740473Z 2025-05-07T20:26:42.8740478Z 2025-05-07T20:26:42.8740629Z  2025-05-07T20:26:42.8740774Z 2025-05-07T20:26:42.8740779Z 2025-05-07T20:26:42.8740915Z  2025-05-07T20:26:42.8741093Z 2025-05-07T20:26:42.8741099Z 2025-05-07T20:26:42.8741104Z 2025-05-07T20:26:42.8741248Z  2025-05-07T20:26:42.8741419Z 2025-05-07T20:26:42.8741425Z 2025-05-07T20:26:42.8741431Z 2025-05-07T20:26:42.8741436Z 2025-05-07T20:26:42.8741584Z  2025-05-07T20:26:42.8741743Z 2025-05-07T20:26:42.8741748Z 2025-05-07T20:26:42.8741774Z 2025-05-07T20:26:42.8741780Z 2025-05-07T20:26:42.8741785Z 2025-05-07T20:26:42.8741928Z  2025-05-07T20:26:42.8742101Z 2025-05-07T20:26:42.8742106Z 2025-05-07T20:26:42.8742111Z 2025-05-07T20:26:42.8742116Z 2025-05-07T20:26:42.8742270Z 2025-05-07T20:26:42.8742303Z 2025-05-07T20:26:42.8742458Z  2025-05-07T20:26:42.8742634Z 2025-05-07T20:26:42.8742639Z 2025-05-07T20:26:42.8742644Z 2025-05-07T20:26:42.8742649Z 2025-05-07T20:26:42.8742654Z 2025-05-07T20:26:42.8742659Z 2025-05-07T20:26:42.8742688Z 2025-05-07T20:26:42.8742846Z  2025-05-07T20:26:42.8743032Z 2025-05-07T20:26:42.8743037Z 2025-05-07T20:26:42.8743051Z 2025-05-07T20:26:42.8743056Z 2025-05-07T20:26:42.8743062Z 2025-05-07T20:26:42.8743066Z 2025-05-07T20:26:42.8743072Z 2025-05-07T20:26:42.8743094Z 2025-05-07T20:26:42.8743253Z  2025-05-07T20:26:42.8743405Z 2025-05-07T20:26:42.8743409Z 2025-05-07T20:26:42.8743413Z 2025-05-07T20:26:42.8743416Z 2025-05-07T20:26:42.8743420Z 2025-05-07T20:26:42.8743423Z 2025-05-07T20:26:42.8743427Z 2025-05-07T20:26:42.8743450Z 2025-05-07T20:26:42.8743453Z 2025-05-07T20:26:42.8743576Z  2025-05-07T20:26:42.8743739Z 2025-05-07T20:26:42.8743751Z 2025-05-07T20:26:42.8743755Z 2025-05-07T20:26:42.8743758Z 2025-05-07T20:26:42.8743762Z 2025-05-07T20:26:42.8743784Z 2025-05-07T20:26:42.8743788Z 2025-05-07T20:26:42.8743791Z 2025-05-07T20:26:42.8743795Z 2025-05-07T20:26:42.8743799Z 2025-05-07T20:26:42.8743923Z  2025-05-07T20:26:42.8744082Z 2025-05-07T20:26:42.8744086Z 2025-05-07T20:26:42.8744090Z 2025-05-07T20:26:42.8744210Z 2025-05-07T20:26:42.8744217Z 2025-05-07T20:26:42.8744222Z 2025-05-07T20:26:42.8744227Z 2025-05-07T20:26:42.8744232Z 2025-05-07T20:26:42.8744237Z 2025-05-07T20:26:42.8744242Z 2025-05-07T20:26:42.8744247Z 2025-05-07T20:26:42.8744435Z  2025-05-07T20:26:42.8744646Z 2025-05-07T20:26:42.8744651Z 2025-05-07T20:26:42.8744657Z 2025-05-07T20:26:42.8744661Z 2025-05-07T20:26:42.8744666Z 2025-05-07T20:26:42.8744672Z 2025-05-07T20:26:42.8744677Z 2025-05-07T20:26:42.8744682Z 2025-05-07T20:26:42.8744687Z 2025-05-07T20:26:42.8744692Z 2025-05-07T20:26:42.8744708Z 2025-05-07T20:26:42.8744713Z 2025-05-07T20:26:42.8744869Z  2025-05-07T20:26:42.8745127Z 2025-05-07T20:26:42.8745131Z 2025-05-07T20:26:42.8745134Z 2025-05-07T20:26:42.8745138Z 2025-05-07T20:26:42.8745141Z 2025-05-07T20:26:42.8745145Z 2025-05-07T20:26:42.8745149Z 2025-05-07T20:26:42.8745152Z 2025-05-07T20:26:42.8745156Z 2025-05-07T20:26:42.8745159Z 2025-05-07T20:26:42.8745170Z 2025-05-07T20:26:42.8745174Z 2025-05-07T20:26:42.8745177Z 2025-05-07T20:26:42.8745329Z  2025-05-07T20:26:42.8745518Z 2025-05-07T20:26:42.8745521Z 2025-05-07T20:26:42.8745525Z 2025-05-07T20:26:42.8745528Z 2025-05-07T20:26:42.8745532Z 2025-05-07T20:26:42.8745536Z 2025-05-07T20:26:42.8745539Z 2025-05-07T20:26:42.8745543Z 2025-05-07T20:26:42.8745546Z 2025-05-07T20:26:42.8745550Z 2025-05-07T20:26:42.8745553Z 2025-05-07T20:26:42.8745557Z 2025-05-07T20:26:42.8745560Z 2025-05-07T20:26:42.8745578Z 2025-05-07T20:26:42.8745718Z  2025-05-07T20:26:42.8745917Z 2025-05-07T20:26:42.8745920Z 2025-05-07T20:26:42.8745924Z 2025-05-07T20:26:42.8745927Z 2025-05-07T20:26:42.8745931Z 2025-05-07T20:26:42.8745934Z 2025-05-07T20:26:42.8745951Z 2025-05-07T20:26:42.8745955Z 2025-05-07T20:26:42.8745958Z 2025-05-07T20:26:42.8745962Z 2025-05-07T20:26:42.8745966Z 2025-05-07T20:26:42.8745969Z 2025-05-07T20:26:42.8745978Z 2025-05-07T20:26:42.8745982Z 2025-05-07T20:26:42.8745985Z 2025-05-07T20:26:42.8746135Z  2025-05-07T20:26:42.8746353Z 2025-05-07T20:26:42.8746356Z 2025-05-07T20:26:42.8746360Z 2025-05-07T20:26:42.8746363Z 2025-05-07T20:26:42.8746367Z 2025-05-07T20:26:42.8746370Z 2025-05-07T20:26:42.8746374Z 2025-05-07T20:26:42.8746377Z 2025-05-07T20:26:42.8746381Z 2025-05-07T20:26:42.8746385Z 2025-05-07T20:26:42.8746388Z 2025-05-07T20:26:42.8746392Z 2025-05-07T20:26:42.8746395Z 2025-05-07T20:26:42.8746399Z 2025-05-07T20:26:42.8746402Z 2025-05-07T20:26:42.8746504Z 2025-05-07T20:26:42.8746679Z  2025-05-07T20:26:42.8746881Z 2025-05-07T20:26:42.8746885Z 2025-05-07T20:26:42.8746889Z 2025-05-07T20:26:42.8746892Z 2025-05-07T20:26:42.8746896Z 2025-05-07T20:26:42.8746899Z 2025-05-07T20:26:42.8746903Z 2025-05-07T20:26:42.8746906Z 2025-05-07T20:26:42.8746910Z 2025-05-07T20:26:42.8746913Z 2025-05-07T20:26:42.8746922Z 2025-05-07T20:26:42.8746926Z 2025-05-07T20:26:42.8746929Z 2025-05-07T20:26:42.8746945Z 2025-05-07T20:26:42.8746949Z 2025-05-07T20:26:42.8746953Z 2025-05-07T20:26:42.8746956Z 2025-05-07T20:26:42.8747115Z  2025-05-07T20:26:42.8747325Z 2025-05-07T20:26:42.8747328Z 2025-05-07T20:26:42.8747332Z 2025-05-07T20:26:42.8747348Z 2025-05-07T20:26:42.8747352Z 2025-05-07T20:26:42.8747356Z 2025-05-07T20:26:42.8747359Z 2025-05-07T20:26:42.8747363Z 2025-05-07T20:26:42.8747366Z 2025-05-07T20:26:42.8747370Z 2025-05-07T20:26:42.8747373Z 2025-05-07T20:26:42.8747386Z 2025-05-07T20:26:42.8747389Z 2025-05-07T20:26:42.8747393Z 2025-05-07T20:26:42.8747396Z 2025-05-07T20:26:42.8747400Z 2025-05-07T20:26:42.8747403Z 2025-05-07T20:26:42.8747407Z 2025-05-07T20:26:42.8747568Z  2025-05-07T20:26:42.8747856Z 2025-05-07T20:26:42.8747860Z 2025-05-07T20:26:42.8747960Z  2025-05-07T20:26:42.8748117Z 2025-05-07T20:26:42.8748217Z 2025-05-07T20:26:42.8748333Z  2025-05-07T20:26:42.8748487Z 2025-05-07T20:26:42.8748492Z 2025-05-07T20:26:42.8748497Z 2025-05-07T20:26:42.8748617Z  2025-05-07T20:26:42.8748749Z 2025-05-07T20:26:42.8748754Z 2025-05-07T20:26:42.8748759Z 2025-05-07T20:26:42.8748764Z 2025-05-07T20:26:42.8748913Z  2025-05-07T20:26:42.8749029Z 2025-05-07T20:26:42.8749033Z 2025-05-07T20:26:42.8749036Z 2025-05-07T20:26:42.8749040Z 2025-05-07T20:26:42.8749044Z 2025-05-07T20:26:42.8749172Z  2025-05-07T20:26:42.8749344Z 2025-05-07T20:26:42.8749348Z 2025-05-07T20:26:42.8749359Z 2025-05-07T20:26:42.8749362Z 2025-05-07T20:26:42.8749366Z 2025-05-07T20:26:42.8749369Z 2025-05-07T20:26:42.8749514Z  2025-05-07T20:26:42.8749694Z 2025-05-07T20:26:42.8749697Z 2025-05-07T20:26:42.8749701Z 2025-05-07T20:26:42.8749705Z 2025-05-07T20:26:42.8749708Z 2025-05-07T20:26:42.8749712Z 2025-05-07T20:26:42.8749715Z 2025-05-07T20:26:42.8749844Z  2025-05-07T20:26:42.8750000Z 2025-05-07T20:26:42.8750003Z 2025-05-07T20:26:42.8750007Z 2025-05-07T20:26:42.8750010Z 2025-05-07T20:26:42.8750014Z 2025-05-07T20:26:42.8750017Z 2025-05-07T20:26:42.8750021Z 2025-05-07T20:26:42.8750025Z 2025-05-07T20:26:42.8750147Z  2025-05-07T20:26:42.8750310Z 2025-05-07T20:26:42.8750314Z 2025-05-07T20:26:42.8750317Z 2025-05-07T20:26:42.8750321Z 2025-05-07T20:26:42.8750324Z 2025-05-07T20:26:42.8750328Z 2025-05-07T20:26:42.8750332Z 2025-05-07T20:26:42.8750335Z 2025-05-07T20:26:42.8750339Z 2025-05-07T20:26:42.8750465Z  2025-05-07T20:26:42.8750648Z 2025-05-07T20:26:42.8750651Z 2025-05-07T20:26:42.8750655Z 2025-05-07T20:26:42.8750658Z 2025-05-07T20:26:42.8750662Z 2025-05-07T20:26:42.8750665Z 2025-05-07T20:26:42.8750669Z 2025-05-07T20:26:42.8750672Z 2025-05-07T20:26:42.8750676Z 2025-05-07T20:26:42.8750679Z 2025-05-07T20:26:42.8750824Z  2025-05-07T20:26:42.8750991Z 2025-05-07T20:26:42.8751003Z 2025-05-07T20:26:42.8751006Z 2025-05-07T20:26:42.8751010Z 2025-05-07T20:26:42.8751013Z 2025-05-07T20:26:42.8751017Z 2025-05-07T20:26:42.8751020Z 2025-05-07T20:26:42.8751024Z 2025-05-07T20:26:42.8751027Z 2025-05-07T20:26:42.8751031Z 2025-05-07T20:26:42.8751034Z 2025-05-07T20:26:42.8751179Z  2025-05-07T20:26:42.8751357Z 2025-05-07T20:26:42.8751360Z 2025-05-07T20:26:42.8751364Z 2025-05-07T20:26:42.8751367Z 2025-05-07T20:26:42.8751371Z 2025-05-07T20:26:42.8751374Z 2025-05-07T20:26:42.8751378Z 2025-05-07T20:26:42.8751381Z 2025-05-07T20:26:42.8751485Z 2025-05-07T20:26:42.8751489Z 2025-05-07T20:26:42.8751508Z 2025-05-07T20:26:42.8751511Z 2025-05-07T20:26:42.8751717Z  2025-05-07T20:26:42.8751957Z 2025-05-07T20:26:42.8751962Z 2025-05-07T20:26:42.8751967Z 2025-05-07T20:26:42.8751972Z 2025-05-07T20:26:42.8751990Z 2025-05-07T20:26:42.8751994Z 2025-05-07T20:26:42.8751997Z 2025-05-07T20:26:42.8752001Z 2025-05-07T20:26:42.8752014Z 2025-05-07T20:26:42.8752017Z 2025-05-07T20:26:42.8752021Z 2025-05-07T20:26:42.8752024Z 2025-05-07T20:26:42.8752028Z 2025-05-07T20:26:42.8752219Z  2025-05-07T20:26:42.8752432Z 2025-05-07T20:26:42.8752435Z 2025-05-07T20:26:42.8752439Z 2025-05-07T20:26:42.8752442Z 2025-05-07T20:26:42.8752446Z 2025-05-07T20:26:42.8752449Z 2025-05-07T20:26:42.8752453Z 2025-05-07T20:26:42.8752456Z 2025-05-07T20:26:42.8752460Z 2025-05-07T20:26:42.8752463Z 2025-05-07T20:26:42.8752467Z 2025-05-07T20:26:42.8752470Z 2025-05-07T20:26:42.8752481Z 2025-05-07T20:26:42.8752484Z 2025-05-07T20:26:42.8752629Z  2025-05-07T20:26:42.8752866Z 2025-05-07T20:26:42.8752871Z 2025-05-07T20:26:42.8752876Z 2025-05-07T20:26:42.8752881Z 2025-05-07T20:26:42.8752886Z 2025-05-07T20:26:42.8752891Z 2025-05-07T20:26:42.8752896Z 2025-05-07T20:26:42.8752902Z 2025-05-07T20:26:42.8752907Z 2025-05-07T20:26:42.8752912Z 2025-05-07T20:26:42.8753013Z 2025-05-07T20:26:42.8753018Z 2025-05-07T20:26:42.8753023Z 2025-05-07T20:26:42.8753028Z 2025-05-07T20:26:42.8753033Z 2025-05-07T20:26:42.8753255Z  2025-05-07T20:26:42.8753461Z 2025-05-07T20:26:42.8753465Z 2025-05-07T20:26:42.8753469Z 2025-05-07T20:26:42.8753472Z 2025-05-07T20:26:42.8753476Z 2025-05-07T20:26:42.8753479Z 2025-05-07T20:26:42.8753483Z 2025-05-07T20:26:42.8753486Z 2025-05-07T20:26:42.8753501Z 2025-05-07T20:26:42.8753507Z 2025-05-07T20:26:42.8753512Z 2025-05-07T20:26:42.8753518Z 2025-05-07T20:26:42.8753523Z 2025-05-07T20:26:42.8753535Z 2025-05-07T20:26:42.8753540Z 2025-05-07T20:26:42.8753545Z 2025-05-07T20:26:42.8753759Z  2025-05-07T20:26:42.8754031Z 2025-05-07T20:26:42.8754035Z 2025-05-07T20:26:42.8754040Z 2025-05-07T20:26:42.8754044Z 2025-05-07T20:26:42.8754049Z 2025-05-07T20:26:42.8754053Z 2025-05-07T20:26:42.8754058Z 2025-05-07T20:26:42.8754062Z 2025-05-07T20:26:42.8754075Z 2025-05-07T20:26:42.8754080Z 2025-05-07T20:26:42.8754084Z 2025-05-07T20:26:42.8754089Z 2025-05-07T20:26:42.8754093Z 2025-05-07T20:26:42.8754098Z 2025-05-07T20:26:42.8754102Z 2025-05-07T20:26:42.8754107Z 2025-05-07T20:26:42.8754111Z 2025-05-07T20:26:42.8754311Z  2025-05-07T20:26:42.8754578Z 2025-05-07T20:26:42.8754581Z 2025-05-07T20:26:42.8754585Z 2025-05-07T20:26:42.8754588Z 2025-05-07T20:26:42.8754592Z 2025-05-07T20:26:42.8754595Z 2025-05-07T20:26:42.8754599Z 2025-05-07T20:26:42.8754603Z 2025-05-07T20:26:42.8754606Z 2025-05-07T20:26:42.8754617Z 2025-05-07T20:26:42.8754621Z 2025-05-07T20:26:42.8754624Z 2025-05-07T20:26:42.8754628Z 2025-05-07T20:26:42.8754648Z 2025-05-07T20:26:42.8754651Z 2025-05-07T20:26:42.8754655Z 2025-05-07T20:26:42.8754658Z 2025-05-07T20:26:42.8754662Z 2025-05-07T20:26:42.8754832Z  2025-05-07T20:26:42.8755088Z 2025-05-07T20:26:42.8755094Z 2025-05-07T20:26:42.8755253Z  2025-05-07T20:26:42.8755398Z 2025-05-07T20:26:42.8755404Z 2025-05-07T20:26:42.8755556Z  2025-05-07T20:26:42.8755700Z 2025-05-07T20:26:42.8755705Z 2025-05-07T20:26:42.8755710Z 2025-05-07T20:26:42.8755853Z  2025-05-07T20:26:42.8756018Z 2025-05-07T20:26:42.8756023Z 2025-05-07T20:26:42.8756028Z 2025-05-07T20:26:42.8756033Z 2025-05-07T20:26:42.8756178Z  2025-05-07T20:26:42.8756357Z 2025-05-07T20:26:42.8756362Z 2025-05-07T20:26:42.8756367Z 2025-05-07T20:26:42.8756372Z 2025-05-07T20:26:42.8756377Z 2025-05-07T20:26:42.8756520Z  2025-05-07T20:26:42.8756811Z 2025-05-07T20:26:42.8756816Z 2025-05-07T20:26:42.8756831Z 2025-05-07T20:26:42.8756836Z 2025-05-07T20:26:42.8756841Z 2025-05-07T20:26:42.8756846Z 2025-05-07T20:26:42.8756994Z  2025-05-07T20:26:42.8757159Z 2025-05-07T20:26:42.8757165Z 2025-05-07T20:26:42.8757170Z 2025-05-07T20:26:42.8757185Z 2025-05-07T20:26:42.8757191Z 2025-05-07T20:26:42.8757195Z 2025-05-07T20:26:42.8757207Z 2025-05-07T20:26:42.8757360Z  2025-05-07T20:26:42.8757547Z 2025-05-07T20:26:42.8757552Z 2025-05-07T20:26:42.8757557Z 2025-05-07T20:26:42.8757562Z 2025-05-07T20:26:42.8757577Z 2025-05-07T20:26:42.8757583Z 2025-05-07T20:26:42.8757588Z 2025-05-07T20:26:42.8757593Z 2025-05-07T20:26:42.8757750Z  2025-05-07T20:26:42.8757946Z 2025-05-07T20:26:42.8757952Z 2025-05-07T20:26:42.8757957Z 2025-05-07T20:26:42.8757962Z 2025-05-07T20:26:42.8757982Z 2025-05-07T20:26:42.8757987Z 2025-05-07T20:26:42.8757992Z 2025-05-07T20:26:42.8758005Z 2025-05-07T20:26:42.8758010Z 2025-05-07T20:26:42.8758173Z  2025-05-07T20:26:42.8758379Z 2025-05-07T20:26:42.8758384Z 2025-05-07T20:26:42.8758399Z 2025-05-07T20:26:42.8758404Z 2025-05-07T20:26:42.8758409Z 2025-05-07T20:26:42.8758415Z 2025-05-07T20:26:42.8758420Z 2025-05-07T20:26:42.8758425Z 2025-05-07T20:26:42.8758430Z 2025-05-07T20:26:42.8758435Z 2025-05-07T20:26:42.8758729Z  2025-05-07T20:26:42.8758959Z 2025-05-07T20:26:42.8758965Z 2025-05-07T20:26:42.8758970Z 2025-05-07T20:26:42.8758975Z 2025-05-07T20:26:42.8758980Z 2025-05-07T20:26:42.8758985Z 2025-05-07T20:26:42.8758990Z 2025-05-07T20:26:42.8758995Z 2025-05-07T20:26:42.8759000Z 2025-05-07T20:26:42.8759005Z 2025-05-07T20:26:42.8759010Z 2025-05-07T20:26:42.8759185Z  2025-05-07T20:26:42.8759427Z 2025-05-07T20:26:42.8759432Z 2025-05-07T20:26:42.8759437Z 2025-05-07T20:26:42.8759442Z 2025-05-07T20:26:42.8759447Z 2025-05-07T20:26:42.8759452Z 2025-05-07T20:26:42.8759464Z 2025-05-07T20:26:42.8759469Z 2025-05-07T20:26:42.8759474Z 2025-05-07T20:26:42.8759479Z 2025-05-07T20:26:42.8759484Z 2025-05-07T20:26:42.8759489Z 2025-05-07T20:26:42.8759675Z  2025-05-07T20:26:42.8759927Z 2025-05-07T20:26:42.8759933Z 2025-05-07T20:26:42.8759938Z 2025-05-07T20:26:42.8759943Z 2025-05-07T20:26:42.8759948Z 2025-05-07T20:26:42.8759962Z 2025-05-07T20:26:42.8759967Z 2025-05-07T20:26:42.8759972Z 2025-05-07T20:26:42.8759977Z 2025-05-07T20:26:42.8759982Z 2025-05-07T20:26:42.8759986Z 2025-05-07T20:26:42.8759991Z 2025-05-07T20:26:42.8759996Z 2025-05-07T20:26:42.8760186Z  2025-05-07T20:26:42.8760436Z 2025-05-07T20:26:42.8760441Z 2025-05-07T20:26:42.8760446Z 2025-05-07T20:26:42.8760450Z 2025-05-07T20:26:42.8760455Z 2025-05-07T20:26:42.8760460Z 2025-05-07T20:26:42.8760465Z 2025-05-07T20:26:42.8760470Z 2025-05-07T20:26:42.8760475Z 2025-05-07T20:26:42.8760480Z 2025-05-07T20:26:42.8760502Z 2025-05-07T20:26:42.8760508Z 2025-05-07T20:26:42.8760513Z 2025-05-07T20:26:42.8760518Z 2025-05-07T20:26:42.8760711Z  2025-05-07T20:26:42.8760969Z 2025-05-07T20:26:42.8760974Z 2025-05-07T20:26:42.8760979Z 2025-05-07T20:26:42.8760999Z 2025-05-07T20:26:42.8761004Z 2025-05-07T20:26:42.8761009Z 2025-05-07T20:26:42.8761014Z 2025-05-07T20:26:42.8761019Z 2025-05-07T20:26:42.8761029Z 2025-05-07T20:26:42.8761034Z 2025-05-07T20:26:42.8761039Z 2025-05-07T20:26:42.8761044Z 2025-05-07T20:26:42.8761049Z 2025-05-07T20:26:42.8761054Z 2025-05-07T20:26:42.8761059Z 2025-05-07T20:26:42.8761255Z  2025-05-07T20:26:42.8761534Z 2025-05-07T20:26:42.8761539Z 2025-05-07T20:26:42.8761544Z 2025-05-07T20:26:42.8761549Z 2025-05-07T20:26:42.8761554Z 2025-05-07T20:26:42.8761559Z 2025-05-07T20:26:42.8761564Z 2025-05-07T20:26:42.8761568Z 2025-05-07T20:26:42.8761574Z 2025-05-07T20:26:42.8761578Z 2025-05-07T20:26:42.8761680Z 2025-05-07T20:26:42.8761685Z 2025-05-07T20:26:42.8761690Z 2025-05-07T20:26:42.8761695Z 2025-05-07T20:26:42.8761700Z 2025-05-07T20:26:42.8761705Z 2025-05-07T20:26:42.8761920Z  2025-05-07T20:26:42.8762200Z 2025-05-07T20:26:42.8762205Z 2025-05-07T20:26:42.8762210Z 2025-05-07T20:26:42.8762215Z 2025-05-07T20:26:42.8762220Z 2025-05-07T20:26:42.8762225Z 2025-05-07T20:26:42.8762234Z 2025-05-07T20:26:42.8762239Z 2025-05-07T20:26:42.8762244Z 2025-05-07T20:26:42.8762249Z 2025-05-07T20:26:42.8762266Z 2025-05-07T20:26:42.8762271Z 2025-05-07T20:26:42.8762277Z 2025-05-07T20:26:42.8762281Z 2025-05-07T20:26:42.8762287Z 2025-05-07T20:26:42.8762291Z 2025-05-07T20:26:42.8762297Z 2025-05-07T20:26:42.8762503Z  2025-05-07T20:26:42.8762787Z 2025-05-07T20:26:42.8762792Z 2025-05-07T20:26:42.8762797Z 2025-05-07T20:26:42.8762802Z 2025-05-07T20:26:42.8762807Z 2025-05-07T20:26:42.8762812Z 2025-05-07T20:26:42.8762824Z 2025-05-07T20:26:42.8762829Z 2025-05-07T20:26:42.8762834Z 2025-05-07T20:26:42.8762839Z 2025-05-07T20:26:42.8762844Z 2025-05-07T20:26:42.8762850Z 2025-05-07T20:26:42.8762855Z 2025-05-07T20:26:42.8762860Z 2025-05-07T20:26:42.8762865Z 2025-05-07T20:26:42.8762870Z 2025-05-07T20:26:42.8762875Z 2025-05-07T20:26:42.8762880Z 2025-05-07T20:26:42.8763207Z  2025-05-07T20:26:42.8763490Z 2025-05-07T20:26:42.8763495Z 2025-05-07T20:26:42.8763629Z  2025-05-07T20:26:42.8763777Z 2025-05-07T20:26:42.8763783Z 2025-05-07T20:26:42.8763913Z  2025-05-07T20:26:42.8764057Z 2025-05-07T20:26:42.8764062Z 2025-05-07T20:26:42.8764068Z 2025-05-07T20:26:42.8764198Z  2025-05-07T20:26:42.8764449Z 2025-05-07T20:26:42.8764452Z 2025-05-07T20:26:42.8764456Z 2025-05-07T20:26:42.8764460Z 2025-05-07T20:26:42.8764568Z  2025-05-07T20:26:42.8764693Z 2025-05-07T20:26:42.8764696Z 2025-05-07T20:26:42.8764700Z 2025-05-07T20:26:42.8764703Z 2025-05-07T20:26:42.8764713Z 2025-05-07T20:26:42.8764883Z  2025-05-07T20:26:42.8765015Z 2025-05-07T20:26:42.8765018Z 2025-05-07T20:26:42.8765022Z 2025-05-07T20:26:42.8765025Z 2025-05-07T20:26:42.8765029Z 2025-05-07T20:26:42.8765032Z 2025-05-07T20:26:42.8765143Z  2025-05-07T20:26:42.8765277Z 2025-05-07T20:26:42.8765281Z 2025-05-07T20:26:42.8765284Z 2025-05-07T20:26:42.8765296Z 2025-05-07T20:26:42.8765300Z 2025-05-07T20:26:42.8765303Z 2025-05-07T20:26:42.8765307Z 2025-05-07T20:26:42.8765423Z  2025-05-07T20:26:42.8765568Z 2025-05-07T20:26:42.8765572Z 2025-05-07T20:26:42.8765575Z 2025-05-07T20:26:42.8765579Z 2025-05-07T20:26:42.8765582Z 2025-05-07T20:26:42.8765586Z 2025-05-07T20:26:42.8765589Z 2025-05-07T20:26:42.8765593Z 2025-05-07T20:26:42.8765708Z  2025-05-07T20:26:42.8765864Z 2025-05-07T20:26:42.8765868Z 2025-05-07T20:26:42.8765871Z 2025-05-07T20:26:42.8765875Z 2025-05-07T20:26:42.8765878Z 2025-05-07T20:26:42.8765887Z 2025-05-07T20:26:42.8765891Z 2025-05-07T20:26:42.8765894Z 2025-05-07T20:26:42.8765898Z 2025-05-07T20:26:42.8766019Z  2025-05-07T20:26:42.8766184Z 2025-05-07T20:26:42.8766187Z 2025-05-07T20:26:42.8766191Z 2025-05-07T20:26:42.8766195Z 2025-05-07T20:26:42.8766198Z 2025-05-07T20:26:42.8766202Z 2025-05-07T20:26:42.8766205Z 2025-05-07T20:26:42.8766209Z 2025-05-07T20:26:42.8766218Z 2025-05-07T20:26:42.8766221Z 2025-05-07T20:26:42.8766349Z  2025-05-07T20:26:42.8766515Z 2025-05-07T20:26:42.8766519Z 2025-05-07T20:26:42.8766523Z 2025-05-07T20:26:42.8766526Z 2025-05-07T20:26:42.8766530Z 2025-05-07T20:26:42.8766533Z 2025-05-07T20:26:42.8766537Z 2025-05-07T20:26:42.8766540Z 2025-05-07T20:26:42.8766544Z 2025-05-07T20:26:42.8766548Z 2025-05-07T20:26:42.8766551Z 2025-05-07T20:26:42.8766694Z  2025-05-07T20:26:42.8766871Z 2025-05-07T20:26:42.8766875Z 2025-05-07T20:26:42.8766878Z 2025-05-07T20:26:42.8766977Z 2025-05-07T20:26:42.8766980Z 2025-05-07T20:26:42.8766984Z 2025-05-07T20:26:42.8766988Z 2025-05-07T20:26:42.8766991Z 2025-05-07T20:26:42.8766995Z 2025-05-07T20:26:42.8766998Z 2025-05-07T20:26:42.8767002Z 2025-05-07T20:26:42.8767005Z 2025-05-07T20:26:42.8767148Z  2025-05-07T20:26:42.8767328Z 2025-05-07T20:26:42.8767332Z 2025-05-07T20:26:42.8767335Z 2025-05-07T20:26:42.8767342Z 2025-05-07T20:26:42.8767346Z 2025-05-07T20:26:42.8767350Z 2025-05-07T20:26:42.8767353Z 2025-05-07T20:26:42.8767357Z 2025-05-07T20:26:42.8767360Z 2025-05-07T20:26:42.8767364Z 2025-05-07T20:26:42.8767376Z 2025-05-07T20:26:42.8767380Z 2025-05-07T20:26:42.8767383Z 2025-05-07T20:26:42.8767517Z  2025-05-07T20:26:42.8767707Z 2025-05-07T20:26:42.8767710Z 2025-05-07T20:26:42.8767714Z 2025-05-07T20:26:42.8767718Z 2025-05-07T20:26:42.8767721Z 2025-05-07T20:26:42.8767732Z 2025-05-07T20:26:42.8767735Z 2025-05-07T20:26:42.8767739Z 2025-05-07T20:26:42.8767748Z 2025-05-07T20:26:42.8767752Z 2025-05-07T20:26:42.8767755Z 2025-05-07T20:26:42.8767759Z 2025-05-07T20:26:42.8767762Z 2025-05-07T20:26:42.8767766Z 2025-05-07T20:26:42.8767908Z  2025-05-07T20:26:42.8768162Z 2025-05-07T20:26:42.8768167Z 2025-05-07T20:26:42.8768172Z 2025-05-07T20:26:42.8768177Z 2025-05-07T20:26:42.8768182Z 2025-05-07T20:26:42.8768287Z 2025-05-07T20:26:42.8768294Z 2025-05-07T20:26:42.8768299Z 2025-05-07T20:26:42.8768303Z 2025-05-07T20:26:42.8768306Z 2025-05-07T20:26:42.8768310Z 2025-05-07T20:26:42.8768313Z 2025-05-07T20:26:42.8768317Z 2025-05-07T20:26:42.8768321Z 2025-05-07T20:26:42.8768324Z 2025-05-07T20:26:42.8768553Z  2025-05-07T20:26:42.8768776Z 2025-05-07T20:26:42.8768780Z 2025-05-07T20:26:42.8768783Z 2025-05-07T20:26:42.8768787Z 2025-05-07T20:26:42.8768790Z 2025-05-07T20:26:42.8768794Z 2025-05-07T20:26:42.8768797Z 2025-05-07T20:26:42.8768801Z 2025-05-07T20:26:42.8768812Z 2025-05-07T20:26:42.8768815Z 2025-05-07T20:26:42.8768819Z 2025-05-07T20:26:42.8768822Z 2025-05-07T20:26:42.8768826Z 2025-05-07T20:26:42.8768829Z 2025-05-07T20:26:42.8768848Z 2025-05-07T20:26:42.8768852Z 2025-05-07T20:26:42.8769012Z  2025-05-07T20:26:42.8769221Z 2025-05-07T20:26:42.8769224Z 2025-05-07T20:26:42.8769228Z 2025-05-07T20:26:42.8769239Z 2025-05-07T20:26:42.8769242Z 2025-05-07T20:26:42.8769255Z 2025-05-07T20:26:42.8769259Z 2025-05-07T20:26:42.8769263Z 2025-05-07T20:26:42.8769266Z 2025-05-07T20:26:42.8769270Z 2025-05-07T20:26:42.8769274Z 2025-05-07T20:26:42.8769277Z 2025-05-07T20:26:42.8769281Z 2025-05-07T20:26:42.8769284Z 2025-05-07T20:26:42.8769288Z 2025-05-07T20:26:42.8769291Z 2025-05-07T20:26:42.8769295Z 2025-05-07T20:26:42.8769457Z  2025-05-07T20:26:42.8769684Z 2025-05-07T20:26:42.8769687Z 2025-05-07T20:26:42.8769691Z 2025-05-07T20:26:42.8769695Z 2025-05-07T20:26:42.8769703Z 2025-05-07T20:26:42.8769707Z 2025-05-07T20:26:42.8769711Z 2025-05-07T20:26:42.8769714Z 2025-05-07T20:26:42.8769718Z 2025-05-07T20:26:42.8769721Z 2025-05-07T20:26:42.8769725Z 2025-05-07T20:26:42.8769728Z 2025-05-07T20:26:42.8769732Z 2025-05-07T20:26:42.8769735Z 2025-05-07T20:26:42.8769739Z 2025-05-07T20:26:42.8769742Z 2025-05-07T20:26:42.8769746Z 2025-05-07T20:26:42.8769749Z 2025-05-07T20:26:42.8769937Z  2025-05-07T20:26:42.8770148Z 2025-05-07T20:26:42.8770151Z 2025-05-07T20:26:42.8770262Z  2025-05-07T20:26:42.8770371Z 2025-05-07T20:26:42.8770375Z 2025-05-07T20:26:42.8770481Z  2025-05-07T20:26:42.8770612Z 2025-05-07T20:26:42.8770616Z 2025-05-07T20:26:42.8770619Z 2025-05-07T20:26:42.8770728Z  2025-05-07T20:26:42.8770843Z 2025-05-07T20:26:42.8770860Z 2025-05-07T20:26:42.8770864Z 2025-05-07T20:26:42.8770867Z 2025-05-07T20:26:42.8770978Z  2025-05-07T20:26:42.8771099Z 2025-05-07T20:26:42.8771222Z 2025-05-07T20:26:42.8771225Z 2025-05-07T20:26:42.8771229Z 2025-05-07T20:26:42.8771232Z 2025-05-07T20:26:42.8771359Z  2025-05-07T20:26:42.8771486Z 2025-05-07T20:26:42.8771490Z 2025-05-07T20:26:42.8771493Z 2025-05-07T20:26:42.8771497Z 2025-05-07T20:26:42.8771500Z 2025-05-07T20:26:42.8771504Z 2025-05-07T20:26:42.8771630Z  2025-05-07T20:26:42.8771764Z 2025-05-07T20:26:42.8771774Z 2025-05-07T20:26:42.8771778Z 2025-05-07T20:26:42.8771781Z 2025-05-07T20:26:42.8771785Z 2025-05-07T20:26:42.8771789Z 2025-05-07T20:26:42.8771792Z 2025-05-07T20:26:42.8771919Z  2025-05-07T20:26:42.8772062Z 2025-05-07T20:26:42.8772066Z 2025-05-07T20:26:42.8772070Z 2025-05-07T20:26:42.8772073Z 2025-05-07T20:26:42.8772077Z 2025-05-07T20:26:42.8772081Z 2025-05-07T20:26:42.8772084Z 2025-05-07T20:26:42.8772088Z 2025-05-07T20:26:42.8772221Z  2025-05-07T20:26:42.8772372Z 2025-05-07T20:26:42.8772375Z 2025-05-07T20:26:42.8772379Z 2025-05-07T20:26:42.8772388Z 2025-05-07T20:26:42.8772391Z 2025-05-07T20:26:42.8772395Z 2025-05-07T20:26:42.8772398Z 2025-05-07T20:26:42.8772402Z 2025-05-07T20:26:42.8772405Z 2025-05-07T20:26:42.8772548Z  2025-05-07T20:26:42.8772709Z 2025-05-07T20:26:42.8772713Z 2025-05-07T20:26:42.8772716Z 2025-05-07T20:26:42.8772720Z 2025-05-07T20:26:42.8772724Z 2025-05-07T20:26:42.8772727Z 2025-05-07T20:26:42.8772824Z 2025-05-07T20:26:42.8772829Z 2025-05-07T20:26:42.8772846Z 2025-05-07T20:26:42.8772849Z 2025-05-07T20:26:42.8772984Z  2025-05-07T20:26:42.8773151Z 2025-05-07T20:26:42.8773155Z 2025-05-07T20:26:42.8773158Z 2025-05-07T20:26:42.8773162Z 2025-05-07T20:26:42.8773165Z 2025-05-07T20:26:42.8773169Z 2025-05-07T20:26:42.8773185Z 2025-05-07T20:26:42.8773188Z 2025-05-07T20:26:42.8773192Z 2025-05-07T20:26:42.8773195Z 2025-05-07T20:26:42.8773199Z 2025-05-07T20:26:42.8773335Z  2025-05-07T20:26:42.8773512Z 2025-05-07T20:26:42.8773521Z 2025-05-07T20:26:42.8773538Z 2025-05-07T20:26:42.8773542Z 2025-05-07T20:26:42.8773545Z 2025-05-07T20:26:42.8773549Z 2025-05-07T20:26:42.8773552Z 2025-05-07T20:26:42.8773556Z 2025-05-07T20:26:42.8773559Z 2025-05-07T20:26:42.8773563Z 2025-05-07T20:26:42.8773567Z 2025-05-07T20:26:42.8773570Z 2025-05-07T20:26:42.8773710Z  2025-05-07T20:26:42.8773906Z 2025-05-07T20:26:42.8773914Z 2025-05-07T20:26:42.8773918Z 2025-05-07T20:26:42.8773921Z 2025-05-07T20:26:42.8773925Z 2025-05-07T20:26:42.8773929Z 2025-05-07T20:26:42.8773932Z 2025-05-07T20:26:42.8773936Z 2025-05-07T20:26:42.8773939Z 2025-05-07T20:26:42.8773943Z 2025-05-07T20:26:42.8773946Z 2025-05-07T20:26:42.8773950Z 2025-05-07T20:26:42.8773953Z 2025-05-07T20:26:42.8774091Z  2025-05-07T20:26:42.8774289Z 2025-05-07T20:26:42.8774292Z 2025-05-07T20:26:42.8774296Z 2025-05-07T20:26:42.8774300Z 2025-05-07T20:26:42.8774303Z 2025-05-07T20:26:42.8774307Z 2025-05-07T20:26:42.8774316Z 2025-05-07T20:26:42.8774320Z 2025-05-07T20:26:42.8774323Z 2025-05-07T20:26:42.8774327Z 2025-05-07T20:26:42.8774330Z 2025-05-07T20:26:42.8774334Z 2025-05-07T20:26:42.8774337Z 2025-05-07T20:26:42.8774341Z 2025-05-07T20:26:42.8774510Z  2025-05-07T20:26:42.8774704Z 2025-05-07T20:26:42.8774708Z 2025-05-07T20:26:42.8774712Z 2025-05-07T20:26:42.8774722Z 2025-05-07T20:26:42.8774725Z 2025-05-07T20:26:42.8774729Z 2025-05-07T20:26:42.8774733Z 2025-05-07T20:26:42.8774736Z 2025-05-07T20:26:42.8774740Z 2025-05-07T20:26:42.8774754Z 2025-05-07T20:26:42.8774757Z 2025-05-07T20:26:42.8774761Z 2025-05-07T20:26:42.8774764Z 2025-05-07T20:26:42.8774768Z 2025-05-07T20:26:42.8774771Z 2025-05-07T20:26:42.8774925Z  2025-05-07T20:26:42.8775127Z 2025-05-07T20:26:42.8775143Z 2025-05-07T20:26:42.8775147Z 2025-05-07T20:26:42.8775151Z 2025-05-07T20:26:42.8775154Z 2025-05-07T20:26:42.8775158Z 2025-05-07T20:26:42.8775245Z 2025-05-07T20:26:42.8775249Z 2025-05-07T20:26:42.8775252Z 2025-05-07T20:26:42.8775256Z 2025-05-07T20:26:42.8775259Z 2025-05-07T20:26:42.8775263Z 2025-05-07T20:26:42.8775266Z 2025-05-07T20:26:42.8775270Z 2025-05-07T20:26:42.8775273Z 2025-05-07T20:26:42.8775277Z 2025-05-07T20:26:42.8775435Z  2025-05-07T20:26:42.8775655Z 2025-05-07T20:26:42.8775664Z 2025-05-07T20:26:42.8775667Z 2025-05-07T20:26:42.8775671Z 2025-05-07T20:26:42.8775674Z 2025-05-07T20:26:42.8775678Z 2025-05-07T20:26:42.8775682Z 2025-05-07T20:26:42.8775685Z 2025-05-07T20:26:42.8775689Z 2025-05-07T20:26:42.8775692Z 2025-05-07T20:26:42.8775696Z 2025-05-07T20:26:42.8775699Z 2025-05-07T20:26:42.8775703Z 2025-05-07T20:26:42.8775706Z 2025-05-07T20:26:42.8775710Z 2025-05-07T20:26:42.8775713Z 2025-05-07T20:26:42.8775729Z 2025-05-07T20:26:42.8775888Z  2025-05-07T20:26:42.8776097Z 2025-05-07T20:26:42.8776100Z 2025-05-07T20:26:42.8776111Z 2025-05-07T20:26:42.8776115Z 2025-05-07T20:26:42.8776118Z 2025-05-07T20:26:42.8776122Z 2025-05-07T20:26:42.8776125Z 2025-05-07T20:26:42.8776141Z 2025-05-07T20:26:42.8776144Z 2025-05-07T20:26:42.8776148Z 2025-05-07T20:26:42.8776151Z 2025-05-07T20:26:42.8776155Z 2025-05-07T20:26:42.8776158Z 2025-05-07T20:26:42.8776162Z 2025-05-07T20:26:42.8776165Z 2025-05-07T20:26:42.8776169Z 2025-05-07T20:26:42.8776274Z 2025-05-07T20:26:42.8776278Z 2025-05-07T20:26:42.8776451Z  2025-05-07T20:26:42.8776676Z 2025-05-07T20:26:42.8776679Z 2025-05-07T20:26:42.8776781Z  2025-05-07T20:26:42.8776892Z 2025-05-07T20:26:42.8776896Z 2025-05-07T20:26:42.8777015Z  2025-05-07T20:26:42.8777128Z 2025-05-07T20:26:42.8777131Z 2025-05-07T20:26:42.8777135Z 2025-05-07T20:26:42.8777253Z  2025-05-07T20:26:42.8777368Z 2025-05-07T20:26:42.8777371Z 2025-05-07T20:26:42.8777375Z 2025-05-07T20:26:42.8777378Z 2025-05-07T20:26:42.8777493Z  2025-05-07T20:26:42.8777634Z 2025-05-07T20:26:42.8777638Z 2025-05-07T20:26:42.8777641Z 2025-05-07T20:26:42.8777645Z 2025-05-07T20:26:42.8777648Z 2025-05-07T20:26:42.8777760Z  2025-05-07T20:26:42.8777902Z 2025-05-07T20:26:42.8777906Z 2025-05-07T20:26:42.8777909Z 2025-05-07T20:26:42.8777913Z 2025-05-07T20:26:42.8777916Z 2025-05-07T20:26:42.8777920Z 2025-05-07T20:26:42.8778043Z  2025-05-07T20:26:42.8778194Z 2025-05-07T20:26:42.8778198Z 2025-05-07T20:26:42.8778201Z 2025-05-07T20:26:42.8778205Z 2025-05-07T20:26:42.8778208Z 2025-05-07T20:26:42.8778212Z 2025-05-07T20:26:42.8778216Z 2025-05-07T20:26:42.8778343Z  2025-05-07T20:26:42.8778502Z 2025-05-07T20:26:42.8778506Z 2025-05-07T20:26:42.8778510Z 2025-05-07T20:26:42.8778513Z 2025-05-07T20:26:42.8778517Z 2025-05-07T20:26:42.8778520Z 2025-05-07T20:26:42.8778524Z 2025-05-07T20:26:42.8778527Z 2025-05-07T20:26:42.8778652Z  2025-05-07T20:26:42.8778828Z 2025-05-07T20:26:42.8778837Z 2025-05-07T20:26:42.8778841Z 2025-05-07T20:26:42.8778844Z 2025-05-07T20:26:42.8778848Z 2025-05-07T20:26:42.8778851Z 2025-05-07T20:26:42.8778855Z 2025-05-07T20:26:42.8778858Z 2025-05-07T20:26:42.8778862Z 2025-05-07T20:26:42.8778987Z  2025-05-07T20:26:42.8779164Z 2025-05-07T20:26:42.8779167Z 2025-05-07T20:26:42.8779171Z 2025-05-07T20:26:42.8779174Z 2025-05-07T20:26:42.8779183Z 2025-05-07T20:26:42.8779186Z 2025-05-07T20:26:42.8779190Z 2025-05-07T20:26:42.8779193Z 2025-05-07T20:26:42.8779197Z 2025-05-07T20:26:42.8779200Z 2025-05-07T20:26:42.8779338Z  2025-05-07T20:26:42.8779523Z 2025-05-07T20:26:42.8779526Z 2025-05-07T20:26:42.8779530Z 2025-05-07T20:26:42.8779533Z 2025-05-07T20:26:42.8779537Z 2025-05-07T20:26:42.8779540Z 2025-05-07T20:26:42.8779544Z 2025-05-07T20:26:42.8779548Z 2025-05-07T20:26:42.8779551Z 2025-05-07T20:26:42.8779554Z 2025-05-07T20:26:42.8779558Z 2025-05-07T20:26:42.8779718Z  done 2025-05-07T20:26:43.1891488Z Preparing transaction: \ | / done 2025-05-07T20:26:45.4276800Z Verifying transaction: \ | / - \ | / - \ | / - \ | / - \ | / - \ done 2025-05-07T20:26:46.4358007Z Executing transaction: / - \ | / - \ | / - done 2025-05-07T20:26:49.1317473Z [INSTALL] Fixing file placements for CUDA 12.6.3+ ... 2025-05-07T20:26:49.1317877Z [INSTALL] Creating symlinks: libnvToolsExt.so 2025-05-07T20:26:49.1318587Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so.1 /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so 2025-05-07T20:26:49.1319144Z 2025-05-07T20:26:49.1333485Z 2025-05-07T20:26:49.1334620Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so.1 /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so 2025-05-07T20:26:49.1335325Z 2025-05-07T20:26:49.1347095Z 2025-05-07T20:26:49.1347381Z [INSTALL] Copying nvtx3 headers ... 2025-05-07T20:26:49.1353198Z + cp -r /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCuda.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCudaRt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtOpenCL.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtSync.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvtx3.hpp /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvtxDetail /home/ec2-user/miniconda/envs/build_binary/include/ 2025-05-07T20:26:49.1356892Z 2025-05-07T20:26:49.3235388Z 2025-05-07T20:26:49.3240904Z + cp -r /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCuda.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCudaRt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtOpenCL.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtSync.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvtx3.hpp /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvtxDetail /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/ 2025-05-07T20:26:49.3244757Z 2025-05-07T20:26:49.3264332Z 2025-05-07T20:26:49.3264709Z [INSTALL] Appending libcuda.so path to LD_LIBRARY_PATH ... 2025-05-07T20:26:49.3645524Z [ENV] Appending to LD_LIBRARY_PATH: /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs ... 2025-05-07T20:26:51.2937938Z ERROR conda.cli.main_run:execute(125): `conda run printenv LD_LIBRARY_PATH` failed. (See above for error) 2025-05-07T20:26:51.3657908Z + conda env config vars set -n build_binary LD_LIBRARY_PATH=/home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs 2025-05-07T20:26:51.3658432Z 2025-05-07T20:26:51.8002863Z 2025-05-07T20:26:51.8011627Z [INSTALL] Setting environment variable NVML_LIB_PATH ... 2025-05-07T20:26:51.8360215Z + conda env config vars set -n build_binary NVML_LIB_PATH=/home/ec2-user/miniconda/envs/build_binary/lib/stubs/libnvidia-ml.so 2025-05-07T20:26:51.8360981Z 2025-05-07T20:26:52.2821961Z 2025-05-07T20:26:52.2822502Z [INSTALL] Setting environment variable CUDA_INCLUDE_DIRS ... 2025-05-07T20:26:52.2824314Z + conda env config vars set -n build_binary CUDA_INCLUDE_DIRS="/home/ec2-user/miniconda/envs/build_binary/include/:/home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/" 2025-05-07T20:26:52.2825716Z 2025-05-07T20:26:52.7197047Z 2025-05-07T20:26:54.8238324Z [CHECK] cuda_runtime.h found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/cuda_runtime.h 2025-05-07T20:26:56.9059904Z [CHECK] libcuda.so found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs/libcuda.so 2025-05-07T20:26:58.9933803Z [CHECK] libnvToolsExt.so found in CONDA_PREFIX PATH (symbolic link): /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so 2025-05-07T20:26:58.9934587Z /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so 2025-05-07T20:27:01.0979250Z [CHECK] libnvidia-ml.so found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs/libnvidia-ml.so 2025-05-07T20:27:03.0429780Z /home/ec2-user/miniconda/envs/build_binary/bin/nvcc 2025-05-07T20:27:03.0430066Z 2025-05-07T20:27:03.1199249Z [CHECK] Binary nvcc found in PATH 2025-05-07T20:27:07.0920930Z /tmp/tmp7ses5f7e: line 3: clang: command not found 2025-05-07T20:27:07.0921221Z 2025-05-07T20:27:07.0922564Z ERROR conda.cli.main_run:execute(125): `conda run clang --version` failed. (See above for error) 2025-05-07T20:27:07.1686933Z + ls -la /home/ec2-user/miniconda/envs/build_binary/etc/conda/activate.d 2025-05-07T20:27:07.1687272Z 2025-05-07T20:27:07.1705393Z total 36 2025-05-07T20:27:07.1705654Z drwxr-xr-x. 2 ec2-user ec2-user 191 May 7 20:26 . 2025-05-07T20:27:07.1706018Z drwxr-xr-x. 5 ec2-user ec2-user 62 May 7 20:25 .. 2025-05-07T20:27:07.1706449Z -rw-r--r--. 2 ec2-user ec2-user 3778 Jun 10 2024 activate-binutils_linux-64.sh 2025-05-07T20:27:07.1706953Z -rw-r--r--. 2 ec2-user ec2-user 11630 Jun 10 2024 activate-gcc_linux-64.sh 2025-05-07T20:27:07.1707452Z -rw-r--r--. 2 ec2-user ec2-user 5190 Jun 10 2024 activate-gxx_linux-64.sh 2025-05-07T20:27:07.1707897Z -rw-r--r--. 2 ec2-user ec2-user 136 Mar 27 01:27 libglib_activate.sh 2025-05-07T20:27:07.1708631Z -rw-r--r--. 2 ec2-user ec2-user 872 Nov 13 09:20 libxml2_activate.sh 2025-05-07T20:27:07.1709122Z -rw-r--r--. 2 ec2-user ec2-user 2932 Nov 20 20:32 ~cuda-nvcc_activate.sh 2025-05-07T20:27:07.1709422Z 2025-05-07T20:27:07.1709641Z [INSTALL] Removing the -ccbin=CXX hook from NVCC activation scripts ... 2025-05-07T20:27:07.1710266Z + sed -i /-ccbin=/d /home/ec2-user/miniconda/envs/build_binary/etc/conda/activate.d/*cuda-nvcc_activate.sh 2025-05-07T20:27:07.1710668Z 2025-05-07T20:27:07.1730116Z 2025-05-07T20:27:07.1730438Z + conda run -n build_binary c++ --version | grep -i clang 2025-05-07T20:27:07.1730686Z 2025-05-07T20:27:09.1766836Z 2025-05-07T20:27:09.1767488Z [BUILD] Setting prepend flags for NVCC ... 2025-05-07T20:27:09.1768035Z + conda env config vars set -n build_binary NVCC_PREPEND_FLAGS="-allow-unsupported-compiler" 2025-05-07T20:27:09.1768455Z 2025-05-07T20:27:09.6111137Z 2025-05-07T20:27:09.6111599Z + conda run -n build_binary printenv NVCC_PREPEND_FLAGS 2025-05-07T20:27:09.6111859Z 2025-05-07T20:27:11.5655884Z -allow-unsupported-compiler 2025-05-07T20:27:11.5656133Z 2025-05-07T20:27:11.6340171Z 2025-05-07T20:27:11.6340861Z [INFO] Printing out all preprocessor defines in nvcc ... 2025-05-07T20:27:11.6341396Z + conda run -n build_binary nvcc --compiler-options -dM -E -x cu - < /dev/null 2025-05-07T20:27:11.6341722Z 2025-05-07T20:27:13.6506595Z #define _GLIBCXX_DEPRECATED_SUGGEST(ALT) __attribute__ ((__deprecated__ ("use '" ALT "' instead"))) 2025-05-07T20:27:13.6507181Z #define M_PIl 3.141592653589793238462643383279502884L 2025-05-07T20:27:13.6507509Z #define _IO_CURRENTLY_PUTTING 0x800 2025-05-07T20:27:13.6507817Z #define __W_EXITCODE(ret,sig) ((ret) << 8 | (sig)) 2025-05-07T20:27:13.6508127Z #define __DBL_MIN_EXP__ (-1021) 2025-05-07T20:27:13.6508595Z #define _STL_PAIR_H 1 2025-05-07T20:27:13.6509432Z #define __cpp_attributes 200809L 2025-05-07T20:27:13.6509921Z #define __cpp_nontype_template_parameter_auto 201606L 2025-05-07T20:27:13.6510427Z #define __DELETE_THROW throw() 2025-05-07T20:27:13.6510770Z #define _PTRDIFF_T_ 2025-05-07T20:27:13.6510996Z #define M_PI_4 0.78539816339744830962 2025-05-07T20:27:13.6511272Z #define __UINT_LEAST16_MAX__ 0xffff 2025-05-07T20:27:13.6511646Z #define _IO_LEFT 02 2025-05-07T20:27:13.6512082Z #define __ATOMIC_ACQUIRE 2 2025-05-07T20:27:13.6512590Z #define _POSIX2_BC_SCALE_MAX 99 2025-05-07T20:27:13.6513103Z #define _GLIBCXX_USE_RANDOM_TR1 1 2025-05-07T20:27:13.6513955Z #define _GLIBCXX_MOVE_BACKWARD3(_Tp,_Up,_Vp) std::move_backward(_Tp, _Up, _Vp) 2025-05-07T20:27:13.6515018Z #define __FLT128_MAX_10_EXP__ 4932 2025-05-07T20:27:13.6515482Z #define RE_DUP_MAX (0x7fff) 2025-05-07T20:27:13.6515889Z #define _IOS_OUTPUT 2 2025-05-07T20:27:13.6516395Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F 2025-05-07T20:27:13.6517365Z #define toascii_l(c,l) __toascii_l ((c), (l)) 2025-05-07T20:27:13.6517887Z #define __GCC_IEC_559_COMPLEX 2 2025-05-07T20:27:13.6518303Z #define _GLIBCXX_USE_FCHMOD 1 2025-05-07T20:27:13.6518666Z #define __cpp_aggregate_nsdmi 201304L 2025-05-07T20:27:13.6519786Z #define __bswap_16(x) (__extension__ ({ unsigned short int __v, __x = (unsigned short int) (x); if (__builtin_constant_p (__x)) __v = __bswap_constant_16 (__x); else __asm__ ("rorw $8, %w0" : "=r" (__v) : "0" (__x) : "cc"); __v; })) 2025-05-07T20:27:13.6520660Z #define __UINT_LEAST8_TYPE__ unsigned char 2025-05-07T20:27:13.6520954Z #define __SIZEOF_FLOAT80__ 16 2025-05-07T20:27:13.6521236Z #define cudaTextureTypeCubemapLayered 0xFC 2025-05-07T20:27:13.6521532Z #define _T_WCHAR_ 2025-05-07T20:27:13.6521746Z #define stdout stdout 2025-05-07T20:27:13.6522059Z #define _GLIBCXX_ABI_TAG_CXX11 __attribute ((__abi_tag__ ("cxx11"))) 2025-05-07T20:27:13.6522425Z #define CHAR_BIT __CHAR_BIT__ 2025-05-07T20:27:13.6522666Z #define __flexarr [] 2025-05-07T20:27:13.6522888Z #define _GLIBCXX_HAVE_FINITEF 1 2025-05-07T20:27:13.6523325Z #define __islower_l(c,l) __isctype_l((c), _ISlower, (l)) 2025-05-07T20:27:13.6523664Z #define _IO_FLAGS2_USER_WBUF 8 2025-05-07T20:27:13.6523901Z #define _MATH_H 1 2025-05-07T20:27:13.6524171Z #define cudaOccupancyDisableCachingOverride 0x01 2025-05-07T20:27:13.6524600Z #define __S64_TYPE long int 2025-05-07T20:27:13.6524848Z #define __stub_fchflags 2025-05-07T20:27:13.6525104Z #define cudaDeviceScheduleMask 0x07 2025-05-07T20:27:13.6525390Z #define __SQUAD_TYPE long int 2025-05-07T20:27:13.6525646Z #define __INTMAX_C(c) c ## L 2025-05-07T20:27:13.6525891Z #define _BSD_SIZE_T_DEFINED_ 2025-05-07T20:27:13.6526139Z #define NL_NMAX INT_MAX 2025-05-07T20:27:13.6526367Z #define _BITS_TIME_H 1 2025-05-07T20:27:13.6526627Z #define M_LN10l 2.302585092994045684017991454684364208L 2025-05-07T20:27:13.6526952Z #define _GLIBCXX_TXN_SAFE_DYN 2025-05-07T20:27:13.6527251Z #define cudaStreamTailLaunch ((cudaStream_t)0x3) 2025-05-07T20:27:13.6527589Z #define M_El 2.718281828459045235360287471352662498L 2025-05-07T20:27:13.6527980Z #define _PSTL_PRAGMA_DECLARE_SIMD _PSTL_PRAGMA(omp declare simd) 2025-05-07T20:27:13.6528334Z #define __CHAR_BIT__ 8 2025-05-07T20:27:13.6528576Z #define __FSWORD_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:27:13.6528880Z #define _PSTL_STRING_CONCAT(x,y) x #y 2025-05-07T20:27:13.6529166Z #define _GLIBCXX98_USE_C99_MATH 1 2025-05-07T20:27:13.6529428Z #define FP_NAN 0 2025-05-07T20:27:13.6529672Z #define makedev(maj,min) gnu_dev_makedev (maj, min) 2025-05-07T20:27:13.6530099Z #define __glibcxx_requires_sorted_set_pred(_First1,_Last1,_First2,_Pred) 2025-05-07T20:27:13.6530584Z #define cudaGetDeviceProperties cudaGetDeviceProperties_v2 2025-05-07T20:27:13.6530954Z #define __cudaCDP2GetErrorString 2025-05-07T20:27:13.6531236Z #define SHRT_MAX __SHRT_MAX__ 2025-05-07T20:27:13.6531487Z #define _GLIBCXX_X86_RDSEED 1 2025-05-07T20:27:13.6531724Z #define __SM_80_RT_H__ 2025-05-07T20:27:13.6531943Z #define _NEW 2025-05-07T20:27:13.6532164Z #define CLOCK_PROCESS_CPUTIME_ID 2 2025-05-07T20:27:13.6532547Z #define __UINT8_MAX__ 0xff 2025-05-07T20:27:13.6532906Z #define _PSTL_ASSERT_MSG(_Condition,_Message) __glibcxx_assert(_Condition) 2025-05-07T20:27:13.6533304Z #define __SCHAR_WIDTH__ 8 2025-05-07T20:27:13.6533533Z #define __USE_ANSI 1 2025-05-07T20:27:13.6533852Z #define _IO_BE(expr,res) __builtin_expect ((expr), res) 2025-05-07T20:27:13.6534350Z #define __isupper_l(c,l) __isctype_l((c), _ISupper, (l)) 2025-05-07T20:27:13.6534702Z #define __cudaCDP2Memcpy2DAsync_ptsz 2025-05-07T20:27:13.6534986Z #define __WINT_MAX__ 0xffffffffU 2025-05-07T20:27:13.6535259Z #define __SIZEOF_PTHREAD_ATTR_T 56 2025-05-07T20:27:13.6535533Z #define __FLT32_MIN_EXP__ (-125) 2025-05-07T20:27:13.6535798Z #define _GLIBCXX_END_NAMESPACE_LDBL 2025-05-07T20:27:13.6536075Z #define PIPE_BUF 4096 2025-05-07T20:27:13.6536393Z #define _PSTL_PRAGMA_SIMD_ORDERED_MONOTONIC_2ARGS(PRM1,PRM2) 2025-05-07T20:27:13.6536737Z #define ADJ_TICK 0x4000 2025-05-07T20:27:13.6537007Z #define _PSTL_VERSION_PATCH (_PSTL_VERSION % 10) 2025-05-07T20:27:13.6537326Z #define MQ_PRIO_MAX 32768 2025-05-07T20:27:13.6537572Z #define __SIZEOF_PTHREAD_MUTEXATTR_T 4 2025-05-07T20:27:13.6537885Z #define __WAIT_INT(status) (*(int *) &(status)) 2025-05-07T20:27:13.6538340Z #define __GLIBC_PREREQ(maj,min) ((__GLIBC__ << 16) + __GLIBC_MINOR__ >= ((maj) << 16) + (min)) 2025-05-07T20:27:13.6538955Z #define cudaCooperativeLaunchMultiDeviceNoPreSync 0x01 2025-05-07T20:27:13.6539311Z #define _XOPEN_SOURCE 700 2025-05-07T20:27:13.6539559Z #define _POSIX2_BC_DIM_MAX 2048 2025-05-07T20:27:13.6539828Z #define __VECTOR_FUNCTIONS_HPP__ 2025-05-07T20:27:13.6540097Z #define __cpp_static_assert 201411L 2025-05-07T20:27:13.6540437Z #define __WEXITSTATUS(status) (((status) & 0xff00) >> 8) 2025-05-07T20:27:13.6540786Z #define _GLIBCXX_HAVE_STRXFRM_L 1 2025-05-07T20:27:13.6541052Z #define _POSIX_TTY_NAME_MAX 9 2025-05-07T20:27:13.6541325Z #define _GLIBCXX_USE_WEAK_REF __GXX_WEAK__ 2025-05-07T20:27:13.6541649Z #define __OFF_T_MATCHES_OFF64_T 1 2025-05-07T20:27:13.6541930Z #define __ORDER_LITTLE_ENDIAN__ 1234 2025-05-07T20:27:13.6542214Z #define __SIZE_MAX__ 0xffffffffffffffffUL 2025-05-07T20:27:13.6542564Z #define __ispunct_l(c,l) __isctype_l((c), _ISpunct, (l)) 2025-05-07T20:27:13.6542897Z #define __WCHAR_MAX__ 0x7fffffff 2025-05-07T20:27:13.6543165Z #define _GLIBCXX_USE_CLOCK_MONOTONIC 1 2025-05-07T20:27:13.6543471Z #define __BLKCNT_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:27:13.6543830Z #define __isprint_l(c,l) __isctype_l((c), _ISprint, (l)) 2025-05-07T20:27:13.6544162Z #define cudaNvSciSyncAttrSignal 0x1 2025-05-07T20:27:13.6544448Z #define _GLIBCXX_USE_LONG_LONG 1 2025-05-07T20:27:13.6544735Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1 2025-05-07T20:27:13.6545044Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1 2025-05-07T20:27:13.6545357Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1 2025-05-07T20:27:13.6545746Z #define __DBL_DENORM_MIN__ double(4.94065645841246544176568792868221372e-324L) 2025-05-07T20:27:13.6546149Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1 2025-05-07T20:27:13.6546441Z #define ADJ_ESTERROR 0x0008 2025-05-07T20:27:13.6546700Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2 2025-05-07T20:27:13.6546970Z #define __GCC_IEC_559 2 2025-05-07T20:27:13.6547243Z #define __cpp_lib_transformation_trait_aliases 201304 2025-05-07T20:27:13.6547568Z #define _IO_flockfile(_fp) 2025-05-07T20:27:13.6547819Z #define CLOCK_MONOTONIC_RAW 4 2025-05-07T20:27:13.6548078Z #define __FLT32X_DECIMAL_DIG__ 17 2025-05-07T20:27:13.6548328Z #define _IOFBF 0 2025-05-07T20:27:13.6548530Z #define __USE_BSD 1 2025-05-07T20:27:13.6548737Z #define __FLT_EVAL_METHOD__ 0 2025-05-07T20:27:13.6548999Z #define SHRT_MIN (-SHRT_MAX - 1) 2025-05-07T20:27:13.6549271Z #define _IO_USER_LOCK 0x8000 2025-05-07T20:27:13.6549506Z #define _IO_NO_WRITES 8 2025-05-07T20:27:13.6549752Z #define _GLIBCXX_PSEUDO_VISIBILITY(V) 2025-05-07T20:27:13.6550096Z #define __ASMNAME2(prefix,cname) __STRING (prefix) cname 2025-05-07T20:27:13.6550436Z #define _GLIBCXX_HAVE_SYS_STAT_H 1 2025-05-07T20:27:13.6550820Z #define MB_CUR_MAX (__ctype_get_mb_cur_max ()) 2025-05-07T20:27:13.6551132Z #define __cpp_binary_literals 201304L 2025-05-07T20:27:13.6551424Z #define _CPP_TYPE_TRAITS_H 1 2025-05-07T20:27:13.6551676Z #define __BEGIN_NAMESPACE_C99 2025-05-07T20:27:13.6551940Z #define __FLT64_DECIMAL_DIG__ 17 2025-05-07T20:27:13.6552247Z #define _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(A) 2025-05-07T20:27:13.6552620Z #define _G_HAVE_ST_BLKSIZE defined (_STATBUF_ST_BLKSIZE) 2025-05-07T20:27:13.6552975Z #define __cpp_noexcept_function_type 201510L 2025-05-07T20:27:13.6553282Z #define M_PI 3.14159265358979323846 2025-05-07T20:27:13.6553583Z #define _GLIBCXX_PACKAGE_NAME "package-unused" 2025-05-07T20:27:13.6553911Z #define _GLIBCXX_HAVE_BUILTIN_IS_SAME 1 2025-05-07T20:27:13.6554217Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2 2025-05-07T20:27:13.6554525Z #define _POSIX_DELAYTIMER_MAX 32 2025-05-07T20:27:13.6554787Z #define _GLIBCXX_USE_UTIME 1 2025-05-07T20:27:13.6555060Z #define _STL_ITERATOR_BASE_FUNCS_H 1 2025-05-07T20:27:13.6555657Z #define _IO_peekc_unlocked(_fp) (_IO_BE ((_fp)->_IO_read_ptr >= (_fp)->_IO_read_end, 0) && __underflow (_fp) == EOF ? EOF : *(unsigned char *) (_fp)->_IO_read_ptr) 2025-05-07T20:27:13.6556226Z #define _GLIBCXX_TR1_ELL_INTEGRAL_TCC 1 2025-05-07T20:27:13.6556551Z #define w_termsig __wait_terminated.__w_termsig 2025-05-07T20:27:13.6556867Z #define __FLOAT_WORD_ORDER __BYTE_ORDER 2025-05-07T20:27:13.6557248Z #define __cudaCDP2GetErrorName 2025-05-07T20:27:13.6557542Z #define XATTR_SIZE_MAX 65536 2025-05-07T20:27:13.6557806Z #define be64toh(x) __bswap_64 (x) 2025-05-07T20:27:13.6558114Z #define __ASSERT_VOID_CAST static_cast 2025-05-07T20:27:13.6558433Z #define __cpp_variadic_templates 200704L 2025-05-07T20:27:13.6558722Z #define RAND_MAX 2147483647 2025-05-07T20:27:13.6559002Z #define _GLIBCXX_USE_C99_COMPLEX_TR1 1 2025-05-07T20:27:13.6570484Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:27:13.6570832Z #define __SM_90_RT_H__ 2025-05-07T20:27:13.6571074Z #define __SIG_ATOMIC_TYPE__ int 2025-05-07T20:27:13.6571349Z #define __COMPAR_FN_T 2025-05-07T20:27:13.6571598Z #define __GID_T_TYPE __U32_TYPE 2025-05-07T20:27:13.6571857Z #define _IO_BAD_SEEN 0x4000 2025-05-07T20:27:13.6572334Z #define _PSTL_PRAGMA_MESSAGE_IMPL(x) _PSTL_PRAGMA(message(_PSTL_STRING_CONCAT(_PSTL_PRAGMA_LOCATION, x))) 2025-05-07T20:27:13.6572838Z #define __DBL_MIN_10_EXP__ (-307) 2025-05-07T20:27:13.6573181Z #define __glibcxx_requires_sorted_pred(_First,_Last,_Pred) 2025-05-07T20:27:13.6573543Z #define __FINITE_MATH_ONLY__ 0 2025-05-07T20:27:13.6573845Z #define _PSTL_PRAGMA_SIMD_INCLUSIVE_SCAN(PRM) 2025-05-07T20:27:13.6574182Z #define cudaArrayColorAttachment 0x20 2025-05-07T20:27:13.6574487Z #define __cpp_variable_templates 201304L 2025-05-07T20:27:13.6574994Z #define cudaKernelNodeAttributeMemSyncDomainMap cudaLaunchAttributeMemSyncDomainMap 2025-05-07T20:27:13.6575541Z #define __cpp_lib_integral_constant_callable 201304 2025-05-07T20:27:13.6575865Z #define _GLIBCXX_HAVE_SINHF 1 2025-05-07T20:27:13.6576141Z #define MOD_TIMECONST ADJ_TIMECONST 2025-05-07T20:27:13.6576447Z #define __cpp_lib_result_of_sfinae 201210 2025-05-07T20:27:13.6576740Z #define __SM_30_INTRINSICS_H__ 2025-05-07T20:27:13.6577012Z #define __FLT32X_MAX_EXP__ 1024 2025-05-07T20:27:13.6577284Z #define _GLIBCXX_USE_WCHAR_T 1 2025-05-07T20:27:13.6577542Z #define _GLIBCXX_MATH_H 1 2025-05-07T20:27:13.6577794Z #define __u_char_defined 2025-05-07T20:27:13.6578124Z #define WIFEXITED(status) __WIFEXITED (__WAIT_INT (status)) 2025-05-07T20:27:13.6578486Z #define STA_PPSERROR 0x0800 2025-05-07T20:27:13.6578741Z #define _GLIBCXX_STD_A std 2025-05-07T20:27:13.6579003Z #define __FLT32_HAS_DENORM__ 1 2025-05-07T20:27:13.6579284Z #define _GLIBCXX_BEGIN_NAMESPACE_VERSION 2025-05-07T20:27:13.6579707Z #define __device_builtin_texture_type__ __location__(device_builtin_texture_type) 2025-05-07T20:27:13.6580123Z #define FP_INFINITE 1 2025-05-07T20:27:13.6580494Z #define _GLIBCXX11_DEPRECATED_SUGGEST(ALT) _GLIBCXX_DEPRECATED_SUGGEST(ALT) 2025-05-07T20:27:13.6580902Z #define _IO_pid_t __pid_t 2025-05-07T20:27:13.6581392Z #define __UINT_FAST8_MAX__ 0xff 2025-05-07T20:27:13.6581653Z #define __LEAF , __leaf__ 2025-05-07T20:27:13.6581885Z #define PATH_MAX 4096 2025-05-07T20:27:13.6582146Z #define __cpp_rvalue_reference 200610L 2025-05-07T20:27:13.6582486Z #define __LDBL_REDIR1(name,proto,alias) name proto 2025-05-07T20:27:13.6582805Z #define _LIMITS_H___ 2025-05-07T20:27:13.6583032Z #define __size_t 2025-05-07T20:27:13.6583264Z #define _GLIBCXX_HAVE_FREXPF 1 2025-05-07T20:27:13.6583800Z #define STA_RONLY (STA_PPSSIGNAL | STA_PPSJITTER | STA_PPSWANDER | STA_PPSERROR | STA_CLOCKERR | STA_NANO | STA_MODE | STA_CLK) 2025-05-07T20:27:13.6584349Z #define _GLIBCXX_HAVE_FREXPL 1 2025-05-07T20:27:13.6584667Z #define __cpp_nested_namespace_definitions 201411L 2025-05-07T20:27:13.6585008Z #define __DEC64_MAX_EXP__ 385 2025-05-07T20:27:13.6585266Z #define _WCHAR_T_DEFINED 2025-05-07T20:27:13.6585631Z #define __glibcxx_requires_can_decrement_range(_First1,_Last1,_First2) 2025-05-07T20:27:13.6586032Z #define MOD_STATUS ADJ_STATUS 2025-05-07T20:27:13.6586321Z #define _GLIBCXX_PURE __attribute__ ((__pure__)) 2025-05-07T20:27:13.6586646Z #define _GLIBCXX_HAVE_STDINT_H 1 2025-05-07T20:27:13.6586935Z #define __SIZEOF_PTHREAD_CONDATTR_T 4 2025-05-07T20:27:13.6587222Z #define __INT8_C(c) c 2025-05-07T20:27:13.6587475Z #define __cudaCDP2GetParameterBuffer 2025-05-07T20:27:13.6587874Z #define _GLIBCXX_HAVE_COSHF 1 2025-05-07T20:27:13.6588153Z #define _GLIBCXX_HAVE_COSHL 1 2025-05-07T20:27:13.6588407Z #define __SM_70_RT_HPP__ 2025-05-07T20:27:13.6588670Z #define __INT_LEAST8_WIDTH__ 8 2025-05-07T20:27:13.6588951Z #define __cpp_variadic_using 201611L 2025-05-07T20:27:13.6589264Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:27:13.6589586Z #define __INT_LEAST8_MAX__ 0x7f 2025-05-07T20:27:13.6589858Z #define __SM_61_INTRINSICS_HPP__ 2025-05-07T20:27:13.6590123Z #define _IO_FLAGS2_MMAP 1 2025-05-07T20:27:13.6590375Z #define __cpp_capture_star_this 201603L 2025-05-07T20:27:13.6590683Z #define __cudaCDP2LaunchDeviceV2_ptsz 2025-05-07T20:27:13.6590985Z #define _GLIBCXX_HAVE_ENDIAN_H 1 2025-05-07T20:27:13.6591334Z #define __always_inline __inline __attribute__ ((__always_inline__)) 2025-05-07T20:27:13.6591704Z #define NFDBITS __NFDBITS 2025-05-07T20:27:13.6591959Z #define _PSTL_PRAGMA_FORCEINLINE 2025-05-07T20:27:13.6592232Z #define _GLIBCXX_HAVE_SYS_STATVFS_H 1 2025-05-07T20:27:13.6592566Z #define __glibcxx_requires_sorted(_First,_Last) 2025-05-07T20:27:13.6592883Z #define __SHRT_MAX__ 0x7fff 2025-05-07T20:27:13.6593150Z #define _GLIBCXX_SYMVER_GNU 1 2025-05-07T20:27:13.6593433Z #define w_stopval __wait_stopped.__w_stopval 2025-05-07T20:27:13.6593737Z #define STA_UNSYNC 0x0040 2025-05-07T20:27:13.6594056Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:27:13.6594467Z #define _GLIBCXX_USE_C99_COMPLEX _GLIBCXX11_USE_C99_COMPLEX 2025-05-07T20:27:13.6594831Z #define __FLT64X_MAX_10_EXP__ 4932 2025-05-07T20:27:13.6595122Z #define __cpp_if_constexpr 201606L 2025-05-07T20:27:13.6595443Z #define __glibcxx_class_requires4(_a,_b,_c,_d,_e) 2025-05-07T20:27:13.6595825Z #define cudaStreamFireAndForget ((cudaStream_t)0x4) 2025-05-07T20:27:13.6596173Z #define _GLIBCXX_HAVE_WCHAR_H 1 2025-05-07T20:27:13.6596534Z #define _GLIBCXX_USE_C99_STDIO _GLIBCXX11_USE_C99_STDIO 2025-05-07T20:27:13.6596870Z #define __daddr_t_defined 2025-05-07T20:27:13.6597131Z #define __LDBL_IS_IEC_60559__ 2 2025-05-07T20:27:13.6597416Z #define _GLIBCXX_TR1_RIEMANN_ZETA_TCC 1 2025-05-07T20:27:13.6597726Z #define _GLIBCXX_HAVE_STRUCT_DIRENT_D_TYPE 1 2025-05-07T20:27:13.6598247Z #define _PSTL_CPP11_STD_ROTATE_BROKEN ((__GLIBCXX__ && __GLIBCXX__ < 20150716) || (_MSC_VER && _MSC_VER < 1800)) 2025-05-07T20:27:13.6598721Z #define _ACRTIMP 2025-05-07T20:27:13.6598931Z #define _IO_EOF_SEEN 0x10 2025-05-07T20:27:13.6599193Z #define _GLIBCXX_TR1_POLY_LAGUERRE_TCC 1 2025-05-07T20:27:13.6599481Z #define _IOS_BIN 128 2025-05-07T20:27:13.6599821Z #define __fortify_function __extern_always_inline __attribute_artificial__ 2025-05-07T20:27:13.6600326Z #define __FLT64X_HAS_QUIET_NAN__ 1 2025-05-07T20:27:13.6600599Z #define UNDERFLOW 4 2025-05-07T20:27:13.6600823Z #define NAME_MAX 255 2025-05-07T20:27:13.6601053Z #define SCHAR_MAX __SCHAR_MAX__ 2025-05-07T20:27:13.6601327Z #define __UINT_LEAST8_MAX__ 0xff 2025-05-07T20:27:13.6601614Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2 2025-05-07T20:27:13.6601902Z #define _IO_UNIFIED_JUMPTABLES 1 2025-05-07T20:27:13.6602277Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128 2025-05-07T20:27:13.6602658Z #define __ptr_t void * 2025-05-07T20:27:13.6602885Z #define M_E 2.7182818284590452354 2025-05-07T20:27:13.6603160Z #define cudaSurfaceType1D 0x01 2025-05-07T20:27:13.6603420Z #define __USE_ISOCXX11 1 2025-05-07T20:27:13.6603675Z #define __UINTMAX_TYPE__ long unsigned int 2025-05-07T20:27:13.6603982Z #define cudaDeviceBlockingSync 0x04 2025-05-07T20:27:13.6604395Z #define CLOCK_MONOTONIC_COARSE 6 2025-05-07T20:27:13.6604678Z #define _GLIBCXX_OS_DEFINES 1 2025-05-07T20:27:13.6604966Z #define _GLIBCXX_NODISCARD [[__nodiscard__]] 2025-05-07T20:27:13.6605274Z #define cudaSurfaceType2D 0x02 2025-05-07T20:27:13.6605528Z #define __linux 1 2025-05-07T20:27:13.6605744Z #define __DEC32_EPSILON__ 1E-6DF 2025-05-07T20:27:13.6606014Z #define cudaDeviceMask 0xff 2025-05-07T20:27:13.6606282Z #define _GLIBCXX_END_NAMESPACE_ALGO 2025-05-07T20:27:13.6606557Z #define __CUDA_API_VER_MAJOR__ 12 2025-05-07T20:27:13.6606921Z #define htobe16(x) __bswap_16 (x) 2025-05-07T20:27:13.6607212Z #define HUGE_VALF (__builtin_huge_valf()) 2025-05-07T20:27:13.6607506Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0 2025-05-07T20:27:13.6607808Z #define HUGE_VALL (__builtin_huge_vall()) 2025-05-07T20:27:13.6608091Z #define _BITS_TYPES_H 1 2025-05-07T20:27:13.6608701Z #define ULONG_LONG_MAX (LONG_LONG_MAX * 2ULL + 1ULL) 2025-05-07T20:27:13.6609065Z #define _IO_cleanup_region_end(_Doit) 2025-05-07T20:27:13.6609364Z #define cudaSurfaceType3D 0x03 2025-05-07T20:27:13.6609631Z #define _GLIBCXX_HAVE_SYS_TIME_H 1 2025-05-07T20:27:13.6609924Z #define __cudaGet_blockIdx() blockIdx 2025-05-07T20:27:13.6610202Z #define _IO_DONT_CLOSE 0100000 2025-05-07T20:27:13.6610978Z #define __MATHDECLX(type,function,suffix,args,attrib) __MATHDECL_1(type, function,suffix, args) __attribute__ (attrib); __MATHDECL_1(type, __CONCAT(__,function),suffix, args) __attribute__ (attrib) 2025-05-07T20:27:13.6611771Z #define cudaHostRegisterDefault 0x00 2025-05-07T20:27:13.6612061Z #define __unix 1 2025-05-07T20:27:13.6612272Z #define MATH_ERRNO 1 2025-05-07T20:27:13.6612503Z #define _GLIBCXX_STDIO_SEEK_END 2 2025-05-07T20:27:13.6612775Z #define _GLIBCXX_USE_FCHMODAT 1 2025-05-07T20:27:13.6613042Z #define __UINT32_MAX__ 0xffffffffU 2025-05-07T20:27:13.6613315Z #define __GXX_EXPERIMENTAL_CXX0X__ 1 2025-05-07T20:27:13.6613603Z #define __UID_T_TYPE __U32_TYPE 2025-05-07T20:27:13.6613889Z #define _GLIBCXX_HAVE_ATOMIC_LOCK_POLICY 1 2025-05-07T20:27:13.6614350Z #define __CUDART_API_VERSION ((__CUDA_API_VER_MAJOR__ * 1000) + (__CUDA_API_VER_MINOR__ * 10)) 2025-05-07T20:27:13.6614810Z #define __nv_pure__ __location__(nv_pure) 2025-05-07T20:27:13.6615115Z #define CUDARTAPI_CDECL 2025-05-07T20:27:13.6615375Z #define _PSTL_USAGE_WARNINGS 0 2025-05-07T20:27:13.6615643Z #define _GLIBCXX98_USE_C99_COMPLEX 1 2025-05-07T20:27:13.6615928Z #define __cpp_lib_void_t 201411 2025-05-07T20:27:13.6616197Z #define _POSIX_AIO_MAX 1 2025-05-07T20:27:13.6616431Z #define __SIZE_T 2025-05-07T20:27:13.6616690Z #define isgraph_l(c,l) __isgraph_l ((c), (l)) 2025-05-07T20:27:13.6617013Z #define _GLIBCXX_FULLY_DYNAMIC_STRING 0 2025-05-07T20:27:13.6617303Z #define _POSIX_PIPE_BUF 512 2025-05-07T20:27:13.6617569Z #define _GLIBCXX_HAVE_STRTOLD 1 2025-05-07T20:27:13.6617834Z #define _ATFILE_SOURCE 1 2025-05-07T20:27:13.6618219Z #define __glibcxx_assert(cond) do { __glibcxx_constexpr_assert(cond); } while (false) 2025-05-07T20:27:13.6618649Z #define __WAIT_STATUS void * 2025-05-07T20:27:13.6618915Z #define __MATH_FUNCTIONS_H__ 2025-05-07T20:27:13.6619196Z #define _GLIBCXX_HAVE_WCSTOF 1 2025-05-07T20:27:13.6619779Z #define __FLT128_MIN_EXP__ (-16381) 2025-05-07T20:27:13.6620079Z #define _GLIBCXX_HAVE_LC_MESSAGES 1 2025-05-07T20:27:13.6620361Z #define __WINT_MIN__ 0U 2025-05-07T20:27:13.6620936Z #define _PSTL_CPP14_VARIABLE_TEMPLATES_PRESENT (!__INTEL_COMPILER || __INTEL_COMPILER >= 1700) && (_MSC_FULL_VER >= 190023918 || __cplusplus >= 201402L) 2025-05-07T20:27:13.6621581Z #define isdigit_l(c,l) __isdigit_l ((c), (l)) 2025-05-07T20:27:13.6621880Z #define WUNTRACED 2 2025-05-07T20:27:13.6622101Z #define _GLIBCXX_HAVE_SQRTF 1 2025-05-07T20:27:13.6622385Z #define __SIZEOF_PTHREAD_RWLOCKATTR_T 8 2025-05-07T20:27:13.6622670Z #define NZERO 20 2025-05-07T20:27:13.6622908Z #define _GLIBCXX_HAVE_MEMALIGN 1 2025-05-07T20:27:13.6623179Z #define _PSTL_PRAGMA(x) _Pragma(#x) 2025-05-07T20:27:13.6623471Z #define MOD_CLKA ADJ_OFFSET_SINGLESHOT 2025-05-07T20:27:13.6623759Z #define MOD_CLKB ADJ_TICK 2025-05-07T20:27:13.6624014Z #define __FLT128_MIN_10_EXP__ (-4931) 2025-05-07T20:27:13.6624291Z #define __FLT32X_IS_IEC_60559__ 2 2025-05-07T20:27:13.6624566Z #define __DEVICE_FUNCTIONS_H__ 2025-05-07T20:27:13.6624838Z #define SCHAR_MIN (-SCHAR_MAX - 1) 2025-05-07T20:27:13.6625111Z #define EXIT_FAILURE 1 2025-05-07T20:27:13.6625346Z #define ADJ_MAXERROR 0x0004 2025-05-07T20:27:13.6625600Z #define __INT_LEAST16_WIDTH__ 16 2025-05-07T20:27:13.6625862Z #define _SIZE_T_DEFINED_ 2025-05-07T20:27:13.6626275Z #define _POSIX_AIO_LISTIO_MAX 2 2025-05-07T20:27:13.6626576Z #define __cudaCDP2DeviceGetLimit 2025-05-07T20:27:13.6626946Z #define __LDBL_REDIR_NTH(name,proto) name proto __THROW 2025-05-07T20:27:13.6627306Z #define __cudaCDP2FuncGetAttributes 2025-05-07T20:27:13.6627596Z #define __SCHAR_MAX__ 0x7f 2025-05-07T20:27:13.6627857Z #define __FLT128_MANT_DIG__ 113 2025-05-07T20:27:13.6628128Z #define __USING_NAMESPACE_STD(name) 2025-05-07T20:27:13.6628428Z #define _GLIBCXX_HAVE_OBSOLETE_ISINF 1 2025-05-07T20:27:13.6628728Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1) 2025-05-07T20:27:13.6629017Z #define SEEK_DATA 3 2025-05-07T20:27:13.6629252Z #define __KERNEL_STRICT_NAMES 2025-05-07T20:27:13.6629543Z #define _IO_stderr ((_IO_FILE*)(&_IO_2_1_stderr_)) 2025-05-07T20:27:13.6629966Z #define _IO_ferror_unlocked(__fp) (((__fp)->_flags & _IO_ERR_SEEN) != 0) 2025-05-07T20:27:13.6630355Z #define _FUNCTEXCEPT_H 1 2025-05-07T20:27:13.6630600Z #define __INT64_C(c) c ## L 2025-05-07T20:27:13.6630873Z #define __NTH(fct) __LEAF_ATTR fct throw () 2025-05-07T20:27:13.6631214Z #define _GLIBCXX_CONST __attribute__ ((__const__)) 2025-05-07T20:27:13.6631530Z #define _GLIBCXX_HAVE_LINK 1 2025-05-07T20:27:13.6631807Z #define cudaNvSciSyncAttrWait 0x2 2025-05-07T20:27:13.6632102Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2 2025-05-07T20:27:13.6632403Z #define STA_PPSWANDER 0x0400 2025-05-07T20:27:13.6632653Z #define __INT_WCHAR_T_H 2025-05-07T20:27:13.6632894Z #define WSTOPPED 2 2025-05-07T20:27:13.6633128Z #define _POSIX_THREAD_THREADS_MAX 64 2025-05-07T20:27:13.6633405Z #define _POSIX_MQ_OPEN_MAX 8 2025-05-07T20:27:13.6633654Z #define FP_NORMAL 4 2025-05-07T20:27:13.6633899Z #define __cudaCDP2LaunchDevice_ptsz 2025-05-07T20:27:13.6634174Z #define _BITS_TIMEX_H 1 2025-05-07T20:27:13.6634414Z #define _POSIX_LINK_MAX 8 2025-05-07T20:27:13.6634672Z #define _GLIBCXX_HAVE_LIMIT_FSIZE 1 2025-05-07T20:27:13.6634951Z #define _GLIBCXX_HAVE_ATAN2F 1 2025-05-07T20:27:13.6635225Z #define cudaTextureType1D 0x01 2025-05-07T20:27:13.6635495Z #define _GLIBCXX_HAVE_ATAN2L 1 2025-05-07T20:27:13.6635761Z #define COLL_WEIGHTS_MAX 255 2025-05-07T20:27:13.6636035Z #define __isascii(c) (((c) & ~0x7f) == 0) 2025-05-07T20:27:13.6636328Z #define __toascii(c) ((c) & 0x7f) 2025-05-07T20:27:13.6636748Z #define __attribute_format_strfmon__(a,b) __attribute__ ((__format__ (__strfmon__, a, b))) 2025-05-07T20:27:13.6637193Z #define _IO_MAGIC 0xFBAD0000 2025-05-07T20:27:13.6637458Z #define _GLIBCXX_USE_SENDFILE 1 2025-05-07T20:27:13.6637726Z #define _POSIX_SOURCE 1 2025-05-07T20:27:13.6637972Z #define cudaTextureType2D 0x02 2025-05-07T20:27:13.6638240Z #define _PTR_TRAITS_H 1 2025-05-07T20:27:13.6638612Z #define _GLIBCXX_NOEXCEPT_QUAL noexcept (_NE) 2025-05-07T20:27:13.6638927Z #define _GLIBCXX_HAVE_POWF 1 2025-05-07T20:27:13.6639195Z #define _POSIX2_BC_STRING_MAX 1000 2025-05-07T20:27:13.6639523Z #define __attribute_used__ __attribute__ ((__used__)) 2025-05-07T20:27:13.6639858Z #define cudaTextureType3D 0x03 2025-05-07T20:27:13.6640137Z #define _STDIO_USES_IOSTREAM 2025-05-07T20:27:13.6640405Z #define CLOCK_REALTIME 0 2025-05-07T20:27:13.6640647Z #define __FLT32X_MANT_DIG__ 53 2025-05-07T20:27:13.6640924Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2 2025-05-07T20:27:13.6641228Z #define __cpp_aligned_new 201606L 2025-05-07T20:27:13.6641504Z #define __USER_LABEL_PREFIX__ 2025-05-07T20:27:13.6641784Z #define cudaEventBlockingSync 0x01 2025-05-07T20:27:13.6642076Z #define _GLIBCXX_HAVE_TANL 1 2025-05-07T20:27:13.6642353Z #define _GLIBCXX_USE_PTHREAD_RWLOCK_T 1 2025-05-07T20:27:13.6642650Z #define _GLIBCXX_HAVE_LINUX_RANDOM_H 1 2025-05-07T20:27:13.6642948Z #define _GLIBCXX_USE_C99_FENV_TR1 1 2025-05-07T20:27:13.6643238Z #define __FLT32_MAX_10_EXP__ 38 2025-05-07T20:27:13.6643486Z #define __GLIBC__ 2 2025-05-07T20:27:13.6643714Z #define __END_DECLS } 2025-05-07T20:27:13.6643961Z #define FP_ILOGB0 (-2147483647 - 1) 2025-05-07T20:27:13.6644479Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x 2025-05-07T20:27:13.6644852Z #define __CONCAT(x,y) x ## y 2025-05-07T20:27:13.6645196Z #define WCONTINUED 8 2025-05-07T20:27:13.6645429Z #define __STDC_HOSTED__ 1 2025-05-07T20:27:13.6645685Z #define _GLIBCXX_HAVE_ARPA_INET_H 1 2025-05-07T20:27:13.6645954Z #define _ALLOCA_H 1 2025-05-07T20:27:13.6646185Z #define __host__ __location__(host) 2025-05-07T20:27:13.6646602Z #define __warndecl(name,msg) extern void name (void) __attribute__((__warning__ (msg))) 2025-05-07T20:27:13.6647039Z #define __SLONG32_TYPE int 2025-05-07T20:27:13.6647306Z #define _GLIBCXX_DEBUG_ASSERTIONS_H 1 2025-05-07T20:27:13.6647580Z #define _SYS_SELECT_H 1 2025-05-07T20:27:13.6647819Z #define _IO_LINE_BUF 0x200 2025-05-07T20:27:13.6648072Z #define _IOS_NOCREATE 32 2025-05-07T20:27:13.6648314Z #define __DEC64_MIN_EXP__ (-382) 2025-05-07T20:27:13.6648591Z #define __cudaGet_warpSize() warpSize 2025-05-07T20:27:13.6648880Z #define __SSIZE_T_TYPE __SWORD_TYPE 2025-05-07T20:27:13.6649155Z #define _GLIBCXX_HAVE_LIMIT_VMEM 0 2025-05-07T20:27:13.6649435Z #define __global__ __location__(global) 2025-05-07T20:27:13.6649727Z #define __GNU_LIBRARY__ 6 2025-05-07T20:27:13.6649974Z #define __cpp_decltype_auto 201304L 2025-05-07T20:27:13.6650239Z #define __DBL_DIG__ 15 2025-05-07T20:27:13.6650461Z #define TIME_UTC 1 2025-05-07T20:27:13.6650666Z #define __FLT32_DIG__ 6 2025-05-07T20:27:13.6650984Z #define __forceinline__ __inline__ __attribute__((always_inline)) 2025-05-07T20:27:13.6651368Z #define cudaHostAllocWriteCombined 0x04 2025-05-07T20:27:13.6651679Z #define cudaDeviceScheduleAuto 0x00 2025-05-07T20:27:13.6651976Z #define iscntrl_l(c,l) __iscntrl_l ((c), (l)) 2025-05-07T20:27:13.6652270Z #define _G_BUFSIZ 8192 2025-05-07T20:27:13.6652580Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F 2025-05-07T20:27:13.6652936Z #define cudaTextureTypeCubemap 0x0C 2025-05-07T20:27:13.6653228Z #define __cudaCDP2GetDevice 2025-05-07T20:27:13.6653506Z #define __cudaCDP2PeekAtLastError 2025-05-07T20:27:13.6653778Z #define STA_CLOCKERR 0x1000 2025-05-07T20:27:13.6654028Z #define __GXX_WEAK__ 1 2025-05-07T20:27:13.6654283Z #define __RLIM_T_TYPE __SYSCALL_ULONG_TYPE 2025-05-07T20:27:13.6654569Z #define _GLIBCXX_HAVE_ISNANF 1 2025-05-07T20:27:13.6654822Z #define __SHRT_WIDTH__ 16 2025-05-07T20:27:13.6655117Z #define __cpp_lib_robust_nonmodifying_seq_ops 201304 2025-05-07T20:27:13.6655451Z #define _GLIBCXX_BITS_SPECFUN_H 1 2025-05-07T20:27:13.6655718Z #define _GLIBCXX_HAVE_ISNANL 1 2025-05-07T20:27:13.6656000Z #define isblank_l(c,l) __isblank_l ((c), (l)) 2025-05-07T20:27:13.6656295Z #define _G_config_h 1 2025-05-07T20:27:13.6656606Z #define M_LOG2El 1.442695040888963407359924681001892137L 2025-05-07T20:27:13.6656943Z #define ADJ_OFFSET_SINGLESHOT 0x8001 2025-05-07T20:27:13.6657354Z #define _GCC_WCHAR_T 2025-05-07T20:27:13.6657575Z #define TMP_MAX 238328 2025-05-07T20:27:13.6657815Z #define __FLT32_IS_IEC_60559__ 2 2025-05-07T20:27:13.6658079Z #define __DEVICE_TYPES_H__ 2025-05-07T20:27:13.6658327Z #define __DEV_T_TYPE __UQUAD_TYPE 2025-05-07T20:27:13.6658601Z #define _EXT_NUMERIC_TRAITS 1 2025-05-07T20:27:13.6658879Z #define _GLIBCXX_BEGIN_NAMESPACE_ALGO 2025-05-07T20:27:13.6659154Z #define _IO_SKIPWS 01 2025-05-07T20:27:13.6659550Z #define cudaStreamGraphFireAndForgetAsSibling (cudaStream_t)0x0300000000000000 2025-05-07T20:27:13.6660001Z #define _IO_SCIENTIFIC 04000 2025-05-07T20:27:13.6660264Z #define _GLIBCXX_HAVE_STRING_H 1 2025-05-07T20:27:13.6660585Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L 2025-05-07T20:27:13.6660941Z #define cudaDeviceScheduleSpin 0x01 2025-05-07T20:27:13.6661303Z #define __nonnull(params) __attribute__ ((__nonnull__ params)) 2025-05-07T20:27:13.6661654Z #define __DBL_IS_IEC_60559__ 2 2025-05-07T20:27:13.6661909Z #define le32toh(x) (x) 2025-05-07T20:27:13.6662136Z #define _SIZE_T_DEFINED 2025-05-07T20:27:13.6662377Z #define _GLIBCXX_HAVE_XLOCALE_H 1 2025-05-07T20:27:13.6662711Z #define cudaArraySparsePropertiesSingleMipTail 0x1 2025-05-07T20:27:13.6663053Z #define __DEC32_MAX__ 9.999999E96DF 2025-05-07T20:27:13.6663439Z #define __WIFSIGNALED(status) (((signed char) (((status) & 0x7f) + 1) >> 1) > 0) 2025-05-07T20:27:13.6663939Z #define _GLIBCXX_HAVE_FMODL 1 2025-05-07T20:27:13.6664203Z #define _GLIBCXX_HAVE_POLL 1 2025-05-07T20:27:13.6664461Z #define __SM_32_INTRINSICS_H__ 2025-05-07T20:27:13.6664712Z #define _POSIX_NAME_MAX 14 2025-05-07T20:27:13.6665009Z #define __cpp_threadsafe_static_init 200806L 2025-05-07T20:27:13.6665555Z #define _GLIBCXX_MAKE_MOVE_IF_NOEXCEPT_ITERATOR(_Iter) std::__make_move_if_noexcept_iterator(_Iter) 2025-05-07T20:27:13.6666037Z #define _GLIBCXX_USE_CLOCK_REALTIME 1 2025-05-07T20:27:13.6666342Z #define __cpp_enumerator_attributes 201411L 2025-05-07T20:27:13.6666689Z #define __WCOREDUMP(status) ((status) & __WCOREFLAG) 2025-05-07T20:27:13.6666995Z #define _WCHAR_T_ 2025-05-07T20:27:13.6667226Z #define _GLIBCXX_FAST_MATH 0 2025-05-07T20:27:13.6667585Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x 2025-05-07T20:27:13.6667962Z #define RTSIG_MAX 32 2025-05-07T20:27:13.6668174Z #define _STDDEF_H 2025-05-07T20:27:13.6668407Z #define CU_UUID_HAS_BEEN_DEFINED 2025-05-07T20:27:13.6668673Z #define _VA_LIST_DEFINED 2025-05-07T20:27:13.6668913Z #define __FLT32X_HAS_INFINITY__ 1 2025-05-07T20:27:13.6669245Z #define __glibcxx_requires_non_empty_range(_First,_Last) 2025-05-07T20:27:13.6669627Z #define __grid_constant__ __location__(grid_constant) 2025-05-07T20:27:13.6669941Z #define __INT32_MAX__ 0x7fffffff 2025-05-07T20:27:13.6670227Z #define _GLIBCXX_BEGIN_EXTERN_C extern "C" { 2025-05-07T20:27:13.6670681Z #define _PSTL_CPP14_INTEGER_SEQUENCE_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201402L) 2025-05-07T20:27:13.6671187Z #define __glibcxx_digits_b(T,B) (B - __glibcxx_signed_b (T,B)) 2025-05-07T20:27:13.6671551Z #define __SIZEOF_PTHREAD_COND_T 48 2025-05-07T20:27:13.6671865Z #define _PSTL_PRAGMA_SIMD_ORDERED_MONOTONIC(PRM) 2025-05-07T20:27:13.6672171Z #define __unix__ 1 2025-05-07T20:27:13.6672397Z #define __SM_60_ATOMIC_FUNCTIONS_H__ 2025-05-07T20:27:13.6672676Z #define __INT_WIDTH__ 32 2025-05-07T20:27:13.6672920Z #define __SIZEOF_LONG__ 8 2025-05-07T20:27:13.6673152Z #define _IONBF 2 2025-05-07T20:27:13.6673590Z #define __MATHCALLX(function,suffix,args,attrib) __MATHDECLX (_Mdouble_,function,suffix, args, attrib) 2025-05-07T20:27:13.6674343Z #define _IO_getc_unlocked(_fp) (_IO_BE ((_fp)->_IO_read_ptr >= (_fp)->_IO_read_end, 0) ? __uflow (_fp) : *(unsigned char *) (_fp)->_IO_read_ptr++) 2025-05-07T20:27:13.6674850Z #define __STDC_IEC_559__ 1 2025-05-07T20:27:13.6675098Z #define __STDC_ISO_10646__ 201103L 2025-05-07T20:27:13.6675353Z #define __UINT16_C(c) c 2025-05-07T20:27:13.6675586Z #define M_2_PI 0.63661977236758134308 2025-05-07T20:27:13.6675934Z #define STA_DEL 0x0020 2025-05-07T20:27:13.6676165Z #define __CUDACC_VER_MINOR__ 6 2025-05-07T20:27:13.6676409Z #define __id_t_defined 2025-05-07T20:27:13.6676669Z #define w_retcode __wait_terminated.__w_retcode 2025-05-07T20:27:13.6677105Z #define _IO_PENDING_OUTPUT_COUNT(_fp) ((_fp)->_IO_write_ptr - (_fp)->_IO_write_base) 2025-05-07T20:27:13.6677519Z #define _GLIBCXX_HAVE_MODFF 1 2025-05-07T20:27:13.6677769Z #define _GLIBCXX_HAVE_MODFL 1 2025-05-07T20:27:13.6678017Z #define __DECIMAL_DIG__ 21 2025-05-07T20:27:13.6678259Z #define _POSIX2_RE_DUP_MAX 255 2025-05-07T20:27:13.6678505Z #define __USE_FORTIFY_LEVEL 0 2025-05-07T20:27:13.6678760Z #define __STDC_IEC_559_COMPLEX__ 1 2025-05-07T20:27:13.6679021Z #define SING 2 2025-05-07T20:27:13.6679223Z #define STA_FREQHOLD 0x0080 2025-05-07T20:27:13.6679482Z #define __SM_32_ATOMIC_FUNCTIONS_HPP__ 2025-05-07T20:27:13.6679773Z #define cudaStreamDefault 0x00 2025-05-07T20:27:13.6680121Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64 2025-05-07T20:27:13.6680491Z #define _GLIBCXX_HAVE_HYPOTL 1 2025-05-07T20:27:13.6680758Z #define _GLIBCXX_HAVE_SYS_UIO_H 1 2025-05-07T20:27:13.6681023Z #define __gnu_linux__ 1 2025-05-07T20:27:13.6681254Z #define __INT16_MAX__ 0x7fff 2025-05-07T20:27:13.6681509Z #define _LARGEFILE_SOURCE 1 2025-05-07T20:27:13.6681757Z #define MAX_INPUT 255 2025-05-07T20:27:13.6681991Z #define __FLT64_MIN_EXP__ (-1021) 2025-05-07T20:27:13.6682406Z #define __isalpha_l(c,l) __isctype_l((c), _ISalpha, (l)) 2025-05-07T20:27:13.6682781Z #define __glibcxx_requires_heap(_First,_Last) 2025-05-07T20:27:13.6683086Z #define _GLIBCXX_CPU_DEFINES 1 2025-05-07T20:27:13.6683416Z #define _GLIBCXX_HAVE_POLL_H 1 2025-05-07T20:27:13.6683814Z #define __attribute_warn_unused_result__ __attribute__ ((__warn_unused_result__)) 2025-05-07T20:27:13.6684233Z #define _IO_SHOWPOS 02000 2025-05-07T20:27:13.6684658Z #define _GLIBCXX_HAVE_SYMVER_SYMBOL_RENAMING_RUNTIME_SUPPORT 1 2025-05-07T20:27:13.6685013Z #define _Mfloat_ float 2025-05-07T20:27:13.6685276Z #define __glibcxx_requires_cond(_Cond,_Msg) 2025-05-07T20:27:13.6685583Z #define __FLT64X_MIN_10_EXP__ (-4931) 2025-05-07T20:27:13.6685869Z #define DELAYTIMER_MAX 2147483647 2025-05-07T20:27:13.6686361Z #define __glibcxx_max_b(T,B) (__glibcxx_signed_b (T,B) ? (((((T)1 << (__glibcxx_digits_b (T,B) - 1)) - 1) << 1) + 1) : ~(T)0) 2025-05-07T20:27:13.6686847Z #define __LDBL_HAS_QUIET_NAN__ 1 2025-05-07T20:27:13.6687132Z #define _GLIBCXX98_USE_C99_STDIO 1 2025-05-07T20:27:13.6687459Z #define cudaKernelNodeAttrID cudaLaunchAttributeID 2025-05-07T20:27:13.6687820Z #define __glibcxx_class_requires2(_a,_b,_c) 2025-05-07T20:27:13.6688112Z #define __USE_ISOC11 1 2025-05-07T20:27:13.6688345Z #define _BSD_SIZE_T_ 2025-05-07T20:27:13.6688581Z #define ADJ_MICRO 0x1000 2025-05-07T20:27:13.6748158Z #define _GLIBCXX_HAVE_FABSF 1 2025-05-07T20:27:13.6748578Z #define _GLIBCXX_HAVE_FABSL 1 2025-05-07T20:27:13.6748867Z #define _PSTL_PRAGMA_SIMD _PSTL_PRAGMA(omp simd) 2025-05-07T20:27:13.6749174Z #define __FLT64_MANT_DIG__ 53 2025-05-07T20:27:13.6749493Z #define __attribute_const__ __attribute__ ((__const__)) 2025-05-07T20:27:13.6749805Z #define __THROW throw () 2025-05-07T20:27:13.6750065Z #define __cudaGet_gridDim() gridDim 2025-05-07T20:27:13.6750340Z #define __SM_60_ATOMIC_FUNCTIONS_HPP__ 2025-05-07T20:27:13.6750679Z #define __glibcxx_requires_heap_pred(_First,_Last,_Pred) 2025-05-07T20:27:13.6751011Z #define htobe32(x) __bswap_32 (x) 2025-05-07T20:27:13.6751277Z #define _GLIBCXX_HAVE_POWL 1 2025-05-07T20:27:13.6751523Z #define __FLT64X_MANT_DIG__ 64 2025-05-07T20:27:13.6751768Z #define __GLIBC_HAVE_LONG_LONG 1 2025-05-07T20:27:13.6752008Z #define L_tmpnam 20 2025-05-07T20:27:13.6752216Z #define ___int_wchar_t_h 2025-05-07T20:27:13.6752538Z #define WIFCONTINUED(status) __WIFCONTINUED (__WAIT_INT (status)) 2025-05-07T20:27:13.6752922Z #define isascii(c) __isascii (c) 2025-05-07T20:27:13.6753177Z #define _T_PTRDIFF 2025-05-07T20:27:13.6754958Z #define _GLIBCXX_MOVE3(_Tp,_Up,_Vp) std::move(_Tp, _Up, _Vp) 2025-05-07T20:27:13.6755373Z #define toascii(c) __toascii (c) 2025-05-07T20:27:13.6756014Z #define __GNUC__ 11 2025-05-07T20:27:13.6756272Z #define __SYSCALL_ULONG_TYPE __ULONGWORD_TYPE 2025-05-07T20:27:13.6756636Z #define __GXX_RTTI 1 2025-05-07T20:27:13.6756955Z #define __pie__ 2 2025-05-07T20:27:13.6757204Z #define __MMX__ 1 2025-05-07T20:27:13.6757422Z #define __cudaCDP2Malloc 2025-05-07T20:27:13.6757678Z #define __timespec_defined 1 2025-05-07T20:27:13.6757927Z #define L_ctermid 9 2025-05-07T20:27:13.6758149Z #define __OFF64_T_TYPE __SQUAD_TYPE 2025-05-07T20:27:13.6758451Z #define __cudaCDP2GetParameterBufferV2 2025-05-07T20:27:13.6758835Z #define offsetof(TYPE,MEMBER) __builtin_offsetof (TYPE, MEMBER) 2025-05-07T20:27:13.6759193Z #define _BITS_POSIX2_LIM_H 1 2025-05-07T20:27:13.6759451Z #define _GLIBCXX98_USE_C99_STDLIB 1 2025-05-07T20:27:13.6759730Z #define cudaMemAttachGlobal 0x01 2025-05-07T20:27:13.6760016Z #define FD_SET(fd,fdsetp) __FD_SET (fd, fdsetp) 2025-05-07T20:27:13.6760317Z #define __FLT_HAS_DENORM__ 1 2025-05-07T20:27:13.6760583Z #define __SIZEOF_LONG_DOUBLE__ 16 2025-05-07T20:27:13.6761009Z #define _GLIBCXX_NATIVE_THREAD_ID (__gthread_active_p() ? __gthread_self() : (__gthread_t)1) 2025-05-07T20:27:13.6761736Z #define assert_perror(errnum) (!(errnum) ? __ASSERT_VOID_CAST (0) : __assert_perror_fail ((errnum), __FILE__, __LINE__, __ASSERT_FUNCTION)) 2025-05-07T20:27:13.6762325Z #define _IO_HAVE_ST_BLKSIZE _G_HAVE_ST_BLKSIZE 2025-05-07T20:27:13.6762781Z #define __USE_SVID 1 2025-05-07T20:27:13.6763022Z #define __constant__ __location__(constant) 2025-05-07T20:27:13.6763327Z #define _GLIBCXX_HAVE_POSIX_MEMALIGN 1 2025-05-07T20:27:13.6763617Z #define __device__ __location__(device) 2025-05-07T20:27:13.6763932Z #define _GLIBCXX_HAVE_EXCEPTION_PTR_SINCE_GCC46 1 2025-05-07T20:27:13.6764245Z #define _GLIBCXX_RES_LIMITS 1 2025-05-07T20:27:13.6764632Z #define M_1_PI 0.31830988618379067154 2025-05-07T20:27:13.6764909Z #define CUDART_DEVICE __device__ 2025-05-07T20:27:13.6765242Z #define __LDBL_REDIR1_NTH(name,proto,alias) name proto __THROW 2025-05-07T20:27:13.6765608Z #define M_PI_2 1.57079632679489661923 2025-05-07T20:27:13.6765883Z #define __BIGGEST_ALIGNMENT__ 16 2025-05-07T20:27:13.6766240Z #define cudaExternalSemaphoreWaitSkipNvSciBufMemSync 0x02 2025-05-07T20:27:13.6766617Z #define __STDC_UTF_16__ 1 2025-05-07T20:27:13.6766860Z #define LONG_MAX __LONG_MAX__ 2025-05-07T20:27:13.6767213Z #define __glibcxx_digits10_b(T,B) (__glibcxx_digits_b (T,B) * 643L / 2136) 2025-05-07T20:27:13.6767639Z #define _POSIX_THREAD_DESTRUCTOR_ITERATIONS 4 2025-05-07T20:27:13.6767952Z #define _POSIX_HOST_NAME_MAX 255 2025-05-07T20:27:13.6768207Z #define __FLT64_MAX_10_EXP__ 308 2025-05-07T20:27:13.6768464Z #define NGROUPS_MAX 65536 2025-05-07T20:27:13.6768715Z #define _GLIBCXX_NAMESPACE_LDBL 2025-05-07T20:27:13.6768970Z #define __USE_ISOC95 1 2025-05-07T20:27:13.6769182Z #define _TIME_H 1 2025-05-07T20:27:13.6769447Z #define M_LOG10El 0.434294481903251827651128918916605082L 2025-05-07T20:27:13.6769756Z #define __USE_ISOC99 1 2025-05-07T20:27:13.6770067Z #define __ASMNAME(cname) __ASMNAME2 (__USER_LABEL_PREFIX__, cname) 2025-05-07T20:27:13.6770428Z #define HOST_NAME_MAX 64 2025-05-07T20:27:13.6770673Z #define _POSIX_SEM_NSEMS_MAX 256 2025-05-07T20:27:13.6770915Z #define _IOS_ATEND 4 2025-05-07T20:27:13.6771145Z #define __SM_35_INTRINSICS_H__ 2025-05-07T20:27:13.6771463Z #define WTERMSIG(status) __WTERMSIG (__WAIT_INT (status)) 2025-05-07T20:27:13.6771857Z #define cudaStreamAttrValue cudaLaunchAttributeValue 2025-05-07T20:27:13.6772192Z #define _GLIBCXX_HAVE_S_ISREG 1 2025-05-07T20:27:13.6772469Z #define cudaSurfaceTypeCubemap 0x0C 2025-05-07T20:27:13.6772778Z #define __cpp_delegating_constructors 200604L 2025-05-07T20:27:13.6773084Z #define __FLT32_HAS_INFINITY__ 1 2025-05-07T20:27:13.6773334Z #define _STDIO_H 1 2025-05-07T20:27:13.6773724Z #define __isctype_l(c,type,locale) ((locale)->__ctype_b[(int) (c)] & (unsigned short int) type) 2025-05-07T20:27:13.6774166Z #define _GLIBCXX_PREDEFINED_OPS_H 1 2025-05-07T20:27:13.6774515Z #define __DBL_MAX__ double(1.79769313486231570814527423731704357e+308L) 2025-05-07T20:27:13.6774970Z #define _G_IO_IO_FILE_VERSION 0x20001 2025-05-07T20:27:13.6775240Z #define _POSIX_SIGQUEUE_MAX 32 2025-05-07T20:27:13.6775497Z #define _GLIBCXX_HAVE_GETS 1 2025-05-07T20:27:13.6775751Z #define _GLIBCXX_HAVE_LINUX_TYPES_H 1 2025-05-07T20:27:13.6776030Z #define __cpp_raw_strings 200710L 2025-05-07T20:27:13.6776317Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL 2025-05-07T20:27:13.6776617Z #define _GLIBCXX_HAVE_VFWSCANF 1 2025-05-07T20:27:13.6776877Z #define __DBL_HAS_INFINITY__ 1 2025-05-07T20:27:13.6777135Z #define __STDCPP_MATH_SPEC_FUNCS__ 201003L 2025-05-07T20:27:13.6777434Z #define _GLIBCXX_STDIO_EOF -1 2025-05-07T20:27:13.6777693Z #define __SIZEOF_PTHREAD_MUTEX_T 40 2025-05-07T20:27:13.6777960Z #define __CHANNEL_DESCRIPTOR_H__ 2025-05-07T20:27:13.6778307Z #define _ISbit(bit) ((bit) < 8 ? ((1 << (bit)) << 8) : ((1 << (bit)) >> 8)) 2025-05-07T20:27:13.6778667Z #define __SIZEOF_FLOAT__ 4 2025-05-07T20:27:13.6778891Z #define __USE_XOPEN 1 2025-05-07T20:27:13.6779142Z #define __SIZEOF_PTHREAD_RWLOCK_T 56 2025-05-07T20:27:13.6779568Z #define cudaStreamAttributeMemSyncDomain cudaLaunchAttributeMemSyncDomain 2025-05-07T20:27:13.6779988Z #define __USE_XOPEN2K 1 2025-05-07T20:27:13.6780210Z #define _PSTL_UDR_PRESENT 1 2025-05-07T20:27:13.6780479Z #define __HAVE_SPECULATION_SAFE_VALUE 1 2025-05-07T20:27:13.6780774Z #define _GLIBCXX_HAVE_COSF 1 2025-05-07T20:27:13.6781132Z #define __cpp_fold_expressions 201603L 2025-05-07T20:27:13.6781773Z #define cudaWaitExternalSemaphoresAsync __CUDART_API_PTSZ(cudaWaitExternalSemaphoresAsync_v2) 2025-05-07T20:27:13.6782300Z #define NL_LANGMAX _POSIX2_LINE_MAX 2025-05-07T20:27:13.6782611Z #define __DEC32_MIN_EXP__ (-94) 2025-05-07T20:27:13.6783064Z #define __glibcxx_requires_partitioned_upper(_First,_Last,_Value) 2025-05-07T20:27:13.6783454Z #define __DADDR_T_TYPE __S32_TYPE 2025-05-07T20:27:13.6783826Z #define cudaExternalSemaphoreSignalSkipNvSciBufMemSync 0x01 2025-05-07T20:27:13.6784219Z #define __END_NAMESPACE_C99 2025-05-07T20:27:13.6784495Z #define __glibcxx_integral_traps true 2025-05-07T20:27:13.6784777Z #define _POSIX_PATH_MAX 256 2025-05-07T20:27:13.6785040Z #define __INTPTR_WIDTH__ 64 2025-05-07T20:27:13.6785318Z #define __FLT64X_HAS_INFINITY__ 1 2025-05-07T20:27:13.6785587Z #define _ISOC11_SOURCE 1 2025-05-07T20:27:13.6785828Z #define _GLIBCXX_HAVE_LINUX_FUTEX 1 2025-05-07T20:27:13.6786126Z #define __UINT_LEAST32_MAX__ 0xffffffffU 2025-05-07T20:27:13.6786422Z #define _GLIBCXX_HAVE_QUICK_EXIT 1 2025-05-07T20:27:13.6786774Z #define __glibcxx_requires_irreflexive_pred2(_First,_Last,_Pred) 2025-05-07T20:27:13.6787142Z #define LONG_MIN (-LONG_MAX - 1L) 2025-05-07T20:27:13.6787404Z #define _GLIBCXX_HAVE_SINCOSF 1 2025-05-07T20:27:13.6787647Z #define _IO_UNITBUF 020000 2025-05-07T20:27:13.6787885Z #define _GLIBCXX_HAVE_SINCOSL 1 2025-05-07T20:27:13.6788131Z #define __FD_SETSIZE 1024 2025-05-07T20:27:13.6788361Z #define getc(_fp) _IO_getc (_fp) 2025-05-07T20:27:13.6788621Z #define be32toh(x) __bswap_32 (x) 2025-05-07T20:27:13.6788959Z #define _GLIBCXX_PACKAGE__GLIBCXX_VERSION "version-unused" 2025-05-07T20:27:13.6789301Z #define __FLT32X_HAS_DENORM__ 1 2025-05-07T20:27:13.6789551Z #define __INT_FAST16_TYPE__ long int 2025-05-07T20:27:13.6789851Z #define isxdigit_l(c,l) __isxdigit_l ((c), (l)) 2025-05-07T20:27:13.6790160Z #define _GLIBCXX_HAVE_GETIPINFO 1 2025-05-07T20:27:13.6790409Z #define __MMX_WITH_SSE__ 1 2025-05-07T20:27:13.6790711Z #define __isalnum_l(c,l) __isctype_l((c), _ISalnum, (l)) 2025-05-07T20:27:13.6791031Z #define _WCHAR_T_DEFINED_ 2025-05-07T20:27:13.6791299Z #define cudaIpcMemLazyEnablePeerAccess 0x01 2025-05-07T20:27:13.6791611Z #define _GLIBCXX_HAVE_AT_QUICK_EXIT 1 2025-05-07T20:27:13.6791887Z #define __INO_T_MATCHES_INO64_T 1 2025-05-07T20:27:13.6792138Z #define __USE_POSIX199506 1 2025-05-07T20:27:13.6792385Z #define _FEATURES_H 1 2025-05-07T20:27:13.6792616Z #define __LDBL_HAS_DENORM__ 1 2025-05-07T20:27:13.6792995Z #define _PSTL_PRAGMA_SIMD_REDUCTION(PRM) _PSTL_PRAGMA(omp simd reduction(PRM)) 2025-05-07T20:27:13.6793496Z #define __stub_getmsg 2025-05-07T20:27:13.6793715Z #define _IO_FIXED 010000 2025-05-07T20:27:13.6793975Z #define __cpp_lib_addressof_constexpr 201603 2025-05-07T20:27:13.6794269Z #define _GLIBCXX11_USE_C99_STDIO 1 2025-05-07T20:27:13.6794538Z #define __stub_setlogin 2025-05-07T20:27:13.6794763Z #define __stub_fattach 2025-05-07T20:27:13.6795013Z #define __cplusplus 201703L 2025-05-07T20:27:13.6795303Z #define __cpp_ref_qualifiers 200710L 2025-05-07T20:27:13.6795603Z #define _STRUCT_TIMEVAL 1 2025-05-07T20:27:13.6795844Z #define INFINITY (__builtin_inff()) 2025-05-07T20:27:13.6796114Z #define _IO_UNBUFFERED 2 2025-05-07T20:27:13.6796608Z #define cudaStreamAttributeSynchronizationPolicy cudaLaunchAttributeSynchronizationPolicy 2025-05-07T20:27:13.6797121Z #define _IO_INTERNAL 010 2025-05-07T20:27:13.6797369Z #define __DEC32_MIN__ 1E-95DF 2025-05-07T20:27:13.6797707Z #define cudaKernelNodeAttrValue cudaLaunchAttributeValue 2025-05-07T20:27:13.6798060Z #define __dev_t_defined 2025-05-07T20:27:13.6798293Z #define __DEPRECATED 1 2025-05-07T20:27:13.6798518Z #define __S32_TYPE int 2025-05-07T20:27:13.6798763Z #define __cpp_rvalue_references 200610L 2025-05-07T20:27:13.6799041Z #define __DBL_MAX_EXP__ 1024 2025-05-07T20:27:13.6799294Z #define _IO_fpos_t _G_fpos_t 2025-05-07T20:27:13.6799545Z #define __WCHAR_WIDTH__ 32 2025-05-07T20:27:13.6800214Z #define cudaKernelNodeAttributePreferredSharedMemoryCarveout cudaLaunchAttributePreferredSharedMemoryCarveout 2025-05-07T20:27:13.6800840Z #define _G_HAVE_MREMAP 1 2025-05-07T20:27:13.6801146Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:27:13.6801466Z #define OVERFLOW 3 2025-05-07T20:27:13.6801703Z #define __toascii_l(c,l) ((l), __toascii (c)) 2025-05-07T20:27:13.6802002Z #define __DEC128_EPSILON__ 1E-33DL 2025-05-07T20:27:13.6802274Z #define __SM_32_ATOMIC_FUNCTIONS_H__ 2025-05-07T20:27:13.6802593Z #define _GLIBCXX_DEFAULT_ABI_TAG _GLIBCXX_ABI_TAG_CXX11 2025-05-07T20:27:13.6802911Z #define __SSE2_MATH__ 1 2025-05-07T20:27:13.6803158Z #define __ATOMIC_HLE_RELEASE 131072 2025-05-07T20:27:13.6803448Z #define __FSFILCNT_T_TYPE __SYSCALL_ULONG_TYPE 2025-05-07T20:27:13.6803734Z #define _IO_STDIO_H 2025-05-07T20:27:13.6803969Z #define PDP_ENDIAN __PDP_ENDIAN 2025-05-07T20:27:13.6804243Z #define isspace_l(c,l) __isspace_l ((c), (l)) 2025-05-07T20:27:13.6804697Z #define __cudaCDP2Memcpy2DAsync 2025-05-07T20:27:13.6804993Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL 2025-05-07T20:27:13.6805288Z #define _GLIBCXX_HAVE_STRERROR_R 1 2025-05-07T20:27:13.6805541Z #define __amd64 1 2025-05-07T20:27:13.6805761Z #define _POSIX_TZNAME_MAX 6 2025-05-07T20:27:13.6806007Z #define __cudaCDP2Memset3DAsync 2025-05-07T20:27:13.6806272Z #define __SYSCALL_WORDSIZE 64 2025-05-07T20:27:13.6806549Z #define _GLIBCXX_HAVE_ATTRIBUTE_VISIBILITY 1 2025-05-07T20:27:13.6806844Z #define _EXT_TYPE_TRAITS 1 2025-05-07T20:27:13.6807088Z #define _GLIBCXX_HAVE_POSIX_SEMAPHORE 1 2025-05-07T20:27:13.6807376Z #define _POSIX_RE_DUP_MAX 255 2025-05-07T20:27:13.6807635Z #define __STDC_NO_THREADS__ 1 2025-05-07T20:27:13.6807865Z #define __bounded 2025-05-07T20:27:13.6808088Z #define __USECONDS_T_TYPE __U32_TYPE 2025-05-07T20:27:13.6808671Z #define _IO_DELETE_DONT_CLOSE 0x40 2025-05-07T20:27:13.6808950Z #define __BEGIN_NAMESPACE_STD 2025-05-07T20:27:13.6809226Z #define _PTRDIFF_T_DECLARED 2025-05-07T20:27:13.6809502Z #define __OFF_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:27:13.6809823Z #define __W_STOPCODE(sig) ((sig) << 8 | 0x7f) 2025-05-07T20:27:13.6810251Z #define cudaStreamAttributePriority cudaLaunchAttributePriority 2025-05-07T20:27:13.6810656Z #define _GLIBCXX_HAVE_NETDB_H 1 2025-05-07T20:27:13.6810919Z #define __SM_20_INTRINSICS_HPP__ 2025-05-07T20:27:13.6811276Z #define __cpp_lib_has_unique_object_representations 201606 2025-05-07T20:27:13.6811737Z #define STA_PLL 0x0001 2025-05-07T20:27:13.6811987Z #define __ATOMIC_HLE_ACQUIRE 65536 2025-05-07T20:27:13.6812243Z #define __GNUG__ 11 2025-05-07T20:27:13.6812478Z #define _GLIBCXX_USE_GET_NPROCS 1 2025-05-07T20:27:13.6813023Z #define _T_WCHAR 2025-05-07T20:27:13.6813250Z #define __cudaCDP2GetDeviceCount 2025-05-07T20:27:13.6813545Z #define __specialization_static 2025-05-07T20:27:13.6813854Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL 2025-05-07T20:27:13.6814156Z #define __SIZEOF_SIZE_T__ 8 2025-05-07T20:27:13.6814425Z #define cudaArraySparse 0x40 2025-05-07T20:27:13.6814690Z #define STA_PPSFREQ 0x0002 2025-05-07T20:27:13.6814933Z #define __GLIBCXX__ 20230528 2025-05-07T20:27:13.6815227Z #define _IO_stdin ((_IO_FILE*)(&_IO_2_1_stdin_)) 2025-05-07T20:27:13.6815524Z #define _WCHAR_T 2025-05-07T20:27:13.6815726Z #define __cudaCDP2Free 2025-05-07T20:27:13.6816351Z #define __FD_ZERO(fdsp) do { int __d0, __d1; __asm__ __volatile__ ("cld; rep; " __FD_ZERO_STOS : "=c" (__d0), "=D" (__d1) : "a" (0), "0" (sizeof (fd_set) / sizeof (__fd_mask)), "1" (&__FDS_BITS (fdsp)[0]) : "memory"); } while (0) 2025-05-07T20:27:13.6817026Z #define __cpp_nsdmi 200809L 2025-05-07T20:27:13.6817442Z #define __glibcxx_min_b(T,B) (__glibcxx_signed_b (T,B) ? -__glibcxx_max_b (T,B) - 1 : (T)0) 2025-05-07T20:27:13.6817867Z #define __FLT64X_MIN_EXP__ (-16381) 2025-05-07T20:27:13.6818133Z #define __SIZEOF_WINT_T__ 4 2025-05-07T20:27:13.6818381Z #define cudaArrayCubemap 0x04 2025-05-07T20:27:13.6818692Z #define _PSTL_MONOTONIC_PRESENT (__INTEL_COMPILER >= 1800) 2025-05-07T20:27:13.6819028Z #define _GLIBCXX_UTILITY 1 2025-05-07T20:27:13.6819489Z #define __NO_CTYPE 1 2025-05-07T20:27:13.6819708Z #define __stub_bdflush 2025-05-07T20:27:13.6820054Z #define _GLIBCXX_MAKE_MOVE_ITERATOR(_Iter) std::make_move_iterator(_Iter) 2025-05-07T20:27:13.6821769Z #define __CORRECT_ISO_CPP_STRING_H_PROTO 2025-05-07T20:27:13.6822052Z #define _GLIBCXX_STDC_HEADERS 1 2025-05-07T20:27:13.6822300Z #define __LONG_LONG_WIDTH__ 64 2025-05-07T20:27:13.6822560Z #define __cpp_initializer_lists 200806L 2025-05-07T20:27:13.6822848Z #define _GLIBCXX_HAVE_NETINET_TCP_H 1 2025-05-07T20:27:13.6823125Z #define __U16_TYPE unsigned short int 2025-05-07T20:27:13.6823448Z #define __glibcxx_requires_can_increment(_First,_Size) 2025-05-07T20:27:13.6823783Z #define _GLIBCXX_HAVE_SYS_PARAM_H 1 2025-05-07T20:27:13.6824043Z #define __FLT32_MAX_EXP__ 128 2025-05-07T20:27:13.6824310Z #define cudaHostRegisterIoMemory 0x04 2025-05-07T20:27:13.6824637Z #define __FD_MASK(d) ((__fd_mask) 1 << ((d) % __NFDBITS)) 2025-05-07T20:27:13.6824961Z #define __cpp_lib_is_invocable 201703 2025-05-07T20:27:13.6825230Z #define _IO_STDIO 040000 2025-05-07T20:27:13.6825541Z #define _SIGSET_NWORDS (1024 / (8 * sizeof (unsigned long int))) 2025-05-07T20:27:13.6825913Z #define cudaSurfaceType1DLayered 0xF1 2025-05-07T20:27:13.6826210Z #define cudaArraySurfaceLoadStore 0x02 2025-05-07T20:27:13.6826486Z #define _PTRDIFF_T 2025-05-07T20:27:13.6826692Z #define _MOVE_H 1 2025-05-07T20:27:13.6826902Z #define __cpp_hex_float 201603L 2025-05-07T20:27:13.6827147Z #define ADJ_TAI 0x0080 2025-05-07T20:27:13.6827362Z #define __ptrvalue 2025-05-07T20:27:13.6827565Z #define _GLIBCXX_HOSTED 1 2025-05-07T20:27:13.6827802Z #define __GXX_ABI_VERSION 1016 2025-05-07T20:27:13.6828077Z #define __WTERMSIG(status) ((status) & 0x7f) 2025-05-07T20:27:13.6828358Z #define MATH_ERREXCEPT 2 2025-05-07T20:27:13.6828595Z #define _GLIBCXX_HAS_GTHREADS 1 2025-05-07T20:27:13.6828867Z #define cudaTextureType2DLayered 0xF2 2025-05-07T20:27:13.6829247Z #define __isleap(year) ((year) % 4 == 0 && ((year) % 100 != 0 || (year) % 400 == 0)) 2025-05-07T20:27:13.6829611Z #define __USE_GNU 1 2025-05-07T20:27:13.6829828Z #define __FLT128_HAS_INFINITY__ 1 2025-05-07T20:27:13.6830092Z #define __FLT_MIN_EXP__ (-125) 2025-05-07T20:27:13.6830340Z #define __GCC_HAVE_DWARF2_CFI_ASM 1 2025-05-07T20:27:13.6830709Z #define __FD_CLR(d,set) ((void) (__FDS_BITS (set)[__FD_ELT (d)] &= ~__FD_MASK (d))) 2025-05-07T20:27:13.6831078Z #define WEXITED 4 2025-05-07T20:27:13.6831273Z #define _IO_NO_READS 4 2025-05-07T20:27:13.6831560Z #define cudaGraphKernelNodePortLaunchCompletion 2 2025-05-07T20:27:13.6831894Z #define M_LOG2E 1.4426950408889634074 2025-05-07T20:27:13.6832263Z #define _POSIX_SYMLINK_MAX 255 2025-05-07T20:27:13.6832548Z #define _GLIBCXX_HAVE_BUILTIN_HAS_UNIQ_OBJ_REP 1 2025-05-07T20:27:13.6832845Z #define __uid_t_defined 2025-05-07T20:27:13.6833072Z #define __FD_ELT(d) ((d) / __NFDBITS) 2025-05-07T20:27:13.6833344Z #define _GLIBCXX_USE_STD_SPEC_FUNCS 1 2025-05-07T20:27:13.6833600Z #define WNOHANG 1 2025-05-07T20:27:13.6833831Z #define alloca(size) __builtin_alloca (size) 2025-05-07T20:27:13.6834119Z #define _GLIBCXX_HAVE_HYPOTF 1 2025-05-07T20:27:13.6834381Z #define cudaEventDefault 0x00 2025-05-07T20:27:13.6834666Z #define __maxnreg__(a) __attribute__((maxnreg(a))) 2025-05-07T20:27:13.6834960Z #define NL_SETMAX INT_MAX 2025-05-07T20:27:13.6835183Z #define __x86_64 1 2025-05-07T20:27:13.6835402Z #define __cudaCDP2LaunchDevice 2025-05-07T20:27:13.6835770Z #define __REDIRECT(name,proto,alias) name proto __asm__ (__ASMNAME (#alias)) 2025-05-07T20:27:13.6836232Z #define _GLIBCXX_BEGIN_NAMESPACE_CXX11 namespace __cxx11 { 2025-05-07T20:27:13.6836712Z #define __extern_always_inline extern __always_inline __attribute__ ((__gnu_inline__)) 2025-05-07T20:27:13.6837134Z #define __PTRDIFF_T 2025-05-07T20:27:13.6837437Z #define __exctype_l(name) extern int name (int, __locale_t) __THROW 2025-05-07T20:27:13.6837800Z #define _GLIBCXX_HAVE_FINITEL 1 2025-05-07T20:27:13.6838063Z #define __SM_35_ATOMIC_FUNCTIONS_H__ 2025-05-07T20:27:13.6838331Z #define _Mlong_double_ long double 2025-05-07T20:27:13.6838688Z #define __cpp_lambdas 200907L 2025-05-07T20:27:13.6838927Z #define _IO_DEC 020 2025-05-07T20:27:13.6839135Z #define _GLIBCXX_HAVE_SINHL 1 2025-05-07T20:27:13.6839390Z #define _POSIX_CLOCKRES_MIN 20000000 2025-05-07T20:27:13.6839666Z #define __INT_FAST64_TYPE__ long int 2025-05-07T20:27:13.6839953Z #define ADJ_TIMECONST 0x0020 2025-05-07T20:27:13.6840304Z #define _GLIBCXX_HAVE_SQRTL 1 2025-05-07T20:27:13.6840606Z #define __cudaCDP2DeviceGetSharedMemConfig 2025-05-07T20:27:13.6840909Z #define _GLIBCXX_HAVE_STDALIGN_H 1 2025-05-07T20:27:13.6841169Z #define _ANSI_STDDEF_H 2025-05-07T20:27:13.6841440Z #define _GLIBCXX_MOVE(__val) std::move(__val) 2025-05-07T20:27:13.6841743Z #define _GLIBCXX_HAVE_STRERROR_L 1 2025-05-07T20:27:13.6842090Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64 2025-05-07T20:27:13.6842461Z #define _GLIBCXX_USE_DEV_RANDOM 1 2025-05-07T20:27:13.6842727Z #define _STL_ITERATOR_BASE_TYPES_H 1 2025-05-07T20:27:13.6843003Z #define __cpp_template_auto 201606L 2025-05-07T20:27:13.6843356Z #define __DBL_MIN__ double(2.22507385850720138309023271733240406e-308L) 2025-05-07T20:27:13.6843712Z #define _GLIBCXX_HAVE_SYS_SEM_H 1 2025-05-07T20:27:13.6843961Z #define __key_t_defined 2025-05-07T20:27:13.6844198Z #define _IO_MAGIC_MASK 0xFFFF0000 2025-05-07T20:27:13.6844677Z #define __cluster_dims__(...) __attribute__((cluster_dims(__VA_ARGS__))) 2025-05-07T20:27:13.6845169Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128 2025-05-07T20:27:13.6845530Z #define __GNUC_VA_LIST 2025-05-07T20:27:13.6845850Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:27:13.6846222Z #define __SIZEOF_POINTER__ 8 2025-05-07T20:27:13.6846466Z #define CLOCK_REALTIME_COARSE 5 2025-05-07T20:27:13.6846730Z #define _GLIBCXX14_CONSTEXPR constexpr 2025-05-07T20:27:13.6847009Z #define __USE_XOPEN2KXSI 1 2025-05-07T20:27:13.6847250Z #define __WCOREFLAG 0x80 2025-05-07T20:27:13.6847585Z #define M_2_SQRTPI 1.12837916709551257390 2025-05-07T20:27:13.6847996Z #define cudaEventDisableTiming 0x02 2025-05-07T20:27:13.6848362Z #define __LP64__ 1 2025-05-07T20:27:13.6848687Z #define __isascii_l(c,l) ((l), __isascii (c)) 2025-05-07T20:27:13.6849109Z #define cudaStreamNonBlocking 0x01 2025-05-07T20:27:13.6849479Z #define _IO_off64_t __off64_t 2025-05-07T20:27:13.6849837Z #define __DBL_HAS_QUIET_NAN__ 1 2025-05-07T20:27:13.6850188Z #define __time_t_defined 1 2025-05-07T20:27:13.6850532Z #define _POSIX_SYMLOOP_MAX 8 2025-05-07T20:27:13.6850986Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x 2025-05-07T20:27:13.6851597Z #define __USE_UNIX98 1 2025-05-07T20:27:13.6851918Z #define __MODE_T_TYPE __U32_TYPE 2025-05-07T20:27:13.6852270Z #define CLOCK_REALTIME_ALARM 8 2025-05-07T20:27:13.6852630Z #define _GLIBCXX_HAVE_STRINGS_H 1 2025-05-07T20:27:13.6853025Z #define __LEAF_ATTR __attribute__ ((__leaf__)) 2025-05-07T20:27:13.6853430Z #define __DECIMAL_BID_FORMAT__ 1 2025-05-07T20:27:13.6853782Z #define SEEK_CUR 1 2025-05-07T20:27:13.6854099Z #define __RLIM64_T_TYPE __UQUAD_TYPE 2025-05-07T20:27:13.6854454Z #define _ASSERT_H 1 2025-05-07T20:27:13.6855222Z #define _PSTL_PRAGMA_DECLARE_REDUCTION(NAME,OP) _PSTL_PRAGMA(omp declare reduction(NAME:OP : omp_out(omp_in)) initializer(omp_priv = omp_orig)) 2025-05-07T20:27:13.6856079Z #define _GLIBCXX_USE_DEPRECATED 1 2025-05-07T20:27:13.6856446Z #define CHAR_MAX SCHAR_MAX 2025-05-07T20:27:13.6856775Z #define _GLIBCXX_HAVE_SETENV 1 2025-05-07T20:27:13.6857134Z #define NL_ARGMAX _POSIX_ARG_MAX 2025-05-07T20:27:13.6857501Z #define _GLIBCXX_USE_UTIMENSAT 1 2025-05-07T20:27:13.6857991Z #define __extern_inline extern __inline __attribute__ ((__gnu_inline__)) 2025-05-07T20:27:13.6858550Z #define _GLIBCXX_DEBUG_ONLY(_Statement) 2025-05-07T20:27:13.6859432Z #define _IO_putc_unlocked(_ch,_fp) (_IO_BE ((_fp)->_IO_write_ptr >= (_fp)->_IO_write_end, 0) ? __overflow (_fp, (unsigned char) (_ch)) : (unsigned char) (*(_fp)->_IO_write_ptr++ = (_ch))) 2025-05-07T20:27:13.6860408Z #define _GLIBCXX_HAVE_BUILTIN_LAUNDER 1 2025-05-07T20:27:13.6860808Z #define _IO_BOOLALPHA 0200000 2025-05-07T20:27:13.6861276Z #define _PSTL_CPP17_EXECUTION_POLICIES_PRESENT (_MSC_VER >= 1912) 2025-05-07T20:27:13.6861782Z #define _GLIBCXX_PACKAGE_URL "" 2025-05-07T20:27:13.6862135Z #define __FLT64_MIN_10_EXP__ (-307) 2025-05-07T20:27:13.6862514Z #define cudaArrayDefault 0x00 2025-05-07T20:27:13.6862893Z #define __cudaCDP2LaunchDeviceV2 2025-05-07T20:27:13.6863280Z #define __FDS_BITS(set) ((set)->fds_bits) 2025-05-07T20:27:13.6863662Z #define TLOSS 5 2025-05-07T20:27:13.6863952Z #define __ssize_t_defined 2025-05-07T20:27:13.6864289Z #define __CUDACC_VER_BUILD__ 85 2025-05-07T20:27:13.6864662Z #define _GLIBCXX_HAVE_SYS_SOCKET_H 1 2025-05-07T20:27:13.6865050Z #define ULONG_MAX (LONG_MAX * 2UL + 1UL) 2025-05-07T20:27:13.6865437Z #define __FLT64X_DECIMAL_DIG__ 21 2025-05-07T20:27:13.6865917Z #define _GLIBCXX_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_NAMESPACE_CXX11 2025-05-07T20:27:13.6866436Z #define _POSIX_HIWAT _POSIX_PIPE_BUF 2025-05-07T20:27:13.6866821Z #define __DEC128_MIN__ 1E-6143DL 2025-05-07T20:27:13.6867198Z #define __cudaCDP2EventRecordWithFlags 2025-05-07T20:27:13.6867615Z #define _GLIBCXX_ATOMIC_BUILTINS 1 2025-05-07T20:27:13.6868007Z #define cudaPeerAccessDefault 0x00 2025-05-07T20:27:13.6868383Z #define __REGISTER_PREFIX__ 2025-05-07T20:27:13.6868729Z #define __UINT16_MAX__ 0xffff 2025-05-07T20:27:13.6869171Z #define __glibcxx_requires_sorted_set(_First1,_Last1,_First2) 2025-05-07T20:27:13.6869646Z #define _IOS_NOREPLACE 64 2025-05-07T20:27:13.6869988Z #define _GLIBCXX_WEAK_DEFINITION 2025-05-07T20:27:13.6870340Z #define __cdecl 2025-05-07T20:27:13.6870653Z #define cudaEventInterprocess 0x04 2025-05-07T20:27:13.6871097Z #define M_SQRT1_2l 0.707106781186547524400844362104849039L 2025-05-07T20:27:13.6871536Z #define LOGIN_NAME_MAX 256 2025-05-07T20:27:13.6871862Z #define _IO_TIED_PUT_GET 0x400 2025-05-07T20:27:13.6872224Z #define X_TLOSS 1.41484755040568800000e+16 2025-05-07T20:27:13.6872613Z #define CUDA_IPC_HANDLE_SIZE 64 2025-05-07T20:27:13.6872972Z #define __LDBL_HAS_INFINITY__ 1 2025-05-07T20:27:13.6873377Z #define __attribute_pure__ __attribute__ ((__pure__)) 2025-05-07T20:27:13.6873815Z #define __TEXTURE_TYPES_H__ 2025-05-07T20:27:13.6874358Z #define __NV_GLIBCXX_VERSION (__GNUC__ * 10000 + __GNUC_MINOR__ * 100 + __GNUC_PATCHLEVEL__) 2025-05-07T20:27:13.6874929Z #define ADJ_NANO 0x2000 2025-05-07T20:27:13.6875334Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32 2025-05-07T20:27:13.6875809Z #define __UINT8_TYPE__ unsigned char 2025-05-07T20:27:13.6876199Z #define _GLIBCXX_HAVE_ISWBLANK 1 2025-05-07T20:27:13.6887871Z #define __FLT_DIG__ 6 2025-05-07T20:27:13.6888253Z #define __REDIRECT_LDBL(name,proto,alias) __REDIRECT (name, proto, alias) 2025-05-07T20:27:13.6888656Z #define __NO_INLINE__ 1 2025-05-07T20:27:13.6888971Z #define _PSTL_EARLYEXIT_PRESENT (__INTEL_COMPILER >= 1800) 2025-05-07T20:27:13.6889327Z #define _POSIX_NGROUPS_MAX 8 2025-05-07T20:27:13.6889598Z #define ADJ_STATUS 0x0010 2025-05-07T20:27:13.6889869Z #define __cudaCDP2MemcpyAsync_ptsz 2025-05-07T20:27:13.6890163Z #define CLOCK_BOOTTIME_ALARM 9 2025-05-07T20:27:13.6890441Z #define LONG_LONG_MAX __LONG_LONG_MAX__ 2025-05-07T20:27:13.6890737Z #define _GLIBCXX_HAVE_OBSOLETE_ISNAN 1 2025-05-07T20:27:13.6891035Z #define __DEC_EVAL_METHOD__ 2 2025-05-07T20:27:13.6891420Z #define cudaStreamGraphFireAndForget (cudaStream_t)0x0200000000000000 2025-05-07T20:27:13.6891826Z #define _GLIBCXX_HAVE_ALIGNED_ALLOC 1 2025-05-07T20:27:13.6892177Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL 2025-05-07T20:27:13.6892528Z #define CHAR_MIN SCHAR_MIN 2025-05-07T20:27:13.6892773Z #define MAX_CANON 255 2025-05-07T20:27:13.6893010Z #define __FLT_MANT_DIG__ 24 2025-05-07T20:27:13.6893263Z #define __LDBL_DECIMAL_DIG__ 21 2025-05-07T20:27:13.6893537Z #define _GLIBCXX_HAVE_COMPLEX_H 1 2025-05-07T20:27:13.6893817Z #define _PSTL_PRAGMA_VECTOR_UNALIGNED 2025-05-07T20:27:13.6894122Z #define _POSIX_FD_SETSIZE _POSIX_OPEN_MAX 2025-05-07T20:27:13.6894639Z #define _GLIBCXX_HAVE_HYPOT 1 2025-05-07T20:27:13.6894919Z #define __cudaCDP2Memset2DAsync_ptsz 2025-05-07T20:27:13.6895243Z #define _GLIBCXX_TR1_MODIFIED_BESSEL_FUNC_TCC 1 2025-05-07T20:27:13.6895557Z #define __VERSION__ "11.4.0" 2025-05-07T20:27:13.6895817Z #define _GLIBCXX11_USE_C99_STDLIB 1 2025-05-07T20:27:13.6896113Z #define cudaHostRegisterMapped 0x02 2025-05-07T20:27:13.6896405Z #define _GLIBCXX_HAVE_INT64_T 1 2025-05-07T20:27:13.6896685Z #define _GLIBCXX_USE_CONSTEXPR constexpr 2025-05-07T20:27:13.6896997Z #define FD_ZERO(fdsetp) __FD_ZERO (fdsetp) 2025-05-07T20:27:13.6897296Z #define __UINT64_C(c) c ## UL 2025-05-07T20:27:13.6897554Z #define MOD_OFFSET ADJ_OFFSET 2025-05-07T20:27:13.6897815Z #define _SYS_TYPES_H 1 2025-05-07T20:27:13.6898057Z #define AIO_PRIO_DELTA_MAX 20 2025-05-07T20:27:13.6898324Z #define _GLIBCXX_HAVE_TANHF 1 2025-05-07T20:27:13.6898572Z #define _SYS_CDEFS_H 1 2025-05-07T20:27:13.6898818Z #define _GLIBCXX_HAVE_TANHL 1 2025-05-07T20:27:13.6899098Z #define __cpp_unicode_characters 201411L 2025-05-07T20:27:13.6899390Z #define _IO_ERR_SEEN 0x20 2025-05-07T20:27:13.6899650Z #define _GLIBCXX_USE_DECIMAL_FLOAT 1 2025-05-07T20:27:13.6899945Z #define __cudaCDP2StreamDestroy 2025-05-07T20:27:13.6900211Z #define FP_SUBNORMAL 3 2025-05-07T20:27:13.6900475Z #define cudaOccupancyDefault 0x00 2025-05-07T20:27:13.6900759Z #define _INITIALIZER_LIST 2025-05-07T20:27:13.6901006Z #define _STDC_PREDEF_H 1 2025-05-07T20:27:13.6901265Z #define __CUDA_RUNTIME_API_H__ 2025-05-07T20:27:13.6901551Z #define _GLIBCXX_PACKAGE_BUGREPORT "" 2025-05-07T20:27:13.6901829Z #define _GLIBCXX_HAVE_MODF 1 2025-05-07T20:27:13.6902090Z #define _IO_file_flags _flags 2025-05-07T20:27:13.6902405Z #define __USE_XOPEN2K8 1 2025-05-07T20:27:13.6902736Z #define htobe64(x) __bswap_64 (x) 2025-05-07T20:27:13.6903089Z #define _OLD_STDIO_MAGIC 0xFABC0000 2025-05-07T20:27:13.6903369Z #define HUGE 3.40282347e+38F 2025-05-07T20:27:13.6903641Z #define __cpp_lib_is_null_pointer 201309 2025-05-07T20:27:13.6904017Z #define WEXITSTATUS(status) __WEXITSTATUS (__WAIT_INT (status)) 2025-05-07T20:27:13.6904408Z #define islower_l(c,l) __islower_l ((c), (l)) 2025-05-07T20:27:13.6904716Z #define _GLIBCXX_USE_CXX11_ABI 1 2025-05-07T20:27:13.6904980Z #define _GLIBCXX_HAVE_SYMLINK 1 2025-05-07T20:27:13.6905234Z #define _BSD_SOURCE 1 2025-05-07T20:27:13.6905475Z #define _GLIBCXX_THROW(_EXC) 2025-05-07T20:27:13.6906324Z #define _GLIBCXX_HAS_NESTED_TYPE(_NTYPE) template> struct __has_ ##_NTYPE : false_type { }; template struct __has_ ##_NTYPE<_Tp, __void_t> : true_type { }; 2025-05-07T20:27:13.6907302Z #define __catch(X) catch(X) 2025-05-07T20:27:13.6907563Z #define __INT_LEAST32_MAX__ 0x7fffffff 2025-05-07T20:27:13.6907844Z #define LINE_MAX _POSIX2_LINE_MAX 2025-05-07T20:27:13.6908105Z #define __TIMER_T_TYPE void * 2025-05-07T20:27:13.6908701Z #define __STRING(x) #x 2025-05-07T20:27:13.6908941Z #define __GCC_ATOMIC_INT_LOCK_FREE 2 2025-05-07T20:27:13.6909200Z #define _T_PTRDIFF_ 2025-05-07T20:27:13.6909441Z #define _GLIBCXX_USE_NOEXCEPT noexcept 2025-05-07T20:27:13.6909733Z #define cudaEventWaitExternal 0x01 2025-05-07T20:27:13.6909994Z #define __unbounded 2025-05-07T20:27:13.6910228Z #define __DEVICE_ATOMIC_FUNCTIONS_H__ 2025-05-07T20:27:13.6910507Z #define __FLT128_MAX_EXP__ 16384 2025-05-07T20:27:13.6910768Z #define __INO_T_TYPE __SYSCALL_ULONG_TYPE 2025-05-07T20:27:13.6911063Z #define be16toh(x) __bswap_16 (x) 2025-05-07T20:27:13.6911342Z #define __cpp_lib_is_final 201402L 2025-05-07T20:27:13.6911620Z #define _GLIBCXX_BEGIN_NAMESPACE_CONTAINER 2025-05-07T20:27:13.6911936Z #define LONG_LONG_MIN (-LONG_LONG_MAX - 1LL) 2025-05-07T20:27:13.6912239Z #define __MATH_DECLARE_LDOUBLE 1 2025-05-07T20:27:13.6912499Z #define __managed__ __location__(managed) 2025-05-07T20:27:13.6912784Z #define _POSIX2_EXPR_NEST_MAX 32 2025-05-07T20:27:13.6913178Z #define __GNUC_PREREQ(maj,min) ((__GNUC__ << 16) + __GNUC_MINOR__ >= ((maj) << 16) + (min)) 2025-05-07T20:27:13.6913576Z #define _POSIX_STREAM_MAX 8 2025-05-07T20:27:13.6914075Z #define __LIBRARY_TYPES_H__ 2025-05-07T20:27:13.6914437Z #define _GLIBCXX_END_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_END_NAMESPACE_CXX11 2025-05-07T20:27:13.6914824Z #define __FLT32_MANT_DIG__ 24 2025-05-07T20:27:13.6915057Z #define _SYS_SIZE_T_H 2025-05-07T20:27:13.6915337Z #define _PSTL_VERSION_MINOR ((_PSTL_VERSION % 1000) / 10) 2025-05-07T20:27:13.6915661Z #define _GLIBCXX_STDLIB_H 1 2025-05-07T20:27:13.6915923Z #define isupper_l(c,l) __isupper_l ((c), (l)) 2025-05-07T20:27:13.6916204Z #define _CRTIMP 2025-05-07T20:27:13.6916417Z #define _GLIBCXX_CXX_CONFIG_H 1 2025-05-07T20:27:13.6916706Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:27:13.6917020Z #define STA_PPSJITTER 0x0200 2025-05-07T20:27:13.6917368Z #define _IO_feof_unlocked(__fp) (((__fp)->_flags & _IO_EOF_SEEN) != 0) 2025-05-07T20:27:13.6917754Z #define __SUSECONDS_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:27:13.6918057Z #define _GLIBCXX_HAVE_ISINFF 1 2025-05-07T20:27:13.6918333Z #define __glibcxx_requires_subscript(_N) 2025-05-07T20:27:13.6918602Z #define __SIZE_T__ 2025-05-07T20:27:13.6918808Z #define __stub_gtty 2025-05-07T20:27:13.6919029Z #define __pid_t_defined 2025-05-07T20:27:13.6919280Z #define _GLIBCXX_FWDREF(_Tp) _Tp&& 2025-05-07T20:27:13.6919570Z #define __NLINK_T_TYPE __SYSCALL_ULONG_TYPE 2025-05-07T20:27:13.6919875Z #define __glibcxx_function_requires(...) 2025-05-07T20:27:13.6920158Z #define __SM_80_RT_HPP__ 2025-05-07T20:27:13.6920381Z #define __need_clockid_t 2025-05-07T20:27:13.6920609Z #define SSIZE_MAX LONG_MAX 2025-05-07T20:27:13.6920854Z #define _GLIBCXX_HAVE_USELOCALE 1 2025-05-07T20:27:13.6921157Z #define __glibcxx_requires_string_len(_String,_Len) 2025-05-07T20:27:13.6921458Z #define _IO_HEX 0100 2025-05-07T20:27:13.6921707Z #define __NFDBITS (8 * (int) sizeof (__fd_mask)) 2025-05-07T20:27:13.6922024Z #define cudaExternalMemoryDedicated 0x1 2025-05-07T20:27:13.6922322Z #define _GLIBCXX_HAVE_TGMATH_H 1 2025-05-07T20:27:13.6922587Z #define _GLIBCXX11_USE_C99_COMPLEX 1 2025-05-07T20:27:13.6922980Z #define _GLIBCXX17_DEPRECATED_SUGGEST(ALT) _GLIBCXX_DEPRECATED_SUGGEST(ALT) 2025-05-07T20:27:13.6923414Z #define ispunct_l(c,l) __ispunct_l ((c), (l)) 2025-05-07T20:27:13.6923716Z #define __cpp_aggregate_bases 201603L 2025-05-07T20:27:13.6923810Z #define __cudaGet_blockDim() blockDim 2025-05-07T20:27:13.6923914Z #define __cudaCDP2Memcpy3DAsync 2025-05-07T20:27:13.6924009Z #define __cudaCDP2MemcpyAsync 2025-05-07T20:27:13.6924083Z #define __stub_sstk 2025-05-07T20:27:13.6924179Z #define _IO_IN_BACKUP 0x100 2025-05-07T20:27:13.6924430Z #define _GLIBCXX_USE_C99_STDLIB _GLIBCXX11_USE_C99_STDLIB 2025-05-07T20:27:13.6924643Z #define __wur 2025-05-07T20:27:13.6924765Z #define isprint_l(c,l) __isprint_l ((c), (l)) 2025-05-07T20:27:13.6924848Z #define _G_HAVE_MMAP 1 2025-05-07T20:27:13.6924940Z #define _IO_OCT 040 2025-05-07T20:27:13.6925050Z #define __FLT128_HAS_DENORM__ 1 2025-05-07T20:27:13.6925151Z #define NL_MSGMAX INT_MAX 2025-05-07T20:27:13.6925253Z #define _GLIBCXX_USE_LFS 1 2025-05-07T20:27:13.6925379Z #define cudaDeviceScheduleBlockingSync 0x04 2025-05-07T20:27:13.6925467Z #define _POSIX_RTSIG_MAX 8 2025-05-07T20:27:13.6925572Z #define _GLIBCXX_NOEXCEPT noexcept 2025-05-07T20:27:13.6925757Z #define __glibcxx_requires_partitioned_lower(_First,_Last,_Value) 2025-05-07T20:27:13.6925848Z #define __FLT32_DECIMAL_DIG__ 9 2025-05-07T20:27:13.6925942Z #define _STL_ALGOBASE_H 1 2025-05-07T20:27:13.6926046Z #define __cudaCDP2MemsetAsync_ptsz 2025-05-07T20:27:13.6926131Z #define __off64_t_defined 2025-05-07T20:27:13.6926225Z #define __FLT128_DIG__ 33 2025-05-07T20:27:13.6926326Z #define _GLIBCXX_USE_C99_INTTYPES_TR1 1 2025-05-07T20:27:13.6926425Z #define _GLIBCXX_HAVE_LOCALE_H 1 2025-05-07T20:27:13.6926515Z #define __INT32_C(c) c 2025-05-07T20:27:13.6926608Z #define __DEC64_EPSILON__ 1E-15DD 2025-05-07T20:27:13.6926711Z #define __ORDER_PDP_ENDIAN__ 3412 2025-05-07T20:27:13.6926802Z #define __DEC128_MIN_EXP__ (-6142) 2025-05-07T20:27:13.6926889Z #define __PDP_ENDIAN 3412 2025-05-07T20:27:13.6928849Z #define _ISOC95_SOURCE 1 2025-05-07T20:27:13.6928959Z #define _IO_fpos64_t _G_fpos64_t 2025-05-07T20:27:13.6929091Z #define M_PI_2l 1.570796326794896619231321691639751442L 2025-05-07T20:27:13.6929189Z #define BYTE_ORDER __BYTE_ORDER 2025-05-07T20:27:13.6929274Z #define __SM_90_RT_HPP__ 2025-05-07T20:27:13.6929367Z #define __INT_FAST32_TYPE__ long int 2025-05-07T20:27:13.6929475Z #define __have_pthread_attr_t 1 2025-05-07T20:27:13.6929571Z #define _GLIBCXX_HAVE_LIMIT_DATA 1 2025-05-07T20:27:13.6929792Z #define _GLIBCXX_BEGIN_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_BEGIN_NAMESPACE_CXX11 2025-05-07T20:27:13.6929902Z #define __cudaCDP2StreamWaitEvent 2025-05-07T20:27:13.6930002Z #define __cudaCDP2EventRecord 2025-05-07T20:27:13.6930103Z #define _BITS_TYPESIZES_H 1 2025-05-07T20:27:13.6930186Z #define htole32(x) (x) 2025-05-07T20:27:13.6930430Z #define __cudaCDP2OccupancyMaxActiveBlocksPerMultiprocessorWithFlags 2025-05-07T20:27:13.6930552Z #define __SYSCALL_SLONG_TYPE __SLONGWORD_TYPE 2025-05-07T20:27:13.6930654Z #define _GLIBCXX_USE_C99_MATH_TR1 1 2025-05-07T20:27:13.6930808Z #define WSTOPSIG(status) __WSTOPSIG (__WAIT_INT (status)) 2025-05-07T20:27:13.6930948Z #define _GLIBCXX_USE_C99_MATH _GLIBCXX11_USE_C99_MATH 2025-05-07T20:27:13.6931070Z #define __UINT_LEAST16_TYPE__ short unsigned int 2025-05-07T20:27:13.6931213Z #define __WIFEXITED(status) (__WTERMSIG(status) == 0) 2025-05-07T20:27:13.6931299Z #define ADJ_OFFSET 0x0001 2025-05-07T20:27:13.6931399Z #define cudaArrayLayered 0x01 2025-05-07T20:27:13.6931572Z #define _PSTL_ICC_18_OMP_SIMD_BROKEN (__INTEL_COMPILER == 1800) 2025-05-07T20:27:13.6931678Z #define cudaEventRecordDefault 0x00 2025-05-07T20:27:13.6931775Z #define _GLIBCXX_HAVE_FMODF 1 2025-05-07T20:27:13.6931879Z #define _PSTL_PRAGMA_MESSAGE(x) 2025-05-07T20:27:13.6931957Z #define unix 1 2025-05-07T20:27:13.6932046Z #define __DBL_HAS_DENORM__ 1 2025-05-07T20:27:13.6932141Z #define _POSIX_CHILD_MAX 25 2025-05-07T20:27:13.6932235Z #define _POSIX_MAX_INPUT 255 2025-05-07T20:27:13.6932357Z #define __cudaCDP2DeviceGetCacheConfig 2025-05-07T20:27:13.6932445Z #define __USE_POSIX 1 2025-05-07T20:27:13.6932535Z #define __FD_ZERO_STOS "stosq" 2025-05-07T20:27:13.6932669Z #define _PSTL_VERSION_MAJOR (_PSTL_VERSION / 1000) 2025-05-07T20:27:13.6932756Z #define __THROWNL throw () 2025-05-07T20:27:13.6932844Z #define __cpp_rtti 199711L 2025-05-07T20:27:13.6932952Z #define __SIZE_TYPE__ long unsigned int 2025-05-07T20:27:13.6933069Z #define __PMT(args) args 2025-05-07T20:27:13.6933223Z #define __UINT64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:27:13.6933430Z #define __va_arg_pack_len() __builtin_va_arg_pack_len () 2025-05-07T20:27:13.6933644Z #define __ULONGWORD_TYPE unsigned long int 2025-05-07T20:27:13.6933732Z #define _SIZE_T_DECLARED 2025-05-07T20:27:13.6933831Z #define _PSTL_STRING_AUX(x) #x 2025-05-07T20:27:13.6933918Z #define __FLT_IS_IEC_60559__ 2 2025-05-07T20:27:13.6934312Z #define _PSTL_CPP14_MAKE_REVERSE_ITERATOR_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201402L || __cpp_lib_make_reverse_iterator == 201402) 2025-05-07T20:27:13.6934411Z #define _GLIBCXX_HAVE_LIMIT_AS 1 2025-05-07T20:27:13.6934501Z #define XATTR_LIST_MAX 65536 2025-05-07T20:27:13.6934599Z #define __CUDACC_VER_MAJOR__ 12 2025-05-07T20:27:13.6934736Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE" 2025-05-07T20:27:13.6934818Z #define _WCHAR_T_H 2025-05-07T20:27:13.6934907Z #define __FLT64X_DIG__ 18 2025-05-07T20:27:13.6934994Z #define _IO_SHOWBASE 0200 2025-05-07T20:27:13.6935079Z #define _POSIX_QLIMIT 1 2025-05-07T20:27:13.6935181Z #define __INT8_TYPE__ signed char 2025-05-07T20:27:13.6935271Z #define __SURFACE_TYPES_H__ 2025-05-07T20:27:13.6935360Z #define __CUDA_ARCH__ 520 2025-05-07T20:27:13.6935469Z #define __cpp_digit_separators 201309L 2025-05-07T20:27:13.6935547Z #define __ELF__ 1 2025-05-07T20:27:13.6935646Z #define CLOCK_THREAD_CPUTIME_ID 3 2025-05-07T20:27:13.6935743Z #define __GCC_ASM_FLAG_OUTPUTS__ 1 2025-05-07T20:27:13.6935826Z #define STA_INS 0x0010 2025-05-07T20:27:13.6935928Z #define __UINT32_TYPE__ unsigned int 2025-05-07T20:27:13.6936180Z #define _toupper(c) ((int) (*__ctype_toupper_loc ())[(int) (c)]) 2025-05-07T20:27:13.6936279Z #define _BITS_BYTESWAP_H 1 2025-05-07T20:27:13.6936376Z #define __ID_T_TYPE __U32_TYPE 2025-05-07T20:27:13.6936481Z #define __TIME_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:27:13.6936583Z #define __DEVICE_DOUBLE_FUNCTIONS_HPP__ 2025-05-07T20:27:13.6936682Z #define _GLIBCXX_HAVE_MBSTATE_T 1 2025-05-07T20:27:13.6936783Z #define __cpp_lib_logical_traits 201510 2025-05-07T20:27:13.6936875Z #define ADJ_OFFSET_SS_READ 0xa001 2025-05-07T20:27:13.6937031Z #define __warnattr(msg) __attribute__((__warning__ (msg))) 2025-05-07T20:27:13.6937189Z #define _PSTL_PRAGMA_LOCATION " [Parallel STL message]: " 2025-05-07T20:27:13.6937289Z #define _IO_funlockfile(_fp) 2025-05-07T20:27:13.6937605Z #define cudaKernelNodeAttributeAccessPolicyWindow cudaLaunchAttributeAccessPolicyWindow 2025-05-07T20:27:13.6937729Z #define M_2_PIl 0.636619772367581343075535053490057448L 2025-05-07T20:27:13.6937823Z #define __DRIVER_TYPES_H__ 2025-05-07T20:27:13.6937915Z #define __FLT_RADIX__ 2 2025-05-07T20:27:13.6938013Z #define __INT_LEAST16_TYPE__ short int 2025-05-07T20:27:13.6938180Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L 2025-05-07T20:27:13.6938273Z #define __UINTMAX_C(c) c ## UL 2025-05-07T20:27:13.6938364Z #define _GLIBCXX_USE_LSTAT 1 2025-05-07T20:27:13.6938466Z #define minor(dev) gnu_dev_minor (dev) 2025-05-07T20:27:13.6938559Z #define _POSIX_C_SOURCE 200809L 2025-05-07T20:27:13.6938658Z #define _GLIBCXX_HAVE_DIRENT_H 1 2025-05-07T20:27:13.6938755Z #define __GLIBCXX_BITSIZE_INT_N_0 128 2025-05-07T20:27:13.6938837Z #define WORD_BIT 32 2025-05-07T20:27:13.6938924Z #define _IO_USER_BUF 1 2025-05-07T20:27:13.6939013Z #define __VECTOR_TYPES_H__ 2025-05-07T20:27:13.6939112Z #define __SM_20_ATOMIC_FUNCTIONS_HPP__ 2025-05-07T20:27:13.6939220Z #define cudaHostAllocPortable 0x01 2025-05-07T20:27:13.6939316Z #define PTHREAD_STACK_MIN 16384 2025-05-07T20:27:13.6939410Z #define __long_double_t long double 2025-05-07T20:27:13.6939511Z #define _GLIBCXX_HAVE_ISINF 1 2025-05-07T20:27:13.6939601Z #define _POSIX_ARG_MAX 4096 2025-05-07T20:27:13.6939999Z #define cudaKernelNodeAttributeDeviceUpdatableKernelNode cudaLaunchAttributeDeviceUpdatableKernelNode 2025-05-07T20:27:13.6940077Z #define __k8 1 2025-05-07T20:27:13.6940265Z #define _GLIBCXX_NO_OBSOLETE_ISINF_ISNAN_DYNAMIC __GLIBC_PREREQ(2,23) 2025-05-07T20:27:13.6940435Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x 2025-05-07T20:27:13.6940546Z #define __LDBL_REDIR(name,proto) name proto 2025-05-07T20:27:13.6940640Z #define __SIG_ATOMIC_MAX__ 0x7fffffff 2025-05-07T20:27:13.6940825Z #define __SM_30_INTRINSICS_HPP__ 2025-05-07T20:27:13.6940925Z #define _GLIBCXX_EXTERN_TEMPLATE 1 2025-05-07T20:27:13.6941014Z #define __blksize_t_defined 2025-05-07T20:27:13.6941109Z #define _IO_SHOWPOINT 0400 2025-05-07T20:27:13.6941202Z #define _GLIBCXX_HAVE_LIMIT_RSS 1 2025-05-07T20:27:13.6941317Z #define cudaDeviceLmemResizeToMax 0x10 2025-05-07T20:27:13.6941412Z #define _GLIBCXX_X86_RDRAND 1 2025-05-07T20:27:13.6941513Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2 2025-05-07T20:27:13.6941611Z #define _IO_IS_FILEBUF 0x2000 2025-05-07T20:27:13.6941702Z #define _GLIBCXX_USE_DUAL_ABI 1 2025-05-07T20:27:13.6941951Z #define __bswap_constant_16(x) ((unsigned short int) ((((x) >> 8) & 0xff) | (((x) & 0xff) << 8))) 2025-05-07T20:27:13.6942411Z #define cudaSignalExternalSemaphoresAsync __CUDART_API_PTSZ(cudaSignalExternalSemaphoresAsync_v2) 2025-05-07T20:27:13.6942511Z #define UCHAR_MAX (SCHAR_MAX * 2 + 1) 2025-05-07T20:27:13.6942603Z #define __SIZEOF_PTRDIFF_T__ 8 2025-05-07T20:27:13.6942689Z #define SEEK_SET 0 2025-05-07T20:27:13.6942779Z #define _GLIBCXX_TR1_GAMMA_TCC 1 2025-05-07T20:27:13.6942867Z #define __CUDA_API_VER_MINOR__ 6 2025-05-07T20:27:13.6943065Z #define _GLIBCXX_VISIBILITY(V) __attribute__ ((__visibility__ (#V))) 2025-05-07T20:27:13.6943160Z #define _GLIBCXX20_DEPRECATED(MSG) 2025-05-07T20:27:13.6943263Z #define __cudaCDP2GetLastError 2025-05-07T20:27:13.6943480Z #define _GLIBCXX_HAVE_COSL 1 2025-05-07T20:27:13.6943569Z #define _MATH_H_MATHDEF 1 2025-05-07T20:27:13.6943890Z #define __bswap_constant_32(x) ((((x) & 0xff000000) >> 24) | (((x) & 0x00ff0000) >> 8) | (((x) & 0x0000ff00) << 8) | (((x) & 0x000000ff) << 24)) 2025-05-07T20:27:13.6943980Z #define _GLIBCXX_USE_FLOAT128 1 2025-05-07T20:27:13.6944068Z #define _IO_FLAGS2_NOTCANCEL 2 2025-05-07T20:27:13.6944157Z #define __stub_sigreturn 2025-05-07T20:27:13.6944390Z #define __errordecl(name,msg) extern void name (void) __attribute__((__error__ (msg))) 2025-05-07T20:27:13.6944484Z #define _GLIBCXX_HAVE_UTIME_H 1 2025-05-07T20:27:13.6944581Z #define __HOST_CONFIG_H__ 2025-05-07T20:27:13.6944677Z #define _XOPEN_SOURCE_EXTENDED 1 2025-05-07T20:27:13.6944767Z #define CLOCK_TAI 11 2025-05-07T20:27:13.6944874Z #define _GLIBCXX_END_NAMESPACE_VERSION 2025-05-07T20:27:13.6944960Z #define __restrict_arr 2025-05-07T20:27:13.6945095Z #define _PSTL_PRAGMA_MESSAGE_POLICIES(x) 2025-05-07T20:27:13.6945260Z #define __glibcxx_requires_valid_range(_First,_Last) 2025-05-07T20:27:13.6945770Z #define strndupa(s,n) (__extension__ ({ const char *__old = (s); size_t __len = strnlen (__old, (n)); char *__new = (char *) __builtin_alloca (__len + 1); __new[__len] = '\0'; (char *) memcpy (__new, __old, __len); })) 2025-05-07T20:27:13.6945954Z #define __attribute_artificial__ __attribute__ ((__artificial__)) 2025-05-07T20:27:13.6946038Z #define __USE_MISC 1 2025-05-07T20:27:13.6946142Z #define __UWORD_TYPE unsigned long int 2025-05-07T20:27:13.6946238Z #define _EXCEPTION_DEFINES_H 1 2025-05-07T20:27:13.6946323Z #define _GCC_LIMITS_H_ 2025-05-07T20:27:13.6946415Z #define __LDBL_DIG__ 18 2025-05-07T20:27:13.6946509Z #define __BIT_TYPES_DEFINED__ 1 2025-05-07T20:27:13.6946604Z #define __malloc_and_calloc_defined 2025-05-07T20:27:13.6946702Z #define __FLT64_IS_IEC_60559__ 2 2025-05-07T20:27:13.6946798Z #define _GLIBCXX_HAVE_SYS_SYSINFO_H 1 2025-05-07T20:27:13.6946873Z #define __x86_64__ 1 2025-05-07T20:27:13.6946951Z #define _SIZE_T_ 2025-05-07T20:27:13.6947829Z #define __bswap_constant_64(x) (__extension__ ((((x) & 0xff00000000000000ull) >> 56) | (((x) & 0x00ff000000000000ull) >> 40) | (((x) & 0x0000ff0000000000ull) >> 24) | (((x) & 0x000000ff00000000ull) >> 8) | (((x) & 0x00000000ff000000ull) << 8) | (((x) & 0x0000000000ff0000ull) << 24) | (((x) & 0x000000000000ff00ull) << 40) | (((x) & 0x00000000000000ffull) << 56))) 2025-05-07T20:27:13.6947929Z #define _POSIX2_COLL_WEIGHTS_MAX 2 2025-05-07T20:27:13.6948017Z #define __FLT32X_MIN_EXP__ (-1021) 2025-05-07T20:27:13.6948122Z #define __PTHREAD_RWLOCK_INT_FLAGS_SHARED 1 2025-05-07T20:27:13.6948238Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF 2025-05-07T20:27:13.6948411Z #define _IO_iconv_t _G_iconv_t 2025-05-07T20:27:13.6948510Z #define _GLIBCXX_FLOAT_IS_IEEE_BINARY32 1 2025-05-07T20:27:13.6948628Z #define __cpp_lib_make_reverse_iterator 201402 2025-05-07T20:27:13.6948759Z #define _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(A) 2025-05-07T20:27:13.6948853Z #define _GLIBCXX_HAVE_DLFCN_H 1 2025-05-07T20:27:13.6949305Z #define strdupa(s) (__extension__ ({ const char *__old = (s); size_t __len = strlen (__old) + 1; char *__new = (char *) __builtin_alloca (__len); (char *) memcpy (__new, __old, __len); })) 2025-05-07T20:27:13.6949422Z #define __no_return__ __attribute__((noreturn)) 2025-05-07T20:27:13.6949564Z #define __device_builtin__ __location__(device_builtin) 2025-05-07T20:27:13.6949656Z #define _PSTL_HIDE_FROM_ABI_POP 2025-05-07T20:27:13.6949743Z #define _GLIBCXX_HAVE_ACOSF 1 2025-05-07T20:27:13.6949828Z #define STA_FLL 0x0008 2025-05-07T20:27:13.6949962Z #define _GLIBCXX_HAVE_BUILTIN_IS_CONSTANT_EVALUATED 1 2025-05-07T20:27:13.6950057Z #define _GLIBCXX_END_EXTERN_C } 2025-05-07T20:27:13.6950177Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL 2025-05-07T20:27:13.6950279Z #define __cpp_lib_integer_sequence 201304 2025-05-07T20:27:13.6950368Z #define __stub_revoke 2025-05-07T20:27:13.6950460Z #define __timer_t_defined 1 2025-05-07T20:27:13.6950590Z #define _GLIBCXX11_DEPRECATED _GLIBCXX_DEPRECATED 2025-05-07T20:27:13.6950772Z #define INT_MAX __INT_MAX__ 2025-05-07T20:27:13.6950875Z #define ULLONG_MAX (LLONG_MAX * 2ULL + 1) 2025-05-07T20:27:13.6950975Z #define _GLIBCXX_END_NAMESPACE_CXX11 } 2025-05-07T20:27:13.6951079Z #define _GLIBCXX_ICONV_CONST 2025-05-07T20:27:13.6951178Z #define major(dev) gnu_dev_major (dev) 2025-05-07T20:27:13.6951283Z #define cudaArrayTextureGather 0x08 2025-05-07T20:27:13.6951386Z #define _GLIBCXX_LT_OBJDIR ".libs/" 2025-05-07T20:27:13.6951530Z #define __inline_hint__ __attribute__((nv_inline_hint)) 2025-05-07T20:27:13.6951631Z #define __NV_LEGACY_LAUNCH 1 2025-05-07T20:27:13.6951725Z #define _IO_off_t __off_t 2025-05-07T20:27:13.6951810Z #define __FLT64_DIG__ 15 2025-05-07T20:27:13.6952028Z #define PTHREAD_DESTRUCTOR_ITERATIONS _POSIX_THREAD_DESTRUCTOR_ITERATIONS 2025-05-07T20:27:13.6952121Z #define _POSIX2_LINE_MAX 2048 2025-05-07T20:27:13.6952247Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL 2025-05-07T20:27:13.6952373Z #define __UINT_LEAST64_TYPE__ long unsigned int 2025-05-07T20:27:13.6952471Z #define ADJ_FREQUENCY 0x0002 2025-05-07T20:27:13.6952572Z #define __CUDART_API_PTDS(api) api 2025-05-07T20:27:13.6952659Z #define NULL __null 2025-05-07T20:27:13.6952785Z #define cudaStreamPerThread ((cudaStream_t)0x2) 2025-05-07T20:27:13.6952886Z #define _GLIBCXX_CONSTEXPR constexpr 2025-05-07T20:27:13.6952990Z #define __U64_TYPE unsigned long int 2025-05-07T20:27:13.6953081Z #define __FLT_HAS_QUIET_NAN__ 1 2025-05-07T20:27:13.6953181Z #define __FLT_MAX_10_EXP__ 38 2025-05-07T20:27:13.6953258Z #define FP_ZERO 2 2025-05-07T20:27:13.6953349Z #define _GLIBCXX_HAVE_FLOORL 1 2025-05-07T20:27:13.6953507Z #define __isgraph_l(c,l) __isctype_l((c), _ISgraph, (l)) 2025-05-07T20:27:13.6953610Z #define __LONG_MAX__ 0x7fffffffffffffffL 2025-05-07T20:27:13.6953692Z #define __WCHAR_T__ 2025-05-07T20:27:13.6953791Z #define __FLT64X_HAS_DENORM__ 1 2025-05-07T20:27:13.6953982Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL 2025-05-07T20:27:13.6954134Z #define _GLIBCXX_NORETURN __attribute__ ((__noreturn__)) 2025-05-07T20:27:13.6954237Z #define __FLT_HAS_INFINITY__ 1 2025-05-07T20:27:13.6954354Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8" 2025-05-07T20:27:13.6954469Z #define _GLIBCXX20_DEPRECATED_SUGGEST(ALT) 2025-05-07T20:27:13.6954592Z #define __WSTOPSIG(status) __WEXITSTATUS(status) 2025-05-07T20:27:13.6954716Z #define cudaSurfaceTypeCubemapLayered 0xFC 2025-05-07T20:27:13.6954815Z #define _BSD_PTRDIFF_T_ 2025-05-07T20:27:13.6954907Z #define _SIGSET_H_types 1 2025-05-07T20:27:13.6955016Z #define cudaTextureType1DLayered 0xF1 2025-05-07T20:27:13.6955126Z #define __cpp_unicode_literals 200710L 2025-05-07T20:27:13.6955353Z #define __isdigit_l(c,l) __isctype_l((c), _ISdigit, (l)) 2025-05-07T20:27:13.6955450Z #define __LONG_LONG_PAIR(HI,LO) LO, HI 2025-05-07T20:27:13.6955581Z #define __UINT_FAST16_TYPE__ long unsigned int 2025-05-07T20:27:13.6955708Z #define __bos0(ptr) __builtin_object_size (ptr, 0) 2025-05-07T20:27:13.6955825Z #define __DEC64_MAX__ 9.999999999999999E384DD 2025-05-07T20:27:13.6955956Z #define M_1_PIl 0.318309886183790671537767526745028724L 2025-05-07T20:27:13.6956127Z #define WIFSTOPPED(status) __WIFSTOPPED (__WAIT_INT (status)) 2025-05-07T20:27:13.6956230Z #define __INT_FAST32_WIDTH__ 64 2025-05-07T20:27:13.6956333Z #define _POSIX2_CHARCLASS_NAME_MAX 14 2025-05-07T20:27:13.6956430Z #define _GLIBCXX_BITS_STD_ABS_H 2025-05-07T20:27:13.6956527Z #define STA_MODE 0x4000 2025-05-07T20:27:13.6956637Z #define __CHAR16_TYPE__ short unsigned int 2025-05-07T20:27:13.6956740Z #define __PRAGMA_REDEFINE_EXTNAME 1 2025-05-07T20:27:13.6956861Z #define __glibcxx_signed_b(T,B) ((T)(-1) < 0) 2025-05-07T20:27:13.6956965Z #define __USING_NAMESPACE_C99(name) 2025-05-07T20:27:13.6957059Z #define BIG_ENDIAN __BIG_ENDIAN 2025-05-07T20:27:13.6957169Z #define __cudaCDP2EventRecord_ptsz 2025-05-07T20:27:13.6957269Z #define _GLIBCXX_HAVE_SINL 1 2025-05-07T20:27:13.6957388Z #define EXPR_NEST_MAX _POSIX2_EXPR_NEST_MAX 2025-05-07T20:27:13.6957479Z #define __SIZE_WIDTH__ 64 2025-05-07T20:27:13.6957678Z #define __BLKSIZE_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:27:13.6957772Z #define __SEG_FS 1 2025-05-07T20:27:13.6957861Z #define _IO_size_t size_t 2025-05-07T20:27:13.6957957Z #define __INT_LEAST16_MAX__ 0x7fff 2025-05-07T20:27:13.6958060Z #define INT_MIN (-INT_MAX - 1) 2025-05-07T20:27:13.6958141Z #define __stub_lchmod 2025-05-07T20:27:13.6958231Z #define __DEC64_MANT_DIG__ 16 2025-05-07T20:27:13.6958345Z #define __INT64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:27:13.6958440Z #define _GLIBCXX_MANGLE_SIZE_T m 2025-05-07T20:27:13.6958521Z #define __SEG_GS 1 2025-05-07T20:27:13.6958707Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32 2025-05-07T20:27:13.6958800Z #define _IOS_APPEND 8 2025-05-07T20:27:13.6958898Z #define __SIG_ATOMIC_WIDTH__ 32 2025-05-07T20:27:13.6958990Z #define _GLIBCXX_RELEASE 11 2025-05-07T20:27:13.6959085Z #define _GLIBCXX98_USE_C99_WCHAR 1 2025-05-07T20:27:13.6959184Z #define _IO_IS_APPENDING 0x1000 2025-05-07T20:27:13.6959290Z #define __INT_LEAST64_TYPE__ long int 2025-05-07T20:27:13.6959374Z #define htole16(x) (x) 2025-05-07T20:27:13.6959494Z #define __TEXTURE_INDIRECT_FUNCTIONS_H__ 2025-05-07T20:27:13.6959594Z #define _GLIBCXX_HAVE_FCNTL_H 1 2025-05-07T20:27:13.6959692Z #define __INT16_TYPE__ short int 2025-05-07T20:27:13.6959809Z #define __INT_LEAST8_TYPE__ signed char 2025-05-07T20:27:13.6959915Z #define __glibcxx_class_requires(_a,_b) 2025-05-07T20:27:13.6960041Z #define __cpp_structured_bindings 201606L 2025-05-07T20:27:13.6960167Z #define __align__(n) __attribute__((aligned(n))) 2025-05-07T20:27:13.6960261Z #define __SIZEOF_INT__ 4 2025-05-07T20:27:13.6960379Z #define __WCLONE 0x80000000 2025-05-07T20:27:13.6960477Z #define __DEC32_MAX_EXP__ 97 2025-05-07T20:27:13.6960569Z #define SEEK_HOLE 4 2025-05-07T20:27:13.6960673Z #define TIMER_ABSTIME 1 2025-05-07T20:27:13.6960770Z #define __INT_FAST8_MAX__ 0x7f 2025-05-07T20:27:13.6960870Z #define __CUDA_MATH_CRTIMP 2025-05-07T20:27:13.6961068Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:27:13.6961182Z #define __INTPTR_MAX__ 0x7fffffffffffffffL 2025-05-07T20:27:13.6961280Z #define __DRIVER_FUNCTIONS_H__ 2025-05-07T20:27:13.6961402Z #define __cpp_sized_deallocation 201309L 2025-05-07T20:27:13.6961497Z #define __MATH_FUNCTIONS_HPP__ 2025-05-07T20:27:13.6961635Z #define __cpp_guaranteed_copy_elision 201606L 2025-05-07T20:27:13.6961728Z #define _LINUX_LIMITS_H 2025-05-07T20:27:13.6961816Z #define linux 1 2025-05-07T20:27:13.6961928Z #define MOD_MICRO ADJ_MICRO 2025-05-07T20:27:13.6962037Z #define _GLIBCXX_DEBUG_ASSERT(_Condition) 2025-05-07T20:27:13.6962137Z #define _GLIBCXX_HAVE_VSWSCANF 1 2025-05-07T20:27:13.6962357Z #define _GLIBCXX_HAVE_ISNAN 1 2025-05-07T20:27:13.6962465Z #define _XOPEN_IOV_MAX _POSIX_UIO_MAXIOV 2025-05-07T20:27:13.6962613Z #define __cudart_builtin__ __location__(cudart_builtin) 2025-05-07T20:27:13.6962733Z #define __cpp_lib_hypot 201603 2025-05-07T20:27:13.6962833Z #define __FLT64_HAS_QUIET_NAN__ 1 2025-05-07T20:27:13.6962939Z #define _GLIBCXX_HAVE_WCTYPE_H 1 2025-05-07T20:27:13.6963045Z #define MOD_NANO ADJ_NANO 2025-05-07T20:27:13.6963135Z #define htole64(x) (x) 2025-05-07T20:27:13.6963255Z #define FP_ILOGBNAN (-2147483647 - 1) 2025-05-07T20:27:13.6963379Z #define _IO_stdout ((_IO_FILE*)(&_IO_2_1_stdout_)) 2025-05-07T20:27:13.6963470Z #define _IO_UPPERCASE 01000 2025-05-07T20:27:13.6964480Z #define cudaKernelNodeAttributeClusterSchedulingPolicyPreference cudaLaunchAttributeClusterSchedulingPolicyPreference 2025-05-07T20:27:13.6964579Z #define __USE_POSIX2 1 2025-05-07T20:27:13.6964678Z #define MOD_ESTERROR ADJ_ESTERROR 2025-05-07T20:27:13.6964786Z #define __WALL 0x40000000 2025-05-07T20:27:13.6964879Z #define _GLIBCXX_HAVE_LDEXPF 1 2025-05-07T20:27:13.6964961Z #define _XLOCALE_H 1 2025-05-07T20:27:13.6965064Z #define _GLIBCXX_USE_TMPNAM 1 2025-05-07T20:27:13.6965156Z #define __FLT32_MIN_10_EXP__ (-37) 2025-05-07T20:27:13.6965262Z #define __KEY_T_TYPE __S32_TYPE 2025-05-07T20:27:13.6965368Z #define __cudaGet_threadIdx() threadIdx 2025-05-07T20:27:13.6965565Z #define __EXCEPTIONS 1 2025-05-07T20:27:13.6965675Z #define __CUDART_API_PTSZ(api) api 2025-05-07T20:27:13.6965869Z #define __launch_bounds__(...) __annotate__(launch_bounds(__VA_ARGS__)) 2025-05-07T20:27:13.6965958Z #define __WORDSIZE 64 2025-05-07T20:27:13.6966063Z #define CLOCK_MONOTONIC 1 2025-05-07T20:27:13.6966150Z #define _STL_RELOPS_H 1 2025-05-07T20:27:13.6966241Z #define __PTRDIFF_WIDTH__ 64 2025-05-07T20:27:13.6966349Z #define __BEGIN_DECLS extern "C" { 2025-05-07T20:27:13.6966443Z #define _GLIBCXX_HAVE_SYS_IPC_H 1 2025-05-07T20:27:13.6966533Z #define __LDBL_MANT_DIG__ 64 2025-05-07T20:27:13.6966650Z #define _GLIBCXX_HAVE_TRUNCATE 1 2025-05-07T20:27:13.6966947Z #define cudaKernelNodeAttributeClusterDimension cudaLaunchAttributeClusterDimension 2025-05-07T20:27:13.6967185Z #define _PSTL_GCC_VERSION (__GNUC__ * 10000 + __GNUC_MINOR__ * 100 + __GNUC_PATCHLEVEL__) 2025-05-07T20:27:13.6967316Z #define _GLIBCXX_NAMESPACE_CXX11 __cxx11:: 2025-05-07T20:27:13.6967410Z #define _GLIBCXX_NUMERIC_LIMITS 1 2025-05-07T20:27:13.6967524Z #define __cpp_range_based_for 201603L 2025-05-07T20:27:13.6967630Z #define __cpp_lib_exchange_function 201304 2025-05-07T20:27:13.6967729Z #define _GLIBCXX_HAVE_INTTYPES_H 1 2025-05-07T20:27:13.6967849Z #define _GLIBCXX_DARWIN_USE_64_BIT_INODE 1 2025-05-07T20:27:13.6968026Z #define cudaCooperativeLaunchMultiDeviceNoPostSync 0x02 2025-05-07T20:27:13.6968124Z #define __FLT64_HAS_INFINITY__ 1 2025-05-07T20:27:13.6968222Z #define _GLIBCXX_CSTDLIB 1 2025-05-07T20:27:13.6968322Z #define _GLIBCXX_DEBUG_MACRO_SWITCH_H 1 2025-05-07T20:27:13.6968508Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:27:13.6968625Z #define __STDCPP_DEFAULT_NEW_ALIGNMENT__ 16 2025-05-07T20:27:13.6968706Z #define _STRING_H 1 2025-05-07T20:27:13.6968829Z #define _BITS_PTHREADTYPES_H 1 2025-05-07T20:27:13.6968918Z #define _GCC_MAX_ALIGN_T 2025-05-07T20:27:13.6969016Z #define __SM_32_INTRINSICS_HPP__ 2025-05-07T20:27:13.6969159Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1) 2025-05-07T20:27:13.6969249Z #define __code_model_small__ 1 2025-05-07T20:27:13.6969332Z #define _PSTL_CONFIG_H 2025-05-07T20:27:13.6969436Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2 2025-05-07T20:27:13.6969544Z #define __cpp_nontype_template_args 201411L 2025-05-07T20:27:13.6969642Z #define __SM_20_INTRINSICS_H__ 2025-05-07T20:27:13.6969739Z #define cudaCpuDeviceId ((int)-1) 2025-05-07T20:27:13.6970069Z #define assert(expr) ((expr) ? __ASSERT_VOID_CAST (0) : __assert_fail (__STRING(expr), __FILE__, __LINE__, __ASSERT_FUNCTION)) 2025-05-07T20:27:13.6970162Z #define __DEC32_MANT_DIG__ 7 2025-05-07T20:27:13.6970358Z #define le64toh(x) (x) 2025-05-07T20:27:13.6970441Z #define FILENAME_MAX 4096 2025-05-07T20:27:13.6970595Z #define __iscntrl_l(c,l) __isctype_l((c), _IScntrl, (l)) 2025-05-07T20:27:13.6970703Z #define __cpp_return_type_deduction 201304L 2025-05-07T20:27:13.6970780Z #define L_cuserid 9 2025-05-07T20:27:13.6970868Z #define __ino_t_defined 2025-05-07T20:27:13.6970941Z #define __k8__ 1 2025-05-07T20:27:13.6971039Z #define __INTPTR_TYPE__ long int 2025-05-07T20:27:13.6971145Z #define __UINT16_TYPE__ short unsigned int 2025-05-07T20:27:13.6971226Z #define __int8_t_defined 2025-05-07T20:27:13.6971313Z #define __WCHAR_TYPE__ int 2025-05-07T20:27:13.6971408Z #define __CLOCKID_T_TYPE __S32_TYPE 2025-05-07T20:27:13.6971511Z #define cudaHostRegisterPortable 0x01 2025-05-07T20:27:13.6971607Z #define __SLONGWORD_TYPE long int 2025-05-07T20:27:13.6971684Z #define _IOS_TRUNC 16 2025-05-07T20:27:13.6971795Z #define _GLIBCXX_PACKAGE_TARNAME "libstdc++" 2025-05-07T20:27:13.6971943Z #define __isblank_l(c,l) __isctype_l((c), _ISblank, (l)) 2025-05-07T20:27:13.6972027Z #define __HAVE_COLUMN 2025-05-07T20:27:13.6972104Z #define __stub_fdetach 2025-05-07T20:27:13.6972504Z #define __CUDACC_VER__ "__CUDACC_VER__ is no longer supported. Use __CUDACC_VER_MAJOR__, __CUDACC_VER_MINOR__, and __CUDACC_VER_BUILD__ instead." 2025-05-07T20:27:13.6972576Z #define __pic__ 2 2025-05-07T20:27:13.6972779Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL 2025-05-07T20:27:13.6972869Z #define CLOCKS_PER_SEC 1000000l 2025-05-07T20:27:13.6972954Z #define __INT_FAST64_WIDTH__ 64 2025-05-07T20:27:13.6973058Z #define _GLIBCXX_HAVE_SOCKATMARK 1 2025-05-07T20:27:13.6973137Z #define __stub_chflags 2025-05-07T20:27:13.6973218Z #define CLOCK_BOOTTIME 7 2025-05-07T20:27:13.6973304Z #define __need_IOV_MAX 2025-05-07T20:27:13.6973405Z #define putc(_ch,_fp) _IO_putc (_ch, _fp) 2025-05-07T20:27:13.6973500Z #define __UQUAD_TYPE unsigned long int 2025-05-07T20:27:13.6973599Z #define __cpp_decltype 200707L 2025-05-07T20:27:13.6973692Z #define __BYTE_ORDER __LITTLE_ENDIAN 2025-05-07T20:27:13.6973782Z #define _GLIBCXX_USE_C99 1 2025-05-07T20:27:13.6973891Z #define _GLIBCXX_TR1_BETA_FUNCTION_TCC 1 2025-05-07T20:27:13.6973970Z #define TTY_NAME_MAX 32 2025-05-07T20:27:13.6974137Z #define _GLIBCXX_FORWARD(_Tp,__val) std::forward<_Tp>(__val) 2025-05-07T20:27:13.6974251Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:27:13.6974415Z #define _PSTL_ASSERT(_Condition) __glibcxx_assert(_Condition) 2025-05-07T20:27:13.6974524Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1 2025-05-07T20:27:13.6974610Z #define __LITTLE_ENDIAN 1234 2025-05-07T20:27:13.6974695Z #define STA_PPSTIME 0x0004 2025-05-07T20:27:13.6974780Z #define __import__ 2025-05-07T20:27:13.6974864Z #define BUFSIZ _IO_BUFSIZ 2025-05-07T20:27:13.6975015Z #define M_SQRT2l 1.414213562373095048801688724209698079L 2025-05-07T20:27:13.6975109Z #define __export__ 2025-05-07T20:27:13.6975234Z #define __FSID_T_TYPE struct { int __val[2]; } 2025-05-07T20:27:13.6975334Z #define cudaMemAttachHost 0x02 2025-05-07T20:27:13.6975494Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:27:13.6975584Z #define _GLIBCXX_HAVE_ICONV 1 2025-05-07T20:27:13.6975672Z #define _GLIBCXX_SYMVER 1 2025-05-07T20:27:13.6975762Z #define __FLT64X_MAX_EXP__ 16384 2025-05-07T20:27:13.6975844Z #define _WCHAR_T_DECLARED 2025-05-07T20:27:13.6975964Z #define __UINT_FAST64_TYPE__ long unsigned int 2025-05-07T20:27:13.6976082Z #define isalpha_l(c,l) __isalpha_l ((c), (l)) 2025-05-07T20:27:13.6976178Z #define __cpp_inline_variables 201606L 2025-05-07T20:27:13.6976267Z #define WNOWAIT 0x01000000 2025-05-07T20:27:13.6976342Z #define PLOSS 6 2025-05-07T20:27:13.6976427Z #define M_LN10 2.30258509299404568402 2025-05-07T20:27:13.6976688Z #define _PSTL_UDS_PRESENT (__INTEL_COMPILER >= 1900 && __INTEL_COMPILER_BUILD_DATE >= 20180626) 2025-05-07T20:27:13.6976767Z #define EXIT_SUCCESS 0 2025-05-07T20:27:13.6976862Z #define __LDBL_REDIR_DECL(name) 2025-05-07T20:27:13.6976951Z #define _GLIBCXX_HAVE_STRTOF 1 2025-05-07T20:27:13.6977134Z #define MOD_FREQUENCY ADJ_FREQUENCY 2025-05-07T20:27:13.6977223Z #define __thread__ __thread 2025-05-07T20:27:13.6977312Z #define _GLIBCXX_HAVE_MEMORY_H 1 2025-05-07T20:27:13.6977398Z #define __INT_MAX__ 0x7fffffff 2025-05-07T20:27:13.6977498Z #define __SIZEOF_PTHREAD_BARRIER_T 32 2025-05-07T20:27:13.6977716Z #define __glibcxx_requires_partitioned_upper_pred(_First,_Last,_Value,_Pred) 2025-05-07T20:27:13.6977828Z #define __cudaCDP2StreamWaitEvent_ptsz 2025-05-07T20:27:13.6977922Z #define _GLIBCXX_HAVE_SINF 1 2025-05-07T20:27:13.6977996Z #define __linux__ 1 2025-05-07T20:27:13.6978090Z #define STA_PPSSIGNAL 0x0100 2025-05-07T20:27:13.6978211Z #define M_LN2l 0.693147180559945309417232121458176568L 2025-05-07T20:27:13.6978295Z #define __S16_TYPE short int 2025-05-07T20:27:13.6978637Z #define __glibcxx_constexpr_assert(cond) if (__builtin_is_constant_evaluated() && !bool(cond)) __builtin_unreachable() 2025-05-07T20:27:13.6978736Z #define __NVCC_DIAG_PRAGMA_SUPPORT__ 1 2025-05-07T20:27:13.6978918Z #define __bos(ptr) __builtin_object_size (ptr, __USE_FORTIFY_LEVEL > 1) 2025-05-07T20:27:13.6979018Z #define __COMMON_FUNCTIONS_H__ 2025-05-07T20:27:13.6979112Z #define UINT_MAX (INT_MAX * 2U + 1U) 2025-05-07T20:27:13.6979186Z #define _T_SIZE_ 2025-05-07T20:27:13.6979283Z #define LLONG_MAX __LONG_LONG_MAX__ 2025-05-07T20:27:13.6979393Z #define __cudaCDP2StreamCreateWithFlags 2025-05-07T20:27:13.6979567Z #define _PSTL_VERSION 12000 2025-05-07T20:27:13.6979684Z #define __noinline__ __attribute__((noinline)) 2025-05-07T20:27:13.6979772Z #define __WNOTHREAD 0x20000000 2025-05-07T20:27:13.6979867Z #define _G_va_list __gnuc_va_list 2025-05-07T20:27:13.6979989Z #define M_PI_4l 0.785398163397448309615660845819875721L 2025-05-07T20:27:13.6980067Z #define _IOS_INPUT 1 2025-05-07T20:27:13.6980159Z #define __USE_LARGEFILE64 1 2025-05-07T20:27:13.6980254Z #define _GLIBCXX_TR1_EXP_INTEGRAL_TCC 1 2025-05-07T20:27:13.6980342Z #define __INT64_TYPE__ long int 2025-05-07T20:27:13.6980442Z #define _POSIX_SSIZE_MAX 32767 2025-05-07T20:27:13.6980543Z #define __shared__ __location__(shared) 2025-05-07T20:27:13.6980633Z #define __FLT_MAX_EXP__ 128 2025-05-07T20:27:13.6980790Z #define __glibc_unlikely(cond) __builtin_expect((cond), 0) 2025-05-07T20:27:13.6980876Z #define __gid_t_defined 2025-05-07T20:27:13.6980989Z #define _GLIBCXX_USE_SC_NPROCESSORS_ONLN 1 2025-05-07T20:27:13.6981084Z #define __ORDER_BIG_ENDIAN__ 4321 2025-05-07T20:27:13.6981280Z #define __glibcxx_requires_can_increment_range(_First1,_Last1,_First2) 2025-05-07T20:27:13.6981380Z #define _GLIBCXX17_INLINE inline 2025-05-07T20:27:13.6981467Z #define __DBL_MANT_DIG__ 53 2025-05-07T20:27:13.6981552Z #define ___int_size_t_h 2025-05-07T20:27:13.6981658Z #define __FSBLKCNT64_T_TYPE __UQUAD_TYPE 2025-05-07T20:27:13.6981778Z #define __cpp_inheriting_constructors 201511L 2025-05-07T20:27:13.6981929Z #define __WIFCONTINUED(status) ((status) == __W_CONTINUED) 2025-05-07T20:27:13.6982033Z #define CUDA_DOUBLE_MATH_FUNCTIONS 1 2025-05-07T20:27:13.6982122Z #define _GLIBCXX_HAVE_FENV_H 1 2025-05-07T20:27:13.6982226Z #define _GLIBCXX_HAVE_STDBOOL_H 1 2025-05-07T20:27:13.6982314Z #define __SIZEOF_FLOAT128__ 16 2025-05-07T20:27:13.6982434Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:27:13.6982552Z #define _GLIBCXX_TR1_HYPERGEOMETRIC_TCC 1 2025-05-07T20:27:13.6982668Z #define _GLIBCXX_DEBUG_PEDASSERT(_Condition) 2025-05-07T20:27:13.6982755Z #define __clock_t_defined 1 2025-05-07T20:27:13.6982862Z #define _POSIX_SEM_VALUE_MAX 32767 2025-05-07T20:27:13.6982966Z #define __cudaCDP2RuntimeGetVersion 2025-05-07T20:27:13.6983053Z #define __GLIBC_MINOR__ 17 2025-05-07T20:27:13.6983146Z #define __DEC64_MIN__ 1E-383DD 2025-05-07T20:27:13.6983240Z #define __WINT_TYPE__ unsigned int 2025-05-07T20:27:13.6983344Z #define __UINT_LEAST32_TYPE__ unsigned int 2025-05-07T20:27:13.6983438Z #define __SIZEOF_SHORT__ 2 2025-05-07T20:27:13.6983604Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:27:13.6983691Z #define __SSE__ 1 2025-05-07T20:27:13.6983871Z #define SEM_VALUE_MAX (2147483647) 2025-05-07T20:27:13.6983964Z #define M_SQRT1_2 0.70710678118654752440 2025-05-07T20:27:13.6984052Z #define _CTYPE_H 1 2025-05-07T20:27:13.6984138Z #define __sigset_t_defined 2025-05-07T20:27:13.6984230Z #define __LDBL_MIN_EXP__ (-16381) 2025-05-07T20:27:13.6984329Z #define _GLIBCXX_HAVE_LOGF 1 2025-05-07T20:27:13.6984411Z #define MOD_TAI ADJ_TAI 2025-05-07T20:27:13.6984507Z #define _IO_va_list __gnuc_va_list 2025-05-07T20:27:13.6984605Z #define _GLIBCXX_HAVE_LOGL 1 2025-05-07T20:27:13.6984686Z #define __SM_70_RT_H__ 2025-05-07T20:27:13.6984782Z #define _GLIBCXX_HAVE_WRITEV 1 2025-05-07T20:27:13.6984881Z #define cudaEventWaitDefault 0x00 2025-05-07T20:27:13.6984972Z #define _GLIBCXX_HAVE_EXPL 1 2025-05-07T20:27:13.6985137Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:27:13.6985229Z #define _POSIX_MAX_CANON 255 2025-05-07T20:27:13.6985335Z #define _GLIBCXX_NOEXCEPT_PARM , bool _NE 2025-05-07T20:27:13.6985429Z #define FD_SETSIZE __FD_SETSIZE 2025-05-07T20:27:13.6985522Z #define _GLIBCXX_TXN_SAFE 2025-05-07T20:27:13.6985602Z #define __amd64__ 1 2025-05-07T20:27:13.6985701Z #define __WINT_WIDTH__ 32 2025-05-07T20:27:13.6985798Z #define __CUDA_DEVICE_RUNTIME_API_H__ 2025-05-07T20:27:13.6986057Z #define __REDIRECT_NTHNL(name,proto,alias) name proto __THROWNL __asm__ (__ASMNAME (#alias)) 2025-05-07T20:27:13.6986157Z #define _GLIBCXX_STDIO_SEEK_CUR 1 2025-05-07T20:27:13.6986313Z #define EOF (-1) 2025-05-07T20:27:13.6986414Z #define __WAIT_STATUS_DEFN void * 2025-05-07T20:27:13.6986502Z #define __USE_POSIX199309 1 2025-05-07T20:27:13.6986593Z #define __INT_LEAST64_WIDTH__ 64 2025-05-07T20:27:13.6986689Z #define __LDBL_MAX_EXP__ 16384 2025-05-07T20:27:13.6986779Z #define __FLT32X_MAX_10_EXP__ 308 2025-05-07T20:27:13.6986871Z #define LLONG_MIN (-LLONG_MAX-1) 2025-05-07T20:27:13.6986985Z #define cudaSurfaceType2DLayered 0xF2 2025-05-07T20:27:13.6987088Z #define ____mbstate_t_defined 1 2025-05-07T20:27:13.6987254Z #define STA_NANO 0x2000 2025-05-07T20:27:13.6987428Z #define _GLIBCXX_HAVE_LOG10F 1 2025-05-07T20:27:13.6993499Z #define _GLIBCXX_HAVE_LOG10L 1 2025-05-07T20:27:13.6993618Z #define _IO_LINKED 0x80 2025-05-07T20:27:13.6993720Z #define __cpp_lib_launder 201606 2025-05-07T20:27:13.6993812Z #define __SIZEOF_INT128__ 16 2025-05-07T20:27:13.6993921Z #define __PTHREAD_MUTEX_HAVE_PREV 1 2025-05-07T20:27:13.6994015Z #define __FLT64X_IS_IEC_60559__ 2 2025-05-07T20:27:13.6994125Z #define _GLIBCXX_TYPE_TRAITS 1 2025-05-07T20:27:13.6994271Z #define cudaGraphKernelNodePortProgrammatic 1 2025-05-07T20:27:13.6994378Z #define __DEVICE_ATOMIC_FUNCTIONS_HPP__ 2025-05-07T20:27:13.6994487Z #define __BLKCNT64_T_TYPE __SQUAD_TYPE 2025-05-07T20:27:13.6994583Z #define __LDBL_MAX_10_EXP__ 4932 2025-05-07T20:27:13.6994676Z #define __W_CONTINUED 0xffff 2025-05-07T20:27:13.6994773Z #define __ATOMIC_RELAXED 0 2025-05-07T20:27:13.6994904Z #define w_coredump __wait_terminated.__w_coredump 2025-05-07T20:27:13.6995022Z #define __FSBLKCNT_T_TYPE __SYSCALL_ULONG_TYPE 2025-05-07T20:27:13.6995236Z #define __cudaCDP2OccupancyMaxActiveBlocksPerMultiprocessor 2025-05-07T20:27:13.6995416Z #define __DBL_EPSILON__ double(2.22044604925031308084726333618164062e-16L) 2025-05-07T20:27:13.6995505Z #define __stub_stty 2025-05-07T20:27:13.6995671Z #define _tolower(c) ((int) (*__ctype_tolower_loc ())[(int) (c)]) 2025-05-07T20:27:13.6995756Z #define le16toh(x) (x) 2025-05-07T20:27:13.6995870Z #define BC_SCALE_MAX _POSIX2_BC_SCALE_MAX 2025-05-07T20:27:13.6996043Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128 2025-05-07T20:27:13.6996155Z #define _SIZET_ 2025-05-07T20:27:13.6996288Z #define XATTR_NAME_MAX 255 2025-05-07T20:27:13.6996579Z #define _SVID_SOURCE 1 2025-05-07T20:27:13.6996667Z #define _LP64 1 2025-05-07T20:27:13.6996763Z #define _LIBC_LIMITS_H_ 1 2025-05-07T20:27:13.6997000Z #define __REDIRECT_NTH_LDBL(name,proto,alias) __REDIRECT_NTH (name, proto, alias) 2025-05-07T20:27:13.6997111Z #define _GLIBCXX_TR1_BESSEL_FUNCTION_TCC 1 2025-05-07T20:27:13.6997201Z #define __UINT8_C(c) c 2025-05-07T20:27:13.6997424Z #define _GLIBCXX_HAVE_CEILF 1 2025-05-07T20:27:13.6997520Z #define _GLIBCXX_HAVE_CEILL 1 2025-05-07T20:27:13.6997627Z #define __cudaCDP2Memset3DAsync_ptsz 2025-05-07T20:27:13.6997719Z #define __CUDA_ARCH_LIST__ 520 2025-05-07T20:27:13.6997819Z #define __FLT64_MAX_EXP__ 1024 2025-05-07T20:27:13.6997916Z #define MOD_MAXERROR ADJ_MAXERROR 2025-05-07T20:27:13.6998002Z #define CUDARTAPI 2025-05-07T20:27:13.6998093Z #define IOV_MAX 1024 2025-05-07T20:27:13.6998236Z #define __glibcxx_requires_irreflexive2(_First,_Last) 2025-05-07T20:27:13.6998332Z #define __INT_LEAST32_TYPE__ int 2025-05-07T20:27:13.6998440Z #define cudaMemAttachSingle 0x04 2025-05-07T20:27:13.6998522Z #define __wchar_t__ 2025-05-07T20:27:13.6998623Z #define __cpp_lib_is_aggregate 201703 2025-05-07T20:27:13.6998711Z #define SEEK_END 2 2025-05-07T20:27:13.6998802Z #define __SIZEOF_WCHAR_T__ 4 2025-05-07T20:27:13.6998984Z #define _GLIBCXX_USE_TBB_PAR_BACKEND __has_include() 2025-05-07T20:27:13.6999088Z #define _IO_ftrylockfile(_fp) 2025-05-07T20:27:13.6999231Z #define _GLIBCXX_USE_C99_WCHAR _GLIBCXX11_USE_C99_WCHAR 2025-05-07T20:27:13.6999327Z #define ____FILE_defined 1 2025-05-07T20:27:13.6999444Z #define _GLIBCXX_HAVE_BUILTIN_IS_AGGREGATE 1 2025-05-07T20:27:13.6999538Z #define __GNUC_PATCHLEVEL__ 0 2025-05-07T20:27:13.6999634Z #define _ISOC99_SOURCE 1 2025-05-07T20:27:13.6999808Z #define __VECTOR_FUNCTIONS_H__ 2025-05-07T20:27:13.7000055Z #define __REDIRECT_NTH(name,proto,alias) name proto __THROW __asm__ (__ASMNAME (#alias)) 2025-05-07T20:27:13.7000189Z #define _PSTL_USE_NONTEMPORAL_STORES_IF_ALLOWED 2025-05-07T20:27:13.7000275Z #define _IO_RIGHT 04 2025-05-07T20:27:13.7000374Z #define __END_NAMESPACE_STD 2025-05-07T20:27:13.7000558Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:27:13.7000646Z #define _GLIBCXX_STD_C std 2025-05-07T20:27:13.7000769Z #define cudaInitDeviceFlagsAreValid 0x01 2025-05-07T20:27:13.7000861Z #define _LARGEFILE64_SOURCE 1 2025-05-07T20:27:13.7000965Z #define _GLIBCXX_USE_C99_STDINT_TR1 1 2025-05-07T20:27:13.7001052Z #define _STDDEF_H_ 2025-05-07T20:27:13.7001224Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:27:13.7001321Z #define __FLT128_HAS_QUIET_NAN__ 1 2025-05-07T20:27:13.7001446Z #define isalnum_l(c,l) __isalnum_l ((c), (l)) 2025-05-07T20:27:13.7001656Z #define __FD_ISSET(d,set) ((__FDS_BITS (set)[__FD_ELT (d)] & __FD_MASK (d)) != 0) 2025-05-07T20:27:13.7001774Z #define __INTMAX_MAX__ 0x7fffffffffffffffL 2025-05-07T20:27:13.7001914Z #define __glibcxx_requires_irreflexive(_First,_Last) 2025-05-07T20:27:13.7002033Z #define cudaGraphKernelNodePortDefault 0 2025-05-07T20:27:13.7002141Z #define __INT_FAST8_TYPE__ signed char 2025-05-07T20:27:13.7002251Z #define __cudaCDP2Memcpy3DAsync_ptsz 2025-05-07T20:27:13.7002351Z #define __PID_T_TYPE __S32_TYPE 2025-05-07T20:27:13.7002471Z #define __cpp_namespace_attributes 201411L 2025-05-07T20:27:13.7002567Z #define CHARCLASS_NAME_MAX 2048 2025-05-07T20:27:13.7002664Z #define _GLIBCXX_HAVE_TANF 1 2025-05-07T20:27:13.7002770Z #define _GLIBCXX_USE_ST_MTIM 1 2025-05-07T20:27:13.7002943Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x 2025-05-07T20:27:13.7003045Z #define __CUDA_RUNTIME_H__ 2025-05-07T20:27:13.7003222Z #define WIFSIGNALED(status) __WIFSIGNALED (__WAIT_INT (status)) 2025-05-07T20:27:13.7003330Z #define _GLIBCXX_HAVE_STDLIB_H 1 2025-05-07T20:27:13.7003434Z #define __STDCPP_THREADS__ 1 2025-05-07T20:27:13.7003581Z #define M_2_SQRTPIl 1.128379167095512573896158903121545172L 2025-05-07T20:27:13.7003677Z #define __GNUC_STDC_INLINE__ 1 2025-05-07T20:27:13.7003778Z #define _POSIX_UIO_MAXIOV 16 2025-05-07T20:27:13.7003878Z #define _PSTL_PAR_BACKEND_SERIAL 2025-05-07T20:27:13.7003978Z #define P_tmpdir "/tmp" 2025-05-07T20:27:13.7004115Z #define __ASSERT_FUNCTION __PRETTY_FUNCTION__ 2025-05-07T20:27:13.7004210Z #define __FLT64_HAS_DENORM__ 1 2025-05-07T20:27:13.7004460Z #define __WORDSIZE_TIME64_COMPAT32 1 2025-05-07T20:27:13.7004744Z #define _GLIBCXX_DEPRECATED __attribute__ ((__deprecated__)) 2025-05-07T20:27:13.7004913Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32 2025-05-07T20:27:13.7005025Z #define _PSTL_HIDE_FROM_ABI_PUSH 2025-05-07T20:27:13.7005144Z #define cudaStreamLegacy ((cudaStream_t)0x1) 2025-05-07T20:27:13.7005255Z #define _IO_cleanup_region_start(_fct,_fp) 2025-05-07T20:27:13.7005384Z #define __location__(a) __annotate__(a) 2025-05-07T20:27:13.7005611Z #define __device_builtin_surface_type__ __location__(device_builtin_surface_type) 2025-05-07T20:27:13.7005707Z #define _POSIX2_BC_BASE_MAX 99 2025-05-07T20:27:13.7005828Z #define __cudaCDP2DeviceGetAttribute 2025-05-07T20:27:13.7005924Z #define __DBL_DECIMAL_DIG__ 17 2025-05-07T20:27:13.7006013Z #define __STDC_UTF_32__ 1 2025-05-07T20:27:13.7006118Z #define __INT_FAST8_WIDTH__ 8 2025-05-07T20:27:13.7006214Z #define NAN (__builtin_nanf ("")) 2025-05-07T20:27:13.7006317Z #define _POSIX_MQ_PRIO_MAX 32 2025-05-07T20:27:13.7006400Z #define __FXSR__ 1 2025-05-07T20:27:13.7006486Z #define _SIZE_T 2025-05-07T20:27:13.7006598Z #define _GLIBCXX_USE_GETTIMEOFDAY 1 2025-05-07T20:27:13.7006709Z #define cudaHostRegisterReadOnly 0x08 2025-05-07T20:27:13.7006878Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:27:13.7007041Z #define __WIFSTOPPED(status) (((status) & 0xff) == 0x7f) 2025-05-07T20:27:13.7007218Z #define _IO_ssize_t __ssize_t 2025-05-07T20:27:13.7007318Z #define __ULONG32_TYPE unsigned int 2025-05-07T20:27:13.7007519Z #define __DBL_NORM_MAX__ double(1.79769313486231570814527423731704357e+308L) 2025-05-07T20:27:13.7007718Z #define cudaStreamGraphTailLaunch (cudaStream_t)0x0100000000000000 2025-05-07T20:27:13.7007820Z #define _GXX_NULLPTR_T 2025-05-07T20:27:13.7007942Z #define __glibcxx_class_requires3(_a,_b,_c,_d) 2025-05-07T20:27:13.7008030Z #define FOPEN_MAX 16 2025-05-07T20:27:13.7008129Z #define __BIG_ENDIAN 4321 2025-05-07T20:27:13.7008493Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:27:13.7008629Z #define __suseconds_t_defined 2025-05-07T20:27:13.7008724Z #define __off_t_defined 2025-05-07T20:27:13.7008816Z #define stderr stderr 2025-05-07T20:27:13.7008909Z #define M_LOG10E 0.43429448190325182765 2025-05-07T20:27:13.7009030Z #define __glibcxx_requires_string(_String) 2025-05-07T20:27:13.7009125Z #define _GLIBCXX_HAVE_LDEXPL 1 2025-05-07T20:27:13.7009217Z #define __INTMAX_WIDTH__ 64 2025-05-07T20:27:13.7009651Z #define _PSTL_CPP14_2RANGE_MISMATCH_EQUAL_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201300L || __cpp_lib_robust_nonmodifying_seq_ops == 201304) 2025-05-07T20:27:13.7009745Z #define __mode_t_defined 2025-05-07T20:27:13.7009842Z #define _GCC_SIZE_T 2025-05-07T20:27:13.7009942Z #define __INO64_T_TYPE __UQUAD_TYPE 2025-05-07T20:27:13.7010044Z #define __cpp_runtime_arrays 198712L 2025-05-07T20:27:13.7010167Z #define __UINT64_TYPE__ long unsigned int 2025-05-07T20:27:13.7010262Z #define __USE_XOPEN2K8XSI 1 2025-05-07T20:27:13.7010358Z #define __UINT32_C(c) c ## U 2025-05-07T20:27:13.7010488Z #define __cpp_alias_templates 200704L 2025-05-07T20:27:13.7010600Z #define cudaHostAllocMapped 0x02 2025-05-07T20:27:13.7010708Z #define __DEVICE_LAUNCH_PARAMETERS_H__ 2025-05-07T20:27:13.7010817Z #define _STL_ITERATOR_H 1 2025-05-07T20:27:13.7010901Z #define __size_t__ 2025-05-07T20:27:13.7011044Z #define cudaStreamAttrID cudaLaunchAttributeID 2025-05-07T20:27:13.7011147Z #define _GLIBCXX_HAVE_ATANF 1 2025-05-07T20:27:13.7011259Z #define cudaEventRecordExternal 0x01 2025-05-07T20:27:13.7011430Z #define __isspace_l(c,l) __isctype_l((c), _ISspace, (l)) 2025-05-07T20:27:13.7011528Z #define _IO_BUFSIZ _G_BUFSIZ 2025-05-07T20:27:13.7011702Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F 2025-05-07T20:27:13.7011810Z #define _ENDIAN_H 1 2025-05-07T20:27:13.7011919Z #define __builtin_align__(a) __align__(a) 2025-05-07T20:27:13.7012010Z #define _GLIBCXX20_CONSTEXPR 2025-05-07T20:27:13.7012108Z #define __NV_NO_HOST_COMPILER_CHECK 1 2025-05-07T20:27:13.7012202Z #define __try try 2025-05-07T20:27:13.7012530Z #define _GLIBCXX_HAVE_FINITE 1 2025-05-07T20:27:13.7012628Z #define __FLT128_IS_IEC_60559__ 2 2025-05-07T20:27:13.7012714Z #define __INT8_MAX__ 0x7f 2025-05-07T20:27:13.7012969Z #define cudaStreamGetCaptureInfo __CUDART_API_PTSZ(cudaStreamGetCaptureInfo_v2) 2025-05-07T20:27:13.7013073Z #define __LONG_WIDTH__ 64 2025-05-07T20:27:13.7013155Z #define __PIC__ 2 2025-05-07T20:27:13.7013273Z #define BC_STRING_MAX _POSIX2_BC_STRING_MAX 2025-05-07T20:27:13.7013412Z #define __UINT_FAST32_TYPE__ long unsigned int 2025-05-07T20:27:13.7013545Z #define FD_ISSET(fd,fdsetp) __FD_ISSET (fd, fdsetp) 2025-05-07T20:27:13.7013645Z #define _GLIBCXX_HAVE_FLOAT_H 1 2025-05-07T20:27:13.7013756Z #define _GLIBCXX_HAVE_ATANL 1 2025-05-07T20:27:13.7013945Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:27:13.7014048Z #define __DEVICE_FUNCTIONS_HPP__ 2025-05-07T20:27:13.7014162Z #define __CHAR32_TYPE__ unsigned int 2025-05-07T20:27:13.7014256Z #define _IO_uid_t __uid_t 2025-05-07T20:27:13.7014382Z #define _GLIBCXX_HAVE_READLINK 1 2025-05-07T20:27:13.7014513Z #define __cudaCDP2EventRecordWithFlags_ptsz 2025-05-07T20:27:13.7014608Z #define _CONCEPT_CHECK_H 1 2025-05-07T20:27:13.7014772Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:27:13.7014878Z #define _GLIBCXX_HAVE_NETINET_IN_H 1 2025-05-07T20:27:13.7015167Z #define _GLIBCXX_TR1_SPECIAL_FUNCTION_UTIL_H 1 2025-05-07T20:27:13.7015266Z #define LONG_BIT 64 2025-05-07T20:27:13.7015377Z #define __SIZEOF_PTHREAD_BARRIERATTR_T 4 2025-05-07T20:27:13.7015479Z #define _GLIBCXX_USE_ALLOCATOR_NEW 1 2025-05-07T20:27:13.7015620Z #define __cpp_lib_math_special_functions 201603L 2025-05-07T20:27:13.7015725Z #define __fsfilcnt_t_defined 2025-05-07T20:27:13.7015830Z #define __blkcnt_t_defined 2025-05-07T20:27:13.7016101Z #define cudaKernelNodeAttributeMemSyncDomain cudaLaunchAttributeMemSyncDomain 2025-05-07T20:27:13.7016195Z #define __USE_LARGEFILE 1 2025-05-07T20:27:13.7016314Z #define __cpp_constexpr 201603L 2025-05-07T20:27:13.7016414Z #define CUDART_VERSION 12060 2025-05-07T20:27:13.7016507Z #define NL_TEXTMAX INT_MAX 2025-05-07T20:27:13.7016628Z #define cudaDeviceMapHost 0x08 2025-05-07T20:27:13.7016725Z #define _GLIBCXX_CMATH 1 2025-05-07T20:27:13.7016925Z #define __attribute_format_arg__(x) __attribute__ ((__format_arg__ (x))) 2025-05-07T20:27:13.7017038Z #define __lldiv_t_defined 1 2025-05-07T20:27:13.7017133Z #define __SSE2__ 1 2025-05-07T20:27:13.7017216Z #define _IOLBF 1 2025-05-07T20:27:13.7017328Z #define _GLIBCXX_HAVE_SYS_TYPES_H 1 2025-05-07T20:27:13.7017420Z #define _GLIBCXX_HAVE_FLOORF 1 2025-05-07T20:27:13.7017542Z #define __cpp_deduction_guides 201703L 2025-05-07T20:27:13.7017644Z #define _GLIBCXX_HAVE_EXPF 1 2025-05-07T20:27:13.7017755Z #define __annotate__(a) __attribute__((a)) 2025-05-07T20:27:13.7017868Z #define __INT32_TYPE__ int 2025-05-07T20:27:13.7017959Z #define __SIZEOF_DOUBLE__ 8 2025-05-07T20:27:13.7018070Z #define cudaDeviceSyncMemops 0x80 2025-05-07T20:27:13.7018183Z #define __cpp_exceptions 199711L 2025-05-07T20:27:13.7018283Z #define __FLT_MIN_10_EXP__ (-37) 2025-05-07T20:27:13.7018394Z #define cudaDeviceScheduleYield 0x02 2025-05-07T20:27:13.7018502Z #define _SYS_SYSMACROS_H 1 2025-05-07T20:27:13.7018616Z #define _GLIBCXX_TR1_LEGENDRE_FUNCTION_TCC 1 2025-05-07T20:27:13.7018775Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64 2025-05-07T20:27:13.7018892Z #define __INT_LEAST32_WIDTH__ 32 2025-05-07T20:27:13.7018991Z #define __SWORD_TYPE long int 2025-05-07T20:27:13.7019099Z #define __INTMAX_TYPE__ long int 2025-05-07T20:27:13.7019195Z #define _GLIBCXX11_USE_C99_MATH 1 2025-05-07T20:27:13.7019295Z #define __PTHREAD_SPINS 0, 0 2025-05-07T20:27:13.7019403Z #define _BITS_POSIX1_LIM_H 1 2025-05-07T20:27:13.7019685Z #define cudaStreamAttributeMemSyncDomainMap cudaLaunchAttributeMemSyncDomainMap 2025-05-07T20:27:13.7019780Z #define __DEC128_MAX_EXP__ 6145 2025-05-07T20:27:13.7019939Z #define math_errhandling (MATH_ERRNO | MATH_ERREXCEPT) 2025-05-07T20:27:13.7020021Z #define _T_SIZE 2025-05-07T20:27:13.7020223Z #define cudaHostAllocDefault 0x00 2025-05-07T20:27:13.7020359Z #define _PSTL_PRAGMA_SIMD_EXCLUSIVE_SCAN(PRM) 2025-05-07T20:27:13.7020492Z #define __va_arg_pack() __builtin_va_arg_pack () 2025-05-07T20:27:13.7020598Z #define _POSIX_TIMER_MAX 32 2025-05-07T20:27:13.7020689Z #define _GLIBCXX_HAVE_TLS 1 2025-05-07T20:27:13.7020820Z #define _GLIBCXX_NOTHROW _GLIBCXX_USE_NOEXCEPT 2025-05-07T20:27:13.7020928Z #define _GLIBCXX_HAVE_ACOSL 1 2025-05-07T20:27:13.7021032Z #define __FLT32X_HAS_QUIET_NAN__ 1 2025-05-07T20:27:13.7021123Z #define __ATOMIC_CONSUME 1 2025-05-07T20:27:13.7021309Z #define __CUDA_ARCH_HAS_FEATURE__(_FEAT) __CUDA_ARCH_FEAT_ ##_FEAT 2025-05-07T20:27:13.7021405Z #define __GNUC_MINOR__ 4 2025-05-07T20:27:13.7021505Z #define __GLIBCXX_TYPE_INT_N_0 __int128 2025-05-07T20:27:13.7021607Z #define __INT_FAST16_WIDTH__ 64 2025-05-07T20:27:13.7021724Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL 2025-05-07T20:27:13.7021823Z #define __PIE__ 2 2025-05-07T20:27:13.7021932Z #define LITTLE_ENDIAN __LITTLE_ENDIAN 2025-05-07T20:27:13.7022028Z #define _GLIBCXX_HAVE_INT64_T_LONG 1 2025-05-07T20:27:13.7022224Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x 2025-05-07T20:27:13.7022439Z #define __intN_t(N,MODE) typedef int int ##N ##_t __attribute__ ((__mode__ (MODE))) 2025-05-07T20:27:13.7022531Z #define __nlink_t_defined 2025-05-07T20:27:13.7022786Z #define _GLIBCXX17_DEPRECATED [[__deprecated__]] 2025-05-07T20:27:13.7022899Z #define _PSTL_STRING(x) _PSTL_STRING_AUX(x) 2025-05-07T20:27:13.7022981Z #define _XOPEN_LIM_H 1 2025-05-07T20:27:13.7023320Z #define __u_intN_t(N,MODE) typedef unsigned int u_int ##N ##_t __attribute__ ((__mode__ (MODE))) 2025-05-07T20:27:13.7023602Z #define __cpp_template_template_args 201611L 2025-05-07T20:27:13.7023713Z #define _GTHREAD_USE_MUTEX_TIMEDLOCK 1 2025-05-07T20:27:13.7023814Z #define BC_DIM_MAX _POSIX2_BC_DIM_MAX 2025-05-07T20:27:13.7023907Z #define __DBL_MAX_10_EXP__ 308 2025-05-07T20:27:13.7024015Z #define __FILE_defined 1 2025-05-07T20:27:13.7024203Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L 2025-05-07T20:27:13.7024302Z #define _GLIBCXX_HAVE_SINCOS 1 2025-05-07T20:27:13.7024409Z #define __USE_XOPEN_EXTENDED 1 2025-05-07T20:27:13.7024517Z #define __cpp_lib_tuple_element_t 201402L 2025-05-07T20:27:13.7024639Z #define isascii_l(c,l) __isascii_l ((c), (l)) 2025-05-07T20:27:13.7024980Z #define cudaInvalidDeviceId ((int)-2) 2025-05-07T20:27:13.7025088Z #define _GLIBCXX_HAVE_SYS_RESOURCE_H 1 2025-05-07T20:27:13.7025173Z #define __INT16_C(c) c 2025-05-07T20:27:13.7025282Z #define __U32_TYPE unsigned int 2025-05-07T20:27:13.7025379Z #define _GLIBCXX_HAVE_SYS_IOCTL_H 1 2025-05-07T20:27:13.7025519Z #define FD_CLR(fd,fdsetp) __FD_CLR (fd, fdsetp) 2025-05-07T20:27:13.7025605Z #define __STDC__ 1 2025-05-07T20:27:13.7025707Z #define _GLIBCXX_HAVE_VWSCANF 1 2025-05-07T20:27:13.7025825Z #define _GLIBCXX_HAVE_EXECINFO_H 1 2025-05-07T20:27:13.7025927Z #define _GLIBCXX_USE_REALPATH 1 2025-05-07T20:27:13.7026086Z #define __attribute_malloc__ __attribute__ ((__malloc__)) 2025-05-07T20:27:13.7026190Z #define __FLT32X_DIG__ 15 2025-05-07T20:27:13.7026291Z #define _GLIBCXX_USE_C99_CTYPE_TR1 1 2025-05-07T20:27:13.7026389Z #define __PTRDIFF_TYPE__ long int 2025-05-07T20:27:13.7026519Z #define cudaArrayDeferredMapping 0x80 2025-05-07T20:27:13.7026629Z #define _GLIBCXX_END_NAMESPACE_CONTAINER 2025-05-07T20:27:13.7026743Z #define USHRT_MAX (SHRT_MAX * 2 + 1) 2025-05-07T20:27:13.7026847Z #define __cpp_lib_is_swappable 201603 2025-05-07T20:27:13.7026932Z #define stdin stdin 2025-05-07T20:27:13.7027048Z #define __ino64_t_defined 2025-05-07T20:27:13.7027137Z #define STA_CLK 0x8000 2025-05-07T20:27:13.7027233Z #define __clockid_t_defined 1 2025-05-07T20:27:13.7027391Z #define _GLIBCXX_NOEXCEPT_IF(...) noexcept(__VA_ARGS__) 2025-05-07T20:27:13.7027556Z #define __attribute_noinline__ __attribute__ ((__noinline__)) 2025-05-07T20:27:13.7027661Z #define __cudaCDP2MemsetAsync 2025-05-07T20:27:13.7027784Z #define _PSTL_PRAGMA_SIMD_SCAN(PRM) 2025-05-07T20:27:13.7027994Z #define _GLIBCXX_BEGIN_NAMESPACE_LDBL 2025-05-07T20:27:13.7028100Z #define _GLIBCXX_TR1_POLY_HERMITE_TCC 1 2025-05-07T20:27:13.7028320Z #define __FD_SET(d,set) ((void) (__FDS_BITS (set)[__FD_ELT (d)] |= __FD_MASK (d))) 2025-05-07T20:27:13.7028414Z #define __ATOMIC_SEQ_CST 5 2025-05-07T20:27:13.7028965Z #define __tobody(c,f,a,args) (__extension__ ({ int __res; if (sizeof (c) > 1) { if (__builtin_constant_p (c)) { int __c = (c); __res = __c < -128 || __c > 255 ? __c : (a)[__c]; } else __res = f args; } else __res = (a)[(int) (c)]; __res; })) 2025-05-07T20:27:13.7029054Z #define DOMAIN 1 2025-05-07T20:27:13.7029147Z #define M_LN2 0.69314718055994530942 2025-05-07T20:27:13.7029244Z #define __NVCC__ 1 2025-05-07T20:27:13.7029350Z #define __cudaCDP2Memset2DAsync 2025-05-07T20:27:13.7029465Z #define __CLOCK_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:27:13.7029579Z #define _PSTL_PRAGMA_SIMD_EARLYEXIT 2025-05-07T20:27:13.7029683Z #define __throw_exception_again throw 2025-05-07T20:27:13.7029799Z #define M_SQRT2 1.41421356237309504880 2025-05-07T20:27:13.7029893Z #define __EXCEPTION_H 1 2025-05-07T20:27:13.7029996Z #define __FLT32X_MIN_10_EXP__ (-307) 2025-05-07T20:27:13.7030113Z #define HUGE_VAL (__builtin_huge_val()) 2025-05-07T20:27:13.7030415Z #define cudaStreamAttributeAccessPolicyWindow cudaLaunchAttributeAccessPolicyWindow 2025-05-07T20:27:13.7030612Z #define __UINTPTR_TYPE__ long unsigned int 2025-05-07T20:27:13.7030722Z #define _GLIBCXX_INLINE_VERSION 0 2025-05-07T20:27:13.7030820Z #define _GLIBCXX_USE_INT128 1 2025-05-07T20:27:13.7030922Z #define __cpp_lib_bool_constant 201505 2025-05-07T20:27:13.7031031Z #define PTHREAD_KEYS_MAX 1024 2025-05-07T20:27:13.7031174Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD 2025-05-07T20:27:13.7031287Z #define __FSFILCNT64_T_TYPE __UQUAD_TYPE 2025-05-07T20:27:13.7031394Z #define _GLIBCXX_DOUBLE_IS_IEEE_BINARY64 1 2025-05-07T20:27:13.7031486Z #define __DEC128_MANT_DIG__ 34 2025-05-07T20:27:13.7031598Z #define __cpp_lib_tuples_by_type 201304 2025-05-07T20:27:13.7031699Z #define __LDBL_MIN_10_EXP__ (-4931) 2025-05-07T20:27:13.7031798Z #define __cpp_generic_lambdas 201304L 2025-05-07T20:27:13.7031946Z #define _GLIBCXX_THROW_OR_ABORT(_EXC) (throw (_EXC)) 2025-05-07T20:27:13.7032040Z #define __useconds_t_defined 2025-05-07T20:27:13.7032136Z #define _GLIBCXX_USE_SCHED_YIELD 1 2025-05-07T20:27:13.7032334Z #define __attribute_deprecated__ __attribute__ ((__deprecated__)) 2025-05-07T20:27:13.7032480Z #define __cpp_lib_type_trait_variable_templates 201510L 2025-05-07T20:27:13.7032576Z #define __SSE_MATH__ 1 2025-05-07T20:27:13.7032665Z #define _IO_wint_t wint_t 2025-05-07T20:27:13.7032758Z #define __SIZEOF_LONG_LONG__ 8 2025-05-07T20:27:13.7032862Z #define _GLIBCXX_VERBOSE 1 2025-05-07T20:27:13.7032955Z #define _GLIBCXX_HAVE_ASINF 1 2025-05-07T20:27:13.7033067Z #define __cpp_user_defined_literals 200809L 2025-05-07T20:27:13.7033176Z #define _GLIBCXX_HAVE_ISINFL 1 2025-05-07T20:27:13.7033264Z #define _GLIBCXX_HAVE_ASINL 1 2025-05-07T20:27:13.7033352Z #define __USE_ATFILE 1 2025-05-07T20:27:13.7033457Z #define _POSIX_OPEN_MAX 20 2025-05-07T20:27:13.7033550Z #define _POSIX_LOGIN_NAME_MAX 9 2025-05-07T20:27:13.7033637Z #define _GCC_PTRDIFF_T 2025-05-07T20:27:13.7033881Z #define cudaKernelNodeAttributePriority cudaLaunchAttributePriority 2025-05-07T20:27:13.7033977Z #define __FLT128_DECIMAL_DIG__ 36 2025-05-07T20:27:13.7034097Z #define _POSIX_THREAD_KEYS_MAX 128 2025-05-07T20:27:13.7034199Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2 2025-05-07T20:27:13.7034306Z #define __cpp_lib_array_constexpr 201803L 2025-05-07T20:27:13.7034401Z #define _STDLIB_H 1 2025-05-07T20:27:13.7034544Z #define __exctype(name) extern int name (int) __THROW 2025-05-07T20:27:13.7034642Z #define __FLT32_HAS_QUIET_NAN__ 1 2025-05-07T20:27:13.7034752Z #define __FLT_DECIMAL_DIG__ 9 2025-05-07T20:27:13.7034882Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL 2025-05-07T20:27:13.7034995Z #define __SURFACE_INDIRECT_FUNCTIONS_H__ 2025-05-07T20:27:13.7035104Z #define __SM_61_INTRINSICS_H__ 2025-05-07T20:27:13.7035373Z #define _GLIBCXX_PACKAGE_STRING "package-unused version-unused" 2025-05-07T20:27:13.7035532Z #define __isxdigit_l(c,l) __isctype_l((c), _ISxdigit, (l)) 2025-05-07T20:27:13.7035651Z #define __glibcxx_requires_nonempty() 2025-05-07T20:27:13.7035770Z #define w_stopsig __wait_stopped.__w_stopsig 2025-05-07T20:27:13.7035873Z #define __ldiv_t_defined 1 2025-05-07T20:27:13.7036060Z #define __glibcxx_requires_irreflexive_pred(_First,_Last,_Pred) 2025-05-07T20:27:13.7036152Z #define ___int_ptrdiff_t_h 2025-05-07T20:27:13.7036331Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:27:13.7036432Z #define __cudaCDP2EventDestroy 2025-05-07T20:27:13.7036521Z #define __HOST_DEFINES_H__ 2025-05-07T20:27:13.7036643Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2 2025-05-07T20:27:13.7036744Z #define __SM_20_ATOMIC_FUNCTIONS_H__ 2025-05-07T20:27:13.7036864Z #define _GLIBCXX_USE_NANOSLEEP 1 2025-05-07T20:27:13.7036962Z #define CUDART_CB 2025-05-07T20:27:13.7037093Z #define BC_BASE_MAX _POSIX2_BC_BASE_MAX 2025-05-07T20:27:13.7037231Z #define _GLIBCXX_USE_C99_INTTYPES_WCHAR_T_TR1 1 2025-05-07T20:27:13.7037316Z #define MB_LEN_MAX 16 2025-05-07T20:27:13.7037538Z #define __glibcxx_requires_partitioned_lower_pred(_First,_Last,_Value,_Pred) 2025-05-07T20:27:13.7037653Z #define _GLIBCXX11_USE_C99_WCHAR 1 2025-05-07T20:27:13.7037872Z #define _IO_peekc(_fp) _IO_peekc_unlocked (_fp) 2025-05-07T20:27:13.7037982Z #define _GLIBCXX_HAVE_AS_SYMVER_DIRECTIVE 1 2025-05-07T20:27:13.7038093Z #define _GLIBCXX_HAVE_UNISTD_H 1 2025-05-07T20:27:13.7038239Z #define __glibc_likely(cond) __builtin_expect((cond), 1) 2025-05-07T20:27:13.7038344Z #define __UINT_FAST8_TYPE__ unsigned char 2025-05-07T20:27:13.7038446Z #define _GNU_SOURCE 1 2025-05-07T20:27:13.7038532Z #define __stub_putmsg 2025-05-07T20:27:13.7038625Z #define __CUDACC__ 1 2025-05-07T20:27:13.7038716Z #define __N(msgid) (msgid) 2025-05-07T20:27:13.7038802Z #define __P(args) args 2025-05-07T20:27:13.7039070Z #define cudaKernelNodeAttributeCooperative cudaLaunchAttributeCooperative 2025-05-07T20:27:13.7039172Z #define __cpp_init_captures 201304L 2025-05-07T20:27:13.7039277Z #define _GLIBCXX17_CONSTEXPR constexpr 2025-05-07T20:27:13.7039378Z #define __ATOMIC_ACQ_REL 4 2025-05-07T20:27:13.7039475Z #define __cpp_lib_as_const 201510 2025-05-07T20:27:13.7039555Z #define __WCHAR_T 2025-05-07T20:27:13.7039664Z #define __ATOMIC_RELEASE 3 2025-05-07T20:27:13.7039768Z #define __fsblkcnt_t_defined 2025-05-07T20:27:13.7039883Z #define __cudaCDP2EventCreateWithFlags 2025-05-07T20:27:13.7039995Z #define __DEVICE_DOUBLE_FUNCTIONS_H__ 2025-05-07T20:27:13.7040002Z 2025-05-07T20:27:13.7397729Z 2025-05-07T20:27:13.7398512Z + conda run -n build_binary nvcc --version 2025-05-07T20:27:13.7398530Z 2025-05-07T20:27:15.6872228Z nvcc: NVIDIA (R) Cuda compiler driver 2025-05-07T20:27:15.6872725Z Copyright (c) 2005-2024 NVIDIA Corporation 2025-05-07T20:27:15.6873037Z Built on Tue_Oct_29_23:50:19_PDT_2024 2025-05-07T20:27:15.6873345Z Cuda compilation tools, release 12.6, V12.6.85 2025-05-07T20:27:15.6873687Z Build cuda_12.6.r12.6/compiler.35059454_0 2025-05-07T20:27:15.6873895Z 2025-05-07T20:27:15.7621582Z 2025-05-07T20:27:15.7633255Z /usr/bin/nvidia-smi 2025-05-07T20:27:15.7638658Z + nvidia-smi 2025-05-07T20:27:15.7638913Z 2025-05-07T20:27:15.7820426Z Wed May 7 20:27:15 2025 2025-05-07T20:27:15.7820854Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:27:15.7821447Z | NVIDIA-SMI 570.133.07 Driver Version: 570.133.07 CUDA Version: 12.8 | 2025-05-07T20:27:15.7821936Z |-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:27:15.7822417Z | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 2025-05-07T20:27:15.7822933Z | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 2025-05-07T20:27:15.7823362Z | | | MIG M. | 2025-05-07T20:27:15.7823996Z |=========================================+========================+======================| 2025-05-07T20:27:15.7991923Z | 0 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 | 2025-05-07T20:27:15.7992408Z | 0% 27C P8 15W / 300W | 0MiB / 23028MiB | 0% Default | 2025-05-07T20:27:15.7992801Z | | | N/A | 2025-05-07T20:27:15.7993189Z +-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:27:15.7996689Z 2025-05-07T20:27:15.7997105Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:27:15.7997873Z | Processes: | 2025-05-07T20:27:15.7998722Z | GPU GI CI PID Type Process name GPU Memory | 2025-05-07T20:27:15.7999125Z | ID ID Usage | 2025-05-07T20:27:15.7999464Z |=========================================================================================| 2025-05-07T20:27:15.8001948Z | No running processes found | 2025-05-07T20:27:15.8002443Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:27:16.0530676Z 2025-05-07T20:27:16.0536344Z [INSTALL] Successfully installed CUDA 12.6.3 2025-05-07T20:27:16.0592583Z ##[group]Run . $PRELUDE; install_pytorch_pip $BUILD_ENV nightly cuda/12.6.3 2025-05-07T20:27:16.0593129Z . $PRELUDE; install_pytorch_pip $BUILD_ENV nightly cuda/12.6.3 2025-05-07T20:27:16.0607413Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:27:16.0607750Z env: 2025-05-07T20:27:16.0607986Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:27:16.0608549Z BUILD_ENV: build_binary 2025-05-07T20:27:16.0608796Z BUILD_TARGET: genai 2025-05-07T20:27:16.0609020Z BUILD_VARIANT: cuda 2025-05-07T20:27:16.0609244Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:27:16.0609492Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:27:16.0609794Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:27:16.0610126Z ##[endgroup] 2025-05-07T20:27:16.4060769Z ################################################################################ 2025-05-07T20:27:16.4061187Z # Install PyTorch (PIP) 2025-05-07T20:27:16.4061411Z # 2025-05-07T20:27:16.4076862Z # [2025-05-07T20:27:16.407Z] + install_pytorch_pip build_binary nightly cuda/12.6.3 2025-05-07T20:27:16.4077330Z ################################################################################ 2025-05-07T20:27:16.4077549Z 2025-05-07T20:27:16.4108099Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y numpy 2025-05-07T20:27:17.4283467Z Channels: 2025-05-07T20:27:17.4283734Z - conda-forge 2025-05-07T20:27:17.4283970Z Platform: linux-64 2025-05-07T20:27:21.2102501Z Collecting package metadata (repodata.json): - \ | / done 2025-05-07T20:27:21.9347942Z Solving environment: \ | / done 2025-05-07T20:27:22.1545680Z 2025-05-07T20:27:22.1546057Z ## Package Plan ## 2025-05-07T20:27:22.1546221Z 2025-05-07T20:27:22.1546463Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:27:22.1546773Z 2025-05-07T20:27:22.1546869Z added / updated specs: 2025-05-07T20:27:22.1547107Z - numpy 2025-05-07T20:27:22.1547219Z 2025-05-07T20:27:22.1547248Z 2025-05-07T20:27:22.1547370Z The following packages will be downloaded: 2025-05-07T20:27:22.1547578Z 2025-05-07T20:27:22.1547689Z package | build 2025-05-07T20:27:22.1548006Z ---------------------------|----------------- 2025-05-07T20:27:22.1548378Z libblas-3.9.0 |31_h59b9bed_openblas 16 KB conda-forge 2025-05-07T20:27:22.1549108Z libcblas-3.9.0 |31_he106b2a_openblas 16 KB conda-forge 2025-05-07T20:27:22.1549540Z libgfortran-15.1.0 | h69a702a_2 34 KB conda-forge 2025-05-07T20:27:22.1549979Z libgfortran5-15.1.0 | hcea5267_2 1.5 MB conda-forge 2025-05-07T20:27:22.1550420Z liblapack-3.9.0 |31_h7ac8fdf_openblas 16 KB conda-forge 2025-05-07T20:27:22.1550891Z libopenblas-0.3.29 |pthreads_h94d23a6_0 5.6 MB conda-forge 2025-05-07T20:27:22.1551379Z numpy-2.2.5 | py313h17eae1a_0 8.1 MB conda-forge 2025-05-07T20:27:22.1551766Z ------------------------------------------------------------ 2025-05-07T20:27:22.1552105Z Total: 15.4 MB 2025-05-07T20:27:22.1552311Z 2025-05-07T20:27:22.1552434Z The following NEW packages will be INSTALLED: 2025-05-07T20:27:22.1552663Z 2025-05-07T20:27:22.1552874Z libblas conda-forge/linux-64::libblas-3.9.0-31_h59b9bed_openblas 2025-05-07T20:27:22.1553370Z libcblas conda-forge/linux-64::libcblas-3.9.0-31_he106b2a_openblas 2025-05-07T20:27:22.1553883Z libgfortran conda-forge/linux-64::libgfortran-15.1.0-h69a702a_2 2025-05-07T20:27:22.1554451Z libgfortran5 conda-forge/linux-64::libgfortran5-15.1.0-hcea5267_2 2025-05-07T20:27:22.1569119Z liblapack conda-forge/linux-64::liblapack-3.9.0-31_h7ac8fdf_openblas 2025-05-07T20:27:22.1569718Z libopenblas conda-forge/linux-64::libopenblas-0.3.29-pthreads_h94d23a6_0 2025-05-07T20:27:22.1570485Z numpy conda-forge/linux-64::numpy-2.2.5-py313h17eae1a_0 2025-05-07T20:27:22.1570764Z 2025-05-07T20:27:22.1570769Z 2025-05-07T20:27:22.1570773Z 2025-05-07T20:27:22.1570921Z Downloading and Extracting Packages: ...working... 2025-05-07T20:27:22.1571301Z numpy-2.2.5 | 8.1 MB | | 0% 2025-05-07T20:27:22.1571529Z 2025-05-07T20:27:22.1571950Z libopenblas-0.3.29 | 5.6 MB | | 0%  2025-05-07T20:27:22.1572188Z 2025-05-07T20:27:22.1572192Z 2025-05-07T20:27:22.1572415Z libgfortran5-15.1.0 | 1.5 MB | | 0%  2025-05-07T20:27:22.1572673Z 2025-05-07T20:27:22.1572677Z 2025-05-07T20:27:22.1572680Z 2025-05-07T20:27:22.1572902Z libgfortran-15.1.0 | 34 KB | | 0%  2025-05-07T20:27:22.1573163Z 2025-05-07T20:27:22.1573167Z 2025-05-07T20:27:22.1573171Z 2025-05-07T20:27:22.1573179Z 2025-05-07T20:27:22.1591797Z libblas-3.9.0 | 16 KB | | 0%  2025-05-07T20:27:22.1592104Z 2025-05-07T20:27:22.1592117Z 2025-05-07T20:27:22.1592121Z 2025-05-07T20:27:22.1592125Z 2025-05-07T20:27:22.1594116Z 2025-05-07T20:27:22.1595497Z libcblas-3.9.0 | 16 KB | | 0%  2025-05-07T20:27:22.1595768Z 2025-05-07T20:27:22.1595772Z 2025-05-07T20:27:22.1595775Z 2025-05-07T20:27:22.1595779Z 2025-05-07T20:27:22.1595789Z 2025-05-07T20:27:22.1595793Z 2025-05-07T20:27:22.2321102Z liblapack-3.9.0 | 16 KB | | 0%  2025-05-07T20:27:22.2321383Z 2025-05-07T20:27:22.2321493Z 2025-05-07T20:27:22.2321501Z 2025-05-07T20:27:22.2321509Z 2025-05-07T20:27:22.3204842Z libblas-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:27:22.3205128Z 2025-05-07T20:27:22.3206563Z 2025-05-07T20:27:22.3212821Z libgfortran5-15.1.0 | 1.5 MB | 1 | 1%  2025-05-07T20:27:22.3213114Z 2025-05-07T20:27:22.3213119Z 2025-05-07T20:27:22.3213123Z 2025-05-07T20:27:22.3213138Z 2025-05-07T20:27:22.3213143Z 2025-05-07T20:27:22.3386329Z libcblas-3.9.0 | 16 KB | #########7 | 98%  2025-05-07T20:27:22.3386840Z 2025-05-07T20:27:22.3386847Z 2025-05-07T20:27:22.3386855Z 2025-05-07T20:27:22.3386875Z 2025-05-07T20:27:22.3490201Z 2025-05-07T20:27:22.4379986Z libcblas-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:27:22.4380498Z 2025-05-07T20:27:22.4380502Z 2025-05-07T20:27:22.4395843Z libgfortran5-15.1.0 | 1.5 MB | 5 | 5%  2025-05-07T20:27:22.4396107Z 2025-05-07T20:27:22.4396111Z 2025-05-07T20:27:22.4396114Z 2025-05-07T20:27:22.4396118Z 2025-05-07T20:27:22.4396122Z 2025-05-07T20:27:22.4398844Z 2025-05-07T20:27:22.4448534Z liblapack-3.9.0 | 16 KB | #########7 | 98%  2025-05-07T20:27:22.4448817Z 2025-05-07T20:27:22.4448822Z 2025-05-07T20:27:22.4448825Z 2025-05-07T20:27:22.4448829Z 2025-05-07T20:27:22.4448843Z 2025-05-07T20:27:22.4449523Z 2025-05-07T20:27:22.4700613Z liblapack-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:27:22.4705490Z 2025-05-07T20:27:22.4913690Z libopenblas-0.3.29 | 5.6 MB | | 0%  2025-05-07T20:27:22.4913944Z 2025-05-07T20:27:22.4913948Z 2025-05-07T20:27:22.4915790Z 2025-05-07T20:27:22.4951775Z libgfortran-15.1.0 | 34 KB | ####7 | 47%  2025-05-07T20:27:22.4952044Z 2025-05-07T20:27:22.4952059Z 2025-05-07T20:27:22.4955177Z 2025-05-07T20:27:22.4977051Z libgfortran-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:27:22.4977333Z 2025-05-07T20:27:22.4977337Z 2025-05-07T20:27:22.4977341Z 2025-05-07T20:27:22.4977891Z 2025-05-07T20:27:22.4981690Z libblas-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:27:22.4981947Z 2025-05-07T20:27:22.4981951Z 2025-05-07T20:27:22.4981955Z 2025-05-07T20:27:22.4982052Z 2025-05-07T20:27:22.4984537Z libblas-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:27:22.4984839Z 2025-05-07T20:27:22.4984843Z 2025-05-07T20:27:22.4984847Z 2025-05-07T20:27:22.4984850Z 2025-05-07T20:27:22.4986972Z 2025-05-07T20:27:22.5098304Z libcblas-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:27:22.5098575Z 2025-05-07T20:27:22.5099017Z 2025-05-07T20:27:22.5177941Z libgfortran5-15.1.0 | 1.5 MB | ########## | 100%  2025-05-07T20:27:22.5178215Z 2025-05-07T20:27:22.5178227Z 2025-05-07T20:27:22.5178231Z 2025-05-07T20:27:22.5178235Z 2025-05-07T20:27:22.5178238Z 2025-05-07T20:27:22.5178604Z 2025-05-07T20:27:22.5560903Z liblapack-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:27:22.5561172Z 2025-05-07T20:27:22.5561176Z 2025-05-07T20:27:22.5561180Z 2025-05-07T20:27:22.5630784Z libgfortran-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:27:22.5717051Z numpy-2.2.5 | 8.1 MB | | 0% 2025-05-07T20:27:22.5717696Z 2025-05-07T20:27:22.5922619Z libopenblas-0.3.29 | 5.6 MB | #########3 | 93%  2025-05-07T20:27:22.5922862Z 2025-05-07T20:27:22.5922880Z 2025-05-07T20:27:22.5925765Z libgfortran5-15.1.0 | 1.5 MB | ########## | 100%  2025-05-07T20:27:22.5926021Z 2025-05-07T20:27:22.5926211Z 2025-05-07T20:27:22.5937055Z libgfortran5-15.1.0 | 1.5 MB | ########## | 100%  2025-05-07T20:27:22.5938407Z 2025-05-07T20:27:22.6591406Z libopenblas-0.3.29 | 5.6 MB | ########## | 100%  2025-05-07T20:27:22.7189913Z numpy-2.2.5 | 8.1 MB | ########## | 100% 2025-05-07T20:27:22.7190410Z 2025-05-07T20:27:23.0630929Z libopenblas-0.3.29 | 5.6 MB | ########## | 100%  2025-05-07T20:27:23.0631371Z numpy-2.2.5 | 8.1 MB | ########## | 100% 2025-05-07T20:27:23.0639372Z numpy-2.2.5 | 8.1 MB | ########## | 100% 2025-05-07T20:27:23.0639708Z 2025-05-07T20:27:23.0639907Z 2025-05-07T20:27:23.0640113Z  2025-05-07T20:27:23.0640314Z 2025-05-07T20:27:23.0640318Z 2025-05-07T20:27:23.0640504Z  2025-05-07T20:27:23.0640712Z 2025-05-07T20:27:23.0640716Z 2025-05-07T20:27:23.0640720Z 2025-05-07T20:27:23.0640887Z  2025-05-07T20:27:23.0641096Z 2025-05-07T20:27:23.0641099Z 2025-05-07T20:27:23.0641103Z 2025-05-07T20:27:23.0641112Z 2025-05-07T20:27:23.0641511Z  2025-05-07T20:27:23.0641724Z 2025-05-07T20:27:23.0641728Z 2025-05-07T20:27:23.0641732Z 2025-05-07T20:27:23.0641735Z 2025-05-07T20:27:23.0641739Z 2025-05-07T20:27:23.0641910Z  2025-05-07T20:27:23.0642114Z 2025-05-07T20:27:23.0642129Z 2025-05-07T20:27:23.0642133Z 2025-05-07T20:27:23.0642136Z 2025-05-07T20:27:23.0642140Z 2025-05-07T20:27:23.0642143Z 2025-05-07T20:27:23.0642324Z  done 2025-05-07T20:27:23.1645509Z Preparing transaction: \ done 2025-05-07T20:27:23.2650006Z Verifying transaction: / done 2025-05-07T20:27:23.3658385Z Executing transaction: \ done 2025-05-07T20:27:23.5627445Z ################################################################################ 2025-05-07T20:27:23.5627814Z # Install Package From PyTorch PIP: torch 2025-05-07T20:27:23.5628110Z # 2025-05-07T20:27:23.5645001Z # [2025-05-07T20:27:23.564Z] + install_from_pytorch_pip build_binary torch nightly cuda/12.6.3 2025-05-07T20:27:23.5645489Z ################################################################################ 2025-05-07T20:27:23.5645717Z 2025-05-07T20:27:23.5661126Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:27:23.6559140Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:27:23.6559573Z ################################################################################ 2025-05-07T20:27:23.6559918Z # Prepare PIP Arguments (PyTorch PIP) 2025-05-07T20:27:23.6560214Z # 2025-05-07T20:27:23.6579319Z # [2025-05-07T20:27:23.657Z] + __prepare_pip_arguments torch nightly cuda/12.6.3 2025-05-07T20:27:23.6579769Z ################################################################################ 2025-05-07T20:27:23.6580003Z 2025-05-07T20:27:23.6604010Z [INSTALL] Extracted package (channel, version): (nightly, LATEST) 2025-05-07T20:27:23.6631169Z [INSTALL] Extracted package variant: cu126 2025-05-07T20:27:23.6648802Z [INSTALL] Using a non-RELEASE channel: nightly ... 2025-05-07T20:27:23.6649396Z [INSTALL] Extracted the full PIP channel: https://download.pytorch.org/whl/nightly/cu126/ 2025-05-07T20:27:23.6658216Z [INSTALL] Extracted the full PIP package: --pre torch 2025-05-07T20:27:23.6667767Z [INSTALL] Attempting to install [torch, LATEST] from PyTorch PIP using channel https://download.pytorch.org/whl/nightly/cu126/ ... 2025-05-07T20:27:23.6689986Z [EXEC] [ATTEMPT 0/3] + conda run -n build_binary pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu126/ 2025-05-07T20:28:44.2145702Z DEPRECATION: Building 'MarkupSafe' using the legacy setup.py bdist_wheel mechanism, which will be removed in a future version. pip 25.3 will enforce this behaviour change. A possible replacement is to use the standardized build interface by setting the `--use-pep517` option, (possibly combined with `--no-build-isolation`), or adding a `pyproject.toml` file to the source tree of 'MarkupSafe'. Discussion can be found at https://github.com/pypa/pip/issues/6334 2025-05-07T20:28:44.2147581Z Looking in indexes: https://download.pytorch.org/whl/nightly/cu126/ 2025-05-07T20:28:44.2147975Z Collecting torch 2025-05-07T20:28:44.2148633Z Downloading https://download.pytorch.org/whl/nightly/cu126/torch-2.8.0.dev20250507%2Bcu126-cp313-cp313-manylinux_2_28_x86_64.whl.metadata (30 kB) 2025-05-07T20:28:44.2149336Z Collecting filelock (from torch) 2025-05-07T20:28:44.2149562Z 2025-05-07T20:28:44.2149893Z Downloading https://download.pytorch.org/whl/nightly/filelock-3.16.1-py3-none-any.whl (16 kB) 2025-05-07T20:28:44.2150816Z Requirement already satisfied: typing-extensions>=4.10.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages (from torch) (4.13.2) 2025-05-07T20:28:44.2151882Z Requirement already satisfied: setuptools in /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages (from torch) (78.1.1) 2025-05-07T20:28:44.2152900Z Collecting sympy>=1.13.3 (from torch) 2025-05-07T20:28:44.2153391Z Downloading https://download.pytorch.org/whl/nightly/sympy-1.13.3-py3-none-any.whl (6.2 MB) 2025-05-07T20:28:44.2154264Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.2/6.2 MB 32.8 MB/s eta 0:00:00 2025-05-07T20:28:44.2154624Z Collecting networkx (from torch) 2025-05-07T20:28:44.2155122Z Downloading https://download.pytorch.org/whl/nightly/networkx-3.4.2-py3-none-any.whl (1.7 MB) 2025-05-07T20:28:44.2155758Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.7/1.7 MB 27.8 MB/s eta 0:00:00 2025-05-07T20:28:44.2156099Z Collecting jinja2 (from torch) 2025-05-07T20:28:44.2156586Z Downloading https://download.pytorch.org/whl/nightly/jinja2-3.1.4-py3-none-any.whl (133 kB) 2025-05-07T20:28:44.2157084Z Collecting fsspec (from torch) 2025-05-07T20:28:44.2157564Z Downloading https://download.pytorch.org/whl/nightly/fsspec-2024.10.0-py3-none-any.whl (179 kB) 2025-05-07T20:28:44.2158132Z Collecting nvidia-cuda-nvrtc-cu12==12.6.77 (from torch) 2025-05-07T20:28:44.2158856Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cuda_nvrtc_cu12-12.6.77-py3-none-manylinux2014_x86_64.whl (23.7 MB) 2025-05-07T20:28:44.2159626Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 23.7/23.7 MB 54.5 MB/s eta 0:00:00 2025-05-07T20:28:44.2160033Z Collecting nvidia-cuda-runtime-cu12==12.6.77 (from torch) 2025-05-07T20:28:44.2160748Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cuda_runtime_cu12-12.6.77-py3-none-manylinux2014_x86_64.whl (897 kB) 2025-05-07T20:28:44.2161525Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 897.7/897.7 kB 10.3 MB/s eta 0:00:00 2025-05-07T20:28:44.2162125Z Collecting nvidia-cuda-cupti-cu12==12.6.80 (from torch) 2025-05-07T20:28:44.2162821Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cuda_cupti_cu12-12.6.80-py3-none-manylinux2014_x86_64.whl (8.9 MB) 2025-05-07T20:28:44.2163584Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8.9/8.9 MB 49.5 MB/s eta 0:00:00 2025-05-07T20:28:44.2163946Z Collecting nvidia-cudnn-cu12==9.5.1.17 (from torch) 2025-05-07T20:28:44.2164724Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cudnn_cu12-9.5.1.17-py3-none-manylinux_2_28_x86_64.whl (571.0 MB) 2025-05-07T20:28:44.2165480Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 571.0/571.0 MB 34.1 MB/s eta 0:00:00 2025-05-07T20:28:44.2165848Z Collecting nvidia-cublas-cu12==12.6.4.1 (from torch) 2025-05-07T20:28:44.2166606Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cublas_cu12-12.6.4.1-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (393.1 MB) 2025-05-07T20:28:44.2167448Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 393.1/393.1 MB 81.6 MB/s eta 0:00:00 2025-05-07T20:28:44.2167828Z Collecting nvidia-cufft-cu12==11.3.0.4 (from torch) 2025-05-07T20:28:44.2168484Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cufft_cu12-11.3.0.4-py3-none-manylinux2014_x86_64.whl (200.2 MB) 2025-05-07T20:28:44.2169233Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 200.2/200.2 MB 128.3 MB/s eta 0:00:00 2025-05-07T20:28:44.2169618Z Collecting nvidia-curand-cu12==10.3.7.77 (from torch) 2025-05-07T20:28:44.2170281Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_curand_cu12-10.3.7.77-py3-none-manylinux2014_x86_64.whl (56.3 MB) 2025-05-07T20:28:44.2171027Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56.3/56.3 MB 146.0 MB/s eta 0:00:00 2025-05-07T20:28:44.2171409Z Collecting nvidia-cusolver-cu12==11.7.1.2 (from torch) 2025-05-07T20:28:44.2172092Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cusolver_cu12-11.7.1.2-py3-none-manylinux2014_x86_64.whl (158.2 MB) 2025-05-07T20:28:44.2172868Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 158.2/158.2 MB 167.0 MB/s eta 0:00:00 2025-05-07T20:28:44.2173242Z Collecting nvidia-cusparse-cu12==12.5.4.2 (from torch) 2025-05-07T20:28:44.2173927Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cusparse_cu12-12.5.4.2-py3-none-manylinux2014_x86_64.whl (216.6 MB) 2025-05-07T20:28:44.2174693Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 216.6/216.6 MB 144.5 MB/s eta 0:00:00 2025-05-07T20:28:44.2175168Z Collecting nvidia-cusparselt-cu12==0.6.3 (from torch) 2025-05-07T20:28:44.2175859Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cusparselt_cu12-0.6.3-py3-none-manylinux2014_x86_64.whl (156.8 MB) 2025-05-07T20:28:44.2176805Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 156.8/156.8 MB 141.0 MB/s eta 0:00:00 2025-05-07T20:28:44.2177170Z Collecting nvidia-nccl-cu12==2.26.2 (from torch) 2025-05-07T20:28:44.2177924Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_nccl_cu12-2.26.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (2.0 kB) 2025-05-07T20:28:44.2178680Z Collecting nvidia-nvtx-cu12==12.6.77 (from torch) 2025-05-07T20:28:44.2179325Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_nvtx_cu12-12.6.77-py3-none-manylinux2014_x86_64.whl (89 kB) 2025-05-07T20:28:44.2179986Z Collecting nvidia-nvjitlink-cu12==12.6.85 (from torch) 2025-05-07T20:28:44.2180753Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_nvjitlink_cu12-12.6.85-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl (19.7 MB) 2025-05-07T20:28:44.2181596Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 19.7/19.7 MB 154.5 MB/s eta 0:00:00 2025-05-07T20:28:44.2181970Z Collecting nvidia-cufile-cu12==1.11.1.6 (from torch) 2025-05-07T20:28:44.2182733Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cufile_cu12-1.11.1.6-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.5 kB) 2025-05-07T20:28:44.2183532Z Collecting pytorch-triton==3.3.0+git96316ce5 (from torch) 2025-05-07T20:28:44.2184460Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.3.0%2Bgit96316ce5-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (1.6 kB) 2025-05-07T20:28:44.2185275Z Collecting mpmath<1.4,>=1.1.0 (from sympy>=1.13.3->torch) 2025-05-07T20:28:44.2185809Z Downloading https://download.pytorch.org/whl/nightly/mpmath-1.3.0-py3-none-any.whl (536 kB) 2025-05-07T20:28:44.2186439Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 536.2/536.2 kB 58.1 MB/s eta 0:00:00 2025-05-07T20:28:44.2186798Z Collecting MarkupSafe>=2.0 (from jinja2->torch) 2025-05-07T20:28:44.2187279Z Downloading https://download.pytorch.org/whl/nightly/MarkupSafe-2.1.5.tar.gz (19 kB) 2025-05-07T20:28:44.2187776Z Preparing metadata (setup.py): started 2025-05-07T20:28:44.2188171Z Preparing metadata (setup.py): finished with status 'done' 2025-05-07T20:28:44.2188902Z Downloading https://download.pytorch.org/whl/nightly/cu126/torch-2.8.0.dev20250507%2Bcu126-cp313-cp313-manylinux_2_28_x86_64.whl (825.4 MB) 2025-05-07T20:28:44.2189689Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 825.4/825.4 MB 37.8 MB/s eta 0:00:00 2025-05-07T20:28:44.2190445Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cufile_cu12-1.11.1.6-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (1.1 MB) 2025-05-07T20:28:44.2191269Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.1/1.1 MB 11.8 MB/s eta 0:00:00 2025-05-07T20:28:44.2192104Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_nccl_cu12-2.26.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (201.3 MB) 2025-05-07T20:28:44.2192917Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 201.3/201.3 MB 98.5 MB/s eta 0:00:00 2025-05-07T20:28:44.2193690Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.3.0%2Bgit96316ce5-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (153.5 MB) 2025-05-07T20:28:44.2194544Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 153.5/153.5 MB 132.8 MB/s eta 0:00:00 2025-05-07T20:28:44.2194935Z Building wheels for collected packages: MarkupSafe 2025-05-07T20:28:44.2195296Z Building wheel for MarkupSafe (setup.py): started 2025-05-07T20:28:44.2195722Z Building wheel for MarkupSafe (setup.py): finished with status 'done' 2025-05-07T20:28:44.2196567Z Created wheel for MarkupSafe: filename=markupsafe-2.1.5-cp313-cp313-linux_x86_64.whl size=14954 sha256=0ad2daeb7144f6b1498751df4fa6a76a2c004ca82d33a0f5885e5a381123a56d 2025-05-07T20:28:44.2197598Z Stored in directory: /home/ec2-user/.cache/pip/wheels/3a/21/87/28c44597225fd0c28d6ffa365f1c2c9dd0ab763711aa4957c6 2025-05-07T20:28:44.2198159Z Successfully built MarkupSafe 2025-05-07T20:28:44.2199895Z Installing collected packages: nvidia-cusparselt-cu12, mpmath, sympy, pytorch-triton, nvidia-nvtx-cu12, nvidia-nvjitlink-cu12, nvidia-nccl-cu12, nvidia-curand-cu12, nvidia-cufile-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, networkx, MarkupSafe, fsspec, filelock, nvidia-cusparse-cu12, nvidia-cufft-cu12, nvidia-cudnn-cu12, jinja2, nvidia-cusolver-cu12, torch 2025-05-07T20:28:44.2201455Z 2025-05-07T20:28:44.2203391Z Successfully installed MarkupSafe-2.1.5 filelock-3.16.1 fsspec-2024.10.0 jinja2-3.1.4 mpmath-1.3.0 networkx-3.4.2 nvidia-cublas-cu12-12.6.4.1 nvidia-cuda-cupti-cu12-12.6.80 nvidia-cuda-nvrtc-cu12-12.6.77 nvidia-cuda-runtime-cu12-12.6.77 nvidia-cudnn-cu12-9.5.1.17 nvidia-cufft-cu12-11.3.0.4 nvidia-cufile-cu12-1.11.1.6 nvidia-curand-cu12-10.3.7.77 nvidia-cusolver-cu12-11.7.1.2 nvidia-cusparse-cu12-12.5.4.2 nvidia-cusparselt-cu12-0.6.3 nvidia-nccl-cu12-2.26.2 nvidia-nvjitlink-cu12-12.6.85 nvidia-nvtx-cu12-12.6.77 pytorch-triton-3.3.0+git96316ce5 sympy-1.13.3 torch-2.8.0.dev20250507+cu126 2025-05-07T20:28:44.2207380Z 2025-05-07T20:28:46.5220391Z torch 2.8.0.dev20250507+cu126 2025-05-07T20:28:46.5222809Z [CHECK] The installed package [torch, nightly/LATEST] is the correct variant (cu126) 2025-05-07T20:28:50.0412563Z [CHECK] Python (sub-)package 'torch.distributed' found ... 2025-05-07T20:28:53.5481518Z [CHECK] NOTE: The installed version is: 2.8.0.dev20250507+cu126 2025-05-07T20:28:53.5482331Z [CHECK] NOTE: Checking _GLIBCXX_USE_CXX11_ABI ... 2025-05-07T20:28:56.9670436Z True 2025-05-07T20:28:56.9670668Z True 2025-05-07T20:28:56.9670767Z 2025-05-07T20:28:57.0355253Z [INSTALL] Successfully installed PyTorch through PyTorch PIP 2025-05-07T20:28:57.0396209Z ##[group]Run if . $PRELUDE && which conda; then collect_pytorch_env_info $BUILD_ENV; fi 2025-05-07T20:28:57.0396816Z if . $PRELUDE && which conda; then collect_pytorch_env_info $BUILD_ENV; fi 2025-05-07T20:28:57.0412052Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:28:57.0412390Z env: 2025-05-07T20:28:57.0412603Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:28:57.0412887Z BUILD_ENV: build_binary 2025-05-07T20:28:57.0413121Z BUILD_TARGET: genai 2025-05-07T20:28:57.0413343Z BUILD_VARIANT: cuda 2025-05-07T20:28:57.0413566Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:28:57.0413803Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:28:57.0414094Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:28:57.0414412Z ##[endgroup] 2025-05-07T20:28:57.3804207Z /home/ec2-user/miniconda/bin/conda 2025-05-07T20:28:57.3806051Z ################################################################################ 2025-05-07T20:28:57.3807077Z # Collect PyTorch Environment Information (for Reporting Issues) 2025-05-07T20:28:57.3807857Z # 2025-05-07T20:28:57.3823344Z # [2025-05-07T20:28:57.381Z] + collect_pytorch_env_info build_binary 2025-05-07T20:28:57.3824159Z ################################################################################ 2025-05-07T20:28:57.3824623Z 2025-05-07T20:28:57.3838969Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:28:57.4780726Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:28:57.4791180Z [INFO] Downloading the PyTorch environment info collection script ... 2025-05-07T20:28:57.4792210Z + wget -q https://raw.githubusercontent.com/pytorch/pytorch/main/torch/utils/collect_env.py 2025-05-07T20:28:57.4792671Z 2025-05-07T20:28:57.5681745Z 2025-05-07T20:28:57.5682410Z [INFO] Collecting PyTorch environment info (will be needed for reporting issues to PyTorch) ... 2025-05-07T20:28:57.5707195Z [EXEC] [ATTEMPT 0/3] + conda run -n build_binary python collect_env.py 2025-05-07T20:29:03.5982574Z Collecting environment information... 2025-05-07T20:29:03.5983177Z PyTorch version: 2.8.0.dev20250507+cu126 2025-05-07T20:29:03.5983610Z Is debug build: False 2025-05-07T20:29:03.5983956Z CUDA used to build PyTorch: 12.6 2025-05-07T20:29:03.5984332Z ROCM used to build PyTorch: N/A 2025-05-07T20:29:03.5984592Z 2025-05-07T20:29:03.5984760Z OS: Amazon Linux 2023.6.20250317 (x86_64) 2025-05-07T20:29:03.5985211Z GCC version: (conda-forge gcc 11.4.0-13) 11.4.0 2025-05-07T20:29:03.5985837Z Clang version: Could not collect 2025-05-07T20:29:03.5986890Z CMake version: Could not collect 2025-05-07T20:29:03.5987735Z Libc version: glibc-2.34 2025-05-07T20:29:03.5988266Z 2025-05-07T20:29:03.5988970Z Python version: 3.13.0 | packaged by conda-forge | (main, Nov 27 2024, 19:18:50) [GCC 13.3.0] (64-bit runtime) 2025-05-07T20:29:03.5989966Z Python platform: Linux-6.1.130-139.222.amzn2023.x86_64-x86_64-with-glibc2.34 2025-05-07T20:29:03.5990721Z Is CUDA available: True 2025-05-07T20:29:03.5991291Z CUDA runtime version: 12.6.85 2025-05-07T20:29:03.5991849Z CUDA_MODULE_LOADING set to: LAZY 2025-05-07T20:29:03.5992388Z GPU models and configuration: GPU 0: NVIDIA A10G 2025-05-07T20:29:03.6006569Z Nvidia driver version: 570.133.07 2025-05-07T20:29:03.6006993Z cuDNN version: Could not collect 2025-05-07T20:29:03.6007357Z HIP runtime version: N/A 2025-05-07T20:29:03.6007719Z MIOpen runtime version: N/A 2025-05-07T20:29:03.6008088Z Is XNNPACK available: True 2025-05-07T20:29:03.6008549Z 2025-05-07T20:29:03.6008677Z CPU: 2025-05-07T20:29:03.6008963Z Architecture: x86_64 2025-05-07T20:29:03.6009432Z CPU op-mode(s): 32-bit, 64-bit 2025-05-07T20:29:03.6009977Z Address sizes: 48 bits physical, 48 bits virtual 2025-05-07T20:29:03.6010505Z Byte Order: Little Endian 2025-05-07T20:29:03.6010945Z CPU(s): 16 2025-05-07T20:29:03.6011375Z On-line CPU(s) list: 0-15 2025-05-07T20:29:03.6012409Z Vendor ID: AuthenticAMD 2025-05-07T20:29:03.6012919Z Model name: AMD EPYC 7R32 2025-05-07T20:29:03.6013406Z CPU family: 23 2025-05-07T20:29:03.6013826Z Model: 49 2025-05-07T20:29:03.6014224Z Thread(s) per core: 2 2025-05-07T20:29:03.6014643Z Core(s) per socket: 8 2025-05-07T20:29:03.6015044Z Socket(s): 1 2025-05-07T20:29:03.6015447Z Stepping: 0 2025-05-07T20:29:03.6015935Z BogoMIPS: 5600.00 2025-05-07T20:29:03.6018993Z Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:29:03.6022319Z Hypervisor vendor: KVM 2025-05-07T20:29:03.6022773Z Virtualization type: full 2025-05-07T20:29:03.6023250Z L1d cache: 256 KiB (8 instances) 2025-05-07T20:29:03.6023778Z L1i cache: 256 KiB (8 instances) 2025-05-07T20:29:03.6024304Z L2 cache: 4 MiB (8 instances) 2025-05-07T20:29:03.6024797Z L3 cache: 32 MiB (2 instances) 2025-05-07T20:29:03.6025243Z NUMA node(s): 1 2025-05-07T20:29:03.6025660Z NUMA node0 CPU(s): 0-15 2025-05-07T20:29:03.6026146Z Vulnerability Gather data sampling: Not affected 2025-05-07T20:29:03.6026665Z Vulnerability Itlb multihit: Not affected 2025-05-07T20:29:03.6027177Z Vulnerability L1tf: Not affected 2025-05-07T20:29:03.6027677Z Vulnerability Mds: Not affected 2025-05-07T20:29:03.6028182Z Vulnerability Meltdown: Not affected 2025-05-07T20:29:03.6028704Z Vulnerability Mmio stale data: Not affected 2025-05-07T20:29:03.6029265Z Vulnerability Reg file data sampling: Not affected 2025-05-07T20:29:03.6030092Z Vulnerability Retbleed: Mitigation; untrained return thunk; SMT enabled with STIBP protection 2025-05-07T20:29:03.6030956Z Vulnerability Spec rstack overflow: Mitigation; safe RET 2025-05-07T20:29:03.6031752Z Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl 2025-05-07T20:29:03.6032728Z Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization 2025-05-07T20:29:03.6033999Z Vulnerability Spectre v2: Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected 2025-05-07T20:29:03.6034995Z Vulnerability Srbds: Not affected 2025-05-07T20:29:03.6035501Z Vulnerability Tsx async abort: Not affected 2025-05-07T20:29:03.6035835Z 2025-05-07T20:29:03.6035980Z Versions of relevant libraries: 2025-05-07T20:29:03.6036335Z [pip3] numpy==2.2.5 2025-05-07T20:29:03.6036659Z [pip3] nvidia-cublas-cu12==12.6.4.1 2025-05-07T20:29:03.6037086Z [pip3] nvidia-cuda-cupti-cu12==12.6.80 2025-05-07T20:29:03.6037498Z [pip3] nvidia-cuda-nvrtc-cu12==12.6.77 2025-05-07T20:29:03.6037931Z [pip3] nvidia-cuda-runtime-cu12==12.6.77 2025-05-07T20:29:03.6038361Z [pip3] nvidia-cudnn-cu12==9.5.1.17 2025-05-07T20:29:03.6038741Z [pip3] nvidia-cufft-cu12==11.3.0.4 2025-05-07T20:29:03.6039153Z [pip3] nvidia-curand-cu12==10.3.7.77 2025-05-07T20:29:03.6039558Z [pip3] nvidia-cusolver-cu12==11.7.1.2 2025-05-07T20:29:03.6039989Z [pip3] nvidia-cusparse-cu12==12.5.4.2 2025-05-07T20:29:03.6040583Z [pip3] nvidia-cusparselt-cu12==0.6.3 2025-05-07T20:29:03.6041003Z [pip3] nvidia-nccl-cu12==2.26.2 2025-05-07T20:29:03.6041389Z [pip3] nvidia-nvjitlink-cu12==12.6.85 2025-05-07T20:29:03.6041779Z [pip3] nvidia-nvtx-cu12==12.6.77 2025-05-07T20:29:03.6042186Z [pip3] pytorch-triton==3.3.0+git96316ce5 2025-05-07T20:29:03.6042603Z [pip3] torch==2.8.0.dev20250507+cu126 2025-05-07T20:29:03.6043097Z [conda] cuda-cudart 12.6.77 h5888daf_0 conda-forge 2025-05-07T20:29:03.6043779Z [conda] cuda-cudart-dev 12.6.77 h5888daf_0 conda-forge 2025-05-07T20:29:03.6044675Z [conda] cuda-cudart-dev_linux-64 12.6.77 h3f2d84a_0 conda-forge 2025-05-07T20:29:03.6045424Z [conda] cuda-cudart-static 12.6.77 h5888daf_0 conda-forge 2025-05-07T20:29:03.6046170Z [conda] cuda-cudart-static_linux-64 12.6.77 h3f2d84a_0 conda-forge 2025-05-07T20:29:03.6046932Z [conda] cuda-cudart_linux-64 12.6.77 h3f2d84a_0 conda-forge 2025-05-07T20:29:03.6047622Z [conda] cuda-cupti 12.6.80 hbd13f7d_0 conda-forge 2025-05-07T20:29:03.6048420Z [conda] cuda-cupti-dev 12.6.80 h5888daf_0 conda-forge 2025-05-07T20:29:03.6049110Z [conda] cuda-libraries 12.6.3 ha770c72_0 conda-forge 2025-05-07T20:29:03.6049804Z [conda] cuda-libraries-dev 12.6.3 ha770c72_0 conda-forge 2025-05-07T20:29:03.6050487Z [conda] cuda-nvrtc 12.6.85 hbd13f7d_0 conda-forge 2025-05-07T20:29:03.6051141Z [conda] cuda-nvrtc-dev 12.6.85 h5888daf_0 conda-forge 2025-05-07T20:29:03.6051788Z [conda] cuda-nvtx 12.6.77 hbd13f7d_0 conda-forge 2025-05-07T20:29:03.6052432Z [conda] cuda-opencl 12.6.77 hbd13f7d_0 conda-forge 2025-05-07T20:29:03.6053097Z [conda] cuda-opencl-dev 12.6.77 h5888daf_0 conda-forge 2025-05-07T20:29:03.6053797Z [conda] cuda-runtime 12.6.3 ha804496_0 conda-forge 2025-05-07T20:29:03.6054449Z [conda] libcublas 12.6.4.1 h5888daf_1 conda-forge 2025-05-07T20:29:03.6055117Z [conda] libcublas-dev 12.6.4.1 h5888daf_1 conda-forge 2025-05-07T20:29:03.6055764Z [conda] libcufft 11.3.0.4 hbd13f7d_0 conda-forge 2025-05-07T20:29:03.6056440Z [conda] libcufft-dev 11.3.0.4 h5888daf_0 conda-forge 2025-05-07T20:29:03.6057109Z [conda] libcurand 10.3.7.77 hbd13f7d_0 conda-forge 2025-05-07T20:29:03.6057761Z [conda] libcurand-dev 10.3.7.77 h5888daf_0 conda-forge 2025-05-07T20:29:03.6058415Z [conda] libcusolver 11.7.1.2 h5888daf_1 conda-forge 2025-05-07T20:29:03.6059092Z [conda] libcusolver-dev 11.7.1.2 h5888daf_1 conda-forge 2025-05-07T20:29:03.6059785Z [conda] libcusparse 12.5.4.2 hbd13f7d_0 conda-forge 2025-05-07T20:29:03.6060469Z [conda] libcusparse-dev 12.5.4.2 h5888daf_0 conda-forge 2025-05-07T20:29:03.6061199Z [conda] libnvjitlink 12.6.85 hbd13f7d_0 conda-forge 2025-05-07T20:29:03.6061911Z [conda] libnvjitlink-dev 12.6.85 h5888daf_0 conda-forge 2025-05-07T20:29:03.6062605Z [conda] numpy 2.2.5 py313h17eae1a_0 conda-forge 2025-05-07T20:29:03.6063260Z [conda] nvidia-cublas-cu12 12.6.4.1 pypi_0 pypi 2025-05-07T20:29:03.6063952Z [conda] nvidia-cuda-cupti-cu12 12.6.80 pypi_0 pypi 2025-05-07T20:29:03.6064688Z [conda] nvidia-cuda-nvrtc-cu12 12.6.77 pypi_0 pypi 2025-05-07T20:29:03.6065397Z [conda] nvidia-cuda-runtime-cu12 12.6.77 pypi_0 pypi 2025-05-07T20:29:03.6066098Z [conda] nvidia-cudnn-cu12 9.5.1.17 pypi_0 pypi 2025-05-07T20:29:03.6066908Z [conda] nvidia-cufft-cu12 11.3.0.4 pypi_0 pypi 2025-05-07T20:29:03.6067600Z [conda] nvidia-curand-cu12 10.3.7.77 pypi_0 pypi 2025-05-07T20:29:03.6068293Z [conda] nvidia-cusolver-cu12 11.7.1.2 pypi_0 pypi 2025-05-07T20:29:03.6069029Z [conda] nvidia-cusparse-cu12 12.5.4.2 pypi_0 pypi 2025-05-07T20:29:03.6069743Z [conda] nvidia-cusparselt-cu12 0.6.3 pypi_0 pypi 2025-05-07T20:29:03.6070417Z [conda] nvidia-nccl-cu12 2.26.2 pypi_0 pypi 2025-05-07T20:29:03.6071100Z [conda] nvidia-nvjitlink-cu12 12.6.85 pypi_0 pypi 2025-05-07T20:29:03.6071751Z [conda] nvidia-nvtx-cu12 12.6.77 pypi_0 pypi 2025-05-07T20:29:03.6072425Z [conda] pytorch-triton 3.3.0+git96316ce5 pypi_0 pypi 2025-05-07T20:29:03.6073071Z [conda] torch 2.8.0.dev20250507+cu126 pypi_0 pypi 2025-05-07T20:29:03.6073471Z 2025-05-07T20:29:03.6864077Z ##[group]Run . $PRELUDE; cd fbgemm_gpu; prepare_fbgemm_gpu_build $BUILD_ENV 2025-05-07T20:29:03.6865175Z . $PRELUDE; cd fbgemm_gpu; prepare_fbgemm_gpu_build $BUILD_ENV 2025-05-07T20:29:03.6879589Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:29:03.6879921Z env: 2025-05-07T20:29:03.6880127Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:29:03.6880420Z BUILD_ENV: build_binary 2025-05-07T20:29:03.6880656Z BUILD_TARGET: genai 2025-05-07T20:29:03.6880862Z BUILD_VARIANT: cuda 2025-05-07T20:29:03.6881092Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:29:03.6881344Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:29:03.6881623Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:29:03.6881945Z ##[endgroup] 2025-05-07T20:29:04.0298573Z ################################################################################ 2025-05-07T20:29:04.0299187Z # Prepare FBGEMM-GPU Build 2025-05-07T20:29:04.0299461Z # 2025-05-07T20:29:04.0316381Z # [2025-05-07T20:29:04.031Z] + prepare_fbgemm_gpu_build build_binary 2025-05-07T20:29:04.0316875Z ################################################################################ 2025-05-07T20:29:04.0317091Z 2025-05-07T20:29:04.0333593Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:29:04.1286224Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:29:04.1307971Z [BUILD] Running git submodules update ... 2025-05-07T20:29:04.1330681Z [EXEC] [ATTEMPT 0/3] + git submodule sync 2025-05-07T20:29:04.1695135Z Synchronizing submodule url for '../external/asmjit' 2025-05-07T20:29:04.1695768Z Synchronizing submodule url for '../external/composable_kernel' 2025-05-07T20:29:04.1696201Z Synchronizing submodule url for '../external/cpuinfo' 2025-05-07T20:29:04.1696581Z Synchronizing submodule url for '../external/cutlass' 2025-05-07T20:29:04.1696984Z Synchronizing submodule url for '../external/googletest' 2025-05-07T20:29:04.1697418Z Synchronizing submodule url for '../external/hipify_torch' 2025-05-07T20:29:04.1697830Z Synchronizing submodule url for '../external/json' 2025-05-07T20:29:04.1731937Z [EXEC] [ATTEMPT 0/3] + git submodule update --init --recursive 2025-05-07T20:29:04.2286257Z [BUILD] Installing other build dependencies ... 2025-05-07T20:29:04.2307274Z [EXEC] [ATTEMPT 0/3] + conda run --no-capture-output -n build_binary python -m pip install -r requirements.txt 2025-05-07T20:29:06.6696616Z Collecting backports.tarfile (from -r requirements.txt (line 13)) 2025-05-07T20:29:06.6860621Z Downloading backports.tarfile-1.2.0-py3-none-any.whl.metadata (2.0 kB) 2025-05-07T20:29:06.7945724Z Collecting build (from -r requirements.txt (line 14)) 2025-05-07T20:29:06.7967547Z Downloading build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB) 2025-05-07T20:29:07.0078097Z Collecting cmake (from -r requirements.txt (line 15)) 2025-05-07T20:29:07.0101829Z Downloading cmake-4.0.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.3 kB) 2025-05-07T20:29:07.1257560Z Collecting click (from -r requirements.txt (line 16)) 2025-05-07T20:29:07.1281038Z Downloading click-8.1.8-py3-none-any.whl.metadata (2.3 kB) 2025-05-07T20:29:07.4268044Z Collecting hypothesis (from -r requirements.txt (line 17)) 2025-05-07T20:29:07.4293565Z Downloading hypothesis-6.131.14-py3-none-any.whl.metadata (5.6 kB) 2025-05-07T20:29:07.4879063Z Requirement already satisfied: jinja2 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages (from -r requirements.txt (line 18)) (3.1.4) 2025-05-07T20:29:07.4882259Z Requirement already satisfied: mpmath==1.3.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages (from -r requirements.txt (line 19)) (1.3.0) 2025-05-07T20:29:07.5623142Z Collecting ninja (from -r requirements.txt (line 20)) 2025-05-07T20:29:07.5648865Z Downloading ninja-1.11.1.4-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.metadata (5.0 kB) 2025-05-07T20:29:07.6120160Z Requirement already satisfied: numpy>=2.0.2 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages (from -r requirements.txt (line 21)) (2.2.5) 2025-05-07T20:29:07.6712292Z Collecting pyre-extensions (from -r requirements.txt (line 22)) 2025-05-07T20:29:07.6740067Z Downloading pyre_extensions-0.0.32-py3-none-any.whl.metadata (4.0 kB) 2025-05-07T20:29:07.8026448Z Collecting pyyaml (from -r requirements.txt (line 23)) 2025-05-07T20:29:07.8047699Z Downloading PyYAML-6.0.2-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.1 kB) 2025-05-07T20:29:07.9218119Z Collecting scikit-build (from -r requirements.txt (line 24)) 2025-05-07T20:29:07.9250645Z Downloading scikit_build-0.18.1-py3-none-any.whl.metadata (18 kB) 2025-05-07T20:29:07.9817074Z Requirement already satisfied: setuptools in /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages (from -r requirements.txt (line 25)) (78.1.1) 2025-05-07T20:29:08.0339959Z Collecting setuptools_git_versioning (from -r requirements.txt (line 26)) 2025-05-07T20:29:08.0359962Z Downloading setuptools_git_versioning-2.1.0-py3-none-any.whl.metadata (6.1 kB) 2025-05-07T20:29:08.1379803Z Collecting tabulate (from -r requirements.txt (line 27)) 2025-05-07T20:29:08.1399740Z Downloading tabulate-0.9.0-py3-none-any.whl.metadata (34 kB) 2025-05-07T20:29:08.2633484Z Collecting patchelf (from -r requirements.txt (line 28)) 2025-05-07T20:29:08.2663968Z Downloading patchelf-0.17.2.2-py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.musllinux_1_1_x86_64.whl.metadata (3.5 kB) 2025-05-07T20:29:08.3874055Z Collecting packaging>=19.1 (from build->-r requirements.txt (line 14)) 2025-05-07T20:29:08.3894442Z Downloading packaging-25.0-py3-none-any.whl.metadata (3.3 kB) 2025-05-07T20:29:08.4957490Z Collecting pyproject_hooks (from build->-r requirements.txt (line 14)) 2025-05-07T20:29:08.4979446Z Downloading pyproject_hooks-1.2.0-py3-none-any.whl.metadata (1.3 kB) 2025-05-07T20:29:08.5998804Z Collecting attrs>=22.2.0 (from hypothesis->-r requirements.txt (line 17)) 2025-05-07T20:29:08.6026534Z Downloading attrs-25.3.0-py3-none-any.whl.metadata (10 kB) 2025-05-07T20:29:08.7110520Z Collecting sortedcontainers<3.0.0,>=2.1.0 (from hypothesis->-r requirements.txt (line 17)) 2025-05-07T20:29:08.7129298Z Downloading sortedcontainers-2.4.0-py2.py3-none-any.whl.metadata (10 kB) 2025-05-07T20:29:08.7688154Z Requirement already satisfied: MarkupSafe>=2.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages (from jinja2->-r requirements.txt (line 18)) (2.1.5) 2025-05-07T20:29:08.8149066Z Collecting typing-inspect (from pyre-extensions->-r requirements.txt (line 22)) 2025-05-07T20:29:08.8167330Z Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB) 2025-05-07T20:29:08.8686705Z Requirement already satisfied: typing-extensions in /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages (from pyre-extensions->-r requirements.txt (line 22)) (4.13.2) 2025-05-07T20:29:08.9269723Z Collecting distro (from scikit-build->-r requirements.txt (line 24)) 2025-05-07T20:29:08.9289950Z Downloading distro-1.9.0-py3-none-any.whl.metadata (6.8 kB) 2025-05-07T20:29:08.9785468Z Requirement already satisfied: wheel>=0.32.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages (from scikit-build->-r requirements.txt (line 24)) (0.45.1) 2025-05-07T20:29:09.0414628Z Collecting mypy-extensions>=0.3.0 (from typing-inspect->pyre-extensions->-r requirements.txt (line 22)) 2025-05-07T20:29:09.0438885Z Downloading mypy_extensions-1.1.0-py3-none-any.whl.metadata (1.1 kB) 2025-05-07T20:29:09.0922354Z Downloading backports.tarfile-1.2.0-py3-none-any.whl (30 kB) 2025-05-07T20:29:09.1524504Z Downloading build-1.2.2.post1-py3-none-any.whl (22 kB) 2025-05-07T20:29:09.2054008Z Downloading cmake-4.0.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.9 MB) 2025-05-07T20:29:09.7095989Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 27.9/27.9 MB 55.2 MB/s eta 0:00:00 2025-05-07T20:29:09.7119577Z Downloading click-8.1.8-py3-none-any.whl (98 kB) 2025-05-07T20:29:09.7820210Z Downloading hypothesis-6.131.14-py3-none-any.whl (500 kB) 2025-05-07T20:29:09.8406813Z Downloading sortedcontainers-2.4.0-py2.py3-none-any.whl (29 kB) 2025-05-07T20:29:09.8989855Z Downloading ninja-1.11.1.4-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (422 kB) 2025-05-07T20:29:09.9594620Z Downloading pyre_extensions-0.0.32-py3-none-any.whl (12 kB) 2025-05-07T20:29:10.0106164Z Downloading PyYAML-6.0.2-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (759 kB) 2025-05-07T20:29:10.0729657Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 759.5/759.5 kB 8.3 MB/s eta 0:00:00 2025-05-07T20:29:10.0772521Z Downloading scikit_build-0.18.1-py3-none-any.whl (85 kB) 2025-05-07T20:29:10.1341407Z Downloading setuptools_git_versioning-2.1.0-py3-none-any.whl (10 kB) 2025-05-07T20:29:10.1979473Z Downloading tabulate-0.9.0-py3-none-any.whl (35 kB) 2025-05-07T20:29:10.2555147Z Downloading patchelf-0.17.2.2-py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.musllinux_1_1_x86_64.whl (466 kB) 2025-05-07T20:29:10.3187346Z Downloading attrs-25.3.0-py3-none-any.whl (63 kB) 2025-05-07T20:29:10.3754055Z Downloading packaging-25.0-py3-none-any.whl (66 kB) 2025-05-07T20:29:10.4273193Z Downloading distro-1.9.0-py3-none-any.whl (20 kB) 2025-05-07T20:29:10.4794509Z Downloading pyproject_hooks-1.2.0-py3-none-any.whl (10 kB) 2025-05-07T20:29:10.5391791Z Downloading typing_inspect-0.9.0-py3-none-any.whl (8.8 kB) 2025-05-07T20:29:10.5881794Z Downloading mypy_extensions-1.1.0-py3-none-any.whl (5.0 kB) 2025-05-07T20:29:10.7642949Z Installing collected packages: sortedcontainers, tabulate, pyyaml, pyproject_hooks, patchelf, packaging, ninja, mypy-extensions, distro, cmake, click, backports.tarfile, attrs, typing-inspect, setuptools_git_versioning, scikit-build, hypothesis, build, pyre-extensions 2025-05-07T20:29:13.1359465Z 2025-05-07T20:29:13.1410276Z Successfully installed attrs-25.3.0 backports.tarfile-1.2.0 build-1.2.2.post1 click-8.1.8 cmake-4.0.0 distro-1.9.0 hypothesis-6.131.14 mypy-extensions-1.1.0 ninja-1.11.1.4 packaging-25.0 patchelf-0.17.2.2 pyproject_hooks-1.2.0 pyre-extensions-0.0.32 pyyaml-6.0.2 scikit-build-0.18.1 setuptools_git_versioning-2.1.0 sortedcontainers-2.4.0 tabulate-0.9.0 typing-inspect-0.9.0 2025-05-07T20:29:13.3327091Z ################################################################################ 2025-05-07T20:29:13.3327459Z # Install PyTorch (PyTorch PIP) 2025-05-07T20:29:13.3327723Z # 2025-05-07T20:29:13.3344090Z # [2025-05-07T20:29:13.334Z] + install_triton_pip build_binary 2025-05-07T20:29:13.3344480Z ################################################################################ 2025-05-07T20:29:13.3344716Z 2025-05-07T20:29:13.3344939Z [BUILD] Installing pytorch-triton nightly/3.2.0+git4b3bb1f8 from PIP ... 2025-05-07T20:29:13.3345367Z ################################################################################ 2025-05-07T20:29:13.3345723Z # Install Package From PyTorch PIP: pytorch-triton 2025-05-07T20:29:13.3346035Z # 2025-05-07T20:29:13.3364024Z # [2025-05-07T20:29:13.336Z] + install_from_pytorch_pip build_binary pytorch-triton nightly/3.2.0+git4b3bb1f8 2025-05-07T20:29:13.3364651Z ################################################################################ 2025-05-07T20:29:13.3364883Z 2025-05-07T20:29:13.3381811Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:29:13.4368317Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:29:13.4368661Z ################################################################################ 2025-05-07T20:29:13.4369001Z # Prepare PIP Arguments (PyTorch PIP) 2025-05-07T20:29:13.4369266Z # 2025-05-07T20:29:13.4387929Z # [2025-05-07T20:29:13.438Z] + __prepare_pip_arguments pytorch-triton nightly/3.2.0+git4b3bb1f8 2025-05-07T20:29:13.4388431Z ################################################################################ 2025-05-07T20:29:13.4388658Z 2025-05-07T20:29:13.4439812Z [INSTALL] Extracted package (channel, version): (nightly, 3.2.0+git4b3bb1f8) 2025-05-07T20:29:13.4457250Z [INSTALL] Using a non-RELEASE channel: nightly ... 2025-05-07T20:29:13.4457921Z [INSTALL] Extracted the full PIP channel: https://download.pytorch.org/whl/nightly/ 2025-05-07T20:29:13.4466271Z [INSTALL] Extracted the full PIP package: --pre pytorch-triton==3.2.0+git4b3bb1f8 2025-05-07T20:29:13.4475825Z [INSTALL] Attempting to install [pytorch-triton, 3.2.0+git4b3bb1f8] from PyTorch PIP using channel https://download.pytorch.org/whl/nightly/ ... 2025-05-07T20:29:13.4497196Z [EXEC] [ATTEMPT 0/3] + conda run -n build_binary pip install --pre pytorch-triton==3.2.0+git4b3bb1f8 --index-url https://download.pytorch.org/whl/nightly/ 2025-05-07T20:29:21.0921118Z ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. 2025-05-07T20:29:21.0922478Z torch 2.8.0.dev20250507+cu126 requires pytorch-triton==3.3.0+git96316ce5; platform_system == "Linux" and platform_machine == "x86_64", but you have pytorch-triton 3.2.0+git4b3bb1f8 which is incompatible. 2025-05-07T20:29:21.0923367Z 2025-05-07T20:29:21.0923591Z Looking in indexes: https://download.pytorch.org/whl/nightly/ 2025-05-07T20:29:21.0924041Z Collecting pytorch-triton==3.2.0+git4b3bb1f8 2025-05-07T20:29:21.0925071Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.2.0%2Bgit4b3bb1f8-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (1.3 kB) 2025-05-07T20:29:21.0926439Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.2.0%2Bgit4b3bb1f8-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (166.5 MB) 2025-05-07T20:29:21.0927659Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 166.5/166.5 MB 56.6 MB/s eta 0:00:00 2025-05-07T20:29:21.0928069Z Installing collected packages: pytorch-triton 2025-05-07T20:29:21.0928436Z Attempting uninstall: pytorch-triton 2025-05-07T20:29:21.0928847Z Found existing installation: pytorch-triton 3.3.0+git96316ce5 2025-05-07T20:29:21.0929302Z Uninstalling pytorch-triton-3.3.0+git96316ce5: 2025-05-07T20:29:21.0929765Z Successfully uninstalled pytorch-triton-3.3.0+git96316ce5 2025-05-07T20:29:21.0930251Z Successfully installed pytorch-triton-3.2.0+git4b3bb1f8 2025-05-07T20:29:21.0930549Z 2025-05-07T20:29:23.3733655Z [CHECK] Python (sub-)package 'triton' found ... 2025-05-07T20:29:23.3737243Z [CHECK] Printing out the pytorch-triton version ... 2025-05-07T20:29:25.5885955Z ################################################################################ 2025-05-07T20:29:25.5886437Z [CHECK] The installed VERSION of pytorch-triton is: 3.2.0 2025-05-07T20:29:25.5886812Z ################################################################################ 2025-05-07T20:29:25.5887032Z 2025-05-07T20:29:27.6950024Z [CHECK] Python (sub-)package 'numpy' found ... 2025-05-07T20:29:29.9460489Z [CHECK] Python (sub-)package 'skbuild' found ... 2025-05-07T20:29:29.9465040Z [BUILD] Successfully ran git submodules update 2025-05-07T20:29:29.9522849Z ##[group]Run . $PRELUDE; install_fbgemm_gpu_wheel $BUILD_ENV *.whl 2025-05-07T20:29:29.9523606Z . $PRELUDE; install_fbgemm_gpu_wheel $BUILD_ENV *.whl 2025-05-07T20:29:29.9540182Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:29:29.9540688Z env: 2025-05-07T20:29:29.9541010Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:29:29.9541452Z BUILD_ENV: build_binary 2025-05-07T20:29:29.9541796Z BUILD_TARGET: genai 2025-05-07T20:29:29.9542126Z BUILD_VARIANT: cuda 2025-05-07T20:29:29.9542461Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:29:29.9542838Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:29:29.9543300Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:29:29.9543837Z ##[endgroup] 2025-05-07T20:29:30.3012287Z ################################################################################ 2025-05-07T20:29:30.3012647Z # Install FBGEMM-GPU from Wheel 2025-05-07T20:29:30.3012898Z # 2025-05-07T20:29:30.3030033Z # [2025-05-07T20:29:30.302Z] + install_fbgemm_gpu_wheel build_binary fbgemm_gpu_genai_nightly-2025.5.7-cp313-cp313-manylinux_2_28_x86_64.whl 2025-05-07T20:29:30.3031064Z ################################################################################ 2025-05-07T20:29:30.3031279Z 2025-05-07T20:29:30.3031633Z [INSTALL] Printing out FBGEMM-GPU wheel SHA: fbgemm_gpu_genai_nightly-2025.5.7-cp313-cp313-manylinux_2_28_x86_64.whl 2025-05-07T20:29:30.3032312Z + sha1sum fbgemm_gpu_genai_nightly-2025.5.7-cp313-cp313-manylinux_2_28_x86_64.whl 2025-05-07T20:29:30.3032646Z 2025-05-07T20:29:30.3149188Z b4ae9b0abd70864ad0f9bc87eab637debe5f8911 fbgemm_gpu_genai_nightly-2025.5.7-cp313-cp313-manylinux_2_28_x86_64.whl 2025-05-07T20:29:30.3151833Z 2025-05-07T20:29:30.3152413Z + sha256sum fbgemm_gpu_genai_nightly-2025.5.7-cp313-cp313-manylinux_2_28_x86_64.whl 2025-05-07T20:29:30.3152759Z 2025-05-07T20:29:30.3284994Z 288e01505cd42cb622816f5ed4cb9190deac249c91490a8fe2dfe37b78609048 fbgemm_gpu_genai_nightly-2025.5.7-cp313-cp313-manylinux_2_28_x86_64.whl 2025-05-07T20:29:30.3286283Z 2025-05-07T20:29:30.3287041Z + md5sum fbgemm_gpu_genai_nightly-2025.5.7-cp313-cp313-manylinux_2_28_x86_64.whl 2025-05-07T20:29:30.3287586Z 2025-05-07T20:29:30.3518759Z ce6591c5de70d034e768ce9f8fdfb894 fbgemm_gpu_genai_nightly-2025.5.7-cp313-cp313-manylinux_2_28_x86_64.whl 2025-05-07T20:29:30.3521272Z 2025-05-07T20:29:30.3530795Z [INSTALL] Installing FBGEMM-GPU wheel: fbgemm_gpu_genai_nightly-2025.5.7-cp313-cp313-manylinux_2_28_x86_64.whl ... 2025-05-07T20:29:30.3551495Z [EXEC] [ATTEMPT 0/3] + conda run -n build_binary python -m pip install fbgemm_gpu_genai_nightly-2025.5.7-cp313-cp313-manylinux_2_28_x86_64.whl 2025-05-07T20:29:33.1041182Z Processing ./fbgemm_gpu_genai_nightly-2025.5.7-cp313-cp313-manylinux_2_28_x86_64.whl 2025-05-07T20:29:33.1042405Z Requirement already satisfied: numpy in /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages (from fbgemm-gpu-genai-nightly==2025.5.7) (2.2.5) 2025-05-07T20:29:33.1043273Z Installing collected packages: fbgemm-gpu-genai-nightly 2025-05-07T20:29:33.1043719Z Successfully installed fbgemm-gpu-genai-nightly-2025.5.7 2025-05-07T20:29:33.1044004Z 2025-05-07T20:29:40.2626266Z ################################################################################ 2025-05-07T20:29:40.2626635Z [CHECK] !!!! INFO !!!! 2025-05-07T20:29:40.2627023Z [CHECK] The installed version of PyTorch is: 2.8.0.dev20250507+cu126 2025-05-07T20:29:40.2627456Z [CHECK] CUDA version reported by PyTorch is: 12.6 2025-05-07T20:29:40.2627758Z [CHECK] 2025-05-07T20:29:40.2628082Z [CHECK] NOTE: If the PyTorch package channel is different from the FBGEMM_GPU 2025-05-07T20:29:40.2628582Z [CHECK] package channel; the package may be broken at runtime!!! 2025-05-07T20:29:40.2628973Z ################################################################################ 2025-05-07T20:29:40.2629181Z 2025-05-07T20:29:40.2629294Z [INSTALL] Checking imports and symbols ... 2025-05-07T20:29:44.3677816Z [CHECK] Python (sub-)package 'fbgemm_gpu' found ... 2025-05-07T20:29:48.4764414Z [CHECK] Found symbol '__version__' in Python package 'fbgemm_gpu'. 2025-05-07T20:29:52.6177441Z [CHECK] Found symbol '__variant__' in Python package 'fbgemm_gpu'. 2025-05-07T20:29:52.6181065Z [CHECK] Printing out the FBGEMM-GPU version ... 2025-05-07T20:30:04.9594199Z ################################################################################ 2025-05-07T20:30:04.9596325Z [CHECK] The installed FBGEMM TARGET is: genai 2025-05-07T20:30:04.9596918Z [CHECK] The installed FBGEMM VARIANT is: cuda 2025-05-07T20:30:04.9597279Z [CHECK] The installed FBGEMM VERSION is: 2025.5.7 2025-05-07T20:30:04.9597607Z ################################################################################ 2025-05-07T20:30:04.9597824Z 2025-05-07T20:30:13.2192696Z ################################################################################ 2025-05-07T20:30:13.2193126Z [CHECK] FBGEMM_GPU Experimental Packages 2025-05-07T20:30:13.2194497Z [CHECK] fbgemm_gpu: ['__annotations__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '__target__', '__variant__', '__version__', '_load_library', 'docs', 'fbgemm_genai_libraries', 'fbgemm_gpu', 'fbgemm_gpu_libraries', 'libraries_to_load', 'library', 'logging', 'open_source', 'os', 'split_embedding_configs', 'split_table_batched_embeddings_ops_common', 'torch', 'utils'] 2025-05-07T20:30:13.2196647Z [CHECK] fbgemm_gpu.experimental: ['__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__'] 2025-05-07T20:30:13.2197151Z ################################################################################ 2025-05-07T20:30:13.2197369Z 2025-05-07T20:30:13.2197517Z [INSTALL] Check for installation of Python sources ... 2025-05-07T20:30:17.3314931Z [CHECK] Python (sub-)package 'fbgemm_gpu.config' found ... 2025-05-07T20:30:21.4692091Z [CHECK] Python (sub-)package 'fbgemm_gpu.docs' found ... 2025-05-07T20:30:25.7294064Z [CHECK] Python (sub-)package 'fbgemm_gpu.quantize' found ... 2025-05-07T20:30:29.8624926Z [CHECK] Python (sub-)package 'fbgemm_gpu.tbe.cache' found ... 2025-05-07T20:30:29.8629155Z [INSTALL] Check for operator registrations ... 2025-05-07T20:30:33.8956941Z fbgemm.nccl_init 2025-05-07T20:30:33.8959119Z 2025-05-07T20:30:33.9652185Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.nccl_init 2025-05-07T20:30:37.9974685Z fbgemm.gqa_attn_splitk 2025-05-07T20:30:37.9974888Z 2025-05-07T20:30:38.0643907Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.gqa_attn_splitk 2025-05-07T20:30:42.1043546Z fbgemm.rope_qkv_decoding 2025-05-07T20:30:42.1043786Z 2025-05-07T20:30:42.1732415Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.rope_qkv_decoding 2025-05-07T20:30:42.1733015Z [INSTALL] FBGEMM-GPU installation through wheel completed ... 2025-05-07T20:30:42.1767001Z ##[group]Run . $PRELUDE; test_all_fbgemm_gpu_modules $BUILD_ENV 2025-05-07T20:30:42.1767472Z . $PRELUDE; test_all_fbgemm_gpu_modules $BUILD_ENV 2025-05-07T20:30:42.1782657Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:30:42.1783017Z env: 2025-05-07T20:30:42.1783235Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:30:42.1783523Z BUILD_ENV: build_binary 2025-05-07T20:30:42.1783763Z BUILD_TARGET: genai 2025-05-07T20:30:42.1783978Z BUILD_VARIANT: cuda 2025-05-07T20:30:42.1784194Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:30:42.1784440Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:30:42.1784730Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:30:42.1785038Z ##[endgroup] 2025-05-07T20:30:42.5214600Z ################################################################################ 2025-05-07T20:30:42.5214959Z # Test All FBGEMM-GPU Modules 2025-05-07T20:30:42.5215223Z # 2025-05-07T20:30:42.5232593Z # [2025-05-07T20:30:42.522Z] + test_all_fbgemm_gpu_modules build_binary 2025-05-07T20:30:42.5233056Z ################################################################################ 2025-05-07T20:30:42.5233287Z 2025-05-07T20:30:50.7618556Z [TEST] Determined FBGEMM_GPU (target : variant) from installation: (genai : cuda) 2025-05-07T20:30:50.7619410Z [TEST] Will be running tests specific to this target and variant ... 2025-05-07T20:30:50.7619797Z [TEST] Determined the test directories: 2025-05-07T20:30:50.7620104Z fbgemm_gpu/experimental/gen_ai/test 2025-05-07T20:30:50.7620387Z fbgemm_gpu/experimental/example/test 2025-05-07T20:30:50.7620680Z fbgemm_gpu/experimental/gemm/test 2025-05-07T20:30:50.7620863Z 2025-05-07T20:30:50.7626722Z [TEST] FBGEMM_GPU variant is cuda; configuring for CUDA-based testing ... 2025-05-07T20:30:50.7633456Z [TEST] Set environment variables for CUDA testing ... 2025-05-07T20:30:50.7634180Z + conda env config vars unset -n build_binary CUDA_VISIBLE_DEVICES 2025-05-07T20:30:50.7634655Z 2025-05-07T20:30:51.1989316Z 2025-05-07T20:30:51.1989674Z [TEST] Installing PyTest ... 2025-05-07T20:30:51.2017621Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y pytest expecttest 2025-05-07T20:30:52.3173958Z Channels: 2025-05-07T20:30:52.3174217Z - conda-forge 2025-05-07T20:30:52.3174839Z Platform: linux-64 2025-05-07T20:30:55.9799552Z Collecting package metadata (repodata.json): - \ | / done 2025-05-07T20:30:57.1617633Z Solving environment: \ | / done 2025-05-07T20:30:57.3909199Z 2025-05-07T20:30:57.3909799Z ## Package Plan ## 2025-05-07T20:30:57.3912106Z 2025-05-07T20:30:57.3912352Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:30:57.3912663Z 2025-05-07T20:30:57.3912767Z added / updated specs: 2025-05-07T20:30:57.3913173Z - expecttest 2025-05-07T20:30:57.3913570Z - pytest 2025-05-07T20:30:57.3913779Z 2025-05-07T20:30:57.3913786Z 2025-05-07T20:30:57.3913992Z The following packages will be downloaded: 2025-05-07T20:30:57.3914409Z 2025-05-07T20:30:57.3914601Z package | build 2025-05-07T20:30:57.3915073Z ---------------------------|----------------- 2025-05-07T20:30:57.3915445Z colorama-0.4.6 | pyhd8ed1ab_1 26 KB conda-forge 2025-05-07T20:30:57.3915918Z exceptiongroup-1.2.2 | pyhd8ed1ab_1 20 KB conda-forge 2025-05-07T20:30:57.3916370Z expecttest-0.3.0 | pyhd8ed1ab_0 14 KB conda-forge 2025-05-07T20:30:57.3916804Z iniconfig-2.0.0 | pyhd8ed1ab_1 11 KB conda-forge 2025-05-07T20:30:57.3917231Z packaging-25.0 | pyh29332c3_1 61 KB conda-forge 2025-05-07T20:30:57.3917643Z pluggy-1.5.0 | pyhd8ed1ab_1 23 KB conda-forge 2025-05-07T20:30:57.3918047Z pytest-8.3.5 | pyhd8ed1ab_0 254 KB conda-forge 2025-05-07T20:30:57.3918826Z tomli-2.2.1 | pyhd8ed1ab_1 19 KB conda-forge 2025-05-07T20:30:57.3919217Z ------------------------------------------------------------ 2025-05-07T20:30:57.3919549Z Total: 428 KB 2025-05-07T20:30:57.3919762Z 2025-05-07T20:30:57.3919883Z The following NEW packages will be INSTALLED: 2025-05-07T20:30:57.3920104Z 2025-05-07T20:30:57.3920304Z colorama conda-forge/noarch::colorama-0.4.6-pyhd8ed1ab_1 2025-05-07T20:30:57.3920801Z exceptiongroup conda-forge/noarch::exceptiongroup-1.2.2-pyhd8ed1ab_1 2025-05-07T20:30:57.3921674Z expecttest conda-forge/noarch::expecttest-0.3.0-pyhd8ed1ab_0 2025-05-07T20:30:57.3922546Z iniconfig conda-forge/noarch::iniconfig-2.0.0-pyhd8ed1ab_1 2025-05-07T20:30:57.3923050Z packaging conda-forge/noarch::packaging-25.0-pyh29332c3_1 2025-05-07T20:30:57.3923476Z pluggy conda-forge/noarch::pluggy-1.5.0-pyhd8ed1ab_1 2025-05-07T20:30:57.3923895Z pytest conda-forge/noarch::pytest-8.3.5-pyhd8ed1ab_0 2025-05-07T20:30:57.3924472Z tomli conda-forge/noarch::tomli-2.2.1-pyhd8ed1ab_1 2025-05-07T20:30:57.3924720Z 2025-05-07T20:30:57.3924725Z 2025-05-07T20:30:57.3924729Z 2025-05-07T20:30:57.3924874Z Downloading and Extracting Packages: ...working... 2025-05-07T20:30:57.3925242Z pytest-8.3.5 | 254 KB | | 0% 2025-05-07T20:30:57.3925464Z 2025-05-07T20:30:57.3925851Z packaging-25.0 | 61 KB | | 0%  2025-05-07T20:30:57.3926087Z 2025-05-07T20:30:57.3926091Z 2025-05-07T20:30:57.3934909Z colorama-0.4.6 | 26 KB | | 0%  2025-05-07T20:30:57.3935245Z 2025-05-07T20:30:57.3935261Z 2025-05-07T20:30:57.3935267Z 2025-05-07T20:30:57.3944843Z pluggy-1.5.0 | 23 KB | | 0%  2025-05-07T20:30:57.3945183Z 2025-05-07T20:30:57.3945189Z 2025-05-07T20:30:57.3945194Z 2025-05-07T20:30:57.3945214Z 2025-05-07T20:30:57.3967361Z exceptiongroup-1.2.2 | 20 KB | | 0%  2025-05-07T20:30:57.3968000Z 2025-05-07T20:30:57.3968011Z 2025-05-07T20:30:57.3968018Z 2025-05-07T20:30:57.3968027Z 2025-05-07T20:30:57.3968046Z 2025-05-07T20:30:57.3968821Z tomli-2.2.1 | 19 KB | | 0%  2025-05-07T20:30:57.3969375Z 2025-05-07T20:30:57.3969634Z 2025-05-07T20:30:57.3969642Z 2025-05-07T20:30:57.3969651Z 2025-05-07T20:30:57.3969659Z 2025-05-07T20:30:57.3969683Z 2025-05-07T20:30:57.3970438Z expecttest-0.3.0 | 14 KB | | 0%  2025-05-07T20:30:57.3971083Z 2025-05-07T20:30:57.3971092Z 2025-05-07T20:30:57.3971100Z 2025-05-07T20:30:57.3971108Z 2025-05-07T20:30:57.3971124Z 2025-05-07T20:30:57.3971134Z 2025-05-07T20:30:57.3971147Z 2025-05-07T20:30:57.8267378Z iniconfig-2.0.0 | 11 KB | | 0%  2025-05-07T20:30:57.8267842Z 2025-05-07T20:30:57.8267849Z 2025-05-07T20:30:57.8267855Z 2025-05-07T20:30:57.8269155Z 2025-05-07T20:30:57.8273558Z exceptiongroup-1.2.2 | 20 KB | #######9 | 80%  2025-05-07T20:30:57.8276787Z 2025-05-07T20:30:57.8282049Z packaging-25.0 | 61 KB | ##6 | 26%  2025-05-07T20:30:57.8298378Z pytest-8.3.5 | 254 KB | 6 | 6% 2025-05-07T20:30:57.8298784Z 2025-05-07T20:30:57.8298803Z 2025-05-07T20:30:57.8298810Z 2025-05-07T20:30:57.8301156Z 2025-05-07T20:30:57.8382130Z exceptiongroup-1.2.2 | 20 KB | ########## | 100%  2025-05-07T20:30:57.8382536Z 2025-05-07T20:30:57.8461128Z packaging-25.0 | 61 KB | ########## | 100%  2025-05-07T20:30:57.8656263Z pytest-8.3.5 | 254 KB | ########## | 100% 2025-05-07T20:30:57.8656723Z 2025-05-07T20:30:57.8656730Z 2025-05-07T20:30:57.8656738Z 2025-05-07T20:30:57.8656745Z 2025-05-07T20:30:57.8717576Z exceptiongroup-1.2.2 | 20 KB | ########## | 100%  2025-05-07T20:30:57.8717938Z 2025-05-07T20:30:57.8765816Z packaging-25.0 | 61 KB | ########## | 100%  2025-05-07T20:30:57.8766294Z 2025-05-07T20:30:57.8766688Z 2025-05-07T20:30:57.8769969Z colorama-0.4.6 | 26 KB | ###### | 61%  2025-05-07T20:30:57.8770447Z 2025-05-07T20:30:57.8770455Z 2025-05-07T20:30:57.8770462Z 2025-05-07T20:30:57.8770470Z 2025-05-07T20:30:57.8770477Z 2025-05-07T20:30:57.8786165Z tomli-2.2.1 | 19 KB | ########5 | 85%  2025-05-07T20:30:57.8786423Z 2025-05-07T20:30:57.8789255Z 2025-05-07T20:30:57.8798265Z colorama-0.4.6 | 26 KB | ########## | 100%  2025-05-07T20:30:57.8798507Z 2025-05-07T20:30:57.8798511Z 2025-05-07T20:30:57.8798514Z 2025-05-07T20:30:57.8798518Z 2025-05-07T20:30:57.8798521Z 2025-05-07T20:30:57.8803902Z tomli-2.2.1 | 19 KB | ########## | 100%  2025-05-07T20:30:57.8804147Z 2025-05-07T20:30:57.8804151Z 2025-05-07T20:30:57.8804154Z 2025-05-07T20:30:57.8804158Z 2025-05-07T20:30:57.8804161Z 2025-05-07T20:30:57.8804977Z 2025-05-07T20:30:57.8821433Z expecttest-0.3.0 | 14 KB | ########## | 100%  2025-05-07T20:30:57.8821980Z 2025-05-07T20:30:57.8821989Z 2025-05-07T20:30:57.8821996Z 2025-05-07T20:30:57.8822004Z 2025-05-07T20:30:57.8822010Z 2025-05-07T20:30:57.8823120Z 2025-05-07T20:30:57.8937278Z expecttest-0.3.0 | 14 KB | ########## | 100%  2025-05-07T20:30:57.8937737Z 2025-05-07T20:30:57.8937743Z 2025-05-07T20:30:57.8937750Z 2025-05-07T20:30:57.8937755Z 2025-05-07T20:30:57.8937762Z 2025-05-07T20:30:57.8937767Z 2025-05-07T20:30:57.8937774Z 2025-05-07T20:30:57.8951387Z iniconfig-2.0.0 | 11 KB | ########## | 100%  2025-05-07T20:30:57.8951941Z 2025-05-07T20:30:57.8951949Z 2025-05-07T20:30:57.8951957Z 2025-05-07T20:30:57.8951965Z 2025-05-07T20:30:57.8951973Z 2025-05-07T20:30:57.8951980Z 2025-05-07T20:30:57.8952002Z 2025-05-07T20:30:57.9153105Z iniconfig-2.0.0 | 11 KB | ########## | 100%  2025-05-07T20:30:57.9153583Z 2025-05-07T20:30:57.9153590Z 2025-05-07T20:30:57.9153597Z 2025-05-07T20:30:57.9153622Z 2025-05-07T20:30:57.9153630Z 2025-05-07T20:30:57.9213817Z tomli-2.2.1 | 19 KB | ########## | 100%  2025-05-07T20:30:57.9214237Z 2025-05-07T20:30:57.9214244Z 2025-05-07T20:30:57.9216122Z 2025-05-07T20:30:57.9254545Z pluggy-1.5.0 | 23 KB | ######9 | 69%  2025-05-07T20:30:57.9255212Z 2025-05-07T20:30:57.9255219Z 2025-05-07T20:30:57.9258189Z 2025-05-07T20:30:57.9273041Z pluggy-1.5.0 | 23 KB | ########## | 100%  2025-05-07T20:30:57.9273476Z 2025-05-07T20:30:57.9273484Z 2025-05-07T20:30:57.9323481Z colorama-0.4.6 | 26 KB | ########## | 100%  2025-05-07T20:30:57.9330782Z pytest-8.3.5 | 254 KB | ########## | 100% 2025-05-07T20:30:57.9436513Z pytest-8.3.5 | 254 KB | ########## | 100% 2025-05-07T20:30:57.9436926Z 2025-05-07T20:30:57.9436933Z 2025-05-07T20:30:57.9436940Z 2025-05-07T20:30:57.9436946Z 2025-05-07T20:30:57.9436952Z 2025-05-07T20:30:57.9437143Z 2025-05-07T20:30:57.9496804Z expecttest-0.3.0 | 14 KB | ########## | 100%  2025-05-07T20:30:57.9497207Z 2025-05-07T20:30:57.9497211Z 2025-05-07T20:30:57.9497215Z 2025-05-07T20:30:57.9497218Z 2025-05-07T20:30:57.9497222Z 2025-05-07T20:30:57.9497226Z 2025-05-07T20:30:57.9497229Z 2025-05-07T20:30:57.9523041Z iniconfig-2.0.0 | 11 KB | ########## | 100%  2025-05-07T20:30:57.9523495Z 2025-05-07T20:30:57.9523499Z 2025-05-07T20:30:57.9523502Z 2025-05-07T20:30:57.9530511Z pluggy-1.5.0 | 23 KB | ########## | 100%  2025-05-07T20:30:57.9530980Z 2025-05-07T20:30:57.9531216Z 2025-05-07T20:30:57.9531421Z  2025-05-07T20:30:57.9531721Z 2025-05-07T20:30:57.9531725Z 2025-05-07T20:30:57.9531886Z  2025-05-07T20:30:57.9532086Z 2025-05-07T20:30:57.9532090Z 2025-05-07T20:30:57.9532103Z 2025-05-07T20:30:57.9532449Z  2025-05-07T20:30:57.9532655Z 2025-05-07T20:30:57.9532659Z 2025-05-07T20:30:57.9532663Z 2025-05-07T20:30:57.9532666Z 2025-05-07T20:30:57.9532870Z  2025-05-07T20:30:57.9533079Z 2025-05-07T20:30:57.9533083Z 2025-05-07T20:30:57.9533087Z 2025-05-07T20:30:57.9533090Z 2025-05-07T20:30:57.9533094Z 2025-05-07T20:30:57.9533274Z  2025-05-07T20:30:57.9533478Z 2025-05-07T20:30:57.9533481Z 2025-05-07T20:30:57.9533485Z 2025-05-07T20:30:57.9533489Z 2025-05-07T20:30:57.9533492Z 2025-05-07T20:30:57.9533496Z 2025-05-07T20:30:57.9533674Z  2025-05-07T20:30:57.9533880Z 2025-05-07T20:30:57.9533884Z 2025-05-07T20:30:57.9533888Z 2025-05-07T20:30:57.9533891Z 2025-05-07T20:30:57.9533895Z 2025-05-07T20:30:57.9533899Z 2025-05-07T20:30:57.9533902Z 2025-05-07T20:30:57.9534095Z  done 2025-05-07T20:30:58.0536898Z Preparing transaction: \ done 2025-05-07T20:30:58.1539061Z Verifying transaction: / done 2025-05-07T20:31:00.1570893Z Executing transaction: \ | / - \ | / - \ | / - \ | / - \ | / - done 2025-05-07T20:31:00.3135068Z [TEST] Checking imports ... 2025-05-07T20:31:04.4193685Z [CHECK] Python (sub-)package 'fbgemm_gpu' found ... 2025-05-07T20:31:04.4206045Z [TEST] Setting feature flags ... 2025-05-07T20:31:04.4206473Z + conda env config vars set -n build_binary FBGEMM_TBE_ENSEMBLE_ROWWISE_ADAGRAD=1 2025-05-07T20:31:04.4206803Z 2025-05-07T20:31:04.8586862Z 2025-05-07T20:31:04.8587302Z [TEST] PyTest args: -v -rsx -s -W ignore::pytest.PytestCollectionWarning 2025-05-07T20:31:04.8588911Z ################################################################################ 2025-05-07T20:31:04.8589335Z # Run FBGEMM-GPU Tests: 2025-05-07T20:31:04.8589606Z # 2025-05-07T20:31:04.8609758Z # [2025-05-07T20:31:04.860Z] + __run_fbgemm_gpu_tests_in_directory build_binary 2025-05-07T20:31:04.8610308Z ################################################################################ 2025-05-07T20:31:04.8610527Z 2025-05-07T20:31:04.8617831Z [TEST] Enumerating ALL test files ... 2025-05-07T20:31:04.8646826Z ./attention/gqa_test.py 2025-05-07T20:31:04.8647244Z ./coalesce/coalesce_test.py 2025-05-07T20:31:04.8647548Z ./comm/multi_gpu_car_test.py 2025-05-07T20:31:04.8647821Z ./gather_scatter/gather_scatter_test.py 2025-05-07T20:31:04.8648105Z ./kv_cache/kv_cache_test.py 2025-05-07T20:31:04.8648350Z ./moe/activation_test.py 2025-05-07T20:31:04.8648592Z ./moe/gather_scatter_test.py 2025-05-07T20:31:04.8648832Z ./moe/layers_test.py 2025-05-07T20:31:04.8649057Z ./moe/shuffling_test.py 2025-05-07T20:31:04.8649295Z ./quantize/quantize_test.py 2025-05-07T20:31:04.8649457Z 2025-05-07T20:31:04.8649575Z [TEST] Enumerating IGNORED test files ... 2025-05-07T20:31:04.8649777Z 2025-05-07T20:31:04.8667879Z ################################################################################ 2025-05-07T20:31:04.8683353Z # [2025-05-07T20:31:04.868Z] Run Python Test Suite: 2025-05-07T20:31:04.8683796Z # ./attention/gqa_test.py 2025-05-07T20:31:04.8684078Z ################################################################################ 2025-05-07T20:31:04.8707652Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./attention/gqa_test.py 2025-05-07T20:31:04.8708506Z 2025-05-07T20:31:07.4615410Z ============================= test session starts ============================== 2025-05-07T20:31:07.4616300Z platform linux -- Python 3.13.0, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:31:07.4616829Z cachedir: .pytest_cache 2025-05-07T20:31:07.4617711Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:31:07.4618436Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:31:07.4618847Z plugins: hypothesis-6.131.14 2025-05-07T20:31:09.0771443Z collecting ... collected 2 items 2025-05-07T20:31:09.0771749Z 2025-05-07T20:31:46.7105972Z attention/gqa_test.py::Int4GQATest::test_gqa Trying example: test_gqa( 2025-05-07T20:31:46.7106751Z self=, 2025-05-07T20:31:46.7107261Z int4_kv=False, 2025-05-07T20:31:46.7107604Z num_groups=1, 2025-05-07T20:31:46.7107920Z B=1, 2025-05-07T20:31:46.7108477Z MAX_T=4, 2025-05-07T20:31:46.7108801Z N_H_L=1, 2025-05-07T20:31:46.7109111Z ) 2025-05-07T20:31:46.7109420Z Trying example: test_gqa( 2025-05-07T20:31:46.7109876Z self=, 2025-05-07T20:31:46.7110360Z int4_kv=True, 2025-05-07T20:31:46.7110681Z num_groups=1, 2025-05-07T20:31:46.7111015Z B=1, 2025-05-07T20:31:46.7111299Z MAX_T=4, 2025-05-07T20:31:46.7111634Z N_H_L=1, 2025-05-07T20:31:46.7111933Z ) 2025-05-07T20:31:46.7112251Z Trying example: test_gqa( 2025-05-07T20:31:46.7112715Z self=, 2025-05-07T20:31:46.7113218Z int4_kv=True, 2025-05-07T20:31:46.7113538Z num_groups=4, 2025-05-07T20:31:46.7113815Z B=23, 2025-05-07T20:31:46.7114042Z MAX_T=33, 2025-05-07T20:31:46.7114308Z N_H_L=68, 2025-05-07T20:31:46.7114560Z ) 2025-05-07T20:31:46.7114798Z Trying example: test_gqa( 2025-05-07T20:31:46.7115164Z self=, 2025-05-07T20:31:46.7115578Z int4_kv=True, 2025-05-07T20:31:46.7115833Z num_groups=4, 2025-05-07T20:31:46.7116100Z B=77, 2025-05-07T20:31:46.7116346Z MAX_T=4, 2025-05-07T20:31:46.7116580Z N_H_L=1, 2025-05-07T20:31:46.7116825Z ) 2025-05-07T20:31:46.7117071Z Trying example: test_gqa( 2025-05-07T20:31:46.7117421Z self=, 2025-05-07T20:31:46.7117823Z int4_kv=True, 2025-05-07T20:31:46.7118106Z num_groups=4, 2025-05-07T20:31:46.7118350Z B=77, 2025-05-07T20:31:46.7118601Z MAX_T=52, 2025-05-07T20:31:46.7118861Z N_H_L=67, 2025-05-07T20:31:46.7119096Z ) 2025-05-07T20:31:46.7119344Z Trying example: test_gqa( 2025-05-07T20:31:46.7120260Z self=, 2025-05-07T20:31:46.7120646Z int4_kv=False, 2025-05-07T20:31:46.7120928Z num_groups=4, 2025-05-07T20:31:46.7121204Z B=57, 2025-05-07T20:31:46.7121451Z MAX_T=45, 2025-05-07T20:31:46.7121711Z N_H_L=120, 2025-05-07T20:31:46.7121963Z ) 2025-05-07T20:31:46.7122203Z Trying example: test_gqa( 2025-05-07T20:31:46.7122577Z self=, 2025-05-07T20:31:46.7122988Z int4_kv=True, 2025-05-07T20:31:46.7123258Z num_groups=4, 2025-05-07T20:31:46.7123506Z B=52, 2025-05-07T20:31:46.7123754Z MAX_T=42, 2025-05-07T20:31:46.7124018Z N_H_L=53, 2025-05-07T20:31:46.7124250Z ) 2025-05-07T20:31:46.7124659Z Trying example: test_gqa( 2025-05-07T20:31:46.7125036Z self=, 2025-05-07T20:31:46.7125416Z int4_kv=True, 2025-05-07T20:31:46.7125686Z num_groups=1, 2025-05-07T20:31:46.7125955Z B=77, 2025-05-07T20:31:46.7126180Z MAX_T=95, 2025-05-07T20:31:46.7126450Z N_H_L=53, 2025-05-07T20:31:46.7126691Z ) 2025-05-07T20:31:46.7126924Z Trying example: test_gqa( 2025-05-07T20:31:46.7127383Z self=, 2025-05-07T20:31:46.7127757Z int4_kv=True, 2025-05-07T20:31:46.7128023Z num_groups=4, 2025-05-07T20:31:46.7128287Z B=113, 2025-05-07T20:31:46.7128514Z MAX_T=48, 2025-05-07T20:31:46.7128771Z N_H_L=96, 2025-05-07T20:31:46.7129029Z ) 2025-05-07T20:31:46.7129261Z Trying example: test_gqa( 2025-05-07T20:31:46.7129632Z self=, 2025-05-07T20:31:46.7130038Z int4_kv=False, 2025-05-07T20:31:46.7130299Z num_groups=1, 2025-05-07T20:31:46.7130557Z B=51, 2025-05-07T20:31:46.7131064Z MAX_T=61, 2025-05-07T20:31:46.7131296Z N_H_L=69, 2025-05-07T20:31:46.7131549Z ) 2025-05-07T20:31:46.7131793Z Trying example: test_gqa( 2025-05-07T20:31:46.7132133Z self=, 2025-05-07T20:31:46.7132522Z int4_kv=False, 2025-05-07T20:31:46.7132801Z num_groups=4, 2025-05-07T20:31:46.7133040Z B=17, 2025-05-07T20:31:46.7133275Z MAX_T=113, 2025-05-07T20:31:46.7133525Z N_H_L=65, 2025-05-07T20:31:46.7133755Z ) 2025-05-07T20:31:46.7133999Z Trying example: test_gqa( 2025-05-07T20:31:46.7134358Z self=, 2025-05-07T20:31:46.7134729Z int4_kv=False, 2025-05-07T20:31:46.7134992Z num_groups=4, 2025-05-07T20:31:46.7135243Z B=17, 2025-05-07T20:31:46.7135463Z MAX_T=65, 2025-05-07T20:31:46.7135709Z N_H_L=65, 2025-05-07T20:31:46.7135952Z ) 2025-05-07T20:31:46.7136195Z Trying example: test_gqa( 2025-05-07T20:31:46.7136533Z self=, 2025-05-07T20:31:46.7136942Z int4_kv=False, 2025-05-07T20:31:46.7137217Z num_groups=4, 2025-05-07T20:31:46.7137458Z B=65, 2025-05-07T20:31:46.7137692Z MAX_T=65, 2025-05-07T20:31:46.7137942Z N_H_L=65, 2025-05-07T20:31:46.7138170Z ) 2025-05-07T20:31:46.7138418Z Trying example: test_gqa( 2025-05-07T20:31:46.7138796Z self=, 2025-05-07T20:31:46.7139169Z int4_kv=False, 2025-05-07T20:31:46.7139439Z num_groups=1, 2025-05-07T20:31:46.7139696Z B=6, 2025-05-07T20:31:46.7139925Z MAX_T=108, 2025-05-07T20:31:46.7140181Z N_H_L=14, 2025-05-07T20:31:46.7140417Z ) 2025-05-07T20:31:46.7140644Z Trying example: test_gqa( 2025-05-07T20:31:46.7141009Z self=, 2025-05-07T20:31:46.7141404Z int4_kv=False, 2025-05-07T20:31:46.7141656Z num_groups=1, 2025-05-07T20:31:46.7141923Z B=6, 2025-05-07T20:31:46.7142162Z MAX_T=14, 2025-05-07T20:31:46.7142391Z N_H_L=14, 2025-05-07T20:31:46.7142634Z ) 2025-05-07T20:31:46.7142893Z Trying example: test_gqa( 2025-05-07T20:31:46.7143241Z self=, 2025-05-07T20:31:46.7143628Z int4_kv=False, 2025-05-07T20:31:46.7143896Z num_groups=1, 2025-05-07T20:31:46.7144140Z B=6, 2025-05-07T20:31:46.7144370Z MAX_T=6, 2025-05-07T20:31:46.7144734Z N_H_L=14, 2025-05-07T20:31:46.7144961Z ) 2025-05-07T20:31:46.7145199Z Trying example: test_gqa( 2025-05-07T20:31:46.7145551Z self=, 2025-05-07T20:31:46.7145946Z int4_kv=False, 2025-05-07T20:31:46.7146192Z num_groups=1, 2025-05-07T20:31:46.7146445Z B=6, 2025-05-07T20:31:46.7146673Z MAX_T=6, 2025-05-07T20:31:46.7146897Z N_H_L=6, 2025-05-07T20:31:46.7147136Z ) 2025-05-07T20:31:46.7147377Z Trying example: test_gqa( 2025-05-07T20:31:46.7147715Z self=, 2025-05-07T20:31:46.7148098Z int4_kv=False, 2025-05-07T20:31:46.7148362Z num_groups=1, 2025-05-07T20:31:46.7148608Z B=70, 2025-05-07T20:31:46.7148849Z MAX_T=94, 2025-05-07T20:31:46.7149090Z N_H_L=78, 2025-05-07T20:31:46.7149308Z ) 2025-05-07T20:31:46.7149557Z Trying example: test_gqa( 2025-05-07T20:31:46.7149915Z self=, 2025-05-07T20:31:46.7150299Z int4_kv=False, 2025-05-07T20:31:46.7150571Z num_groups=1, 2025-05-07T20:31:46.7150833Z B=78, 2025-05-07T20:31:46.7151051Z MAX_T=94, 2025-05-07T20:31:46.7151290Z N_H_L=78, 2025-05-07T20:31:46.7151532Z ) 2025-05-07T20:31:46.7151758Z Trying example: test_gqa( 2025-05-07T20:31:46.7152119Z self=, 2025-05-07T20:31:46.7152508Z int4_kv=False, 2025-05-07T20:31:46.7152780Z num_groups=1, 2025-05-07T20:31:46.7152990Z B=94, 2025-05-07T20:31:46.7153169Z MAX_T=94, 2025-05-07T20:31:46.7153349Z N_H_L=78, 2025-05-07T20:31:46.7153535Z ) 2025-05-07T20:31:46.7153728Z Trying example: test_gqa( 2025-05-07T20:31:46.7154121Z self=, 2025-05-07T20:31:46.7154425Z int4_kv=False, 2025-05-07T20:31:46.7154638Z num_groups=1, 2025-05-07T20:31:46.7154841Z B=94, 2025-05-07T20:31:46.7155016Z MAX_T=94, 2025-05-07T20:31:46.7155203Z N_H_L=94, 2025-05-07T20:31:46.7155399Z ) 2025-05-07T20:31:46.7155583Z Trying example: test_gqa( 2025-05-07T20:31:46.7155876Z self=, 2025-05-07T20:31:46.7156184Z int4_kv=False, 2025-05-07T20:31:46.7156386Z num_groups=4, 2025-05-07T20:31:46.7156590Z B=41, 2025-05-07T20:31:46.7156775Z MAX_T=105, 2025-05-07T20:31:46.7156964Z N_H_L=126, 2025-05-07T20:31:46.7157159Z ) 2025-05-07T20:31:46.7157347Z Trying example: test_gqa( 2025-05-07T20:31:46.7157620Z self=, 2025-05-07T20:31:46.7157918Z int4_kv=False, 2025-05-07T20:31:46.7158113Z num_groups=4, 2025-05-07T20:31:46.7158300Z B=105, 2025-05-07T20:31:46.7158482Z MAX_T=105, 2025-05-07T20:31:46.7158675Z N_H_L=126, 2025-05-07T20:31:46.7158854Z ) 2025-05-07T20:31:46.7159033Z Trying example: test_gqa( 2025-05-07T20:31:46.7159310Z self=, 2025-05-07T20:31:46.7159603Z int4_kv=False, 2025-05-07T20:31:46.7159803Z num_groups=4, 2025-05-07T20:31:46.7159998Z B=105, 2025-05-07T20:31:46.7160173Z MAX_T=105, 2025-05-07T20:31:46.7160363Z N_H_L=105, 2025-05-07T20:31:46.7160544Z ) 2025-05-07T20:31:46.7160715Z Trying example: test_gqa( 2025-05-07T20:31:46.7160997Z self=, 2025-05-07T20:31:46.7161298Z int4_kv=True, 2025-05-07T20:31:46.7161493Z num_groups=1, 2025-05-07T20:31:46.7161680Z B=95, 2025-05-07T20:31:46.7161855Z MAX_T=114, 2025-05-07T20:31:46.7162039Z N_H_L=43, 2025-05-07T20:31:46.7162214Z ) 2025-05-07T20:31:46.7162403Z Trying example: test_gqa( 2025-05-07T20:31:46.7162690Z self=, 2025-05-07T20:31:46.7162989Z int4_kv=True, 2025-05-07T20:31:46.7163190Z num_groups=1, 2025-05-07T20:31:46.7163400Z B=43, 2025-05-07T20:31:46.7163574Z MAX_T=114, 2025-05-07T20:31:46.7163770Z N_H_L=43, 2025-05-07T20:31:46.7163955Z ) 2025-05-07T20:31:46.7164133Z Trying example: test_gqa( 2025-05-07T20:31:46.7164554Z self=, 2025-05-07T20:31:46.7164952Z int4_kv=True, 2025-05-07T20:31:46.7165146Z num_groups=1, 2025-05-07T20:31:46.7165347Z B=43, 2025-05-07T20:31:46.7165530Z MAX_T=43, 2025-05-07T20:31:46.7165712Z N_H_L=43, 2025-05-07T20:31:46.7165895Z ) 2025-05-07T20:31:46.7166085Z Trying example: test_gqa( 2025-05-07T20:31:46.7166363Z self=, 2025-05-07T20:31:46.7166669Z int4_kv=False, 2025-05-07T20:31:46.7166881Z num_groups=1, 2025-05-07T20:31:46.7167072Z B=21, 2025-05-07T20:31:46.7167256Z MAX_T=38, 2025-05-07T20:31:46.7167441Z N_H_L=42, 2025-05-07T20:31:46.7167612Z ) 2025-05-07T20:31:46.7167790Z Trying example: test_gqa( 2025-05-07T20:31:46.7168072Z self=, 2025-05-07T20:31:46.7168358Z int4_kv=False, 2025-05-07T20:31:46.7168555Z num_groups=1, 2025-05-07T20:31:46.7168745Z B=38, 2025-05-07T20:31:46.7168915Z MAX_T=38, 2025-05-07T20:31:46.7169087Z N_H_L=42, 2025-05-07T20:31:46.7169271Z ) 2025-05-07T20:31:46.7169450Z Trying example: test_gqa( 2025-05-07T20:31:46.7169717Z self=, 2025-05-07T20:31:46.7170015Z int4_kv=False, 2025-05-07T20:31:46.7170213Z num_groups=1, 2025-05-07T20:31:46.7170398Z B=38, 2025-05-07T20:31:46.7170568Z MAX_T=42, 2025-05-07T20:31:46.7170751Z N_H_L=42, 2025-05-07T20:31:46.7170922Z ) 2025-05-07T20:31:46.7171102Z Trying example: test_gqa( 2025-05-07T20:31:46.7171378Z self=, 2025-05-07T20:31:46.7171667Z int4_kv=False, 2025-05-07T20:31:46.7171869Z num_groups=1, 2025-05-07T20:31:46.7172058Z B=42, 2025-05-07T20:31:46.7172226Z MAX_T=42, 2025-05-07T20:31:46.7172514Z N_H_L=42, 2025-05-07T20:31:46.7172697Z ) 2025-05-07T20:31:46.7172869Z Trying example: test_gqa( 2025-05-07T20:31:46.7173148Z self=, 2025-05-07T20:31:46.7173445Z int4_kv=True, 2025-05-07T20:31:46.7173637Z num_groups=1, 2025-05-07T20:31:46.7173835Z B=74, 2025-05-07T20:31:46.7174004Z MAX_T=20, 2025-05-07T20:31:46.7174181Z N_H_L=15, 2025-05-07T20:31:46.7174355Z ) 2025-05-07T20:31:46.7174534Z Trying example: test_gqa( 2025-05-07T20:31:46.7174803Z self=, 2025-05-07T20:31:46.7175102Z int4_kv=True, 2025-05-07T20:31:46.7175291Z num_groups=1, 2025-05-07T20:31:46.7175480Z B=20, 2025-05-07T20:31:46.7175643Z MAX_T=20, 2025-05-07T20:31:46.7175826Z N_H_L=15, 2025-05-07T20:31:46.7176007Z ) 2025-05-07T20:31:46.7176178Z Trying example: test_gqa( 2025-05-07T20:31:46.7176452Z self=, 2025-05-07T20:31:46.7176749Z int4_kv=True, 2025-05-07T20:31:46.7176942Z num_groups=1, 2025-05-07T20:31:46.7177130Z B=20, 2025-05-07T20:31:46.7177300Z MAX_T=15, 2025-05-07T20:31:46.7177476Z N_H_L=15, 2025-05-07T20:31:46.7177652Z ) 2025-05-07T20:31:46.7177834Z Trying example: test_gqa( 2025-05-07T20:31:46.7178105Z self=, 2025-05-07T20:31:46.7178405Z int4_kv=True, 2025-05-07T20:31:46.7178605Z num_groups=1, 2025-05-07T20:31:46.7178793Z B=15, 2025-05-07T20:31:46.7178968Z MAX_T=20, 2025-05-07T20:31:46.7179152Z N_H_L=15, 2025-05-07T20:31:46.7179323Z ) 2025-05-07T20:31:46.7179505Z Trying example: test_gqa( 2025-05-07T20:31:46.7179782Z self=, 2025-05-07T20:31:46.7180069Z int4_kv=True, 2025-05-07T20:31:46.7180262Z num_groups=1, 2025-05-07T20:31:46.7180454Z B=15, 2025-05-07T20:31:46.7180621Z MAX_T=15, 2025-05-07T20:31:46.7180805Z N_H_L=15, 2025-05-07T20:31:46.7180986Z ) 2025-05-07T20:31:46.7181165Z Trying example: test_gqa( 2025-05-07T20:31:46.7181451Z self=, 2025-05-07T20:31:46.7181753Z int4_kv=False, 2025-05-07T20:31:46.7181957Z num_groups=4, 2025-05-07T20:31:46.7182144Z B=117, 2025-05-07T20:31:46.7182335Z MAX_T=104, 2025-05-07T20:31:46.7182529Z N_H_L=69, 2025-05-07T20:31:46.7182800Z ) 2025-05-07T20:31:46.7182993Z Trying example: test_gqa( 2025-05-07T20:31:46.7183279Z self=, 2025-05-07T20:31:46.7183578Z int4_kv=False, 2025-05-07T20:31:46.7183789Z num_groups=4, 2025-05-07T20:31:46.7183993Z B=117, 2025-05-07T20:31:46.7184170Z MAX_T=117, 2025-05-07T20:31:46.7184368Z N_H_L=69, 2025-05-07T20:31:46.7184563Z ) 2025-05-07T20:31:46.7184743Z Trying example: test_gqa( 2025-05-07T20:31:46.7185036Z self=, 2025-05-07T20:31:46.7185355Z int4_kv=False, 2025-05-07T20:31:46.7185565Z num_groups=4, 2025-05-07T20:31:46.7185769Z B=69, 2025-05-07T20:31:46.7185950Z MAX_T=117, 2025-05-07T20:31:46.7186144Z N_H_L=69, 2025-05-07T20:31:46.7186332Z ) 2025-05-07T20:31:46.7186516Z Trying example: test_gqa( 2025-05-07T20:31:46.7186805Z self=, 2025-05-07T20:31:46.7187109Z int4_kv=False, 2025-05-07T20:31:46.7187312Z num_groups=4, 2025-05-07T20:31:46.7187497Z B=117, 2025-05-07T20:31:46.7187672Z MAX_T=69, 2025-05-07T20:31:46.7187856Z N_H_L=69, 2025-05-07T20:31:46.7188028Z ) 2025-05-07T20:31:46.7188201Z PASSED 2025-05-07T20:31:46.7294811Z attention/gqa_test.py::Int4GQATest::test_mqa_main SKIPPED (Skip when...) 2025-05-07T20:31:46.7295158Z 2025-05-07T20:31:46.7295313Z =========================== short test summary info ============================ 2025-05-07T20:31:46.7296200Z SKIPPED [1] ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/unittest/case.py:154: Skip when CUDA is not available or xformers is not available 2025-05-07T20:31:46.7297424Z ======================== 1 passed, 1 skipped in 39.78s ========================= 2025-05-07T20:31:47.4907478Z 2025-05-07T20:31:47.4908087Z [TEST] Python test suite PASSED: ./attention/gqa_test.py 2025-05-07T20:31:47.4929264Z [TEST] Python test time for ./attention/gqa_test.py: 43 seconds 2025-05-07T20:31:47.4929553Z 2025-05-07T20:31:47.4929582Z 2025-05-07T20:31:47.4929588Z 2025-05-07T20:31:47.4929593Z 2025-05-07T20:31:47.4950684Z ################################################################################ 2025-05-07T20:31:47.4966275Z # [2025-05-07T20:31:47.496Z] Run Python Test Suite: 2025-05-07T20:31:47.4966606Z # ./coalesce/coalesce_test.py 2025-05-07T20:31:47.4966892Z ################################################################################ 2025-05-07T20:31:47.4991677Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./coalesce/coalesce_test.py 2025-05-07T20:31:47.4992307Z 2025-05-07T20:31:49.7093088Z ============================= test session starts ============================== 2025-05-07T20:31:49.7093859Z platform linux -- Python 3.13.0, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:31:49.7094369Z cachedir: .pytest_cache 2025-05-07T20:31:49.7094951Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:31:49.7095688Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:31:49.7096082Z plugins: hypothesis-6.131.14 2025-05-07T20:31:51.2806746Z collecting ... collected 1 item 2025-05-07T20:31:51.2807026Z 2025-05-07T20:31:52.0583750Z coalesce/coalesce_test.py::CoalesceTest::test_coalesce_batches PASSED 2025-05-07T20:31:52.0584092Z 2025-05-07T20:31:52.0584243Z ============================== 1 passed in 2.48s =============================== 2025-05-07T20:31:52.7823983Z 2025-05-07T20:31:52.7824762Z [TEST] Python test suite PASSED: ./coalesce/coalesce_test.py 2025-05-07T20:31:52.7843016Z [TEST] Python test time for ./coalesce/coalesce_test.py: 5 seconds 2025-05-07T20:31:52.7843315Z 2025-05-07T20:31:52.7843320Z 2025-05-07T20:31:52.7843324Z 2025-05-07T20:31:52.7843327Z 2025-05-07T20:31:52.7864299Z ################################################################################ 2025-05-07T20:31:52.7879656Z # [2025-05-07T20:31:52.787Z] Run Python Test Suite: 2025-05-07T20:31:52.7879983Z # ./comm/multi_gpu_car_test.py 2025-05-07T20:31:52.7880276Z ################################################################################ 2025-05-07T20:31:52.7907287Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./comm/multi_gpu_car_test.py 2025-05-07T20:31:52.7907894Z 2025-05-07T20:31:55.0004586Z ============================= test session starts ============================== 2025-05-07T20:31:55.0005807Z platform linux -- Python 3.13.0, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:31:55.0006860Z cachedir: .pytest_cache 2025-05-07T20:31:55.0007989Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:31:55.0008978Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:31:55.0009394Z plugins: hypothesis-6.131.14 2025-05-07T20:31:56.6517236Z collecting ... collected 5 items 2025-05-07T20:31:56.6517525Z 2025-05-07T20:31:56.6528569Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allgather SKIPPED 2025-05-07T20:31:56.6536432Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allgather_dtype_mismatch SKIPPED 2025-05-07T20:31:56.6543703Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allreduce SKIPPED 2025-05-07T20:31:56.6555846Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_oneshot_car_stress SKIPPED 2025-05-07T20:31:56.6572026Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_reducescatter SKIPPED 2025-05-07T20:31:56.6572790Z 2025-05-07T20:31:56.6572963Z =========================== short test summary info ============================ 2025-05-07T20:31:56.6573639Z SKIPPED [1] comm/multi_gpu_car_test.py:310: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs 2025-05-07T20:31:56.6574586Z SKIPPED [1] comm/multi_gpu_car_test.py:351: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs 2025-05-07T20:31:56.6575519Z SKIPPED [1] comm/multi_gpu_car_test.py:418: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs 2025-05-07T20:31:56.6576451Z SKIPPED [1] comm/multi_gpu_car_test.py:434: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs 2025-05-07T20:31:56.6577367Z SKIPPED [1] comm/multi_gpu_car_test.py:402: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs 2025-05-07T20:31:56.6578034Z ============================== 5 skipped in 1.79s ============================== 2025-05-07T20:31:57.3280089Z 2025-05-07T20:31:57.3280895Z [TEST] Python test suite PASSED: ./comm/multi_gpu_car_test.py 2025-05-07T20:31:57.3302745Z [TEST] Python test time for ./comm/multi_gpu_car_test.py: 5 seconds 2025-05-07T20:31:57.3303072Z 2025-05-07T20:31:57.3303082Z 2025-05-07T20:31:57.3303086Z 2025-05-07T20:31:57.3303104Z 2025-05-07T20:31:57.3325395Z ################################################################################ 2025-05-07T20:31:57.3340896Z # [2025-05-07T20:31:57.333Z] Run Python Test Suite: 2025-05-07T20:31:57.3341241Z # ./gather_scatter/gather_scatter_test.py 2025-05-07T20:31:57.3341565Z ################################################################################ 2025-05-07T20:31:57.3368017Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./gather_scatter/gather_scatter_test.py 2025-05-07T20:31:57.3368676Z 2025-05-07T20:31:59.5539765Z ============================= test session starts ============================== 2025-05-07T20:31:59.5540455Z platform linux -- Python 3.13.0, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:31:59.5541437Z cachedir: .pytest_cache 2025-05-07T20:31:59.5542031Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:31:59.5542742Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:31:59.5543149Z plugins: hypothesis-6.131.14 2025-05-07T20:32:01.2428689Z collecting ... collected 2 items 2025-05-07T20:32:01.2429101Z 2025-05-07T20:32:01.2438944Z gather_scatter/gather_scatter_test.py::GatherScatterTests::test_gather_along_first_dim SKIPPED 2025-05-07T20:32:01.2454024Z gather_scatter/gather_scatter_test.py::GatherScatterTests::test_scatter_add_along_first_dim SKIPPED 2025-05-07T20:32:01.2454452Z 2025-05-07T20:32:01.2454636Z =========================== short test summary info ============================ 2025-05-07T20:32:01.2455254Z SKIPPED [1] gather_scatter/gather_scatter_test.py:29: Skip when no Hopper GPU is available. This test is only for Hopper GPU. 2025-05-07T20:32:01.2456095Z SKIPPED [1] gather_scatter/gather_scatter_test.py:70: Skip when no Hopper GPU is available. This test is only for Hopper GPU. 2025-05-07T20:32:01.2456695Z ============================== 2 skipped in 1.83s ============================== 2025-05-07T20:32:01.9311757Z 2025-05-07T20:32:01.9312742Z [TEST] Python test suite PASSED: ./gather_scatter/gather_scatter_test.py 2025-05-07T20:32:01.9331313Z [TEST] Python test time for ./gather_scatter/gather_scatter_test.py: 4 seconds 2025-05-07T20:32:01.9331638Z 2025-05-07T20:32:01.9331888Z 2025-05-07T20:32:01.9331893Z 2025-05-07T20:32:01.9331972Z 2025-05-07T20:32:01.9353401Z ################################################################################ 2025-05-07T20:32:01.9371242Z # [2025-05-07T20:32:01.936Z] Run Python Test Suite: 2025-05-07T20:32:01.9371577Z # ./kv_cache/kv_cache_test.py 2025-05-07T20:32:01.9371851Z ################################################################################ 2025-05-07T20:32:01.9397414Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./kv_cache/kv_cache_test.py 2025-05-07T20:32:01.9398036Z 2025-05-07T20:32:04.1532862Z ============================= test session starts ============================== 2025-05-07T20:32:04.1533931Z platform linux -- Python 3.13.0, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:32:04.1534833Z cachedir: .pytest_cache 2025-05-07T20:32:04.1535762Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:32:04.1536923Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:32:04.1537646Z plugins: hypothesis-6.131.14 2025-05-07T20:32:05.7927044Z collecting ... collected 4 items 2025-05-07T20:32:05.7927289Z 2025-05-07T20:32:08.3887949Z kv_cache/kv_cache_test.py::KVCacheTests::test_fp8_kv_cache SKIPPED (...) 2025-05-07T20:32:08.3969570Z kv_cache/kv_cache_test.py::KVCacheTests::test_int4_kv_cache SKIPPED 2025-05-07T20:32:08.4059843Z kv_cache/kv_cache_test.py::KVCacheTests::test_positional_encoding_with_paged_attention SKIPPED 2025-05-07T20:32:08.4148012Z kv_cache/kv_cache_test.py::KVCacheTests::test_rope_positional_encoding_only SKIPPED 2025-05-07T20:32:08.4148369Z 2025-05-07T20:32:08.4148515Z =========================== short test summary info ============================ 2025-05-07T20:32:08.4149210Z SKIPPED [1] ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/unittest/case.py:154: Skip when H100 is not available or MI300 is not available 2025-05-07T20:32:08.4150127Z SKIPPED [3] ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/unittest/case.py:154: Skip when xformers is not available 2025-05-07T20:32:08.4150733Z ============================== 4 skipped in 4.40s ============================== 2025-05-07T20:32:10.7958405Z 2025-05-07T20:32:10.7959121Z [TEST] Python test suite PASSED: ./kv_cache/kv_cache_test.py 2025-05-07T20:32:10.7982828Z [TEST] Python test time for ./kv_cache/kv_cache_test.py: 9 seconds 2025-05-07T20:32:10.7983144Z 2025-05-07T20:32:10.7983149Z 2025-05-07T20:32:10.7983154Z 2025-05-07T20:32:10.7983159Z 2025-05-07T20:32:10.8004924Z ################################################################################ 2025-05-07T20:32:10.8020962Z # [2025-05-07T20:32:10.801Z] Run Python Test Suite: 2025-05-07T20:32:10.8021301Z # ./moe/activation_test.py 2025-05-07T20:32:10.8021582Z ################################################################################ 2025-05-07T20:32:10.8048310Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./moe/activation_test.py 2025-05-07T20:32:10.8048906Z 2025-05-07T20:32:13.0137332Z ============================= test session starts ============================== 2025-05-07T20:32:13.0137958Z platform linux -- Python 3.13.0, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:32:13.0138500Z cachedir: .pytest_cache 2025-05-07T20:32:13.0139079Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:32:13.0139802Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:32:13.0140199Z plugins: hypothesis-6.131.14 2025-05-07T20:32:14.6739803Z TMA benchmarks will be running with experimental grid constant TMA descriptor. 2025-05-07T20:32:14.7709211Z collecting ... collected 2 items 2025-05-07T20:32:14.7709615Z 2025-05-07T20:32:20.1425386Z moe/activation_test.py::ActivationTests::test_silu_mul Trying example: test_silu_mul( 2025-05-07T20:32:20.1426491Z self=, 2025-05-07T20:32:20.1426890Z T=1, 2025-05-07T20:32:20.1427084Z D=5120, 2025-05-07T20:32:20.1427281Z contiguous=True, 2025-05-07T20:32:20.1427546Z compiled=True, 2025-05-07T20:32:20.1427821Z ) 2025-05-07T20:32:20.1428079Z Trying example: test_silu_mul( 2025-05-07T20:32:20.1428528Z self=, 2025-05-07T20:32:20.1429005Z T=4096, 2025-05-07T20:32:20.1429232Z D=5120, 2025-05-07T20:32:20.1429446Z contiguous=True, 2025-05-07T20:32:20.1429719Z compiled=True, 2025-05-07T20:32:20.1429972Z ) 2025-05-07T20:32:20.1430196Z Trying example: test_silu_mul( 2025-05-07T20:32:20.1430644Z self=, 2025-05-07T20:32:20.1431106Z T=4096, 2025-05-07T20:32:20.1431314Z D=7168, 2025-05-07T20:32:20.1431540Z contiguous=False, 2025-05-07T20:32:20.1431804Z compiled=False, 2025-05-07T20:32:20.1432031Z ) 2025-05-07T20:32:20.1432333Z Trying example: test_silu_mul( 2025-05-07T20:32:20.1432774Z self=, 2025-05-07T20:32:20.1433217Z T=4096, 2025-05-07T20:32:20.1433434Z D=5120, 2025-05-07T20:32:20.1433667Z contiguous=False, 2025-05-07T20:32:20.1433919Z compiled=True, 2025-05-07T20:32:20.1434158Z ) 2025-05-07T20:32:20.1434385Z Trying example: test_silu_mul( 2025-05-07T20:32:20.1434817Z self=, 2025-05-07T20:32:20.1435258Z T=1, 2025-05-07T20:32:20.1435470Z D=7168, 2025-05-07T20:32:20.1435698Z contiguous=True, 2025-05-07T20:32:20.1435936Z compiled=True, 2025-05-07T20:32:20.1436169Z ) 2025-05-07T20:32:20.1436395Z Trying example: test_silu_mul( 2025-05-07T20:32:20.1436808Z self=, 2025-05-07T20:32:20.1437257Z T=1, 2025-05-07T20:32:20.1437475Z D=7168, 2025-05-07T20:32:20.1437693Z contiguous=False, 2025-05-07T20:32:20.1437950Z compiled=True, 2025-05-07T20:32:20.1438176Z ) 2025-05-07T20:32:20.1438389Z Trying example: test_silu_mul( 2025-05-07T20:32:20.1438818Z self=, 2025-05-07T20:32:20.1439464Z T=4096, 2025-05-07T20:32:20.1439660Z D=5120, 2025-05-07T20:32:20.1439879Z contiguous=False, 2025-05-07T20:32:20.1440136Z compiled=False, 2025-05-07T20:32:20.1440359Z ) 2025-05-07T20:32:20.1440589Z Trying example: test_silu_mul( 2025-05-07T20:32:20.1441026Z self=, 2025-05-07T20:32:20.1441455Z T=1, 2025-05-07T20:32:20.1441667Z D=7168, 2025-05-07T20:32:20.1441886Z contiguous=True, 2025-05-07T20:32:20.1442145Z compiled=False, 2025-05-07T20:32:20.1442365Z ) 2025-05-07T20:32:20.1442583Z Trying example: test_silu_mul( 2025-05-07T20:32:20.1443015Z self=, 2025-05-07T20:32:20.1443460Z T=2048, 2025-05-07T20:32:20.1443670Z D=5120, 2025-05-07T20:32:20.1443889Z contiguous=True, 2025-05-07T20:32:20.1444128Z compiled=True, 2025-05-07T20:32:20.1444352Z ) 2025-05-07T20:32:20.1444674Z Trying example: test_silu_mul( 2025-05-07T20:32:20.1445033Z self=, 2025-05-07T20:32:20.1445420Z T=2048, 2025-05-07T20:32:20.1445613Z D=7168, 2025-05-07T20:32:20.1445800Z contiguous=True, 2025-05-07T20:32:20.1446020Z compiled=True, 2025-05-07T20:32:20.1446234Z ) 2025-05-07T20:32:20.1446420Z Trying example: test_silu_mul( 2025-05-07T20:32:20.1446795Z self=, 2025-05-07T20:32:20.1447169Z T=2048, 2025-05-07T20:32:20.1447340Z D=7168, 2025-05-07T20:32:20.1447540Z contiguous=True, 2025-05-07T20:32:20.1447765Z compiled=False, 2025-05-07T20:32:20.1447960Z ) 2025-05-07T20:32:20.1448163Z Trying example: test_silu_mul( 2025-05-07T20:32:20.1448636Z self=, 2025-05-07T20:32:20.1449012Z T=128, 2025-05-07T20:32:20.1449186Z D=5120, 2025-05-07T20:32:20.1449381Z contiguous=False, 2025-05-07T20:32:20.1449602Z compiled=True, 2025-05-07T20:32:20.1449791Z ) 2025-05-07T20:32:20.1449991Z Trying example: test_silu_mul( 2025-05-07T20:32:20.1450351Z self=, 2025-05-07T20:32:20.1450708Z T=128, 2025-05-07T20:32:20.1450882Z D=5120, 2025-05-07T20:32:20.1451067Z contiguous=True, 2025-05-07T20:32:20.1451271Z compiled=True, 2025-05-07T20:32:20.1451464Z ) 2025-05-07T20:32:20.1451651Z Trying example: test_silu_mul( 2025-05-07T20:32:20.1452002Z self=, 2025-05-07T20:32:20.1452373Z T=16384, 2025-05-07T20:32:20.1452560Z D=5120, 2025-05-07T20:32:20.1452737Z contiguous=False, 2025-05-07T20:32:20.1452956Z compiled=True, 2025-05-07T20:32:20.1453160Z ) 2025-05-07T20:32:20.1453335Z Trying example: test_silu_mul( 2025-05-07T20:32:20.1453696Z self=, 2025-05-07T20:32:20.1454069Z T=16384, 2025-05-07T20:32:20.1454257Z D=5120, 2025-05-07T20:32:20.1454449Z contiguous=False, 2025-05-07T20:32:20.1454665Z compiled=False, 2025-05-07T20:32:20.1454857Z ) 2025-05-07T20:32:20.1455045Z Trying example: test_silu_mul( 2025-05-07T20:32:20.1455403Z self=, 2025-05-07T20:32:20.1455777Z T=128, 2025-05-07T20:32:20.1455966Z D=7168, 2025-05-07T20:32:20.1456160Z contiguous=True, 2025-05-07T20:32:20.1456386Z compiled=False, 2025-05-07T20:32:20.1456590Z ) 2025-05-07T20:32:20.1456781Z Trying example: test_silu_mul( 2025-05-07T20:32:20.1457152Z self=, 2025-05-07T20:32:20.1457528Z T=128, 2025-05-07T20:32:20.1457707Z D=7168, 2025-05-07T20:32:20.1457909Z contiguous=False, 2025-05-07T20:32:20.1458133Z compiled=False, 2025-05-07T20:32:20.1458324Z ) 2025-05-07T20:32:20.1458519Z Trying example: test_silu_mul( 2025-05-07T20:32:20.1458896Z self=, 2025-05-07T20:32:20.1459359Z T=1, 2025-05-07T20:32:20.1459544Z D=5120, 2025-05-07T20:32:20.1459740Z contiguous=False, 2025-05-07T20:32:20.1459956Z compiled=False, 2025-05-07T20:32:20.1460163Z ) 2025-05-07T20:32:20.1460357Z Trying example: test_silu_mul( 2025-05-07T20:32:20.1460725Z self=, 2025-05-07T20:32:20.1461086Z T=1, 2025-05-07T20:32:20.1461268Z D=7168, 2025-05-07T20:32:20.1461462Z contiguous=False, 2025-05-07T20:32:20.1461675Z compiled=False, 2025-05-07T20:32:20.1461882Z ) 2025-05-07T20:32:20.1462078Z Trying example: test_silu_mul( 2025-05-07T20:32:20.1462436Z self=, 2025-05-07T20:32:20.1462826Z T=4096, 2025-05-07T20:32:20.1463013Z D=5120, 2025-05-07T20:32:20.1463195Z contiguous=True, 2025-05-07T20:32:20.1463417Z compiled=False, 2025-05-07T20:32:20.1463627Z ) 2025-05-07T20:32:20.1463809Z Trying example: test_silu_mul( 2025-05-07T20:32:20.1464189Z self=, 2025-05-07T20:32:20.1464561Z T=128, 2025-05-07T20:32:20.1464735Z D=7168, 2025-05-07T20:32:20.1464930Z contiguous=True, 2025-05-07T20:32:20.1465162Z compiled=True, 2025-05-07T20:32:20.1465352Z ) 2025-05-07T20:32:20.1465555Z Trying example: test_silu_mul( 2025-05-07T20:32:20.1465928Z self=, 2025-05-07T20:32:20.1466300Z T=1, 2025-05-07T20:32:20.1466478Z D=5120, 2025-05-07T20:32:20.1466679Z contiguous=False, 2025-05-07T20:32:20.1466914Z compiled=True, 2025-05-07T20:32:20.1467106Z ) 2025-05-07T20:32:20.1467308Z Trying example: test_silu_mul( 2025-05-07T20:32:20.1467779Z self=, 2025-05-07T20:32:20.1468144Z T=4096, 2025-05-07T20:32:20.1468333Z D=7168, 2025-05-07T20:32:20.1468527Z contiguous=True, 2025-05-07T20:32:20.1468739Z compiled=False, 2025-05-07T20:32:20.1468951Z ) 2025-05-07T20:32:20.1469149Z Trying example: test_silu_mul( 2025-05-07T20:32:20.1469505Z self=, 2025-05-07T20:32:20.1469877Z T=4096, 2025-05-07T20:32:20.1470070Z D=7168, 2025-05-07T20:32:20.1470250Z contiguous=False, 2025-05-07T20:32:20.1470478Z compiled=True, 2025-05-07T20:32:20.1470678Z ) 2025-05-07T20:32:20.1470871Z Trying example: test_silu_mul( 2025-05-07T20:32:20.1471281Z self=, 2025-05-07T20:32:20.1471658Z T=128, 2025-05-07T20:32:20.1471826Z D=5120, 2025-05-07T20:32:20.1472020Z contiguous=True, 2025-05-07T20:32:20.1472246Z compiled=False, 2025-05-07T20:32:20.1472460Z ) 2025-05-07T20:32:20.1472642Z Trying example: test_silu_mul( 2025-05-07T20:32:20.1473011Z self=, 2025-05-07T20:32:20.1473372Z T=128, 2025-05-07T20:32:20.1473535Z D=5120, 2025-05-07T20:32:20.1473721Z contiguous=False, 2025-05-07T20:32:20.1473932Z compiled=False, 2025-05-07T20:32:20.1474118Z ) 2025-05-07T20:32:20.1474300Z Trying example: test_silu_mul( 2025-05-07T20:32:20.1474660Z self=, 2025-05-07T20:32:20.1475014Z T=1, 2025-05-07T20:32:20.1475181Z D=5120, 2025-05-07T20:32:20.1475360Z contiguous=True, 2025-05-07T20:32:20.1475564Z compiled=False, 2025-05-07T20:32:20.1475755Z ) 2025-05-07T20:32:20.1475936Z Trying example: test_silu_mul( 2025-05-07T20:32:20.1476292Z self=, 2025-05-07T20:32:20.1476657Z T=2048, 2025-05-07T20:32:20.1476829Z D=7168, 2025-05-07T20:32:20.1477006Z contiguous=False, 2025-05-07T20:32:20.1477222Z compiled=True, 2025-05-07T20:32:20.1477416Z ) 2025-05-07T20:32:20.1477591Z Trying example: test_silu_mul( 2025-05-07T20:32:20.1477950Z self=, 2025-05-07T20:32:20.1478408Z T=2048, 2025-05-07T20:32:20.1478582Z D=7168, 2025-05-07T20:32:20.1478760Z contiguous=False, 2025-05-07T20:32:20.1478977Z compiled=False, 2025-05-07T20:32:20.1479175Z ) 2025-05-07T20:32:20.1479356Z Trying example: test_silu_mul( 2025-05-07T20:32:20.1479714Z self=, 2025-05-07T20:32:20.1480080Z T=16384, 2025-05-07T20:32:20.1480254Z D=7168, 2025-05-07T20:32:20.1480437Z contiguous=False, 2025-05-07T20:32:20.1480653Z compiled=True, 2025-05-07T20:32:20.1480835Z ) 2025-05-07T20:32:20.1481025Z Trying example: test_silu_mul( 2025-05-07T20:32:20.1481383Z self=, 2025-05-07T20:32:20.1481749Z T=16384, 2025-05-07T20:32:20.1481934Z D=7168, 2025-05-07T20:32:20.1482116Z contiguous=True, 2025-05-07T20:32:20.1482323Z compiled=True, 2025-05-07T20:32:20.1482519Z ) 2025-05-07T20:32:20.1482702Z Trying example: test_silu_mul( 2025-05-07T20:32:20.1483067Z self=, 2025-05-07T20:32:20.1483431Z T=4096, 2025-05-07T20:32:20.1483601Z D=7168, 2025-05-07T20:32:20.1483778Z contiguous=True, 2025-05-07T20:32:20.1483991Z compiled=True, 2025-05-07T20:32:20.1484179Z ) 2025-05-07T20:32:20.1484358Z Trying example: test_silu_mul( 2025-05-07T20:32:20.1484816Z self=, 2025-05-07T20:32:20.1485181Z T=2048, 2025-05-07T20:32:20.1485346Z D=5120, 2025-05-07T20:32:20.1485530Z contiguous=False, 2025-05-07T20:32:20.1485746Z compiled=False, 2025-05-07T20:32:20.1485930Z ) 2025-05-07T20:32:20.1486116Z Trying example: test_silu_mul( 2025-05-07T20:32:20.1486578Z self=, 2025-05-07T20:32:20.1486935Z T=2048, 2025-05-07T20:32:20.1487110Z D=5120, 2025-05-07T20:32:20.1487290Z contiguous=True, 2025-05-07T20:32:20.1487493Z compiled=False, 2025-05-07T20:32:20.1487695Z ) 2025-05-07T20:32:20.1487880Z Trying example: test_silu_mul( 2025-05-07T20:32:20.1488233Z self=, 2025-05-07T20:32:20.1488601Z T=128, 2025-05-07T20:32:20.1497775Z D=7168, 2025-05-07T20:32:20.1498017Z contiguous=False, 2025-05-07T20:32:20.1498260Z compiled=True, 2025-05-07T20:32:20.1498471Z ) 2025-05-07T20:32:20.1498675Z Trying example: test_silu_mul( 2025-05-07T20:32:20.1499055Z self=, 2025-05-07T20:32:20.1499441Z T=16384, 2025-05-07T20:32:20.1499634Z D=5120, 2025-05-07T20:32:20.1499836Z contiguous=True, 2025-05-07T20:32:20.1500062Z compiled=True, 2025-05-07T20:32:20.1500268Z ) 2025-05-07T20:32:20.1500470Z Trying example: test_silu_mul( 2025-05-07T20:32:20.1500845Z self=, 2025-05-07T20:32:20.1501217Z T=2048, 2025-05-07T20:32:20.1501409Z D=5120, 2025-05-07T20:32:20.1501614Z contiguous=False, 2025-05-07T20:32:20.1501833Z compiled=True, 2025-05-07T20:32:20.1502038Z ) 2025-05-07T20:32:20.1502237Z Trying example: test_silu_mul( 2025-05-07T20:32:20.1502599Z self=, 2025-05-07T20:32:20.1502975Z T=16384, 2025-05-07T20:32:20.1503171Z D=5120, 2025-05-07T20:32:20.1503354Z contiguous=True, 2025-05-07T20:32:20.1503581Z compiled=False, 2025-05-07T20:32:20.1503785Z ) 2025-05-07T20:32:20.1503975Z Trying example: test_silu_mul( 2025-05-07T20:32:20.1504349Z self=, 2025-05-07T20:32:20.1504723Z T=16384, 2025-05-07T20:32:20.1504914Z D=7168, 2025-05-07T20:32:20.1505105Z contiguous=False, 2025-05-07T20:32:20.1505334Z compiled=False, 2025-05-07T20:32:20.1505540Z ) 2025-05-07T20:32:20.1505728Z Trying example: test_silu_mul( 2025-05-07T20:32:20.1506099Z self=, 2025-05-07T20:32:20.1506592Z T=16384, 2025-05-07T20:32:20.1506780Z D=7168, 2025-05-07T20:32:20.1506978Z contiguous=True, 2025-05-07T20:32:20.1507202Z compiled=False, 2025-05-07T20:32:20.1507401Z ) 2025-05-07T20:32:20.1507589Z PASSED 2025-05-07T20:32:20.2138067Z W0507 20:32:20.211000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:20.2139140Z W0507 20:32:20.211000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last): 2025-05-07T20:32:20.2140474Z W0507 20:32:20.211000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:20.2141893Z W0507 20:32:20.211000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:20.2142870Z W0507 20:32:20.211000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:20.2144167Z W0507 20:32:20.211000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:20.2145762Z W0507 20:32:20.211000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:20.2147105Z W0507 20:32:20.211000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:20.2148510Z W0507 20:32:20.211000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:20.2149581Z W0507 20:32:20.211000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] module_map=module_map) 2025-05-07T20:32:20.2150874Z W0507 20:32:20.211000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:20.2152155Z W0507 20:32:20.211000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] generator.visit(fn.parse()) 2025-05-07T20:32:20.2153014Z W0507 20:32:20.211000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:20.2154255Z W0507 20:32:20.211000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:20.2155477Z W0507 20:32:20.211000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ret = super().visit(node) 2025-05-07T20:32:20.2156501Z W0507 20:32:20.211000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:20.2157510Z W0507 20:32:20.211000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return visitor(node) 2025-05-07T20:32:20.2158709Z W0507 20:32:20.211000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:20.2160111Z W0507 20:32:20.211000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:20.2161008Z W0507 20:32:20.211000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:20.2162083Z W0507 20:32:20.211000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:20.2163118Z W0507 20:32:20.211000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] self.visit(item) 2025-05-07T20:32:20.2163878Z W0507 20:32:20.211000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:20.2165198Z W0507 20:32:20.211000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:20.2166552Z W0507 20:32:20.211000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:20.2167600Z W0507 20:32:20.211000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:20.2168502Z W0507 20:32:20.211000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:20.2169326Z W0507 20:32:20.211000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^ 2025-05-07T20:32:20.2170331Z W0507 20:32:20.211000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:20.2295670Z W0507 20:32:20.228000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:20.2296722Z W0507 20:32:20.228000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last): 2025-05-07T20:32:20.2298036Z W0507 20:32:20.228000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:20.2299440Z W0507 20:32:20.228000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:20.2300397Z W0507 20:32:20.228000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:20.2301690Z W0507 20:32:20.228000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:20.2303047Z W0507 20:32:20.228000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:20.2304337Z W0507 20:32:20.228000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:20.2305690Z W0507 20:32:20.228000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:20.2306869Z W0507 20:32:20.228000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] module_map=module_map) 2025-05-07T20:32:20.2308108Z W0507 20:32:20.228000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:20.2309469Z W0507 20:32:20.228000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] generator.visit(fn.parse()) 2025-05-07T20:32:20.2310300Z W0507 20:32:20.228000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:20.2311481Z W0507 20:32:20.228000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:20.2312667Z W0507 20:32:20.228000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ret = super().visit(node) 2025-05-07T20:32:20.2313682Z W0507 20:32:20.228000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:20.2314681Z W0507 20:32:20.228000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return visitor(node) 2025-05-07T20:32:20.2316051Z W0507 20:32:20.228000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:20.2317314Z W0507 20:32:20.228000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:20.2318198Z W0507 20:32:20.228000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:20.2319265Z W0507 20:32:20.228000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:20.2320287Z W0507 20:32:20.228000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] self.visit(item) 2025-05-07T20:32:20.2321039Z W0507 20:32:20.228000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:20.2322187Z W0507 20:32:20.228000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:20.2323521Z W0507 20:32:20.228000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:20.2324672Z W0507 20:32:20.228000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:20.2325566Z W0507 20:32:20.228000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:20.2326289Z W0507 20:32:20.228000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^ 2025-05-07T20:32:20.2327296Z W0507 20:32:20.228000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:20.2694151Z W0507 20:32:20.267000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:20.2696544Z W0507 20:32:20.267000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last): 2025-05-07T20:32:20.2699181Z W0507 20:32:20.267000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:20.2701989Z W0507 20:32:20.267000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:20.2703924Z W0507 20:32:20.267000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:20.2705662Z W0507 20:32:20.267000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:20.2707112Z W0507 20:32:20.267000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:20.2708612Z W0507 20:32:20.267000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:20.2710187Z W0507 20:32:20.267000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:20.2711279Z W0507 20:32:20.267000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] module_map=module_map) 2025-05-07T20:32:20.2712591Z W0507 20:32:20.267000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:20.2713892Z W0507 20:32:20.267000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] generator.visit(fn.parse()) 2025-05-07T20:32:20.2714757Z W0507 20:32:20.267000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:20.2715950Z W0507 20:32:20.267000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:20.2717142Z W0507 20:32:20.267000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ret = super().visit(node) 2025-05-07T20:32:20.2718155Z W0507 20:32:20.267000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:20.2719164Z W0507 20:32:20.267000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return visitor(node) 2025-05-07T20:32:20.2720371Z W0507 20:32:20.267000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:20.2721632Z W0507 20:32:20.267000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:20.2722511Z W0507 20:32:20.267000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:20.2723577Z W0507 20:32:20.267000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:20.2724802Z W0507 20:32:20.267000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] self.visit(item) 2025-05-07T20:32:20.2725558Z W0507 20:32:20.267000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:20.2726715Z W0507 20:32:20.267000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:20.2728054Z W0507 20:32:20.267000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:20.2729104Z W0507 20:32:20.267000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:20.2730013Z W0507 20:32:20.267000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:20.2730745Z W0507 20:32:20.267000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^ 2025-05-07T20:32:20.2731746Z W0507 20:32:20.267000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:20.2745647Z W0507 20:32:20.273000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:20.2746699Z W0507 20:32:20.273000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last): 2025-05-07T20:32:20.2748016Z W0507 20:32:20.273000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:20.2749411Z W0507 20:32:20.273000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:20.2750377Z W0507 20:32:20.273000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:20.2751662Z W0507 20:32:20.273000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:20.2753023Z W0507 20:32:20.273000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:20.2754314Z W0507 20:32:20.273000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:20.2755672Z W0507 20:32:20.273000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:20.2756706Z W0507 20:32:20.273000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] module_map=module_map) 2025-05-07T20:32:20.2757945Z W0507 20:32:20.273000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:20.2759268Z W0507 20:32:20.273000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] generator.visit(fn.parse()) 2025-05-07T20:32:20.2760103Z W0507 20:32:20.273000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:20.2761303Z W0507 20:32:20.273000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:20.2762508Z W0507 20:32:20.273000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ret = super().visit(node) 2025-05-07T20:32:20.2763525Z W0507 20:32:20.273000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:20.2764647Z W0507 20:32:20.273000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return visitor(node) 2025-05-07T20:32:20.2765861Z W0507 20:32:20.273000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:20.2767127Z W0507 20:32:20.273000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:20.2768013Z W0507 20:32:20.273000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:20.2769155Z W0507 20:32:20.273000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:20.2770189Z W0507 20:32:20.273000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] self.visit(item) 2025-05-07T20:32:20.2770947Z W0507 20:32:20.273000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:20.2772095Z W0507 20:32:20.273000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:20.2773420Z W0507 20:32:20.273000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:20.2774468Z W0507 20:32:20.273000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:20.2775358Z W0507 20:32:20.273000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:20.2776082Z W0507 20:32:20.273000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^ 2025-05-07T20:32:20.2777074Z W0507 20:32:20.273000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:20.6955510Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant( 2025-05-07T20:32:20.6956865Z self=, 2025-05-07T20:32:20.6957673Z T=1, 2025-05-07T20:32:20.6958042Z D=5120, 2025-05-07T20:32:20.6958413Z scale_ub=None, 2025-05-07T20:32:20.6958852Z contiguous=True, 2025-05-07T20:32:20.6959283Z compiled=True, 2025-05-07T20:32:20.6959677Z ) 2025-05-07T20:32:20.6960296Z self = 2025-05-07T20:32:20.6961266Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:20.6962198Z 2025-05-07T20:32:20.6962363Z @given( 2025-05-07T20:32:20.6962809Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:20.6963428Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:20.6964027Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:20.6964811Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:20.6965438Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:20.6965993Z ) 2025-05-07T20:32:20.6966684Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:20.6967541Z def test_silu_mul_quant( 2025-05-07T20:32:20.6968030Z self, 2025-05-07T20:32:20.6968417Z T: int, 2025-05-07T20:32:20.6968785Z D: int, 2025-05-07T20:32:20.6969211Z scale_ub: Optional[float], 2025-05-07T20:32:20.6969744Z contiguous: bool, 2025-05-07T20:32:20.6970201Z compiled: bool, 2025-05-07T20:32:20.6970660Z ) -> None: 2025-05-07T20:32:20.6970920Z torch.manual_seed(2025) 2025-05-07T20:32:20.6971161Z 2025-05-07T20:32:20.6971447Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:20.6971808Z 2025-05-07T20:32:20.6972015Z x_sign = torch.sign(x) 2025-05-07T20:32:20.6972309Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:20.6972639Z x = x_sign * x_clamp 2025-05-07T20:32:20.6972892Z x0 = x[:, :D] 2025-05-07T20:32:20.6973113Z x1 = x[:, D:] 2025-05-07T20:32:20.6973338Z 2025-05-07T20:32:20.6973537Z if contiguous: 2025-05-07T20:32:20.6973773Z x0 = x0.contiguous() 2025-05-07T20:32:20.6974240Z x1 = x1.contiguous() 2025-05-07T20:32:20.6974505Z 2025-05-07T20:32:20.6974699Z if scale_ub is not None: 2025-05-07T20:32:20.6974993Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:20.6975349Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:20.6975671Z ) 2025-05-07T20:32:20.6975877Z else: 2025-05-07T20:32:20.6976098Z scale_ub_tensor = None 2025-05-07T20:32:20.6976353Z 2025-05-07T20:32:20.6976604Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:20.6976938Z op = silu_mul_quant 2025-05-07T20:32:20.6977213Z if compiled: 2025-05-07T20:32:20.6977466Z op = torch.compile(op) 2025-05-07T20:32:20.6977776Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:20.6978066Z 2025-05-07T20:32:20.6978259Z y_fp8, y_scale = fn() 2025-05-07T20:32:20.6978552Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:20.6978862Z 2025-05-07T20:32:20.6979095Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:20.6979445Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:20.6979747Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:20.6980064Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:20.6980437Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:20.6980756Z 2025-05-07T20:32:20.6980961Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:20.6981168Z 2025-05-07T20:32:20.6981272Z moe/activation_test.py:126: 2025-05-07T20:32:20.6981598Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:20.6981948Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:20.6982274Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:20.6983086Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:20.6983849Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:20.6984404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:20.6985183Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:20.6985884Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:20.6986625Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:20.6987360Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:20.6987991Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:20.6988596Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:20.6989117Z fn() 2025-05-07T20:32:20.6989620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:20.6990207Z self.fn.run( 2025-05-07T20:32:20.6990676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:20.6991213Z kernel = self.compile( 2025-05-07T20:32:20.6991746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:20.6992406Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:20.6992811Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:20.6993040Z 2025-05-07T20:32:20.6993257Z self = 2025-05-07T20:32:20.6994413Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:20.6995807Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fab0ae09ee0>} 2025-05-07T20:32:20.6997211Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:20.6998230Z context = 2025-05-07T20:32:20.6998518Z 2025-05-07T20:32:20.6998682Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:20.6999197Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:20.6999667Z module_map=module_map) 2025-05-07T20:32:20.7000027Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:20.7000372Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:20.7000634Z E ^ 2025-05-07T20:32:20.7001093Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:20.7001546Z 2025-05-07T20:32:20.7001955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:20.7002471Z 2025-05-07T20:32:20.7002568Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:20.7002981Z self=, 2025-05-07T20:32:20.7003379Z T=2048, 2025-05-07T20:32:20.7003558Z D=5120, 2025-05-07T20:32:20.7003744Z scale_ub=1200.0, 2025-05-07T20:32:20.7003968Z contiguous=True, 2025-05-07T20:32:20.7004179Z compiled=False, 2025-05-07T20:32:20.7004462Z ) 2025-05-07T20:32:20.7004787Z self = 2025-05-07T20:32:20.7005273Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:20.7005558Z 2025-05-07T20:32:20.7005634Z @given( 2025-05-07T20:32:20.7005874Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:20.7006295Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:20.7006599Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:20.7006943Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:20.7007281Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:20.7007561Z ) 2025-05-07T20:32:20.7007911Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:20.7008592Z def test_silu_mul_quant( 2025-05-07T20:32:20.7008910Z self, 2025-05-07T20:32:20.7009114Z T: int, 2025-05-07T20:32:20.7009317Z D: int, 2025-05-07T20:32:20.7009536Z scale_ub: Optional[float], 2025-05-07T20:32:20.7009813Z contiguous: bool, 2025-05-07T20:32:20.7010059Z compiled: bool, 2025-05-07T20:32:20.7010279Z ) -> None: 2025-05-07T20:32:20.7010500Z torch.manual_seed(2025) 2025-05-07T20:32:20.7010752Z 2025-05-07T20:32:20.7011058Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:20.7011431Z 2025-05-07T20:32:20.7011624Z x_sign = torch.sign(x) 2025-05-07T20:32:20.7011927Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:20.7012235Z x = x_sign * x_clamp 2025-05-07T20:32:20.7012480Z x0 = x[:, :D] 2025-05-07T20:32:20.7012706Z x1 = x[:, D:] 2025-05-07T20:32:20.7012909Z 2025-05-07T20:32:20.7013106Z if contiguous: 2025-05-07T20:32:20.7013346Z x0 = x0.contiguous() 2025-05-07T20:32:20.7013597Z x1 = x1.contiguous() 2025-05-07T20:32:20.7013845Z 2025-05-07T20:32:20.7014044Z if scale_ub is not None: 2025-05-07T20:32:20.7014451Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:20.7014799Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:20.7015121Z ) 2025-05-07T20:32:20.7015314Z else: 2025-05-07T20:32:20.7015525Z scale_ub_tensor = None 2025-05-07T20:32:20.7015788Z 2025-05-07T20:32:20.7016017Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:20.7016346Z op = silu_mul_quant 2025-05-07T20:32:20.7016607Z if compiled: 2025-05-07T20:32:20.7016859Z op = torch.compile(op) 2025-05-07T20:32:20.7017145Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:20.7017430Z 2025-05-07T20:32:20.7017631Z > y_fp8, y_scale = fn() 2025-05-07T20:32:20.7017793Z 2025-05-07T20:32:20.7017894Z moe/activation_test.py:117: 2025-05-07T20:32:20.7018206Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:20.7018563Z moe/activation_test.py:115: in fn 2025-05-07T20:32:20.7018839Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:20.7019543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:20.7020241Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:20.7020802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:20.7021487Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:20.7022161Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:20.7022699Z kernel = self.compile( 2025-05-07T20:32:20.7023239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:20.7023901Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:20.7024317Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:20.7024543Z 2025-05-07T20:32:20.7024757Z self = 2025-05-07T20:32:20.7025833Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:20.7027338Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fab098420c0>} 2025-05-07T20:32:20.7028680Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:20.7029713Z context = 2025-05-07T20:32:20.7030004Z 2025-05-07T20:32:20.7030181Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:20.7030689Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:20.7031195Z module_map=module_map) 2025-05-07T20:32:20.7031610Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:20.7031962Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:20.7032219Z E ^ 2025-05-07T20:32:20.7032685Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:20.7033134Z 2025-05-07T20:32:20.7033554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:20.9665266Z W0507 20:32:20.962000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:20.9666344Z W0507 20:32:20.962000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last): 2025-05-07T20:32:20.9667685Z W0507 20:32:20.962000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:20.9669119Z W0507 20:32:20.962000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:20.9670097Z W0507 20:32:20.962000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:20.9671408Z W0507 20:32:20.962000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:20.9672782Z W0507 20:32:20.962000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:20.9674090Z W0507 20:32:20.962000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:20.9675465Z W0507 20:32:20.962000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:20.9676532Z W0507 20:32:20.962000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] module_map=module_map) 2025-05-07T20:32:20.9677796Z W0507 20:32:20.962000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:20.9679179Z W0507 20:32:20.962000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] generator.visit(fn.parse()) 2025-05-07T20:32:20.9680031Z W0507 20:32:20.962000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:20.9681241Z W0507 20:32:20.962000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:20.9682460Z W0507 20:32:20.962000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ret = super().visit(node) 2025-05-07T20:32:20.9683496Z W0507 20:32:20.962000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:20.9684650Z W0507 20:32:20.962000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return visitor(node) 2025-05-07T20:32:20.9685898Z W0507 20:32:20.962000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:20.9687171Z W0507 20:32:20.962000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:20.9688057Z W0507 20:32:20.962000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:20.9689227Z W0507 20:32:20.962000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:20.9690270Z W0507 20:32:20.962000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] self.visit(item) 2025-05-07T20:32:20.9691043Z W0507 20:32:20.962000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:20.9692206Z W0507 20:32:20.962000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:20.9702949Z W0507 20:32:20.962000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:20.9704035Z W0507 20:32:20.962000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:20.9704945Z W0507 20:32:20.962000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:20.9705694Z W0507 20:32:20.962000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^ 2025-05-07T20:32:20.9706720Z W0507 20:32:20.962000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:21.0380478Z W0507 20:32:21.034000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:21.0381541Z W0507 20:32:21.034000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last): 2025-05-07T20:32:21.0382886Z W0507 20:32:21.034000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:21.0384477Z W0507 20:32:21.034000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:21.0385445Z W0507 20:32:21.034000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:21.0386740Z W0507 20:32:21.034000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:21.0388105Z W0507 20:32:21.034000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:21.0389395Z W0507 20:32:21.034000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:21.0390766Z W0507 20:32:21.034000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:21.0391859Z W0507 20:32:21.034000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] module_map=module_map) 2025-05-07T20:32:21.0393256Z W0507 20:32:21.034000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:21.0394497Z W0507 20:32:21.034000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] generator.visit(fn.parse()) 2025-05-07T20:32:21.0395337Z W0507 20:32:21.034000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:21.0396563Z W0507 20:32:21.034000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:21.0397801Z W0507 20:32:21.034000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ret = super().visit(node) 2025-05-07T20:32:21.0398824Z W0507 20:32:21.034000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:21.0399827Z W0507 20:32:21.034000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return visitor(node) 2025-05-07T20:32:21.0401029Z W0507 20:32:21.034000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:21.0402292Z W0507 20:32:21.034000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:21.0403184Z W0507 20:32:21.034000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:21.0404253Z W0507 20:32:21.034000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:21.0405369Z W0507 20:32:21.034000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] self.visit(item) 2025-05-07T20:32:21.0406135Z W0507 20:32:21.034000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:21.0407385Z W0507 20:32:21.034000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:21.0408924Z W0507 20:32:21.034000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:21.0409972Z W0507 20:32:21.034000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:21.0410879Z W0507 20:32:21.034000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:21.0411613Z W0507 20:32:21.034000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^ 2025-05-07T20:32:21.0412626Z W0507 20:32:21.034000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:21.2490814Z W0507 20:32:21.245000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:21.2491892Z W0507 20:32:21.245000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last): 2025-05-07T20:32:21.2493402Z W0507 20:32:21.245000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:21.2494820Z W0507 20:32:21.245000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:21.2495802Z W0507 20:32:21.245000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:21.2497145Z W0507 20:32:21.245000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:21.2498672Z W0507 20:32:21.245000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:21.2500130Z W0507 20:32:21.245000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:21.2501663Z W0507 20:32:21.245000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:21.2502836Z W0507 20:32:21.245000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] module_map=module_map) 2025-05-07T20:32:21.2504248Z W0507 20:32:21.245000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:21.2505641Z W0507 20:32:21.245000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] generator.visit(fn.parse()) 2025-05-07T20:32:21.2506568Z W0507 20:32:21.245000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:21.2507906Z W0507 20:32:21.245000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:21.2509530Z W0507 20:32:21.245000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ret = super().visit(node) 2025-05-07T20:32:21.2510684Z W0507 20:32:21.245000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:21.2511806Z W0507 20:32:21.245000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return visitor(node) 2025-05-07T20:32:21.2513066Z W0507 20:32:21.245000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:21.2514338Z W0507 20:32:21.245000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:21.2515238Z W0507 20:32:21.245000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:21.2516300Z W0507 20:32:21.245000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:21.2517329Z W0507 20:32:21.245000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] self.visit(item) 2025-05-07T20:32:21.2518203Z W0507 20:32:21.245000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:21.2519348Z W0507 20:32:21.245000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:21.2520701Z W0507 20:32:21.245000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:21.2521741Z W0507 20:32:21.245000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:21.2522648Z W0507 20:32:21.245000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:21.2523384Z W0507 20:32:21.245000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^ 2025-05-07T20:32:21.2524481Z W0507 20:32:21.245000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:21.2590784Z W0507 20:32:21.255000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:21.2591848Z W0507 20:32:21.255000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last): 2025-05-07T20:32:21.2593160Z W0507 20:32:21.255000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:21.2594564Z W0507 20:32:21.255000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:21.2595582Z W0507 20:32:21.255000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:21.2596866Z W0507 20:32:21.255000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:21.2598389Z W0507 20:32:21.255000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:21.2599672Z W0507 20:32:21.255000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:21.2602300Z W0507 20:32:21.255000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:21.2603335Z W0507 20:32:21.255000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] module_map=module_map) 2025-05-07T20:32:21.2604710Z W0507 20:32:21.255000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:21.2605944Z W0507 20:32:21.255000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] generator.visit(fn.parse()) 2025-05-07T20:32:21.2606779Z W0507 20:32:21.255000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:21.2608085Z W0507 20:32:21.255000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:21.2609495Z W0507 20:32:21.255000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ret = super().visit(node) 2025-05-07T20:32:21.2610530Z W0507 20:32:21.255000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:21.2611533Z W0507 20:32:21.255000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return visitor(node) 2025-05-07T20:32:21.2612740Z W0507 20:32:21.255000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:21.2614016Z W0507 20:32:21.255000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:21.2614909Z W0507 20:32:21.255000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:21.2615991Z W0507 20:32:21.255000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:21.2617015Z W0507 20:32:21.255000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] self.visit(item) 2025-05-07T20:32:21.2617785Z W0507 20:32:21.255000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:21.2618945Z W0507 20:32:21.255000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:21.2620296Z W0507 20:32:21.255000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:21.2621476Z W0507 20:32:21.255000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:21.2622386Z W0507 20:32:21.255000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:21.2623124Z W0507 20:32:21.255000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^ 2025-05-07T20:32:21.2624147Z W0507 20:32:21.255000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:21.5742143Z 2025-05-07T20:32:21.5742351Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:21.5742779Z self=, 2025-05-07T20:32:21.5743242Z T=2048, 2025-05-07T20:32:21.5743432Z D=5120, 2025-05-07T20:32:21.5743619Z scale_ub=1200.0, 2025-05-07T20:32:21.5743838Z contiguous=True, 2025-05-07T20:32:21.5744059Z compiled=True, 2025-05-07T20:32:21.5744258Z ) 2025-05-07T20:32:21.5744570Z self = 2025-05-07T20:32:21.5745066Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:21.5745345Z 2025-05-07T20:32:21.5745420Z @given( 2025-05-07T20:32:21.5745648Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:21.5745955Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:21.5746260Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:21.5746590Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:21.5747062Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:21.5747348Z ) 2025-05-07T20:32:21.5747687Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:21.5748118Z def test_silu_mul_quant( 2025-05-07T20:32:21.5748360Z self, 2025-05-07T20:32:21.5748547Z T: int, 2025-05-07T20:32:21.5748732Z D: int, 2025-05-07T20:32:21.5748945Z scale_ub: Optional[float], 2025-05-07T20:32:21.5749212Z contiguous: bool, 2025-05-07T20:32:21.5749444Z compiled: bool, 2025-05-07T20:32:21.5749664Z ) -> None: 2025-05-07T20:32:21.5749874Z torch.manual_seed(2025) 2025-05-07T20:32:21.5750110Z 2025-05-07T20:32:21.5750374Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:21.5750707Z 2025-05-07T20:32:21.5750898Z x_sign = torch.sign(x) 2025-05-07T20:32:21.5751178Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:21.5751490Z x = x_sign * x_clamp 2025-05-07T20:32:21.5751724Z x0 = x[:, :D] 2025-05-07T20:32:21.5751928Z x1 = x[:, D:] 2025-05-07T20:32:21.5752130Z 2025-05-07T20:32:21.5752315Z if contiguous: 2025-05-07T20:32:21.5752537Z x0 = x0.contiguous() 2025-05-07T20:32:21.5752793Z x1 = x1.contiguous() 2025-05-07T20:32:21.5753027Z 2025-05-07T20:32:21.5753205Z if scale_ub is not None: 2025-05-07T20:32:21.5753469Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:21.5753800Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:21.5754094Z ) 2025-05-07T20:32:21.5754278Z else: 2025-05-07T20:32:21.5754484Z scale_ub_tensor = None 2025-05-07T20:32:21.5754729Z 2025-05-07T20:32:21.5754945Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:21.5755242Z op = silu_mul_quant 2025-05-07T20:32:21.5755481Z if compiled: 2025-05-07T20:32:21.5755718Z op = torch.compile(op) 2025-05-07T20:32:21.5756006Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:21.5756266Z 2025-05-07T20:32:21.5756440Z y_fp8, y_scale = fn() 2025-05-07T20:32:21.5756713Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:21.5757122Z 2025-05-07T20:32:21.5757343Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:21.5757669Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:21.5757947Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:21.5758237Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:21.5758578Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:21.5758878Z 2025-05-07T20:32:21.5759073Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:21.5759260Z 2025-05-07T20:32:21.5759354Z moe/activation_test.py:126: 2025-05-07T20:32:21.5759655Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:21.5759980Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:21.5760296Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:21.5761081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:21.5761826Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:21.5762360Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:21.5763027Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:21.5763704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:21.5764526Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:21.5765331Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:21.5765954Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:21.5766542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:21.5767055Z fn() 2025-05-07T20:32:21.5767545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:21.5768112Z self.fn.run( 2025-05-07T20:32:21.5768564Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:21.5769083Z kernel = self.compile( 2025-05-07T20:32:21.5769610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:21.5770249Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:21.5770647Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:21.5770870Z 2025-05-07T20:32:21.5771070Z self = 2025-05-07T20:32:21.5772138Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:21.5773501Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fab09e3e840>} 2025-05-07T20:32:21.5774829Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:21.5775836Z context = 2025-05-07T20:32:21.5776124Z 2025-05-07T20:32:21.5776285Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:21.5776796Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:21.5777249Z module_map=module_map) 2025-05-07T20:32:21.5777684Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:21.5778030Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:21.5778286Z E ^ 2025-05-07T20:32:21.5778737Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:21.5779180Z 2025-05-07T20:32:21.5779585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:21.5780092Z 2025-05-07T20:32:21.5780189Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:21.5780594Z self=, 2025-05-07T20:32:21.5780981Z T=16384, 2025-05-07T20:32:21.5781159Z D=7168, 2025-05-07T20:32:21.5781338Z scale_ub=1200.0, 2025-05-07T20:32:21.5781551Z contiguous=False, 2025-05-07T20:32:21.5781759Z compiled=False, 2025-05-07T20:32:21.5781952Z ) 2025-05-07T20:32:21.5782262Z self = 2025-05-07T20:32:21.5782743Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:21.5783021Z 2025-05-07T20:32:21.5783092Z @given( 2025-05-07T20:32:21.5783311Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:21.5783609Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:21.5783909Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:21.5784227Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:21.5784543Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:21.5784806Z ) 2025-05-07T20:32:21.5785226Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:21.5785658Z def test_silu_mul_quant( 2025-05-07T20:32:21.5785885Z self, 2025-05-07T20:32:21.5786067Z T: int, 2025-05-07T20:32:21.5786253Z D: int, 2025-05-07T20:32:21.5786458Z scale_ub: Optional[float], 2025-05-07T20:32:21.5786717Z contiguous: bool, 2025-05-07T20:32:21.5786946Z compiled: bool, 2025-05-07T20:32:21.5787152Z ) -> None: 2025-05-07T20:32:21.5787355Z torch.manual_seed(2025) 2025-05-07T20:32:21.5787584Z 2025-05-07T20:32:21.5787842Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:21.5788172Z 2025-05-07T20:32:21.5788348Z x_sign = torch.sign(x) 2025-05-07T20:32:21.5788629Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:21.5788929Z x = x_sign * x_clamp 2025-05-07T20:32:21.5789152Z x0 = x[:, :D] 2025-05-07T20:32:21.5789360Z x1 = x[:, D:] 2025-05-07T20:32:21.5789564Z 2025-05-07T20:32:21.5789736Z if contiguous: 2025-05-07T20:32:21.5789956Z x0 = x0.contiguous() 2025-05-07T20:32:21.5790204Z x1 = x1.contiguous() 2025-05-07T20:32:21.5790426Z 2025-05-07T20:32:21.5790609Z if scale_ub is not None: 2025-05-07T20:32:21.5790878Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:21.5791200Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:21.5791499Z ) 2025-05-07T20:32:21.5791680Z else: 2025-05-07T20:32:21.5791876Z scale_ub_tensor = None 2025-05-07T20:32:21.5792118Z 2025-05-07T20:32:21.5792340Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:21.5792644Z op = silu_mul_quant 2025-05-07T20:32:21.5792875Z if compiled: 2025-05-07T20:32:21.5793110Z op = torch.compile(op) 2025-05-07T20:32:21.5793392Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:21.5793653Z 2025-05-07T20:32:21.5793834Z > y_fp8, y_scale = fn() 2025-05-07T20:32:21.5793991Z 2025-05-07T20:32:21.5794089Z moe/activation_test.py:117: 2025-05-07T20:32:21.5794369Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:21.5794801Z moe/activation_test.py:115: in fn 2025-05-07T20:32:21.5795069Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:21.5795736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:21.5796411Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:21.5796933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:21.5797604Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:21.5798248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:21.5798770Z kernel = self.compile( 2025-05-07T20:32:21.5799299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:21.5799941Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:21.5800328Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:21.5800556Z 2025-05-07T20:32:21.5800756Z self = 2025-05-07T20:32:21.5801817Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:21.5803171Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fab097171a0>} 2025-05-07T20:32:21.5804713Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:21.5805721Z context = 2025-05-07T20:32:21.5806012Z 2025-05-07T20:32:21.5806172Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:21.5806682Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:21.5807131Z module_map=module_map) 2025-05-07T20:32:21.5807483Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:21.5807823Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:21.5808070Z E ^ 2025-05-07T20:32:21.5808687Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:21.5809140Z 2025-05-07T20:32:21.5809549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:21.7616679Z W0507 20:32:21.758000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:21.7617745Z W0507 20:32:21.758000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last): 2025-05-07T20:32:21.7619062Z W0507 20:32:21.758000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:21.7620463Z W0507 20:32:21.758000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:21.7621434Z W0507 20:32:21.758000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:21.7622724Z W0507 20:32:21.758000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:21.7624264Z W0507 20:32:21.758000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:21.7625550Z W0507 20:32:21.758000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:21.7626905Z W0507 20:32:21.758000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:21.7627942Z W0507 20:32:21.758000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] module_map=module_map) 2025-05-07T20:32:21.7629196Z W0507 20:32:21.758000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:21.7630429Z W0507 20:32:21.758000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] generator.visit(fn.parse()) 2025-05-07T20:32:21.7631264Z W0507 20:32:21.758000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:21.7632569Z W0507 20:32:21.758000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:21.7633769Z W0507 20:32:21.758000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ret = super().visit(node) 2025-05-07T20:32:21.7634794Z W0507 20:32:21.758000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:21.7635794Z W0507 20:32:21.758000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return visitor(node) 2025-05-07T20:32:21.7636985Z W0507 20:32:21.758000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:21.7638243Z W0507 20:32:21.758000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:21.7639126Z W0507 20:32:21.758000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:21.7640198Z W0507 20:32:21.758000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:21.7641216Z W0507 20:32:21.758000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] self.visit(item) 2025-05-07T20:32:21.7641963Z W0507 20:32:21.758000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:21.7643115Z W0507 20:32:21.758000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:21.7644543Z W0507 20:32:21.758000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:21.7645674Z W0507 20:32:21.758000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:21.7646571Z W0507 20:32:21.758000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:21.7647292Z W0507 20:32:21.758000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^ 2025-05-07T20:32:21.7648295Z W0507 20:32:21.758000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:21.8132777Z W0507 20:32:21.809000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:21.8133822Z W0507 20:32:21.809000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last): 2025-05-07T20:32:21.8135147Z W0507 20:32:21.809000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:21.8136555Z W0507 20:32:21.809000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:21.8137529Z W0507 20:32:21.809000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:21.8138985Z W0507 20:32:21.809000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:21.8140371Z W0507 20:32:21.809000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:21.8141674Z W0507 20:32:21.809000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:21.8143040Z W0507 20:32:21.809000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:21.8144088Z W0507 20:32:21.809000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] module_map=module_map) 2025-05-07T20:32:21.8145349Z W0507 20:32:21.809000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:21.8146594Z W0507 20:32:21.809000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] generator.visit(fn.parse()) 2025-05-07T20:32:21.8147419Z W0507 20:32:21.809000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:21.8148595Z W0507 20:32:21.809000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:21.8149796Z W0507 20:32:21.809000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ret = super().visit(node) 2025-05-07T20:32:21.8150813Z W0507 20:32:21.809000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:21.8151935Z W0507 20:32:21.809000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return visitor(node) 2025-05-07T20:32:21.8153134Z W0507 20:32:21.809000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:21.8154381Z W0507 20:32:21.809000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:21.8155311Z W0507 20:32:21.809000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:21.8156390Z W0507 20:32:21.809000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:21.8157423Z W0507 20:32:21.809000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] self.visit(item) 2025-05-07T20:32:21.8158179Z W0507 20:32:21.809000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:21.8159336Z W0507 20:32:21.809000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:21.8160759Z W0507 20:32:21.809000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:21.8161817Z W0507 20:32:21.809000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:21.8162720Z W0507 20:32:21.809000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:21.8163455Z W0507 20:32:21.809000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^ 2025-05-07T20:32:21.8164550Z W0507 20:32:21.809000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:21.9865720Z W0507 20:32:21.982000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:21.9867431Z W0507 20:32:21.982000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last): 2025-05-07T20:32:21.9868908Z W0507 20:32:21.982000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:21.9870479Z W0507 20:32:21.982000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:21.9871557Z W0507 20:32:21.982000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:21.9873008Z W0507 20:32:21.982000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:21.9874537Z W0507 20:32:21.982000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:21.9875902Z W0507 20:32:21.982000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:21.9877467Z W0507 20:32:21.982000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:21.9878505Z W0507 20:32:21.982000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] module_map=module_map) 2025-05-07T20:32:21.9879758Z W0507 20:32:21.982000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:21.9880992Z W0507 20:32:21.982000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] generator.visit(fn.parse()) 2025-05-07T20:32:21.9881826Z W0507 20:32:21.982000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:21.9883013Z W0507 20:32:21.982000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:21.9884208Z W0507 20:32:21.982000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ret = super().visit(node) 2025-05-07T20:32:21.9885443Z W0507 20:32:21.982000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:21.9886454Z W0507 20:32:21.982000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return visitor(node) 2025-05-07T20:32:21.9887647Z W0507 20:32:21.982000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:21.9888915Z W0507 20:32:21.982000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:21.9889808Z W0507 20:32:21.982000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:21.9890884Z W0507 20:32:21.982000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:21.9891914Z W0507 20:32:21.982000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] self.visit(item) 2025-05-07T20:32:21.9892669Z W0507 20:32:21.982000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:21.9893829Z W0507 20:32:21.982000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:21.9895167Z W0507 20:32:21.982000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:21.9896218Z W0507 20:32:21.982000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:21.9897115Z W0507 20:32:21.982000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:21.9905095Z W0507 20:32:21.982000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^ 2025-05-07T20:32:21.9906236Z W0507 20:32:21.982000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:21.9956252Z W0507 20:32:21.992000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:21.9957299Z W0507 20:32:21.992000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last): 2025-05-07T20:32:21.9958642Z W0507 20:32:21.992000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:21.9960047Z W0507 20:32:21.992000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:21.9961026Z W0507 20:32:21.992000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:21.9962320Z W0507 20:32:21.992000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:21.9963853Z W0507 20:32:21.992000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:21.9965235Z W0507 20:32:21.992000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:21.9966601Z W0507 20:32:21.992000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:21.9967636Z W0507 20:32:21.992000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] module_map=module_map) 2025-05-07T20:32:21.9968890Z W0507 20:32:21.992000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:21.9970123Z W0507 20:32:21.992000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] generator.visit(fn.parse()) 2025-05-07T20:32:21.9970961Z W0507 20:32:21.992000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:21.9972152Z W0507 20:32:21.992000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:21.9973352Z W0507 20:32:21.992000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ret = super().visit(node) 2025-05-07T20:32:21.9974374Z W0507 20:32:21.992000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:21.9975373Z W0507 20:32:21.992000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return visitor(node) 2025-05-07T20:32:21.9976575Z W0507 20:32:21.992000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:21.9977968Z W0507 20:32:21.992000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:21.9978867Z W0507 20:32:21.992000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:21.9979931Z W0507 20:32:21.992000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:21.9980961Z W0507 20:32:21.992000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] self.visit(item) 2025-05-07T20:32:21.9981725Z W0507 20:32:21.992000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:21.9982882Z W0507 20:32:21.992000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:21.9984228Z W0507 20:32:21.992000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:21.9985270Z W0507 20:32:21.992000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:21.9986166Z W0507 20:32:21.992000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:21.9986983Z W0507 20:32:21.992000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^ 2025-05-07T20:32:21.9988002Z W0507 20:32:21.992000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.6798871Z 2025-05-07T20:32:22.6799560Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.6800099Z self=, 2025-05-07T20:32:22.6800535Z T=1, 2025-05-07T20:32:22.6800724Z D=7168, 2025-05-07T20:32:22.6800995Z scale_ub=None, 2025-05-07T20:32:22.6801278Z contiguous=True, 2025-05-07T20:32:22.6801574Z compiled=True, 2025-05-07T20:32:22.6802012Z ) 2025-05-07T20:32:22.6802810Z self = 2025-05-07T20:32:22.6803943Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:22.6804709Z 2025-05-07T20:32:22.6804854Z @given( 2025-05-07T20:32:22.6805301Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.6805870Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.6806414Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.6806993Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.6807608Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.6808190Z ) 2025-05-07T20:32:22.6808792Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.6809316Z def test_silu_mul_quant( 2025-05-07T20:32:22.6809575Z self, 2025-05-07T20:32:22.6809785Z T: int, 2025-05-07T20:32:22.6809997Z D: int, 2025-05-07T20:32:22.6810225Z scale_ub: Optional[float], 2025-05-07T20:32:22.6810526Z contiguous: bool, 2025-05-07T20:32:22.6810794Z compiled: bool, 2025-05-07T20:32:22.6811042Z ) -> None: 2025-05-07T20:32:22.6811274Z torch.manual_seed(2025) 2025-05-07T20:32:22.6811554Z 2025-05-07T20:32:22.6811843Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.6812224Z 2025-05-07T20:32:22.6812437Z x_sign = torch.sign(x) 2025-05-07T20:32:22.6812753Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.6813484Z x = x_sign * x_clamp 2025-05-07T20:32:22.6813749Z x0 = x[:, :D] 2025-05-07T20:32:22.6813985Z x1 = x[:, D:] 2025-05-07T20:32:22.6814205Z 2025-05-07T20:32:22.6814410Z if contiguous: 2025-05-07T20:32:22.6814665Z x0 = x0.contiguous() 2025-05-07T20:32:22.6814942Z x1 = x1.contiguous() 2025-05-07T20:32:22.6815210Z 2025-05-07T20:32:22.6815425Z if scale_ub is not None: 2025-05-07T20:32:22.6815718Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:22.6816088Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:22.6816435Z ) 2025-05-07T20:32:22.6816622Z else: 2025-05-07T20:32:22.6816837Z scale_ub_tensor = None 2025-05-07T20:32:22.6817088Z 2025-05-07T20:32:22.6817313Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.6817629Z op = silu_mul_quant 2025-05-07T20:32:22.6817885Z if compiled: 2025-05-07T20:32:22.6818143Z op = torch.compile(op) 2025-05-07T20:32:22.6818435Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.6818711Z 2025-05-07T20:32:22.6818906Z y_fp8, y_scale = fn() 2025-05-07T20:32:22.6819185Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:22.6819475Z 2025-05-07T20:32:22.6819706Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.6820031Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:22.6820330Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:22.6820650Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:22.6821199Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:22.6821531Z 2025-05-07T20:32:22.6821750Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:22.6821943Z 2025-05-07T20:32:22.6822060Z moe/activation_test.py:126: 2025-05-07T20:32:22.6822364Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.6822719Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:22.6823053Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:22.6823849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:22.6824605Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:22.6825163Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:22.6825855Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:22.6826548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:22.6827280Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:22.6828024Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:22.6828665Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:22.6829290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:22.6829815Z fn() 2025-05-07T20:32:22.6830332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:22.6830915Z self.fn.run( 2025-05-07T20:32:22.6831391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:22.6831923Z kernel = self.compile( 2025-05-07T20:32:22.6832465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:22.6833120Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.6833529Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.6833850Z 2025-05-07T20:32:22.6834066Z self = 2025-05-07T20:32:22.6835143Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:22.6836535Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faaf822d1c0>} 2025-05-07T20:32:22.6837879Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:22.6838899Z context = 2025-05-07T20:32:22.6839193Z 2025-05-07T20:32:22.6839367Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:22.6839882Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.6840351Z module_map=module_map) 2025-05-07T20:32:22.6840715Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.6841060Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:22.6841329Z E ^ 2025-05-07T20:32:22.6841793Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.6842239Z 2025-05-07T20:32:22.6842773Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:22.6843282Z 2025-05-07T20:32:22.6843385Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.6843790Z self=, 2025-05-07T20:32:22.6844195Z T=4096, 2025-05-07T20:32:22.6844449Z D=5120, 2025-05-07T20:32:22.6844640Z scale_ub=None, 2025-05-07T20:32:22.6844855Z contiguous=False, 2025-05-07T20:32:22.6845072Z compiled=False, 2025-05-07T20:32:22.6845274Z ) 2025-05-07T20:32:22.6845596Z self = 2025-05-07T20:32:22.6846095Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:22.6846365Z 2025-05-07T20:32:22.6846443Z @given( 2025-05-07T20:32:22.6846674Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.6846986Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.6847290Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.6847620Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.6847954Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.6848228Z ) 2025-05-07T20:32:22.6848586Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.6849030Z def test_silu_mul_quant( 2025-05-07T20:32:22.6849282Z self, 2025-05-07T20:32:22.6849464Z T: int, 2025-05-07T20:32:22.6849663Z D: int, 2025-05-07T20:32:22.6849881Z scale_ub: Optional[float], 2025-05-07T20:32:22.6850142Z contiguous: bool, 2025-05-07T20:32:22.6850385Z compiled: bool, 2025-05-07T20:32:22.6850609Z ) -> None: 2025-05-07T20:32:22.6850816Z torch.manual_seed(2025) 2025-05-07T20:32:22.6851061Z 2025-05-07T20:32:22.6851330Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.6851662Z 2025-05-07T20:32:22.6851856Z x_sign = torch.sign(x) 2025-05-07T20:32:22.6852146Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.6852443Z x = x_sign * x_clamp 2025-05-07T20:32:22.6852679Z x0 = x[:, :D] 2025-05-07T20:32:22.6852890Z x1 = x[:, D:] 2025-05-07T20:32:22.6853176Z 2025-05-07T20:32:22.6853353Z if contiguous: 2025-05-07T20:32:22.6853576Z x0 = x0.contiguous() 2025-05-07T20:32:22.6853822Z x1 = x1.contiguous() 2025-05-07T20:32:22.6854054Z 2025-05-07T20:32:22.6854233Z if scale_ub is not None: 2025-05-07T20:32:22.6854496Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:22.6854816Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:22.6855120Z ) 2025-05-07T20:32:22.6855301Z else: 2025-05-07T20:32:22.6855495Z scale_ub_tensor = None 2025-05-07T20:32:22.6855738Z 2025-05-07T20:32:22.6855969Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.6856268Z op = silu_mul_quant 2025-05-07T20:32:22.6856516Z if compiled: 2025-05-07T20:32:22.6856763Z op = torch.compile(op) 2025-05-07T20:32:22.6857077Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.6857378Z 2025-05-07T20:32:22.6857594Z > y_fp8, y_scale = fn() 2025-05-07T20:32:22.6857762Z 2025-05-07T20:32:22.6857859Z moe/activation_test.py:117: 2025-05-07T20:32:22.6858165Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.6858507Z moe/activation_test.py:115: in fn 2025-05-07T20:32:22.6858794Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.6859482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:22.6860178Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:22.6860808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:22.6861489Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:22.6862165Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:22.6862713Z kernel = self.compile( 2025-05-07T20:32:22.6863260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:22.6863911Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.6864317Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.6864547Z 2025-05-07T20:32:22.6864764Z self = 2025-05-07T20:32:22.6865884Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:22.6867250Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fab096993a0>} 2025-05-07T20:32:22.6868605Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:22.6869638Z context = 2025-05-07T20:32:22.6869928Z 2025-05-07T20:32:22.6870106Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:22.6870613Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.6871077Z module_map=module_map) 2025-05-07T20:32:22.6871450Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.6871798Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:22.6872043Z E ^ 2025-05-07T20:32:22.6872502Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.6873034Z 2025-05-07T20:32:22.6873453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:22.9587046Z W0507 20:32:22.955000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:22.9588113Z W0507 20:32:22.955000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last): 2025-05-07T20:32:22.9589439Z W0507 20:32:22.955000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:22.9590848Z W0507 20:32:22.955000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:22.9591834Z W0507 20:32:22.955000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:22.9593163Z W0507 20:32:22.955000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:22.9594567Z W0507 20:32:22.955000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.9596078Z W0507 20:32:22.955000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:22.9597479Z W0507 20:32:22.955000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.9598524Z W0507 20:32:22.955000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] module_map=module_map) 2025-05-07T20:32:22.9599769Z W0507 20:32:22.955000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:22.9600998Z W0507 20:32:22.955000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] generator.visit(fn.parse()) 2025-05-07T20:32:22.9601829Z W0507 20:32:22.955000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:22.9603011Z W0507 20:32:22.955000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:22.9604203Z W0507 20:32:22.955000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ret = super().visit(node) 2025-05-07T20:32:22.9605380Z W0507 20:32:22.955000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:22.9606383Z W0507 20:32:22.955000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return visitor(node) 2025-05-07T20:32:22.9607588Z W0507 20:32:22.955000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:22.9609119Z W0507 20:32:22.955000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:22.9610144Z W0507 20:32:22.955000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:22.9611208Z W0507 20:32:22.955000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:22.9612238Z W0507 20:32:22.955000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] self.visit(item) 2025-05-07T20:32:22.9613002Z W0507 20:32:22.955000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:22.9614158Z W0507 20:32:22.955000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:22.9615500Z W0507 20:32:22.955000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:22.9616541Z W0507 20:32:22.955000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.9617441Z W0507 20:32:22.955000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:22.9618166Z W0507 20:32:22.955000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^ 2025-05-07T20:32:22.9619282Z W0507 20:32:22.955000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:23.1317510Z W0507 20:32:23.128000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:23.1318598Z W0507 20:32:23.128000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last): 2025-05-07T20:32:23.1319937Z W0507 20:32:23.128000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:23.1321354Z W0507 20:32:23.128000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:23.1322329Z W0507 20:32:23.128000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:23.1323632Z W0507 20:32:23.128000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:23.1325091Z W0507 20:32:23.128000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:23.1326384Z W0507 20:32:23.128000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:23.1327766Z W0507 20:32:23.128000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:23.1328812Z W0507 20:32:23.128000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] module_map=module_map) 2025-05-07T20:32:23.1330276Z W0507 20:32:23.128000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:23.1331509Z W0507 20:32:23.128000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] generator.visit(fn.parse()) 2025-05-07T20:32:23.1332337Z W0507 20:32:23.128000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:23.1333541Z W0507 20:32:23.128000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:23.1334746Z W0507 20:32:23.128000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ret = super().visit(node) 2025-05-07T20:32:23.1335784Z W0507 20:32:23.128000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:23.1336800Z W0507 20:32:23.128000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return visitor(node) 2025-05-07T20:32:23.1338134Z W0507 20:32:23.128000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:23.1339411Z W0507 20:32:23.128000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:23.1340325Z W0507 20:32:23.128000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:23.1341414Z W0507 20:32:23.128000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:23.1342442Z W0507 20:32:23.128000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] self.visit(item) 2025-05-07T20:32:23.1343215Z W0507 20:32:23.128000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:23.1344391Z W0507 20:32:23.128000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:23.1345739Z W0507 20:32:23.128000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:23.1346804Z W0507 20:32:23.128000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:23.1347706Z W0507 20:32:23.128000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:23.1348457Z W0507 20:32:23.128000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^ 2025-05-07T20:32:23.1349478Z W0507 20:32:23.128000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:23.3950157Z W0507 20:32:23.391000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:23.3951205Z W0507 20:32:23.391000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last): 2025-05-07T20:32:23.3952791Z W0507 20:32:23.391000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:23.3954207Z W0507 20:32:23.391000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:23.3955192Z W0507 20:32:23.391000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:23.3956500Z W0507 20:32:23.391000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:23.3957883Z W0507 20:32:23.391000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:23.3959173Z W0507 20:32:23.391000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:23.3960550Z W0507 20:32:23.391000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:23.3961730Z W0507 20:32:23.391000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] module_map=module_map) 2025-05-07T20:32:23.3963052Z W0507 20:32:23.391000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:23.3964361Z W0507 20:32:23.391000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] generator.visit(fn.parse()) 2025-05-07T20:32:23.3965393Z W0507 20:32:23.391000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:23.3966665Z W0507 20:32:23.391000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:23.3967944Z W0507 20:32:23.391000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ret = super().visit(node) 2025-05-07T20:32:23.3969007Z W0507 20:32:23.391000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:23.3970035Z W0507 20:32:23.391000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return visitor(node) 2025-05-07T20:32:23.3971242Z W0507 20:32:23.391000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:23.3972512Z W0507 20:32:23.391000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:23.3973424Z W0507 20:32:23.391000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:23.3974511Z W0507 20:32:23.391000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:23.3975649Z W0507 20:32:23.391000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] self.visit(item) 2025-05-07T20:32:23.3976413Z W0507 20:32:23.391000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:23.3977577Z W0507 20:32:23.391000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:23.3978934Z W0507 20:32:23.391000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:23.3979995Z W0507 20:32:23.391000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:23.3980898Z W0507 20:32:23.391000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:23.3981640Z W0507 20:32:23.391000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^ 2025-05-07T20:32:23.3982654Z W0507 20:32:23.391000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:23.4049822Z W0507 20:32:23.401000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:23.4051061Z W0507 20:32:23.401000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last): 2025-05-07T20:32:23.4052423Z W0507 20:32:23.401000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:23.4053885Z W0507 20:32:23.401000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:23.4054889Z W0507 20:32:23.401000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:23.4056235Z W0507 20:32:23.401000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:23.4057639Z W0507 20:32:23.401000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:23.4058952Z W0507 20:32:23.401000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:23.4060318Z W0507 20:32:23.401000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:23.4061361Z W0507 20:32:23.401000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] module_map=module_map) 2025-05-07T20:32:23.4062630Z W0507 20:32:23.401000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:23.4063873Z W0507 20:32:23.401000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] generator.visit(fn.parse()) 2025-05-07T20:32:23.4064828Z W0507 20:32:23.401000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:23.4066030Z W0507 20:32:23.401000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:23.4067245Z W0507 20:32:23.401000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ret = super().visit(node) 2025-05-07T20:32:23.4068288Z W0507 20:32:23.401000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:23.4069306Z W0507 20:32:23.401000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return visitor(node) 2025-05-07T20:32:23.4070524Z W0507 20:32:23.401000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:23.4071808Z W0507 20:32:23.401000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:23.4072719Z W0507 20:32:23.401000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:23.4073898Z W0507 20:32:23.401000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:23.4074948Z W0507 20:32:23.401000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] self.visit(item) 2025-05-07T20:32:23.4075727Z W0507 20:32:23.401000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:23.4076899Z W0507 20:32:23.401000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:23.4078256Z W0507 20:32:23.401000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:23.4079337Z W0507 20:32:23.401000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:23.4080249Z W0507 20:32:23.401000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:23.4081008Z W0507 20:32:23.401000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^ 2025-05-07T20:32:23.4082035Z W0507 20:32:23.401000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:24.5923574Z 2025-05-07T20:32:24.5924141Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:24.5924734Z self=, 2025-05-07T20:32:24.5925168Z T=4096, 2025-05-07T20:32:24.5925360Z D=7168, 2025-05-07T20:32:24.5925536Z scale_ub=None, 2025-05-07T20:32:24.5925755Z contiguous=False, 2025-05-07T20:32:24.5925982Z compiled=False, 2025-05-07T20:32:24.5926206Z ) 2025-05-07T20:32:24.5926527Z self = 2025-05-07T20:32:24.5927027Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:24.5927295Z 2025-05-07T20:32:24.5927817Z @given( 2025-05-07T20:32:24.5928035Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:24.5928350Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:24.5928660Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:24.5929020Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:24.5929344Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:24.5929616Z ) 2025-05-07T20:32:24.5929960Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:24.5939457Z def test_silu_mul_quant( 2025-05-07T20:32:24.5939750Z self, 2025-05-07T20:32:24.5939946Z T: int, 2025-05-07T20:32:24.5940160Z D: int, 2025-05-07T20:32:24.5940389Z scale_ub: Optional[float], 2025-05-07T20:32:24.5940666Z contiguous: bool, 2025-05-07T20:32:24.5940922Z compiled: bool, 2025-05-07T20:32:24.5941161Z ) -> None: 2025-05-07T20:32:24.5941373Z torch.manual_seed(2025) 2025-05-07T20:32:24.5941631Z 2025-05-07T20:32:24.5941920Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:24.5942263Z 2025-05-07T20:32:24.5942462Z x_sign = torch.sign(x) 2025-05-07T20:32:24.5942764Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:24.5943071Z x = x_sign * x_clamp 2025-05-07T20:32:24.5943322Z x0 = x[:, :D] 2025-05-07T20:32:24.5943549Z x1 = x[:, D:] 2025-05-07T20:32:24.5943754Z 2025-05-07T20:32:24.5943947Z if contiguous: 2025-05-07T20:32:24.5944186Z x0 = x0.contiguous() 2025-05-07T20:32:24.5944440Z x1 = x1.contiguous() 2025-05-07T20:32:24.5944682Z 2025-05-07T20:32:24.5945104Z if scale_ub is not None: 2025-05-07T20:32:24.5945394Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:24.5945739Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:24.5946058Z ) 2025-05-07T20:32:24.5946271Z else: 2025-05-07T20:32:24.5946497Z scale_ub_tensor = None 2025-05-07T20:32:24.5946778Z 2025-05-07T20:32:24.5947028Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:24.5947349Z op = silu_mul_quant 2025-05-07T20:32:24.5947618Z if compiled: 2025-05-07T20:32:24.5947883Z op = torch.compile(op) 2025-05-07T20:32:24.5948187Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:24.5948479Z 2025-05-07T20:32:24.5948687Z > y_fp8, y_scale = fn() 2025-05-07T20:32:24.5948857Z 2025-05-07T20:32:24.5948966Z moe/activation_test.py:117: 2025-05-07T20:32:24.5949277Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:24.5949639Z moe/activation_test.py:115: in fn 2025-05-07T20:32:24.5949938Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:24.5950630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:24.5951343Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:24.5951891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:24.5952575Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:24.5953250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:24.5953793Z kernel = self.compile( 2025-05-07T20:32:24.5954346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:24.5955005Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:24.5955417Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:24.5955650Z 2025-05-07T20:32:24.5955868Z self = 2025-05-07T20:32:24.5957056Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:24.5958437Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fab09699260>} 2025-05-07T20:32:24.5959784Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:24.5960829Z context = 2025-05-07T20:32:24.5961118Z 2025-05-07T20:32:24.5961300Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:24.5961820Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:24.5962306Z module_map=module_map) 2025-05-07T20:32:24.5962699Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:24.5963072Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:24.5963339Z E ^ 2025-05-07T20:32:24.5963822Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:24.5964275Z 2025-05-07T20:32:24.5964836Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:24.5965347Z 2025-05-07T20:32:24.5965469Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:24.5967300Z self=, 2025-05-07T20:32:24.5967714Z T=128, 2025-05-07T20:32:24.5967907Z D=7168, 2025-05-07T20:32:24.5968115Z scale_ub=None, 2025-05-07T20:32:24.5968330Z contiguous=False, 2025-05-07T20:32:24.5968585Z compiled=True, 2025-05-07T20:32:24.5968803Z ) 2025-05-07T20:32:24.5969129Z self = 2025-05-07T20:32:24.5969634Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:24.5969900Z 2025-05-07T20:32:24.5969995Z @given( 2025-05-07T20:32:24.5970225Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:24.5970550Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:24.5970871Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:24.5971200Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:24.5971550Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:24.5971852Z ) 2025-05-07T20:32:24.5972215Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:24.5972665Z def test_silu_mul_quant( 2025-05-07T20:32:24.5972921Z self, 2025-05-07T20:32:24.5973137Z T: int, 2025-05-07T20:32:24.5973334Z D: int, 2025-05-07T20:32:24.5973575Z scale_ub: Optional[float], 2025-05-07T20:32:24.5973870Z contiguous: bool, 2025-05-07T20:32:24.5974112Z compiled: bool, 2025-05-07T20:32:24.5974352Z ) -> None: 2025-05-07T20:32:24.5974588Z torch.manual_seed(2025) 2025-05-07T20:32:24.5974836Z 2025-05-07T20:32:24.5975130Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:24.5975487Z 2025-05-07T20:32:24.5975691Z x_sign = torch.sign(x) 2025-05-07T20:32:24.5976010Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:24.5976377Z x = x_sign * x_clamp 2025-05-07T20:32:24.5976627Z x0 = x[:, :D] 2025-05-07T20:32:24.5976866Z x1 = x[:, D:] 2025-05-07T20:32:24.5977085Z 2025-05-07T20:32:24.5977290Z if contiguous: 2025-05-07T20:32:24.5977526Z x0 = x0.contiguous() 2025-05-07T20:32:24.5977800Z x1 = x1.contiguous() 2025-05-07T20:32:24.5978151Z 2025-05-07T20:32:24.5978349Z if scale_ub is not None: 2025-05-07T20:32:24.5978638Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:24.5978985Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:24.5979290Z ) 2025-05-07T20:32:24.5979496Z else: 2025-05-07T20:32:24.5979710Z scale_ub_tensor = None 2025-05-07T20:32:24.5979952Z 2025-05-07T20:32:24.5980188Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:24.5980512Z op = silu_mul_quant 2025-05-07T20:32:24.5980753Z if compiled: 2025-05-07T20:32:24.5981002Z op = torch.compile(op) 2025-05-07T20:32:24.5981308Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:24.5981578Z 2025-05-07T20:32:24.5981781Z y_fp8, y_scale = fn() 2025-05-07T20:32:24.5982075Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:24.5982366Z 2025-05-07T20:32:24.5982606Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:24.5982943Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:24.5983236Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:24.5983540Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:24.5983908Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:24.5984221Z 2025-05-07T20:32:24.5984422Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:24.5984625Z 2025-05-07T20:32:24.5984722Z moe/activation_test.py:126: 2025-05-07T20:32:24.5985028Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:24.5985466Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:24.5985790Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:24.5986625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:24.5987390Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:24.5987933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:24.5988623Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:24.5989317Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:24.5990045Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:24.5990768Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:24.5991417Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:24.5992023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:24.5992547Z fn() 2025-05-07T20:32:24.5993050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:24.5993639Z self.fn.run( 2025-05-07T20:32:24.5994106Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:24.5994624Z kernel = self.compile( 2025-05-07T20:32:24.5995167Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:24.5995821Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:24.5996245Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:24.5996505Z 2025-05-07T20:32:24.5996710Z self = 2025-05-07T20:32:24.5997787Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:24.5999253Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fab09699d00>} 2025-05-07T20:32:24.6000597Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:24.6001613Z context = 2025-05-07T20:32:24.6001910Z 2025-05-07T20:32:24.6002083Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:24.6002609Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:24.6003086Z module_map=module_map) 2025-05-07T20:32:24.6003440Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:24.6003800Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:24.6004068Z E ^ 2025-05-07T20:32:24.6004602Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:24.6005058Z 2025-05-07T20:32:24.6005467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:24.8451727Z 2025-05-07T20:32:24.8452177Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:24.8452828Z self=, 2025-05-07T20:32:24.8453491Z T=128, 2025-05-07T20:32:24.8453994Z D=7168, 2025-05-07T20:32:24.8454266Z scale_ub=None, 2025-05-07T20:32:24.8454557Z contiguous=False, 2025-05-07T20:32:24.8454863Z compiled=False, 2025-05-07T20:32:24.8455134Z ) 2025-05-07T20:32:24.8455575Z self = 2025-05-07T20:32:24.8456230Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:24.8456536Z 2025-05-07T20:32:24.8456607Z @given( 2025-05-07T20:32:24.8456832Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:24.8457136Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:24.8457442Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:24.8457762Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:24.8458081Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:24.8458355Z ) 2025-05-07T20:32:24.8458696Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:24.8459132Z def test_silu_mul_quant( 2025-05-07T20:32:24.8459366Z self, 2025-05-07T20:32:24.8459543Z T: int, 2025-05-07T20:32:24.8459729Z D: int, 2025-05-07T20:32:24.8459940Z scale_ub: Optional[float], 2025-05-07T20:32:24.8460202Z contiguous: bool, 2025-05-07T20:32:24.8460437Z compiled: bool, 2025-05-07T20:32:24.8460655Z ) -> None: 2025-05-07T20:32:24.8460857Z torch.manual_seed(2025) 2025-05-07T20:32:24.8461092Z 2025-05-07T20:32:24.8461364Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:24.8461691Z 2025-05-07T20:32:24.8461878Z x_sign = torch.sign(x) 2025-05-07T20:32:24.8462166Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:24.8462472Z x = x_sign * x_clamp 2025-05-07T20:32:24.8462694Z x0 = x[:, :D] 2025-05-07T20:32:24.8462899Z x1 = x[:, D:] 2025-05-07T20:32:24.8463092Z 2025-05-07T20:32:24.8463260Z if contiguous: 2025-05-07T20:32:24.8463483Z x0 = x0.contiguous() 2025-05-07T20:32:24.8463727Z x1 = x1.contiguous() 2025-05-07T20:32:24.8463951Z 2025-05-07T20:32:24.8464131Z if scale_ub is not None: 2025-05-07T20:32:24.8464394Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:24.8464861Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:24.8465160Z ) 2025-05-07T20:32:24.8465345Z else: 2025-05-07T20:32:24.8465537Z scale_ub_tensor = None 2025-05-07T20:32:24.8465781Z 2025-05-07T20:32:24.8466001Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:24.8466299Z op = silu_mul_quant 2025-05-07T20:32:24.8466543Z if compiled: 2025-05-07T20:32:24.8466783Z op = torch.compile(op) 2025-05-07T20:32:24.8467069Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:24.8467327Z 2025-05-07T20:32:24.8467507Z > y_fp8, y_scale = fn() 2025-05-07T20:32:24.8467668Z 2025-05-07T20:32:24.8467771Z moe/activation_test.py:117: 2025-05-07T20:32:24.8468053Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:24.8468378Z moe/activation_test.py:115: in fn 2025-05-07T20:32:24.8468651Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:24.8469343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:24.8470021Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:24.8470551Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:24.8471229Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:24.8472089Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:24.8472614Z kernel = self.compile( 2025-05-07T20:32:24.8473241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:24.8473886Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:24.8474278Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:24.8474518Z 2025-05-07T20:32:24.8474721Z self = 2025-05-07T20:32:24.8475793Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:24.8477252Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faae123e700>} 2025-05-07T20:32:24.8478597Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:24.8479609Z context = 2025-05-07T20:32:24.8479908Z 2025-05-07T20:32:24.8480068Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:24.8480585Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:24.8481048Z module_map=module_map) 2025-05-07T20:32:24.8481399Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:24.8481748Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:24.8482005Z E ^ 2025-05-07T20:32:24.8482462Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:24.8482907Z 2025-05-07T20:32:24.8483331Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:24.8483836Z 2025-05-07T20:32:24.8483938Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:24.8484337Z self=, 2025-05-07T20:32:24.8484942Z T=4096, 2025-05-07T20:32:24.8485124Z D=5120, 2025-05-07T20:32:24.8485302Z scale_ub=1200.0, 2025-05-07T20:32:24.8485522Z contiguous=True, 2025-05-07T20:32:24.8485736Z compiled=False, 2025-05-07T20:32:24.8485923Z ) 2025-05-07T20:32:24.8486239Z self = 2025-05-07T20:32:24.8486732Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:24.8486998Z 2025-05-07T20:32:24.8487068Z @given( 2025-05-07T20:32:24.8487289Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:24.8487596Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:24.8487905Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:24.8488217Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:24.8488538Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:24.8488818Z ) 2025-05-07T20:32:24.8489152Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:24.8489594Z def test_silu_mul_quant( 2025-05-07T20:32:24.8489827Z self, 2025-05-07T20:32:24.8490003Z T: int, 2025-05-07T20:32:24.8490192Z D: int, 2025-05-07T20:32:24.8490400Z scale_ub: Optional[float], 2025-05-07T20:32:24.8490655Z contiguous: bool, 2025-05-07T20:32:24.8490886Z compiled: bool, 2025-05-07T20:32:24.8491100Z ) -> None: 2025-05-07T20:32:24.8491296Z torch.manual_seed(2025) 2025-05-07T20:32:24.8491525Z 2025-05-07T20:32:24.8491784Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:24.8492116Z 2025-05-07T20:32:24.8492375Z x_sign = torch.sign(x) 2025-05-07T20:32:24.8492661Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:24.8492958Z x = x_sign * x_clamp 2025-05-07T20:32:24.8493180Z x0 = x[:, :D] 2025-05-07T20:32:24.8493384Z x1 = x[:, D:] 2025-05-07T20:32:24.8493585Z 2025-05-07T20:32:24.8493757Z if contiguous: 2025-05-07T20:32:24.8493986Z x0 = x0.contiguous() 2025-05-07T20:32:24.8494232Z x1 = x1.contiguous() 2025-05-07T20:32:24.8494453Z 2025-05-07T20:32:24.8494629Z if scale_ub is not None: 2025-05-07T20:32:24.8494890Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:24.8495208Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:24.8495506Z ) 2025-05-07T20:32:24.8495687Z else: 2025-05-07T20:32:24.8495880Z scale_ub_tensor = None 2025-05-07T20:32:24.8496120Z 2025-05-07T20:32:24.8496341Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:24.8496642Z op = silu_mul_quant 2025-05-07T20:32:24.8496884Z if compiled: 2025-05-07T20:32:24.8497121Z op = torch.compile(op) 2025-05-07T20:32:24.8497410Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:24.8497668Z 2025-05-07T20:32:24.8497853Z > y_fp8, y_scale = fn() 2025-05-07T20:32:24.8498011Z 2025-05-07T20:32:24.8498115Z moe/activation_test.py:117: 2025-05-07T20:32:24.8498396Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:24.8498719Z moe/activation_test.py:115: in fn 2025-05-07T20:32:24.8498991Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:24.8499662Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:24.8500342Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:24.8500876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:24.8501553Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:24.8502205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:24.8502813Z kernel = self.compile( 2025-05-07T20:32:24.8503349Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:24.8503997Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:24.8504381Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:24.8504613Z 2025-05-07T20:32:24.8504815Z self = 2025-05-07T20:32:24.8505893Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:24.8507256Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faae123c400>} 2025-05-07T20:32:24.8508746Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:24.8509764Z context = 2025-05-07T20:32:24.8510058Z 2025-05-07T20:32:24.8510219Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:24.8510736Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:24.8511190Z module_map=module_map) 2025-05-07T20:32:24.8511549Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:24.8512025Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:24.8512275Z E ^ 2025-05-07T20:32:24.8512776Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:24.8513239Z 2025-05-07T20:32:24.8513650Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:24.8514159Z 2025-05-07T20:32:24.8514267Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:24.8514665Z self=, 2025-05-07T20:32:24.8515084Z T=1, 2025-05-07T20:32:24.8515275Z D=5120, 2025-05-07T20:32:24.8515462Z scale_ub=None, 2025-05-07T20:32:24.8515663Z contiguous=True, 2025-05-07T20:32:24.8515887Z compiled=True, 2025-05-07T20:32:24.8516090Z ) 2025-05-07T20:32:24.8516393Z self = 2025-05-07T20:32:24.8516875Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:24.8517137Z 2025-05-07T20:32:24.8517207Z @given( 2025-05-07T20:32:24.8517435Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:24.8517735Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:24.8518084Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:24.8518427Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:24.8518737Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:24.8519013Z ) 2025-05-07T20:32:24.8519357Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:24.8519787Z def test_silu_mul_quant( 2025-05-07T20:32:24.8520027Z self, 2025-05-07T20:32:24.8520224Z T: int, 2025-05-07T20:32:24.8520421Z D: int, 2025-05-07T20:32:24.8520635Z scale_ub: Optional[float], 2025-05-07T20:32:24.8520910Z contiguous: bool, 2025-05-07T20:32:24.8521159Z compiled: bool, 2025-05-07T20:32:24.8521372Z ) -> None: 2025-05-07T20:32:24.8521594Z torch.manual_seed(2025) 2025-05-07T20:32:24.8521830Z 2025-05-07T20:32:24.8522094Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:24.8522604Z 2025-05-07T20:32:24.8522795Z x_sign = torch.sign(x) 2025-05-07T20:32:24.8523078Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:24.8523391Z x = x_sign * x_clamp 2025-05-07T20:32:24.8523641Z x0 = x[:, :D] 2025-05-07T20:32:24.8523850Z x1 = x[:, D:] 2025-05-07T20:32:24.8524060Z 2025-05-07T20:32:24.8524246Z if contiguous: 2025-05-07T20:32:24.8524533Z x0 = x0.contiguous() 2025-05-07T20:32:24.8524786Z x1 = x1.contiguous() 2025-05-07T20:32:24.8525015Z 2025-05-07T20:32:24.8525242Z if scale_ub is not None: 2025-05-07T20:32:24.8525623Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:24.8525977Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:24.8526271Z ) 2025-05-07T20:32:24.8526450Z else: 2025-05-07T20:32:24.8526650Z scale_ub_tensor = None 2025-05-07T20:32:24.8526945Z 2025-05-07T20:32:24.8527192Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:24.8527499Z op = silu_mul_quant 2025-05-07T20:32:24.8527731Z if compiled: 2025-05-07T20:32:24.8527969Z op = torch.compile(op) 2025-05-07T20:32:24.8528254Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:24.8528511Z 2025-05-07T20:32:24.8528696Z y_fp8, y_scale = fn() 2025-05-07T20:32:24.8528974Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:24.8529249Z 2025-05-07T20:32:24.8537497Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:24.8537862Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:24.8538157Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:24.8538590Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:24.8538943Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:24.8539255Z 2025-05-07T20:32:24.8539456Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:24.8539660Z 2025-05-07T20:32:24.8539762Z moe/activation_test.py:126: 2025-05-07T20:32:24.8540065Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:24.8540401Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:24.8540733Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:24.8541508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:24.8542254Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:24.8542800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:24.8543481Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:24.8544168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:24.8544896Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:24.8545620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:24.8546248Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:24.8546842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:24.8547350Z fn() 2025-05-07T20:32:24.8547856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:24.8548432Z self.fn.run( 2025-05-07T20:32:24.8548902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:24.8549427Z kernel = self.compile( 2025-05-07T20:32:24.8549958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:24.8550699Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:24.8551091Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:24.8551319Z 2025-05-07T20:32:24.8551534Z self = 2025-05-07T20:32:24.8552601Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:24.8553982Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faae123ef20>} 2025-05-07T20:32:24.8555319Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:24.8556344Z context = 2025-05-07T20:32:24.8556630Z 2025-05-07T20:32:24.8556792Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:24.8557313Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:24.8557772Z module_map=module_map) 2025-05-07T20:32:24.8558132Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:24.8558473Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:24.8558731Z E ^ 2025-05-07T20:32:24.8559280Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:24.8559727Z 2025-05-07T20:32:24.8560142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.0999842Z W0507 20:32:25.096000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:25.1000934Z W0507 20:32:25.096000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last): 2025-05-07T20:32:25.1002285Z W0507 20:32:25.096000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:25.1004928Z W0507 20:32:25.096000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:25.1005915Z W0507 20:32:25.096000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:25.1007249Z W0507 20:32:25.096000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:25.1008956Z W0507 20:32:25.096000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.1010288Z W0507 20:32:25.096000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:25.1011673Z W0507 20:32:25.096000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.1012908Z W0507 20:32:25.096000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] module_map=module_map) 2025-05-07T20:32:25.1014175Z W0507 20:32:25.096000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:25.1015423Z W0507 20:32:25.096000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] generator.visit(fn.parse()) 2025-05-07T20:32:25.1016264Z W0507 20:32:25.096000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:25.1017463Z W0507 20:32:25.096000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:25.1018669Z W0507 20:32:25.096000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ret = super().visit(node) 2025-05-07T20:32:25.1019692Z W0507 20:32:25.096000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:25.1020709Z W0507 20:32:25.096000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return visitor(node) 2025-05-07T20:32:25.1022027Z W0507 20:32:25.096000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:25.1023303Z W0507 20:32:25.096000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:25.1024201Z W0507 20:32:25.096000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:25.1025280Z W0507 20:32:25.096000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:25.1026312Z W0507 20:32:25.096000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] self.visit(item) 2025-05-07T20:32:25.1027067Z W0507 20:32:25.096000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:25.1028232Z W0507 20:32:25.096000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:25.1029593Z W0507 20:32:25.096000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:25.1030658Z W0507 20:32:25.096000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.1031552Z W0507 20:32:25.096000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.1032284Z W0507 20:32:25.096000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^ 2025-05-07T20:32:25.1033311Z W0507 20:32:25.096000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.1640762Z W0507 20:32:25.160000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:25.1642305Z W0507 20:32:25.160000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last): 2025-05-07T20:32:25.1643691Z W0507 20:32:25.160000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:25.1645256Z W0507 20:32:25.160000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:25.1646233Z W0507 20:32:25.160000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:25.1647524Z W0507 20:32:25.160000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:25.1648913Z W0507 20:32:25.160000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.1650207Z W0507 20:32:25.160000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:25.1651696Z W0507 20:32:25.160000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.1652740Z W0507 20:32:25.160000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] module_map=module_map) 2025-05-07T20:32:25.1653996Z W0507 20:32:25.160000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:25.1655240Z W0507 20:32:25.160000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] generator.visit(fn.parse()) 2025-05-07T20:32:25.1656074Z W0507 20:32:25.160000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:25.1657253Z W0507 20:32:25.160000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:25.1658440Z W0507 20:32:25.160000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ret = super().visit(node) 2025-05-07T20:32:25.1659458Z W0507 20:32:25.160000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:25.1660466Z W0507 20:32:25.160000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return visitor(node) 2025-05-07T20:32:25.1661655Z W0507 20:32:25.160000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:25.1662911Z W0507 20:32:25.160000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:25.1663803Z W0507 20:32:25.160000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:25.1664871Z W0507 20:32:25.160000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:25.1666003Z W0507 20:32:25.160000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] self.visit(item) 2025-05-07T20:32:25.1666749Z W0507 20:32:25.160000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:25.1667897Z W0507 20:32:25.160000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:25.1669231Z W0507 20:32:25.160000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:25.1670268Z W0507 20:32:25.160000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.1671164Z W0507 20:32:25.160000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.1671882Z W0507 20:32:25.160000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^ 2025-05-07T20:32:25.1672880Z W0507 20:32:25.160000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.3533874Z W0507 20:32:25.349000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:25.3535372Z W0507 20:32:25.349000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last): 2025-05-07T20:32:25.3536904Z W0507 20:32:25.349000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:25.3538532Z W0507 20:32:25.349000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:25.3539644Z W0507 20:32:25.349000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:25.3541143Z W0507 20:32:25.349000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:25.3542721Z W0507 20:32:25.349000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.3544127Z W0507 20:32:25.349000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:25.3545551Z W0507 20:32:25.349000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.3546629Z W0507 20:32:25.349000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] module_map=module_map) 2025-05-07T20:32:25.3547944Z W0507 20:32:25.349000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:25.3549235Z W0507 20:32:25.349000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] generator.visit(fn.parse()) 2025-05-07T20:32:25.3550194Z W0507 20:32:25.349000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:25.3551370Z W0507 20:32:25.349000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:25.3552557Z W0507 20:32:25.349000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ret = super().visit(node) 2025-05-07T20:32:25.3553580Z W0507 20:32:25.349000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:25.3554584Z W0507 20:32:25.349000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return visitor(node) 2025-05-07T20:32:25.3555788Z W0507 20:32:25.349000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:25.3557043Z W0507 20:32:25.349000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:25.3557934Z W0507 20:32:25.349000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:25.3559084Z W0507 20:32:25.349000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:25.3560111Z W0507 20:32:25.349000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] self.visit(item) 2025-05-07T20:32:25.3560872Z W0507 20:32:25.349000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:25.3562019Z W0507 20:32:25.349000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:25.3563366Z W0507 20:32:25.349000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:25.3564540Z W0507 20:32:25.349000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.3565438Z W0507 20:32:25.349000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.3566160Z W0507 20:32:25.349000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^ 2025-05-07T20:32:25.3567226Z W0507 20:32:25.349000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.3625646Z W0507 20:32:25.359000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:25.3626955Z W0507 20:32:25.359000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last): 2025-05-07T20:32:25.3628355Z W0507 20:32:25.359000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:25.3630145Z W0507 20:32:25.359000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:25.3631108Z W0507 20:32:25.359000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:25.3632403Z W0507 20:32:25.359000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:25.3633779Z W0507 20:32:25.359000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.3635071Z W0507 20:32:25.359000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:25.3636445Z W0507 20:32:25.359000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.3637473Z W0507 20:32:25.359000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] module_map=module_map) 2025-05-07T20:32:25.3638847Z W0507 20:32:25.359000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:25.3640075Z W0507 20:32:25.359000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] generator.visit(fn.parse()) 2025-05-07T20:32:25.3640903Z W0507 20:32:25.359000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:25.3642092Z W0507 20:32:25.359000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:25.3643278Z W0507 20:32:25.359000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ret = super().visit(node) 2025-05-07T20:32:25.3644296Z W0507 20:32:25.359000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:25.3645408Z W0507 20:32:25.359000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return visitor(node) 2025-05-07T20:32:25.3646608Z W0507 20:32:25.359000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:25.3647867Z W0507 20:32:25.359000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:25.3648751Z W0507 20:32:25.359000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:25.3649818Z W0507 20:32:25.359000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:25.3650844Z W0507 20:32:25.359000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] self.visit(item) 2025-05-07T20:32:25.3651597Z W0507 20:32:25.359000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:25.3652744Z W0507 20:32:25.359000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:25.3654169Z W0507 20:32:25.359000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:25.3655217Z W0507 20:32:25.359000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.3656116Z W0507 20:32:25.359000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.3656836Z W0507 20:32:25.359000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^ 2025-05-07T20:32:25.3657843Z W0507 20:32:25.359000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:26.0631131Z W0507 20:32:26.059000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:26.0632294Z W0507 20:32:26.059000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last): 2025-05-07T20:32:26.0633954Z W0507 20:32:26.059000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:26.0635491Z W0507 20:32:26.059000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:26.0636541Z W0507 20:32:26.059000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:26.0637951Z W0507 20:32:26.059000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:26.0639422Z W0507 20:32:26.059000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:26.0640828Z W0507 20:32:26.059000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:26.0642303Z W0507 20:32:26.059000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:26.0643430Z W0507 20:32:26.059000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] module_map=module_map) 2025-05-07T20:32:26.0644905Z W0507 20:32:26.059000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:26.0646148Z W0507 20:32:26.059000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] generator.visit(fn.parse()) 2025-05-07T20:32:26.0646995Z W0507 20:32:26.059000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:26.0648253Z W0507 20:32:26.059000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:26.0649602Z W0507 20:32:26.059000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ret = super().visit(node) 2025-05-07T20:32:26.0650623Z W0507 20:32:26.059000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:26.0651617Z W0507 20:32:26.059000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return visitor(node) 2025-05-07T20:32:26.0653760Z W0507 20:32:26.059000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:26.0655026Z W0507 20:32:26.059000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:26.0655933Z W0507 20:32:26.059000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:26.0657065Z W0507 20:32:26.059000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:26.0658090Z W0507 20:32:26.059000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] self.visit(item) 2025-05-07T20:32:26.0659014Z W0507 20:32:26.059000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:26.0660175Z W0507 20:32:26.059000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:26.0661529Z W0507 20:32:26.059000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:26.0662572Z W0507 20:32:26.059000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:26.0663483Z W0507 20:32:26.059000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:26.0664224Z W0507 20:32:26.059000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^ 2025-05-07T20:32:26.0665249Z W0507 20:32:26.059000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:26.1270993Z W0507 20:32:26.123000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:26.1273120Z W0507 20:32:26.123000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last): 2025-05-07T20:32:26.1275778Z W0507 20:32:26.123000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:26.1277625Z W0507 20:32:26.123000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:26.1278608Z W0507 20:32:26.123000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:26.1279921Z W0507 20:32:26.123000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:26.1281502Z W0507 20:32:26.123000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:26.1282794Z W0507 20:32:26.123000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:26.1284181Z W0507 20:32:26.123000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:26.1285492Z W0507 20:32:26.123000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] module_map=module_map) 2025-05-07T20:32:26.1286754Z W0507 20:32:26.123000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:26.1287984Z W0507 20:32:26.123000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] generator.visit(fn.parse()) 2025-05-07T20:32:26.1288810Z W0507 20:32:26.123000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:26.1290206Z W0507 20:32:26.123000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:26.1291416Z W0507 20:32:26.123000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ret = super().visit(node) 2025-05-07T20:32:26.1292447Z W0507 20:32:26.123000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:26.1293449Z W0507 20:32:26.123000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return visitor(node) 2025-05-07T20:32:26.1294652Z W0507 20:32:26.123000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:26.1295918Z W0507 20:32:26.123000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:26.1296808Z W0507 20:32:26.123000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:26.1297894Z W0507 20:32:26.123000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:26.1298917Z W0507 20:32:26.123000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] self.visit(item) 2025-05-07T20:32:26.1299679Z W0507 20:32:26.123000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:26.1300837Z W0507 20:32:26.123000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:26.1302177Z W0507 20:32:26.123000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:26.1303225Z W0507 20:32:26.123000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:26.1304218Z W0507 20:32:26.123000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:26.1304961Z W0507 20:32:26.123000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^ 2025-05-07T20:32:26.1305980Z W0507 20:32:26.123000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:26.3163441Z W0507 20:32:26.312000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:26.3164657Z W0507 20:32:26.312000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last): 2025-05-07T20:32:26.3166034Z W0507 20:32:26.312000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:26.3167495Z W0507 20:32:26.312000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:26.3168493Z W0507 20:32:26.312000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:26.3170001Z W0507 20:32:26.312000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:26.3171412Z W0507 20:32:26.312000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:26.3172723Z W0507 20:32:26.312000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:26.3174102Z W0507 20:32:26.312000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:26.3175159Z W0507 20:32:26.312000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] module_map=module_map) 2025-05-07T20:32:26.3176422Z W0507 20:32:26.312000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:26.3177676Z W0507 20:32:26.312000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] generator.visit(fn.parse()) 2025-05-07T20:32:26.3178513Z W0507 20:32:26.312000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:26.3179705Z W0507 20:32:26.312000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:26.3180899Z W0507 20:32:26.312000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ret = super().visit(node) 2025-05-07T20:32:26.3181925Z W0507 20:32:26.312000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:26.3183055Z W0507 20:32:26.312000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return visitor(node) 2025-05-07T20:32:26.3184265Z W0507 20:32:26.312000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:26.3185533Z W0507 20:32:26.312000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:26.3186426Z W0507 20:32:26.312000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:26.3187552Z W0507 20:32:26.312000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:26.3188588Z W0507 20:32:26.312000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] self.visit(item) 2025-05-07T20:32:26.3189348Z W0507 20:32:26.312000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:26.3190497Z W0507 20:32:26.312000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:26.3191918Z W0507 20:32:26.312000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:26.3192968Z W0507 20:32:26.312000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:26.3193870Z W0507 20:32:26.312000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:26.3194611Z W0507 20:32:26.312000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^ 2025-05-07T20:32:26.3195611Z W0507 20:32:26.312000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:26.3257434Z W0507 20:32:26.322000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:26.3258796Z W0507 20:32:26.322000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last): 2025-05-07T20:32:26.3260122Z W0507 20:32:26.322000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:26.3261525Z W0507 20:32:26.322000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:26.3262498Z W0507 20:32:26.322000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:26.3263795Z W0507 20:32:26.322000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:26.3265160Z W0507 20:32:26.322000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:26.3266452Z W0507 20:32:26.322000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:26.3268016Z W0507 20:32:26.322000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:26.3269050Z W0507 20:32:26.322000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] module_map=module_map) 2025-05-07T20:32:26.3270294Z W0507 20:32:26.322000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:26.3271523Z W0507 20:32:26.322000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] generator.visit(fn.parse()) 2025-05-07T20:32:26.3272360Z W0507 20:32:26.322000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:26.3273547Z W0507 20:32:26.322000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:26.3274737Z W0507 20:32:26.322000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ret = super().visit(node) 2025-05-07T20:32:26.3275870Z W0507 20:32:26.322000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:26.3276888Z W0507 20:32:26.322000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return visitor(node) 2025-05-07T20:32:26.3278091Z W0507 20:32:26.322000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:26.3279361Z W0507 20:32:26.322000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:26.3280240Z W0507 20:32:26.322000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:26.3281319Z W0507 20:32:26.322000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:26.3282350Z W0507 20:32:26.322000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] self.visit(item) 2025-05-07T20:32:26.3283111Z W0507 20:32:26.322000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:26.3284270Z W0507 20:32:26.322000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:26.3285776Z W0507 20:32:26.322000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:26.3286830Z W0507 20:32:26.322000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:26.3287745Z W0507 20:32:26.322000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:26.3288479Z W0507 20:32:26.322000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^ 2025-05-07T20:32:26.3289577Z W0507 20:32:26.322000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:26.5168572Z 2025-05-07T20:32:26.5168900Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:26.5169647Z self=, 2025-05-07T20:32:26.5170298Z T=2048, 2025-05-07T20:32:26.5170600Z D=5120, 2025-05-07T20:32:26.5170787Z scale_ub=None, 2025-05-07T20:32:26.5170996Z contiguous=True, 2025-05-07T20:32:26.5171216Z compiled=True, 2025-05-07T20:32:26.5171421Z ) 2025-05-07T20:32:26.5171748Z self = 2025-05-07T20:32:26.5172247Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:26.5172520Z 2025-05-07T20:32:26.5172593Z @given( 2025-05-07T20:32:26.5172819Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:26.5173125Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:26.5173432Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:26.5173760Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:26.5174082Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:26.5174367Z ) 2025-05-07T20:32:26.5174712Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:26.5175161Z def test_silu_mul_quant( 2025-05-07T20:32:26.5175397Z self, 2025-05-07T20:32:26.5175596Z T: int, 2025-05-07T20:32:26.5175800Z D: int, 2025-05-07T20:32:26.5176008Z scale_ub: Optional[float], 2025-05-07T20:32:26.5176450Z contiguous: bool, 2025-05-07T20:32:26.5176706Z compiled: bool, 2025-05-07T20:32:26.5176965Z ) -> None: 2025-05-07T20:32:26.5177192Z torch.manual_seed(2025) 2025-05-07T20:32:26.5177443Z 2025-05-07T20:32:26.5177713Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:26.5178069Z 2025-05-07T20:32:26.5178265Z x_sign = torch.sign(x) 2025-05-07T20:32:26.5178552Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:26.5178874Z x = x_sign * x_clamp 2025-05-07T20:32:26.5179123Z x0 = x[:, :D] 2025-05-07T20:32:26.5179336Z x1 = x[:, D:] 2025-05-07T20:32:26.5179547Z 2025-05-07T20:32:26.5179738Z if contiguous: 2025-05-07T20:32:26.5179963Z x0 = x0.contiguous() 2025-05-07T20:32:26.5180223Z x1 = x1.contiguous() 2025-05-07T20:32:26.5180467Z 2025-05-07T20:32:26.5180647Z if scale_ub is not None: 2025-05-07T20:32:26.5180931Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:26.5181262Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:26.5181565Z ) 2025-05-07T20:32:26.5181746Z else: 2025-05-07T20:32:26.5181951Z scale_ub_tensor = None 2025-05-07T20:32:26.5182203Z 2025-05-07T20:32:26.5182421Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:26.5182733Z op = silu_mul_quant 2025-05-07T20:32:26.5182976Z if compiled: 2025-05-07T20:32:26.5183212Z op = torch.compile(op) 2025-05-07T20:32:26.5183518Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:26.5183783Z 2025-05-07T20:32:26.5183963Z y_fp8, y_scale = fn() 2025-05-07T20:32:26.5184248Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:26.5184534Z 2025-05-07T20:32:26.5184759Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:26.5185096Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:26.5185390Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:26.5185695Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:26.5186049Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:26.5186362Z 2025-05-07T20:32:26.5186691Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:26.5186882Z 2025-05-07T20:32:26.5186977Z moe/activation_test.py:126: 2025-05-07T20:32:26.5187279Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:26.5197180Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:26.5197560Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:26.5198357Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:26.5199123Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:26.5199684Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:26.5200374Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:26.5201065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:26.5201805Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:26.5202548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:26.5203196Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:26.5203800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:26.5204436Z fn() 2025-05-07T20:32:26.5204959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:26.5205654Z self.fn.run( 2025-05-07T20:32:26.5206142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:26.5206683Z kernel = self.compile( 2025-05-07T20:32:26.5207243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:26.5207918Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:26.5208668Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:26.5208908Z 2025-05-07T20:32:26.5209131Z self = 2025-05-07T20:32:26.5210237Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:26.5211625Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faae123f9c0>} 2025-05-07T20:32:26.5212988Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:26.5214031Z context = 2025-05-07T20:32:26.5214380Z 2025-05-07T20:32:26.5214590Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:26.5215109Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:26.5215584Z module_map=module_map) 2025-05-07T20:32:26.5215960Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:26.5216324Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:26.5216601Z E ^ 2025-05-07T20:32:26.5217072Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:26.5217522Z 2025-05-07T20:32:26.5217946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:26.5218618Z 2025-05-07T20:32:26.5218736Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:26.5219148Z self=, 2025-05-07T20:32:26.5219561Z T=128, 2025-05-07T20:32:26.5219765Z D=5120, 2025-05-07T20:32:26.5219962Z scale_ub=None, 2025-05-07T20:32:26.5220186Z contiguous=True, 2025-05-07T20:32:26.5220417Z compiled=True, 2025-05-07T20:32:26.5220617Z ) 2025-05-07T20:32:26.5220944Z self = 2025-05-07T20:32:26.5221437Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:26.5221703Z 2025-05-07T20:32:26.5221780Z @given( 2025-05-07T20:32:26.5222021Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:26.5222350Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:26.5222663Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:26.5223000Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:26.5223341Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:26.5223635Z ) 2025-05-07T20:32:26.5224097Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:26.5224714Z def test_silu_mul_quant( 2025-05-07T20:32:26.5225057Z self, 2025-05-07T20:32:26.5225272Z T: int, 2025-05-07T20:32:26.5225481Z D: int, 2025-05-07T20:32:26.5225718Z scale_ub: Optional[float], 2025-05-07T20:32:26.5225995Z contiguous: bool, 2025-05-07T20:32:26.5226253Z compiled: bool, 2025-05-07T20:32:26.5226492Z ) -> None: 2025-05-07T20:32:26.5226861Z torch.manual_seed(2025) 2025-05-07T20:32:26.5227119Z 2025-05-07T20:32:26.5227405Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:26.5227752Z 2025-05-07T20:32:26.5227959Z x_sign = torch.sign(x) 2025-05-07T20:32:26.5228265Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:26.5228575Z x = x_sign * x_clamp 2025-05-07T20:32:26.5228808Z x0 = x[:, :D] 2025-05-07T20:32:26.5229016Z x1 = x[:, D:] 2025-05-07T20:32:26.5229220Z 2025-05-07T20:32:26.5229393Z if contiguous: 2025-05-07T20:32:26.5229613Z x0 = x0.contiguous() 2025-05-07T20:32:26.5229862Z x1 = x1.contiguous() 2025-05-07T20:32:26.5230090Z 2025-05-07T20:32:26.5230273Z if scale_ub is not None: 2025-05-07T20:32:26.5230537Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:26.5230867Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:26.5231176Z ) 2025-05-07T20:32:26.5231377Z else: 2025-05-07T20:32:26.5231591Z scale_ub_tensor = None 2025-05-07T20:32:26.5231838Z 2025-05-07T20:32:26.5232073Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:26.5232392Z op = silu_mul_quant 2025-05-07T20:32:26.5232642Z if compiled: 2025-05-07T20:32:26.5232895Z op = torch.compile(op) 2025-05-07T20:32:26.5233197Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:26.5233466Z 2025-05-07T20:32:26.5233663Z y_fp8, y_scale = fn() 2025-05-07T20:32:26.5233960Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:26.5234264Z 2025-05-07T20:32:26.5234505Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:26.5234855Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:26.5235164Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:26.5235484Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:26.5235866Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:26.5236201Z 2025-05-07T20:32:26.5236409Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:26.5236624Z 2025-05-07T20:32:26.5236727Z moe/activation_test.py:126: 2025-05-07T20:32:26.5237131Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:26.5237472Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:26.5237811Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:26.5238611Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:26.5239375Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:26.5239925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:26.5240626Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:26.5241328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:26.5242060Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:26.5242799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:26.5243464Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:26.5244083Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:26.5244732Z fn() 2025-05-07T20:32:26.5245253Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:26.5245845Z self.fn.run( 2025-05-07T20:32:26.5246409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:26.5246943Z kernel = self.compile( 2025-05-07T20:32:26.5247578Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:26.5248301Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:26.5248711Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:26.5248953Z 2025-05-07T20:32:26.5249169Z self = 2025-05-07T20:32:26.5250266Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:26.5251657Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faae0aaaac0>} 2025-05-07T20:32:26.5253012Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:26.5254045Z context = 2025-05-07T20:32:26.5254346Z 2025-05-07T20:32:26.5254518Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:26.5255055Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:26.5255529Z module_map=module_map) 2025-05-07T20:32:26.5255902Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:26.5256272Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:26.5256550Z E ^ 2025-05-07T20:32:26.5257025Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:26.5257482Z 2025-05-07T20:32:26.5257901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:26.7588239Z W0507 20:32:26.754000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:26.7589712Z W0507 20:32:26.754000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last): 2025-05-07T20:32:26.7591052Z W0507 20:32:26.754000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:26.7592477Z W0507 20:32:26.754000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:26.7593444Z W0507 20:32:26.754000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:26.7594745Z W0507 20:32:26.754000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:26.7596124Z W0507 20:32:26.754000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:26.7597473Z W0507 20:32:26.754000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:26.7598981Z W0507 20:32:26.754000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:26.7600022Z W0507 20:32:26.754000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] module_map=module_map) 2025-05-07T20:32:26.7601294Z W0507 20:32:26.754000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:26.7602542Z W0507 20:32:26.754000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] generator.visit(fn.parse()) 2025-05-07T20:32:26.7603382Z W0507 20:32:26.754000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:26.7604747Z W0507 20:32:26.754000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:26.7605955Z W0507 20:32:26.754000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ret = super().visit(node) 2025-05-07T20:32:26.7607060Z W0507 20:32:26.754000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:26.7608117Z W0507 20:32:26.754000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return visitor(node) 2025-05-07T20:32:26.7609519Z W0507 20:32:26.754000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:26.7610799Z W0507 20:32:26.754000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:26.7611711Z W0507 20:32:26.754000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:26.7612972Z W0507 20:32:26.754000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:26.7614016Z W0507 20:32:26.754000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] self.visit(item) 2025-05-07T20:32:26.7614795Z W0507 20:32:26.754000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:26.7615966Z W0507 20:32:26.754000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:26.7617566Z W0507 20:32:26.754000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:26.7618790Z W0507 20:32:26.754000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:26.7619704Z W0507 20:32:26.754000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:26.7620431Z W0507 20:32:26.754000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^ 2025-05-07T20:32:26.7621442Z W0507 20:32:26.754000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:26.8209380Z W0507 20:32:26.817000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:26.8210936Z W0507 20:32:26.817000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last): 2025-05-07T20:32:26.8212291Z W0507 20:32:26.817000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:26.8213719Z W0507 20:32:26.817000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:26.8214700Z W0507 20:32:26.817000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:26.8216025Z W0507 20:32:26.817000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:26.8217466Z W0507 20:32:26.817000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:26.8218760Z W0507 20:32:26.817000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:26.8220119Z W0507 20:32:26.817000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:26.8221148Z W0507 20:32:26.817000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] module_map=module_map) 2025-05-07T20:32:26.8222394Z W0507 20:32:26.817000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:26.8223752Z W0507 20:32:26.817000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] generator.visit(fn.parse()) 2025-05-07T20:32:26.8224583Z W0507 20:32:26.817000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:26.8226017Z W0507 20:32:26.817000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:26.8227387Z W0507 20:32:26.817000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ret = super().visit(node) 2025-05-07T20:32:26.8228428Z W0507 20:32:26.817000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:26.8229462Z W0507 20:32:26.817000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return visitor(node) 2025-05-07T20:32:26.8230685Z W0507 20:32:26.817000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:26.8231966Z W0507 20:32:26.817000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:26.8233048Z W0507 20:32:26.817000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:26.8234134Z W0507 20:32:26.817000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:26.8235175Z W0507 20:32:26.817000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] self.visit(item) 2025-05-07T20:32:26.8235945Z W0507 20:32:26.817000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:26.8237185Z W0507 20:32:26.817000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:26.8238553Z W0507 20:32:26.817000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:26.8239627Z W0507 20:32:26.817000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:26.8240560Z W0507 20:32:26.817000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:26.8241318Z W0507 20:32:26.817000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^ 2025-05-07T20:32:26.8242338Z W0507 20:32:26.817000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:27.0128274Z W0507 20:32:27.009000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:27.0129622Z W0507 20:32:27.009000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last): 2025-05-07T20:32:27.0130942Z W0507 20:32:27.009000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:27.0132527Z W0507 20:32:27.009000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:27.0133497Z W0507 20:32:27.009000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:27.0134783Z W0507 20:32:27.009000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:27.0136149Z W0507 20:32:27.009000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:27.0137442Z W0507 20:32:27.009000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:27.0138802Z W0507 20:32:27.009000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:27.0139841Z W0507 20:32:27.009000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] module_map=module_map) 2025-05-07T20:32:27.0141201Z W0507 20:32:27.009000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:27.0142445Z W0507 20:32:27.009000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] generator.visit(fn.parse()) 2025-05-07T20:32:27.0143286Z W0507 20:32:27.009000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:27.0144477Z W0507 20:32:27.009000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:27.0145671Z W0507 20:32:27.009000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ret = super().visit(node) 2025-05-07T20:32:27.0146693Z W0507 20:32:27.009000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:27.0147701Z W0507 20:32:27.009000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return visitor(node) 2025-05-07T20:32:27.0148917Z W0507 20:32:27.009000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:27.0150182Z W0507 20:32:27.009000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:27.0151077Z W0507 20:32:27.009000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:27.0152155Z W0507 20:32:27.009000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:27.0153188Z W0507 20:32:27.009000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] self.visit(item) 2025-05-07T20:32:27.0153951Z W0507 20:32:27.009000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:27.0155191Z W0507 20:32:27.009000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:27.0156524Z W0507 20:32:27.009000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:27.0157573Z W0507 20:32:27.009000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:27.0158476Z W0507 20:32:27.009000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:27.0159209Z W0507 20:32:27.009000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^ 2025-05-07T20:32:27.0160222Z W0507 20:32:27.009000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:27.0225995Z W0507 20:32:27.019000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:27.0227121Z W0507 20:32:27.019000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last): 2025-05-07T20:32:27.0228725Z W0507 20:32:27.019000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:27.0230230Z W0507 20:32:27.019000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:27.0231206Z W0507 20:32:27.019000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:27.0232495Z W0507 20:32:27.019000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:27.0233872Z W0507 20:32:27.019000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:27.0235160Z W0507 20:32:27.019000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:27.0236526Z W0507 20:32:27.019000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:27.0237561Z W0507 20:32:27.019000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] module_map=module_map) 2025-05-07T20:32:27.0238805Z W0507 20:32:27.019000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:27.0240031Z W0507 20:32:27.019000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] generator.visit(fn.parse()) 2025-05-07T20:32:27.0240861Z W0507 20:32:27.019000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:27.0242168Z W0507 20:32:27.019000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:27.0243365Z W0507 20:32:27.019000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ret = super().visit(node) 2025-05-07T20:32:27.0244463Z W0507 20:32:27.019000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:27.0245474Z W0507 20:32:27.019000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return visitor(node) 2025-05-07T20:32:27.0246678Z W0507 20:32:27.019000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:27.0247947Z W0507 20:32:27.019000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:27.0248839Z W0507 20:32:27.019000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:27.0249903Z W0507 20:32:27.019000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:27.0250929Z W0507 20:32:27.019000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] self.visit(item) 2025-05-07T20:32:27.0251767Z W0507 20:32:27.019000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:27.0252925Z W0507 20:32:27.019000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:27.0254338Z W0507 20:32:27.019000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:27.0255571Z W0507 20:32:27.019000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:27.0256470Z W0507 20:32:27.019000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:27.0257202Z W0507 20:32:27.019000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^ 2025-05-07T20:32:27.0258207Z W0507 20:32:27.019000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:27.4871328Z W0507 20:32:27.483000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:27.4872454Z W0507 20:32:27.483000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last): 2025-05-07T20:32:27.4873850Z W0507 20:32:27.483000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:27.4875339Z W0507 20:32:27.483000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:27.4876352Z W0507 20:32:27.483000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:27.4877913Z W0507 20:32:27.483000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:27.4879353Z W0507 20:32:27.483000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:27.4880715Z W0507 20:32:27.483000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:27.4882144Z W0507 20:32:27.483000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:27.4883230Z W0507 20:32:27.483000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] module_map=module_map) 2025-05-07T20:32:27.4884620Z W0507 20:32:27.483000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:27.4885856Z W0507 20:32:27.483000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] generator.visit(fn.parse()) 2025-05-07T20:32:27.4886811Z W0507 20:32:27.483000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:27.4888003Z W0507 20:32:27.483000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:27.4889246Z W0507 20:32:27.483000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ret = super().visit(node) 2025-05-07T20:32:27.4890268Z W0507 20:32:27.483000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:27.4891275Z W0507 20:32:27.483000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return visitor(node) 2025-05-07T20:32:27.4892481Z W0507 20:32:27.483000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:27.4893736Z W0507 20:32:27.483000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:27.4894639Z W0507 20:32:27.483000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:27.4895722Z W0507 20:32:27.483000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:27.4896749Z W0507 20:32:27.483000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] self.visit(item) 2025-05-07T20:32:27.4897513Z W0507 20:32:27.483000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:27.4898660Z W0507 20:32:27.483000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:27.4899997Z W0507 20:32:27.483000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:27.4901127Z W0507 20:32:27.483000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:27.4902023Z W0507 20:32:27.483000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:27.4902747Z W0507 20:32:27.483000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^ 2025-05-07T20:32:27.4903759Z W0507 20:32:27.483000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:27.5490631Z W0507 20:32:27.545000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:27.5492606Z W0507 20:32:27.545000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last): 2025-05-07T20:32:27.5494509Z W0507 20:32:27.545000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:27.5495911Z W0507 20:32:27.545000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:27.5497120Z W0507 20:32:27.545000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:27.5498407Z W0507 20:32:27.545000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:27.5499811Z W0507 20:32:27.545000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:27.5501141Z W0507 20:32:27.545000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:27.5502539Z W0507 20:32:27.545000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:27.5503599Z W0507 20:32:27.545000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] module_map=module_map) 2025-05-07T20:32:27.5504878Z W0507 20:32:27.545000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:27.5506150Z W0507 20:32:27.545000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] generator.visit(fn.parse()) 2025-05-07T20:32:27.5507012Z W0507 20:32:27.545000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:27.5508508Z W0507 20:32:27.545000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:27.5509714Z W0507 20:32:27.545000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ret = super().visit(node) 2025-05-07T20:32:27.5510729Z W0507 20:32:27.545000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:27.5511884Z W0507 20:32:27.545000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return visitor(node) 2025-05-07T20:32:27.5513094Z W0507 20:32:27.545000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:27.5514374Z W0507 20:32:27.545000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:27.5515269Z W0507 20:32:27.545000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:27.5516337Z W0507 20:32:27.545000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:27.5517362Z W0507 20:32:27.545000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] self.visit(item) 2025-05-07T20:32:27.5518127Z W0507 20:32:27.545000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:27.5519285Z W0507 20:32:27.545000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:27.5520791Z W0507 20:32:27.545000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:27.5521847Z W0507 20:32:27.545000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:27.5522756Z W0507 20:32:27.545000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:27.5523489Z W0507 20:32:27.545000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^ 2025-05-07T20:32:27.5524614Z W0507 20:32:27.545000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:27.7414883Z W0507 20:32:27.737000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:27.7416335Z W0507 20:32:27.737000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last): 2025-05-07T20:32:27.7417773Z W0507 20:32:27.737000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:27.7419264Z W0507 20:32:27.737000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:27.7420284Z W0507 20:32:27.737000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:27.7421643Z W0507 20:32:27.737000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:27.7423083Z W0507 20:32:27.737000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:27.7424618Z W0507 20:32:27.737000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:27.7426050Z W0507 20:32:27.737000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:27.7427147Z W0507 20:32:27.737000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] module_map=module_map) 2025-05-07T20:32:27.7428459Z W0507 20:32:27.737000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:27.7429756Z W0507 20:32:27.737000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] generator.visit(fn.parse()) 2025-05-07T20:32:27.7430637Z W0507 20:32:27.737000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:27.7431887Z W0507 20:32:27.737000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:27.7433148Z W0507 20:32:27.737000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ret = super().visit(node) 2025-05-07T20:32:27.7434331Z W0507 20:32:27.737000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:27.7435392Z W0507 20:32:27.737000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return visitor(node) 2025-05-07T20:32:27.7436622Z W0507 20:32:27.737000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:27.7437878Z W0507 20:32:27.737000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:27.7438767Z W0507 20:32:27.737000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:27.7439835Z W0507 20:32:27.737000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:27.7440910Z W0507 20:32:27.737000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] self.visit(item) 2025-05-07T20:32:27.7441672Z W0507 20:32:27.737000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:27.7442822Z W0507 20:32:27.737000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:27.7444271Z W0507 20:32:27.737000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:27.7445320Z W0507 20:32:27.737000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:27.7446218Z W0507 20:32:27.737000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:27.7447032Z W0507 20:32:27.737000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^ 2025-05-07T20:32:27.7448087Z W0507 20:32:27.737000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:27.7506698Z W0507 20:32:27.747000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:27.7508411Z W0507 20:32:27.747000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last): 2025-05-07T20:32:27.7509758Z W0507 20:32:27.747000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:27.7511188Z W0507 20:32:27.747000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:27.7512168Z W0507 20:32:27.747000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:27.7513472Z W0507 20:32:27.747000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:27.7515007Z W0507 20:32:27.747000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:27.7516317Z W0507 20:32:27.747000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:27.7517694Z W0507 20:32:27.747000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:27.7518745Z W0507 20:32:27.747000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] module_map=module_map) 2025-05-07T20:32:27.7520013Z W0507 20:32:27.747000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:27.7521250Z W0507 20:32:27.747000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] generator.visit(fn.parse()) 2025-05-07T20:32:27.7522094Z W0507 20:32:27.747000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:27.7523301Z W0507 20:32:27.747000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:27.7524604Z W0507 20:32:27.747000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ret = super().visit(node) 2025-05-07T20:32:27.7525630Z W0507 20:32:27.747000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:27.7526630Z W0507 20:32:27.747000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return visitor(node) 2025-05-07T20:32:27.7527829Z W0507 20:32:27.747000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:27.7529223Z W0507 20:32:27.747000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:27.7530117Z W0507 20:32:27.747000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:27.7537899Z W0507 20:32:27.747000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:27.7538952Z W0507 20:32:27.747000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] self.visit(item) 2025-05-07T20:32:27.7539718Z W0507 20:32:27.747000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:27.7540888Z W0507 20:32:27.747000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:27.7542233Z W0507 20:32:27.747000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:27.7543292Z W0507 20:32:27.747000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:27.7544299Z W0507 20:32:27.747000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:27.7545045Z W0507 20:32:27.747000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^ 2025-05-07T20:32:27.7546056Z W0507 20:32:27.747000 87999 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:27.9773843Z 2025-05-07T20:32:27.9774045Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:27.9774491Z self=, 2025-05-07T20:32:27.9775160Z T=4096, 2025-05-07T20:32:27.9775345Z D=5120, 2025-05-07T20:32:27.9775523Z scale_ub=None, 2025-05-07T20:32:27.9775732Z contiguous=True, 2025-05-07T20:32:27.9775950Z compiled=True, 2025-05-07T20:32:27.9776140Z ) 2025-05-07T20:32:27.9776452Z self = 2025-05-07T20:32:27.9776950Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:27.9777212Z 2025-05-07T20:32:27.9777283Z @given( 2025-05-07T20:32:27.9777510Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:27.9777828Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:27.9778126Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:27.9778448Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:27.9778771Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:27.9779049Z ) 2025-05-07T20:32:27.9779382Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:27.9779820Z def test_silu_mul_quant( 2025-05-07T20:32:27.9780059Z self, 2025-05-07T20:32:27.9780241Z T: int, 2025-05-07T20:32:27.9780437Z D: int, 2025-05-07T20:32:27.9780649Z scale_ub: Optional[float], 2025-05-07T20:32:27.9780907Z contiguous: bool, 2025-05-07T20:32:27.9781147Z compiled: bool, 2025-05-07T20:32:27.9781363Z ) -> None: 2025-05-07T20:32:27.9781567Z torch.manual_seed(2025) 2025-05-07T20:32:27.9781800Z 2025-05-07T20:32:27.9782072Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:27.9782627Z 2025-05-07T20:32:27.9782814Z x_sign = torch.sign(x) 2025-05-07T20:32:27.9783101Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:27.9783398Z x = x_sign * x_clamp 2025-05-07T20:32:27.9783650Z x0 = x[:, :D] 2025-05-07T20:32:27.9783885Z x1 = x[:, D:] 2025-05-07T20:32:27.9784088Z 2025-05-07T20:32:27.9784261Z if contiguous: 2025-05-07T20:32:27.9784490Z x0 = x0.contiguous() 2025-05-07T20:32:27.9784746Z x1 = x1.contiguous() 2025-05-07T20:32:27.9784970Z 2025-05-07T20:32:27.9785158Z if scale_ub is not None: 2025-05-07T20:32:27.9785422Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:27.9785754Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:27.9786057Z ) 2025-05-07T20:32:27.9786243Z else: 2025-05-07T20:32:27.9786440Z scale_ub_tensor = None 2025-05-07T20:32:27.9786688Z 2025-05-07T20:32:27.9786916Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:27.9787219Z op = silu_mul_quant 2025-05-07T20:32:27.9787460Z if compiled: 2025-05-07T20:32:27.9787706Z op = torch.compile(op) 2025-05-07T20:32:27.9787996Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:27.9788259Z 2025-05-07T20:32:27.9788446Z y_fp8, y_scale = fn() 2025-05-07T20:32:27.9788724Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:27.9788998Z 2025-05-07T20:32:27.9789224Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:27.9789546Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:27.9789828Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:27.9790255Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:27.9790612Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:27.9790914Z 2025-05-07T20:32:27.9791098Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:27.9791299Z 2025-05-07T20:32:27.9791395Z moe/activation_test.py:126: 2025-05-07T20:32:27.9791689Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:27.9792015Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:27.9792336Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:27.9793117Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:27.9793857Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:27.9794389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:27.9795070Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:27.9795744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:27.9796461Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:27.9797182Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:27.9797814Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:27.9798403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:27.9798904Z fn() 2025-05-07T20:32:27.9799402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:27.9799972Z self.fn.run( 2025-05-07T20:32:27.9800434Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:27.9800947Z kernel = self.compile( 2025-05-07T20:32:27.9801483Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:27.9802218Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:27.9802603Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:27.9802834Z 2025-05-07T20:32:27.9803038Z self = 2025-05-07T20:32:27.9804115Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:27.9805674Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faae0b0a3e0>} 2025-05-07T20:32:27.9807006Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:27.9808026Z context = 2025-05-07T20:32:27.9808477Z 2025-05-07T20:32:27.9808642Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:27.9809155Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:27.9809611Z module_map=module_map) 2025-05-07T20:32:27.9809964Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:27.9810311Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:27.9810566Z E ^ 2025-05-07T20:32:27.9811147Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:27.9811597Z 2025-05-07T20:32:27.9812008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:27.9812522Z 2025-05-07T20:32:27.9812619Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:27.9813023Z self=, 2025-05-07T20:32:27.9813412Z T=16384, 2025-05-07T20:32:27.9813590Z D=5120, 2025-05-07T20:32:27.9813774Z scale_ub=None, 2025-05-07T20:32:27.9813977Z contiguous=True, 2025-05-07T20:32:27.9814190Z compiled=True, 2025-05-07T20:32:27.9814385Z ) 2025-05-07T20:32:27.9814699Z self = 2025-05-07T20:32:27.9815181Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:27.9815449Z 2025-05-07T20:32:27.9815519Z @given( 2025-05-07T20:32:27.9815746Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:27.9816044Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:27.9816344Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:27.9816668Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:27.9816992Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:27.9817258Z ) 2025-05-07T20:32:27.9817591Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:27.9818017Z def test_silu_mul_quant( 2025-05-07T20:32:27.9818239Z self, 2025-05-07T20:32:27.9818420Z T: int, 2025-05-07T20:32:27.9818610Z D: int, 2025-05-07T20:32:27.9818811Z scale_ub: Optional[float], 2025-05-07T20:32:27.9819071Z contiguous: bool, 2025-05-07T20:32:27.9819298Z compiled: bool, 2025-05-07T20:32:27.9819503Z ) -> None: 2025-05-07T20:32:27.9819707Z torch.manual_seed(2025) 2025-05-07T20:32:27.9819938Z 2025-05-07T20:32:27.9820203Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:27.9820534Z 2025-05-07T20:32:27.9820713Z x_sign = torch.sign(x) 2025-05-07T20:32:27.9820992Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:27.9821419Z x = x_sign * x_clamp 2025-05-07T20:32:27.9821650Z x0 = x[:, :D] 2025-05-07T20:32:27.9821857Z x1 = x[:, D:] 2025-05-07T20:32:27.9822045Z 2025-05-07T20:32:27.9822219Z if contiguous: 2025-05-07T20:32:27.9822437Z x0 = x0.contiguous() 2025-05-07T20:32:27.9822678Z x1 = x1.contiguous() 2025-05-07T20:32:27.9822907Z 2025-05-07T20:32:27.9823087Z if scale_ub is not None: 2025-05-07T20:32:27.9823513Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:27.9823842Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:27.9824138Z ) 2025-05-07T20:32:27.9824315Z else: 2025-05-07T20:32:27.9824525Z scale_ub_tensor = None 2025-05-07T20:32:27.9824766Z 2025-05-07T20:32:27.9824979Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:27.9825285Z op = silu_mul_quant 2025-05-07T20:32:27.9825525Z if compiled: 2025-05-07T20:32:27.9825765Z op = torch.compile(op) 2025-05-07T20:32:27.9826050Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:27.9826320Z 2025-05-07T20:32:27.9826496Z y_fp8, y_scale = fn() 2025-05-07T20:32:27.9826776Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:27.9827057Z 2025-05-07T20:32:27.9827281Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:27.9827602Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:27.9827886Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:27.9828187Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:27.9828617Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:27.9828918Z 2025-05-07T20:32:27.9829102Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:27.9829289Z 2025-05-07T20:32:27.9829381Z moe/activation_test.py:126: 2025-05-07T20:32:27.9829666Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:27.9829998Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:27.9830314Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:27.9831081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:27.9831818Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:27.9832349Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:27.9833012Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:27.9833688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:27.9834395Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:27.9835110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:27.9835735Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:27.9836320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:27.9836820Z fn() 2025-05-07T20:32:27.9837313Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:27.9837876Z self.fn.run( 2025-05-07T20:32:27.9838333Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:27.9838852Z kernel = self.compile( 2025-05-07T20:32:27.9839384Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:27.9840024Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:27.9840412Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:27.9840725Z 2025-05-07T20:32:27.9840933Z self = 2025-05-07T20:32:27.9841998Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:27.9843360Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faae0979580>} 2025-05-07T20:32:27.9844770Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:27.9845782Z context = 2025-05-07T20:32:27.9846073Z 2025-05-07T20:32:27.9846242Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:27.9846750Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:27.9847214Z module_map=module_map) 2025-05-07T20:32:27.9847615Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:27.9847956Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:27.9848206Z E ^ 2025-05-07T20:32:27.9848658Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:27.9849101Z 2025-05-07T20:32:27.9849625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.0082215Z W0507 20:32:28.007000 87999 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] torch._dynamo hit config.recompile_limit (8) 2025-05-07T20:32:28.0083796Z W0507 20:32:28.007000 87999 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55) 2025-05-07T20:32:28.0085561Z W0507 20:32:28.007000 87999 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] last reason: 1/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240 2025-05-07T20:32:28.0086802Z W0507 20:32:28.007000 87999 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] To log all recompilation reasons, use TORCH_LOGS="recompiles". 2025-05-07T20:32:28.0088204Z W0507 20:32:28.007000 87999 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html. 2025-05-07T20:32:28.2281264Z 2025-05-07T20:32:28.2281600Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.2282030Z self=, 2025-05-07T20:32:28.2282551Z T=1, 2025-05-07T20:32:28.2282867Z D=5120, 2025-05-07T20:32:28.2283148Z scale_ub=1200.0, 2025-05-07T20:32:28.2283478Z contiguous=True, 2025-05-07T20:32:28.2283814Z compiled=True, 2025-05-07T20:32:28.2284105Z ) 2025-05-07T20:32:28.2284663Z self = 2025-05-07T20:32:28.2285255Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:28.2285516Z 2025-05-07T20:32:28.2285592Z @given( 2025-05-07T20:32:28.2285806Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.2286110Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.2286410Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.2286723Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.2287040Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.2287322Z ) 2025-05-07T20:32:28.2287706Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.2288333Z def test_silu_mul_quant( 2025-05-07T20:32:28.2288565Z self, 2025-05-07T20:32:28.2288754Z T: int, 2025-05-07T20:32:28.2288941Z D: int, 2025-05-07T20:32:28.2289155Z scale_ub: Optional[float], 2025-05-07T20:32:28.2289426Z contiguous: bool, 2025-05-07T20:32:28.2289662Z compiled: bool, 2025-05-07T20:32:28.2289880Z ) -> None: 2025-05-07T20:32:28.2290083Z torch.manual_seed(2025) 2025-05-07T20:32:28.2290323Z 2025-05-07T20:32:28.2290594Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.2290930Z 2025-05-07T20:32:28.2291130Z x_sign = torch.sign(x) 2025-05-07T20:32:28.2291419Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.2291722Z x = x_sign * x_clamp 2025-05-07T20:32:28.2291951Z x0 = x[:, :D] 2025-05-07T20:32:28.2292164Z x1 = x[:, D:] 2025-05-07T20:32:28.2292375Z 2025-05-07T20:32:28.2292563Z if contiguous: 2025-05-07T20:32:28.2292797Z x0 = x0.contiguous() 2025-05-07T20:32:28.2293045Z x1 = x1.contiguous() 2025-05-07T20:32:28.2293283Z 2025-05-07T20:32:28.2293475Z if scale_ub is not None: 2025-05-07T20:32:28.2293739Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.2294077Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.2294387Z ) 2025-05-07T20:32:28.2294584Z else: 2025-05-07T20:32:28.2294787Z scale_ub_tensor = None 2025-05-07T20:32:28.2295030Z 2025-05-07T20:32:28.2295257Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.2295681Z op = silu_mul_quant 2025-05-07T20:32:28.2295930Z if compiled: 2025-05-07T20:32:28.2296176Z op = torch.compile(op) 2025-05-07T20:32:28.2296467Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.2296730Z 2025-05-07T20:32:28.2296922Z > y_fp8, y_scale = fn() 2025-05-07T20:32:28.2297087Z 2025-05-07T20:32:28.2297183Z moe/activation_test.py:117: 2025-05-07T20:32:28.2297479Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.2297813Z moe/activation_test.py:115: in fn 2025-05-07T20:32:28.2298086Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.2298643Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:28.2299204Z return fn(*args, **kwargs) 2025-05-07T20:32:28.2299868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:28.2300554Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:28.2301094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.2301784Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.2302456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.2302983Z kernel = self.compile( 2025-05-07T20:32:28.2303530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.2304194Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.2304577Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.2304806Z 2025-05-07T20:32:28.2305011Z self = 2025-05-07T20:32:28.2306098Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.2307548Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faad3805d00>} 2025-05-07T20:32:28.2309083Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.2310092Z context = 2025-05-07T20:32:28.2310386Z 2025-05-07T20:32:28.2310548Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.2311071Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.2311533Z module_map=module_map) 2025-05-07T20:32:28.2311890Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.2312235Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:28.2312494Z E ^ 2025-05-07T20:32:28.2312947Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.2313396Z 2025-05-07T20:32:28.2313803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.2314305Z 2025-05-07T20:32:28.2314399Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.2314800Z self=, 2025-05-07T20:32:28.2315181Z T=1, 2025-05-07T20:32:28.2315355Z D=5120, 2025-05-07T20:32:28.2315537Z scale_ub=None, 2025-05-07T20:32:28.2315735Z contiguous=False, 2025-05-07T20:32:28.2316073Z compiled=True, 2025-05-07T20:32:28.2316262Z ) 2025-05-07T20:32:28.2316560Z self = 2025-05-07T20:32:28.2317030Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:28.2317289Z 2025-05-07T20:32:28.2317367Z @given( 2025-05-07T20:32:28.2317590Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.2317885Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.2318182Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.2318502Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.2318814Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.2319089Z ) 2025-05-07T20:32:28.2319420Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.2319842Z def test_silu_mul_quant( 2025-05-07T20:32:28.2320069Z self, 2025-05-07T20:32:28.2320256Z T: int, 2025-05-07T20:32:28.2320439Z D: int, 2025-05-07T20:32:28.2320643Z scale_ub: Optional[float], 2025-05-07T20:32:28.2320903Z contiguous: bool, 2025-05-07T20:32:28.2321129Z compiled: bool, 2025-05-07T20:32:28.2321344Z ) -> None: 2025-05-07T20:32:28.2321551Z torch.manual_seed(2025) 2025-05-07T20:32:28.2321783Z 2025-05-07T20:32:28.2322035Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.2322364Z 2025-05-07T20:32:28.2322548Z x_sign = torch.sign(x) 2025-05-07T20:32:28.2322823Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.2323117Z x = x_sign * x_clamp 2025-05-07T20:32:28.2323340Z x0 = x[:, :D] 2025-05-07T20:32:28.2323538Z x1 = x[:, D:] 2025-05-07T20:32:28.2323737Z 2025-05-07T20:32:28.2323911Z if contiguous: 2025-05-07T20:32:28.2324124Z x0 = x0.contiguous() 2025-05-07T20:32:28.2324505Z x1 = x1.contiguous() 2025-05-07T20:32:28.2324746Z 2025-05-07T20:32:28.2324920Z if scale_ub is not None: 2025-05-07T20:32:28.2325180Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.2325507Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.2325954Z ) 2025-05-07T20:32:28.2326134Z else: 2025-05-07T20:32:28.2326337Z scale_ub_tensor = None 2025-05-07T20:32:28.2326576Z 2025-05-07T20:32:28.2326798Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.2327103Z op = silu_mul_quant 2025-05-07T20:32:28.2327340Z if compiled: 2025-05-07T20:32:28.2327571Z op = torch.compile(op) 2025-05-07T20:32:28.2327854Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.2328117Z 2025-05-07T20:32:28.2328293Z y_fp8, y_scale = fn() 2025-05-07T20:32:28.2328571Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:28.2328853Z 2025-05-07T20:32:28.2329087Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.2329409Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:28.2329694Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:28.2329992Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:28.2330345Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:28.2330645Z 2025-05-07T20:32:28.2330835Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:28.2331023Z 2025-05-07T20:32:28.2331114Z moe/activation_test.py:126: 2025-05-07T20:32:28.2331398Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.2331721Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:28.2332032Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:28.2332885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:28.2333636Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:28.2334166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.2334844Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.2335534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:28.2342870Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:28.2343774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:28.2344546Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:28.2345267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:28.2345888Z fn() 2025-05-07T20:32:28.2346503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:28.2347202Z self.fn.run( 2025-05-07T20:32:28.2347752Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.2348400Z kernel = self.compile( 2025-05-07T20:32:28.2349037Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.2349818Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.2350280Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.2350547Z 2025-05-07T20:32:28.2350787Z self = 2025-05-07T20:32:28.2352122Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.2353495Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faae021c180>} 2025-05-07T20:32:28.2354963Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.2355995Z context = 2025-05-07T20:32:28.2356286Z 2025-05-07T20:32:28.2356462Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.2356986Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.2357459Z module_map=module_map) 2025-05-07T20:32:28.2357825Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.2358184Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:28.2358443Z E ^ 2025-05-07T20:32:28.2358903Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.2359355Z 2025-05-07T20:32:28.2359774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.3770931Z 2025-05-07T20:32:28.3771757Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.3772245Z self=, 2025-05-07T20:32:28.3772666Z T=1, 2025-05-07T20:32:28.3772860Z D=5120, 2025-05-07T20:32:28.3773060Z scale_ub=None, 2025-05-07T20:32:28.3773276Z contiguous=True, 2025-05-07T20:32:28.3773512Z compiled=False, 2025-05-07T20:32:28.3773729Z ) 2025-05-07T20:32:28.3774479Z self = 2025-05-07T20:32:28.3774984Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:28.3775258Z 2025-05-07T20:32:28.3775336Z @given( 2025-05-07T20:32:28.3775574Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.3775899Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.3776213Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.3776543Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.3776864Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.3777150Z ) 2025-05-07T20:32:28.3777504Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.3777943Z def test_silu_mul_quant( 2025-05-07T20:32:28.3778191Z self, 2025-05-07T20:32:28.3778392Z T: int, 2025-05-07T20:32:28.3778585Z D: int, 2025-05-07T20:32:28.3778817Z scale_ub: Optional[float], 2025-05-07T20:32:28.3779108Z contiguous: bool, 2025-05-07T20:32:28.3779348Z compiled: bool, 2025-05-07T20:32:28.3779582Z ) -> None: 2025-05-07T20:32:28.3779804Z torch.manual_seed(2025) 2025-05-07T20:32:28.3780057Z 2025-05-07T20:32:28.3780324Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.3780709Z 2025-05-07T20:32:28.3780915Z x_sign = torch.sign(x) 2025-05-07T20:32:28.3781213Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.3781529Z x = x_sign * x_clamp 2025-05-07T20:32:28.3781761Z x0 = x[:, :D] 2025-05-07T20:32:28.3781983Z x1 = x[:, D:] 2025-05-07T20:32:28.3782191Z 2025-05-07T20:32:28.3782368Z if contiguous: 2025-05-07T20:32:28.3782612Z x0 = x0.contiguous() 2025-05-07T20:32:28.3782881Z x1 = x1.contiguous() 2025-05-07T20:32:28.3783119Z 2025-05-07T20:32:28.3783329Z if scale_ub is not None: 2025-05-07T20:32:28.3783616Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.3784001Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.3784302Z ) 2025-05-07T20:32:28.3784493Z else: 2025-05-07T20:32:28.3784702Z scale_ub_tensor = None 2025-05-07T20:32:28.3784943Z 2025-05-07T20:32:28.3785411Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.3785730Z op = silu_mul_quant 2025-05-07T20:32:28.3785969Z if compiled: 2025-05-07T20:32:28.3786208Z op = torch.compile(op) 2025-05-07T20:32:28.3786497Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.3786756Z 2025-05-07T20:32:28.3786942Z > y_fp8, y_scale = fn() 2025-05-07T20:32:28.3787102Z 2025-05-07T20:32:28.3787204Z moe/activation_test.py:117: 2025-05-07T20:32:28.3787498Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.3787819Z moe/activation_test.py:115: in fn 2025-05-07T20:32:28.3788098Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.3788789Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:28.3789465Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:28.3790005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.3790684Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.3791345Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.3791859Z kernel = self.compile( 2025-05-07T20:32:28.3792402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.3793054Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.3793522Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.3793759Z 2025-05-07T20:32:28.3793965Z self = 2025-05-07T20:32:28.3795041Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.3796523Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faae03974c0>} 2025-05-07T20:32:28.3797864Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.3798882Z context = 2025-05-07T20:32:28.3799169Z 2025-05-07T20:32:28.3799340Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.3799868Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.3800330Z module_map=module_map) 2025-05-07T20:32:28.3800688Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.3801038Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:28.3801291Z E ^ 2025-05-07T20:32:28.3801754Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.3802198Z 2025-05-07T20:32:28.3802614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.3803131Z 2025-05-07T20:32:28.3803231Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.3803639Z self=, 2025-05-07T20:32:28.3804030Z T=128, 2025-05-07T20:32:28.3804368Z D=5120, 2025-05-07T20:32:28.3804556Z scale_ub=None, 2025-05-07T20:32:28.3804764Z contiguous=False, 2025-05-07T20:32:28.3804977Z compiled=True, 2025-05-07T20:32:28.3805171Z ) 2025-05-07T20:32:28.3805575Z self = 2025-05-07T20:32:28.3806056Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:28.3806327Z 2025-05-07T20:32:28.3806401Z @given( 2025-05-07T20:32:28.3806629Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.3806930Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.3807234Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.3807559Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.3807879Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.3808160Z ) 2025-05-07T20:32:28.3808768Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.3809202Z def test_silu_mul_quant( 2025-05-07T20:32:28.3809426Z self, 2025-05-07T20:32:28.3809611Z T: int, 2025-05-07T20:32:28.3809802Z D: int, 2025-05-07T20:32:28.3810002Z scale_ub: Optional[float], 2025-05-07T20:32:28.3810274Z contiguous: bool, 2025-05-07T20:32:28.3810509Z compiled: bool, 2025-05-07T20:32:28.3810717Z ) -> None: 2025-05-07T20:32:28.3810926Z torch.manual_seed(2025) 2025-05-07T20:32:28.3811161Z 2025-05-07T20:32:28.3811420Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.3811755Z 2025-05-07T20:32:28.3811939Z x_sign = torch.sign(x) 2025-05-07T20:32:28.3812217Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.3812523Z x = x_sign * x_clamp 2025-05-07T20:32:28.3812756Z x0 = x[:, :D] 2025-05-07T20:32:28.3812959Z x1 = x[:, D:] 2025-05-07T20:32:28.3813291Z 2025-05-07T20:32:28.3813476Z if contiguous: 2025-05-07T20:32:28.3813693Z x0 = x0.contiguous() 2025-05-07T20:32:28.3813949Z x1 = x1.contiguous() 2025-05-07T20:32:28.3814181Z 2025-05-07T20:32:28.3814363Z if scale_ub is not None: 2025-05-07T20:32:28.3814629Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.3814959Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.3815260Z ) 2025-05-07T20:32:28.3815438Z else: 2025-05-07T20:32:28.3815645Z scale_ub_tensor = None 2025-05-07T20:32:28.3815895Z 2025-05-07T20:32:28.3816112Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.3816425Z op = silu_mul_quant 2025-05-07T20:32:28.3816672Z if compiled: 2025-05-07T20:32:28.3816904Z op = torch.compile(op) 2025-05-07T20:32:28.3817196Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.3817467Z 2025-05-07T20:32:28.3817652Z > y_fp8, y_scale = fn() 2025-05-07T20:32:28.3817821Z 2025-05-07T20:32:28.3817914Z moe/activation_test.py:117: 2025-05-07T20:32:28.3818208Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.3818535Z moe/activation_test.py:115: in fn 2025-05-07T20:32:28.3818806Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.3819358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:28.3819909Z return fn(*args, **kwargs) 2025-05-07T20:32:28.3820552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:28.3821230Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:28.3821763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.3822442Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.3823087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.3823606Z kernel = self.compile( 2025-05-07T20:32:28.3824187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.3824957Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.3825347Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.3825579Z 2025-05-07T20:32:28.3825785Z self = 2025-05-07T20:32:28.3826858Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.3828234Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faae021ec00>} 2025-05-07T20:32:28.3829615Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.3830637Z context = 2025-05-07T20:32:28.3830928Z 2025-05-07T20:32:28.3831090Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.3831609Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.3832059Z module_map=module_map) 2025-05-07T20:32:28.3832416Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.3832765Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:28.3833129Z E ^ 2025-05-07T20:32:28.3833591Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.3834042Z 2025-05-07T20:32:28.3834451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.3834961Z 2025-05-07T20:32:28.3835068Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.3835465Z self=, 2025-05-07T20:32:28.3835861Z T=128, 2025-05-07T20:32:28.3836045Z D=7168, 2025-05-07T20:32:28.3836226Z scale_ub=1200.0, 2025-05-07T20:32:28.3836450Z contiguous=False, 2025-05-07T20:32:28.3836670Z compiled=False, 2025-05-07T20:32:28.5422920Z ) 2025-05-07T20:32:28.5423281Z self = 2025-05-07T20:32:28.5423807Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:28.5424123Z 2025-05-07T20:32:28.5424209Z @given( 2025-05-07T20:32:28.5424455Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.5424766Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.5425081Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.5425430Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.5425765Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.5426095Z ) 2025-05-07T20:32:28.5426439Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.5426885Z def test_silu_mul_quant( 2025-05-07T20:32:28.5427134Z self, 2025-05-07T20:32:28.5427321Z T: int, 2025-05-07T20:32:28.5427519Z D: int, 2025-05-07T20:32:28.5427742Z scale_ub: Optional[float], 2025-05-07T20:32:28.5428007Z contiguous: bool, 2025-05-07T20:32:28.5428254Z compiled: bool, 2025-05-07T20:32:28.5428486Z ) -> None: 2025-05-07T20:32:28.5428702Z torch.manual_seed(2025) 2025-05-07T20:32:28.5428955Z 2025-05-07T20:32:28.5429239Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.5429580Z 2025-05-07T20:32:28.5429782Z x_sign = torch.sign(x) 2025-05-07T20:32:28.5430311Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.5430628Z x = x_sign * x_clamp 2025-05-07T20:32:28.5430865Z x0 = x[:, :D] 2025-05-07T20:32:28.5431084Z x1 = x[:, D:] 2025-05-07T20:32:28.5431291Z 2025-05-07T20:32:28.5431471Z if contiguous: 2025-05-07T20:32:28.5431705Z x0 = x0.contiguous() 2025-05-07T20:32:28.5431966Z x1 = x1.contiguous() 2025-05-07T20:32:28.5432199Z 2025-05-07T20:32:28.5432392Z if scale_ub is not None: 2025-05-07T20:32:28.5432667Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.5432996Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.5433309Z ) 2025-05-07T20:32:28.5433506Z else: 2025-05-07T20:32:28.5433709Z scale_ub_tensor = None 2025-05-07T20:32:28.5433960Z 2025-05-07T20:32:28.5434194Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.5434500Z op = silu_mul_quant 2025-05-07T20:32:28.5434761Z if compiled: 2025-05-07T20:32:28.5435013Z op = torch.compile(op) 2025-05-07T20:32:28.5435300Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.5435573Z 2025-05-07T20:32:28.5435765Z > y_fp8, y_scale = fn() 2025-05-07T20:32:28.5435939Z 2025-05-07T20:32:28.5436040Z moe/activation_test.py:117: 2025-05-07T20:32:28.5436342Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.5436668Z moe/activation_test.py:115: in fn 2025-05-07T20:32:28.5436951Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.5437806Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:28.5438490Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:28.5439026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.5439715Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.5440382Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.5440903Z kernel = self.compile( 2025-05-07T20:32:28.5441444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.5442113Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.5442501Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.5442739Z 2025-05-07T20:32:28.5442951Z self = 2025-05-07T20:32:28.5444034Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.5445533Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faae021d9e0>} 2025-05-07T20:32:28.5446873Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.5447910Z context = 2025-05-07T20:32:28.5448206Z 2025-05-07T20:32:28.5448371Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.5448896Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.5449364Z module_map=module_map) 2025-05-07T20:32:28.5449721Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.5450076Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:28.5450422Z E ^ 2025-05-07T20:32:28.5450875Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.5451324Z 2025-05-07T20:32:28.5451734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.5452249Z 2025-05-07T20:32:28.5452352Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.5452763Z self=, 2025-05-07T20:32:28.5453151Z T=128, 2025-05-07T20:32:28.5453346Z D=5120, 2025-05-07T20:32:28.5453539Z scale_ub=None, 2025-05-07T20:32:28.5453752Z contiguous=False, 2025-05-07T20:32:28.5453980Z compiled=False, 2025-05-07T20:32:28.5454183Z ) 2025-05-07T20:32:28.5454493Z self = 2025-05-07T20:32:28.5454983Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:28.5455265Z 2025-05-07T20:32:28.5455342Z @given( 2025-05-07T20:32:28.5455578Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.5455885Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.5456192Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.5456521Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.5456838Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.5457126Z ) 2025-05-07T20:32:28.5457500Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.5457958Z def test_silu_mul_quant( 2025-05-07T20:32:28.5458310Z self, 2025-05-07T20:32:28.5458512Z T: int, 2025-05-07T20:32:28.5458707Z D: int, 2025-05-07T20:32:28.5458928Z scale_ub: Optional[float], 2025-05-07T20:32:28.5459204Z contiguous: bool, 2025-05-07T20:32:28.5459449Z compiled: bool, 2025-05-07T20:32:28.5459669Z ) -> None: 2025-05-07T20:32:28.5459889Z torch.manual_seed(2025) 2025-05-07T20:32:28.5460134Z 2025-05-07T20:32:28.5460399Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.5460744Z 2025-05-07T20:32:28.5460938Z x_sign = torch.sign(x) 2025-05-07T20:32:28.5461223Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.5461533Z x = x_sign * x_clamp 2025-05-07T20:32:28.5461775Z x0 = x[:, :D] 2025-05-07T20:32:28.5461986Z x1 = x[:, D:] 2025-05-07T20:32:28.5462197Z 2025-05-07T20:32:28.5462385Z if contiguous: 2025-05-07T20:32:28.5462613Z x0 = x0.contiguous() 2025-05-07T20:32:28.5462880Z x1 = x1.contiguous() 2025-05-07T20:32:28.5463121Z 2025-05-07T20:32:28.5463303Z if scale_ub is not None: 2025-05-07T20:32:28.5463575Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.5463912Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.5464223Z ) 2025-05-07T20:32:28.5464414Z else: 2025-05-07T20:32:28.5464622Z scale_ub_tensor = None 2025-05-07T20:32:28.5464868Z 2025-05-07T20:32:28.5465089Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.5465399Z op = silu_mul_quant 2025-05-07T20:32:28.5465650Z if compiled: 2025-05-07T20:32:28.5465891Z op = torch.compile(op) 2025-05-07T20:32:28.5466184Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.5466460Z 2025-05-07T20:32:28.5466645Z > y_fp8, y_scale = fn() 2025-05-07T20:32:28.5466814Z 2025-05-07T20:32:28.5466913Z moe/activation_test.py:117: 2025-05-07T20:32:28.5467217Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.5467546Z moe/activation_test.py:115: in fn 2025-05-07T20:32:28.5467870Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.5468559Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:28.5469332Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:28.5469858Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.5470535Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.5471194Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.5471721Z kernel = self.compile( 2025-05-07T20:32:28.5472257Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.5472908Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.5473306Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.5473531Z 2025-05-07T20:32:28.5473742Z self = 2025-05-07T20:32:28.5474820Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.5476188Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faae2f7dc60>} 2025-05-07T20:32:28.5477626Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.5478671Z context = 2025-05-07T20:32:28.5478960Z 2025-05-07T20:32:28.5479123Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.5479645Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.5480112Z module_map=module_map) 2025-05-07T20:32:28.5480467Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.5480825Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:28.5481081Z E ^ 2025-05-07T20:32:28.5481542Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.5481986Z 2025-05-07T20:32:28.5482405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.5482924Z 2025-05-07T20:32:28.5483023Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.5483432Z self=, 2025-05-07T20:32:28.5483829Z T=128, 2025-05-07T20:32:28.5484008Z D=5120, 2025-05-07T20:32:28.5484351Z scale_ub=1200.0, 2025-05-07T20:32:28.5484573Z contiguous=True, 2025-05-07T20:32:28.5484783Z compiled=False, 2025-05-07T20:32:28.5484985Z ) 2025-05-07T20:32:28.5485303Z self = 2025-05-07T20:32:28.5485783Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:28.5486055Z 2025-05-07T20:32:28.5486128Z @given( 2025-05-07T20:32:28.5486355Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.5486660Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.5486963Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.5487297Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.5487621Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.5487894Z ) 2025-05-07T20:32:28.5496253Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.5496881Z def test_silu_mul_quant( 2025-05-07T20:32:28.5497152Z self, 2025-05-07T20:32:28.5497370Z T: int, 2025-05-07T20:32:28.5497576Z D: int, 2025-05-07T20:32:28.5497810Z scale_ub: Optional[float], 2025-05-07T20:32:28.5498101Z contiguous: bool, 2025-05-07T20:32:28.5498350Z compiled: bool, 2025-05-07T20:32:28.5498598Z ) -> None: 2025-05-07T20:32:28.5498833Z torch.manual_seed(2025) 2025-05-07T20:32:28.5499083Z 2025-05-07T20:32:28.5499379Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.5499736Z 2025-05-07T20:32:28.5499936Z x_sign = torch.sign(x) 2025-05-07T20:32:28.5500245Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.5500574Z x = x_sign * x_clamp 2025-05-07T20:32:28.5500833Z x0 = x[:, :D] 2025-05-07T20:32:28.5501057Z x1 = x[:, D:] 2025-05-07T20:32:28.5501284Z 2025-05-07T20:32:28.5501487Z if contiguous: 2025-05-07T20:32:28.5501732Z x0 = x0.contiguous() 2025-05-07T20:32:28.5502006Z x1 = x1.contiguous() 2025-05-07T20:32:28.5502261Z 2025-05-07T20:32:28.5502462Z if scale_ub is not None: 2025-05-07T20:32:28.5502750Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.5503100Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.5503410Z ) 2025-05-07T20:32:28.5503619Z else: 2025-05-07T20:32:28.5503843Z scale_ub_tensor = None 2025-05-07T20:32:28.5504099Z 2025-05-07T20:32:28.5504341Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.5504667Z op = silu_mul_quant 2025-05-07T20:32:28.5505003Z if compiled: 2025-05-07T20:32:28.5505266Z op = torch.compile(op) 2025-05-07T20:32:28.5505570Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.5505841Z 2025-05-07T20:32:28.5506036Z > y_fp8, y_scale = fn() 2025-05-07T20:32:28.5506208Z 2025-05-07T20:32:28.5506320Z moe/activation_test.py:117: 2025-05-07T20:32:28.5506618Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.5506949Z moe/activation_test.py:115: in fn 2025-05-07T20:32:28.5507235Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.5507924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:28.5508915Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:28.5509462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.5510153Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.5510815Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.5511349Z kernel = self.compile( 2025-05-07T20:32:28.5511895Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.5512562Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.5512957Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.5513195Z 2025-05-07T20:32:28.5513406Z self = 2025-05-07T20:32:28.5514496Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.5515875Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faad3a79ee0>} 2025-05-07T20:32:28.5517218Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.5518415Z context = 2025-05-07T20:32:28.5518712Z 2025-05-07T20:32:28.5518880Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.5519405Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.5519880Z module_map=module_map) 2025-05-07T20:32:28.5520243Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.5520604Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:28.5520879Z E ^ 2025-05-07T20:32:28.5521345Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.5521802Z 2025-05-07T20:32:28.5522219Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.7076243Z 2025-05-07T20:32:28.7076784Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.7077573Z self=, 2025-05-07T20:32:28.7078312Z T=1, 2025-05-07T20:32:28.7078571Z D=7168, 2025-05-07T20:32:28.7078771Z scale_ub=1200.0, 2025-05-07T20:32:28.7078989Z contiguous=True, 2025-05-07T20:32:28.7079210Z compiled=True, 2025-05-07T20:32:28.7079440Z ) 2025-05-07T20:32:28.7079781Z self = 2025-05-07T20:32:28.7080274Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:28.7080763Z 2025-05-07T20:32:28.7080855Z @given( 2025-05-07T20:32:28.7081082Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.7081395Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.7081700Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.7082043Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.7082357Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.7082637Z ) 2025-05-07T20:32:28.7082993Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.7083432Z def test_silu_mul_quant( 2025-05-07T20:32:28.7083684Z self, 2025-05-07T20:32:28.7083881Z T: int, 2025-05-07T20:32:28.7084070Z D: int, 2025-05-07T20:32:28.7084433Z scale_ub: Optional[float], 2025-05-07T20:32:28.7084715Z contiguous: bool, 2025-05-07T20:32:28.7084954Z compiled: bool, 2025-05-07T20:32:28.7085184Z ) -> None: 2025-05-07T20:32:28.7085408Z torch.manual_seed(2025) 2025-05-07T20:32:28.7085644Z 2025-05-07T20:32:28.7085926Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.7086309Z 2025-05-07T20:32:28.7086507Z x_sign = torch.sign(x) 2025-05-07T20:32:28.7086811Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.7087115Z x = x_sign * x_clamp 2025-05-07T20:32:28.7087367Z x0 = x[:, :D] 2025-05-07T20:32:28.7087593Z x1 = x[:, D:] 2025-05-07T20:32:28.7087811Z 2025-05-07T20:32:28.7087995Z if contiguous: 2025-05-07T20:32:28.7088237Z x0 = x0.contiguous() 2025-05-07T20:32:28.7088505Z x1 = x1.contiguous() 2025-05-07T20:32:28.7088737Z 2025-05-07T20:32:28.7088927Z if scale_ub is not None: 2025-05-07T20:32:28.7089192Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.7089520Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.7089831Z ) 2025-05-07T20:32:28.7090017Z else: 2025-05-07T20:32:28.7090217Z scale_ub_tensor = None 2025-05-07T20:32:28.7090462Z 2025-05-07T20:32:28.7090688Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.7090986Z op = silu_mul_quant 2025-05-07T20:32:28.7091380Z if compiled: 2025-05-07T20:32:28.7091619Z op = torch.compile(op) 2025-05-07T20:32:28.7091903Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.7092176Z 2025-05-07T20:32:28.7092358Z > y_fp8, y_scale = fn() 2025-05-07T20:32:28.7092518Z 2025-05-07T20:32:28.7092623Z moe/activation_test.py:117: 2025-05-07T20:32:28.7092904Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.7093235Z moe/activation_test.py:115: in fn 2025-05-07T20:32:28.7093512Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.7094065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:28.7094613Z return fn(*args, **kwargs) 2025-05-07T20:32:28.7095262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:28.7095936Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:28.7096455Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.7097127Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.7097782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.7098294Z kernel = self.compile( 2025-05-07T20:32:28.7098828Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.7099479Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.7099955Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.7100183Z 2025-05-07T20:32:28.7100385Z self = 2025-05-07T20:32:28.7101468Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.7102851Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faad3a7a660>} 2025-05-07T20:32:28.7104180Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.7105193Z context = 2025-05-07T20:32:28.7105490Z 2025-05-07T20:32:28.7105650Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.7106170Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.7106637Z module_map=module_map) 2025-05-07T20:32:28.7106986Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.7107326Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:28.7107578Z E ^ 2025-05-07T20:32:28.7108026Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.7108670Z 2025-05-07T20:32:28.7109084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.7109595Z 2025-05-07T20:32:28.7109694Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.7110106Z self=, 2025-05-07T20:32:28.7110491Z T=1, 2025-05-07T20:32:28.7110684Z D=7168, 2025-05-07T20:32:28.7110885Z scale_ub=1200.0, 2025-05-07T20:32:28.7111105Z contiguous=False, 2025-05-07T20:32:28.7111344Z compiled=True, 2025-05-07T20:32:28.7111693Z ) 2025-05-07T20:32:28.7112009Z self = 2025-05-07T20:32:28.7112503Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:28.7112777Z 2025-05-07T20:32:28.7112851Z @given( 2025-05-07T20:32:28.7113089Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.7113390Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.7113707Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.7114041Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.7114363Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.7114649Z ) 2025-05-07T20:32:28.7115003Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.7115445Z def test_silu_mul_quant( 2025-05-07T20:32:28.7115683Z self, 2025-05-07T20:32:28.7115883Z T: int, 2025-05-07T20:32:28.7116091Z D: int, 2025-05-07T20:32:28.7116309Z scale_ub: Optional[float], 2025-05-07T20:32:28.7116587Z contiguous: bool, 2025-05-07T20:32:28.7116830Z compiled: bool, 2025-05-07T20:32:28.7117041Z ) -> None: 2025-05-07T20:32:28.7117259Z torch.manual_seed(2025) 2025-05-07T20:32:28.7117504Z 2025-05-07T20:32:28.7117766Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.7118112Z 2025-05-07T20:32:28.7118310Z x_sign = torch.sign(x) 2025-05-07T20:32:28.7118598Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.7118915Z x = x_sign * x_clamp 2025-05-07T20:32:28.7119156Z x0 = x[:, :D] 2025-05-07T20:32:28.7119497Z x1 = x[:, D:] 2025-05-07T20:32:28.7119709Z 2025-05-07T20:32:28.7119902Z if contiguous: 2025-05-07T20:32:28.7120131Z x0 = x0.contiguous() 2025-05-07T20:32:28.7120396Z x1 = x1.contiguous() 2025-05-07T20:32:28.7120647Z 2025-05-07T20:32:28.7120828Z if scale_ub is not None: 2025-05-07T20:32:28.7121112Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.7121458Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.7121770Z ) 2025-05-07T20:32:28.7121962Z else: 2025-05-07T20:32:28.7122180Z scale_ub_tensor = None 2025-05-07T20:32:28.7122438Z 2025-05-07T20:32:28.7122669Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.7122990Z op = silu_mul_quant 2025-05-07T20:32:28.7123249Z if compiled: 2025-05-07T20:32:28.7123493Z op = torch.compile(op) 2025-05-07T20:32:28.7123794Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.7124072Z 2025-05-07T20:32:28.7124323Z > y_fp8, y_scale = fn() 2025-05-07T20:32:28.7124497Z 2025-05-07T20:32:28.7124597Z moe/activation_test.py:117: 2025-05-07T20:32:28.7124901Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.7125249Z moe/activation_test.py:115: in fn 2025-05-07T20:32:28.7125519Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.7126087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:28.7126650Z return fn(*args, **kwargs) 2025-05-07T20:32:28.7127307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:28.7128043Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:28.7128584Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.7129273Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.7129927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.7130461Z kernel = self.compile( 2025-05-07T20:32:28.7131113Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.7131757Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.7132161Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.7132404Z 2025-05-07T20:32:28.7132610Z self = 2025-05-07T20:32:28.7133695Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.7135066Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faad3a793a0>} 2025-05-07T20:32:28.7136395Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.7137411Z context = 2025-05-07T20:32:28.7137698Z 2025-05-07T20:32:28.7137867Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.7138382Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.7138835Z module_map=module_map) 2025-05-07T20:32:28.7139198Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.7139650Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:28.7139894Z E ^ 2025-05-07T20:32:28.7140349Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.7140792Z 2025-05-07T20:32:28.7141215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:29.2337870Z 2025-05-07T20:32:29.2338152Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.2338590Z self=, 2025-05-07T20:32:29.2339087Z T=1, 2025-05-07T20:32:29.2339363Z D=7168, 2025-05-07T20:32:29.2339645Z scale_ub=None, 2025-05-07T20:32:29.2339969Z contiguous=False, 2025-05-07T20:32:29.2340336Z compiled=True, 2025-05-07T20:32:29.2340669Z ) 2025-05-07T20:32:29.2341141Z self = 2025-05-07T20:32:29.2341646Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:29.2341923Z 2025-05-07T20:32:29.2342012Z @given( 2025-05-07T20:32:29.2342255Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.2342585Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.2342926Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.2343260Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.2343604Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.2343912Z ) 2025-05-07T20:32:29.2344284Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.2344726Z def test_silu_mul_quant( 2025-05-07T20:32:29.2344990Z self, 2025-05-07T20:32:29.2345207Z T: int, 2025-05-07T20:32:29.2345424Z D: int, 2025-05-07T20:32:29.2345656Z scale_ub: Optional[float], 2025-05-07T20:32:29.2345943Z contiguous: bool, 2025-05-07T20:32:29.2346199Z compiled: bool, 2025-05-07T20:32:29.2346447Z ) -> None: 2025-05-07T20:32:29.2346675Z torch.manual_seed(2025) 2025-05-07T20:32:29.2346919Z 2025-05-07T20:32:29.2347247Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.2347589Z 2025-05-07T20:32:29.2347989Z x_sign = torch.sign(x) 2025-05-07T20:32:29.2348292Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:29.2348609Z x = x_sign * x_clamp 2025-05-07T20:32:29.2348864Z x0 = x[:, :D] 2025-05-07T20:32:29.2349102Z x1 = x[:, D:] 2025-05-07T20:32:29.2349311Z 2025-05-07T20:32:29.2349515Z if contiguous: 2025-05-07T20:32:29.2349764Z x0 = x0.contiguous() 2025-05-07T20:32:29.2350025Z x1 = x1.contiguous() 2025-05-07T20:32:29.2350282Z 2025-05-07T20:32:29.2350497Z if scale_ub is not None: 2025-05-07T20:32:29.2350772Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:29.2351130Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:29.2351456Z ) 2025-05-07T20:32:29.2351655Z else: 2025-05-07T20:32:29.2351866Z scale_ub_tensor = None 2025-05-07T20:32:29.2352134Z 2025-05-07T20:32:29.2352373Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.2352693Z op = silu_mul_quant 2025-05-07T20:32:29.2352953Z if compiled: 2025-05-07T20:32:29.2353214Z op = torch.compile(op) 2025-05-07T20:32:29.2353512Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.2353799Z 2025-05-07T20:32:29.2353995Z y_fp8, y_scale = fn() 2025-05-07T20:32:29.2354277Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:29.2354580Z 2025-05-07T20:32:29.2354830Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.2355166Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:29.2355472Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:29.2355925Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:29.2356297Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:29.2356609Z 2025-05-07T20:32:29.2356826Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:29.2357027Z 2025-05-07T20:32:29.2357146Z moe/activation_test.py:126: 2025-05-07T20:32:29.2357447Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.2357794Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:29.2358137Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:29.2358924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:29.2359690Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:29.2360245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:29.2360947Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:29.2361628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:29.2362354Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:29.2363099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:29.2365199Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:29.2365802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:29.2366337Z fn() 2025-05-07T20:32:29.2366859Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:29.2367436Z self.fn.run( 2025-05-07T20:32:29.2367930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:29.2368473Z kernel = self.compile( 2025-05-07T20:32:29.2369034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:29.2369771Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:29.2370180Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.2370411Z 2025-05-07T20:32:29.2370632Z self = 2025-05-07T20:32:29.2371722Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:29.2373095Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faad3a7bce0>} 2025-05-07T20:32:29.2374450Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:29.2375494Z context = 2025-05-07T20:32:29.2375792Z 2025-05-07T20:32:29.2375976Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:29.2376489Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:29.2376967Z module_map=module_map) 2025-05-07T20:32:29.2377345Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:29.2377715Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:29.2377980Z E ^ 2025-05-07T20:32:29.2378538Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:29.2378993Z 2025-05-07T20:32:29.2379426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:29.2379934Z 2025-05-07T20:32:29.2380052Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.2380477Z self=, 2025-05-07T20:32:29.2380892Z T=1, 2025-05-07T20:32:29.2381088Z D=5120, 2025-05-07T20:32:29.2381288Z scale_ub=1200.0, 2025-05-07T20:32:29.2381527Z contiguous=False, 2025-05-07T20:32:29.2381771Z compiled=True, 2025-05-07T20:32:29.2381987Z ) 2025-05-07T20:32:29.2382316Z self = 2025-05-07T20:32:29.2382821Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:29.2383083Z 2025-05-07T20:32:29.2383164Z @given( 2025-05-07T20:32:29.2383414Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.2383739Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.2384049Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.2384389Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.2384736Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.2385037Z ) 2025-05-07T20:32:29.2385384Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.2385835Z def test_silu_mul_quant( 2025-05-07T20:32:29.2386085Z self, 2025-05-07T20:32:29.2386274Z T: int, 2025-05-07T20:32:29.2386470Z D: int, 2025-05-07T20:32:29.2386689Z scale_ub: Optional[float], 2025-05-07T20:32:29.2386954Z contiguous: bool, 2025-05-07T20:32:29.2387201Z compiled: bool, 2025-05-07T20:32:29.2387435Z ) -> None: 2025-05-07T20:32:29.2387645Z torch.manual_seed(2025) 2025-05-07T20:32:29.2387894Z 2025-05-07T20:32:29.2388181Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.2388518Z 2025-05-07T20:32:29.2388710Z x_sign = torch.sign(x) 2025-05-07T20:32:29.2389009Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:29.2389412Z x = x_sign * x_clamp 2025-05-07T20:32:29.2389644Z x0 = x[:, :D] 2025-05-07T20:32:29.2389863Z x1 = x[:, D:] 2025-05-07T20:32:29.2390080Z 2025-05-07T20:32:29.2390259Z if contiguous: 2025-05-07T20:32:29.2390492Z x0 = x0.contiguous() 2025-05-07T20:32:29.2396853Z x1 = x1.contiguous() 2025-05-07T20:32:29.2397150Z 2025-05-07T20:32:29.2397357Z if scale_ub is not None: 2025-05-07T20:32:29.2397636Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:29.2397988Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:29.2398301Z ) 2025-05-07T20:32:29.2398493Z else: 2025-05-07T20:32:29.2398722Z scale_ub_tensor = None 2025-05-07T20:32:29.2398981Z 2025-05-07T20:32:29.2399214Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.2399537Z op = silu_mul_quant 2025-05-07T20:32:29.2399796Z if compiled: 2025-05-07T20:32:29.2400050Z op = torch.compile(op) 2025-05-07T20:32:29.2400356Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.2400637Z 2025-05-07T20:32:29.2400837Z > y_fp8, y_scale = fn() 2025-05-07T20:32:29.2401002Z 2025-05-07T20:32:29.2401106Z moe/activation_test.py:117: 2025-05-07T20:32:29.2401406Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.2401749Z moe/activation_test.py:115: in fn 2025-05-07T20:32:29.2402029Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.2402601Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:29.2403166Z return fn(*args, **kwargs) 2025-05-07T20:32:29.2403936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:29.2404801Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:29.2405347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:29.2406040Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:29.2406702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:29.2407245Z kernel = self.compile( 2025-05-07T20:32:29.2407805Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:29.2408709Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:29.2409116Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.2409356Z 2025-05-07T20:32:29.2409567Z self = 2025-05-07T20:32:29.2410663Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:29.2412052Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faad35b3920>} 2025-05-07T20:32:29.2413417Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:29.2414533Z context = 2025-05-07T20:32:29.2414828Z 2025-05-07T20:32:29.2414994Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:29.2415515Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:29.2415978Z module_map=module_map) 2025-05-07T20:32:29.2416509Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:29.2416864Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:29.2417107Z E ^ 2025-05-07T20:32:29.2417573Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:29.2418027Z 2025-05-07T20:32:29.2418441Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:29.3821815Z 2025-05-07T20:32:29.3821997Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.3822545Z self=, 2025-05-07T20:32:29.3822958Z T=1, 2025-05-07T20:32:29.3823151Z D=5120, 2025-05-07T20:32:29.3823441Z scale_ub=1200.0, 2025-05-07T20:32:29.3823671Z contiguous=False, 2025-05-07T20:32:29.3824019Z compiled=False, 2025-05-07T20:32:29.3824330Z ) 2025-05-07T20:32:29.3824820Z self = 2025-05-07T20:32:29.3825492Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:29.3825753Z 2025-05-07T20:32:29.3825828Z @given( 2025-05-07T20:32:29.3826064Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.3826376Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.3826684Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.3827006Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.3827335Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.3827615Z ) 2025-05-07T20:32:29.3828131Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.3828576Z def test_silu_mul_quant( 2025-05-07T20:32:29.3828825Z self, 2025-05-07T20:32:29.3829016Z T: int, 2025-05-07T20:32:29.3829202Z D: int, 2025-05-07T20:32:29.3829422Z scale_ub: Optional[float], 2025-05-07T20:32:29.3829691Z contiguous: bool, 2025-05-07T20:32:29.3829917Z compiled: bool, 2025-05-07T20:32:29.3830131Z ) -> None: 2025-05-07T20:32:29.3830338Z torch.manual_seed(2025) 2025-05-07T20:32:29.3830564Z 2025-05-07T20:32:29.3830829Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.3831161Z 2025-05-07T20:32:29.3831342Z x_sign = torch.sign(x) 2025-05-07T20:32:29.3831626Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:29.3831921Z x = x_sign * x_clamp 2025-05-07T20:32:29.3832156Z x0 = x[:, :D] 2025-05-07T20:32:29.3832357Z x1 = x[:, D:] 2025-05-07T20:32:29.3832553Z 2025-05-07T20:32:29.3832729Z if contiguous: 2025-05-07T20:32:29.3832948Z x0 = x0.contiguous() 2025-05-07T20:32:29.3833199Z x1 = x1.contiguous() 2025-05-07T20:32:29.3833431Z 2025-05-07T20:32:29.3833604Z if scale_ub is not None: 2025-05-07T20:32:29.3833865Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:29.3834194Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:29.3834484Z ) 2025-05-07T20:32:29.3834662Z else: 2025-05-07T20:32:29.3834863Z scale_ub_tensor = None 2025-05-07T20:32:29.3835096Z 2025-05-07T20:32:29.3835319Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.3835616Z op = silu_mul_quant 2025-05-07T20:32:29.3835850Z if compiled: 2025-05-07T20:32:29.3836087Z op = torch.compile(op) 2025-05-07T20:32:29.3836373Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.3836627Z 2025-05-07T20:32:29.3836822Z > y_fp8, y_scale = fn() 2025-05-07T20:32:29.3836985Z 2025-05-07T20:32:29.3837077Z moe/activation_test.py:117: 2025-05-07T20:32:29.3837362Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.3837679Z moe/activation_test.py:115: in fn 2025-05-07T20:32:29.3838082Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.3838758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:29.3839425Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:29.3839944Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:29.3840614Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:29.3841263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:29.3841779Z kernel = self.compile( 2025-05-07T20:32:29.3842321Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:29.3842970Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:29.3843351Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.3843582Z 2025-05-07T20:32:29.3843785Z self = 2025-05-07T20:32:29.3844998Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:29.3846365Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faad3bd74c0>} 2025-05-07T20:32:29.3847794Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:29.3848838Z context = 2025-05-07T20:32:29.3849131Z 2025-05-07T20:32:29.3849292Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:29.3849801Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:29.3850259Z module_map=module_map) 2025-05-07T20:32:29.3850609Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:29.3850949Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:29.3851201Z E ^ 2025-05-07T20:32:29.3851644Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:29.3852089Z 2025-05-07T20:32:29.3852505Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:29.3853017Z 2025-05-07T20:32:29.3853114Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.3853513Z self=, 2025-05-07T20:32:29.3853905Z T=16384, 2025-05-07T20:32:29.3854090Z D=5120, 2025-05-07T20:32:29.3854272Z scale_ub=1200.0, 2025-05-07T20:32:29.3854479Z contiguous=False, 2025-05-07T20:32:29.3854694Z compiled=True, 2025-05-07T20:32:29.3854886Z ) 2025-05-07T20:32:29.3855190Z self = 2025-05-07T20:32:29.3855682Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:29.3855959Z 2025-05-07T20:32:29.3856030Z @given( 2025-05-07T20:32:29.3856250Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.3856547Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.3856847Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.3857168Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.3857477Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.3857770Z ) 2025-05-07T20:32:29.3858219Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.3858640Z def test_silu_mul_quant( 2025-05-07T20:32:29.3858870Z self, 2025-05-07T20:32:29.3859059Z T: int, 2025-05-07T20:32:29.3859242Z D: int, 2025-05-07T20:32:29.3859451Z scale_ub: Optional[float], 2025-05-07T20:32:29.3859709Z contiguous: bool, 2025-05-07T20:32:29.3859938Z compiled: bool, 2025-05-07T20:32:29.3860145Z ) -> None: 2025-05-07T20:32:29.3860353Z torch.manual_seed(2025) 2025-05-07T20:32:29.3860584Z 2025-05-07T20:32:29.3860844Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.3861177Z 2025-05-07T20:32:29.3861367Z x_sign = torch.sign(x) 2025-05-07T20:32:29.3861641Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:29.3861945Z x = x_sign * x_clamp 2025-05-07T20:32:29.3862177Z x0 = x[:, :D] 2025-05-07T20:32:29.3862381Z x1 = x[:, D:] 2025-05-07T20:32:29.3862583Z 2025-05-07T20:32:29.3862756Z if contiguous: 2025-05-07T20:32:29.3862974Z x0 = x0.contiguous() 2025-05-07T20:32:29.3863225Z x1 = x1.contiguous() 2025-05-07T20:32:29.3863452Z 2025-05-07T20:32:29.3863627Z if scale_ub is not None: 2025-05-07T20:32:29.3863887Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:29.3864204Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:29.3864504Z ) 2025-05-07T20:32:29.3864680Z else: 2025-05-07T20:32:29.3864880Z scale_ub_tensor = None 2025-05-07T20:32:29.3865116Z 2025-05-07T20:32:29.3865417Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.3865722Z op = silu_mul_quant 2025-05-07T20:32:29.3865958Z if compiled: 2025-05-07T20:32:29.3866189Z op = torch.compile(op) 2025-05-07T20:32:29.3866476Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.3866748Z 2025-05-07T20:32:29.3866923Z > y_fp8, y_scale = fn() 2025-05-07T20:32:29.3867089Z 2025-05-07T20:32:29.3867181Z moe/activation_test.py:117: 2025-05-07T20:32:29.3867465Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.3867781Z moe/activation_test.py:115: in fn 2025-05-07T20:32:29.3868047Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.3868596Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:29.3869135Z return fn(*args, **kwargs) 2025-05-07T20:32:29.3869784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:29.3870459Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:29.3870980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:29.3871641Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:29.3872295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:29.3872815Z kernel = self.compile( 2025-05-07T20:32:29.3873345Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:29.3873998Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:29.3874387Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.3874611Z 2025-05-07T20:32:29.3874817Z self = 2025-05-07T20:32:29.3875889Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:29.3877357Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faad3806660>} 2025-05-07T20:32:29.3878680Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:29.3879695Z context = 2025-05-07T20:32:29.3879985Z 2025-05-07T20:32:29.3880145Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:29.3880659Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:29.3881109Z module_map=module_map) 2025-05-07T20:32:29.3881468Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:29.3881811Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:29.3882065Z E ^ 2025-05-07T20:32:29.3882516Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:29.3882964Z 2025-05-07T20:32:29.3883373Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:29.3883877Z 2025-05-07T20:32:29.3883982Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.3884457Z self=, 2025-05-07T20:32:29.3884849Z T=2048, 2025-05-07T20:32:29.3885023Z D=7168, 2025-05-07T20:32:29.3885197Z scale_ub=1200.0, 2025-05-07T20:32:29.3885493Z contiguous=False, 2025-05-07T20:32:29.3885709Z compiled=True, 2025-05-07T20:32:29.5758869Z ) 2025-05-07T20:32:29.5759716Z self = 2025-05-07T20:32:29.5760678Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:29.5761154Z 2025-05-07T20:32:29.5761273Z @given( 2025-05-07T20:32:29.5761622Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.5762097Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.5762514Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.5762947Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.5763365Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.5763638Z ) 2025-05-07T20:32:29.5763973Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.5764524Z def test_silu_mul_quant( 2025-05-07T20:32:29.5764759Z self, 2025-05-07T20:32:29.5764942Z T: int, 2025-05-07T20:32:29.5765121Z D: int, 2025-05-07T20:32:29.5765324Z scale_ub: Optional[float], 2025-05-07T20:32:29.5765585Z contiguous: bool, 2025-05-07T20:32:29.5765808Z compiled: bool, 2025-05-07T20:32:29.5766021Z ) -> None: 2025-05-07T20:32:29.5766223Z torch.manual_seed(2025) 2025-05-07T20:32:29.5766457Z 2025-05-07T20:32:29.5766709Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.5767034Z 2025-05-07T20:32:29.5767213Z x_sign = torch.sign(x) 2025-05-07T20:32:29.5767481Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:29.5767782Z x = x_sign * x_clamp 2025-05-07T20:32:29.5768010Z x0 = x[:, :D] 2025-05-07T20:32:29.5768208Z x1 = x[:, D:] 2025-05-07T20:32:29.5768405Z 2025-05-07T20:32:29.5768569Z if contiguous: 2025-05-07T20:32:29.5768782Z x0 = x0.contiguous() 2025-05-07T20:32:29.5769032Z x1 = x1.contiguous() 2025-05-07T20:32:29.5769255Z 2025-05-07T20:32:29.5769434Z if scale_ub is not None: 2025-05-07T20:32:29.5769699Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:29.5770025Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:29.5770541Z ) 2025-05-07T20:32:29.5770733Z else: 2025-05-07T20:32:29.5770938Z scale_ub_tensor = None 2025-05-07T20:32:29.5771186Z 2025-05-07T20:32:29.5771414Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.5771713Z op = silu_mul_quant 2025-05-07T20:32:29.5771959Z if compiled: 2025-05-07T20:32:29.5772206Z op = torch.compile(op) 2025-05-07T20:32:29.5772497Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.5772761Z 2025-05-07T20:32:29.5772952Z > y_fp8, y_scale = fn() 2025-05-07T20:32:29.5773114Z 2025-05-07T20:32:29.5773217Z moe/activation_test.py:117: 2025-05-07T20:32:29.5773506Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.5773833Z moe/activation_test.py:115: in fn 2025-05-07T20:32:29.5774104Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.5774649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:29.5775204Z return fn(*args, **kwargs) 2025-05-07T20:32:29.5775849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:29.5776527Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:29.5777047Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:29.5777719Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:29.5778490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:29.5779010Z kernel = self.compile( 2025-05-07T20:32:29.5779537Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:29.5780179Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:29.5780570Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.5780792Z 2025-05-07T20:32:29.5780996Z self = 2025-05-07T20:32:29.5782068Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:29.5783487Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faae124b060>} 2025-05-07T20:32:29.5784818Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:29.5785833Z context = 2025-05-07T20:32:29.5786117Z 2025-05-07T20:32:29.5786278Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:29.5786786Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:29.5787235Z module_map=module_map) 2025-05-07T20:32:29.5787582Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:29.5787912Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:29.5788150Z E ^ 2025-05-07T20:32:29.5788601Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:29.5789040Z 2025-05-07T20:32:29.5789456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:29.5790004Z 2025-05-07T20:32:29.5790099Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.5790582Z self=, 2025-05-07T20:32:29.5790967Z T=1, 2025-05-07T20:32:29.5791133Z D=5120, 2025-05-07T20:32:29.5791316Z scale_ub=None, 2025-05-07T20:32:29.5791516Z contiguous=False, 2025-05-07T20:32:29.5791724Z compiled=False, 2025-05-07T20:32:29.5791914Z ) 2025-05-07T20:32:29.5792217Z self = 2025-05-07T20:32:29.5792679Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:29.5792933Z 2025-05-07T20:32:29.5793001Z @given( 2025-05-07T20:32:29.5793218Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.5793520Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.5793808Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.5794120Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.5794437Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.5794706Z ) 2025-05-07T20:32:29.5795034Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.5795458Z def test_silu_mul_quant( 2025-05-07T20:32:29.5795679Z self, 2025-05-07T20:32:29.5795859Z T: int, 2025-05-07T20:32:29.5796043Z D: int, 2025-05-07T20:32:29.5796244Z scale_ub: Optional[float], 2025-05-07T20:32:29.5796501Z contiguous: bool, 2025-05-07T20:32:29.5796722Z compiled: bool, 2025-05-07T20:32:29.5796930Z ) -> None: 2025-05-07T20:32:29.5797127Z torch.manual_seed(2025) 2025-05-07T20:32:29.5797355Z 2025-05-07T20:32:29.5797694Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.5798021Z 2025-05-07T20:32:29.5798199Z x_sign = torch.sign(x) 2025-05-07T20:32:29.5798473Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:29.5798760Z x = x_sign * x_clamp 2025-05-07T20:32:29.5798987Z x0 = x[:, :D] 2025-05-07T20:32:29.5799191Z x1 = x[:, D:] 2025-05-07T20:32:29.5799379Z 2025-05-07T20:32:29.5799555Z if contiguous: 2025-05-07T20:32:29.5799768Z x0 = x0.contiguous() 2025-05-07T20:32:29.5800004Z x1 = x1.contiguous() 2025-05-07T20:32:29.5800230Z 2025-05-07T20:32:29.5800401Z if scale_ub is not None: 2025-05-07T20:32:29.5800648Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:29.5800964Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:29.5801259Z ) 2025-05-07T20:32:29.5801431Z else: 2025-05-07T20:32:29.5801625Z scale_ub_tensor = None 2025-05-07T20:32:29.5801858Z 2025-05-07T20:32:29.5802080Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.5802373Z op = silu_mul_quant 2025-05-07T20:32:29.5802606Z if compiled: 2025-05-07T20:32:29.5802838Z op = torch.compile(op) 2025-05-07T20:32:29.5803118Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.5803376Z 2025-05-07T20:32:29.5803554Z > y_fp8, y_scale = fn() 2025-05-07T20:32:29.5803710Z 2025-05-07T20:32:29.5803801Z moe/activation_test.py:117: 2025-05-07T20:32:29.5804080Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.5804513Z moe/activation_test.py:115: in fn 2025-05-07T20:32:29.5804772Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.5805439Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:29.5806106Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:29.5806639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:29.5807297Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:29.5807945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:29.5808715Z kernel = self.compile( 2025-05-07T20:32:29.5809239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:29.5809871Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:29.5810251Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.5810472Z 2025-05-07T20:32:29.5810679Z self = 2025-05-07T20:32:29.5811741Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:29.5813094Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faae0aaaa20>} 2025-05-07T20:32:29.5814420Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:29.5815420Z context = 2025-05-07T20:32:29.5815699Z 2025-05-07T20:32:29.5815861Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:29.5816359Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:29.5816991Z module_map=module_map) 2025-05-07T20:32:29.5817341Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:29.5817674Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:29.5823893Z E ^ 2025-05-07T20:32:29.5824378Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:29.5824839Z 2025-05-07T20:32:29.5825260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:29.5825779Z 2025-05-07T20:32:29.5825884Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.5826293Z self=, 2025-05-07T20:32:29.5826704Z T=4096, 2025-05-07T20:32:29.5826891Z D=7168, 2025-05-07T20:32:29.5827083Z scale_ub=1200.0, 2025-05-07T20:32:29.5827314Z contiguous=False, 2025-05-07T20:32:29.5827536Z compiled=False, 2025-05-07T20:32:29.5827747Z ) 2025-05-07T20:32:29.5828074Z self = 2025-05-07T20:32:29.5828563Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:29.5828840Z 2025-05-07T20:32:29.5828918Z @given( 2025-05-07T20:32:29.5829149Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.5829466Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.5829775Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.5830107Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.5830433Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.5830713Z ) 2025-05-07T20:32:29.5831060Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.5831502Z def test_silu_mul_quant( 2025-05-07T20:32:29.5831737Z self, 2025-05-07T20:32:29.5831930Z T: int, 2025-05-07T20:32:29.5832132Z D: int, 2025-05-07T20:32:29.5832349Z scale_ub: Optional[float], 2025-05-07T20:32:29.5832623Z contiguous: bool, 2025-05-07T20:32:29.5832863Z compiled: bool, 2025-05-07T20:32:29.5833079Z ) -> None: 2025-05-07T20:32:29.5833294Z torch.manual_seed(2025) 2025-05-07T20:32:29.5833536Z 2025-05-07T20:32:29.5833971Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.5834315Z 2025-05-07T20:32:29.5834510Z x_sign = torch.sign(x) 2025-05-07T20:32:29.5834807Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:29.5835108Z x = x_sign * x_clamp 2025-05-07T20:32:29.5835342Z x0 = x[:, :D] 2025-05-07T20:32:29.5835553Z x1 = x[:, D:] 2025-05-07T20:32:29.5835755Z 2025-05-07T20:32:29.5835954Z if contiguous: 2025-05-07T20:32:29.5836202Z x0 = x0.contiguous() 2025-05-07T20:32:29.5836463Z x1 = x1.contiguous() 2025-05-07T20:32:29.5836709Z 2025-05-07T20:32:29.5836901Z if scale_ub is not None: 2025-05-07T20:32:29.5837177Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:29.5837510Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:29.5837816Z ) 2025-05-07T20:32:29.5838009Z else: 2025-05-07T20:32:29.5838218Z scale_ub_tensor = None 2025-05-07T20:32:29.5838478Z 2025-05-07T20:32:29.5838720Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.5839033Z op = silu_mul_quant 2025-05-07T20:32:29.5839288Z if compiled: 2025-05-07T20:32:29.5839542Z op = torch.compile(op) 2025-05-07T20:32:29.5839834Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.5840107Z 2025-05-07T20:32:29.5840300Z > y_fp8, y_scale = fn() 2025-05-07T20:32:29.5840466Z 2025-05-07T20:32:29.5840568Z moe/activation_test.py:117: 2025-05-07T20:32:29.5840866Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.5841291Z moe/activation_test.py:115: in fn 2025-05-07T20:32:29.5841575Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.5842253Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:29.5842944Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:29.5843482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:29.5844155Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:29.5844931Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:29.5845453Z kernel = self.compile( 2025-05-07T20:32:29.5845991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:29.5846641Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:29.5847034Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.5847270Z 2025-05-07T20:32:29.5847475Z self = 2025-05-07T20:32:29.5848549Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:29.5849926Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faad2e442c0>} 2025-05-07T20:32:29.5851257Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:29.5852275Z context = 2025-05-07T20:32:29.5852565Z 2025-05-07T20:32:29.5852728Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:29.5853244Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:29.5853788Z module_map=module_map) 2025-05-07T20:32:29.5854144Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:29.5854490Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:29.5854749Z E ^ 2025-05-07T20:32:29.5855207Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:29.5855657Z 2025-05-07T20:32:29.5856069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:29.7440159Z 2025-05-07T20:32:29.7440707Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.7441573Z self=, 2025-05-07T20:32:29.7442406Z T=16384, 2025-05-07T20:32:29.7442779Z D=7168, 2025-05-07T20:32:29.7443149Z scale_ub=None, 2025-05-07T20:32:29.7443576Z contiguous=True, 2025-05-07T20:32:29.7443796Z compiled=True, 2025-05-07T20:32:29.7444010Z ) 2025-05-07T20:32:29.7444433Z self = 2025-05-07T20:32:29.7444950Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:29.7445234Z 2025-05-07T20:32:29.7445314Z @given( 2025-05-07T20:32:29.7445549Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.7445872Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.7446180Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.7446521Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.7446864Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.7447154Z ) 2025-05-07T20:32:29.7447680Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.7448198Z def test_silu_mul_quant( 2025-05-07T20:32:29.7448449Z self, 2025-05-07T20:32:29.7448641Z T: int, 2025-05-07T20:32:29.7448846Z D: int, 2025-05-07T20:32:29.7449064Z scale_ub: Optional[float], 2025-05-07T20:32:29.7449325Z contiguous: bool, 2025-05-07T20:32:29.7449561Z compiled: bool, 2025-05-07T20:32:29.7449787Z ) -> None: 2025-05-07T20:32:29.7449995Z torch.manual_seed(2025) 2025-05-07T20:32:29.7450238Z 2025-05-07T20:32:29.7450504Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.7450840Z 2025-05-07T20:32:29.7451023Z x_sign = torch.sign(x) 2025-05-07T20:32:29.7451314Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:29.7451613Z x = x_sign * x_clamp 2025-05-07T20:32:29.7451847Z x0 = x[:, :D] 2025-05-07T20:32:29.7452061Z x1 = x[:, D:] 2025-05-07T20:32:29.7452260Z 2025-05-07T20:32:29.7452431Z if contiguous: 2025-05-07T20:32:29.7452669Z x0 = x0.contiguous() 2025-05-07T20:32:29.7452915Z x1 = x1.contiguous() 2025-05-07T20:32:29.7453152Z 2025-05-07T20:32:29.7453338Z if scale_ub is not None: 2025-05-07T20:32:29.7453610Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:29.7453936Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:29.7454242Z ) 2025-05-07T20:32:29.7454438Z else: 2025-05-07T20:32:29.7454636Z scale_ub_tensor = None 2025-05-07T20:32:29.7454889Z 2025-05-07T20:32:29.7455106Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.7455413Z op = silu_mul_quant 2025-05-07T20:32:29.7455658Z if compiled: 2025-05-07T20:32:29.7455903Z op = torch.compile(op) 2025-05-07T20:32:29.7456195Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.7456467Z 2025-05-07T20:32:29.7456655Z > y_fp8, y_scale = fn() 2025-05-07T20:32:29.7456813Z 2025-05-07T20:32:29.7456911Z moe/activation_test.py:117: 2025-05-07T20:32:29.7457207Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.7457676Z moe/activation_test.py:115: in fn 2025-05-07T20:32:29.7457958Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.7458511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:29.7459065Z return fn(*args, **kwargs) 2025-05-07T20:32:29.7459714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:29.7460383Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:29.7460913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:29.7461593Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:29.7462250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:29.7462768Z kernel = self.compile( 2025-05-07T20:32:29.7463314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:29.7463967Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:29.7464396Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.7464647Z 2025-05-07T20:32:29.7464852Z self = 2025-05-07T20:32:29.7466013Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:29.7467381Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faad2e45c60>} 2025-05-07T20:32:29.7468718Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:29.7469742Z context = 2025-05-07T20:32:29.7470033Z 2025-05-07T20:32:29.7470200Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:29.7470720Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:29.7471180Z module_map=module_map) 2025-05-07T20:32:29.7471535Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:29.7471884Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:29.7472138Z E ^ 2025-05-07T20:32:29.7472595Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:29.7473045Z 2025-05-07T20:32:29.7473459Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:29.7473973Z 2025-05-07T20:32:29.7474069Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.7474477Z self=, 2025-05-07T20:32:29.7474863Z T=4096, 2025-05-07T20:32:29.7475049Z D=5120, 2025-05-07T20:32:29.7475234Z scale_ub=None, 2025-05-07T20:32:29.7475440Z contiguous=False, 2025-05-07T20:32:29.7475662Z compiled=True, 2025-05-07T20:32:29.7475864Z ) 2025-05-07T20:32:29.7476168Z self = 2025-05-07T20:32:29.7476664Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:29.7476937Z 2025-05-07T20:32:29.7477014Z @given( 2025-05-07T20:32:29.7477236Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.7477533Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.7477918Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.7478241Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.7478557Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.7478834Z ) 2025-05-07T20:32:29.7479172Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.7479598Z def test_silu_mul_quant( 2025-05-07T20:32:29.7479837Z self, 2025-05-07T20:32:29.7480033Z T: int, 2025-05-07T20:32:29.7480225Z D: int, 2025-05-07T20:32:29.7480432Z scale_ub: Optional[float], 2025-05-07T20:32:29.7480700Z contiguous: bool, 2025-05-07T20:32:29.7480937Z compiled: bool, 2025-05-07T20:32:29.7481152Z ) -> None: 2025-05-07T20:32:29.7481367Z torch.manual_seed(2025) 2025-05-07T20:32:29.7481607Z 2025-05-07T20:32:29.7481873Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.7482212Z 2025-05-07T20:32:29.7482412Z x_sign = torch.sign(x) 2025-05-07T20:32:29.7482693Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:29.7483004Z x = x_sign * x_clamp 2025-05-07T20:32:29.7483246Z x0 = x[:, :D] 2025-05-07T20:32:29.7483454Z x1 = x[:, D:] 2025-05-07T20:32:29.7483664Z 2025-05-07T20:32:29.7483844Z if contiguous: 2025-05-07T20:32:29.7484069Z x0 = x0.contiguous() 2025-05-07T20:32:29.7484419Z x1 = x1.contiguous() 2025-05-07T20:32:29.7484666Z 2025-05-07T20:32:29.7484850Z if scale_ub is not None: 2025-05-07T20:32:29.7485112Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:29.7485553Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:29.7485855Z ) 2025-05-07T20:32:29.7486040Z else: 2025-05-07T20:32:29.7486240Z scale_ub_tensor = None 2025-05-07T20:32:29.7486489Z 2025-05-07T20:32:29.7486711Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.7487019Z op = silu_mul_quant 2025-05-07T20:32:29.7487264Z if compiled: 2025-05-07T20:32:29.7487503Z op = torch.compile(op) 2025-05-07T20:32:29.7487794Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.7488067Z 2025-05-07T20:32:29.7488250Z > y_fp8, y_scale = fn() 2025-05-07T20:32:29.7488422Z 2025-05-07T20:32:29.7488516Z moe/activation_test.py:117: 2025-05-07T20:32:29.7488805Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.7489131Z moe/activation_test.py:115: in fn 2025-05-07T20:32:29.7489408Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.7489964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:29.7490518Z return fn(*args, **kwargs) 2025-05-07T20:32:29.7491160Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:29.7491837Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:29.7492366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:29.7493039Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:29.7493688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:29.7494210Z kernel = self.compile( 2025-05-07T20:32:29.7494742Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:29.7495386Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:29.7495782Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.7496009Z 2025-05-07T20:32:29.7496211Z self = 2025-05-07T20:32:29.7497366Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:29.7498721Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faad2e46980>} 2025-05-07T20:32:29.7500049Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:29.7501062Z context = 2025-05-07T20:32:29.7501347Z 2025-05-07T20:32:29.7501517Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:29.7502024Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:29.7502483Z module_map=module_map) 2025-05-07T20:32:29.7502842Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:29.7503185Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:29.7503435Z E ^ 2025-05-07T20:32:29.7503895Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:29.7504338Z 2025-05-07T20:32:29.7504753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:29.8905090Z 2025-05-07T20:32:29.8905603Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.8906290Z self=, 2025-05-07T20:32:29.8906863Z T=4096, 2025-05-07T20:32:29.8907169Z D=5120, 2025-05-07T20:32:29.8907367Z scale_ub=1200.0, 2025-05-07T20:32:29.8907592Z contiguous=False, 2025-05-07T20:32:29.8907883Z compiled=False, 2025-05-07T20:32:29.8908698Z ) 2025-05-07T20:32:29.8909337Z self = 2025-05-07T20:32:29.8910310Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:29.8910843Z 2025-05-07T20:32:29.8910993Z @given( 2025-05-07T20:32:29.8911440Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.8912052Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.8912646Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.8913283Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.8913931Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.8914474Z ) 2025-05-07T20:32:29.8915160Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.8916013Z def test_silu_mul_quant( 2025-05-07T20:32:29.8916480Z self, 2025-05-07T20:32:29.8916862Z T: int, 2025-05-07T20:32:29.8917236Z D: int, 2025-05-07T20:32:29.8917650Z scale_ub: Optional[float], 2025-05-07T20:32:29.8918159Z contiguous: bool, 2025-05-07T20:32:29.8918612Z compiled: bool, 2025-05-07T20:32:29.8919048Z ) -> None: 2025-05-07T20:32:29.8919448Z torch.manual_seed(2025) 2025-05-07T20:32:29.8919922Z 2025-05-07T20:32:29.8920443Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.8921100Z 2025-05-07T20:32:29.8921461Z x_sign = torch.sign(x) 2025-05-07T20:32:29.8922021Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:29.8922621Z x = x_sign * x_clamp 2025-05-07T20:32:29.8923076Z x0 = x[:, :D] 2025-05-07T20:32:29.8923491Z x1 = x[:, D:] 2025-05-07T20:32:29.8923882Z 2025-05-07T20:32:29.8924248Z if contiguous: 2025-05-07T20:32:29.8924473Z x0 = x0.contiguous() 2025-05-07T20:32:29.8924718Z x1 = x1.contiguous() 2025-05-07T20:32:29.8925571Z 2025-05-07T20:32:29.8925751Z if scale_ub is not None: 2025-05-07T20:32:29.8926017Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:29.8926339Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:29.8926637Z ) 2025-05-07T20:32:29.8926826Z else: 2025-05-07T20:32:29.8927027Z scale_ub_tensor = None 2025-05-07T20:32:29.8927270Z 2025-05-07T20:32:29.8927496Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.8927802Z op = silu_mul_quant 2025-05-07T20:32:29.8928047Z if compiled: 2025-05-07T20:32:29.8928294Z op = torch.compile(op) 2025-05-07T20:32:29.8928581Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.8928852Z 2025-05-07T20:32:29.8929052Z > y_fp8, y_scale = fn() 2025-05-07T20:32:29.8929214Z 2025-05-07T20:32:29.8929310Z moe/activation_test.py:117: 2025-05-07T20:32:29.8929604Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.8929931Z moe/activation_test.py:115: in fn 2025-05-07T20:32:29.8930208Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.8930882Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:29.8931563Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:29.8932088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:29.8932756Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:29.8933520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:29.8934041Z kernel = self.compile( 2025-05-07T20:32:29.8934575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:29.8935227Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:29.8935619Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.8935843Z 2025-05-07T20:32:29.8936055Z self = 2025-05-07T20:32:29.8937126Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:29.8938486Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faad2e47ba0>} 2025-05-07T20:32:29.8939811Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:29.8940832Z context = 2025-05-07T20:32:29.8941114Z 2025-05-07T20:32:29.8941280Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:29.8941788Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:29.8942244Z module_map=module_map) 2025-05-07T20:32:29.8942603Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:29.8942947Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:29.8943197Z E ^ 2025-05-07T20:32:29.8943656Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:29.8944094Z 2025-05-07T20:32:29.8944506Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:29.8945093Z 2025-05-07T20:32:29.8945198Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.8945601Z self=, 2025-05-07T20:32:29.8945998Z T=4096, 2025-05-07T20:32:29.8946179Z D=5120, 2025-05-07T20:32:29.8946362Z scale_ub=1200.0, 2025-05-07T20:32:29.8946585Z contiguous=False, 2025-05-07T20:32:29.8946802Z compiled=True, 2025-05-07T20:32:29.8946991Z ) 2025-05-07T20:32:29.8947302Z self = 2025-05-07T20:32:29.8947793Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:29.8948058Z 2025-05-07T20:32:29.8948131Z @given( 2025-05-07T20:32:29.8948358Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.8948667Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.8948973Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.8949289Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.8949619Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.8949900Z ) 2025-05-07T20:32:29.8950237Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.8950672Z def test_silu_mul_quant( 2025-05-07T20:32:29.8950913Z self, 2025-05-07T20:32:29.8951101Z T: int, 2025-05-07T20:32:29.8951294Z D: int, 2025-05-07T20:32:29.8951506Z scale_ub: Optional[float], 2025-05-07T20:32:29.8951765Z contiguous: bool, 2025-05-07T20:32:29.8951996Z compiled: bool, 2025-05-07T20:32:29.8952215Z ) -> None: 2025-05-07T20:32:29.8952427Z torch.manual_seed(2025) 2025-05-07T20:32:29.8952755Z 2025-05-07T20:32:29.8953021Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.8953356Z 2025-05-07T20:32:29.8953544Z x_sign = torch.sign(x) 2025-05-07T20:32:29.8953826Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:29.8954135Z x = x_sign * x_clamp 2025-05-07T20:32:29.8954361Z x0 = x[:, :D] 2025-05-07T20:32:29.8954567Z x1 = x[:, D:] 2025-05-07T20:32:29.8954775Z 2025-05-07T20:32:29.8954947Z if contiguous: 2025-05-07T20:32:29.8955175Z x0 = x0.contiguous() 2025-05-07T20:32:29.8955426Z x1 = x1.contiguous() 2025-05-07T20:32:29.8955652Z 2025-05-07T20:32:29.8955844Z if scale_ub is not None: 2025-05-07T20:32:29.8956106Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:29.8956428Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:29.8956742Z ) 2025-05-07T20:32:29.8956938Z else: 2025-05-07T20:32:29.8957154Z scale_ub_tensor = None 2025-05-07T20:32:29.8957406Z 2025-05-07T20:32:29.8963876Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.8964301Z op = silu_mul_quant 2025-05-07T20:32:29.8964603Z if compiled: 2025-05-07T20:32:29.8964874Z op = torch.compile(op) 2025-05-07T20:32:29.8965166Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.8965440Z 2025-05-07T20:32:29.8965633Z > y_fp8, y_scale = fn() 2025-05-07T20:32:29.8965798Z 2025-05-07T20:32:29.8965899Z moe/activation_test.py:117: 2025-05-07T20:32:29.8966201Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.8966534Z moe/activation_test.py:115: in fn 2025-05-07T20:32:29.8966815Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.8967376Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:29.8967936Z return fn(*args, **kwargs) 2025-05-07T20:32:29.8968592Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:29.8969266Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:29.8969964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:29.8970694Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:29.8971353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:29.8971880Z kernel = self.compile( 2025-05-07T20:32:29.8972423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:29.8973073Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:29.8973473Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.8973705Z 2025-05-07T20:32:29.8973909Z self = 2025-05-07T20:32:29.8974981Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:29.8976363Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faad2ce8ea0>} 2025-05-07T20:32:29.8977696Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:29.8978712Z context = 2025-05-07T20:32:29.8979004Z 2025-05-07T20:32:29.8979247Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:29.8979762Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:29.8980223Z module_map=module_map) 2025-05-07T20:32:29.8980594Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:29.8980942Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:29.8981202Z E ^ 2025-05-07T20:32:29.8981661Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:29.8982119Z 2025-05-07T20:32:29.8982536Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:29.8983043Z 2025-05-07T20:32:29.8983146Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.8983549Z self=, 2025-05-07T20:32:29.8983952Z T=2048, 2025-05-07T20:32:29.8984145Z D=7168, 2025-05-07T20:32:29.8984339Z scale_ub=1200.0, 2025-05-07T20:32:29.8984563Z contiguous=False, 2025-05-07T20:32:29.8984793Z compiled=False, 2025-05-07T20:32:30.0958777Z ) 2025-05-07T20:32:30.0959736Z self = 2025-05-07T20:32:30.0961290Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:30.0961891Z 2025-05-07T20:32:30.0962056Z @given( 2025-05-07T20:32:30.0962496Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:30.0962902Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:30.0963219Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:30.0963557Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:30.0963906Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:30.0964330Z ) 2025-05-07T20:32:30.0964700Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:30.0965165Z def test_silu_mul_quant( 2025-05-07T20:32:30.0965407Z self, 2025-05-07T20:32:30.0965602Z T: int, 2025-05-07T20:32:30.0965800Z D: int, 2025-05-07T20:32:30.0966012Z scale_ub: Optional[float], 2025-05-07T20:32:30.0966483Z contiguous: bool, 2025-05-07T20:32:30.0966725Z compiled: bool, 2025-05-07T20:32:30.0966951Z ) -> None: 2025-05-07T20:32:30.0967172Z torch.manual_seed(2025) 2025-05-07T20:32:30.0967419Z 2025-05-07T20:32:30.0967690Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:30.0968043Z 2025-05-07T20:32:30.0968235Z x_sign = torch.sign(x) 2025-05-07T20:32:30.0968527Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:30.0968845Z x = x_sign * x_clamp 2025-05-07T20:32:30.0969082Z x0 = x[:, :D] 2025-05-07T20:32:30.0969281Z x1 = x[:, D:] 2025-05-07T20:32:30.0969487Z 2025-05-07T20:32:30.0969664Z if contiguous: 2025-05-07T20:32:30.0969892Z x0 = x0.contiguous() 2025-05-07T20:32:30.0970135Z x1 = x1.contiguous() 2025-05-07T20:32:30.0970365Z 2025-05-07T20:32:30.0970553Z if scale_ub is not None: 2025-05-07T20:32:30.0970819Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:30.0971144Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:30.0971449Z ) 2025-05-07T20:32:30.0971628Z else: 2025-05-07T20:32:30.0971828Z scale_ub_tensor = None 2025-05-07T20:32:30.0972072Z 2025-05-07T20:32:30.0972288Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:30.0972597Z op = silu_mul_quant 2025-05-07T20:32:30.0972839Z if compiled: 2025-05-07T20:32:30.0973072Z op = torch.compile(op) 2025-05-07T20:32:30.0973360Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:30.0973630Z 2025-05-07T20:32:30.0973939Z > y_fp8, y_scale = fn() 2025-05-07T20:32:30.0974105Z 2025-05-07T20:32:30.0974196Z moe/activation_test.py:117: 2025-05-07T20:32:30.0974485Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:30.0974810Z moe/activation_test.py:115: in fn 2025-05-07T20:32:30.0975082Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:30.0975759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:30.0976437Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:30.0976956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:30.0977622Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:30.0978274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:30.0978800Z kernel = self.compile( 2025-05-07T20:32:30.0979325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:30.0979970Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:30.0980368Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:30.0980586Z 2025-05-07T20:32:30.0980800Z self = 2025-05-07T20:32:30.0981867Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:30.0983235Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faad2ce9940>} 2025-05-07T20:32:30.0984585Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:30.0985609Z context = 2025-05-07T20:32:30.0985988Z 2025-05-07T20:32:30.0986149Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:30.0986667Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:30.0987128Z module_map=module_map) 2025-05-07T20:32:30.0987492Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:30.0987836Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:30.0988103Z E ^ 2025-05-07T20:32:30.0988568Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:30.0989012Z 2025-05-07T20:32:30.0989433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:30.0990100Z 2025-05-07T20:32:30.0990202Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:30.0990610Z self=, 2025-05-07T20:32:30.0991016Z T=1, 2025-05-07T20:32:30.0991190Z D=7168, 2025-05-07T20:32:30.0991384Z scale_ub=None, 2025-05-07T20:32:30.0991592Z contiguous=True, 2025-05-07T20:32:30.0991805Z compiled=False, 2025-05-07T20:32:30.0992004Z ) 2025-05-07T20:32:30.0992321Z self = 2025-05-07T20:32:30.0992795Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:30.0993059Z 2025-05-07T20:32:30.0993134Z @given( 2025-05-07T20:32:30.0993363Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:30.0993768Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:30.0994071Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:30.0994391Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:30.0994723Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:30.0994998Z ) 2025-05-07T20:32:30.0995353Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:30.0995796Z def test_silu_mul_quant( 2025-05-07T20:32:30.0996029Z self, 2025-05-07T20:32:30.0996222Z T: int, 2025-05-07T20:32:30.0996424Z D: int, 2025-05-07T20:32:30.0996633Z scale_ub: Optional[float], 2025-05-07T20:32:30.0996903Z contiguous: bool, 2025-05-07T20:32:30.0997135Z compiled: bool, 2025-05-07T20:32:30.0997344Z ) -> None: 2025-05-07T20:32:30.0997555Z torch.manual_seed(2025) 2025-05-07T20:32:30.0997792Z 2025-05-07T20:32:30.0998060Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:30.0998396Z 2025-05-07T20:32:30.0998580Z x_sign = torch.sign(x) 2025-05-07T20:32:30.0998863Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:30.0999151Z x = x_sign * x_clamp 2025-05-07T20:32:30.0999382Z x0 = x[:, :D] 2025-05-07T20:32:30.0999599Z x1 = x[:, D:] 2025-05-07T20:32:30.0999791Z 2025-05-07T20:32:30.0999957Z if contiguous: 2025-05-07T20:32:30.1000175Z x0 = x0.contiguous() 2025-05-07T20:32:30.1000411Z x1 = x1.contiguous() 2025-05-07T20:32:30.1000634Z 2025-05-07T20:32:30.1000813Z if scale_ub is not None: 2025-05-07T20:32:30.1001066Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:30.1001391Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:30.1001684Z ) 2025-05-07T20:32:30.1001863Z else: 2025-05-07T20:32:30.1002068Z scale_ub_tensor = None 2025-05-07T20:32:30.1002314Z 2025-05-07T20:32:30.1002540Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:30.1002841Z op = silu_mul_quant 2025-05-07T20:32:30.1003081Z if compiled: 2025-05-07T20:32:30.1003321Z op = torch.compile(op) 2025-05-07T20:32:30.1003600Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:30.1003957Z 2025-05-07T20:32:30.1004142Z > y_fp8, y_scale = fn() 2025-05-07T20:32:30.1004393Z 2025-05-07T20:32:30.1004486Z moe/activation_test.py:117: 2025-05-07T20:32:30.1004773Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:30.1005100Z moe/activation_test.py:115: in fn 2025-05-07T20:32:30.1005363Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:30.1006048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:30.1006721Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:30.1007255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:30.1007917Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:30.1008734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:30.1009262Z kernel = self.compile( 2025-05-07T20:32:30.1009798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:30.1010438Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:30.1010826Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:30.1011048Z 2025-05-07T20:32:30.1011261Z self = 2025-05-07T20:32:30.1012481Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:30.1013855Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faad2ceaca0>} 2025-05-07T20:32:30.1015204Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:30.1016228Z context = 2025-05-07T20:32:30.1016512Z 2025-05-07T20:32:30.1016684Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:30.1017190Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:30.1017665Z module_map=module_map) 2025-05-07T20:32:30.1018027Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:30.1018367Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:30.1018628Z E ^ 2025-05-07T20:32:30.1019089Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:30.1019542Z 2025-05-07T20:32:30.1019961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:30.1020465Z 2025-05-07T20:32:30.1020564Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:30.1020984Z self=, 2025-05-07T20:32:30.1021386Z T=16384, 2025-05-07T20:32:30.1021566Z D=7168, 2025-05-07T20:32:30.1021744Z scale_ub=1200.0, 2025-05-07T20:32:30.1021966Z contiguous=False, 2025-05-07T20:32:30.1022193Z compiled=True, 2025-05-07T20:32:30.1022409Z ) 2025-05-07T20:32:30.1022719Z self = 2025-05-07T20:32:30.1023221Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:30.1023499Z 2025-05-07T20:32:30.1023581Z @given( 2025-05-07T20:32:30.1023801Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:30.1024234Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:30.1024534Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:30.1024855Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:30.1025177Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:30.1025458Z ) 2025-05-07T20:32:30.1025798Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:30.1026239Z def test_silu_mul_quant( 2025-05-07T20:32:30.1026483Z self, 2025-05-07T20:32:30.1026674Z T: int, 2025-05-07T20:32:30.1026855Z D: int, 2025-05-07T20:32:30.1027066Z scale_ub: Optional[float], 2025-05-07T20:32:30.1027346Z contiguous: bool, 2025-05-07T20:32:30.1027568Z compiled: bool, 2025-05-07T20:32:30.1027788Z ) -> None: 2025-05-07T20:32:30.1027995Z torch.manual_seed(2025) 2025-05-07T20:32:30.1028223Z 2025-05-07T20:32:30.1028487Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:30.1028835Z 2025-05-07T20:32:30.1029013Z x_sign = torch.sign(x) 2025-05-07T20:32:30.1029303Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:30.1029609Z x = x_sign * x_clamp 2025-05-07T20:32:30.1029835Z x0 = x[:, :D] 2025-05-07T20:32:30.1030041Z x1 = x[:, D:] 2025-05-07T20:32:30.1030246Z 2025-05-07T20:32:30.1030413Z if contiguous: 2025-05-07T20:32:30.1030635Z x0 = x0.contiguous() 2025-05-07T20:32:30.1030898Z x1 = x1.contiguous() 2025-05-07T20:32:30.1031129Z 2025-05-07T20:32:30.1031323Z if scale_ub is not None: 2025-05-07T20:32:30.1031679Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:30.1032003Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:30.1032298Z ) 2025-05-07T20:32:30.1032474Z else: 2025-05-07T20:32:30.1032670Z scale_ub_tensor = None 2025-05-07T20:32:30.1032915Z 2025-05-07T20:32:30.1033141Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:30.1033438Z op = silu_mul_quant 2025-05-07T20:32:30.1033675Z if compiled: 2025-05-07T20:32:30.1033917Z op = torch.compile(op) 2025-05-07T20:32:30.1034206Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:30.1034462Z 2025-05-07T20:32:30.1034646Z > y_fp8, y_scale = fn() 2025-05-07T20:32:30.1034805Z 2025-05-07T20:32:30.1034906Z moe/activation_test.py:117: 2025-05-07T20:32:30.1035191Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:30.1035515Z moe/activation_test.py:115: in fn 2025-05-07T20:32:30.1035794Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:30.1036334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:30.1036880Z return fn(*args, **kwargs) 2025-05-07T20:32:30.1037524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:30.1038202Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:30.1038720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:30.1039391Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:30.1040042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:30.1040555Z kernel = self.compile( 2025-05-07T20:32:30.1041089Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:30.1041737Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:30.1042119Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:30.1042342Z 2025-05-07T20:32:30.1042629Z self = 2025-05-07T20:32:30.1043698Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:30.1045117Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faad2cebf60>} 2025-05-07T20:32:30.1046450Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:30.1047460Z context = 2025-05-07T20:32:30.1047747Z 2025-05-07T20:32:30.1047906Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:30.1048422Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:30.1048883Z module_map=module_map) 2025-05-07T20:32:30.1049230Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:30.1049576Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:30.1049821Z E ^ 2025-05-07T20:32:30.1050273Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:30.1050716Z 2025-05-07T20:32:30.1051206Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:30.2378674Z 2025-05-07T20:32:30.2378931Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:30.2379694Z self=, 2025-05-07T20:32:30.2380404Z T=1, 2025-05-07T20:32:30.2380747Z D=7168, 2025-05-07T20:32:30.2381079Z scale_ub=None, 2025-05-07T20:32:30.2381309Z contiguous=False, 2025-05-07T20:32:30.2381532Z compiled=False, 2025-05-07T20:32:30.2381742Z ) 2025-05-07T20:32:30.2382058Z self = 2025-05-07T20:32:30.2382545Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:30.2382801Z 2025-05-07T20:32:30.2382877Z @given( 2025-05-07T20:32:30.2383099Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:30.2383409Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:30.2383715Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:30.2384057Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:30.2384398Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:30.2384696Z ) 2025-05-07T20:32:30.2385056Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:30.2385515Z def test_silu_mul_quant( 2025-05-07T20:32:30.2385763Z self, 2025-05-07T20:32:30.2385966Z T: int, 2025-05-07T20:32:30.2386165Z D: int, 2025-05-07T20:32:30.2386386Z scale_ub: Optional[float], 2025-05-07T20:32:30.2386663Z contiguous: bool, 2025-05-07T20:32:30.2386902Z compiled: bool, 2025-05-07T20:32:30.2387129Z ) -> None: 2025-05-07T20:32:30.2387354Z torch.manual_seed(2025) 2025-05-07T20:32:30.2387596Z 2025-05-07T20:32:30.2387873Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:30.2388228Z 2025-05-07T20:32:30.2388421Z x_sign = torch.sign(x) 2025-05-07T20:32:30.2388721Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:30.2389040Z x = x_sign * x_clamp 2025-05-07T20:32:30.2389276Z x0 = x[:, :D] 2025-05-07T20:32:30.2389499Z x1 = x[:, D:] 2025-05-07T20:32:30.2389710Z 2025-05-07T20:32:30.2389891Z if contiguous: 2025-05-07T20:32:30.2390312Z x0 = x0.contiguous() 2025-05-07T20:32:30.2390576Z x1 = x1.contiguous() 2025-05-07T20:32:30.2390822Z 2025-05-07T20:32:30.2391010Z if scale_ub is not None: 2025-05-07T20:32:30.2391286Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:30.2391628Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:30.2391926Z ) 2025-05-07T20:32:30.2392116Z else: 2025-05-07T20:32:30.2392325Z scale_ub_tensor = None 2025-05-07T20:32:30.2392566Z 2025-05-07T20:32:30.2392788Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:30.2393098Z op = silu_mul_quant 2025-05-07T20:32:30.2393343Z if compiled: 2025-05-07T20:32:30.2393587Z op = torch.compile(op) 2025-05-07T20:32:30.2393878Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:30.2394138Z 2025-05-07T20:32:30.2394326Z > y_fp8, y_scale = fn() 2025-05-07T20:32:30.2394487Z 2025-05-07T20:32:30.2394598Z moe/activation_test.py:117: 2025-05-07T20:32:30.2394885Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:30.2395211Z moe/activation_test.py:115: in fn 2025-05-07T20:32:30.2395489Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:30.2396172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:30.2396848Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:30.2397377Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:30.2398171Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:30.2398829Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:30.2399346Z kernel = self.compile( 2025-05-07T20:32:30.2399878Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:30.2400531Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:30.2400916Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:30.2401145Z 2025-05-07T20:32:30.2401349Z self = 2025-05-07T20:32:30.2402418Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:30.2403782Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faad303c9a0>} 2025-05-07T20:32:30.2405216Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:30.2406233Z context = 2025-05-07T20:32:30.2406521Z 2025-05-07T20:32:30.2406677Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:30.2407182Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:30.2407635Z module_map=module_map) 2025-05-07T20:32:30.2407981Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:30.2408547Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:30.2408796Z E ^ 2025-05-07T20:32:30.2409250Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:30.2409696Z 2025-05-07T20:32:30.2410105Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:30.2410778Z 2025-05-07T20:32:30.2410878Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:30.2411270Z self=, 2025-05-07T20:32:30.2411660Z T=2048, 2025-05-07T20:32:30.2411840Z D=7168, 2025-05-07T20:32:30.2418974Z scale_ub=None, 2025-05-07T20:32:30.2419228Z contiguous=False, 2025-05-07T20:32:30.2419460Z compiled=True, 2025-05-07T20:32:30.2419671Z ) 2025-05-07T20:32:30.2419990Z self = 2025-05-07T20:32:30.2420488Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:30.2420765Z 2025-05-07T20:32:30.2420852Z @given( 2025-05-07T20:32:30.2421081Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:30.2421402Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:30.2421708Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:30.2422044Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:30.2422368Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:30.2422647Z ) 2025-05-07T20:32:30.2422992Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:30.2423421Z def test_silu_mul_quant( 2025-05-07T20:32:30.2423660Z self, 2025-05-07T20:32:30.2423854Z T: int, 2025-05-07T20:32:30.2424041Z D: int, 2025-05-07T20:32:30.2424261Z scale_ub: Optional[float], 2025-05-07T20:32:30.2424527Z contiguous: bool, 2025-05-07T20:32:30.2424762Z compiled: bool, 2025-05-07T20:32:30.2424982Z ) -> None: 2025-05-07T20:32:30.2425353Z torch.manual_seed(2025) 2025-05-07T20:32:30.2425590Z 2025-05-07T20:32:30.2425859Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:30.2426196Z 2025-05-07T20:32:30.2426389Z x_sign = torch.sign(x) 2025-05-07T20:32:30.2426675Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:30.2426980Z x = x_sign * x_clamp 2025-05-07T20:32:30.2427214Z x0 = x[:, :D] 2025-05-07T20:32:30.2427421Z x1 = x[:, D:] 2025-05-07T20:32:30.2427626Z 2025-05-07T20:32:30.2427806Z if contiguous: 2025-05-07T20:32:30.2428037Z x0 = x0.contiguous() 2025-05-07T20:32:30.2428291Z x1 = x1.contiguous() 2025-05-07T20:32:30.2428526Z 2025-05-07T20:32:30.2428707Z if scale_ub is not None: 2025-05-07T20:32:30.2428974Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:30.2429306Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:30.2429609Z ) 2025-05-07T20:32:30.2429805Z else: 2025-05-07T20:32:30.2430025Z scale_ub_tensor = None 2025-05-07T20:32:30.2430268Z 2025-05-07T20:32:30.2430492Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:30.2430795Z op = silu_mul_quant 2025-05-07T20:32:30.2431045Z if compiled: 2025-05-07T20:32:30.2431287Z op = torch.compile(op) 2025-05-07T20:32:30.2431581Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:30.2431848Z 2025-05-07T20:32:30.2432034Z > y_fp8, y_scale = fn() 2025-05-07T20:32:30.2432206Z 2025-05-07T20:32:30.2432307Z moe/activation_test.py:117: 2025-05-07T20:32:30.2432600Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:30.2432922Z moe/activation_test.py:115: in fn 2025-05-07T20:32:30.2433200Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:30.2433758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:30.2434302Z return fn(*args, **kwargs) 2025-05-07T20:32:30.2434951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:30.2435709Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:30.2436243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:30.2436916Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:30.2437572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:30.2438093Z kernel = self.compile( 2025-05-07T20:32:30.2438639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:30.2439281Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:30.2439685Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:30.2439910Z 2025-05-07T20:32:30.2440120Z self = 2025-05-07T20:32:30.2441203Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:30.2442573Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faad303e160>} 2025-05-07T20:32:30.2443899Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:30.2445196Z context = 2025-05-07T20:32:30.2445496Z 2025-05-07T20:32:30.2445661Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:30.2446179Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:30.2446638Z module_map=module_map) 2025-05-07T20:32:30.2447001Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:30.2447346Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:30.2447590Z E ^ 2025-05-07T20:32:30.2448057Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:30.2448503Z 2025-05-07T20:32:30.2448914Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:30.2449419Z 2025-05-07T20:32:30.2449520Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:30.2449934Z self=, 2025-05-07T20:32:30.2450331Z T=4096, 2025-05-07T20:32:30.2450519Z D=7168, 2025-05-07T20:32:30.2450697Z scale_ub=None, 2025-05-07T20:32:30.2450908Z contiguous=False, 2025-05-07T20:32:30.2451135Z compiled=True, 2025-05-07T20:32:30.4738039Z ) 2025-05-07T20:32:30.4739141Z self = 2025-05-07T20:32:30.4741043Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:30.4742070Z 2025-05-07T20:32:30.4742279Z @given( 2025-05-07T20:32:30.4742724Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:30.4743330Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:30.4743913Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:30.4744547Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:30.4745184Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:30.4745739Z ) 2025-05-07T20:32:30.4746417Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:30.4747281Z def test_silu_mul_quant( 2025-05-07T20:32:30.4747742Z self, 2025-05-07T20:32:30.4748098Z T: int, 2025-05-07T20:32:30.4748797Z D: int, 2025-05-07T20:32:30.4749207Z scale_ub: Optional[float], 2025-05-07T20:32:30.4749719Z contiguous: bool, 2025-05-07T20:32:30.4750172Z compiled: bool, 2025-05-07T20:32:30.4750598Z ) -> None: 2025-05-07T20:32:30.4750998Z torch.manual_seed(2025) 2025-05-07T20:32:30.4751459Z 2025-05-07T20:32:30.4751976Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:30.4752631Z 2025-05-07T20:32:30.4752990Z x_sign = torch.sign(x) 2025-05-07T20:32:30.4753544Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:30.4754135Z x = x_sign * x_clamp 2025-05-07T20:32:30.4754459Z x0 = x[:, :D] 2025-05-07T20:32:30.4754706Z x1 = x[:, D:] 2025-05-07T20:32:30.4754903Z 2025-05-07T20:32:30.4755077Z if contiguous: 2025-05-07T20:32:30.4755299Z x0 = x0.contiguous() 2025-05-07T20:32:30.4755545Z x1 = x1.contiguous() 2025-05-07T20:32:30.4755781Z 2025-05-07T20:32:30.4755961Z if scale_ub is not None: 2025-05-07T20:32:30.4756219Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:30.4756535Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:30.4756834Z ) 2025-05-07T20:32:30.4757016Z else: 2025-05-07T20:32:30.4757209Z scale_ub_tensor = None 2025-05-07T20:32:30.4757446Z 2025-05-07T20:32:30.4757671Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:30.4757970Z op = silu_mul_quant 2025-05-07T20:32:30.4758209Z if compiled: 2025-05-07T20:32:30.4758444Z op = torch.compile(op) 2025-05-07T20:32:30.4758848Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:30.4759119Z 2025-05-07T20:32:30.4759300Z > y_fp8, y_scale = fn() 2025-05-07T20:32:30.4759461Z 2025-05-07T20:32:30.4759554Z moe/activation_test.py:117: 2025-05-07T20:32:30.4759840Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:30.4760174Z moe/activation_test.py:115: in fn 2025-05-07T20:32:30.4760446Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:30.4760997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:30.4761552Z return fn(*args, **kwargs) 2025-05-07T20:32:30.4762193Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:30.4762859Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:30.4763374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:30.4764045Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:30.4764787Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:30.4765301Z kernel = self.compile( 2025-05-07T20:32:30.4765842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:30.4766484Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:30.4766863Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:30.4767097Z 2025-05-07T20:32:30.4767295Z self = 2025-05-07T20:32:30.4768367Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:30.4769723Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faad303ee80>} 2025-05-07T20:32:30.4771046Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:30.4772134Z context = 2025-05-07T20:32:30.4772424Z 2025-05-07T20:32:30.4772584Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:30.4773089Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:30.4773546Z module_map=module_map) 2025-05-07T20:32:30.4773892Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:30.4774232Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:30.4774478Z E ^ 2025-05-07T20:32:30.4774915Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:30.4775362Z 2025-05-07T20:32:30.4775778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:30.4776282Z 2025-05-07T20:32:30.4776385Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:30.4776780Z self=, 2025-05-07T20:32:30.4777157Z T=16384, 2025-05-07T20:32:30.4777337Z D=5120, 2025-05-07T20:32:30.4777521Z scale_ub=1200.0, 2025-05-07T20:32:30.4777726Z contiguous=False, 2025-05-07T20:32:30.4777942Z compiled=False, 2025-05-07T20:32:30.4778136Z ) 2025-05-07T20:32:30.4778439Z self = 2025-05-07T20:32:30.4779006Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:30.4779286Z 2025-05-07T20:32:30.4779357Z @given( 2025-05-07T20:32:30.4779579Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:30.4779873Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:30.4780165Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:30.4780478Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:30.4780791Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:30.4781063Z ) 2025-05-07T20:32:30.4781397Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:30.4781815Z def test_silu_mul_quant( 2025-05-07T20:32:30.4782043Z self, 2025-05-07T20:32:30.4782225Z T: int, 2025-05-07T20:32:30.4782408Z D: int, 2025-05-07T20:32:30.4782619Z scale_ub: Optional[float], 2025-05-07T20:32:30.4782885Z contiguous: bool, 2025-05-07T20:32:30.4783125Z compiled: bool, 2025-05-07T20:32:30.4783332Z ) -> None: 2025-05-07T20:32:30.4783549Z torch.manual_seed(2025) 2025-05-07T20:32:30.4783773Z 2025-05-07T20:32:30.4784029Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:30.4784367Z 2025-05-07T20:32:30.4784572Z x_sign = torch.sign(x) 2025-05-07T20:32:30.4784869Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:30.4785168Z x = x_sign * x_clamp 2025-05-07T20:32:30.4785402Z x0 = x[:, :D] 2025-05-07T20:32:30.4785601Z x1 = x[:, D:] 2025-05-07T20:32:30.4785795Z 2025-05-07T20:32:30.4785970Z if contiguous: 2025-05-07T20:32:30.4786185Z x0 = x0.contiguous() 2025-05-07T20:32:30.4786437Z x1 = x1.contiguous() 2025-05-07T20:32:30.4786666Z 2025-05-07T20:32:30.4786842Z if scale_ub is not None: 2025-05-07T20:32:30.4787101Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:30.4787430Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:30.4787727Z ) 2025-05-07T20:32:30.4787903Z else: 2025-05-07T20:32:30.4788100Z scale_ub_tensor = None 2025-05-07T20:32:30.4788339Z 2025-05-07T20:32:30.4788553Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:30.4788974Z op = silu_mul_quant 2025-05-07T20:32:30.4789204Z if compiled: 2025-05-07T20:32:30.4789430Z op = torch.compile(op) 2025-05-07T20:32:30.4789715Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:30.4789972Z 2025-05-07T20:32:30.4790149Z > y_fp8, y_scale = fn() 2025-05-07T20:32:30.4790307Z 2025-05-07T20:32:30.4790398Z moe/activation_test.py:117: 2025-05-07T20:32:30.4790686Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:30.4791004Z moe/activation_test.py:115: in fn 2025-05-07T20:32:30.4791279Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:30.4791957Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:30.4792626Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:30.4793148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:30.4793820Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:30.4794469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:30.4794990Z kernel = self.compile( 2025-05-07T20:32:30.4795518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:30.4796166Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:30.4796563Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:30.4796865Z 2025-05-07T20:32:30.4797067Z self = 2025-05-07T20:32:30.4798129Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:30.4799492Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faad3138220>} 2025-05-07T20:32:30.4800817Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:30.4801826Z context = 2025-05-07T20:32:30.4802104Z 2025-05-07T20:32:30.4802264Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:30.4802767Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:30.4803223Z module_map=module_map) 2025-05-07T20:32:30.4803575Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:30.4803920Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:30.4805766Z E ^ 2025-05-07T20:32:30.4806216Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:30.4806655Z 2025-05-07T20:32:30.4807067Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:30.4807571Z 2025-05-07T20:32:30.4807668Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:30.4808070Z self=, 2025-05-07T20:32:30.4808630Z T=16384, 2025-05-07T20:32:30.4808807Z D=5120, 2025-05-07T20:32:30.4808989Z scale_ub=1200.0, 2025-05-07T20:32:30.4809191Z contiguous=True, 2025-05-07T20:32:30.4809399Z compiled=True, 2025-05-07T20:32:30.4809597Z ) 2025-05-07T20:32:30.4809895Z self = 2025-05-07T20:32:30.4810514Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:30.4810787Z 2025-05-07T20:32:30.4810857Z @given( 2025-05-07T20:32:30.4811078Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:30.4811381Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:30.4811676Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:30.4811989Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:30.4812296Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:30.4812569Z ) 2025-05-07T20:32:30.4812907Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:30.4813325Z def test_silu_mul_quant( 2025-05-07T20:32:30.4813549Z self, 2025-05-07T20:32:30.4813731Z T: int, 2025-05-07T20:32:30.4813913Z D: int, 2025-05-07T20:32:30.4814117Z scale_ub: Optional[float], 2025-05-07T20:32:30.4814374Z contiguous: bool, 2025-05-07T20:32:30.4814594Z compiled: bool, 2025-05-07T20:32:30.4814798Z ) -> None: 2025-05-07T20:32:30.4814995Z torch.manual_seed(2025) 2025-05-07T20:32:30.4815218Z 2025-05-07T20:32:30.4815470Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:30.4815793Z 2025-05-07T20:32:30.4815968Z x_sign = torch.sign(x) 2025-05-07T20:32:30.4816240Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:30.4816532Z x = x_sign * x_clamp 2025-05-07T20:32:30.4816756Z x0 = x[:, :D] 2025-05-07T20:32:30.4816947Z x1 = x[:, D:] 2025-05-07T20:32:30.4817138Z 2025-05-07T20:32:30.4817421Z if contiguous: 2025-05-07T20:32:30.4817635Z x0 = x0.contiguous() 2025-05-07T20:32:30.4817877Z x1 = x1.contiguous() 2025-05-07T20:32:30.4818100Z 2025-05-07T20:32:30.4818268Z if scale_ub is not None: 2025-05-07T20:32:30.4818526Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:30.4818849Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:30.4819136Z ) 2025-05-07T20:32:30.4819315Z else: 2025-05-07T20:32:30.4819507Z scale_ub_tensor = None 2025-05-07T20:32:30.4819741Z 2025-05-07T20:32:30.4819951Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:30.4820244Z op = silu_mul_quant 2025-05-07T20:32:30.4820477Z if compiled: 2025-05-07T20:32:30.4820705Z op = torch.compile(op) 2025-05-07T20:32:30.4820981Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:30.4821237Z 2025-05-07T20:32:30.4821406Z > y_fp8, y_scale = fn() 2025-05-07T20:32:30.4821573Z 2025-05-07T20:32:30.4821664Z moe/activation_test.py:117: 2025-05-07T20:32:30.4821949Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:30.4822258Z moe/activation_test.py:115: in fn 2025-05-07T20:32:30.4822529Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:30.4823066Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:30.4823598Z return fn(*args, **kwargs) 2025-05-07T20:32:30.4824242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:30.4824902Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:30.4825419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:30.4826071Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:30.4826727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:30.4827238Z kernel = self.compile( 2025-05-07T20:32:30.4827760Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:30.4828478Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:30.4828862Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:30.4829084Z 2025-05-07T20:32:30.4829289Z self = 2025-05-07T20:32:30.4830352Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:30.4831701Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faad31394e0>} 2025-05-07T20:32:30.4833021Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:30.4834035Z context = 2025-05-07T20:32:30.4834317Z 2025-05-07T20:32:30.4834474Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:30.4834974Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:30.4835426Z module_map=module_map) 2025-05-07T20:32:30.4835772Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:30.4836110Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:30.4836349Z E ^ 2025-05-07T20:32:30.4836882Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:30.4837321Z 2025-05-07T20:32:30.4837731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:30.6396940Z 2025-05-07T20:32:30.6397361Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:30.6398024Z self=, 2025-05-07T20:32:30.6398674Z T=16384, 2025-05-07T20:32:30.6398964Z D=5120, 2025-05-07T20:32:30.6399238Z scale_ub=None, 2025-05-07T20:32:30.6399536Z contiguous=False, 2025-05-07T20:32:30.6399828Z compiled=True, 2025-05-07T20:32:30.6400022Z ) 2025-05-07T20:32:30.6400330Z self = 2025-05-07T20:32:30.6400822Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:30.6401096Z 2025-05-07T20:32:30.6401175Z @given( 2025-05-07T20:32:30.6401401Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:30.6401709Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:30.6402002Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:30.6402318Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:30.6402639Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:30.6402916Z ) 2025-05-07T20:32:30.6403254Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:30.6403678Z def test_silu_mul_quant( 2025-05-07T20:32:30.6403910Z self, 2025-05-07T20:32:30.6404098Z T: int, 2025-05-07T20:32:30.6404406Z D: int, 2025-05-07T20:32:30.6404616Z scale_ub: Optional[float], 2025-05-07T20:32:30.6404879Z contiguous: bool, 2025-05-07T20:32:30.6405108Z compiled: bool, 2025-05-07T20:32:30.6405318Z ) -> None: 2025-05-07T20:32:30.6405526Z torch.manual_seed(2025) 2025-05-07T20:32:30.6405766Z 2025-05-07T20:32:30.6406024Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:30.6406357Z 2025-05-07T20:32:30.6406539Z x_sign = torch.sign(x) 2025-05-07T20:32:30.6406817Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:30.6407331Z x = x_sign * x_clamp 2025-05-07T20:32:30.6407556Z x0 = x[:, :D] 2025-05-07T20:32:30.6407769Z x1 = x[:, D:] 2025-05-07T20:32:30.6407965Z 2025-05-07T20:32:30.6408135Z if contiguous: 2025-05-07T20:32:30.6408573Z x0 = x0.contiguous() 2025-05-07T20:32:30.6408831Z x1 = x1.contiguous() 2025-05-07T20:32:30.6409064Z 2025-05-07T20:32:30.6409244Z if scale_ub is not None: 2025-05-07T20:32:30.6415775Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:30.6416132Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:30.6416450Z ) 2025-05-07T20:32:30.6416671Z else: 2025-05-07T20:32:30.6416890Z scale_ub_tensor = None 2025-05-07T20:32:30.6417144Z 2025-05-07T20:32:30.6417388Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:30.6417711Z op = silu_mul_quant 2025-05-07T20:32:30.6417976Z if compiled: 2025-05-07T20:32:30.6418260Z op = torch.compile(op) 2025-05-07T20:32:30.6418590Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:30.6418877Z 2025-05-07T20:32:30.6419071Z > y_fp8, y_scale = fn() 2025-05-07T20:32:30.6419242Z 2025-05-07T20:32:30.6419345Z moe/activation_test.py:117: 2025-05-07T20:32:30.6419652Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:30.6419986Z moe/activation_test.py:115: in fn 2025-05-07T20:32:30.6420273Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:30.6420842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:30.6421563Z return fn(*args, **kwargs) 2025-05-07T20:32:30.6422221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:30.6422907Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:30.6423450Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:30.6424121Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:30.6424780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:30.6425315Z kernel = self.compile( 2025-05-07T20:32:30.6425860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:30.6426515Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:30.6426923Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:30.6427152Z 2025-05-07T20:32:30.6427362Z self = 2025-05-07T20:32:30.6428440Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:30.6429807Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faad313a2a0>} 2025-05-07T20:32:30.6431147Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:30.6432171Z context = 2025-05-07T20:32:30.6432468Z 2025-05-07T20:32:30.6432640Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:30.6433166Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:30.6433638Z module_map=module_map) 2025-05-07T20:32:30.6434124Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:30.6434477Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:30.6434735Z E ^ 2025-05-07T20:32:30.6435204Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:30.6435655Z 2025-05-07T20:32:30.6436082Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:30.6436593Z 2025-05-07T20:32:30.6436705Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:30.6437124Z self=, 2025-05-07T20:32:30.6437523Z T=2048, 2025-05-07T20:32:30.6437716Z D=5120, 2025-05-07T20:32:30.6437905Z scale_ub=None, 2025-05-07T20:32:30.6438128Z contiguous=False, 2025-05-07T20:32:30.6438363Z compiled=True, 2025-05-07T20:32:30.6438605Z ) 2025-05-07T20:32:30.6438951Z self = 2025-05-07T20:32:30.6439447Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:30.6439716Z 2025-05-07T20:32:30.6439797Z @given( 2025-05-07T20:32:30.6440032Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:30.6440350Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:30.6440665Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:30.6440994Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:30.6441324Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:30.6441615Z ) 2025-05-07T20:32:30.6442102Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:30.6442548Z def test_silu_mul_quant( 2025-05-07T20:32:30.6442791Z self, 2025-05-07T20:32:30.6442991Z T: int, 2025-05-07T20:32:30.6443196Z D: int, 2025-05-07T20:32:30.6443428Z scale_ub: Optional[float], 2025-05-07T20:32:30.6443705Z contiguous: bool, 2025-05-07T20:32:30.6443946Z compiled: bool, 2025-05-07T20:32:30.6444233Z ) -> None: 2025-05-07T20:32:30.6444450Z torch.manual_seed(2025) 2025-05-07T20:32:30.6444689Z 2025-05-07T20:32:30.6444960Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:30.6445297Z 2025-05-07T20:32:30.6445481Z x_sign = torch.sign(x) 2025-05-07T20:32:30.6445772Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:30.6446080Z x = x_sign * x_clamp 2025-05-07T20:32:30.6446313Z x0 = x[:, :D] 2025-05-07T20:32:30.6446524Z x1 = x[:, D:] 2025-05-07T20:32:30.6446738Z 2025-05-07T20:32:30.6446913Z if contiguous: 2025-05-07T20:32:30.6447142Z x0 = x0.contiguous() 2025-05-07T20:32:30.6447397Z x1 = x1.contiguous() 2025-05-07T20:32:30.6447632Z 2025-05-07T20:32:30.6447816Z if scale_ub is not None: 2025-05-07T20:32:30.6448092Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:30.6448463Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:30.6448769Z ) 2025-05-07T20:32:30.6448961Z else: 2025-05-07T20:32:30.6449167Z scale_ub_tensor = None 2025-05-07T20:32:30.6449406Z 2025-05-07T20:32:30.6449637Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:30.6449955Z op = silu_mul_quant 2025-05-07T20:32:30.6450199Z if compiled: 2025-05-07T20:32:30.6450442Z op = torch.compile(op) 2025-05-07T20:32:30.6450733Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:30.6451000Z 2025-05-07T20:32:30.6451187Z > y_fp8, y_scale = fn() 2025-05-07T20:32:30.6451352Z 2025-05-07T20:32:30.6451450Z moe/activation_test.py:117: 2025-05-07T20:32:30.6451741Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:30.6452158Z moe/activation_test.py:115: in fn 2025-05-07T20:32:30.6452436Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:30.6452983Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:30.6453526Z return fn(*args, **kwargs) 2025-05-07T20:32:30.6454174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:30.6454844Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:30.6455371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:30.6456040Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:30.6456690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:30.6457213Z kernel = self.compile( 2025-05-07T20:32:30.6457745Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:30.6458396Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:30.6458789Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:30.6459016Z 2025-05-07T20:32:30.6459228Z self = 2025-05-07T20:32:30.6460301Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:30.6461752Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faad313b560>} 2025-05-07T20:32:30.6463086Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:30.6464102Z context = 2025-05-07T20:32:30.6464386Z 2025-05-07T20:32:30.6464553Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:30.6465065Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:30.6465521Z module_map=module_map) 2025-05-07T20:32:30.6465887Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:30.6466227Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:30.6466485Z E ^ 2025-05-07T20:32:30.6466940Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:30.6467383Z 2025-05-07T20:32:30.6467800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:30.8078211Z 2025-05-07T20:32:30.8078768Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:30.8080213Z self=, 2025-05-07T20:32:30.8081423Z T=2048, 2025-05-07T20:32:30.8081963Z D=5120, 2025-05-07T20:32:30.8082404Z scale_ub=1200.0, 2025-05-07T20:32:30.8082858Z contiguous=False, 2025-05-07T20:32:30.8083304Z compiled=True, 2025-05-07T20:32:30.8083707Z ) 2025-05-07T20:32:30.8084498Z self = 2025-05-07T20:32:30.8085498Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:30.8086032Z 2025-05-07T20:32:30.8086191Z @given( 2025-05-07T20:32:30.8086704Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:30.8087320Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:30.8087925Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:30.8088618Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:30.8088948Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:30.8089243Z ) 2025-05-07T20:32:30.8089598Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:30.8090054Z def test_silu_mul_quant( 2025-05-07T20:32:30.8090297Z self, 2025-05-07T20:32:30.8090503Z T: int, 2025-05-07T20:32:30.8090701Z D: int, 2025-05-07T20:32:30.8090918Z scale_ub: Optional[float], 2025-05-07T20:32:30.8091199Z contiguous: bool, 2025-05-07T20:32:30.8091446Z compiled: bool, 2025-05-07T20:32:30.8091674Z ) -> None: 2025-05-07T20:32:30.8091897Z torch.manual_seed(2025) 2025-05-07T20:32:30.8092142Z 2025-05-07T20:32:30.8092416Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:30.8092766Z 2025-05-07T20:32:30.8092959Z x_sign = torch.sign(x) 2025-05-07T20:32:30.8093256Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:30.8093572Z x = x_sign * x_clamp 2025-05-07T20:32:30.8093813Z x0 = x[:, :D] 2025-05-07T20:32:30.8094024Z x1 = x[:, D:] 2025-05-07T20:32:30.8094235Z 2025-05-07T20:32:30.8094426Z if contiguous: 2025-05-07T20:32:30.8094662Z x0 = x0.contiguous() 2025-05-07T20:32:30.8094928Z x1 = x1.contiguous() 2025-05-07T20:32:30.8095176Z 2025-05-07T20:32:30.8095377Z if scale_ub is not None: 2025-05-07T20:32:30.8095654Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:30.8095997Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:30.8096439Z ) 2025-05-07T20:32:30.8096642Z else: 2025-05-07T20:32:30.8096857Z scale_ub_tensor = None 2025-05-07T20:32:30.8097119Z 2025-05-07T20:32:30.8097352Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:30.8097699Z op = silu_mul_quant 2025-05-07T20:32:30.8097968Z if compiled: 2025-05-07T20:32:30.8098219Z op = torch.compile(op) 2025-05-07T20:32:30.8098522Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:30.8098810Z 2025-05-07T20:32:30.8099004Z > y_fp8, y_scale = fn() 2025-05-07T20:32:30.8099181Z 2025-05-07T20:32:30.8099285Z moe/activation_test.py:117: 2025-05-07T20:32:30.8099591Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:30.8099937Z moe/activation_test.py:115: in fn 2025-05-07T20:32:30.8100218Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:30.8100784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:30.8101351Z return fn(*args, **kwargs) 2025-05-07T20:32:30.8102004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:30.8102695Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:30.8103225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:30.8103905Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:30.8104561Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:30.8105090Z kernel = self.compile( 2025-05-07T20:32:30.8105635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:30.8106295Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:30.8106696Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:30.8106934Z 2025-05-07T20:32:30.8107149Z self = 2025-05-07T20:32:30.8108512Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:30.8109895Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faad29f0c20>} 2025-05-07T20:32:30.8111232Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:30.8112269Z context = 2025-05-07T20:32:30.8112564Z 2025-05-07T20:32:30.8112728Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:30.8113256Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:30.8113725Z module_map=module_map) 2025-05-07T20:32:30.8114102Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:30.8114456Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:30.8114719Z E ^ 2025-05-07T20:32:30.8115191Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:30.8115646Z 2025-05-07T20:32:30.8116062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:30.8116576Z 2025-05-07T20:32:30.8116686Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:30.8117217Z self=, 2025-05-07T20:32:30.8117630Z T=4096, 2025-05-07T20:32:30.8117824Z D=5120, 2025-05-07T20:32:30.8118010Z scale_ub=1200.0, 2025-05-07T20:32:30.8118238Z contiguous=True, 2025-05-07T20:32:30.8118467Z compiled=True, 2025-05-07T20:32:30.8118674Z ) 2025-05-07T20:32:30.8118994Z self = 2025-05-07T20:32:30.8119495Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:30.8119763Z 2025-05-07T20:32:30.8119848Z @given( 2025-05-07T20:32:30.8120081Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:30.8120404Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:30.8120713Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:30.8121042Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:30.8121377Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:30.8121673Z ) 2025-05-07T20:32:30.8122020Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:30.8122461Z def test_silu_mul_quant( 2025-05-07T20:32:30.8122709Z self, 2025-05-07T20:32:30.8122917Z T: int, 2025-05-07T20:32:30.8123120Z D: int, 2025-05-07T20:32:30.8123344Z scale_ub: Optional[float], 2025-05-07T20:32:30.8123620Z contiguous: bool, 2025-05-07T20:32:30.8123862Z compiled: bool, 2025-05-07T20:32:30.8124097Z ) -> None: 2025-05-07T20:32:30.8124394Z torch.manual_seed(2025) 2025-05-07T20:32:30.8124633Z 2025-05-07T20:32:30.8124903Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:30.8125254Z 2025-05-07T20:32:30.8125449Z x_sign = torch.sign(x) 2025-05-07T20:32:30.8125740Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:30.8126049Z x = x_sign * x_clamp 2025-05-07T20:32:30.8126289Z x0 = x[:, :D] 2025-05-07T20:32:30.8126512Z x1 = x[:, D:] 2025-05-07T20:32:30.8126722Z 2025-05-07T20:32:30.8126904Z if contiguous: 2025-05-07T20:32:30.8127135Z x0 = x0.contiguous() 2025-05-07T20:32:30.8127393Z x1 = x1.contiguous() 2025-05-07T20:32:30.8127767Z 2025-05-07T20:32:30.8127959Z if scale_ub is not None: 2025-05-07T20:32:30.8128228Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:30.8128561Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:30.8128874Z ) 2025-05-07T20:32:30.8129063Z else: 2025-05-07T20:32:30.8129273Z scale_ub_tensor = None 2025-05-07T20:32:30.8129526Z 2025-05-07T20:32:30.8129753Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:30.8130074Z op = silu_mul_quant 2025-05-07T20:32:30.8130333Z if compiled: 2025-05-07T20:32:30.8130574Z op = torch.compile(op) 2025-05-07T20:32:30.8130876Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:30.8131156Z 2025-05-07T20:32:30.8131344Z > y_fp8, y_scale = fn() 2025-05-07T20:32:30.8131516Z 2025-05-07T20:32:30.8131615Z moe/activation_test.py:117: 2025-05-07T20:32:30.8131913Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:30.8132257Z moe/activation_test.py:115: in fn 2025-05-07T20:32:30.8132538Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:30.8133095Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:30.8133649Z return fn(*args, **kwargs) 2025-05-07T20:32:30.8134298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:30.8134982Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:30.8135596Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:30.8136276Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:30.8136929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:30.8137462Z kernel = self.compile( 2025-05-07T20:32:30.8137997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:30.8138650Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:30.8139041Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:30.8139274Z 2025-05-07T20:32:30.8139479Z self = 2025-05-07T20:32:30.8140557Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:30.8141922Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faad29f1a80>} 2025-05-07T20:32:30.8143260Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:30.8144280Z context = 2025-05-07T20:32:30.8144573Z 2025-05-07T20:32:30.8144737Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:30.8145263Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:30.8145721Z module_map=module_map) 2025-05-07T20:32:30.8146084Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:30.8147940Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:30.8148197Z E ^ 2025-05-07T20:32:30.8148661Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:30.8149111Z 2025-05-07T20:32:30.8149612Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:30.9856100Z 2025-05-07T20:32:30.9856405Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:30.9857052Z self=, 2025-05-07T20:32:30.9857642Z T=128, 2025-05-07T20:32:30.9857944Z D=5120, 2025-05-07T20:32:30.9858214Z scale_ub=1200.0, 2025-05-07T20:32:30.9858453Z contiguous=False, 2025-05-07T20:32:30.9858665Z compiled=True, 2025-05-07T20:32:30.9858864Z ) 2025-05-07T20:32:30.9859170Z self = 2025-05-07T20:32:30.9859664Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:30.9859929Z 2025-05-07T20:32:30.9860002Z @given( 2025-05-07T20:32:30.9860225Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:30.9860545Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:30.9860859Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:30.9861204Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:30.9861544Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:30.9861832Z ) 2025-05-07T20:32:30.9862192Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:30.9862651Z def test_silu_mul_quant( 2025-05-07T20:32:30.9862893Z self, 2025-05-07T20:32:30.9863085Z T: int, 2025-05-07T20:32:30.9863286Z D: int, 2025-05-07T20:32:30.9863507Z scale_ub: Optional[float], 2025-05-07T20:32:30.9863782Z contiguous: bool, 2025-05-07T20:32:30.9864196Z compiled: bool, 2025-05-07T20:32:30.9864430Z ) -> None: 2025-05-07T20:32:30.9864637Z torch.manual_seed(2025) 2025-05-07T20:32:30.9864886Z 2025-05-07T20:32:30.9865157Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:30.9865508Z 2025-05-07T20:32:30.9865701Z x_sign = torch.sign(x) 2025-05-07T20:32:30.9865995Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:30.9866308Z x = x_sign * x_clamp 2025-05-07T20:32:30.9866547Z x0 = x[:, :D] 2025-05-07T20:32:30.9866763Z x1 = x[:, D:] 2025-05-07T20:32:30.9866965Z 2025-05-07T20:32:30.9867148Z if contiguous: 2025-05-07T20:32:30.9867382Z x0 = x0.contiguous() 2025-05-07T20:32:30.9867642Z x1 = x1.contiguous() 2025-05-07T20:32:30.9867886Z 2025-05-07T20:32:30.9868079Z if scale_ub is not None: 2025-05-07T20:32:30.9868354Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:30.9868700Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:30.9869014Z ) 2025-05-07T20:32:30.9869199Z else: 2025-05-07T20:32:30.9869395Z scale_ub_tensor = None 2025-05-07T20:32:30.9869644Z 2025-05-07T20:32:30.9869866Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:30.9870177Z op = silu_mul_quant 2025-05-07T20:32:30.9870423Z if compiled: 2025-05-07T20:32:30.9870670Z op = torch.compile(op) 2025-05-07T20:32:30.9870956Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:30.9871223Z 2025-05-07T20:32:30.9871413Z > y_fp8, y_scale = fn() 2025-05-07T20:32:30.9871573Z 2025-05-07T20:32:30.9871673Z moe/activation_test.py:117: 2025-05-07T20:32:30.9871965Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:30.9872290Z moe/activation_test.py:115: in fn 2025-05-07T20:32:30.9872565Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:30.9873116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:30.9873664Z return fn(*args, **kwargs) 2025-05-07T20:32:30.9874309Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:30.9875102Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:30.9875629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:30.9876298Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:30.9876950Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:30.9877466Z kernel = self.compile( 2025-05-07T20:32:30.9877997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:30.9878694Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:30.9879077Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:30.9879307Z 2025-05-07T20:32:30.9879511Z self = 2025-05-07T20:32:30.9880585Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:30.9881951Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faad29f2ca0>} 2025-05-07T20:32:30.9883383Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:30.9884526Z context = 2025-05-07T20:32:30.9884813Z 2025-05-07T20:32:30.9884975Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:30.9885483Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:30.9885949Z module_map=module_map) 2025-05-07T20:32:30.9886298Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:30.9886642Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:30.9886896Z E ^ 2025-05-07T20:32:30.9887356Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:30.9887804Z 2025-05-07T20:32:30.9888214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:30.9888766Z 2025-05-07T20:32:30.9888877Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:30.9889278Z self=, 2025-05-07T20:32:30.9889671Z T=16384, 2025-05-07T20:32:30.9889863Z D=7168, 2025-05-07T20:32:30.9896901Z scale_ub=1200.0, 2025-05-07T20:32:30.9897181Z contiguous=True, 2025-05-07T20:32:30.9897408Z compiled=True, 2025-05-07T20:32:30.9897623Z ) 2025-05-07T20:32:30.9897952Z self = 2025-05-07T20:32:30.9898453Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:30.9898737Z 2025-05-07T20:32:30.9898819Z @given( 2025-05-07T20:32:30.9899050Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:30.9899366Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:30.9899667Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:30.9899993Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:30.9900327Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:30.9900600Z ) 2025-05-07T20:32:30.9900944Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:30.9901383Z def test_silu_mul_quant( 2025-05-07T20:32:30.9901731Z self, 2025-05-07T20:32:30.9901920Z T: int, 2025-05-07T20:32:30.9902115Z D: int, 2025-05-07T20:32:30.9902327Z scale_ub: Optional[float], 2025-05-07T20:32:30.9902594Z contiguous: bool, 2025-05-07T20:32:30.9902832Z compiled: bool, 2025-05-07T20:32:30.9903054Z ) -> None: 2025-05-07T20:32:30.9903264Z torch.manual_seed(2025) 2025-05-07T20:32:30.9903503Z 2025-05-07T20:32:30.9903773Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:30.9904110Z 2025-05-07T20:32:30.9904302Z x_sign = torch.sign(x) 2025-05-07T20:32:30.9904588Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:30.9904890Z x = x_sign * x_clamp 2025-05-07T20:32:30.9905121Z x0 = x[:, :D] 2025-05-07T20:32:30.9905330Z x1 = x[:, D:] 2025-05-07T20:32:30.9905528Z 2025-05-07T20:32:30.9905707Z if contiguous: 2025-05-07T20:32:30.9905941Z x0 = x0.contiguous() 2025-05-07T20:32:30.9906196Z x1 = x1.contiguous() 2025-05-07T20:32:30.9906434Z 2025-05-07T20:32:30.9906622Z if scale_ub is not None: 2025-05-07T20:32:30.9906887Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:30.9907219Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:30.9907525Z ) 2025-05-07T20:32:30.9907718Z else: 2025-05-07T20:32:30.9907926Z scale_ub_tensor = None 2025-05-07T20:32:30.9908180Z 2025-05-07T20:32:30.9908600Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:30.9908914Z op = silu_mul_quant 2025-05-07T20:32:30.9909169Z if compiled: 2025-05-07T20:32:30.9909561Z op = torch.compile(op) 2025-05-07T20:32:30.9909859Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:30.9910137Z 2025-05-07T20:32:30.9910342Z > y_fp8, y_scale = fn() 2025-05-07T20:32:30.9910509Z 2025-05-07T20:32:30.9910614Z moe/activation_test.py:117: 2025-05-07T20:32:30.9910920Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:30.9911259Z moe/activation_test.py:115: in fn 2025-05-07T20:32:30.9911537Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:30.9912095Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:30.9912657Z return fn(*args, **kwargs) 2025-05-07T20:32:30.9913307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:30.9913986Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:30.9914537Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:30.9915220Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:30.9915881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:30.9916409Z kernel = self.compile( 2025-05-07T20:32:30.9916957Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:30.9917615Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:30.9918011Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:30.9918245Z 2025-05-07T20:32:30.9918454Z self = 2025-05-07T20:32:30.9919539Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:30.9920908Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faad2540400>} 2025-05-07T20:32:30.9922370Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:30.9923384Z context = 2025-05-07T20:32:30.9923669Z 2025-05-07T20:32:30.9923836Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:30.9924462Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:30.9924930Z module_map=module_map) 2025-05-07T20:32:30.9925295Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:30.9925655Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:30.9925916Z E ^ 2025-05-07T20:32:30.9926382Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:30.9926836Z 2025-05-07T20:32:30.9927249Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:31.1094441Z 2025-05-07T20:32:31.1094915Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:31.1095955Z self=, 2025-05-07T20:32:31.1096915Z T=16384, 2025-05-07T20:32:31.1097326Z D=5120, 2025-05-07T20:32:31.1097528Z scale_ub=1200.0, 2025-05-07T20:32:31.1097769Z contiguous=True, 2025-05-07T20:32:31.1098011Z compiled=False, 2025-05-07T20:32:31.1098237Z ) 2025-05-07T20:32:31.1098760Z self = 2025-05-07T20:32:31.1099294Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:31.1099576Z 2025-05-07T20:32:31.1099656Z @given( 2025-05-07T20:32:31.1099888Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:31.1100213Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:31.1100526Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:31.1100851Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:31.1101183Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:31.1101474Z ) 2025-05-07T20:32:31.1101821Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:31.1102276Z def test_silu_mul_quant( 2025-05-07T20:32:31.1102522Z self, 2025-05-07T20:32:31.1102712Z T: int, 2025-05-07T20:32:31.1102916Z D: int, 2025-05-07T20:32:31.1103142Z scale_ub: Optional[float], 2025-05-07T20:32:31.1103411Z contiguous: bool, 2025-05-07T20:32:31.1103682Z compiled: bool, 2025-05-07T20:32:31.1103902Z ) -> None: 2025-05-07T20:32:31.1104120Z torch.manual_seed(2025) 2025-05-07T20:32:31.1104363Z 2025-05-07T20:32:31.1104626Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:31.1104967Z 2025-05-07T20:32:31.1105147Z x_sign = torch.sign(x) 2025-05-07T20:32:31.1105424Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:31.1105725Z x = x_sign * x_clamp 2025-05-07T20:32:31.1105960Z x0 = x[:, :D] 2025-05-07T20:32:31.1106162Z x1 = x[:, D:] 2025-05-07T20:32:31.1106361Z 2025-05-07T20:32:31.1106539Z if contiguous: 2025-05-07T20:32:31.1106758Z x0 = x0.contiguous() 2025-05-07T20:32:31.1107014Z x1 = x1.contiguous() 2025-05-07T20:32:31.1107247Z 2025-05-07T20:32:31.1107427Z if scale_ub is not None: 2025-05-07T20:32:31.1107697Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:31.1108029Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:31.1108523Z ) 2025-05-07T20:32:31.1108705Z else: 2025-05-07T20:32:31.1108908Z scale_ub_tensor = None 2025-05-07T20:32:31.1109288Z 2025-05-07T20:32:31.1109503Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:31.1109813Z op = silu_mul_quant 2025-05-07T20:32:31.1110060Z if compiled: 2025-05-07T20:32:31.1110297Z op = torch.compile(op) 2025-05-07T20:32:31.1110583Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:31.1110848Z 2025-05-07T20:32:31.1111023Z > y_fp8, y_scale = fn() 2025-05-07T20:32:31.1111186Z 2025-05-07T20:32:31.1111281Z moe/activation_test.py:117: 2025-05-07T20:32:31.1111570Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.1111892Z moe/activation_test.py:115: in fn 2025-05-07T20:32:31.1112160Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:31.1112839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:31.1113523Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:31.1114046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:31.1114713Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:31.1115362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:31.1115878Z kernel = self.compile( 2025-05-07T20:32:31.1116402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:31.1117044Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:31.1117541Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.1117769Z 2025-05-07T20:32:31.1117978Z self = 2025-05-07T20:32:31.1119049Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:31.1120413Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faad2540e00>} 2025-05-07T20:32:31.1121742Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:31.1122750Z context = 2025-05-07T20:32:31.1123037Z 2025-05-07T20:32:31.1123195Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:31.1123703Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:31.1124162Z module_map=module_map) 2025-05-07T20:32:31.1124680Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:31.1125024Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:31.1125282Z E ^ 2025-05-07T20:32:31.1125746Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:31.1126191Z 2025-05-07T20:32:31.1126606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:31.1127115Z 2025-05-07T20:32:31.1127215Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:31.1127625Z self=, 2025-05-07T20:32:31.1128022Z T=1, 2025-05-07T20:32:31.1128201Z D=7168, 2025-05-07T20:32:31.1128393Z scale_ub=1200.0, 2025-05-07T20:32:31.1128611Z contiguous=False, 2025-05-07T20:32:31.1128829Z compiled=False, 2025-05-07T20:32:31.1129121Z ) 2025-05-07T20:32:31.1129434Z self = 2025-05-07T20:32:31.1129915Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:31.1130183Z 2025-05-07T20:32:31.1130258Z @given( 2025-05-07T20:32:31.1130493Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:31.1130797Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:31.1131094Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:31.1131418Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:31.1131740Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:31.1132012Z ) 2025-05-07T20:32:31.1132363Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:31.1132803Z def test_silu_mul_quant( 2025-05-07T20:32:31.1133044Z self, 2025-05-07T20:32:31.1133236Z T: int, 2025-05-07T20:32:31.1133437Z D: int, 2025-05-07T20:32:31.1133652Z scale_ub: Optional[float], 2025-05-07T20:32:31.1133924Z contiguous: bool, 2025-05-07T20:32:31.1134160Z compiled: bool, 2025-05-07T20:32:31.1134376Z ) -> None: 2025-05-07T20:32:31.1134590Z torch.manual_seed(2025) 2025-05-07T20:32:31.1134836Z 2025-05-07T20:32:31.1135106Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:31.1135439Z 2025-05-07T20:32:31.1135629Z x_sign = torch.sign(x) 2025-05-07T20:32:31.1135914Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:31.1136217Z x = x_sign * x_clamp 2025-05-07T20:32:31.1136451Z x0 = x[:, :D] 2025-05-07T20:32:31.1136780Z x1 = x[:, D:] 2025-05-07T20:32:31.1136983Z 2025-05-07T20:32:31.1137163Z if contiguous: 2025-05-07T20:32:31.1137392Z x0 = x0.contiguous() 2025-05-07T20:32:31.1137648Z x1 = x1.contiguous() 2025-05-07T20:32:31.1137889Z 2025-05-07T20:32:31.1138081Z if scale_ub is not None: 2025-05-07T20:32:31.1138349Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:31.1138683Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:31.1138992Z ) 2025-05-07T20:32:31.1139176Z else: 2025-05-07T20:32:31.1139384Z scale_ub_tensor = None 2025-05-07T20:32:31.1139630Z 2025-05-07T20:32:31.1139855Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:31.1140167Z op = silu_mul_quant 2025-05-07T20:32:31.1140414Z if compiled: 2025-05-07T20:32:31.1140657Z op = torch.compile(op) 2025-05-07T20:32:31.1140947Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:31.1141230Z 2025-05-07T20:32:31.1141421Z > y_fp8, y_scale = fn() 2025-05-07T20:32:31.1141581Z 2025-05-07T20:32:31.1141682Z moe/activation_test.py:117: 2025-05-07T20:32:31.1141971Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.1142308Z moe/activation_test.py:115: in fn 2025-05-07T20:32:31.1142584Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:31.1143262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:31.1143947Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:31.1144471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:31.1145142Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:31.1145799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:31.1146331Z kernel = self.compile( 2025-05-07T20:32:31.1146857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:31.1147505Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:31.1147981Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.1148206Z 2025-05-07T20:32:31.1148415Z self = 2025-05-07T20:32:31.1149486Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:31.1150847Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faad2542160>} 2025-05-07T20:32:31.1152181Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:31.1153200Z context = 2025-05-07T20:32:31.1153490Z 2025-05-07T20:32:31.1153651Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:31.1154164Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:31.1154624Z module_map=module_map) 2025-05-07T20:32:31.1154983Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:31.1155324Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:31.1155581Z E ^ 2025-05-07T20:32:31.1156037Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:31.1156559Z 2025-05-07T20:32:31.1156971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:31.1157487Z 2025-05-07T20:32:31.1157592Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:31.1158013Z self=, 2025-05-07T20:32:31.1158406Z T=4096, 2025-05-07T20:32:31.1158589Z D=7168, 2025-05-07T20:32:31.1158777Z scale_ub=1200.0, 2025-05-07T20:32:31.1159011Z contiguous=False, 2025-05-07T20:32:31.1159233Z compiled=True, 2025-05-07T20:32:31.2784361Z ) 2025-05-07T20:32:31.2784730Z self = 2025-05-07T20:32:31.2785271Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:31.2785548Z 2025-05-07T20:32:31.2785621Z @given( 2025-05-07T20:32:31.2785842Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:31.2786160Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:31.2786464Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:31.2786778Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:31.2787092Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:31.2787373Z ) 2025-05-07T20:32:31.2787707Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:31.2788140Z def test_silu_mul_quant( 2025-05-07T20:32:31.2788375Z self, 2025-05-07T20:32:31.2788552Z T: int, 2025-05-07T20:32:31.2788740Z D: int, 2025-05-07T20:32:31.2788953Z scale_ub: Optional[float], 2025-05-07T20:32:31.2789207Z contiguous: bool, 2025-05-07T20:32:31.2789441Z compiled: bool, 2025-05-07T20:32:31.2789656Z ) -> None: 2025-05-07T20:32:31.2789858Z torch.manual_seed(2025) 2025-05-07T20:32:31.2790091Z 2025-05-07T20:32:31.2790359Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:31.2790688Z 2025-05-07T20:32:31.2790864Z x_sign = torch.sign(x) 2025-05-07T20:32:31.2791147Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:31.2791444Z x = x_sign * x_clamp 2025-05-07T20:32:31.2791845Z x0 = x[:, :D] 2025-05-07T20:32:31.2792046Z x1 = x[:, D:] 2025-05-07T20:32:31.2792251Z 2025-05-07T20:32:31.2792421Z if contiguous: 2025-05-07T20:32:31.2792639Z x0 = x0.contiguous() 2025-05-07T20:32:31.2792891Z x1 = x1.contiguous() 2025-05-07T20:32:31.2793111Z 2025-05-07T20:32:31.2793292Z if scale_ub is not None: 2025-05-07T20:32:31.2793560Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:31.2793878Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:31.2794172Z ) 2025-05-07T20:32:31.2794358Z else: 2025-05-07T20:32:31.2794552Z scale_ub_tensor = None 2025-05-07T20:32:31.2794793Z 2025-05-07T20:32:31.2795022Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:31.2795316Z op = silu_mul_quant 2025-05-07T20:32:31.2795559Z if compiled: 2025-05-07T20:32:31.2795802Z op = torch.compile(op) 2025-05-07T20:32:31.2796099Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:31.2796353Z 2025-05-07T20:32:31.2796529Z > y_fp8, y_scale = fn() 2025-05-07T20:32:31.2796689Z 2025-05-07T20:32:31.2796787Z moe/activation_test.py:117: 2025-05-07T20:32:31.2797080Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.2797410Z moe/activation_test.py:115: in fn 2025-05-07T20:32:31.2797683Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:31.2798233Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:31.2798795Z return fn(*args, **kwargs) 2025-05-07T20:32:31.2799581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:31.2800258Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:31.2800781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:31.2801455Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:31.2802106Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:31.2802626Z kernel = self.compile( 2025-05-07T20:32:31.2803160Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:31.2803811Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:31.2804307Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.2804529Z 2025-05-07T20:32:31.2804738Z self = 2025-05-07T20:32:31.2805805Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:31.2807166Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faad2543420>} 2025-05-07T20:32:31.2808642Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:31.2809657Z context = 2025-05-07T20:32:31.2809942Z 2025-05-07T20:32:31.2810099Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:31.2810620Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:31.2811082Z module_map=module_map) 2025-05-07T20:32:31.2811432Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:31.2811895Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:31.2812139Z E ^ 2025-05-07T20:32:31.2812604Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:31.2813041Z 2025-05-07T20:32:31.2813453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:31.2813960Z 2025-05-07T20:32:31.2814060Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:31.2814466Z self=, 2025-05-07T20:32:31.2814865Z T=128, 2025-05-07T20:32:31.2815042Z D=7168, 2025-05-07T20:32:31.2815241Z scale_ub=1200.0, 2025-05-07T20:32:31.2815459Z contiguous=False, 2025-05-07T20:32:31.2815673Z compiled=True, 2025-05-07T20:32:31.2815877Z ) 2025-05-07T20:32:31.2816191Z self = 2025-05-07T20:32:31.2816669Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:31.2816950Z 2025-05-07T20:32:31.2817023Z @given( 2025-05-07T20:32:31.2817251Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:31.2817549Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:31.2817845Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:31.2818161Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:31.2818486Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:31.2818759Z ) 2025-05-07T20:32:31.2819106Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:31.2819662Z def test_silu_mul_quant( 2025-05-07T20:32:31.2819892Z self, 2025-05-07T20:32:31.2820076Z T: int, 2025-05-07T20:32:31.2820268Z D: int, 2025-05-07T20:32:31.2820475Z scale_ub: Optional[float], 2025-05-07T20:32:31.2820738Z contiguous: bool, 2025-05-07T20:32:31.2820976Z compiled: bool, 2025-05-07T20:32:31.2821181Z ) -> None: 2025-05-07T20:32:31.2821384Z torch.manual_seed(2025) 2025-05-07T20:32:31.2821613Z 2025-05-07T20:32:31.2821871Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:31.2822211Z 2025-05-07T20:32:31.2822400Z x_sign = torch.sign(x) 2025-05-07T20:32:31.2822692Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:31.2823010Z x = x_sign * x_clamp 2025-05-07T20:32:31.2823252Z x0 = x[:, :D] 2025-05-07T20:32:31.2823457Z x1 = x[:, D:] 2025-05-07T20:32:31.2823661Z 2025-05-07T20:32:31.2823837Z if contiguous: 2025-05-07T20:32:31.2824069Z x0 = x0.contiguous() 2025-05-07T20:32:31.2824310Z x1 = x1.contiguous() 2025-05-07T20:32:31.2824550Z 2025-05-07T20:32:31.2824743Z if scale_ub is not None: 2025-05-07T20:32:31.2825009Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:31.2832574Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:31.2832901Z ) 2025-05-07T20:32:31.2833100Z else: 2025-05-07T20:32:31.2833329Z scale_ub_tensor = None 2025-05-07T20:32:31.2833589Z 2025-05-07T20:32:31.2833825Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:31.2834156Z op = silu_mul_quant 2025-05-07T20:32:31.2834425Z if compiled: 2025-05-07T20:32:31.2834683Z op = torch.compile(op) 2025-05-07T20:32:31.2834985Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:31.2835281Z 2025-05-07T20:32:31.2835476Z > y_fp8, y_scale = fn() 2025-05-07T20:32:31.2835644Z 2025-05-07T20:32:31.2835760Z moe/activation_test.py:117: 2025-05-07T20:32:31.2836066Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.2836427Z moe/activation_test.py:115: in fn 2025-05-07T20:32:31.2836712Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:31.2837401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:31.2837970Z return fn(*args, **kwargs) 2025-05-07T20:32:31.2838631Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:31.2839324Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:31.2839859Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:31.2840547Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:31.2841217Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:31.2841744Z kernel = self.compile( 2025-05-07T20:32:31.2842296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:31.2842957Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:31.2843359Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.2843596Z 2025-05-07T20:32:31.2843805Z self = 2025-05-07T20:32:31.2844968Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:31.2846437Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faad2468720>} 2025-05-07T20:32:31.2847797Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:31.2848839Z context = 2025-05-07T20:32:31.2849132Z 2025-05-07T20:32:31.2849317Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:31.2849847Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:31.2850316Z module_map=module_map) 2025-05-07T20:32:31.2850682Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:31.2851038Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:31.2851300Z E ^ 2025-05-07T20:32:31.2851771Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:31.2852218Z 2025-05-07T20:32:31.2852642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:31.2853153Z 2025-05-07T20:32:31.2853274Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:31.2853680Z self=, 2025-05-07T20:32:31.2854078Z T=2048, 2025-05-07T20:32:31.2854256Z D=7168, 2025-05-07T20:32:31.2854441Z scale_ub=None, 2025-05-07T20:32:31.2854650Z contiguous=True, 2025-05-07T20:32:31.2854870Z compiled=True, 2025-05-07T20:32:31.4165016Z ) 2025-05-07T20:32:31.4165365Z self = 2025-05-07T20:32:31.4166026Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:31.4166579Z 2025-05-07T20:32:31.4166665Z @given( 2025-05-07T20:32:31.4166934Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:31.4167299Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:31.4167624Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:31.4167957Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:31.4168458Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:31.4168745Z ) 2025-05-07T20:32:31.4169087Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:31.4169517Z def test_silu_mul_quant( 2025-05-07T20:32:31.4169757Z self, 2025-05-07T20:32:31.4169945Z T: int, 2025-05-07T20:32:31.4170137Z D: int, 2025-05-07T20:32:31.4170341Z scale_ub: Optional[float], 2025-05-07T20:32:31.4170600Z contiguous: bool, 2025-05-07T20:32:31.4170830Z compiled: bool, 2025-05-07T20:32:31.4171043Z ) -> None: 2025-05-07T20:32:31.4171251Z torch.manual_seed(2025) 2025-05-07T20:32:31.4171483Z 2025-05-07T20:32:31.4171756Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:31.4172084Z 2025-05-07T20:32:31.4172274Z x_sign = torch.sign(x) 2025-05-07T20:32:31.4172552Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:31.4172857Z x = x_sign * x_clamp 2025-05-07T20:32:31.4173096Z x0 = x[:, :D] 2025-05-07T20:32:31.4173316Z x1 = x[:, D:] 2025-05-07T20:32:31.4173509Z 2025-05-07T20:32:31.4173686Z if contiguous: 2025-05-07T20:32:31.4173900Z x0 = x0.contiguous() 2025-05-07T20:32:31.4174143Z x1 = x1.contiguous() 2025-05-07T20:32:31.4174369Z 2025-05-07T20:32:31.4174561Z if scale_ub is not None: 2025-05-07T20:32:31.4174823Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:31.4175158Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:31.4175459Z ) 2025-05-07T20:32:31.4175641Z else: 2025-05-07T20:32:31.4175978Z scale_ub_tensor = None 2025-05-07T20:32:31.4176227Z 2025-05-07T20:32:31.4176453Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:31.4176756Z op = silu_mul_quant 2025-05-07T20:32:31.4176996Z if compiled: 2025-05-07T20:32:31.4177236Z op = torch.compile(op) 2025-05-07T20:32:31.4177516Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:31.4177779Z 2025-05-07T20:32:31.4177958Z > y_fp8, y_scale = fn() 2025-05-07T20:32:31.4178120Z 2025-05-07T20:32:31.4178214Z moe/activation_test.py:117: 2025-05-07T20:32:31.4178496Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.4178823Z moe/activation_test.py:115: in fn 2025-05-07T20:32:31.4179090Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:31.4179641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:31.4180186Z return fn(*args, **kwargs) 2025-05-07T20:32:31.4180842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:31.4181519Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:31.4182043Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:31.4182730Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:31.4183386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:31.4183900Z kernel = self.compile( 2025-05-07T20:32:31.4184437Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:31.4185084Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:31.4185468Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.4185697Z 2025-05-07T20:32:31.4185899Z self = 2025-05-07T20:32:31.4186970Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:31.4188460Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faad2469440>} 2025-05-07T20:32:31.4189785Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:31.4190810Z context = 2025-05-07T20:32:31.4191097Z 2025-05-07T20:32:31.4191262Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:31.4191769Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:31.4192221Z module_map=module_map) 2025-05-07T20:32:31.4192578Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:31.4192916Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:31.4193156Z E ^ 2025-05-07T20:32:31.4193598Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:31.4194036Z 2025-05-07T20:32:31.4194447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:31.4194950Z 2025-05-07T20:32:31.4195044Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:31.4195443Z self=, 2025-05-07T20:32:31.4195908Z T=16384, 2025-05-07T20:32:31.4196084Z D=5120, 2025-05-07T20:32:31.4196265Z scale_ub=None, 2025-05-07T20:32:31.4196462Z contiguous=False, 2025-05-07T20:32:31.4196670Z compiled=False, 2025-05-07T20:32:31.4196863Z ) 2025-05-07T20:32:31.4197166Z self = 2025-05-07T20:32:31.4197647Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:31.4197930Z 2025-05-07T20:32:31.4198001Z @given( 2025-05-07T20:32:31.4198223Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:31.4198518Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:31.4198812Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:31.4199129Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:31.4199448Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:31.4199709Z ) 2025-05-07T20:32:31.4200049Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:31.4200484Z def test_silu_mul_quant( 2025-05-07T20:32:31.4200712Z self, 2025-05-07T20:32:31.4200897Z T: int, 2025-05-07T20:32:31.4201086Z D: int, 2025-05-07T20:32:31.4201286Z scale_ub: Optional[float], 2025-05-07T20:32:31.4201554Z contiguous: bool, 2025-05-07T20:32:31.4201785Z compiled: bool, 2025-05-07T20:32:31.4201989Z ) -> None: 2025-05-07T20:32:31.4202195Z torch.manual_seed(2025) 2025-05-07T20:32:31.4202427Z 2025-05-07T20:32:31.4202677Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:31.4203005Z 2025-05-07T20:32:31.4203184Z x_sign = torch.sign(x) 2025-05-07T20:32:31.4203454Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:31.4205604Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 40.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:31.4207551Z 2025-05-07T20:32:31.4207662Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:31.4207869Z 2025-05-07T20:32:31.4207962Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:31.4208606Z self=, 2025-05-07T20:32:31.4208988Z T=4096, 2025-05-07T20:32:31.4209159Z D=7168, 2025-05-07T20:32:31.4209329Z scale_ub=1200.0, 2025-05-07T20:32:31.4209536Z contiguous=True, 2025-05-07T20:32:31.4209742Z compiled=True, 2025-05-07T20:32:31.4209932Z ) 2025-05-07T20:32:31.4210245Z self = 2025-05-07T20:32:31.4210726Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:31.4210990Z 2025-05-07T20:32:31.4211059Z @given( 2025-05-07T20:32:31.4211272Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:31.4211573Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:31.4211865Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:31.4212176Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:31.4212484Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:31.4212751Z ) 2025-05-07T20:32:31.4213084Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:31.4213509Z def test_silu_mul_quant( 2025-05-07T20:32:31.4213731Z self, 2025-05-07T20:32:31.4213904Z T: int, 2025-05-07T20:32:31.4214087Z D: int, 2025-05-07T20:32:31.4214286Z scale_ub: Optional[float], 2025-05-07T20:32:31.4214694Z contiguous: bool, 2025-05-07T20:32:31.4214927Z compiled: bool, 2025-05-07T20:32:31.4215129Z ) -> None: 2025-05-07T20:32:31.4215328Z torch.manual_seed(2025) 2025-05-07T20:32:31.4215550Z 2025-05-07T20:32:31.4215805Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:31.4216134Z 2025-05-07T20:32:31.4216310Z x_sign = torch.sign(x) 2025-05-07T20:32:31.4216583Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:31.4218570Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:31.4220424Z 2025-05-07T20:32:31.4220535Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:31.4220743Z 2025-05-07T20:32:31.4220839Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:31.4221239Z self=, 2025-05-07T20:32:31.4221620Z T=16384, 2025-05-07T20:32:31.4221800Z D=7168, 2025-05-07T20:32:31.4221978Z scale_ub=None, 2025-05-07T20:32:31.4222172Z contiguous=False, 2025-05-07T20:32:31.4222387Z compiled=False, 2025-05-07T20:32:31.4222573Z ) 2025-05-07T20:32:31.4222874Z self = 2025-05-07T20:32:31.4223355Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:31.4223628Z 2025-05-07T20:32:31.4223696Z @given( 2025-05-07T20:32:31.4223920Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:31.4224211Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:31.4224503Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:31.4224811Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:31.4225122Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:31.4225517Z ) 2025-05-07T20:32:31.4225846Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:31.4226265Z def test_silu_mul_quant( 2025-05-07T20:32:31.4226490Z self, 2025-05-07T20:32:31.4226667Z T: int, 2025-05-07T20:32:31.4226846Z D: int, 2025-05-07T20:32:31.4227050Z scale_ub: Optional[float], 2025-05-07T20:32:31.4227305Z contiguous: bool, 2025-05-07T20:32:31.4227537Z compiled: bool, 2025-05-07T20:32:31.4227741Z ) -> None: 2025-05-07T20:32:31.4227939Z torch.manual_seed(2025) 2025-05-07T20:32:31.4228167Z 2025-05-07T20:32:31.4228424Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:31.4230463Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:31.4232328Z 2025-05-07T20:32:31.4232438Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:31.5478840Z 2025-05-07T20:32:31.5479503Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:31.5480280Z self=, 2025-05-07T20:32:31.5481642Z T=2048, 2025-05-07T20:32:31.5482093Z D=7168, 2025-05-07T20:32:31.5482365Z scale_ub=1200.0, 2025-05-07T20:32:31.5482609Z contiguous=True, 2025-05-07T20:32:31.5482853Z compiled=True, 2025-05-07T20:32:31.5483077Z ) 2025-05-07T20:32:31.5483433Z self = 2025-05-07T20:32:31.5483994Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:31.5484445Z 2025-05-07T20:32:31.5484529Z @given( 2025-05-07T20:32:31.5484779Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:31.5485127Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:31.5485472Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:31.5485841Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:31.5486212Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:31.5486533Z ) 2025-05-07T20:32:31.5486934Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:31.5487453Z def test_silu_mul_quant( 2025-05-07T20:32:31.5487714Z self, 2025-05-07T20:32:31.5487928Z T: int, 2025-05-07T20:32:31.5488142Z D: int, 2025-05-07T20:32:31.5488378Z scale_ub: Optional[float], 2025-05-07T20:32:31.5488686Z contiguous: bool, 2025-05-07T20:32:31.5488952Z compiled: bool, 2025-05-07T20:32:31.5489197Z ) -> None: 2025-05-07T20:32:31.5489432Z torch.manual_seed(2025) 2025-05-07T20:32:31.5489702Z 2025-05-07T20:32:31.5489991Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:31.5490376Z 2025-05-07T20:32:31.5490580Z x_sign = torch.sign(x) 2025-05-07T20:32:31.5490898Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:31.5493276Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:31.5495719Z 2025-05-07T20:32:31.5495847Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:31.5496089Z 2025-05-07T20:32:31.5496201Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:31.5496666Z self=, 2025-05-07T20:32:31.5497070Z T=2048, 2025-05-07T20:32:31.5497260Z D=7168, 2025-05-07T20:32:31.5497449Z scale_ub=None, 2025-05-07T20:32:31.5497662Z contiguous=True, 2025-05-07T20:32:31.5497884Z compiled=False, 2025-05-07T20:32:31.5498086Z ) 2025-05-07T20:32:31.5498406Z self = 2025-05-07T20:32:31.5498906Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:31.5499183Z 2025-05-07T20:32:31.5499259Z @given( 2025-05-07T20:32:31.5499485Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:31.5499801Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:31.5500110Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:31.5500447Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:31.5500773Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:31.5501062Z ) 2025-05-07T20:32:31.5501436Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:31.5501875Z def test_silu_mul_quant( 2025-05-07T20:32:31.5502112Z self, 2025-05-07T20:32:31.5502308Z T: int, 2025-05-07T20:32:31.5502499Z D: int, 2025-05-07T20:32:31.5502709Z scale_ub: Optional[float], 2025-05-07T20:32:31.5503059Z contiguous: bool, 2025-05-07T20:32:31.5503299Z compiled: bool, 2025-05-07T20:32:31.5503512Z ) -> None: 2025-05-07T20:32:31.5503721Z torch.manual_seed(2025) 2025-05-07T20:32:31.5503957Z 2025-05-07T20:32:31.5504221Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:31.5504560Z 2025-05-07T20:32:31.5504743Z > x_sign = torch.sign(x) 2025-05-07T20:32:31.5506668Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:31.5508755Z 2025-05-07T20:32:31.5508873Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:31.5509078Z 2025-05-07T20:32:31.5509173Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:31.5509569Z self=, 2025-05-07T20:32:31.5509959Z T=1, 2025-05-07T20:32:31.5510123Z D=7168, 2025-05-07T20:32:31.5510298Z scale_ub=1200.0, 2025-05-07T20:32:31.5510502Z contiguous=True, 2025-05-07T20:32:31.5510707Z compiled=False, 2025-05-07T20:32:31.5510896Z ) 2025-05-07T20:32:31.5511201Z self = 2025-05-07T20:32:31.5511670Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:31.5511931Z 2025-05-07T20:32:31.5512000Z @given( 2025-05-07T20:32:31.5512215Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:31.5512514Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:31.5512808Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:31.5513122Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:31.5513437Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:31.5513699Z ) 2025-05-07T20:32:31.5514033Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:31.5514598Z def test_silu_mul_quant( 2025-05-07T20:32:31.5514820Z self, 2025-05-07T20:32:31.5514997Z T: int, 2025-05-07T20:32:31.5515178Z D: int, 2025-05-07T20:32:31.5515376Z scale_ub: Optional[float], 2025-05-07T20:32:31.5515634Z contiguous: bool, 2025-05-07T20:32:31.5515860Z compiled: bool, 2025-05-07T20:32:31.5516063Z ) -> None: 2025-05-07T20:32:31.5516264Z torch.manual_seed(2025) 2025-05-07T20:32:31.5516492Z 2025-05-07T20:32:31.5516746Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:31.5517064Z 2025-05-07T20:32:31.5517242Z x_sign = torch.sign(x) 2025-05-07T20:32:31.5517522Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:31.5517811Z x = x_sign * x_clamp 2025-05-07T20:32:31.5518034Z x0 = x[:, :D] 2025-05-07T20:32:31.5518240Z x1 = x[:, D:] 2025-05-07T20:32:31.5518438Z 2025-05-07T20:32:31.5518607Z if contiguous: 2025-05-07T20:32:31.5518826Z x0 = x0.contiguous() 2025-05-07T20:32:31.5519069Z x1 = x1.contiguous() 2025-05-07T20:32:31.5519294Z 2025-05-07T20:32:31.5519475Z if scale_ub is not None: 2025-05-07T20:32:31.5519732Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:31.5520050Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:31.5520345Z ) 2025-05-07T20:32:31.5520520Z else: 2025-05-07T20:32:31.5520726Z scale_ub_tensor = None 2025-05-07T20:32:31.5520961Z 2025-05-07T20:32:31.5521177Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:31.5521594Z op = silu_mul_quant 2025-05-07T20:32:31.5521833Z if compiled: 2025-05-07T20:32:31.5522068Z op = torch.compile(op) 2025-05-07T20:32:31.5522349Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:31.5522608Z 2025-05-07T20:32:31.5522791Z > y_fp8, y_scale = fn() 2025-05-07T20:32:31.5522948Z 2025-05-07T20:32:31.5523040Z moe/activation_test.py:117: 2025-05-07T20:32:31.5523318Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.5523637Z moe/activation_test.py:115: in fn 2025-05-07T20:32:31.5523903Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:31.5524698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:31.5525375Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:31.5525904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:31.5526569Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:31.5527220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:31.5527744Z kernel = self.compile( 2025-05-07T20:32:31.5528273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:31.5528924Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:31.5529314Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.5529539Z 2025-05-07T20:32:31.5529741Z self = 2025-05-07T20:32:31.5530823Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:31.5532341Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faad2afc400>} 2025-05-07T20:32:31.5533821Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:31.5534826Z context = 2025-05-07T20:32:31.5535107Z 2025-05-07T20:32:31.5535266Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:31.5542163Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:31.5542660Z module_map=module_map) 2025-05-07T20:32:31.5543018Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:31.5543376Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:31.5543627Z E ^ 2025-05-07T20:32:31.5544089Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:31.5544532Z 2025-05-07T20:32:31.5544965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:31.5545472Z 2025-05-07T20:32:31.5545576Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:31.5545981Z self=, 2025-05-07T20:32:31.5546381Z T=128, 2025-05-07T20:32:31.5546564Z D=5120, 2025-05-07T20:32:31.5546752Z scale_ub=None, 2025-05-07T20:32:31.5546961Z contiguous=True, 2025-05-07T20:32:31.5547179Z compiled=False, 2025-05-07T20:32:31.5547373Z ) 2025-05-07T20:32:31.5547686Z self = 2025-05-07T20:32:31.5548280Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:31.5548548Z 2025-05-07T20:32:31.5548635Z @given( 2025-05-07T20:32:31.5548859Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:31.5549161Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:31.5549466Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:31.5549780Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:31.5550106Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:31.5550381Z ) 2025-05-07T20:32:31.5550715Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:31.5551152Z def test_silu_mul_quant( 2025-05-07T20:32:31.5551387Z self, 2025-05-07T20:32:31.5551567Z T: int, 2025-05-07T20:32:31.5551761Z D: int, 2025-05-07T20:32:31.5551969Z scale_ub: Optional[float], 2025-05-07T20:32:31.5552230Z contiguous: bool, 2025-05-07T20:32:31.5552474Z compiled: bool, 2025-05-07T20:32:31.5552691Z ) -> None: 2025-05-07T20:32:31.5552910Z torch.manual_seed(2025) 2025-05-07T20:32:31.5553142Z 2025-05-07T20:32:31.5553415Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:31.5553743Z 2025-05-07T20:32:31.5553915Z x_sign = torch.sign(x) 2025-05-07T20:32:31.5554188Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:31.5554480Z x = x_sign * x_clamp 2025-05-07T20:32:31.5554701Z x0 = x[:, :D] 2025-05-07T20:32:31.5554906Z x1 = x[:, D:] 2025-05-07T20:32:31.5555102Z 2025-05-07T20:32:31.5555271Z if contiguous: 2025-05-07T20:32:31.5555486Z x0 = x0.contiguous() 2025-05-07T20:32:31.5555728Z x1 = x1.contiguous() 2025-05-07T20:32:31.5555945Z 2025-05-07T20:32:31.5556123Z if scale_ub is not None: 2025-05-07T20:32:31.5556380Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:31.5556708Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:31.5557006Z ) 2025-05-07T20:32:31.5557183Z else: 2025-05-07T20:32:31.5557376Z scale_ub_tensor = None 2025-05-07T20:32:31.5557614Z 2025-05-07T20:32:31.5557833Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:31.5558221Z op = silu_mul_quant 2025-05-07T20:32:31.5558450Z if compiled: 2025-05-07T20:32:31.5558689Z op = torch.compile(op) 2025-05-07T20:32:31.5559017Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:31.5559277Z 2025-05-07T20:32:31.5559455Z > y_fp8, y_scale = fn() 2025-05-07T20:32:31.5559612Z 2025-05-07T20:32:31.5559706Z moe/activation_test.py:117: 2025-05-07T20:32:31.5559982Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.5560301Z moe/activation_test.py:115: in fn 2025-05-07T20:32:31.5560572Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:31.5561245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:31.5561916Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:31.5562436Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:31.5563109Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:31.5563755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:31.5564374Z kernel = self.compile( 2025-05-07T20:32:31.5564897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:31.5565537Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:31.5565916Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.5566221Z 2025-05-07T20:32:31.5566424Z self = 2025-05-07T20:32:31.5567482Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:31.5568896Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faad2afd300>} 2025-05-07T20:32:31.5570216Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:31.5571216Z context = 2025-05-07T20:32:31.5571499Z 2025-05-07T20:32:31.5571664Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:31.5572164Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:31.5572613Z module_map=module_map) 2025-05-07T20:32:31.5572955Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:31.5573293Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:31.5573541Z E ^ 2025-05-07T20:32:31.5573984Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:31.5574428Z 2025-05-07T20:32:31.5574833Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:31.6697786Z 2025-05-07T20:32:31.6698271Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:31.6699155Z self=, 2025-05-07T20:32:31.6699980Z T=128, 2025-05-07T20:32:31.6700317Z D=7168, 2025-05-07T20:32:31.6700671Z scale_ub=None, 2025-05-07T20:32:31.6700942Z contiguous=True, 2025-05-07T20:32:31.6701158Z compiled=False, 2025-05-07T20:32:31.6701356Z ) 2025-05-07T20:32:31.6701665Z self = 2025-05-07T20:32:31.6702353Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:31.6702614Z 2025-05-07T20:32:31.6702689Z @given( 2025-05-07T20:32:31.6702908Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:31.6703203Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:31.6703494Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:31.6703812Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:31.6704123Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:31.6704393Z ) 2025-05-07T20:32:31.6704732Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:31.6705192Z def test_silu_mul_quant( 2025-05-07T20:32:31.6705417Z self, 2025-05-07T20:32:31.6705592Z T: int, 2025-05-07T20:32:31.6705773Z D: int, 2025-05-07T20:32:31.6705975Z scale_ub: Optional[float], 2025-05-07T20:32:31.6706234Z contiguous: bool, 2025-05-07T20:32:31.6706459Z compiled: bool, 2025-05-07T20:32:31.6706668Z ) -> None: 2025-05-07T20:32:31.6706862Z torch.manual_seed(2025) 2025-05-07T20:32:31.6707087Z 2025-05-07T20:32:31.6707347Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:31.6707671Z 2025-05-07T20:32:31.6707846Z x_sign = torch.sign(x) 2025-05-07T20:32:31.6708122Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:31.6708601Z x = x_sign * x_clamp 2025-05-07T20:32:31.6708831Z x0 = x[:, :D] 2025-05-07T20:32:31.6709033Z x1 = x[:, D:] 2025-05-07T20:32:31.6709224Z 2025-05-07T20:32:31.6709519Z if contiguous: 2025-05-07T20:32:31.6709737Z x0 = x0.contiguous() 2025-05-07T20:32:31.6709978Z x1 = x1.contiguous() 2025-05-07T20:32:31.6710192Z 2025-05-07T20:32:31.6710362Z if scale_ub is not None: 2025-05-07T20:32:31.6710617Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:31.6710937Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:31.6711229Z ) 2025-05-07T20:32:31.6711402Z else: 2025-05-07T20:32:31.6711593Z scale_ub_tensor = None 2025-05-07T20:32:31.6711824Z 2025-05-07T20:32:31.6712041Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:31.6712334Z op = silu_mul_quant 2025-05-07T20:32:31.6712566Z if compiled: 2025-05-07T20:32:31.6712804Z op = torch.compile(op) 2025-05-07T20:32:31.6713078Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:31.6713333Z 2025-05-07T20:32:31.6713504Z > y_fp8, y_scale = fn() 2025-05-07T20:32:31.6713667Z 2025-05-07T20:32:31.6713767Z moe/activation_test.py:117: 2025-05-07T20:32:31.6714049Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.6714369Z moe/activation_test.py:115: in fn 2025-05-07T20:32:31.6714637Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:31.6715314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:31.6715990Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:31.6716515Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:31.6717188Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:31.6717836Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:31.6718348Z kernel = self.compile( 2025-05-07T20:32:31.6718885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:31.6719519Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:31.6719902Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.6720258Z 2025-05-07T20:32:31.6720462Z self = 2025-05-07T20:32:31.6721525Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:31.6722894Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faad2afe0c0>} 2025-05-07T20:32:31.6724347Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:31.6725349Z context = 2025-05-07T20:32:31.6725636Z 2025-05-07T20:32:31.6725799Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:31.6726305Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:31.6726751Z module_map=module_map) 2025-05-07T20:32:31.6727098Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:31.6727433Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:31.6727668Z E ^ 2025-05-07T20:32:31.6728112Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:31.6728556Z 2025-05-07T20:32:31.6729056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:31.6729560Z 2025-05-07T20:32:31.6729658Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:31.6730047Z self=, 2025-05-07T20:32:31.6730434Z T=2048, 2025-05-07T20:32:31.6730604Z D=7168, 2025-05-07T20:32:31.6730776Z scale_ub=1200.0, 2025-05-07T20:32:31.6730982Z contiguous=True, 2025-05-07T20:32:31.6731183Z compiled=False, 2025-05-07T20:32:31.6731368Z ) 2025-05-07T20:32:31.6731666Z self = 2025-05-07T20:32:31.6732143Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:31.6732406Z 2025-05-07T20:32:31.6732475Z @given( 2025-05-07T20:32:31.6732680Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:31.6732979Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:31.6733277Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:31.6733587Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:31.6733897Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:31.6734166Z ) 2025-05-07T20:32:31.6734497Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:31.6734920Z def test_silu_mul_quant( 2025-05-07T20:32:31.6736610Z self, 2025-05-07T20:32:31.6736785Z T: int, 2025-05-07T20:32:31.6736962Z D: int, 2025-05-07T20:32:31.6737168Z scale_ub: Optional[float], 2025-05-07T20:32:31.6737431Z contiguous: bool, 2025-05-07T20:32:31.6737651Z compiled: bool, 2025-05-07T20:32:31.6737858Z ) -> None: 2025-05-07T20:32:31.6738065Z torch.manual_seed(2025) 2025-05-07T20:32:31.6738291Z 2025-05-07T20:32:31.6738547Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:31.6740587Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.70 GiB is allocated by PyTorch, and 53.93 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:31.6742513Z 2025-05-07T20:32:31.6742623Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:31.6742835Z 2025-05-07T20:32:31.6742935Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:31.6743333Z self=, 2025-05-07T20:32:31.6743734Z T=1, 2025-05-07T20:32:31.6743898Z D=5120, 2025-05-07T20:32:31.6744065Z scale_ub=1200.0, 2025-05-07T20:32:31.6744277Z contiguous=True, 2025-05-07T20:32:31.6744483Z compiled=False, 2025-05-07T20:32:31.6744669Z ) 2025-05-07T20:32:31.6744966Z self = 2025-05-07T20:32:31.6745448Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:31.6745715Z 2025-05-07T20:32:31.6745783Z @given( 2025-05-07T20:32:31.6745987Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:31.6746282Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:31.6746573Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:31.6746880Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:31.6747191Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:31.6747460Z ) 2025-05-07T20:32:31.6747789Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:31.6748223Z def test_silu_mul_quant( 2025-05-07T20:32:31.6748450Z self, 2025-05-07T20:32:31.6748717Z T: int, 2025-05-07T20:32:31.6748894Z D: int, 2025-05-07T20:32:31.6749102Z scale_ub: Optional[float], 2025-05-07T20:32:31.6749353Z contiguous: bool, 2025-05-07T20:32:31.6749571Z compiled: bool, 2025-05-07T20:32:31.6749774Z ) -> None: 2025-05-07T20:32:31.6749979Z torch.manual_seed(2025) 2025-05-07T20:32:31.6750202Z 2025-05-07T20:32:31.6750467Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:31.6750794Z 2025-05-07T20:32:31.6750963Z x_sign = torch.sign(x) 2025-05-07T20:32:31.6751251Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:31.6751540Z x = x_sign * x_clamp 2025-05-07T20:32:31.6751769Z x0 = x[:, :D] 2025-05-07T20:32:31.6751967Z x1 = x[:, D:] 2025-05-07T20:32:31.6752154Z 2025-05-07T20:32:31.6752315Z if contiguous: 2025-05-07T20:32:31.6752533Z x0 = x0.contiguous() 2025-05-07T20:32:31.6752786Z x1 = x1.contiguous() 2025-05-07T20:32:31.6753003Z 2025-05-07T20:32:31.6753169Z if scale_ub is not None: 2025-05-07T20:32:31.6753424Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:31.6753744Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:31.6754037Z ) 2025-05-07T20:32:31.6754211Z else: 2025-05-07T20:32:31.6754403Z scale_ub_tensor = None 2025-05-07T20:32:31.6754634Z 2025-05-07T20:32:31.6754846Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:31.6755135Z op = silu_mul_quant 2025-05-07T20:32:31.6755370Z if compiled: 2025-05-07T20:32:31.6755604Z op = torch.compile(op) 2025-05-07T20:32:31.6755877Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:31.6756129Z 2025-05-07T20:32:31.6756308Z > y_fp8, y_scale = fn() 2025-05-07T20:32:31.6756464Z 2025-05-07T20:32:31.6756555Z moe/activation_test.py:117: 2025-05-07T20:32:31.6756855Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.6757177Z moe/activation_test.py:115: in fn 2025-05-07T20:32:31.6757432Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:31.6758108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:31.6758866Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:31.6759389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:31.6760053Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:31.6760704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:31.6761224Z kernel = self.compile( 2025-05-07T20:32:31.6761762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:31.6762402Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:31.6762784Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.6763003Z 2025-05-07T20:32:31.6763209Z self = 2025-05-07T20:32:31.6764368Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:31.6765725Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faad2aff6a0>} 2025-05-07T20:32:31.6767176Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:31.6768180Z context = 2025-05-07T20:32:31.6768462Z 2025-05-07T20:32:31.6768623Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:31.6769130Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:31.6769590Z module_map=module_map) 2025-05-07T20:32:31.6769946Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:31.6770272Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:31.6770515Z E ^ 2025-05-07T20:32:31.6770958Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:31.6771396Z 2025-05-07T20:32:31.6771805Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:31.7592769Z 2025-05-07T20:32:31.7593346Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:31.7594773Z self=, 2025-05-07T20:32:31.7596093Z T=2048, 2025-05-07T20:32:31.7596428Z D=5120, 2025-05-07T20:32:31.7596755Z scale_ub=None, 2025-05-07T20:32:31.7597121Z contiguous=True, 2025-05-07T20:32:31.7597502Z compiled=False, 2025-05-07T20:32:31.7597850Z ) 2025-05-07T20:32:31.7598401Z self = 2025-05-07T20:32:31.7599277Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:31.7599770Z 2025-05-07T20:32:31.7599898Z @given( 2025-05-07T20:32:31.7600298Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:31.7600841Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:31.7601373Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:31.7601954Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:31.7602526Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:31.7603018Z ) 2025-05-07T20:32:31.7603627Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:31.7604652Z def test_silu_mul_quant( 2025-05-07T20:32:31.7605114Z self, 2025-05-07T20:32:31.7605292Z T: int, 2025-05-07T20:32:31.7605474Z D: int, 2025-05-07T20:32:31.7605671Z scale_ub: Optional[float], 2025-05-07T20:32:31.7605924Z contiguous: bool, 2025-05-07T20:32:31.7606158Z compiled: bool, 2025-05-07T20:32:31.7606362Z ) -> None: 2025-05-07T20:32:31.7606564Z torch.manual_seed(2025) 2025-05-07T20:32:31.7606790Z 2025-05-07T20:32:31.7607041Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:31.7607369Z 2025-05-07T20:32:31.7607547Z > x_sign = torch.sign(x) 2025-05-07T20:32:31.7609916Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:31.7611785Z 2025-05-07T20:32:31.7611907Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:31.7612116Z 2025-05-07T20:32:31.7612211Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:31.7612608Z self=, 2025-05-07T20:32:31.7612995Z T=16384, 2025-05-07T20:32:31.7613169Z D=5120, 2025-05-07T20:32:31.7613343Z scale_ub=None, 2025-05-07T20:32:31.7613539Z contiguous=True, 2025-05-07T20:32:31.7613878Z compiled=False, 2025-05-07T20:32:31.7614069Z ) 2025-05-07T20:32:31.7614369Z self = 2025-05-07T20:32:31.7614844Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:31.7615114Z 2025-05-07T20:32:31.7615186Z @given( 2025-05-07T20:32:31.7615398Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:31.7615698Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:31.7615985Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:31.7616297Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:31.7616609Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:31.7616875Z ) 2025-05-07T20:32:31.7617206Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:31.7617632Z def test_silu_mul_quant( 2025-05-07T20:32:31.7617849Z self, 2025-05-07T20:32:31.7618022Z T: int, 2025-05-07T20:32:31.7618214Z D: int, 2025-05-07T20:32:31.7618410Z scale_ub: Optional[float], 2025-05-07T20:32:31.7618662Z contiguous: bool, 2025-05-07T20:32:31.7618885Z compiled: bool, 2025-05-07T20:32:31.7619088Z ) -> None: 2025-05-07T20:32:31.7619288Z torch.manual_seed(2025) 2025-05-07T20:32:31.7619521Z 2025-05-07T20:32:31.7619780Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:31.7621802Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:31.7623644Z 2025-05-07T20:32:31.7623756Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:31.7623963Z 2025-05-07T20:32:31.7624058Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:31.7624450Z self=, 2025-05-07T20:32:31.7624971Z T=4096, 2025-05-07T20:32:31.7625138Z D=5120, 2025-05-07T20:32:31.7625314Z scale_ub=None, 2025-05-07T20:32:31.7625508Z contiguous=True, 2025-05-07T20:32:31.7625710Z compiled=False, 2025-05-07T20:32:31.7625902Z ) 2025-05-07T20:32:31.7626204Z self = 2025-05-07T20:32:31.7626679Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:31.7626943Z 2025-05-07T20:32:31.7627011Z @given( 2025-05-07T20:32:31.7627225Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:31.7627523Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:31.7627811Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:31.7628124Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:31.7628429Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:31.7628699Z ) 2025-05-07T20:32:31.7629040Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:31.7629459Z def test_silu_mul_quant( 2025-05-07T20:32:31.7629678Z self, 2025-05-07T20:32:31.7629854Z T: int, 2025-05-07T20:32:31.7630032Z D: int, 2025-05-07T20:32:31.7630230Z scale_ub: Optional[float], 2025-05-07T20:32:31.7630494Z contiguous: bool, 2025-05-07T20:32:31.7630712Z compiled: bool, 2025-05-07T20:32:31.7630922Z ) -> None: 2025-05-07T20:32:31.7631123Z torch.manual_seed(2025) 2025-05-07T20:32:31.7631344Z 2025-05-07T20:32:31.7631599Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:31.7633692Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:31.7635532Z 2025-05-07T20:32:31.7635641Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:31.7635842Z 2025-05-07T20:32:31.7635937Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:31.7643596Z self=, 2025-05-07T20:32:31.7643998Z T=2048, 2025-05-07T20:32:31.7644272Z D=5120, 2025-05-07T20:32:31.7644463Z scale_ub=None, 2025-05-07T20:32:31.7644680Z contiguous=False, 2025-05-07T20:32:31.7644898Z compiled=False, 2025-05-07T20:32:31.7645098Z ) 2025-05-07T20:32:31.7645416Z self = 2025-05-07T20:32:31.7645902Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:31.7646181Z 2025-05-07T20:32:31.7646255Z @given( 2025-05-07T20:32:31.7646486Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:31.7646791Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:31.7647089Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:31.7647410Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:31.7647726Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:31.7647996Z ) 2025-05-07T20:32:31.7648337Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:31.7648769Z def test_silu_mul_quant( 2025-05-07T20:32:31.7649005Z self, 2025-05-07T20:32:31.7649195Z T: int, 2025-05-07T20:32:31.7649383Z D: int, 2025-05-07T20:32:31.7649595Z scale_ub: Optional[float], 2025-05-07T20:32:31.7649857Z contiguous: bool, 2025-05-07T20:32:31.7650089Z compiled: bool, 2025-05-07T20:32:31.7650419Z ) -> None: 2025-05-07T20:32:31.7650627Z torch.manual_seed(2025) 2025-05-07T20:32:31.7650862Z 2025-05-07T20:32:31.7651123Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:31.7653153Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:31.7654989Z 2025-05-07T20:32:31.7655105Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:31.7655315Z 2025-05-07T20:32:31.7655415Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:31.7655822Z self=, 2025-05-07T20:32:31.7656216Z T=4096, 2025-05-07T20:32:31.7656398Z D=7168, 2025-05-07T20:32:31.7656595Z scale_ub=None, 2025-05-07T20:32:31.7656799Z contiguous=True, 2025-05-07T20:32:31.7657010Z compiled=True, 2025-05-07T20:32:31.7657205Z ) 2025-05-07T20:32:31.7657506Z self = 2025-05-07T20:32:31.7657984Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:31.7658249Z 2025-05-07T20:32:31.7658320Z @given( 2025-05-07T20:32:31.7658624Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:31.7658962Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:31.7659246Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:31.7659555Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:31.7659859Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:31.7660133Z ) 2025-05-07T20:32:31.7660462Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:31.7660887Z def test_silu_mul_quant( 2025-05-07T20:32:31.7661119Z self, 2025-05-07T20:32:31.7661298Z T: int, 2025-05-07T20:32:31.7661474Z D: int, 2025-05-07T20:32:31.7661678Z scale_ub: Optional[float], 2025-05-07T20:32:31.7661933Z contiguous: bool, 2025-05-07T20:32:31.7662163Z compiled: bool, 2025-05-07T20:32:31.7662363Z ) -> None: 2025-05-07T20:32:31.7662563Z torch.manual_seed(2025) 2025-05-07T20:32:31.7662788Z 2025-05-07T20:32:31.7663048Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:31.7665061Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:31.7666902Z 2025-05-07T20:32:31.7667012Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:31.7667216Z 2025-05-07T20:32:31.7667319Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:31.7667715Z self=, 2025-05-07T20:32:31.7668097Z T=2048, 2025-05-07T20:32:31.7668274Z D=5120, 2025-05-07T20:32:31.7668452Z scale_ub=1200.0, 2025-05-07T20:32:31.7668659Z contiguous=False, 2025-05-07T20:32:31.7668893Z compiled=False, 2025-05-07T20:32:31.8202666Z ) 2025-05-07T20:32:31.8203551Z self = 2025-05-07T20:32:31.8204771Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:31.8205221Z 2025-05-07T20:32:31.8205296Z @given( 2025-05-07T20:32:31.8205518Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:31.8205812Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:31.8206107Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:31.8206421Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:31.8206727Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:31.8206994Z ) 2025-05-07T20:32:31.8207328Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:31.8207757Z def test_silu_mul_quant( 2025-05-07T20:32:31.8207986Z self, 2025-05-07T20:32:31.8208166Z T: int, 2025-05-07T20:32:31.8208568Z D: int, 2025-05-07T20:32:31.8208775Z scale_ub: Optional[float], 2025-05-07T20:32:31.8209042Z contiguous: bool, 2025-05-07T20:32:31.8209283Z compiled: bool, 2025-05-07T20:32:31.8209490Z ) -> None: 2025-05-07T20:32:31.8209690Z torch.manual_seed(2025) 2025-05-07T20:32:31.8209915Z 2025-05-07T20:32:31.8210168Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:31.8212353Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:31.8214434Z 2025-05-07T20:32:31.8214545Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:31.8214762Z 2025-05-07T20:32:31.8214858Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:31.8215257Z self=, 2025-05-07T20:32:31.8215647Z T=4096, 2025-05-07T20:32:31.8215815Z D=7168, 2025-05-07T20:32:31.8215993Z scale_ub=1200.0, 2025-05-07T20:32:31.8216203Z contiguous=True, 2025-05-07T20:32:31.8216407Z compiled=False, 2025-05-07T20:32:31.8216597Z ) 2025-05-07T20:32:31.8216904Z self = 2025-05-07T20:32:31.8217383Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:31.8217651Z 2025-05-07T20:32:31.8217726Z @given( 2025-05-07T20:32:31.8217940Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:31.8218233Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:31.8218526Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:31.8218839Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:31.8219158Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:31.8219423Z ) 2025-05-07T20:32:31.8219753Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:31.8220177Z def test_silu_mul_quant( 2025-05-07T20:32:31.8220400Z self, 2025-05-07T20:32:31.8220579Z T: int, 2025-05-07T20:32:31.8220758Z D: int, 2025-05-07T20:32:31.8220955Z scale_ub: Optional[float], 2025-05-07T20:32:31.8221207Z contiguous: bool, 2025-05-07T20:32:31.8221433Z compiled: bool, 2025-05-07T20:32:31.8221635Z ) -> None: 2025-05-07T20:32:31.8221834Z torch.manual_seed(2025) 2025-05-07T20:32:31.8222067Z 2025-05-07T20:32:31.8222318Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:31.8224339Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:31.8226314Z 2025-05-07T20:32:31.8226423Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:31.8226631Z 2025-05-07T20:32:31.8226724Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:31.8227122Z self=, 2025-05-07T20:32:31.8227503Z T=16384, 2025-05-07T20:32:31.8227681Z D=7168, 2025-05-07T20:32:31.8227856Z scale_ub=None, 2025-05-07T20:32:31.8228054Z contiguous=False, 2025-05-07T20:32:31.8228271Z compiled=True, 2025-05-07T20:32:31.8228470Z ) 2025-05-07T20:32:31.8228810Z self = 2025-05-07T20:32:31.8229321Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:31.8229603Z 2025-05-07T20:32:31.8229681Z @given( 2025-05-07T20:32:31.8229905Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:31.8230206Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:31.8230502Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:31.8230823Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:31.8231139Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:31.8231419Z ) 2025-05-07T20:32:31.8231844Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:31.8232274Z def test_silu_mul_quant( 2025-05-07T20:32:31.8232512Z self, 2025-05-07T20:32:31.8232704Z T: int, 2025-05-07T20:32:31.8232894Z D: int, 2025-05-07T20:32:31.8233112Z scale_ub: Optional[float], 2025-05-07T20:32:31.8233375Z contiguous: bool, 2025-05-07T20:32:31.8233609Z compiled: bool, 2025-05-07T20:32:31.8233821Z ) -> None: 2025-05-07T20:32:31.8234032Z torch.manual_seed(2025) 2025-05-07T20:32:31.8234268Z 2025-05-07T20:32:31.8234529Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:31.8236559Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:31.8238414Z 2025-05-07T20:32:31.8238534Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:31.8238764Z 2025-05-07T20:32:31.8238875Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:31.8239295Z self=, 2025-05-07T20:32:31.8239685Z T=4096, 2025-05-07T20:32:31.8239868Z D=7168, 2025-05-07T20:32:31.8240052Z scale_ub=None, 2025-05-07T20:32:31.8240257Z contiguous=True, 2025-05-07T20:32:31.8240475Z compiled=False, 2025-05-07T20:32:31.8240673Z ) 2025-05-07T20:32:31.8240981Z self = 2025-05-07T20:32:31.8241472Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:31.8241734Z 2025-05-07T20:32:31.8241816Z @given( 2025-05-07T20:32:31.8242033Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:31.8242339Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:31.8242638Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:31.8243040Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:31.8243359Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:31.8243636Z ) 2025-05-07T20:32:31.8243972Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:31.8244472Z def test_silu_mul_quant( 2025-05-07T20:32:31.8244704Z self, 2025-05-07T20:32:31.8244891Z T: int, 2025-05-07T20:32:31.8245079Z D: int, 2025-05-07T20:32:31.8245293Z scale_ub: Optional[float], 2025-05-07T20:32:31.8245559Z contiguous: bool, 2025-05-07T20:32:31.8245788Z compiled: bool, 2025-05-07T20:32:31.8246010Z ) -> None: 2025-05-07T20:32:31.8246222Z torch.manual_seed(2025) 2025-05-07T20:32:31.8246451Z 2025-05-07T20:32:31.8246716Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:31.8248742Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:31.8250601Z 2025-05-07T20:32:31.8250716Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:31.8250925Z 2025-05-07T20:32:31.8251031Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:31.8251549Z self=, 2025-05-07T20:32:31.8252011Z T=16384, 2025-05-07T20:32:31.8252209Z D=7168, 2025-05-07T20:32:31.8252405Z scale_ub=None, 2025-05-07T20:32:31.8252629Z contiguous=True, 2025-05-07T20:32:31.8252868Z compiled=False, 2025-05-07T20:32:31.8253078Z ) 2025-05-07T20:32:31.8253424Z self = 2025-05-07T20:32:31.8253996Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:31.8254317Z 2025-05-07T20:32:31.8254399Z @given( 2025-05-07T20:32:31.8254636Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:31.8254976Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:31.8255306Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:31.8255665Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:31.8256031Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:31.8256344Z ) 2025-05-07T20:32:31.8256729Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:31.8257236Z def test_silu_mul_quant( 2025-05-07T20:32:31.8257489Z self, 2025-05-07T20:32:31.8257685Z T: int, 2025-05-07T20:32:31.8257883Z D: int, 2025-05-07T20:32:31.8258108Z scale_ub: Optional[float], 2025-05-07T20:32:31.8258399Z contiguous: bool, 2025-05-07T20:32:31.8258647Z compiled: bool, 2025-05-07T20:32:31.8258877Z ) -> None: 2025-05-07T20:32:31.8259100Z torch.manual_seed(2025) 2025-05-07T20:32:31.8259358Z 2025-05-07T20:32:31.8259649Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:31.8262238Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:31.8264693Z 2025-05-07T20:32:31.8264819Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:31.8265058Z 2025-05-07T20:32:31.8265163Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:31.8265627Z self=, 2025-05-07T20:32:31.8266086Z T=16384, 2025-05-07T20:32:31.8266286Z D=7168, 2025-05-07T20:32:31.8266479Z scale_ub=1200.0, 2025-05-07T20:32:31.8266713Z contiguous=True, 2025-05-07T20:32:31.8266949Z compiled=False, 2025-05-07T20:32:31.8267159Z ) 2025-05-07T20:32:31.8267506Z self = 2025-05-07T20:32:31.8268085Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:31.8268407Z 2025-05-07T20:32:31.8268481Z @given( 2025-05-07T20:32:31.8268753Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:31.8269118Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:31.8269447Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:31.8269806Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:31.8270171Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:31.8270479Z ) 2025-05-07T20:32:31.8270861Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:31.8271372Z def test_silu_mul_quant( 2025-05-07T20:32:31.8271626Z self, 2025-05-07T20:32:31.8271818Z T: int, 2025-05-07T20:32:31.8272023Z D: int, 2025-05-07T20:32:31.8272244Z scale_ub: Optional[float], 2025-05-07T20:32:31.8272528Z contiguous: bool, 2025-05-07T20:32:31.8272868Z compiled: bool, 2025-05-07T20:32:31.8273098Z ) -> None: 2025-05-07T20:32:31.8273316Z torch.manual_seed(2025) 2025-05-07T20:32:31.8273575Z 2025-05-07T20:32:31.8273863Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:31.8276449Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:31.8278846Z 2025-05-07T20:32:31.8278990Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:32.0081909Z 2025-05-07T20:32:32.0082540Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.0083959Z self=, 2025-05-07T20:32:32.0085567Z T=128, 2025-05-07T20:32:32.0085918Z D=5120, 2025-05-07T20:32:32.0086249Z scale_ub=1200.0, 2025-05-07T20:32:32.0086468Z contiguous=False, 2025-05-07T20:32:32.0086690Z compiled=False, 2025-05-07T20:32:32.0086892Z ) 2025-05-07T20:32:32.0087206Z self = 2025-05-07T20:32:32.0087707Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:32.0087981Z 2025-05-07T20:32:32.0088058Z @given( 2025-05-07T20:32:32.0088276Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.0088583Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.0088883Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.0089215Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.0089539Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.0089819Z ) 2025-05-07T20:32:32.0090171Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.0090612Z def test_silu_mul_quant( 2025-05-07T20:32:32.0091033Z self, 2025-05-07T20:32:32.0091221Z T: int, 2025-05-07T20:32:32.0091403Z D: int, 2025-05-07T20:32:32.0091615Z scale_ub: Optional[float], 2025-05-07T20:32:32.0091879Z contiguous: bool, 2025-05-07T20:32:32.0092107Z compiled: bool, 2025-05-07T20:32:32.0092325Z ) -> None: 2025-05-07T20:32:32.0092535Z torch.manual_seed(2025) 2025-05-07T20:32:32.0092766Z 2025-05-07T20:32:32.0093033Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.0093372Z 2025-05-07T20:32:32.0093561Z x_sign = torch.sign(x) 2025-05-07T20:32:32.0093843Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.0094155Z x = x_sign * x_clamp 2025-05-07T20:32:32.0094387Z x0 = x[:, :D] 2025-05-07T20:32:32.0094595Z x1 = x[:, D:] 2025-05-07T20:32:32.0094791Z 2025-05-07T20:32:32.0094973Z if contiguous: 2025-05-07T20:32:32.0095190Z x0 = x0.contiguous() 2025-05-07T20:32:32.0095453Z x1 = x1.contiguous() 2025-05-07T20:32:32.0095693Z 2025-05-07T20:32:32.0095873Z if scale_ub is not None: 2025-05-07T20:32:32.0096148Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.0096481Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.0096779Z ) 2025-05-07T20:32:32.0096967Z else: 2025-05-07T20:32:32.0097173Z scale_ub_tensor = None 2025-05-07T20:32:32.0097411Z 2025-05-07T20:32:32.0097628Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.0097941Z op = silu_mul_quant 2025-05-07T20:32:32.0098183Z if compiled: 2025-05-07T20:32:32.0098539Z op = torch.compile(op) 2025-05-07T20:32:32.0098839Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.0099111Z 2025-05-07T20:32:32.0099289Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.0099454Z 2025-05-07T20:32:32.0099554Z moe/activation_test.py:117: 2025-05-07T20:32:32.0099845Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.0100176Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.0100456Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.0101161Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.0101873Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.0102417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.0103117Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.0103463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.0103561Z kernel = self.compile( 2025-05-07T20:32:32.0103958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.0104138Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.0104267Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.0104272Z 2025-05-07T20:32:32.0104478Z self = 2025-05-07T20:32:32.0105298Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.0105833Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faad2399bc0>} 2025-05-07T20:32:32.0106592Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.0106866Z context = 2025-05-07T20:32:32.0106871Z 2025-05-07T20:32:32.0107031Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.0107295Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.0107397Z module_map=module_map) 2025-05-07T20:32:32.0107555Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.0107646Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.0107714Z E ^ 2025-05-07T20:32:32.0108071Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.0108076Z 2025-05-07T20:32:32.0108663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.0108675Z 2025-05-07T20:32:32.0108791Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.0109006Z self=, 2025-05-07T20:32:32.0109075Z T=2048, 2025-05-07T20:32:32.0109148Z D=7168, 2025-05-07T20:32:32.0109224Z scale_ub=None, 2025-05-07T20:32:32.0109305Z contiguous=False, 2025-05-07T20:32:32.0109387Z compiled=False, 2025-05-07T20:32:32.0109451Z ) 2025-05-07T20:32:32.0109661Z self = 2025-05-07T20:32:32.0109834Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:32.0109963Z 2025-05-07T20:32:32.0110037Z @given( 2025-05-07T20:32:32.0110152Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.0110251Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.0110359Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.0110479Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.0110595Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.0110673Z ) 2025-05-07T20:32:32.0110942Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.0111026Z def test_silu_mul_quant( 2025-05-07T20:32:32.0111094Z self, 2025-05-07T20:32:32.0111172Z T: int, 2025-05-07T20:32:32.0111241Z D: int, 2025-05-07T20:32:32.0111331Z scale_ub: Optional[float], 2025-05-07T20:32:32.0111420Z contiguous: bool, 2025-05-07T20:32:32.0111498Z compiled: bool, 2025-05-07T20:32:32.0111577Z ) -> None: 2025-05-07T20:32:32.0111673Z torch.manual_seed(2025) 2025-05-07T20:32:32.0111739Z 2025-05-07T20:32:32.0111910Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.0113682Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 5.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:32.0113693Z 2025-05-07T20:32:32.0113811Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:32.0113816Z 2025-05-07T20:32:32.0113911Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.0114132Z self=, 2025-05-07T20:32:32.0114206Z T=128, 2025-05-07T20:32:32.0114274Z D=7168, 2025-05-07T20:32:32.0114348Z scale_ub=1200.0, 2025-05-07T20:32:32.0114429Z contiguous=True, 2025-05-07T20:32:32.0114504Z compiled=True, 2025-05-07T20:32:32.0114715Z ) 2025-05-07T20:32:32.0114934Z self = 2025-05-07T20:32:32.0115096Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:32.0115100Z 2025-05-07T20:32:32.0115172Z @given( 2025-05-07T20:32:32.0115281Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.0115373Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.0115485Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.0115594Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.0115698Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.0115771Z ) 2025-05-07T20:32:32.0116015Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.0116102Z def test_silu_mul_quant( 2025-05-07T20:32:32.0116171Z self, 2025-05-07T20:32:32.0116238Z T: int, 2025-05-07T20:32:32.0116318Z D: int, 2025-05-07T20:32:32.0116410Z scale_ub: Optional[float], 2025-05-07T20:32:32.0116492Z contiguous: bool, 2025-05-07T20:32:32.0116576Z compiled: bool, 2025-05-07T20:32:32.0116651Z ) -> None: 2025-05-07T20:32:32.0116738Z torch.manual_seed(2025) 2025-05-07T20:32:32.0116810Z 2025-05-07T20:32:32.0116974Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.0117038Z 2025-05-07T20:32:32.0117128Z x_sign = torch.sign(x) 2025-05-07T20:32:32.0117248Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.0117338Z x = x_sign * x_clamp 2025-05-07T20:32:32.0117410Z x0 = x[:, :D] 2025-05-07T20:32:32.0117562Z x1 = x[:, D:] 2025-05-07T20:32:32.0117646Z 2025-05-07T20:32:32.0117729Z if contiguous: 2025-05-07T20:32:32.0117825Z x0 = x0.contiguous() 2025-05-07T20:32:32.0117919Z x1 = x1.contiguous() 2025-05-07T20:32:32.0117993Z 2025-05-07T20:32:32.0118092Z if scale_ub is not None: 2025-05-07T20:32:32.0118209Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.0118352Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.0118426Z ) 2025-05-07T20:32:32.0118505Z else: 2025-05-07T20:32:32.0118600Z scale_ub_tensor = None 2025-05-07T20:32:32.0118674Z 2025-05-07T20:32:32.0118820Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.0118911Z op = silu_mul_quant 2025-05-07T20:32:32.0119001Z if compiled: 2025-05-07T20:32:32.0119105Z op = torch.compile(op) 2025-05-07T20:32:32.0119219Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.0119298Z 2025-05-07T20:32:32.0119390Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.0119394Z 2025-05-07T20:32:32.0119495Z moe/activation_test.py:117: 2025-05-07T20:32:32.0119647Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.0119755Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.0119863Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.0120319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:32.0120417Z return fn(*args, **kwargs) 2025-05-07T20:32:32.0121019Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.0121128Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.0121559Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.0121827Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.0122234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.0122331Z kernel = self.compile( 2025-05-07T20:32:32.0122878Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.0123072Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.0123211Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.0123215Z 2025-05-07T20:32:32.0123449Z self = 2025-05-07T20:32:32.0124501Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.0125089Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faad1fd02c0>} 2025-05-07T20:32:32.0126242Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.0136127Z context = 2025-05-07T20:32:32.0136569Z 2025-05-07T20:32:32.0136841Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.0137626Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.0138337Z module_map=module_map) 2025-05-07T20:32:32.0138907Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.0139592Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.0139981Z E ^ 2025-05-07T20:32:32.0140721Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.0141389Z 2025-05-07T20:32:32.0141969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.7170412Z 2025-05-07T20:32:32.7170813Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.7172633Z self=, 2025-05-07T20:32:32.7174344Z T=128, 2025-05-07T20:32:32.7174718Z D=7168, 2025-05-07T20:32:32.7175142Z scale_ub=1200.0, 2025-05-07T20:32:32.7175585Z contiguous=True, 2025-05-07T20:32:32.7176030Z compiled=False, 2025-05-07T20:32:32.7176433Z ) 2025-05-07T20:32:32.7177084Z self = 2025-05-07T20:32:32.7178124Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:32.7178685Z 2025-05-07T20:32:32.7178843Z @given( 2025-05-07T20:32:32.7179289Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.7179923Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.7180550Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.7181214Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.7181888Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.7182481Z ) 2025-05-07T20:32:32.7183183Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.7184096Z def test_silu_mul_quant( 2025-05-07T20:32:32.7184584Z self, 2025-05-07T20:32:32.7184955Z T: int, 2025-05-07T20:32:32.7185324Z D: int, 2025-05-07T20:32:32.7185550Z scale_ub: Optional[float], 2025-05-07T20:32:32.7185819Z contiguous: bool, 2025-05-07T20:32:32.7186073Z compiled: bool, 2025-05-07T20:32:32.7186307Z ) -> None: 2025-05-07T20:32:32.7186523Z torch.manual_seed(2025) 2025-05-07T20:32:32.7186758Z 2025-05-07T20:32:32.7187028Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.7187367Z 2025-05-07T20:32:32.7187738Z x_sign = torch.sign(x) 2025-05-07T20:32:32.7188029Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.7190078Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 4.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:32.7191978Z 2025-05-07T20:32:32.7192095Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:32.7192305Z 2025-05-07T20:32:32.7192413Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.7192809Z self=, 2025-05-07T20:32:32.7193198Z T=128, 2025-05-07T20:32:32.7193378Z D=5120, 2025-05-07T20:32:32.7193558Z scale_ub=1200.0, 2025-05-07T20:32:32.7193764Z contiguous=True, 2025-05-07T20:32:32.7193986Z compiled=True, 2025-05-07T20:32:32.7194175Z ) 2025-05-07T20:32:32.7194476Z self = 2025-05-07T20:32:32.7194948Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:32.7195204Z 2025-05-07T20:32:32.7195280Z @given( 2025-05-07T20:32:32.7195498Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.7195791Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.7196206Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.7196520Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.7196837Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.7197108Z ) 2025-05-07T20:32:32.7197442Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.7197869Z def test_silu_mul_quant( 2025-05-07T20:32:32.7198101Z self, 2025-05-07T20:32:32.7198283Z T: int, 2025-05-07T20:32:32.7198463Z D: int, 2025-05-07T20:32:32.7198671Z scale_ub: Optional[float], 2025-05-07T20:32:32.7198927Z contiguous: bool, 2025-05-07T20:32:32.7199151Z compiled: bool, 2025-05-07T20:32:32.7199362Z ) -> None: 2025-05-07T20:32:32.7199567Z torch.manual_seed(2025) 2025-05-07T20:32:32.7199791Z 2025-05-07T20:32:32.7200051Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.7200378Z 2025-05-07T20:32:32.7200559Z > x_sign = torch.sign(x) 2025-05-07T20:32:32.7202481Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:32.7204437Z 2025-05-07T20:32:32.7204548Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:32.7204759Z 2025-05-07T20:32:32.7204855Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.7205254Z self=, 2025-05-07T20:32:32.7205635Z T=128, 2025-05-07T20:32:32.7205809Z D=7168, 2025-05-07T20:32:32.7205994Z scale_ub=None, 2025-05-07T20:32:32.7206188Z contiguous=True, 2025-05-07T20:32:32.7206397Z compiled=True, 2025-05-07T20:32:32.7206589Z ) 2025-05-07T20:32:32.7206888Z self = 2025-05-07T20:32:32.7207448Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:32.7207707Z 2025-05-07T20:32:32.7207778Z @given( 2025-05-07T20:32:32.7207996Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.7208451Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.7208746Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.7209060Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.7209373Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.7209643Z ) 2025-05-07T20:32:32.7209976Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.7210403Z def test_silu_mul_quant( 2025-05-07T20:32:32.7210631Z self, 2025-05-07T20:32:32.7210811Z T: int, 2025-05-07T20:32:32.7210990Z D: int, 2025-05-07T20:32:32.7211200Z scale_ub: Optional[float], 2025-05-07T20:32:32.7211461Z contiguous: bool, 2025-05-07T20:32:32.7211698Z compiled: bool, 2025-05-07T20:32:32.7211902Z ) -> None: 2025-05-07T20:32:32.7212101Z torch.manual_seed(2025) 2025-05-07T20:32:32.7212325Z 2025-05-07T20:32:32.7212581Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.7214726Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:32.7216570Z 2025-05-07T20:32:32.7216681Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:32.7216888Z 2025-05-07T20:32:32.7263451Z FAILED 2025-05-07T20:32:32.7263616Z 2025-05-07T20:32:32.7264036Z =================================== FAILURES =================================== 2025-05-07T20:32:32.7264673Z _____________________ ActivationTests.test_silu_mul_quant ______________________ 2025-05-07T20:32:32.7266576Z + Exception Group Traceback (most recent call last): 2025-05-07T20:32:32.7267494Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 58, in testPartExecutor 2025-05-07T20:32:32.7268275Z | yield 2025-05-07T20:32:32.7268936Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 651, in run 2025-05-07T20:32:32.7269866Z | self._callTestMethod(testMethod) 2025-05-07T20:32:32.7270342Z | ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:32.7271273Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 606, in _callTestMethod 2025-05-07T20:32:32.7272197Z | if method() is not None: 2025-05-07T20:32:32.7272626Z | ~~~~~~^^ 2025-05-07T20:32:32.7273735Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant 2025-05-07T20:32:32.7275021Z | T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.7275487Z | ^^^^^^^ 2025-05-07T20:32:32.7276493Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/hypothesis/core.py", line 1850, in wrapped_test 2025-05-07T20:32:32.7277596Z | raise the_error_hypothesis_found 2025-05-07T20:32:32.7278293Z | ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions) 2025-05-07T20:32:32.7279079Z +-+---------------- 1 ---------------- 2025-05-07T20:32:32.7279595Z | Traceback (most recent call last): 2025-05-07T20:32:32.7280571Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:32.7281806Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.7284721Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:32.7286683Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:32.7287112Z | self=, 2025-05-07T20:32:32.7287509Z | T=128, 2025-05-07T20:32:32.7287695Z | D=7168, 2025-05-07T20:32:32.7287904Z | scale_ub=1200.0, 2025-05-07T20:32:32.7288133Z | contiguous=True, 2025-05-07T20:32:32.7288359Z | compiled=False, 2025-05-07T20:32:32.7288576Z | ) 2025-05-07T20:32:32.7288746Z | 2025-05-07T20:32:32.7289262Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAUEAQQE=') as a decorator on your test case 2025-05-07T20:32:32.7289857Z +---------------- 2 ---------------- 2025-05-07T20:32:32.7290142Z | Traceback (most recent call last): 2025-05-07T20:32:32.7290833Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:32.7291691Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.7293716Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:32.7295660Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:32.7296090Z | self=, 2025-05-07T20:32:32.7296481Z | T=128, 2025-05-07T20:32:32.7296672Z | D=7168, 2025-05-07T20:32:32.7296873Z | scale_ub=None, 2025-05-07T20:32:32.7297096Z | contiguous=True, 2025-05-07T20:32:32.7297323Z | compiled=True, 2025-05-07T20:32:32.7297534Z | ) 2025-05-07T20:32:32.7297698Z | 2025-05-07T20:32:32.7298211Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case 2025-05-07T20:32:32.7298809Z +---------------- 3 ---------------- 2025-05-07T20:32:32.7299085Z | Traceback (most recent call last): 2025-05-07T20:32:32.7299767Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:32.7300523Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.7302542Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:32.7304669Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:32.7305220Z | self=, 2025-05-07T20:32:32.7305762Z | T=128, 2025-05-07T20:32:32.7306020Z | D=5120, 2025-05-07T20:32:32.7306295Z | scale_ub=1200.0, 2025-05-07T20:32:32.7306596Z | contiguous=True, 2025-05-07T20:32:32.7306912Z | compiled=True, 2025-05-07T20:32:32.7307207Z | ) 2025-05-07T20:32:32.7307439Z | 2025-05-07T20:32:32.7308160Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case 2025-05-07T20:32:32.7309171Z +---------------- 4 ---------------- 2025-05-07T20:32:32.7309560Z | Traceback (most recent call last): 2025-05-07T20:32:32.7310521Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant 2025-05-07T20:32:32.7311568Z | y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:32.7311951Z | ~~~~~~^^ 2025-05-07T20:32:32.7312835Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn 2025-05-07T20:32:32.7313779Z | return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:32.7314926Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row 2025-05-07T20:32:32.7316172Z | _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:32.7316555Z | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^ 2025-05-07T20:32:32.7316895Z | a, 2025-05-07T20:32:32.7317167Z | ^^ 2025-05-07T20:32:32.7317483Z | ...<23 lines>... 2025-05-07T20:32:32.7317821Z | USE_INT64=use_int64, 2025-05-07T20:32:32.7318169Z | ^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:32.7318494Z | ) 2025-05-07T20:32:32.7318727Z | ^ 2025-05-07T20:32:32.7319422Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py", line 330, in 2025-05-07T20:32:32.7320424Z | return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.7321028Z | ~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:32.7321903Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 186, in run 2025-05-07T20:32:32.7322791Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:32.7323256Z | ~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:32.7323887Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 166, in _bench 2025-05-07T20:32:32.7324691Z | return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:32.7325068Z | ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:32.7325664Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py", line 117, in do_bench 2025-05-07T20:32:32.7326216Z | fn() 2025-05-07T20:32:32.7326414Z | ~~^^ 2025-05-07T20:32:32.7326975Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call 2025-05-07T20:32:32.7327597Z | self.fn.run( 2025-05-07T20:32:32.7327818Z | ~~~~~~~~~~~^ 2025-05-07T20:32:32.7328029Z | *args, 2025-05-07T20:32:32.7328231Z | ^^^^^^ 2025-05-07T20:32:32.7328438Z | **current, 2025-05-07T20:32:32.7328801Z | ^^^^^^^^^^ 2025-05-07T20:32:32.7329016Z | ) 2025-05-07T20:32:32.7329197Z | ^ 2025-05-07T20:32:32.7329687Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py", line 623, in run 2025-05-07T20:32:32.7330262Z | kernel = self.compile( 2025-05-07T20:32:32.7330511Z | src, 2025-05-07T20:32:32.7330722Z | target=target, 2025-05-07T20:32:32.7331006Z | options=options.__dict__, 2025-05-07T20:32:32.7331339Z | ) 2025-05-07T20:32:32.7332066Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 273, in compile 2025-05-07T20:32:32.7333041Z | module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.7334007Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:32.7335104Z | return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.7335758Z | module_map=module_map) 2025-05-07T20:32:32.7336260Z | triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.7336744Z | def _kernel_quantize_fp8_row( 2025-05-07T20:32:32.7337082Z | ^ 2025-05-07T20:32:32.7337612Z | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.7338166Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:32.7338549Z | # The test always failed when commented parts were varied together. 2025-05-07T20:32:32.7339342Z | self=, 2025-05-07T20:32:32.7339971Z | T=1, # or any other generated value 2025-05-07T20:32:32.7340397Z | D=5120, # or any other generated value 2025-05-07T20:32:32.7340879Z | scale_ub=None, # or any other generated value 2025-05-07T20:32:32.7341411Z | contiguous=True, # or any other generated value 2025-05-07T20:32:32.7341892Z | compiled=True, # or any other generated value 2025-05-07T20:32:32.7342241Z | ) 2025-05-07T20:32:32.7342418Z | 2025-05-07T20:32:32.7344165Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case 2025-05-07T20:32:32.7345031Z +------------------------------------ 2025-05-07T20:32:32.7345537Z ---------------------------------- Hypothesis ---------------------------------- 2025-05-07T20:32:32.7346066Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.7346639Z self=, 2025-05-07T20:32:32.7347194Z T=1, 2025-05-07T20:32:32.7347442Z D=5120, 2025-05-07T20:32:32.7347692Z scale_ub=None, 2025-05-07T20:32:32.7347980Z contiguous=True, 2025-05-07T20:32:32.7348284Z compiled=True, 2025-05-07T20:32:32.7348564Z ) 2025-05-07T20:32:32.7348997Z self = 2025-05-07T20:32:32.7349684Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:32.7350020Z 2025-05-07T20:32:32.7350125Z @given( 2025-05-07T20:32:32.7350421Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.7350830Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.7351212Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.7351617Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.7352028Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.7352384Z ) 2025-05-07T20:32:32.7352833Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.7353393Z def test_silu_mul_quant( 2025-05-07T20:32:32.7353700Z self, 2025-05-07T20:32:32.7353943Z T: int, 2025-05-07T20:32:32.7354177Z D: int, 2025-05-07T20:32:32.7354613Z scale_ub: Optional[float], 2025-05-07T20:32:32.7354956Z contiguous: bool, 2025-05-07T20:32:32.7355250Z compiled: bool, 2025-05-07T20:32:32.7355533Z ) -> None: 2025-05-07T20:32:32.7355799Z torch.manual_seed(2025) 2025-05-07T20:32:32.7356095Z 2025-05-07T20:32:32.7356440Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.7356874Z 2025-05-07T20:32:32.7357106Z x_sign = torch.sign(x) 2025-05-07T20:32:32.7357472Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.7357862Z x = x_sign * x_clamp 2025-05-07T20:32:32.7358156Z x0 = x[:, :D] 2025-05-07T20:32:32.7358432Z x1 = x[:, D:] 2025-05-07T20:32:32.7358690Z 2025-05-07T20:32:32.7358915Z if contiguous: 2025-05-07T20:32:32.7359208Z x0 = x0.contiguous() 2025-05-07T20:32:32.7359532Z x1 = x1.contiguous() 2025-05-07T20:32:32.7359837Z 2025-05-07T20:32:32.7360086Z if scale_ub is not None: 2025-05-07T20:32:32.7360442Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.7360855Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.7361242Z ) 2025-05-07T20:32:32.7361486Z else: 2025-05-07T20:32:32.7361741Z scale_ub_tensor = None 2025-05-07T20:32:32.7362056Z 2025-05-07T20:32:32.7362340Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.7362729Z op = silu_mul_quant 2025-05-07T20:32:32.7363042Z if compiled: 2025-05-07T20:32:32.7363352Z op = torch.compile(op) 2025-05-07T20:32:32.7363712Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.7386347Z 2025-05-07T20:32:32.7386666Z y_fp8, y_scale = fn() 2025-05-07T20:32:32.7387075Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:32.7387477Z 2025-05-07T20:32:32.7387797Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.7388241Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:32.7388624Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:32.7389010Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:32.7389497Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:32.7389919Z 2025-05-07T20:32:32.7390183Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:32.7390446Z 2025-05-07T20:32:32.7390582Z moe/activation_test.py:126: 2025-05-07T20:32:32.7390991Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.7391447Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:32.7391885Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:32.7392958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:32.7393966Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:32.7394680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.7395627Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.7396520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:32.7397414Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:32.7398329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:32.7399137Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:32.7399904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:32.7400567Z fn() 2025-05-07T20:32:32.7401204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:32.7402047Z self.fn.run( 2025-05-07T20:32:32.7402634Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.7403301Z kernel = self.compile( 2025-05-07T20:32:32.7404003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.7405003Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.7405488Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.7405773Z 2025-05-07T20:32:32.7406069Z self = 2025-05-07T20:32:32.7407429Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.7409643Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fab0ae09ee0>} 2025-05-07T20:32:32.7411493Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.7412890Z context = 2025-05-07T20:32:32.7413279Z 2025-05-07T20:32:32.7413500Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.7414467Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.7415114Z module_map=module_map) 2025-05-07T20:32:32.7415602Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.7416102Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:32.7416471Z E ^ 2025-05-07T20:32:32.7417111Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.7417727Z 2025-05-07T20:32:32.7418293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.7419006Z 2025-05-07T20:32:32.7419140Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.7419694Z self=, 2025-05-07T20:32:32.7420240Z T=2048, 2025-05-07T20:32:32.7420483Z D=5120, 2025-05-07T20:32:32.7420753Z scale_ub=1200.0, 2025-05-07T20:32:32.7421042Z contiguous=True, 2025-05-07T20:32:32.7421319Z compiled=False, 2025-05-07T20:32:32.7421580Z ) 2025-05-07T20:32:32.7421989Z self = 2025-05-07T20:32:32.7422616Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:32.7422974Z 2025-05-07T20:32:32.7423071Z @given( 2025-05-07T20:32:32.7423365Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.7423755Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.7424145Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.7424565Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.7424985Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.7425327Z ) 2025-05-07T20:32:32.7425745Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.7426288Z def test_silu_mul_quant( 2025-05-07T20:32:32.7426574Z self, 2025-05-07T20:32:32.7426803Z T: int, 2025-05-07T20:32:32.7427028Z D: int, 2025-05-07T20:32:32.7427279Z scale_ub: Optional[float], 2025-05-07T20:32:32.7427589Z contiguous: bool, 2025-05-07T20:32:32.7428031Z compiled: bool, 2025-05-07T20:32:32.7428288Z ) -> None: 2025-05-07T20:32:32.7428544Z torch.manual_seed(2025) 2025-05-07T20:32:32.7428866Z 2025-05-07T20:32:32.7429216Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.7429612Z 2025-05-07T20:32:32.7429836Z x_sign = torch.sign(x) 2025-05-07T20:32:32.7430173Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.7430575Z x = x_sign * x_clamp 2025-05-07T20:32:32.7430894Z x0 = x[:, :D] 2025-05-07T20:32:32.7431184Z x1 = x[:, D:] 2025-05-07T20:32:32.7431416Z 2025-05-07T20:32:32.7431632Z if contiguous: 2025-05-07T20:32:32.7431904Z x0 = x0.contiguous() 2025-05-07T20:32:32.7432197Z x1 = x1.contiguous() 2025-05-07T20:32:32.7432494Z 2025-05-07T20:32:32.7432717Z if scale_ub is not None: 2025-05-07T20:32:32.7433033Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.7433418Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.7433788Z ) 2025-05-07T20:32:32.7434010Z else: 2025-05-07T20:32:32.7434266Z scale_ub_tensor = None 2025-05-07T20:32:32.7434610Z 2025-05-07T20:32:32.7434914Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.7435339Z op = silu_mul_quant 2025-05-07T20:32:32.7435721Z if compiled: 2025-05-07T20:32:32.7436060Z op = torch.compile(op) 2025-05-07T20:32:32.7436439Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.7436760Z 2025-05-07T20:32:32.7436982Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.7437173Z 2025-05-07T20:32:32.7437376Z moe/activation_test.py:117: 2025-05-07T20:32:32.7437724Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.7438116Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.7438480Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.7439393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.7440232Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.7440883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.7441807Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.7442710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.7443431Z kernel = self.compile( 2025-05-07T20:32:32.7444161Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.7445216Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.7445763Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.7446085Z 2025-05-07T20:32:32.7446380Z self = 2025-05-07T20:32:32.7447884Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.7449791Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fab098420c0>} 2025-05-07T20:32:32.7451655Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.7453057Z context = 2025-05-07T20:32:32.7453443Z 2025-05-07T20:32:32.7453670Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.7454467Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.7455102Z module_map=module_map) 2025-05-07T20:32:32.7455590Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.7456062Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.7456404Z E ^ 2025-05-07T20:32:32.7457032Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.7457645Z 2025-05-07T20:32:32.7458225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.7458919Z 2025-05-07T20:32:32.7459064Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.7459613Z self=, 2025-05-07T20:32:32.7460150Z T=2048, 2025-05-07T20:32:32.7460407Z D=5120, 2025-05-07T20:32:32.7460659Z scale_ub=1200.0, 2025-05-07T20:32:32.7460955Z contiguous=True, 2025-05-07T20:32:32.7461249Z compiled=True, 2025-05-07T20:32:32.7461513Z ) 2025-05-07T20:32:32.7461941Z self = 2025-05-07T20:32:32.7462604Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:32.7462968Z 2025-05-07T20:32:32.7463068Z @given( 2025-05-07T20:32:32.7463373Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.7463792Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.7464203Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.7464719Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.7465157Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.7465539Z ) 2025-05-07T20:32:32.7466001Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.7466588Z def test_silu_mul_quant( 2025-05-07T20:32:32.7466907Z self, 2025-05-07T20:32:32.7467155Z T: int, 2025-05-07T20:32:32.7467420Z D: int, 2025-05-07T20:32:32.7467704Z scale_ub: Optional[float], 2025-05-07T20:32:32.7468054Z contiguous: bool, 2025-05-07T20:32:32.7468375Z compiled: bool, 2025-05-07T20:32:32.7468672Z ) -> None: 2025-05-07T20:32:32.7468962Z torch.manual_seed(2025) 2025-05-07T20:32:32.7469328Z 2025-05-07T20:32:32.7469691Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.7470155Z 2025-05-07T20:32:32.7470411Z x_sign = torch.sign(x) 2025-05-07T20:32:32.7470813Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.7471226Z x = x_sign * x_clamp 2025-05-07T20:32:32.7471544Z x0 = x[:, :D] 2025-05-07T20:32:32.7471835Z x1 = x[:, D:] 2025-05-07T20:32:32.7472116Z 2025-05-07T20:32:32.7472355Z if contiguous: 2025-05-07T20:32:32.7472670Z x0 = x0.contiguous() 2025-05-07T20:32:32.7473018Z x1 = x1.contiguous() 2025-05-07T20:32:32.7473336Z 2025-05-07T20:32:32.7473593Z if scale_ub is not None: 2025-05-07T20:32:32.7473963Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.7474407Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.7474825Z ) 2025-05-07T20:32:32.7475086Z else: 2025-05-07T20:32:32.7475359Z scale_ub_tensor = None 2025-05-07T20:32:32.7475694Z 2025-05-07T20:32:32.7476000Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.7476410Z op = silu_mul_quant 2025-05-07T20:32:32.7476749Z if compiled: 2025-05-07T20:32:32.7477085Z op = torch.compile(op) 2025-05-07T20:32:32.7477479Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.7477842Z 2025-05-07T20:32:32.7478098Z y_fp8, y_scale = fn() 2025-05-07T20:32:32.7478618Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:32.7479000Z 2025-05-07T20:32:32.7479315Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.7479772Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:32.7480152Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:32.7480566Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:32.7481027Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:32.7481419Z 2025-05-07T20:32:32.7481674Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:32.7481937Z 2025-05-07T20:32:32.7482068Z moe/activation_test.py:126: 2025-05-07T20:32:32.7482468Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.7482937Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:32.7483401Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:32.7484605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:32.7485597Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:32.7486313Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.7487225Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.7488116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:32.7489047Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:32.7490131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:32.7491009Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:32.7491829Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:32.7492513Z fn() 2025-05-07T20:32:32.7493188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:32.7493962Z self.fn.run( 2025-05-07T20:32:32.7494587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.7495309Z kernel = self.compile( 2025-05-07T20:32:32.7496048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.7496897Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.7497408Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.7497724Z 2025-05-07T20:32:32.7498003Z self = 2025-05-07T20:32:32.7499445Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.7501320Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fab09e3e840>} 2025-05-07T20:32:32.7503036Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.7504352Z context = 2025-05-07T20:32:32.7506016Z 2025-05-07T20:32:32.7506244Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.7506968Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.7507719Z module_map=module_map) 2025-05-07T20:32:32.7508185Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.7508939Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:32.7509273Z E ^ 2025-05-07T20:32:32.7509864Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.7510452Z 2025-05-07T20:32:32.7510977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.7511629Z 2025-05-07T20:32:32.7511769Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.7512282Z self=, 2025-05-07T20:32:32.7512801Z T=16384, 2025-05-07T20:32:32.7513036Z D=7168, 2025-05-07T20:32:32.7513269Z scale_ub=1200.0, 2025-05-07T20:32:32.7513541Z contiguous=False, 2025-05-07T20:32:32.7513832Z compiled=False, 2025-05-07T20:32:32.7514080Z ) 2025-05-07T20:32:32.7514492Z self = 2025-05-07T20:32:32.7515146Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:32.7515511Z 2025-05-07T20:32:32.7515619Z @given( 2025-05-07T20:32:32.7515912Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.7516329Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.7516724Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.7517156Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.7517603Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.7518216Z ) 2025-05-07T20:32:32.7518701Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.7519315Z def test_silu_mul_quant( 2025-05-07T20:32:32.7519643Z self, 2025-05-07T20:32:32.7519905Z T: int, 2025-05-07T20:32:32.7520171Z D: int, 2025-05-07T20:32:32.7520464Z scale_ub: Optional[float], 2025-05-07T20:32:32.7520829Z contiguous: bool, 2025-05-07T20:32:32.7521150Z compiled: bool, 2025-05-07T20:32:32.7521463Z ) -> None: 2025-05-07T20:32:32.7521746Z torch.manual_seed(2025) 2025-05-07T20:32:32.7522060Z 2025-05-07T20:32:32.7522417Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.7522869Z 2025-05-07T20:32:32.7523113Z x_sign = torch.sign(x) 2025-05-07T20:32:32.7523487Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.7523902Z x = x_sign * x_clamp 2025-05-07T20:32:32.7524326Z x0 = x[:, :D] 2025-05-07T20:32:32.7524604Z x1 = x[:, D:] 2025-05-07T20:32:32.7524874Z 2025-05-07T20:32:32.7525118Z if contiguous: 2025-05-07T20:32:32.7525412Z x0 = x0.contiguous() 2025-05-07T20:32:32.7525754Z x1 = x1.contiguous() 2025-05-07T20:32:32.7526077Z 2025-05-07T20:32:32.7526319Z if scale_ub is not None: 2025-05-07T20:32:32.7526685Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.7527129Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.7527531Z ) 2025-05-07T20:32:32.7527789Z else: 2025-05-07T20:32:32.7528054Z scale_ub_tensor = None 2025-05-07T20:32:32.7528372Z 2025-05-07T20:32:32.7528661Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.7529072Z op = silu_mul_quant 2025-05-07T20:32:32.7529391Z if compiled: 2025-05-07T20:32:32.7529720Z op = torch.compile(op) 2025-05-07T20:32:32.7530114Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.7530465Z 2025-05-07T20:32:32.7530704Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.7530923Z 2025-05-07T20:32:32.7531043Z moe/activation_test.py:117: 2025-05-07T20:32:32.7531439Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.7532051Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.7532476Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.7533397Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.7534307Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.7535024Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.7535943Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.7536840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.7537545Z kernel = self.compile( 2025-05-07T20:32:32.7538254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.7539143Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.7539690Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.7539995Z 2025-05-07T20:32:32.7540265Z self = 2025-05-07T20:32:32.7541707Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.7543656Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fab097171a0>} 2025-05-07T20:32:32.7545466Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.7546855Z context = 2025-05-07T20:32:32.7547261Z 2025-05-07T20:32:32.7547484Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.7548202Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.7548840Z module_map=module_map) 2025-05-07T20:32:32.7549313Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.7549788Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.7550136Z E ^ 2025-05-07T20:32:32.7550761Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.7551369Z 2025-05-07T20:32:32.7551922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.7552606Z 2025-05-07T20:32:32.7552749Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.7553304Z self=, 2025-05-07T20:32:32.7553840Z T=1, 2025-05-07T20:32:32.7554087Z D=7168, 2025-05-07T20:32:32.7554348Z scale_ub=None, 2025-05-07T20:32:32.7554630Z contiguous=True, 2025-05-07T20:32:32.7554934Z compiled=True, 2025-05-07T20:32:32.7555204Z ) 2025-05-07T20:32:32.7555621Z self = 2025-05-07T20:32:32.7556249Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:32.7556602Z 2025-05-07T20:32:32.7556700Z @given( 2025-05-07T20:32:32.7557007Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.7557423Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.7557838Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.7558264Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.7558802Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.7559194Z ) 2025-05-07T20:32:32.7559664Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.7560213Z def test_silu_mul_quant( 2025-05-07T20:32:32.7560539Z self, 2025-05-07T20:32:32.7560794Z T: int, 2025-05-07T20:32:32.7561037Z D: int, 2025-05-07T20:32:32.7561319Z scale_ub: Optional[float], 2025-05-07T20:32:32.7561691Z contiguous: bool, 2025-05-07T20:32:32.7562007Z compiled: bool, 2025-05-07T20:32:32.7562263Z ) -> None: 2025-05-07T20:32:32.7562535Z torch.manual_seed(2025) 2025-05-07T20:32:32.7562859Z 2025-05-07T20:32:32.7563207Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.7563671Z 2025-05-07T20:32:32.7563920Z x_sign = torch.sign(x) 2025-05-07T20:32:32.7564409Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.7564831Z x = x_sign * x_clamp 2025-05-07T20:32:32.7565166Z x0 = x[:, :D] 2025-05-07T20:32:32.7565448Z x1 = x[:, D:] 2025-05-07T20:32:32.7565736Z 2025-05-07T20:32:32.7565988Z if contiguous: 2025-05-07T20:32:32.7566300Z x0 = x0.contiguous() 2025-05-07T20:32:32.7566659Z x1 = x1.contiguous() 2025-05-07T20:32:32.7566980Z 2025-05-07T20:32:32.7567227Z if scale_ub is not None: 2025-05-07T20:32:32.7567573Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.7568001Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.7568399Z ) 2025-05-07T20:32:32.7568637Z else: 2025-05-07T20:32:32.7568906Z scale_ub_tensor = None 2025-05-07T20:32:32.7569397Z 2025-05-07T20:32:32.7569690Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.7570105Z op = silu_mul_quant 2025-05-07T20:32:32.7570431Z if compiled: 2025-05-07T20:32:32.7570749Z op = torch.compile(op) 2025-05-07T20:32:32.7571156Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.7571531Z 2025-05-07T20:32:32.7571782Z y_fp8, y_scale = fn() 2025-05-07T20:32:32.7572167Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:32.7572561Z 2025-05-07T20:32:32.7572857Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.7573306Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:32.7573713Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:32.7574129Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:32.7574623Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:32.7575073Z 2025-05-07T20:32:32.7575351Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:32.7575616Z 2025-05-07T20:32:32.7575746Z moe/activation_test.py:126: 2025-05-07T20:32:32.7576138Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.7576597Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:32.7577022Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:32.7578084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:32.7579108Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:32.7579847Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.7580774Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.7581724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:32.7582547Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:32.7583264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:32.7583989Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:32.7592780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:32.7593341Z fn() 2025-05-07T20:32:32.7593864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:32.7594459Z self.fn.run( 2025-05-07T20:32:32.7594940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.7595469Z kernel = self.compile( 2025-05-07T20:32:32.7596027Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.7596690Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.7597089Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.7597338Z 2025-05-07T20:32:32.7597547Z self = 2025-05-07T20:32:32.7598638Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.7600077Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faaf822d1c0>} 2025-05-07T20:32:32.7601541Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.7602564Z context = 2025-05-07T20:32:32.7602862Z 2025-05-07T20:32:32.7603035Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.7603562Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.7604040Z module_map=module_map) 2025-05-07T20:32:32.7604568Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.7604934Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:32.7605211Z E ^ 2025-05-07T20:32:32.7605676Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.7606137Z 2025-05-07T20:32:32.7606562Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.7607083Z 2025-05-07T20:32:32.7607189Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.7607606Z self=, 2025-05-07T20:32:32.7608014Z T=4096, 2025-05-07T20:32:32.7608211Z D=5120, 2025-05-07T20:32:32.7608742Z scale_ub=None, 2025-05-07T20:32:32.7608961Z contiguous=False, 2025-05-07T20:32:32.7609194Z compiled=False, 2025-05-07T20:32:32.7609405Z ) 2025-05-07T20:32:32.7609724Z self = 2025-05-07T20:32:32.7610223Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:32.7610502Z 2025-05-07T20:32:32.7610580Z @given( 2025-05-07T20:32:32.7610821Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.7611133Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.7611449Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.7611779Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.7612106Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.7612393Z ) 2025-05-07T20:32:32.7612749Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.7613375Z def test_silu_mul_quant( 2025-05-07T20:32:32.7613619Z self, 2025-05-07T20:32:32.7613816Z T: int, 2025-05-07T20:32:32.7614008Z D: int, 2025-05-07T20:32:32.7614228Z scale_ub: Optional[float], 2025-05-07T20:32:32.7614499Z contiguous: bool, 2025-05-07T20:32:32.7614733Z compiled: bool, 2025-05-07T20:32:32.7614963Z ) -> None: 2025-05-07T20:32:32.7615182Z torch.manual_seed(2025) 2025-05-07T20:32:32.7615428Z 2025-05-07T20:32:32.7615699Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.7616042Z 2025-05-07T20:32:32.7616247Z x_sign = torch.sign(x) 2025-05-07T20:32:32.7616535Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.7616847Z x = x_sign * x_clamp 2025-05-07T20:32:32.7617090Z x0 = x[:, :D] 2025-05-07T20:32:32.7617303Z x1 = x[:, D:] 2025-05-07T20:32:32.7617515Z 2025-05-07T20:32:32.7617715Z if contiguous: 2025-05-07T20:32:32.7617942Z x0 = x0.contiguous() 2025-05-07T20:32:32.7618207Z x1 = x1.contiguous() 2025-05-07T20:32:32.7618451Z 2025-05-07T20:32:32.7618639Z if scale_ub is not None: 2025-05-07T20:32:32.7618917Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.7619256Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.7619562Z ) 2025-05-07T20:32:32.7619750Z else: 2025-05-07T20:32:32.7619963Z scale_ub_tensor = None 2025-05-07T20:32:32.7620214Z 2025-05-07T20:32:32.7620443Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.7620886Z op = silu_mul_quant 2025-05-07T20:32:32.7621142Z if compiled: 2025-05-07T20:32:32.7621387Z op = torch.compile(op) 2025-05-07T20:32:32.7621684Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.7621952Z 2025-05-07T20:32:32.7622144Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.7622304Z 2025-05-07T20:32:32.7622397Z moe/activation_test.py:117: 2025-05-07T20:32:32.7622686Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.7623016Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.7623284Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.7623968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.7624649Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.7625181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.7625856Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.7626516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.7627041Z kernel = self.compile( 2025-05-07T20:32:32.7627574Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.7628225Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.7628622Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.7628847Z 2025-05-07T20:32:32.7629086Z self = 2025-05-07T20:32:32.7630184Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.7631546Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fab096993a0>} 2025-05-07T20:32:32.7632876Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.7633986Z context = 2025-05-07T20:32:32.7634271Z 2025-05-07T20:32:32.7634440Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.7634946Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.7635410Z module_map=module_map) 2025-05-07T20:32:32.7635771Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.7636117Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.7636369Z E ^ 2025-05-07T20:32:32.7636829Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.7637273Z 2025-05-07T20:32:32.7637694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.7638202Z 2025-05-07T20:32:32.7638304Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.7638712Z self=, 2025-05-07T20:32:32.7639134Z T=4096, 2025-05-07T20:32:32.7639328Z D=7168, 2025-05-07T20:32:32.7639513Z scale_ub=None, 2025-05-07T20:32:32.7639724Z contiguous=False, 2025-05-07T20:32:32.7639936Z compiled=False, 2025-05-07T20:32:32.7640134Z ) 2025-05-07T20:32:32.7640449Z self = 2025-05-07T20:32:32.7641015Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:32.7641283Z 2025-05-07T20:32:32.7641356Z @given( 2025-05-07T20:32:32.7641582Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.7641888Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.7642188Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.7642511Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.7642835Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.7643106Z ) 2025-05-07T20:32:32.7643450Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.7643881Z def test_silu_mul_quant( 2025-05-07T20:32:32.7644116Z self, 2025-05-07T20:32:32.7644387Z T: int, 2025-05-07T20:32:32.7644579Z D: int, 2025-05-07T20:32:32.7644795Z scale_ub: Optional[float], 2025-05-07T20:32:32.7645053Z contiguous: bool, 2025-05-07T20:32:32.7645294Z compiled: bool, 2025-05-07T20:32:32.7645509Z ) -> None: 2025-05-07T20:32:32.7645711Z torch.manual_seed(2025) 2025-05-07T20:32:32.7645948Z 2025-05-07T20:32:32.7646220Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.7646551Z 2025-05-07T20:32:32.7646735Z x_sign = torch.sign(x) 2025-05-07T20:32:32.7647022Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.7647321Z x = x_sign * x_clamp 2025-05-07T20:32:32.7647556Z x0 = x[:, :D] 2025-05-07T20:32:32.7647769Z x1 = x[:, D:] 2025-05-07T20:32:32.7647969Z 2025-05-07T20:32:32.7648148Z if contiguous: 2025-05-07T20:32:32.7648376Z x0 = x0.contiguous() 2025-05-07T20:32:32.7648622Z x1 = x1.contiguous() 2025-05-07T20:32:32.7648868Z 2025-05-07T20:32:32.7649091Z if scale_ub is not None: 2025-05-07T20:32:32.7649362Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.7649698Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.7650000Z ) 2025-05-07T20:32:32.7650189Z else: 2025-05-07T20:32:32.7650392Z scale_ub_tensor = None 2025-05-07T20:32:32.7650634Z 2025-05-07T20:32:32.7650861Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.7651253Z op = silu_mul_quant 2025-05-07T20:32:32.7651498Z if compiled: 2025-05-07T20:32:32.7651744Z op = torch.compile(op) 2025-05-07T20:32:32.7652027Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.7652296Z 2025-05-07T20:32:32.7652487Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.7652646Z 2025-05-07T20:32:32.7652745Z moe/activation_test.py:117: 2025-05-07T20:32:32.7653036Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.7653362Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.7653643Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.7654327Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.7655007Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.7655536Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.7656211Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.7656872Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.7657392Z kernel = self.compile( 2025-05-07T20:32:32.7657932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.7658573Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.7658974Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.7659305Z 2025-05-07T20:32:32.7659520Z self = 2025-05-07T20:32:32.7660591Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.7661948Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fab09699260>} 2025-05-07T20:32:32.7663276Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.7664291Z context = 2025-05-07T20:32:32.7664574Z 2025-05-07T20:32:32.7664750Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.7665261Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.7665724Z module_map=module_map) 2025-05-07T20:32:32.7666087Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.7666441Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.7666693Z E ^ 2025-05-07T20:32:32.7667154Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.7667596Z 2025-05-07T20:32:32.7668017Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.7668522Z 2025-05-07T20:32:32.7668631Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.7669066Z self=, 2025-05-07T20:32:32.7669483Z T=128, 2025-05-07T20:32:32.7669676Z D=7168, 2025-05-07T20:32:32.7669857Z scale_ub=None, 2025-05-07T20:32:32.7670072Z contiguous=False, 2025-05-07T20:32:32.7670294Z compiled=True, 2025-05-07T20:32:32.7670485Z ) 2025-05-07T20:32:32.7670799Z self = 2025-05-07T20:32:32.7671370Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:32.7671633Z 2025-05-07T20:32:32.7671709Z @given( 2025-05-07T20:32:32.7671941Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.7672250Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.7672551Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.7672869Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.7673188Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.7673466Z ) 2025-05-07T20:32:32.7673801Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.7674238Z def test_silu_mul_quant( 2025-05-07T20:32:32.7674476Z self, 2025-05-07T20:32:32.7674668Z T: int, 2025-05-07T20:32:32.7674867Z D: int, 2025-05-07T20:32:32.7675085Z scale_ub: Optional[float], 2025-05-07T20:32:32.7675349Z contiguous: bool, 2025-05-07T20:32:32.7675596Z compiled: bool, 2025-05-07T20:32:32.7675819Z ) -> None: 2025-05-07T20:32:32.7676029Z torch.manual_seed(2025) 2025-05-07T20:32:32.7676269Z 2025-05-07T20:32:32.7676543Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.7676876Z 2025-05-07T20:32:32.7677071Z x_sign = torch.sign(x) 2025-05-07T20:32:32.7677361Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.7677671Z x = x_sign * x_clamp 2025-05-07T20:32:32.7677903Z x0 = x[:, :D] 2025-05-07T20:32:32.7678121Z x1 = x[:, D:] 2025-05-07T20:32:32.7678330Z 2025-05-07T20:32:32.7678509Z if contiguous: 2025-05-07T20:32:32.7678833Z x0 = x0.contiguous() 2025-05-07T20:32:32.7679094Z x1 = x1.contiguous() 2025-05-07T20:32:32.7679331Z 2025-05-07T20:32:32.7679525Z if scale_ub is not None: 2025-05-07T20:32:32.7679800Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.7680138Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.7680448Z ) 2025-05-07T20:32:32.7680643Z else: 2025-05-07T20:32:32.7680851Z scale_ub_tensor = None 2025-05-07T20:32:32.7681104Z 2025-05-07T20:32:32.7681337Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.7681645Z op = silu_mul_quant 2025-05-07T20:32:32.7681902Z if compiled: 2025-05-07T20:32:32.7682153Z op = torch.compile(op) 2025-05-07T20:32:32.7682451Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.7682718Z 2025-05-07T20:32:32.7682915Z y_fp8, y_scale = fn() 2025-05-07T20:32:32.7683212Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:32.7683498Z 2025-05-07T20:32:32.7683739Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.7684076Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:32.7684462Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:32.7684780Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:32.7685141Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:32.7685442Z 2025-05-07T20:32:32.7685650Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:32.7685845Z 2025-05-07T20:32:32.7685955Z moe/activation_test.py:126: 2025-05-07T20:32:32.7686259Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.7686592Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:32.7686920Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:32.7687715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:32.7688459Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:32.7689004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.7689776Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.7690460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:32.7691185Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:32.7691916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:32.7692553Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:32.7693153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:32.7693673Z fn() 2025-05-07T20:32:32.7694180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:32.7694759Z self.fn.run( 2025-05-07T20:32:32.7695234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.7695814Z kernel = self.compile( 2025-05-07T20:32:32.7696357Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.7697004Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.7697406Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.7697634Z 2025-05-07T20:32:32.7697847Z self = 2025-05-07T20:32:32.7699002Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.7700363Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fab09699d00>} 2025-05-07T20:32:32.7701703Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.7702725Z context = 2025-05-07T20:32:32.7703017Z 2025-05-07T20:32:32.7703192Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.7703705Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.7704184Z module_map=module_map) 2025-05-07T20:32:32.7704557Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.7704920Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:32.7705182Z E ^ 2025-05-07T20:32:32.7705650Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.7706094Z 2025-05-07T20:32:32.7706511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.7707019Z 2025-05-07T20:32:32.7707132Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.7707540Z self=, 2025-05-07T20:32:32.7707942Z T=128, 2025-05-07T20:32:32.7708133Z D=7168, 2025-05-07T20:32:32.7708524Z scale_ub=None, 2025-05-07T20:32:32.7708786Z contiguous=False, 2025-05-07T20:32:32.7709024Z compiled=False, 2025-05-07T20:32:32.7709230Z ) 2025-05-07T20:32:32.7709552Z self = 2025-05-07T20:32:32.7710045Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:32.7710311Z 2025-05-07T20:32:32.7710537Z @given( 2025-05-07T20:32:32.7710774Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.7711091Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.7711400Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.7711723Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.7712054Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.7712337Z ) 2025-05-07T20:32:32.7712686Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.7713128Z def test_silu_mul_quant( 2025-05-07T20:32:32.7713377Z self, 2025-05-07T20:32:32.7713568Z T: int, 2025-05-07T20:32:32.7713773Z D: int, 2025-05-07T20:32:32.7713999Z scale_ub: Optional[float], 2025-05-07T20:32:32.7714265Z contiguous: bool, 2025-05-07T20:32:32.7714509Z compiled: bool, 2025-05-07T20:32:32.7714732Z ) -> None: 2025-05-07T20:32:32.7714945Z torch.manual_seed(2025) 2025-05-07T20:32:32.7715199Z 2025-05-07T20:32:32.7715474Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.7715811Z 2025-05-07T20:32:32.7716005Z x_sign = torch.sign(x) 2025-05-07T20:32:32.7716303Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.7716613Z x = x_sign * x_clamp 2025-05-07T20:32:32.7716848Z x0 = x[:, :D] 2025-05-07T20:32:32.7717069Z x1 = x[:, D:] 2025-05-07T20:32:32.7717287Z 2025-05-07T20:32:32.7717470Z if contiguous: 2025-05-07T20:32:32.7717709Z x0 = x0.contiguous() 2025-05-07T20:32:32.7717970Z x1 = x1.contiguous() 2025-05-07T20:32:32.7718360Z 2025-05-07T20:32:32.7718556Z if scale_ub is not None: 2025-05-07T20:32:32.7718833Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.7719221Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.7719537Z ) 2025-05-07T20:32:32.7719738Z else: 2025-05-07T20:32:32.7719944Z scale_ub_tensor = None 2025-05-07T20:32:32.7720204Z 2025-05-07T20:32:32.7720436Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.7720747Z op = silu_mul_quant 2025-05-07T20:32:32.7721003Z if compiled: 2025-05-07T20:32:32.7721255Z op = torch.compile(op) 2025-05-07T20:32:32.7721556Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.7721827Z 2025-05-07T20:32:32.7722018Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.7722179Z 2025-05-07T20:32:32.7722284Z moe/activation_test.py:117: 2025-05-07T20:32:32.7722581Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.7722918Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.7723203Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.7723700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.7723803Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.7724171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.7724477Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.7724824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.7724917Z kernel = self.compile( 2025-05-07T20:32:32.7725297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.7725483Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.7725612Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.7725617Z 2025-05-07T20:32:32.7725828Z self = 2025-05-07T20:32:32.7726714Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.7727219Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faae123e700>} 2025-05-07T20:32:32.7727968Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.7728167Z context = 2025-05-07T20:32:32.7728171Z 2025-05-07T20:32:32.7728340Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.7728603Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.7728718Z module_map=module_map) 2025-05-07T20:32:32.7728887Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.7728984Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.7729078Z E ^ 2025-05-07T20:32:32.7729435Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.7729440Z 2025-05-07T20:32:32.7729852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.7729857Z 2025-05-07T20:32:32.7730038Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.7730260Z self=, 2025-05-07T20:32:32.7730340Z T=4096, 2025-05-07T20:32:32.7730414Z D=5120, 2025-05-07T20:32:32.7730493Z scale_ub=1200.0, 2025-05-07T20:32:32.7730580Z contiguous=True, 2025-05-07T20:32:32.7730661Z compiled=False, 2025-05-07T20:32:32.7730731Z ) 2025-05-07T20:32:32.7730956Z self = 2025-05-07T20:32:32.7731130Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:32.7731134Z 2025-05-07T20:32:32.7731208Z @given( 2025-05-07T20:32:32.7731331Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.7731428Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.7731540Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.7731669Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.7731785Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.7731865Z ) 2025-05-07T20:32:32.7732107Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.7732198Z def test_silu_mul_quant( 2025-05-07T20:32:32.7732274Z self, 2025-05-07T20:32:32.7732357Z T: int, 2025-05-07T20:32:32.7732432Z D: int, 2025-05-07T20:32:32.7732535Z scale_ub: Optional[float], 2025-05-07T20:32:32.7732623Z contiguous: bool, 2025-05-07T20:32:32.7732705Z compiled: bool, 2025-05-07T20:32:32.7732785Z ) -> None: 2025-05-07T20:32:32.7732875Z torch.manual_seed(2025) 2025-05-07T20:32:32.7732943Z 2025-05-07T20:32:32.7733115Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.7733185Z 2025-05-07T20:32:32.7733280Z x_sign = torch.sign(x) 2025-05-07T20:32:32.7733403Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.7733496Z x = x_sign * x_clamp 2025-05-07T20:32:32.7733582Z x0 = x[:, :D] 2025-05-07T20:32:32.7733659Z x1 = x[:, D:] 2025-05-07T20:32:32.7733729Z 2025-05-07T20:32:32.7733815Z if contiguous: 2025-05-07T20:32:32.7733908Z x0 = x0.contiguous() 2025-05-07T20:32:32.7734075Z x1 = x1.contiguous() 2025-05-07T20:32:32.7734148Z 2025-05-07T20:32:32.7734235Z if scale_ub is not None: 2025-05-07T20:32:32.7734341Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.7734479Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.7734552Z ) 2025-05-07T20:32:32.7734631Z else: 2025-05-07T20:32:32.7734722Z scale_ub_tensor = None 2025-05-07T20:32:32.7734790Z 2025-05-07T20:32:32.7734921Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.7735004Z op = silu_mul_quant 2025-05-07T20:32:32.7735088Z if compiled: 2025-05-07T20:32:32.7735193Z op = torch.compile(op) 2025-05-07T20:32:32.7735295Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.7735378Z 2025-05-07T20:32:32.7735480Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.7735485Z 2025-05-07T20:32:32.7735602Z moe/activation_test.py:117: 2025-05-07T20:32:32.7735737Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.7735847Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.7735944Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.7736435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.7741014Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.7741399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.7741627Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.7742074Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.7742168Z kernel = self.compile( 2025-05-07T20:32:32.7742552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.7742736Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.7742870Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.7742875Z 2025-05-07T20:32:32.7743078Z self = 2025-05-07T20:32:32.7743852Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.7744363Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faae123c400>} 2025-05-07T20:32:32.7745103Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.7745305Z context = 2025-05-07T20:32:32.7745310Z 2025-05-07T20:32:32.7745470Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.7745740Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.7745852Z module_map=module_map) 2025-05-07T20:32:32.7746016Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.7746119Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.7746192Z E ^ 2025-05-07T20:32:32.7746544Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.7746549Z 2025-05-07T20:32:32.7746965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.7747049Z 2025-05-07T20:32:32.7747153Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.7747378Z self=, 2025-05-07T20:32:32.7747453Z T=1, 2025-05-07T20:32:32.7747528Z D=5120, 2025-05-07T20:32:32.7747620Z scale_ub=None, 2025-05-07T20:32:32.7747709Z contiguous=True, 2025-05-07T20:32:32.7747789Z compiled=True, 2025-05-07T20:32:32.7747864Z ) 2025-05-07T20:32:32.7748082Z self = 2025-05-07T20:32:32.7748243Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:32.7748252Z 2025-05-07T20:32:32.7748325Z @given( 2025-05-07T20:32:32.7748448Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.7748549Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.7748662Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.7748776Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.7748898Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.7748969Z ) 2025-05-07T20:32:32.7749210Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.7749304Z def test_silu_mul_quant( 2025-05-07T20:32:32.7749377Z self, 2025-05-07T20:32:32.7749448Z T: int, 2025-05-07T20:32:32.7749531Z D: int, 2025-05-07T20:32:32.7749626Z scale_ub: Optional[float], 2025-05-07T20:32:32.7749717Z contiguous: bool, 2025-05-07T20:32:32.7749799Z compiled: bool, 2025-05-07T20:32:32.7749877Z ) -> None: 2025-05-07T20:32:32.7749971Z torch.manual_seed(2025) 2025-05-07T20:32:32.7750134Z 2025-05-07T20:32:32.7750302Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.7750376Z 2025-05-07T20:32:32.7750467Z x_sign = torch.sign(x) 2025-05-07T20:32:32.7750589Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.7750682Z x = x_sign * x_clamp 2025-05-07T20:32:32.7750760Z x0 = x[:, :D] 2025-05-07T20:32:32.7750835Z x1 = x[:, D:] 2025-05-07T20:32:32.7750905Z 2025-05-07T20:32:32.7750984Z if contiguous: 2025-05-07T20:32:32.7751082Z x0 = x0.contiguous() 2025-05-07T20:32:32.7751168Z x1 = x1.contiguous() 2025-05-07T20:32:32.7751238Z 2025-05-07T20:32:32.7751331Z if scale_ub is not None: 2025-05-07T20:32:32.7751433Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.7751567Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.7751649Z ) 2025-05-07T20:32:32.7751726Z else: 2025-05-07T20:32:32.7751822Z scale_ub_tensor = None 2025-05-07T20:32:32.7751904Z 2025-05-07T20:32:32.7752030Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.7752114Z op = silu_mul_quant 2025-05-07T20:32:32.7752201Z if compiled: 2025-05-07T20:32:32.7752301Z op = torch.compile(op) 2025-05-07T20:32:32.7752409Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.7752481Z 2025-05-07T20:32:32.7752567Z y_fp8, y_scale = fn() 2025-05-07T20:32:32.7752689Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:32.7752758Z 2025-05-07T20:32:32.7752887Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.7752991Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:32.7753089Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:32.7753210Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:32.7753362Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:32.7753431Z 2025-05-07T20:32:32.7753529Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:32.7753536Z 2025-05-07T20:32:32.7753632Z moe/activation_test.py:126: 2025-05-07T20:32:32.7753758Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.7753947Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:32.7754075Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:32.7754626Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:32.7754728Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:32.7755081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.7755306Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.7755672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:32.7755924Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:32.7756301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:32.7756476Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:32.7756814Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:32.7756895Z fn() 2025-05-07T20:32:32.7757291Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:32.7757372Z self.fn.run( 2025-05-07T20:32:32.7757708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.7757799Z kernel = self.compile( 2025-05-07T20:32:32.7758262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.7758435Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.7758560Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.7758575Z 2025-05-07T20:32:32.7758780Z self = 2025-05-07T20:32:32.7759558Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.7760066Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faae123ef20>} 2025-05-07T20:32:32.7760816Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.7761007Z context = 2025-05-07T20:32:32.7761016Z 2025-05-07T20:32:32.7761176Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.7761436Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.7761551Z module_map=module_map) 2025-05-07T20:32:32.7761708Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.7761812Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:32.7761887Z E ^ 2025-05-07T20:32:32.7762242Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.7762247Z 2025-05-07T20:32:32.7762666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.7762671Z 2025-05-07T20:32:32.7762768Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.7762985Z self=, 2025-05-07T20:32:32.7763171Z T=2048, 2025-05-07T20:32:32.7763244Z D=5120, 2025-05-07T20:32:32.7763327Z scale_ub=None, 2025-05-07T20:32:32.7763412Z contiguous=True, 2025-05-07T20:32:32.7763495Z compiled=True, 2025-05-07T20:32:32.7763569Z ) 2025-05-07T20:32:32.7763788Z self = 2025-05-07T20:32:32.7763961Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:32.7763966Z 2025-05-07T20:32:32.7764046Z @given( 2025-05-07T20:32:32.7764167Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.7764373Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.7764496Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.7764616Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.7764732Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.7764804Z ) 2025-05-07T20:32:32.7765051Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.7765145Z def test_silu_mul_quant( 2025-05-07T20:32:32.7765215Z self, 2025-05-07T20:32:32.7765291Z T: int, 2025-05-07T20:32:32.7765365Z D: int, 2025-05-07T20:32:32.7765461Z scale_ub: Optional[float], 2025-05-07T20:32:32.7765549Z contiguous: bool, 2025-05-07T20:32:32.7765630Z compiled: bool, 2025-05-07T20:32:32.7765706Z ) -> None: 2025-05-07T20:32:32.7765799Z torch.manual_seed(2025) 2025-05-07T20:32:32.7765869Z 2025-05-07T20:32:32.7766036Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.7766111Z 2025-05-07T20:32:32.7766280Z x_sign = torch.sign(x) 2025-05-07T20:32:32.7766408Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.7766492Z x = x_sign * x_clamp 2025-05-07T20:32:32.7766569Z x0 = x[:, :D] 2025-05-07T20:32:32.7766650Z x1 = x[:, D:] 2025-05-07T20:32:32.7766723Z 2025-05-07T20:32:32.7766805Z if contiguous: 2025-05-07T20:32:32.7766897Z x0 = x0.contiguous() 2025-05-07T20:32:32.7766984Z x1 = x1.contiguous() 2025-05-07T20:32:32.7767051Z 2025-05-07T20:32:32.7767143Z if scale_ub is not None: 2025-05-07T20:32:32.7767243Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.7767374Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.7767458Z ) 2025-05-07T20:32:32.7767532Z else: 2025-05-07T20:32:32.7767620Z scale_ub_tensor = None 2025-05-07T20:32:32.7767694Z 2025-05-07T20:32:32.7767820Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.7767915Z op = silu_mul_quant 2025-05-07T20:32:32.7767996Z if compiled: 2025-05-07T20:32:32.7768092Z op = torch.compile(op) 2025-05-07T20:32:32.7768200Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.7768275Z 2025-05-07T20:32:32.7768361Z y_fp8, y_scale = fn() 2025-05-07T20:32:32.7768481Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:32.7768547Z 2025-05-07T20:32:32.7768676Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.7768775Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:32.7768870Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:32.7768988Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:32.7769128Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:32.7769203Z 2025-05-07T20:32:32.7769300Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:32.7769304Z 2025-05-07T20:32:32.7769402Z moe/activation_test.py:126: 2025-05-07T20:32:32.7769526Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.7769633Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:32.7769763Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:32.7770400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:32.7770503Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:32.7770860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.7771083Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.7771442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:32.7771698Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:32.7772074Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:32.7772236Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:32.7772581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:32.7772655Z fn() 2025-05-07T20:32:32.7773050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:32.7773135Z self.fn.run( 2025-05-07T20:32:32.7773466Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.7773555Z kernel = self.compile( 2025-05-07T20:32:32.7773936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.7774185Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.7774314Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.7774319Z 2025-05-07T20:32:32.7774522Z self = 2025-05-07T20:32:32.7775302Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.7775811Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faae123f9c0>} 2025-05-07T20:32:32.7776556Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.7776747Z context = 2025-05-07T20:32:32.7776751Z 2025-05-07T20:32:32.7776910Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.7777175Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.7777283Z module_map=module_map) 2025-05-07T20:32:32.7777442Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.7777544Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:32.7777617Z E ^ 2025-05-07T20:32:32.7777965Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.7777969Z 2025-05-07T20:32:32.7778381Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.7778386Z 2025-05-07T20:32:32.7778492Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.7778713Z self=, 2025-05-07T20:32:32.7778788Z T=128, 2025-05-07T20:32:32.7778860Z D=5120, 2025-05-07T20:32:32.7778942Z scale_ub=None, 2025-05-07T20:32:32.7779098Z contiguous=True, 2025-05-07T20:32:32.7779178Z compiled=True, 2025-05-07T20:32:32.7779252Z ) 2025-05-07T20:32:32.7779469Z self = 2025-05-07T20:32:32.7779634Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:32.7779638Z 2025-05-07T20:32:32.7779712Z @given( 2025-05-07T20:32:32.7779828Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.7779929Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.7780043Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.7780157Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.7780278Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.7780346Z ) 2025-05-07T20:32:32.7780588Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.7780679Z def test_silu_mul_quant( 2025-05-07T20:32:32.7780757Z self, 2025-05-07T20:32:32.7780831Z T: int, 2025-05-07T20:32:32.7780907Z D: int, 2025-05-07T20:32:32.7781000Z scale_ub: Optional[float], 2025-05-07T20:32:32.7781088Z contiguous: bool, 2025-05-07T20:32:32.7781170Z compiled: bool, 2025-05-07T20:32:32.7781246Z ) -> None: 2025-05-07T20:32:32.7781337Z torch.manual_seed(2025) 2025-05-07T20:32:32.7781403Z 2025-05-07T20:32:32.7781572Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.7781646Z 2025-05-07T20:32:32.7781734Z x_sign = torch.sign(x) 2025-05-07T20:32:32.7781854Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.7782021Z x = x_sign * x_clamp 2025-05-07T20:32:32.7782098Z x0 = x[:, :D] 2025-05-07T20:32:32.7782174Z x1 = x[:, D:] 2025-05-07T20:32:32.7782247Z 2025-05-07T20:32:32.7782328Z if contiguous: 2025-05-07T20:32:32.7782417Z x0 = x0.contiguous() 2025-05-07T20:32:32.7782510Z x1 = x1.contiguous() 2025-05-07T20:32:32.7782576Z 2025-05-07T20:32:32.7782666Z if scale_ub is not None: 2025-05-07T20:32:32.7782771Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.7782907Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.7782988Z ) 2025-05-07T20:32:32.7783063Z else: 2025-05-07T20:32:32.7783153Z scale_ub_tensor = None 2025-05-07T20:32:32.7783228Z 2025-05-07T20:32:32.7783357Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.7783450Z op = silu_mul_quant 2025-05-07T20:32:32.7783536Z if compiled: 2025-05-07T20:32:32.7783643Z op = torch.compile(op) 2025-05-07T20:32:32.7783746Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.7783821Z 2025-05-07T20:32:32.7783911Z y_fp8, y_scale = fn() 2025-05-07T20:32:32.7784033Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:32.7784109Z 2025-05-07T20:32:32.7784244Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.7784344Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:32.7784441Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:32.7784559Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:32.7784700Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:32.7784773Z 2025-05-07T20:32:32.7784870Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:32.7784875Z 2025-05-07T20:32:32.7784974Z moe/activation_test.py:126: 2025-05-07T20:32:32.7785103Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.7785220Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:32.7785353Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:32.7785906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:32.7786092Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:32.7786444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.7786663Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.7787032Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:32.7787285Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:32.7787661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:32.7787827Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:32.7788164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:32.7788248Z fn() 2025-05-07T20:32:32.7788647Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:32.7788732Z self.fn.run( 2025-05-07T20:32:32.7789108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.7789212Z kernel = self.compile( 2025-05-07T20:32:32.7789593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.7789770Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.7789998Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.7790003Z 2025-05-07T20:32:32.7790215Z self = 2025-05-07T20:32:32.7790996Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.7791502Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faae0aaaac0>} 2025-05-07T20:32:32.7792243Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.7792435Z context = 2025-05-07T20:32:32.7792440Z 2025-05-07T20:32:32.7792609Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.7792870Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.7792978Z module_map=module_map) 2025-05-07T20:32:32.7793147Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.7793250Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:32.7793336Z E ^ 2025-05-07T20:32:32.7793686Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.7793691Z 2025-05-07T20:32:32.7794101Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.7794105Z 2025-05-07T20:32:32.7794204Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.7794426Z self=, 2025-05-07T20:32:32.7794510Z T=4096, 2025-05-07T20:32:32.7794583Z D=5120, 2025-05-07T20:32:32.7794662Z scale_ub=None, 2025-05-07T20:32:32.7794745Z contiguous=True, 2025-05-07T20:32:32.7794826Z compiled=True, 2025-05-07T20:32:32.7794898Z ) 2025-05-07T20:32:32.7795193Z self = 2025-05-07T20:32:32.7795363Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:32.7795367Z 2025-05-07T20:32:32.7795450Z @given( 2025-05-07T20:32:32.7795568Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.7795667Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.7795788Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.7795903Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.7796018Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.7796091Z ) 2025-05-07T20:32:32.7796338Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.7796433Z def test_silu_mul_quant( 2025-05-07T20:32:32.7796505Z self, 2025-05-07T20:32:32.7796577Z T: int, 2025-05-07T20:32:32.7796653Z D: int, 2025-05-07T20:32:32.7796745Z scale_ub: Optional[float], 2025-05-07T20:32:32.7796835Z contiguous: bool, 2025-05-07T20:32:32.7796918Z compiled: bool, 2025-05-07T20:32:32.7796991Z ) -> None: 2025-05-07T20:32:32.7797078Z torch.manual_seed(2025) 2025-05-07T20:32:32.7797151Z 2025-05-07T20:32:32.7797314Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.7797381Z 2025-05-07T20:32:32.7797470Z x_sign = torch.sign(x) 2025-05-07T20:32:32.7797591Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.7797679Z x = x_sign * x_clamp 2025-05-07T20:32:32.7797753Z x0 = x[:, :D] 2025-05-07T20:32:32.7797828Z x1 = x[:, D:] 2025-05-07T20:32:32.7797899Z 2025-05-07T20:32:32.7798058Z if contiguous: 2025-05-07T20:32:32.7798148Z x0 = x0.contiguous() 2025-05-07T20:32:32.7798234Z x1 = x1.contiguous() 2025-05-07T20:32:32.7798304Z 2025-05-07T20:32:32.7798391Z if scale_ub is not None: 2025-05-07T20:32:32.7798499Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.7798630Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.7798699Z ) 2025-05-07T20:32:32.7798774Z else: 2025-05-07T20:32:32.7798861Z scale_ub_tensor = None 2025-05-07T20:32:32.7798935Z 2025-05-07T20:32:32.7799062Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.7799151Z op = silu_mul_quant 2025-05-07T20:32:32.7799233Z if compiled: 2025-05-07T20:32:32.7799328Z op = torch.compile(op) 2025-05-07T20:32:32.7799431Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.7799501Z 2025-05-07T20:32:32.7799592Z y_fp8, y_scale = fn() 2025-05-07T20:32:32.7799709Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:32.7799778Z 2025-05-07T20:32:32.7799908Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.7800008Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:32.7800112Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:32.7800227Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:32.7800364Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:32.7800440Z 2025-05-07T20:32:32.7800538Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:32.7800543Z 2025-05-07T20:32:32.7800639Z moe/activation_test.py:126: 2025-05-07T20:32:32.7800763Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.7800864Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:32.7800998Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:32.7801556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:32.7801656Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:32.7802010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.7802313Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.7802679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:32.7802932Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:32.7803304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:32.7803467Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:32.7803808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:32.7803886Z fn() 2025-05-07T20:32:32.7804373Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:32.7804457Z self.fn.run( 2025-05-07T20:32:32.7804799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.7804888Z kernel = self.compile( 2025-05-07T20:32:32.7805269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.7805440Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.7805563Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.7805567Z 2025-05-07T20:32:32.7805777Z self = 2025-05-07T20:32:32.7806628Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.7807142Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faae0b0a3e0>} 2025-05-07T20:32:32.7807887Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.7808074Z context = 2025-05-07T20:32:32.7808079Z 2025-05-07T20:32:32.7808432Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.7808811Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.7808954Z module_map=module_map) 2025-05-07T20:32:32.7809146Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.7809269Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:32.7809372Z E ^ 2025-05-07T20:32:32.7809832Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.7809838Z 2025-05-07T20:32:32.7810360Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.7810370Z 2025-05-07T20:32:32.7810467Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.7810683Z self=, 2025-05-07T20:32:32.7810756Z T=16384, 2025-05-07T20:32:32.7810821Z D=5120, 2025-05-07T20:32:32.7810892Z scale_ub=None, 2025-05-07T20:32:32.7810973Z contiguous=True, 2025-05-07T20:32:32.7811046Z compiled=True, 2025-05-07T20:32:32.7811109Z ) 2025-05-07T20:32:32.7811325Z self = 2025-05-07T20:32:32.7811492Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:32.7811649Z 2025-05-07T20:32:32.7811723Z @given( 2025-05-07T20:32:32.7811835Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.7811925Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.7812036Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.7812146Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.7812251Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.7812320Z ) 2025-05-07T20:32:32.7812558Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.7812642Z def test_silu_mul_quant( 2025-05-07T20:32:32.7812709Z self, 2025-05-07T20:32:32.7812779Z T: int, 2025-05-07T20:32:32.7812844Z D: int, 2025-05-07T20:32:32.7812935Z scale_ub: Optional[float], 2025-05-07T20:32:32.7813016Z contiguous: bool, 2025-05-07T20:32:32.7813097Z compiled: bool, 2025-05-07T20:32:32.7813171Z ) -> None: 2025-05-07T20:32:32.7813256Z torch.manual_seed(2025) 2025-05-07T20:32:32.7813322Z 2025-05-07T20:32:32.7813481Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.7813547Z 2025-05-07T20:32:32.7813635Z x_sign = torch.sign(x) 2025-05-07T20:32:32.7813751Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.7813835Z x = x_sign * x_clamp 2025-05-07T20:32:32.7813910Z x0 = x[:, :D] 2025-05-07T20:32:32.7813980Z x1 = x[:, D:] 2025-05-07T20:32:32.7814042Z 2025-05-07T20:32:32.7814123Z if contiguous: 2025-05-07T20:32:32.7814205Z x0 = x0.contiguous() 2025-05-07T20:32:32.7814406Z x1 = x1.contiguous() 2025-05-07T20:32:32.7814470Z 2025-05-07T20:32:32.7814552Z if scale_ub is not None: 2025-05-07T20:32:32.7814656Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.7814789Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.7814861Z ) 2025-05-07T20:32:32.7814931Z else: 2025-05-07T20:32:32.7815015Z scale_ub_tensor = None 2025-05-07T20:32:32.7815077Z 2025-05-07T20:32:32.7815205Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.7815287Z op = silu_mul_quant 2025-05-07T20:32:32.7815362Z if compiled: 2025-05-07T20:32:32.7815459Z op = torch.compile(op) 2025-05-07T20:32:32.7815557Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.7815626Z 2025-05-07T20:32:32.7815709Z y_fp8, y_scale = fn() 2025-05-07T20:32:32.7815823Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:32.7815890Z 2025-05-07T20:32:32.7816022Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.7816115Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:32.7816212Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:32.7816329Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:32.7816467Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:32.7816534Z 2025-05-07T20:32:32.7816629Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:32.7816633Z 2025-05-07T20:32:32.7816732Z moe/activation_test.py:126: 2025-05-07T20:32:32.7816855Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.7816953Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:32.7817082Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:32.7817639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:32.7817734Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:32.7818090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.7818308Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.7818757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:32.7819008Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:32.7819374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:32.7819539Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:32.7819875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:32.7819943Z fn() 2025-05-07T20:32:32.7820346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:32.7820421Z self.fn.run( 2025-05-07T20:32:32.7820753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.7820847Z kernel = self.compile( 2025-05-07T20:32:32.7821220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.7821398Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.7821524Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.7821528Z 2025-05-07T20:32:32.7821730Z self = 2025-05-07T20:32:32.7822602Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.7823101Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faae0979580>} 2025-05-07T20:32:32.7823843Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.7824029Z context = 2025-05-07T20:32:32.7824033Z 2025-05-07T20:32:32.7824194Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.7824448Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.7824553Z module_map=module_map) 2025-05-07T20:32:32.7824713Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.7824811Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:32.7824883Z E ^ 2025-05-07T20:32:32.7825232Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.7825241Z 2025-05-07T20:32:32.7825649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.7825656Z 2025-05-07T20:32:32.7825752Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.7825969Z self=, 2025-05-07T20:32:32.7826040Z T=1, 2025-05-07T20:32:32.7826108Z D=5120, 2025-05-07T20:32:32.7826183Z scale_ub=1200.0, 2025-05-07T20:32:32.7826262Z contiguous=True, 2025-05-07T20:32:32.7826337Z compiled=True, 2025-05-07T20:32:32.7826408Z ) 2025-05-07T20:32:32.7826637Z self = 2025-05-07T20:32:32.7826802Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:32.7826806Z 2025-05-07T20:32:32.7826881Z @given( 2025-05-07T20:32:32.7826994Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.7827164Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.7827274Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.7827388Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.7827496Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.7827566Z ) 2025-05-07T20:32:32.7827808Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.7827893Z def test_silu_mul_quant( 2025-05-07T20:32:32.7827967Z self, 2025-05-07T20:32:32.7828035Z T: int, 2025-05-07T20:32:32.7828104Z D: int, 2025-05-07T20:32:32.7828199Z scale_ub: Optional[float], 2025-05-07T20:32:32.7828284Z contiguous: bool, 2025-05-07T20:32:32.7828373Z compiled: bool, 2025-05-07T20:32:32.7828447Z ) -> None: 2025-05-07T20:32:32.7828535Z torch.manual_seed(2025) 2025-05-07T20:32:32.7828605Z 2025-05-07T20:32:32.7828770Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.7828840Z 2025-05-07T20:32:32.7828927Z x_sign = torch.sign(x) 2025-05-07T20:32:32.7829045Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.7829127Z x = x_sign * x_clamp 2025-05-07T20:32:32.7829203Z x0 = x[:, :D] 2025-05-07T20:32:32.7829275Z x1 = x[:, D:] 2025-05-07T20:32:32.7829339Z 2025-05-07T20:32:32.7829419Z if contiguous: 2025-05-07T20:32:32.7829505Z x0 = x0.contiguous() 2025-05-07T20:32:32.7829594Z x1 = x1.contiguous() 2025-05-07T20:32:32.7829660Z 2025-05-07T20:32:32.7829743Z if scale_ub is not None: 2025-05-07T20:32:32.7829927Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.7830060Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.7830127Z ) 2025-05-07T20:32:32.7830199Z else: 2025-05-07T20:32:32.7830284Z scale_ub_tensor = None 2025-05-07T20:32:32.7830354Z 2025-05-07T20:32:32.7830484Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.7830573Z op = silu_mul_quant 2025-05-07T20:32:32.7830650Z if compiled: 2025-05-07T20:32:32.7830746Z op = torch.compile(op) 2025-05-07T20:32:32.7830846Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.7830913Z 2025-05-07T20:32:32.7830996Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.7831001Z 2025-05-07T20:32:32.7831090Z moe/activation_test.py:117: 2025-05-07T20:32:32.7831216Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.7831314Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.7831412Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.7831778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:32.7831864Z return fn(*args, **kwargs) 2025-05-07T20:32:32.7832354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.7832453Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.7832804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.7833022Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.7833354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.7833439Z kernel = self.compile( 2025-05-07T20:32:32.7833820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.7833989Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.7834113Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.7834196Z 2025-05-07T20:32:32.7834396Z self = 2025-05-07T20:32:32.7835164Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.7835667Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faad3805d00>} 2025-05-07T20:32:32.7836406Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.7836594Z context = 2025-05-07T20:32:32.7836599Z 2025-05-07T20:32:32.7836755Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.7837016Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.7837118Z module_map=module_map) 2025-05-07T20:32:32.7837271Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.7837361Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.7837430Z E ^ 2025-05-07T20:32:32.7837778Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.7837783Z 2025-05-07T20:32:32.7838272Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.7838277Z 2025-05-07T20:32:32.7838371Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.7838588Z self=, 2025-05-07T20:32:32.7838657Z T=1, 2025-05-07T20:32:32.7838726Z D=5120, 2025-05-07T20:32:32.7838803Z scale_ub=None, 2025-05-07T20:32:32.7838879Z contiguous=False, 2025-05-07T20:32:32.7838951Z compiled=True, 2025-05-07T20:32:32.7839019Z ) 2025-05-07T20:32:32.7839237Z self = 2025-05-07T20:32:32.7839394Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:32.7839399Z 2025-05-07T20:32:32.7839469Z @given( 2025-05-07T20:32:32.7839580Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.7839672Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.7839779Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.7839892Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.7840000Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.7840065Z ) 2025-05-07T20:32:32.7840301Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.7840390Z def test_silu_mul_quant( 2025-05-07T20:32:32.7840455Z self, 2025-05-07T20:32:32.7840520Z T: int, 2025-05-07T20:32:32.7840588Z D: int, 2025-05-07T20:32:32.7840676Z scale_ub: Optional[float], 2025-05-07T20:32:32.7840756Z contiguous: bool, 2025-05-07T20:32:32.7840833Z compiled: bool, 2025-05-07T20:32:32.7840900Z ) -> None: 2025-05-07T20:32:32.7840990Z torch.manual_seed(2025) 2025-05-07T20:32:32.7841053Z 2025-05-07T20:32:32.7841217Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.7841282Z 2025-05-07T20:32:32.7841367Z x_sign = torch.sign(x) 2025-05-07T20:32:32.7841490Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.7841572Z x = x_sign * x_clamp 2025-05-07T20:32:32.7841640Z x0 = x[:, :D] 2025-05-07T20:32:32.7841710Z x1 = x[:, D:] 2025-05-07T20:32:32.7841774Z 2025-05-07T20:32:32.7841849Z if contiguous: 2025-05-07T20:32:32.7842015Z x0 = x0.contiguous() 2025-05-07T20:32:32.7842101Z x1 = x1.contiguous() 2025-05-07T20:32:32.7842162Z 2025-05-07T20:32:32.7842242Z if scale_ub is not None: 2025-05-07T20:32:32.7842343Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.7842471Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.7842542Z ) 2025-05-07T20:32:32.7842608Z else: 2025-05-07T20:32:32.7842695Z scale_ub_tensor = None 2025-05-07T20:32:32.7842760Z 2025-05-07T20:32:32.7842880Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.7842959Z op = silu_mul_quant 2025-05-07T20:32:32.7843043Z if compiled: 2025-05-07T20:32:32.7843135Z op = torch.compile(op) 2025-05-07T20:32:32.7843231Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.7843296Z 2025-05-07T20:32:32.7843377Z y_fp8, y_scale = fn() 2025-05-07T20:32:32.7843494Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:32.7843559Z 2025-05-07T20:32:32.7843687Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.7843780Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:32.7843870Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:32.7843984Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:32.7844123Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:32.7844296Z 2025-05-07T20:32:32.7844389Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:32.7844394Z 2025-05-07T20:32:32.7844484Z moe/activation_test.py:126: 2025-05-07T20:32:32.7844689Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.7844790Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:32.7844914Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:32.7845466Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:32.7845566Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:32.7845919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.7846136Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.7846497Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:32.7846747Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:32.7847124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:32.7847283Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:32.7847617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:32.7847693Z fn() 2025-05-07T20:32:32.7848086Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:32.7848164Z self.fn.run( 2025-05-07T20:32:32.7848493Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.7848576Z kernel = self.compile( 2025-05-07T20:32:32.7848952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.7849117Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.7849241Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.7849245Z 2025-05-07T20:32:32.7849445Z self = 2025-05-07T20:32:32.7850213Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.7850818Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faae021c180>} 2025-05-07T20:32:32.7851558Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.7851748Z context = 2025-05-07T20:32:32.7851753Z 2025-05-07T20:32:32.7851909Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.7852163Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.7852270Z module_map=module_map) 2025-05-07T20:32:32.7852423Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.7852516Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:32.7852588Z E ^ 2025-05-07T20:32:32.7852934Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.7852939Z 2025-05-07T20:32:32.7853345Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.7853349Z 2025-05-07T20:32:32.7853444Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.7853735Z self=, 2025-05-07T20:32:32.7853809Z T=1, 2025-05-07T20:32:32.7853876Z D=5120, 2025-05-07T20:32:32.7853948Z scale_ub=None, 2025-05-07T20:32:32.7854027Z contiguous=True, 2025-05-07T20:32:32.7854100Z compiled=False, 2025-05-07T20:32:32.7854174Z ) 2025-05-07T20:32:32.7854386Z self = 2025-05-07T20:32:32.7854542Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:32.7854546Z 2025-05-07T20:32:32.7854616Z @given( 2025-05-07T20:32:32.7854725Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.7854815Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.7854930Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.7855038Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.7855143Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.7855212Z ) 2025-05-07T20:32:32.7855455Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.7855543Z def test_silu_mul_quant( 2025-05-07T20:32:32.7855608Z self, 2025-05-07T20:32:32.7855676Z T: int, 2025-05-07T20:32:32.7855748Z D: int, 2025-05-07T20:32:32.7855835Z scale_ub: Optional[float], 2025-05-07T20:32:32.7855917Z contiguous: bool, 2025-05-07T20:32:32.7855998Z compiled: bool, 2025-05-07T20:32:32.7856069Z ) -> None: 2025-05-07T20:32:32.7856153Z torch.manual_seed(2025) 2025-05-07T20:32:32.7856220Z 2025-05-07T20:32:32.7856383Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.7856449Z 2025-05-07T20:32:32.7856532Z x_sign = torch.sign(x) 2025-05-07T20:32:32.7856649Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.7856729Z x = x_sign * x_clamp 2025-05-07T20:32:32.7856798Z x0 = x[:, :D] 2025-05-07T20:32:32.7856875Z x1 = x[:, D:] 2025-05-07T20:32:32.7856941Z 2025-05-07T20:32:32.7857018Z if contiguous: 2025-05-07T20:32:32.7857100Z x0 = x0.contiguous() 2025-05-07T20:32:32.7857185Z x1 = x1.contiguous() 2025-05-07T20:32:32.7857246Z 2025-05-07T20:32:32.7857414Z if scale_ub is not None: 2025-05-07T20:32:32.7857517Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.7857645Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.7857710Z ) 2025-05-07T20:32:32.7857779Z else: 2025-05-07T20:32:32.7857862Z scale_ub_tensor = None 2025-05-07T20:32:32.7857924Z 2025-05-07T20:32:32.7858046Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.7858125Z op = silu_mul_quant 2025-05-07T20:32:32.7858203Z if compiled: 2025-05-07T20:32:32.7858294Z op = torch.compile(op) 2025-05-07T20:32:32.7858391Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.7858462Z 2025-05-07T20:32:32.7858545Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.7858549Z 2025-05-07T20:32:32.7858636Z moe/activation_test.py:117: 2025-05-07T20:32:32.7858769Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.7858867Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.7858956Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.7859454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.7859549Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.7863707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.7863952Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.7864403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.7864503Z kernel = self.compile( 2025-05-07T20:32:32.7864890Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.7865070Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.7865215Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.7865220Z 2025-05-07T20:32:32.7865436Z self = 2025-05-07T20:32:32.7866255Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.7866770Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faae03974c0>} 2025-05-07T20:32:32.7867515Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.7867717Z context = 2025-05-07T20:32:32.7867722Z 2025-05-07T20:32:32.7867887Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.7868147Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.7868262Z module_map=module_map) 2025-05-07T20:32:32.7868429Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.7868535Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.7868613Z E ^ 2025-05-07T20:32:32.7868971Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.7868976Z 2025-05-07T20:32:32.7869387Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.7869391Z 2025-05-07T20:32:32.7869498Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.7869809Z self=, 2025-05-07T20:32:32.7869888Z T=128, 2025-05-07T20:32:32.7869965Z D=5120, 2025-05-07T20:32:32.7870054Z scale_ub=None, 2025-05-07T20:32:32.7870141Z contiguous=False, 2025-05-07T20:32:32.7870224Z compiled=True, 2025-05-07T20:32:32.7870300Z ) 2025-05-07T20:32:32.7870517Z self = 2025-05-07T20:32:32.7870686Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:32.7870690Z 2025-05-07T20:32:32.7870770Z @given( 2025-05-07T20:32:32.7870889Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.7870998Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.7871112Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.7871228Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.7871342Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.7871421Z ) 2025-05-07T20:32:32.7871665Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.7871761Z def test_silu_mul_quant( 2025-05-07T20:32:32.7871836Z self, 2025-05-07T20:32:32.7871913Z T: int, 2025-05-07T20:32:32.7872010Z D: int, 2025-05-07T20:32:32.7872120Z scale_ub: Optional[float], 2025-05-07T20:32:32.7872227Z contiguous: bool, 2025-05-07T20:32:32.7872313Z compiled: bool, 2025-05-07T20:32:32.7872390Z ) -> None: 2025-05-07T20:32:32.7872484Z torch.manual_seed(2025) 2025-05-07T20:32:32.7872559Z 2025-05-07T20:32:32.7872812Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.7872889Z 2025-05-07T20:32:32.7872979Z x_sign = torch.sign(x) 2025-05-07T20:32:32.7873104Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.7873194Z x = x_sign * x_clamp 2025-05-07T20:32:32.7873278Z x0 = x[:, :D] 2025-05-07T20:32:32.7873356Z x1 = x[:, D:] 2025-05-07T20:32:32.7873430Z 2025-05-07T20:32:32.7873512Z if contiguous: 2025-05-07T20:32:32.7873602Z x0 = x0.contiguous() 2025-05-07T20:32:32.7873694Z x1 = x1.contiguous() 2025-05-07T20:32:32.7873766Z 2025-05-07T20:32:32.7873857Z if scale_ub is not None: 2025-05-07T20:32:32.7873961Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.7874094Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.7874171Z ) 2025-05-07T20:32:32.7874249Z else: 2025-05-07T20:32:32.7874345Z scale_ub_tensor = None 2025-05-07T20:32:32.7874422Z 2025-05-07T20:32:32.7874558Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.7874648Z op = silu_mul_quant 2025-05-07T20:32:32.7874738Z if compiled: 2025-05-07T20:32:32.7874835Z op = torch.compile(op) 2025-05-07T20:32:32.7874944Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.7875017Z 2025-05-07T20:32:32.7875105Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.7875110Z 2025-05-07T20:32:32.7875207Z moe/activation_test.py:117: 2025-05-07T20:32:32.7875336Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.7875436Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.7875536Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.7875901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:32.7875994Z return fn(*args, **kwargs) 2025-05-07T20:32:32.7876498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.7876596Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.7876955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.7877261Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.7877598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.7877692Z kernel = self.compile( 2025-05-07T20:32:32.7878071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.7878245Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.7878376Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.7878380Z 2025-05-07T20:32:32.7878586Z self = 2025-05-07T20:32:32.7879363Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.7879871Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faae021ec00>} 2025-05-07T20:32:32.7880616Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.7880808Z context = 2025-05-07T20:32:32.7880813Z 2025-05-07T20:32:32.7881049Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.7881322Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.7881434Z module_map=module_map) 2025-05-07T20:32:32.7881600Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.7881703Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.7881779Z E ^ 2025-05-07T20:32:32.7882139Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.7882144Z 2025-05-07T20:32:32.7882557Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.7882562Z 2025-05-07T20:32:32.7882662Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.7882887Z self=, 2025-05-07T20:32:32.7882963Z T=128, 2025-05-07T20:32:32.7883043Z D=7168, 2025-05-07T20:32:32.7883132Z scale_ub=1200.0, 2025-05-07T20:32:32.7883218Z contiguous=False, 2025-05-07T20:32:32.7883307Z compiled=False, 2025-05-07T20:32:32.7883379Z ) 2025-05-07T20:32:32.7883596Z self = 2025-05-07T20:32:32.7883779Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:32.7883783Z 2025-05-07T20:32:32.7883861Z @given( 2025-05-07T20:32:32.7883979Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.7884081Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.7884308Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.7884425Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.7884536Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.7884609Z ) 2025-05-07T20:32:32.7884854Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.7884951Z def test_silu_mul_quant( 2025-05-07T20:32:32.7885030Z self, 2025-05-07T20:32:32.7885107Z T: int, 2025-05-07T20:32:32.7885185Z D: int, 2025-05-07T20:32:32.7885284Z scale_ub: Optional[float], 2025-05-07T20:32:32.7885377Z contiguous: bool, 2025-05-07T20:32:32.7885571Z compiled: bool, 2025-05-07T20:32:32.7885650Z ) -> None: 2025-05-07T20:32:32.7885745Z torch.manual_seed(2025) 2025-05-07T20:32:32.7885822Z 2025-05-07T20:32:32.7886000Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.7886074Z 2025-05-07T20:32:32.7886166Z x_sign = torch.sign(x) 2025-05-07T20:32:32.7886289Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.7886376Z x = x_sign * x_clamp 2025-05-07T20:32:32.7886462Z x0 = x[:, :D] 2025-05-07T20:32:32.7886540Z x1 = x[:, D:] 2025-05-07T20:32:32.7886612Z 2025-05-07T20:32:32.7886700Z if contiguous: 2025-05-07T20:32:32.7886795Z x0 = x0.contiguous() 2025-05-07T20:32:32.7886882Z x1 = x1.contiguous() 2025-05-07T20:32:32.7886962Z 2025-05-07T20:32:32.7887050Z if scale_ub is not None: 2025-05-07T20:32:32.7887154Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.7887303Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.7887379Z ) 2025-05-07T20:32:32.7887453Z else: 2025-05-07T20:32:32.7887549Z scale_ub_tensor = None 2025-05-07T20:32:32.7887620Z 2025-05-07T20:32:32.7887752Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.7887840Z op = silu_mul_quant 2025-05-07T20:32:32.7887924Z if compiled: 2025-05-07T20:32:32.7888026Z op = torch.compile(op) 2025-05-07T20:32:32.7888130Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.7888205Z 2025-05-07T20:32:32.7888298Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.7888302Z 2025-05-07T20:32:32.7888480Z moe/activation_test.py:117: 2025-05-07T20:32:32.7888609Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.7888714Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.7888812Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.7889320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.7889418Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.7889776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.7890002Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.7890338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.7890433Z kernel = self.compile( 2025-05-07T20:32:32.7890821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.7890996Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.7891125Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.7891133Z 2025-05-07T20:32:32.7891336Z self = 2025-05-07T20:32:32.7892111Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.7892619Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faae021d9e0>} 2025-05-07T20:32:32.7893368Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.7893562Z context = 2025-05-07T20:32:32.7893567Z 2025-05-07T20:32:32.7893810Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.7894075Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.7894182Z module_map=module_map) 2025-05-07T20:32:32.7894342Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.7894444Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.7894522Z E ^ 2025-05-07T20:32:32.7894872Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.7894877Z 2025-05-07T20:32:32.7895297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.7895302Z 2025-05-07T20:32:32.7895403Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.7895628Z self=, 2025-05-07T20:32:32.7895714Z T=128, 2025-05-07T20:32:32.7895792Z D=5120, 2025-05-07T20:32:32.7895877Z scale_ub=None, 2025-05-07T20:32:32.7895961Z contiguous=False, 2025-05-07T20:32:32.7896044Z compiled=False, 2025-05-07T20:32:32.7896117Z ) 2025-05-07T20:32:32.7896334Z self = 2025-05-07T20:32:32.7896504Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:32.7896511Z 2025-05-07T20:32:32.7896586Z @given( 2025-05-07T20:32:32.7896706Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.7896808Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.7896999Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.7897117Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.7897231Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.7897307Z ) 2025-05-07T20:32:32.7897550Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.7897656Z def test_silu_mul_quant( 2025-05-07T20:32:32.7897731Z self, 2025-05-07T20:32:32.7897807Z T: int, 2025-05-07T20:32:32.7897889Z D: int, 2025-05-07T20:32:32.7897986Z scale_ub: Optional[float], 2025-05-07T20:32:32.7898077Z contiguous: bool, 2025-05-07T20:32:32.7898161Z compiled: bool, 2025-05-07T20:32:32.7898238Z ) -> None: 2025-05-07T20:32:32.7898335Z torch.manual_seed(2025) 2025-05-07T20:32:32.7898408Z 2025-05-07T20:32:32.7898576Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.7898653Z 2025-05-07T20:32:32.7898745Z x_sign = torch.sign(x) 2025-05-07T20:32:32.7898872Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.7898965Z x = x_sign * x_clamp 2025-05-07T20:32:32.7899043Z x0 = x[:, :D] 2025-05-07T20:32:32.7899121Z x1 = x[:, D:] 2025-05-07T20:32:32.7899213Z 2025-05-07T20:32:32.7899307Z if contiguous: 2025-05-07T20:32:32.7899421Z x0 = x0.contiguous() 2025-05-07T20:32:32.7899513Z x1 = x1.contiguous() 2025-05-07T20:32:32.7899586Z 2025-05-07T20:32:32.7899681Z if scale_ub is not None: 2025-05-07T20:32:32.7899786Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.7899920Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.7899997Z ) 2025-05-07T20:32:32.7900073Z else: 2025-05-07T20:32:32.7900166Z scale_ub_tensor = None 2025-05-07T20:32:32.7900240Z 2025-05-07T20:32:32.7900366Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.7900462Z op = silu_mul_quant 2025-05-07T20:32:32.7900548Z if compiled: 2025-05-07T20:32:32.7900646Z op = torch.compile(op) 2025-05-07T20:32:32.7900753Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.7900824Z 2025-05-07T20:32:32.7900913Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.7901007Z 2025-05-07T20:32:32.7901106Z moe/activation_test.py:117: 2025-05-07T20:32:32.7901233Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.7901332Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.7901432Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.7901926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.7902021Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.7902384Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.7902617Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.7902964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.7903060Z kernel = self.compile( 2025-05-07T20:32:32.7903448Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.7903627Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.7903757Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.7903761Z 2025-05-07T20:32:32.7903974Z self = 2025-05-07T20:32:32.7904832Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.7905341Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faae2f7dc60>} 2025-05-07T20:32:32.7906089Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.7906284Z context = 2025-05-07T20:32:32.7906289Z 2025-05-07T20:32:32.7906459Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.7906723Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.7906829Z module_map=module_map) 2025-05-07T20:32:32.7906996Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.7907101Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.7907180Z E ^ 2025-05-07T20:32:32.7907532Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.7907536Z 2025-05-07T20:32:32.7907960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.7907964Z 2025-05-07T20:32:32.7908067Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.7908510Z self=, 2025-05-07T20:32:32.7908623Z T=128, 2025-05-07T20:32:32.7908726Z D=5120, 2025-05-07T20:32:32.7908808Z scale_ub=1200.0, 2025-05-07T20:32:32.7908889Z contiguous=True, 2025-05-07T20:32:32.7908968Z compiled=False, 2025-05-07T20:32:32.7909042Z ) 2025-05-07T20:32:32.7909299Z self = 2025-05-07T20:32:32.7909477Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:32.7909482Z 2025-05-07T20:32:32.7909554Z @given( 2025-05-07T20:32:32.7909671Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.7909765Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.7910032Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.7910146Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.7910255Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.7910333Z ) 2025-05-07T20:32:32.7910573Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.7910662Z def test_silu_mul_quant( 2025-05-07T20:32:32.7910743Z self, 2025-05-07T20:32:32.7910816Z T: int, 2025-05-07T20:32:32.7910885Z D: int, 2025-05-07T20:32:32.7910981Z scale_ub: Optional[float], 2025-05-07T20:32:32.7911064Z contiguous: bool, 2025-05-07T20:32:32.7911152Z compiled: bool, 2025-05-07T20:32:32.7911229Z ) -> None: 2025-05-07T20:32:32.7911319Z torch.manual_seed(2025) 2025-05-07T20:32:32.7911388Z 2025-05-07T20:32:32.7911556Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.7911623Z 2025-05-07T20:32:32.7911723Z x_sign = torch.sign(x) 2025-05-07T20:32:32.7911842Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.7911931Z x = x_sign * x_clamp 2025-05-07T20:32:32.7912010Z x0 = x[:, :D] 2025-05-07T20:32:32.7912084Z x1 = x[:, D:] 2025-05-07T20:32:32.7912159Z 2025-05-07T20:32:32.7912240Z if contiguous: 2025-05-07T20:32:32.7912328Z x0 = x0.contiguous() 2025-05-07T20:32:32.7912411Z x1 = x1.contiguous() 2025-05-07T20:32:32.7912485Z 2025-05-07T20:32:32.7912571Z if scale_ub is not None: 2025-05-07T20:32:32.7912670Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.7912944Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.7913021Z ) 2025-05-07T20:32:32.7913093Z else: 2025-05-07T20:32:32.7913185Z scale_ub_tensor = None 2025-05-07T20:32:32.7913254Z 2025-05-07T20:32:32.7913384Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.7913475Z op = silu_mul_quant 2025-05-07T20:32:32.7913552Z if compiled: 2025-05-07T20:32:32.7913650Z op = torch.compile(op) 2025-05-07T20:32:32.7913755Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.7913821Z 2025-05-07T20:32:32.7913909Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.7913913Z 2025-05-07T20:32:32.7914006Z moe/activation_test.py:117: 2025-05-07T20:32:32.7914133Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.7914231Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.7914324Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.7914826Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.7914918Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.7915275Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.7915502Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.7915839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.7915930Z kernel = self.compile( 2025-05-07T20:32:32.7916307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.7916478Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.7916603Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.7916608Z 2025-05-07T20:32:32.7916813Z self = 2025-05-07T20:32:32.7917589Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.7918175Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faad3a79ee0>} 2025-05-07T20:32:32.7918915Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.7919108Z context = 2025-05-07T20:32:32.7919113Z 2025-05-07T20:32:32.7919282Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.7919544Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.7919646Z module_map=module_map) 2025-05-07T20:32:32.7919803Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.7919904Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.7919978Z E ^ 2025-05-07T20:32:32.7920326Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.7920333Z 2025-05-07T20:32:32.7920741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.7920746Z 2025-05-07T20:32:32.7920845Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.7921065Z self=, 2025-05-07T20:32:32.7921139Z T=1, 2025-05-07T20:32:32.7921285Z D=7168, 2025-05-07T20:32:32.7921368Z scale_ub=1200.0, 2025-05-07T20:32:32.7921448Z contiguous=True, 2025-05-07T20:32:32.7921524Z compiled=True, 2025-05-07T20:32:32.7921596Z ) 2025-05-07T20:32:32.7921814Z self = 2025-05-07T20:32:32.7921986Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:32.7921990Z 2025-05-07T20:32:32.7922061Z @given( 2025-05-07T20:32:32.7922176Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.7922274Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.7922383Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.7922494Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.7922604Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.7922671Z ) 2025-05-07T20:32:32.7922915Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.7923014Z def test_silu_mul_quant( 2025-05-07T20:32:32.7923086Z self, 2025-05-07T20:32:32.7923159Z T: int, 2025-05-07T20:32:32.7923229Z D: int, 2025-05-07T20:32:32.7923321Z scale_ub: Optional[float], 2025-05-07T20:32:32.7923408Z contiguous: bool, 2025-05-07T20:32:32.7923491Z compiled: bool, 2025-05-07T20:32:32.7923563Z ) -> None: 2025-05-07T20:32:32.7923655Z torch.manual_seed(2025) 2025-05-07T20:32:32.7923723Z 2025-05-07T20:32:32.7923888Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.7923958Z 2025-05-07T20:32:32.7924047Z x_sign = torch.sign(x) 2025-05-07T20:32:32.7924168Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.7924341Z x = x_sign * x_clamp 2025-05-07T20:32:32.7924416Z x0 = x[:, :D] 2025-05-07T20:32:32.7924492Z x1 = x[:, D:] 2025-05-07T20:32:32.7924561Z 2025-05-07T20:32:32.7924640Z if contiguous: 2025-05-07T20:32:32.7924745Z x0 = x0.contiguous() 2025-05-07T20:32:32.7924830Z x1 = x1.contiguous() 2025-05-07T20:32:32.7924897Z 2025-05-07T20:32:32.7924990Z if scale_ub is not None: 2025-05-07T20:32:32.7925090Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.7925309Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.7925404Z ) 2025-05-07T20:32:32.7925481Z else: 2025-05-07T20:32:32.7925588Z scale_ub_tensor = None 2025-05-07T20:32:32.7925668Z 2025-05-07T20:32:32.7925793Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.7925876Z op = silu_mul_quant 2025-05-07T20:32:32.7925963Z if compiled: 2025-05-07T20:32:32.7926058Z op = torch.compile(op) 2025-05-07T20:32:32.7926163Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.7926233Z 2025-05-07T20:32:32.7926318Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.7926323Z 2025-05-07T20:32:32.7926428Z moe/activation_test.py:117: 2025-05-07T20:32:32.7926555Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.7926652Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.7926751Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.7927119Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:32.7927216Z return fn(*args, **kwargs) 2025-05-07T20:32:32.7927707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.7927799Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.7928151Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.7928369Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.7928784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.7928880Z kernel = self.compile( 2025-05-07T20:32:32.7929256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.7929445Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.7929572Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.7929577Z 2025-05-07T20:32:32.7929782Z self = 2025-05-07T20:32:32.7930557Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.7931066Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faad3a7a660>} 2025-05-07T20:32:32.7931810Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.7932004Z context = 2025-05-07T20:32:32.7932008Z 2025-05-07T20:32:32.7932173Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.7932433Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.7932534Z module_map=module_map) 2025-05-07T20:32:32.7932696Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.7932789Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.7932860Z E ^ 2025-05-07T20:32:32.7933217Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.7933221Z 2025-05-07T20:32:32.7933630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.7933716Z 2025-05-07T20:32:32.7933817Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.7934035Z self=, 2025-05-07T20:32:32.7934105Z T=1, 2025-05-07T20:32:32.7934176Z D=7168, 2025-05-07T20:32:32.7934253Z scale_ub=1200.0, 2025-05-07T20:32:32.7934333Z contiguous=False, 2025-05-07T20:32:32.7934411Z compiled=True, 2025-05-07T20:32:32.7934478Z ) 2025-05-07T20:32:32.7934691Z self = 2025-05-07T20:32:32.7934857Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:32.7934862Z 2025-05-07T20:32:32.7934933Z @given( 2025-05-07T20:32:32.7935054Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.7935154Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.7935263Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.7935377Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.7935490Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.7935557Z ) 2025-05-07T20:32:32.7935803Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.7935892Z def test_silu_mul_quant( 2025-05-07T20:32:32.7935963Z self, 2025-05-07T20:32:32.7936036Z T: int, 2025-05-07T20:32:32.7936106Z D: int, 2025-05-07T20:32:32.7936202Z scale_ub: Optional[float], 2025-05-07T20:32:32.7936287Z contiguous: bool, 2025-05-07T20:32:32.7936367Z compiled: bool, 2025-05-07T20:32:32.7936444Z ) -> None: 2025-05-07T20:32:32.7936531Z torch.manual_seed(2025) 2025-05-07T20:32:32.7936682Z 2025-05-07T20:32:32.7936851Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.7936923Z 2025-05-07T20:32:32.7937009Z x_sign = torch.sign(x) 2025-05-07T20:32:32.7937136Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.7937224Z x = x_sign * x_clamp 2025-05-07T20:32:32.7937301Z x0 = x[:, :D] 2025-05-07T20:32:32.7937379Z x1 = x[:, D:] 2025-05-07T20:32:32.7937446Z 2025-05-07T20:32:32.7937523Z if contiguous: 2025-05-07T20:32:32.7937615Z x0 = x0.contiguous() 2025-05-07T20:32:32.7937701Z x1 = x1.contiguous() 2025-05-07T20:32:32.7937770Z 2025-05-07T20:32:32.7937856Z if scale_ub is not None: 2025-05-07T20:32:32.7937955Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.7938091Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.7938163Z ) 2025-05-07T20:32:32.7938234Z else: 2025-05-07T20:32:32.7938330Z scale_ub_tensor = None 2025-05-07T20:32:32.7938397Z 2025-05-07T20:32:32.7938523Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.7938613Z op = silu_mul_quant 2025-05-07T20:32:32.7938691Z if compiled: 2025-05-07T20:32:32.7938790Z op = torch.compile(op) 2025-05-07T20:32:32.7938892Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.7938958Z 2025-05-07T20:32:32.7939046Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.7939051Z 2025-05-07T20:32:32.7939143Z moe/activation_test.py:117: 2025-05-07T20:32:32.7939270Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.7939377Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.7939473Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.7939836Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:32.7939940Z return fn(*args, **kwargs) 2025-05-07T20:32:32.7940430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.7940528Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.7940880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.7941186Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.7941524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.7941616Z kernel = self.compile( 2025-05-07T20:32:32.7941996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.7942178Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.7942313Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.7942317Z 2025-05-07T20:32:32.7942524Z self = 2025-05-07T20:32:32.7943296Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.7943809Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faad3a793a0>} 2025-05-07T20:32:32.7944551Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.7944746Z context = 2025-05-07T20:32:32.7944750Z 2025-05-07T20:32:32.7945012Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.7945273Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.7945381Z module_map=module_map) 2025-05-07T20:32:32.7945547Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.7945644Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.7945726Z E ^ 2025-05-07T20:32:32.7946083Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.7946088Z 2025-05-07T20:32:32.7946503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.7946508Z 2025-05-07T20:32:32.7946608Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.7946830Z self=, 2025-05-07T20:32:32.7946913Z T=1, 2025-05-07T20:32:32.7946987Z D=7168, 2025-05-07T20:32:32.7947066Z scale_ub=None, 2025-05-07T20:32:32.7947155Z contiguous=False, 2025-05-07T20:32:32.7947237Z compiled=True, 2025-05-07T20:32:32.7947310Z ) 2025-05-07T20:32:32.7947533Z self = 2025-05-07T20:32:32.7947706Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:32.7947711Z 2025-05-07T20:32:32.7947794Z @given( 2025-05-07T20:32:32.7947910Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.7948008Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.7948123Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.7948240Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.7948352Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.7948426Z ) 2025-05-07T20:32:32.7948674Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.7948770Z def test_silu_mul_quant( 2025-05-07T20:32:32.7948846Z self, 2025-05-07T20:32:32.7948923Z T: int, 2025-05-07T20:32:32.7949014Z D: int, 2025-05-07T20:32:32.7949126Z scale_ub: Optional[float], 2025-05-07T20:32:32.7949366Z contiguous: bool, 2025-05-07T20:32:32.7949461Z compiled: bool, 2025-05-07T20:32:32.7949543Z ) -> None: 2025-05-07T20:32:32.7949636Z torch.manual_seed(2025) 2025-05-07T20:32:32.7949715Z 2025-05-07T20:32:32.7949885Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.7949956Z 2025-05-07T20:32:32.7950050Z x_sign = torch.sign(x) 2025-05-07T20:32:32.7950174Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.7950268Z x = x_sign * x_clamp 2025-05-07T20:32:32.7950346Z x0 = x[:, :D] 2025-05-07T20:32:32.7950424Z x1 = x[:, D:] 2025-05-07T20:32:32.7950498Z 2025-05-07T20:32:32.7950584Z if contiguous: 2025-05-07T20:32:32.7950675Z x0 = x0.contiguous() 2025-05-07T20:32:32.7950767Z x1 = x1.contiguous() 2025-05-07T20:32:32.7950840Z 2025-05-07T20:32:32.7950929Z if scale_ub is not None: 2025-05-07T20:32:32.7951041Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.7951175Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.7951249Z ) 2025-05-07T20:32:32.7951326Z else: 2025-05-07T20:32:32.7951417Z scale_ub_tensor = None 2025-05-07T20:32:32.7951493Z 2025-05-07T20:32:32.7951623Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.7951713Z op = silu_mul_quant 2025-05-07T20:32:32.7951800Z if compiled: 2025-05-07T20:32:32.7951897Z op = torch.compile(op) 2025-05-07T20:32:32.7952000Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.7952074Z 2025-05-07T20:32:32.7952241Z y_fp8, y_scale = fn() 2025-05-07T20:32:32.7952363Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:32.7952438Z 2025-05-07T20:32:32.7952571Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.7952672Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:32.7952779Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:32.7952898Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:32.7953041Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:32.7953117Z 2025-05-07T20:32:32.7953215Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:32.7953220Z 2025-05-07T20:32:32.7953319Z moe/activation_test.py:126: 2025-05-07T20:32:32.7953448Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.7953553Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:32.7953688Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:32.7954248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:32.7954353Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:32.7954714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.7954939Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.7955305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:32.7955558Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:32.7955931Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:32.7956099Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:32.7956442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:32.7956520Z fn() 2025-05-07T20:32:32.7956918Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:32.7957078Z self.fn.run( 2025-05-07T20:32:32.7957413Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.7957504Z kernel = self.compile( 2025-05-07T20:32:32.7957880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.7958054Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.7958180Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.7958185Z 2025-05-07T20:32:32.7958394Z self = 2025-05-07T20:32:32.7959173Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.7959692Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faad3a7bce0>} 2025-05-07T20:32:32.7960438Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.7960629Z context = 2025-05-07T20:32:32.7960634Z 2025-05-07T20:32:32.7960805Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.7961145Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.7961256Z module_map=module_map) 2025-05-07T20:32:32.7961417Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.7961516Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:32.7961599Z E ^ 2025-05-07T20:32:32.7961951Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.7961955Z 2025-05-07T20:32:32.7962366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.7962374Z 2025-05-07T20:32:32.7962472Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.7962692Z self=, 2025-05-07T20:32:32.7962770Z T=1, 2025-05-07T20:32:32.7962845Z D=5120, 2025-05-07T20:32:32.7962926Z scale_ub=1200.0, 2025-05-07T20:32:32.7963017Z contiguous=False, 2025-05-07T20:32:32.7963099Z compiled=True, 2025-05-07T20:32:32.7963171Z ) 2025-05-07T20:32:32.7963389Z self = 2025-05-07T20:32:32.7963557Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:32.7963566Z 2025-05-07T20:32:32.7963641Z @given( 2025-05-07T20:32:32.7963763Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.7963863Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.7963981Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.7964096Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.7964283Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.7964362Z ) 2025-05-07T20:32:32.7964605Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.7964697Z def test_silu_mul_quant( 2025-05-07T20:32:32.7964775Z self, 2025-05-07T20:32:32.7964855Z T: int, 2025-05-07T20:32:32.7964932Z D: int, 2025-05-07T20:32:32.7965032Z scale_ub: Optional[float], 2025-05-07T20:32:32.7965120Z contiguous: bool, 2025-05-07T20:32:32.7965206Z compiled: bool, 2025-05-07T20:32:32.7965283Z ) -> None: 2025-05-07T20:32:32.7965461Z torch.manual_seed(2025) 2025-05-07T20:32:32.7965538Z 2025-05-07T20:32:32.7965706Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.7965779Z 2025-05-07T20:32:32.7965879Z x_sign = torch.sign(x) 2025-05-07T20:32:32.7966001Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.7966089Z x = x_sign * x_clamp 2025-05-07T20:32:32.7966169Z x0 = x[:, :D] 2025-05-07T20:32:32.7966249Z x1 = x[:, D:] 2025-05-07T20:32:32.7966322Z 2025-05-07T20:32:32.7966409Z if contiguous: 2025-05-07T20:32:32.7966499Z x0 = x0.contiguous() 2025-05-07T20:32:32.7966591Z x1 = x1.contiguous() 2025-05-07T20:32:32.7966665Z 2025-05-07T20:32:32.7966754Z if scale_ub is not None: 2025-05-07T20:32:32.7966861Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.7966995Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.7967077Z ) 2025-05-07T20:32:32.7967154Z else: 2025-05-07T20:32:32.7967248Z scale_ub_tensor = None 2025-05-07T20:32:32.7967320Z 2025-05-07T20:32:32.7967449Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.7967537Z op = silu_mul_quant 2025-05-07T20:32:32.7967621Z if compiled: 2025-05-07T20:32:32.7967722Z op = torch.compile(op) 2025-05-07T20:32:32.7967825Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.7967897Z 2025-05-07T20:32:32.7967988Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.7967992Z 2025-05-07T20:32:32.7968088Z moe/activation_test.py:117: 2025-05-07T20:32:32.7968304Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.7968407Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.7968506Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.7968873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:32.7968970Z return fn(*args, **kwargs) 2025-05-07T20:32:32.7969459Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.7969558Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.7969913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.7970142Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.7970478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.7970575Z kernel = self.compile( 2025-05-07T20:32:32.7970956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.7971129Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.7971264Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.7971268Z 2025-05-07T20:32:32.7971471Z self = 2025-05-07T20:32:32.7972244Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.7972754Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faad35b3920>} 2025-05-07T20:32:32.7973506Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.7973698Z context = 2025-05-07T20:32:32.7973805Z 2025-05-07T20:32:32.7973970Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.7974232Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.7974345Z module_map=module_map) 2025-05-07T20:32:32.7974505Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.7974604Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.7974682Z E ^ 2025-05-07T20:32:32.7975035Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.7975046Z 2025-05-07T20:32:32.7975458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.7975463Z 2025-05-07T20:32:32.7975563Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.7975793Z self=, 2025-05-07T20:32:32.7975868Z T=1, 2025-05-07T20:32:32.7975944Z D=5120, 2025-05-07T20:32:32.7976030Z scale_ub=1200.0, 2025-05-07T20:32:32.7976115Z contiguous=False, 2025-05-07T20:32:32.7976198Z compiled=False, 2025-05-07T20:32:32.7976277Z ) 2025-05-07T20:32:32.7976494Z self = 2025-05-07T20:32:32.7976664Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:32.7976669Z 2025-05-07T20:32:32.7976747Z @given( 2025-05-07T20:32:32.7976863Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.7977041Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.7977163Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.7977276Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.7977389Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.7977467Z ) 2025-05-07T20:32:32.7977712Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.7977808Z def test_silu_mul_quant( 2025-05-07T20:32:32.7977883Z self, 2025-05-07T20:32:32.7977959Z T: int, 2025-05-07T20:32:32.7978035Z D: int, 2025-05-07T20:32:32.7978130Z scale_ub: Optional[float], 2025-05-07T20:32:32.7978219Z contiguous: bool, 2025-05-07T20:32:32.7978305Z compiled: bool, 2025-05-07T20:32:32.7978386Z ) -> None: 2025-05-07T20:32:32.7978478Z torch.manual_seed(2025) 2025-05-07T20:32:32.7978552Z 2025-05-07T20:32:32.7978727Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.7978805Z 2025-05-07T20:32:32.7978897Z x_sign = torch.sign(x) 2025-05-07T20:32:32.7979020Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.7979110Z x = x_sign * x_clamp 2025-05-07T20:32:32.7979189Z x0 = x[:, :D] 2025-05-07T20:32:32.7979271Z x1 = x[:, D:] 2025-05-07T20:32:32.7979348Z 2025-05-07T20:32:32.7979431Z if contiguous: 2025-05-07T20:32:32.7979517Z x0 = x0.contiguous() 2025-05-07T20:32:32.7979608Z x1 = x1.contiguous() 2025-05-07T20:32:32.7979675Z 2025-05-07T20:32:32.7979761Z if scale_ub is not None: 2025-05-07T20:32:32.7979868Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.7980000Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.7980074Z ) 2025-05-07T20:32:32.7980146Z else: 2025-05-07T20:32:32.7980234Z scale_ub_tensor = None 2025-05-07T20:32:32.7980303Z 2025-05-07T20:32:32.7980432Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.7980517Z op = silu_mul_quant 2025-05-07T20:32:32.7980599Z if compiled: 2025-05-07T20:32:32.7980694Z op = torch.compile(op) 2025-05-07T20:32:32.7980794Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.7980945Z 2025-05-07T20:32:32.7981035Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.7981039Z 2025-05-07T20:32:32.7981131Z moe/activation_test.py:117: 2025-05-07T20:32:32.7981268Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.7981365Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.7981461Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.7981956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.7982050Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.7986168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.7986413Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.7986759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.7986865Z kernel = self.compile( 2025-05-07T20:32:32.7987247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.7987425Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.7987561Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.7987566Z 2025-05-07T20:32:32.7987768Z self = 2025-05-07T20:32:32.7988653Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.7989169Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faad3bd74c0>} 2025-05-07T20:32:32.7989921Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.7990119Z context = 2025-05-07T20:32:32.7990123Z 2025-05-07T20:32:32.7990285Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.7990552Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.7990659Z module_map=module_map) 2025-05-07T20:32:32.7990830Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.7990936Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.7991011Z E ^ 2025-05-07T20:32:32.7991365Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.7991377Z 2025-05-07T20:32:32.7991788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.7991793Z 2025-05-07T20:32:32.7991896Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.7992121Z self=, 2025-05-07T20:32:32.7992197Z T=16384, 2025-05-07T20:32:32.7992269Z D=5120, 2025-05-07T20:32:32.7992352Z scale_ub=1200.0, 2025-05-07T20:32:32.7992435Z contiguous=False, 2025-05-07T20:32:32.7992516Z compiled=True, 2025-05-07T20:32:32.7992592Z ) 2025-05-07T20:32:32.7992813Z self = 2025-05-07T20:32:32.7992991Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:32.7992996Z 2025-05-07T20:32:32.7993071Z @given( 2025-05-07T20:32:32.7993185Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.7993363Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.7993474Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.7993586Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.7993700Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.7993773Z ) 2025-05-07T20:32:32.7994015Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.7994108Z def test_silu_mul_quant( 2025-05-07T20:32:32.7994184Z self, 2025-05-07T20:32:32.7994260Z T: int, 2025-05-07T20:32:32.7994337Z D: int, 2025-05-07T20:32:32.7994443Z scale_ub: Optional[float], 2025-05-07T20:32:32.7994531Z contiguous: bool, 2025-05-07T20:32:32.7994614Z compiled: bool, 2025-05-07T20:32:32.7994691Z ) -> None: 2025-05-07T20:32:32.7994785Z torch.manual_seed(2025) 2025-05-07T20:32:32.7994856Z 2025-05-07T20:32:32.7995033Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.7995105Z 2025-05-07T20:32:32.7995193Z x_sign = torch.sign(x) 2025-05-07T20:32:32.7995315Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.7995404Z x = x_sign * x_clamp 2025-05-07T20:32:32.7995481Z x0 = x[:, :D] 2025-05-07T20:32:32.7995560Z x1 = x[:, D:] 2025-05-07T20:32:32.7995630Z 2025-05-07T20:32:32.7995711Z if contiguous: 2025-05-07T20:32:32.7995804Z x0 = x0.contiguous() 2025-05-07T20:32:32.7995889Z x1 = x1.contiguous() 2025-05-07T20:32:32.7995959Z 2025-05-07T20:32:32.7996050Z if scale_ub is not None: 2025-05-07T20:32:32.7996232Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.7996366Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.7996442Z ) 2025-05-07T20:32:32.7996518Z else: 2025-05-07T20:32:32.7996610Z scale_ub_tensor = None 2025-05-07T20:32:32.7996692Z 2025-05-07T20:32:32.7996822Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.7996911Z op = silu_mul_quant 2025-05-07T20:32:32.7997000Z if compiled: 2025-05-07T20:32:32.7997096Z op = torch.compile(op) 2025-05-07T20:32:32.7997202Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.7997272Z 2025-05-07T20:32:32.7997360Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.7997364Z 2025-05-07T20:32:32.7997462Z moe/activation_test.py:117: 2025-05-07T20:32:32.7997588Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.7997686Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.7997790Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.7998154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:32.7998254Z return fn(*args, **kwargs) 2025-05-07T20:32:32.7998750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.7998843Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.7999200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.7999421Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.7999757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.7999852Z kernel = self.compile( 2025-05-07T20:32:32.8000235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.8000410Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.8000535Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.8000618Z 2025-05-07T20:32:32.8000820Z self = 2025-05-07T20:32:32.8001599Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.8002101Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faad3806660>} 2025-05-07T20:32:32.8002854Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.8003042Z context = 2025-05-07T20:32:32.8003047Z 2025-05-07T20:32:32.8003208Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.8003485Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.8003591Z module_map=module_map) 2025-05-07T20:32:32.8003756Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.8003851Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.8003927Z E ^ 2025-05-07T20:32:32.8004367Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.8004372Z 2025-05-07T20:32:32.8004860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.8004865Z 2025-05-07T20:32:32.8004972Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.8005189Z self=, 2025-05-07T20:32:32.8005265Z T=2048, 2025-05-07T20:32:32.8005346Z D=7168, 2025-05-07T20:32:32.8005428Z scale_ub=1200.0, 2025-05-07T20:32:32.8005512Z contiguous=False, 2025-05-07T20:32:32.8005598Z compiled=True, 2025-05-07T20:32:32.8005669Z ) 2025-05-07T20:32:32.8005882Z self = 2025-05-07T20:32:32.8006054Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:32.8006059Z 2025-05-07T20:32:32.8006138Z @given( 2025-05-07T20:32:32.8006256Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.8006353Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.8006464Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.8006588Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.8006698Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.8006771Z ) 2025-05-07T20:32:32.8007014Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.8007109Z def test_silu_mul_quant( 2025-05-07T20:32:32.8007188Z self, 2025-05-07T20:32:32.8007265Z T: int, 2025-05-07T20:32:32.8007340Z D: int, 2025-05-07T20:32:32.8007442Z scale_ub: Optional[float], 2025-05-07T20:32:32.8007527Z contiguous: bool, 2025-05-07T20:32:32.8007610Z compiled: bool, 2025-05-07T20:32:32.8007690Z ) -> None: 2025-05-07T20:32:32.8007784Z torch.manual_seed(2025) 2025-05-07T20:32:32.8007863Z 2025-05-07T20:32:32.8008035Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.8008106Z 2025-05-07T20:32:32.8008196Z x_sign = torch.sign(x) 2025-05-07T20:32:32.8008543Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.8008662Z x = x_sign * x_clamp 2025-05-07T20:32:32.8008738Z x0 = x[:, :D] 2025-05-07T20:32:32.8008808Z x1 = x[:, D:] 2025-05-07T20:32:32.8008871Z 2025-05-07T20:32:32.8009125Z if contiguous: 2025-05-07T20:32:32.8009210Z x0 = x0.contiguous() 2025-05-07T20:32:32.8009291Z x1 = x1.contiguous() 2025-05-07T20:32:32.8009360Z 2025-05-07T20:32:32.8009442Z if scale_ub is not None: 2025-05-07T20:32:32.8009543Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.8009673Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.8009742Z ) 2025-05-07T20:32:32.8009812Z else: 2025-05-07T20:32:32.8009897Z scale_ub_tensor = None 2025-05-07T20:32:32.8009960Z 2025-05-07T20:32:32.8010087Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.8010167Z op = silu_mul_quant 2025-05-07T20:32:32.8010247Z if compiled: 2025-05-07T20:32:32.8010343Z op = torch.compile(op) 2025-05-07T20:32:32.8010441Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.8010506Z 2025-05-07T20:32:32.8010593Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.8010603Z 2025-05-07T20:32:32.8010693Z moe/activation_test.py:117: 2025-05-07T20:32:32.8010820Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.8010913Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.8011005Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.8011375Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:32.8011461Z return fn(*args, **kwargs) 2025-05-07T20:32:32.8011951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.8012168Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.8012522Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.8012744Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.8013080Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.8013165Z kernel = self.compile( 2025-05-07T20:32:32.8013544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.8013711Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.8013831Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.8013838Z 2025-05-07T20:32:32.8014035Z self = 2025-05-07T20:32:32.8014813Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.8015313Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faae124b060>} 2025-05-07T20:32:32.8016054Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.8016242Z context = 2025-05-07T20:32:32.8016247Z 2025-05-07T20:32:32.8016406Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.8016663Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.8016773Z module_map=module_map) 2025-05-07T20:32:32.8016929Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.8017023Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.8017093Z E ^ 2025-05-07T20:32:32.8017519Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.8017524Z 2025-05-07T20:32:32.8017931Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.8017935Z 2025-05-07T20:32:32.8018031Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.8018247Z self=, 2025-05-07T20:32:32.8018322Z T=1, 2025-05-07T20:32:32.8018387Z D=5120, 2025-05-07T20:32:32.8018460Z scale_ub=None, 2025-05-07T20:32:32.8018536Z contiguous=False, 2025-05-07T20:32:32.8018609Z compiled=False, 2025-05-07T20:32:32.8018684Z ) 2025-05-07T20:32:32.8018895Z self = 2025-05-07T20:32:32.8019057Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:32.8019061Z 2025-05-07T20:32:32.8019136Z @given( 2025-05-07T20:32:32.8019246Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.8019338Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.8019445Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.8019555Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.8019663Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.8019727Z ) 2025-05-07T20:32:32.8019966Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.8020056Z def test_silu_mul_quant( 2025-05-07T20:32:32.8020123Z self, 2025-05-07T20:32:32.8020189Z T: int, 2025-05-07T20:32:32.8020333Z D: int, 2025-05-07T20:32:32.8020423Z scale_ub: Optional[float], 2025-05-07T20:32:32.8020503Z contiguous: bool, 2025-05-07T20:32:32.8020582Z compiled: bool, 2025-05-07T20:32:32.8020650Z ) -> None: 2025-05-07T20:32:32.8020736Z torch.manual_seed(2025) 2025-05-07T20:32:32.8020807Z 2025-05-07T20:32:32.8020968Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.8021033Z 2025-05-07T20:32:32.8021114Z x_sign = torch.sign(x) 2025-05-07T20:32:32.8021228Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.8021310Z x = x_sign * x_clamp 2025-05-07T20:32:32.8021379Z x0 = x[:, :D] 2025-05-07T20:32:32.8021448Z x1 = x[:, D:] 2025-05-07T20:32:32.8021516Z 2025-05-07T20:32:32.8021589Z if contiguous: 2025-05-07T20:32:32.8021672Z x0 = x0.contiguous() 2025-05-07T20:32:32.8021760Z x1 = x1.contiguous() 2025-05-07T20:32:32.8021825Z 2025-05-07T20:32:32.8021911Z if scale_ub is not None: 2025-05-07T20:32:32.8022009Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.8022138Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.8022207Z ) 2025-05-07T20:32:32.8022273Z else: 2025-05-07T20:32:32.8022361Z scale_ub_tensor = None 2025-05-07T20:32:32.8022430Z 2025-05-07T20:32:32.8022550Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.8022628Z op = silu_mul_quant 2025-05-07T20:32:32.8022707Z if compiled: 2025-05-07T20:32:32.8022798Z op = torch.compile(op) 2025-05-07T20:32:32.8022894Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.8022958Z 2025-05-07T20:32:32.8023038Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.8023042Z 2025-05-07T20:32:32.8023128Z moe/activation_test.py:117: 2025-05-07T20:32:32.8023258Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.8023349Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.8023440Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.8023930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.8024106Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.8024458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.8024673Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.8025004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.8025090Z kernel = self.compile( 2025-05-07T20:32:32.8025462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.8025636Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.8025755Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.8025759Z 2025-05-07T20:32:32.8025960Z self = 2025-05-07T20:32:32.8026742Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.8027240Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faae0aaaa20>} 2025-05-07T20:32:32.8027978Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.8028242Z context = 2025-05-07T20:32:32.8028247Z 2025-05-07T20:32:32.8028407Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.8028668Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.8028772Z module_map=module_map) 2025-05-07T20:32:32.8028927Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.8029016Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.8029089Z E ^ 2025-05-07T20:32:32.8029443Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.8029447Z 2025-05-07T20:32:32.8029849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.8029854Z 2025-05-07T20:32:32.8029948Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.8030171Z self=, 2025-05-07T20:32:32.8030237Z T=4096, 2025-05-07T20:32:32.8030303Z D=7168, 2025-05-07T20:32:32.8030375Z scale_ub=1200.0, 2025-05-07T20:32:32.8030450Z contiguous=False, 2025-05-07T20:32:32.8030537Z compiled=False, 2025-05-07T20:32:32.8030601Z ) 2025-05-07T20:32:32.8030812Z self = 2025-05-07T20:32:32.8030985Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:32.8030990Z 2025-05-07T20:32:32.8031056Z @given( 2025-05-07T20:32:32.8031167Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.8031256Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.8031362Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.8031477Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.8031587Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.8031650Z ) 2025-05-07T20:32:32.8031893Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.8031979Z def test_silu_mul_quant( 2025-05-07T20:32:32.8032047Z self, 2025-05-07T20:32:32.8032194Z T: int, 2025-05-07T20:32:32.8032258Z D: int, 2025-05-07T20:32:32.8032353Z scale_ub: Optional[float], 2025-05-07T20:32:32.8032432Z contiguous: bool, 2025-05-07T20:32:32.8032507Z compiled: bool, 2025-05-07T20:32:32.8032579Z ) -> None: 2025-05-07T20:32:32.8032663Z torch.manual_seed(2025) 2025-05-07T20:32:32.8032724Z 2025-05-07T20:32:32.8032889Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.8032952Z 2025-05-07T20:32:32.8033035Z x_sign = torch.sign(x) 2025-05-07T20:32:32.8033155Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.8033235Z x = x_sign * x_clamp 2025-05-07T20:32:32.8033312Z x0 = x[:, :D] 2025-05-07T20:32:32.8033383Z x1 = x[:, D:] 2025-05-07T20:32:32.8033447Z 2025-05-07T20:32:32.8033527Z if contiguous: 2025-05-07T20:32:32.8033609Z x0 = x0.contiguous() 2025-05-07T20:32:32.8033688Z x1 = x1.contiguous() 2025-05-07T20:32:32.8033759Z 2025-05-07T20:32:32.8033840Z if scale_ub is not None: 2025-05-07T20:32:32.8033939Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.8034069Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.8034137Z ) 2025-05-07T20:32:32.8034203Z else: 2025-05-07T20:32:32.8034290Z scale_ub_tensor = None 2025-05-07T20:32:32.8034352Z 2025-05-07T20:32:32.8034472Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.8034555Z op = silu_mul_quant 2025-05-07T20:32:32.8034632Z if compiled: 2025-05-07T20:32:32.8034723Z op = torch.compile(op) 2025-05-07T20:32:32.8034923Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.8034985Z 2025-05-07T20:32:32.8035068Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.8035073Z 2025-05-07T20:32:32.8035161Z moe/activation_test.py:117: 2025-05-07T20:32:32.8035283Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.8035382Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.8035471Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.8035961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.8036050Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.8036404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.8036627Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.8036964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.8037048Z kernel = self.compile( 2025-05-07T20:32:32.8037425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.8037596Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.8037717Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.8037722Z 2025-05-07T20:32:32.8037919Z self = 2025-05-07T20:32:32.8038688Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.8039200Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faad2e442c0>} 2025-05-07T20:32:32.8039937Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.8040201Z context = 2025-05-07T20:32:32.8040206Z 2025-05-07T20:32:32.8040362Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.8040618Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.8040719Z module_map=module_map) 2025-05-07T20:32:32.8040874Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.8040964Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.8041032Z E ^ 2025-05-07T20:32:32.8041381Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.8041386Z 2025-05-07T20:32:32.8041791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.8041800Z 2025-05-07T20:32:32.8041893Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.8042109Z self=, 2025-05-07T20:32:32.8042177Z T=16384, 2025-05-07T20:32:32.8042243Z D=7168, 2025-05-07T20:32:32.8042316Z scale_ub=None, 2025-05-07T20:32:32.8042391Z contiguous=True, 2025-05-07T20:32:32.8042462Z compiled=True, 2025-05-07T20:32:32.8042526Z ) 2025-05-07T20:32:32.8042735Z self = 2025-05-07T20:32:32.8042902Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:32.8042906Z 2025-05-07T20:32:32.8042977Z @given( 2025-05-07T20:32:32.8043165Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.8043259Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.8043363Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.8043470Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.8043585Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.8043649Z ) 2025-05-07T20:32:32.8043886Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.8043971Z def test_silu_mul_quant( 2025-05-07T20:32:32.8044036Z self, 2025-05-07T20:32:32.8044100Z T: int, 2025-05-07T20:32:32.8044167Z D: int, 2025-05-07T20:32:32.8044327Z scale_ub: Optional[float], 2025-05-07T20:32:32.8044407Z contiguous: bool, 2025-05-07T20:32:32.8044487Z compiled: bool, 2025-05-07T20:32:32.8044554Z ) -> None: 2025-05-07T20:32:32.8044640Z torch.manual_seed(2025) 2025-05-07T20:32:32.8044702Z 2025-05-07T20:32:32.8044869Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.8044936Z 2025-05-07T20:32:32.8045023Z x_sign = torch.sign(x) 2025-05-07T20:32:32.8045137Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.8045222Z x = x_sign * x_clamp 2025-05-07T20:32:32.8045291Z x0 = x[:, :D] 2025-05-07T20:32:32.8045359Z x1 = x[:, D:] 2025-05-07T20:32:32.8045426Z 2025-05-07T20:32:32.8045498Z if contiguous: 2025-05-07T20:32:32.8045580Z x0 = x0.contiguous() 2025-05-07T20:32:32.8045666Z x1 = x1.contiguous() 2025-05-07T20:32:32.8045727Z 2025-05-07T20:32:32.8045807Z if scale_ub is not None: 2025-05-07T20:32:32.8045905Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.8046032Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.8046100Z ) 2025-05-07T20:32:32.8046168Z else: 2025-05-07T20:32:32.8046255Z scale_ub_tensor = None 2025-05-07T20:32:32.8046321Z 2025-05-07T20:32:32.8046441Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.8046519Z op = silu_mul_quant 2025-05-07T20:32:32.8046596Z if compiled: 2025-05-07T20:32:32.8046769Z op = torch.compile(op) 2025-05-07T20:32:32.8046863Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.8046931Z 2025-05-07T20:32:32.8047013Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.8047018Z 2025-05-07T20:32:32.8047107Z moe/activation_test.py:117: 2025-05-07T20:32:32.8047228Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.8047320Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.8047414Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.8047774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:32.8047865Z return fn(*args, **kwargs) 2025-05-07T20:32:32.8048352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.8048438Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.8048790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.8049013Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.8049341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.8049431Z kernel = self.compile( 2025-05-07T20:32:32.8049804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.8049975Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.8050175Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.8050180Z 2025-05-07T20:32:32.8050377Z self = 2025-05-07T20:32:32.8051146Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.8051650Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faad2e45c60>} 2025-05-07T20:32:32.8052403Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.8052586Z context = 2025-05-07T20:32:32.8052591Z 2025-05-07T20:32:32.8052752Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.8053009Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.8053109Z module_map=module_map) 2025-05-07T20:32:32.8053266Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.8053360Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.8053428Z E ^ 2025-05-07T20:32:32.8053777Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.8053781Z 2025-05-07T20:32:32.8054184Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.8054188Z 2025-05-07T20:32:32.8054280Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.8054495Z self=, 2025-05-07T20:32:32.8054567Z T=4096, 2025-05-07T20:32:32.8054636Z D=5120, 2025-05-07T20:32:32.8054706Z scale_ub=None, 2025-05-07T20:32:32.8054781Z contiguous=False, 2025-05-07T20:32:32.8054858Z compiled=True, 2025-05-07T20:32:32.8054920Z ) 2025-05-07T20:32:32.8055131Z self = 2025-05-07T20:32:32.8055378Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:32.8055383Z 2025-05-07T20:32:32.8055450Z @given( 2025-05-07T20:32:32.8055558Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.8055650Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.8055757Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.8055872Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.8055976Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.8056040Z ) 2025-05-07T20:32:32.8056283Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.8056366Z def test_silu_mul_quant( 2025-05-07T20:32:32.8056431Z self, 2025-05-07T20:32:32.8056501Z T: int, 2025-05-07T20:32:32.8056566Z D: int, 2025-05-07T20:32:32.8056652Z scale_ub: Optional[float], 2025-05-07T20:32:32.8056744Z contiguous: bool, 2025-05-07T20:32:32.8056818Z compiled: bool, 2025-05-07T20:32:32.8056886Z ) -> None: 2025-05-07T20:32:32.8056971Z torch.manual_seed(2025) 2025-05-07T20:32:32.8057034Z 2025-05-07T20:32:32.8057195Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.8057261Z 2025-05-07T20:32:32.8057343Z x_sign = torch.sign(x) 2025-05-07T20:32:32.8057461Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.8057540Z x = x_sign * x_clamp 2025-05-07T20:32:32.8057609Z x0 = x[:, :D] 2025-05-07T20:32:32.8057689Z x1 = x[:, D:] 2025-05-07T20:32:32.8057759Z 2025-05-07T20:32:32.8057922Z if contiguous: 2025-05-07T20:32:32.8058014Z x0 = x0.contiguous() 2025-05-07T20:32:32.8058099Z x1 = x1.contiguous() 2025-05-07T20:32:32.8058167Z 2025-05-07T20:32:32.8058257Z if scale_ub is not None: 2025-05-07T20:32:32.8058363Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.8058495Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.8058570Z ) 2025-05-07T20:32:32.8058640Z else: 2025-05-07T20:32:32.8058734Z scale_ub_tensor = None 2025-05-07T20:32:32.8058803Z 2025-05-07T20:32:32.8058926Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.8059014Z op = silu_mul_quant 2025-05-07T20:32:32.8059092Z if compiled: 2025-05-07T20:32:32.8059187Z op = torch.compile(op) 2025-05-07T20:32:32.8059290Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.8059360Z 2025-05-07T20:32:32.8059456Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.8059461Z 2025-05-07T20:32:32.8059561Z moe/activation_test.py:117: 2025-05-07T20:32:32.8059684Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.8059784Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.8059883Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.8060243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:32.8060333Z return fn(*args, **kwargs) 2025-05-07T20:32:32.8060820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.8060913Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.8061264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.8061488Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.8061827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.8061916Z kernel = self.compile( 2025-05-07T20:32:32.8062294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.8062551Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.8062672Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.8062676Z 2025-05-07T20:32:32.8062881Z self = 2025-05-07T20:32:32.8063655Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.8064164Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faad2e46980>} 2025-05-07T20:32:32.8064903Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.8065095Z context = 2025-05-07T20:32:32.8065100Z 2025-05-07T20:32:32.8065262Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.8065520Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.8065624Z module_map=module_map) 2025-05-07T20:32:32.8065780Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.8065876Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.8065956Z E ^ 2025-05-07T20:32:32.8066403Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.8066408Z 2025-05-07T20:32:32.8066815Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.8066826Z 2025-05-07T20:32:32.8066924Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.8067141Z self=, 2025-05-07T20:32:32.8067219Z T=4096, 2025-05-07T20:32:32.8067291Z D=5120, 2025-05-07T20:32:32.8067371Z scale_ub=1200.0, 2025-05-07T20:32:32.8067458Z contiguous=False, 2025-05-07T20:32:32.8067541Z compiled=False, 2025-05-07T20:32:32.8067614Z ) 2025-05-07T20:32:32.8067833Z self = 2025-05-07T20:32:32.8068004Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:32.8068014Z 2025-05-07T20:32:32.8068088Z @given( 2025-05-07T20:32:32.8068203Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.8068300Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.8068412Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.8068528Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.8068635Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.8068707Z ) 2025-05-07T20:32:32.8068947Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.8069033Z def test_silu_mul_quant( 2025-05-07T20:32:32.8069111Z self, 2025-05-07T20:32:32.8069183Z T: int, 2025-05-07T20:32:32.8069252Z D: int, 2025-05-07T20:32:32.8069354Z scale_ub: Optional[float], 2025-05-07T20:32:32.8069439Z contiguous: bool, 2025-05-07T20:32:32.8069527Z compiled: bool, 2025-05-07T20:32:32.8069602Z ) -> None: 2025-05-07T20:32:32.8069696Z torch.manual_seed(2025) 2025-05-07T20:32:32.8069768Z 2025-05-07T20:32:32.8069931Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.8070002Z 2025-05-07T20:32:32.8070093Z x_sign = torch.sign(x) 2025-05-07T20:32:32.8070294Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.8070377Z x = x_sign * x_clamp 2025-05-07T20:32:32.8070455Z x0 = x[:, :D] 2025-05-07T20:32:32.8070529Z x1 = x[:, D:] 2025-05-07T20:32:32.8070596Z 2025-05-07T20:32:32.8070679Z if contiguous: 2025-05-07T20:32:32.8070770Z x0 = x0.contiguous() 2025-05-07T20:32:32.8070852Z x1 = x1.contiguous() 2025-05-07T20:32:32.8070925Z 2025-05-07T20:32:32.8071012Z if scale_ub is not None: 2025-05-07T20:32:32.8071115Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.8071246Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.8071317Z ) 2025-05-07T20:32:32.8071403Z else: 2025-05-07T20:32:32.8071492Z scale_ub_tensor = None 2025-05-07T20:32:32.8071563Z 2025-05-07T20:32:32.8071691Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.8071776Z op = silu_mul_quant 2025-05-07T20:32:32.8071861Z if compiled: 2025-05-07T20:32:32.8071958Z op = torch.compile(op) 2025-05-07T20:32:32.8072057Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.8072126Z 2025-05-07T20:32:32.8072215Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.8072219Z 2025-05-07T20:32:32.8072311Z moe/activation_test.py:117: 2025-05-07T20:32:32.8072438Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.8072537Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.8072630Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.8073205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.8073298Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.8073649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.8073870Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.8074208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.8074298Z kernel = self.compile( 2025-05-07T20:32:32.8074674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.8074844Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.8074971Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.8074975Z 2025-05-07T20:32:32.8075183Z self = 2025-05-07T20:32:32.8075962Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.8076470Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faad2e47ba0>} 2025-05-07T20:32:32.8077210Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.8077401Z context = 2025-05-07T20:32:32.8077405Z 2025-05-07T20:32:32.8077565Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.8077832Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.8077935Z module_map=module_map) 2025-05-07T20:32:32.8078092Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.8078187Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.8078338Z E ^ 2025-05-07T20:32:32.8078691Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.8078695Z 2025-05-07T20:32:32.8079128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.8079132Z 2025-05-07T20:32:32.8079234Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.8079489Z self=, 2025-05-07T20:32:32.8079581Z T=4096, 2025-05-07T20:32:32.8079653Z D=5120, 2025-05-07T20:32:32.8079737Z scale_ub=1200.0, 2025-05-07T20:32:32.8079823Z contiguous=False, 2025-05-07T20:32:32.8079905Z compiled=True, 2025-05-07T20:32:32.8079974Z ) 2025-05-07T20:32:32.8080193Z self = 2025-05-07T20:32:32.8080366Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:32.8080375Z 2025-05-07T20:32:32.8080450Z @given( 2025-05-07T20:32:32.8080564Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.8080660Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.8080768Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.8080879Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.8080991Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.8081062Z ) 2025-05-07T20:32:32.8081304Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.8081392Z def test_silu_mul_quant( 2025-05-07T20:32:32.8081542Z self, 2025-05-07T20:32:32.8081618Z T: int, 2025-05-07T20:32:32.8081689Z D: int, 2025-05-07T20:32:32.8081782Z scale_ub: Optional[float], 2025-05-07T20:32:32.8081870Z contiguous: bool, 2025-05-07T20:32:32.8081950Z compiled: bool, 2025-05-07T20:32:32.8082028Z ) -> None: 2025-05-07T20:32:32.8082123Z torch.manual_seed(2025) 2025-05-07T20:32:32.8082193Z 2025-05-07T20:32:32.8082359Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.8082434Z 2025-05-07T20:32:32.8082523Z x_sign = torch.sign(x) 2025-05-07T20:32:32.8082647Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.8082730Z x = x_sign * x_clamp 2025-05-07T20:32:32.8082807Z x0 = x[:, :D] 2025-05-07T20:32:32.8082884Z x1 = x[:, D:] 2025-05-07T20:32:32.8082952Z 2025-05-07T20:32:32.8083036Z if contiguous: 2025-05-07T20:32:32.8083128Z x0 = x0.contiguous() 2025-05-07T20:32:32.8083220Z x1 = x1.contiguous() 2025-05-07T20:32:32.8083290Z 2025-05-07T20:32:32.8083378Z if scale_ub is not None: 2025-05-07T20:32:32.8083482Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.8083613Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.8083704Z ) 2025-05-07T20:32:32.8083777Z else: 2025-05-07T20:32:32.8083877Z scale_ub_tensor = None 2025-05-07T20:32:32.8083947Z 2025-05-07T20:32:32.8084072Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.8084160Z op = silu_mul_quant 2025-05-07T20:32:32.8084355Z if compiled: 2025-05-07T20:32:32.8084453Z op = torch.compile(op) 2025-05-07T20:32:32.8084556Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.8084623Z 2025-05-07T20:32:32.8084711Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.8084715Z 2025-05-07T20:32:32.8084809Z moe/activation_test.py:117: 2025-05-07T20:32:32.8084941Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.8085039Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.8085134Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.8085495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:32.8085673Z return fn(*args, **kwargs) 2025-05-07T20:32:32.8086169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.8086261Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.8086615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.8086833Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.8087175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.8087265Z kernel = self.compile( 2025-05-07T20:32:32.8087642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.8087819Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.8087945Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.8087950Z 2025-05-07T20:32:32.8088152Z self = 2025-05-07T20:32:32.8088930Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.8089565Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faad2ce8ea0>} 2025-05-07T20:32:32.8090315Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.8090507Z context = 2025-05-07T20:32:32.8090511Z 2025-05-07T20:32:32.8090675Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.8090935Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.8091037Z module_map=module_map) 2025-05-07T20:32:32.8091201Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.8091298Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.8091370Z E ^ 2025-05-07T20:32:32.8091728Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.8091732Z 2025-05-07T20:32:32.8092141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.8092145Z 2025-05-07T20:32:32.8092244Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.8092470Z self=, 2025-05-07T20:32:32.8092544Z T=2048, 2025-05-07T20:32:32.8092620Z D=7168, 2025-05-07T20:32:32.8092698Z scale_ub=1200.0, 2025-05-07T20:32:32.8092785Z contiguous=False, 2025-05-07T20:32:32.8092867Z compiled=False, 2025-05-07T20:32:32.8092935Z ) 2025-05-07T20:32:32.8093150Z self = 2025-05-07T20:32:32.8093325Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:32.8093329Z 2025-05-07T20:32:32.8093404Z @given( 2025-05-07T20:32:32.8093528Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.8093630Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.8093739Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.8093853Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.8093962Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.8094140Z ) 2025-05-07T20:32:32.8094386Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.8094477Z def test_silu_mul_quant( 2025-05-07T20:32:32.8094553Z self, 2025-05-07T20:32:32.8094626Z T: int, 2025-05-07T20:32:32.8094697Z D: int, 2025-05-07T20:32:32.8094800Z scale_ub: Optional[float], 2025-05-07T20:32:32.8094884Z contiguous: bool, 2025-05-07T20:32:32.8094967Z compiled: bool, 2025-05-07T20:32:32.8095044Z ) -> None: 2025-05-07T20:32:32.8095134Z torch.manual_seed(2025) 2025-05-07T20:32:32.8095203Z 2025-05-07T20:32:32.8095383Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.8095452Z 2025-05-07T20:32:32.8095540Z x_sign = torch.sign(x) 2025-05-07T20:32:32.8095665Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.8095751Z x = x_sign * x_clamp 2025-05-07T20:32:32.8095835Z x0 = x[:, :D] 2025-05-07T20:32:32.8095909Z x1 = x[:, D:] 2025-05-07T20:32:32.8095982Z 2025-05-07T20:32:32.8096062Z if contiguous: 2025-05-07T20:32:32.8096149Z x0 = x0.contiguous() 2025-05-07T20:32:32.8096234Z x1 = x1.contiguous() 2025-05-07T20:32:32.8096303Z 2025-05-07T20:32:32.8096392Z if scale_ub is not None: 2025-05-07T20:32:32.8096497Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.8096630Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.8096703Z ) 2025-05-07T20:32:32.8096782Z else: 2025-05-07T20:32:32.8096871Z scale_ub_tensor = None 2025-05-07T20:32:32.8097022Z 2025-05-07T20:32:32.8097149Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.8097236Z op = silu_mul_quant 2025-05-07T20:32:32.8097318Z if compiled: 2025-05-07T20:32:32.8097417Z op = torch.compile(op) 2025-05-07T20:32:32.8097526Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.8097593Z 2025-05-07T20:32:32.8097685Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.8097689Z 2025-05-07T20:32:32.8097782Z moe/activation_test.py:117: 2025-05-07T20:32:32.8097908Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.8098010Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.8098106Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.8098606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.8098700Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.8099062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.8099307Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.8099640Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.8099734Z kernel = self.compile( 2025-05-07T20:32:32.8100111Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.8100282Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.8100407Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.8100411Z 2025-05-07T20:32:32.8100612Z self = 2025-05-07T20:32:32.8101390Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.8101897Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faad2ce9940>} 2025-05-07T20:32:32.8102720Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.8102908Z context = 2025-05-07T20:32:32.8102915Z 2025-05-07T20:32:32.8103075Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.8103334Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.8103447Z module_map=module_map) 2025-05-07T20:32:32.8107395Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.8107487Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.8107555Z E ^ 2025-05-07T20:32:32.8107906Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.8107921Z 2025-05-07T20:32:32.8108546Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.8108554Z 2025-05-07T20:32:32.8108702Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.8108983Z self=, 2025-05-07T20:32:32.8109053Z T=1, 2025-05-07T20:32:32.8109117Z D=7168, 2025-05-07T20:32:32.8109189Z scale_ub=None, 2025-05-07T20:32:32.8109268Z contiguous=True, 2025-05-07T20:32:32.8109341Z compiled=False, 2025-05-07T20:32:32.8109404Z ) 2025-05-07T20:32:32.8109779Z self = 2025-05-07T20:32:32.8109940Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:32.8109944Z 2025-05-07T20:32:32.8110020Z @given( 2025-05-07T20:32:32.8110141Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.8110230Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.8110340Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.8110446Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.8110554Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.8110626Z ) 2025-05-07T20:32:32.8110863Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.8110945Z def test_silu_mul_quant( 2025-05-07T20:32:32.8111014Z self, 2025-05-07T20:32:32.8111079Z T: int, 2025-05-07T20:32:32.8111143Z D: int, 2025-05-07T20:32:32.8111239Z scale_ub: Optional[float], 2025-05-07T20:32:32.8111319Z contiguous: bool, 2025-05-07T20:32:32.8111395Z compiled: bool, 2025-05-07T20:32:32.8111467Z ) -> None: 2025-05-07T20:32:32.8111552Z torch.manual_seed(2025) 2025-05-07T20:32:32.8111617Z 2025-05-07T20:32:32.8111781Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.8111846Z 2025-05-07T20:32:32.8111932Z x_sign = torch.sign(x) 2025-05-07T20:32:32.8112047Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.8112125Z x = x_sign * x_clamp 2025-05-07T20:32:32.8112199Z x0 = x[:, :D] 2025-05-07T20:32:32.8112268Z x1 = x[:, D:] 2025-05-07T20:32:32.8112330Z 2025-05-07T20:32:32.8112410Z if contiguous: 2025-05-07T20:32:32.8112495Z x0 = x0.contiguous() 2025-05-07T20:32:32.8112574Z x1 = x1.contiguous() 2025-05-07T20:32:32.8112638Z 2025-05-07T20:32:32.8112719Z if scale_ub is not None: 2025-05-07T20:32:32.8112821Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.8112953Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.8113022Z ) 2025-05-07T20:32:32.8113090Z else: 2025-05-07T20:32:32.8113175Z scale_ub_tensor = None 2025-05-07T20:32:32.8113361Z 2025-05-07T20:32:32.8113485Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.8113564Z op = silu_mul_quant 2025-05-07T20:32:32.8113639Z if compiled: 2025-05-07T20:32:32.8113732Z op = torch.compile(op) 2025-05-07T20:32:32.8113829Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.8113890Z 2025-05-07T20:32:32.8113974Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.8113979Z 2025-05-07T20:32:32.8114065Z moe/activation_test.py:117: 2025-05-07T20:32:32.8114191Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.8114286Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.8114376Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.8114870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.8114966Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.8115320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.8115540Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.8115876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.8115965Z kernel = self.compile( 2025-05-07T20:32:32.8116341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.8116507Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.8116706Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.8116711Z 2025-05-07T20:32:32.8116910Z self = 2025-05-07T20:32:32.8117685Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.8118189Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faad2ceaca0>} 2025-05-07T20:32:32.8118925Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.8119119Z context = 2025-05-07T20:32:32.8119124Z 2025-05-07T20:32:32.8119280Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.8119541Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.8119651Z module_map=module_map) 2025-05-07T20:32:32.8119806Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.8119897Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.8119967Z E ^ 2025-05-07T20:32:32.8120314Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.8120323Z 2025-05-07T20:32:32.8120725Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.8120729Z 2025-05-07T20:32:32.8120821Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.8121043Z self=, 2025-05-07T20:32:32.8121108Z T=16384, 2025-05-07T20:32:32.8121174Z D=7168, 2025-05-07T20:32:32.8121248Z scale_ub=1200.0, 2025-05-07T20:32:32.8121324Z contiguous=False, 2025-05-07T20:32:32.8121396Z compiled=True, 2025-05-07T20:32:32.8121541Z ) 2025-05-07T20:32:32.8121753Z self = 2025-05-07T20:32:32.8121930Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:32.8121934Z 2025-05-07T20:32:32.8122004Z @given( 2025-05-07T20:32:32.8122119Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.8122215Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.8122320Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.8122426Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.8122537Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.8122606Z ) 2025-05-07T20:32:32.8122849Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.8122932Z def test_silu_mul_quant( 2025-05-07T20:32:32.8122997Z self, 2025-05-07T20:32:32.8123067Z T: int, 2025-05-07T20:32:32.8123139Z D: int, 2025-05-07T20:32:32.8123228Z scale_ub: Optional[float], 2025-05-07T20:32:32.8123314Z contiguous: bool, 2025-05-07T20:32:32.8123388Z compiled: bool, 2025-05-07T20:32:32.8123456Z ) -> None: 2025-05-07T20:32:32.8123542Z torch.manual_seed(2025) 2025-05-07T20:32:32.8123607Z 2025-05-07T20:32:32.8123769Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.8123835Z 2025-05-07T20:32:32.8123917Z x_sign = torch.sign(x) 2025-05-07T20:32:32.8124033Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.8124114Z x = x_sign * x_clamp 2025-05-07T20:32:32.8124183Z x0 = x[:, :D] 2025-05-07T20:32:32.8124424Z x1 = x[:, D:] 2025-05-07T20:32:32.8124486Z 2025-05-07T20:32:32.8124564Z if contiguous: 2025-05-07T20:32:32.8124649Z x0 = x0.contiguous() 2025-05-07T20:32:32.8124728Z x1 = x1.contiguous() 2025-05-07T20:32:32.8124791Z 2025-05-07T20:32:32.8124882Z if scale_ub is not None: 2025-05-07T20:32:32.8124978Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.8125107Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.8125176Z ) 2025-05-07T20:32:32.8125242Z else: 2025-05-07T20:32:32.8125349Z scale_ub_tensor = None 2025-05-07T20:32:32.8125420Z 2025-05-07T20:32:32.8125545Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.8125639Z op = silu_mul_quant 2025-05-07T20:32:32.8125728Z if compiled: 2025-05-07T20:32:32.8125830Z op = torch.compile(op) 2025-05-07T20:32:32.8125952Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.8126014Z 2025-05-07T20:32:32.8126098Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.8126102Z 2025-05-07T20:32:32.8126190Z moe/activation_test.py:117: 2025-05-07T20:32:32.8126312Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.8126413Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.8126502Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.8126864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:32.8126951Z return fn(*args, **kwargs) 2025-05-07T20:32:32.8127440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.8127536Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.8127885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.8128108Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.8128444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.8128529Z kernel = self.compile( 2025-05-07T20:32:32.8129014Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.8129186Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.8129307Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.8129311Z 2025-05-07T20:32:32.8129514Z self = 2025-05-07T20:32:32.8130300Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.8130807Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faad2cebf60>} 2025-05-07T20:32:32.8131553Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.8131744Z context = 2025-05-07T20:32:32.8131749Z 2025-05-07T20:32:32.8131909Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.8132168Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.8132277Z module_map=module_map) 2025-05-07T20:32:32.8132432Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.8132597Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.8132671Z E ^ 2025-05-07T20:32:32.8133018Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.8133023Z 2025-05-07T20:32:32.8133434Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.8133443Z 2025-05-07T20:32:32.8133538Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.8133755Z self=, 2025-05-07T20:32:32.8133825Z T=1, 2025-05-07T20:32:32.8133891Z D=7168, 2025-05-07T20:32:32.8133962Z scale_ub=None, 2025-05-07T20:32:32.8134040Z contiguous=False, 2025-05-07T20:32:32.8134114Z compiled=False, 2025-05-07T20:32:32.8134176Z ) 2025-05-07T20:32:32.8134390Z self = 2025-05-07T20:32:32.8134558Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:32.8134563Z 2025-05-07T20:32:32.8134634Z @given( 2025-05-07T20:32:32.8134746Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.8134838Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.8134951Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.8135059Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.8135166Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.8135234Z ) 2025-05-07T20:32:32.8135474Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.8135560Z def test_silu_mul_quant( 2025-05-07T20:32:32.8135632Z self, 2025-05-07T20:32:32.8135696Z T: int, 2025-05-07T20:32:32.8135762Z D: int, 2025-05-07T20:32:32.8135852Z scale_ub: Optional[float], 2025-05-07T20:32:32.8135931Z contiguous: bool, 2025-05-07T20:32:32.8136017Z compiled: bool, 2025-05-07T20:32:32.8136084Z ) -> None: 2025-05-07T20:32:32.8136170Z torch.manual_seed(2025) 2025-05-07T20:32:32.8136239Z 2025-05-07T20:32:32.8136401Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.8136467Z 2025-05-07T20:32:32.8136635Z x_sign = torch.sign(x) 2025-05-07T20:32:32.8136755Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.8136834Z x = x_sign * x_clamp 2025-05-07T20:32:32.8136908Z x0 = x[:, :D] 2025-05-07T20:32:32.8136979Z x1 = x[:, D:] 2025-05-07T20:32:32.8137044Z 2025-05-07T20:32:32.8137123Z if contiguous: 2025-05-07T20:32:32.8137209Z x0 = x0.contiguous() 2025-05-07T20:32:32.8137288Z x1 = x1.contiguous() 2025-05-07T20:32:32.8137353Z 2025-05-07T20:32:32.8137434Z if scale_ub is not None: 2025-05-07T20:32:32.8137536Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.8137671Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.8137737Z ) 2025-05-07T20:32:32.8137804Z else: 2025-05-07T20:32:32.8137888Z scale_ub_tensor = None 2025-05-07T20:32:32.8137950Z 2025-05-07T20:32:32.8138074Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.8138160Z op = silu_mul_quant 2025-05-07T20:32:32.8138235Z if compiled: 2025-05-07T20:32:32.8138329Z op = torch.compile(op) 2025-05-07T20:32:32.8138425Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.8138487Z 2025-05-07T20:32:32.8138573Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.8138578Z 2025-05-07T20:32:32.8138666Z moe/activation_test.py:117: 2025-05-07T20:32:32.8138797Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.8138892Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.8138981Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.8139578Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.8139668Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.8140018Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.8140242Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.8140571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.8140661Z kernel = self.compile( 2025-05-07T20:32:32.8141033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.8141200Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.8141322Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.8141326Z 2025-05-07T20:32:32.8141529Z self = 2025-05-07T20:32:32.8142303Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.8142806Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faad303c9a0>} 2025-05-07T20:32:32.8143541Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.8143725Z context = 2025-05-07T20:32:32.8143729Z 2025-05-07T20:32:32.8143891Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.8144148Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.8144247Z module_map=module_map) 2025-05-07T20:32:32.8144400Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.8144577Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.8144644Z E ^ 2025-05-07T20:32:32.8144992Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.8144996Z 2025-05-07T20:32:32.8145404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.8145409Z 2025-05-07T20:32:32.8145501Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.8145717Z self=, 2025-05-07T20:32:32.8145784Z T=2048, 2025-05-07T20:32:32.8145855Z D=7168, 2025-05-07T20:32:32.8145932Z scale_ub=None, 2025-05-07T20:32:32.8146007Z contiguous=False, 2025-05-07T20:32:32.8146082Z compiled=True, 2025-05-07T20:32:32.8146143Z ) 2025-05-07T20:32:32.8146355Z self = 2025-05-07T20:32:32.8146527Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:32.8146531Z 2025-05-07T20:32:32.8146599Z @given( 2025-05-07T20:32:32.8146708Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.8146800Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.8146905Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.8147015Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.8147123Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.8147187Z ) 2025-05-07T20:32:32.8147426Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.8147592Z def test_silu_mul_quant( 2025-05-07T20:32:32.8147659Z self, 2025-05-07T20:32:32.8147727Z T: int, 2025-05-07T20:32:32.8147792Z D: int, 2025-05-07T20:32:32.8147879Z scale_ub: Optional[float], 2025-05-07T20:32:32.8147962Z contiguous: bool, 2025-05-07T20:32:32.8148045Z compiled: bool, 2025-05-07T20:32:32.8148111Z ) -> None: 2025-05-07T20:32:32.8148199Z torch.manual_seed(2025) 2025-05-07T20:32:32.8148260Z 2025-05-07T20:32:32.8148420Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.8148483Z 2025-05-07T20:32:32.8148565Z x_sign = torch.sign(x) 2025-05-07T20:32:32.8148686Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.8148765Z x = x_sign * x_clamp 2025-05-07T20:32:32.8148833Z x0 = x[:, :D] 2025-05-07T20:32:32.8148908Z x1 = x[:, D:] 2025-05-07T20:32:32.8148971Z 2025-05-07T20:32:32.8149044Z if contiguous: 2025-05-07T20:32:32.8149134Z x0 = x0.contiguous() 2025-05-07T20:32:32.8149213Z x1 = x1.contiguous() 2025-05-07T20:32:32.8149278Z 2025-05-07T20:32:32.8149362Z if scale_ub is not None: 2025-05-07T20:32:32.8149458Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.8149593Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.8149663Z ) 2025-05-07T20:32:32.8149728Z else: 2025-05-07T20:32:32.8149814Z scale_ub_tensor = None 2025-05-07T20:32:32.8149875Z 2025-05-07T20:32:32.8149995Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.8150080Z op = silu_mul_quant 2025-05-07T20:32:32.8150156Z if compiled: 2025-05-07T20:32:32.8150250Z op = torch.compile(op) 2025-05-07T20:32:32.8150347Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.8150411Z 2025-05-07T20:32:32.8150492Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.8150496Z 2025-05-07T20:32:32.8150593Z moe/activation_test.py:117: 2025-05-07T20:32:32.8150715Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.8150808Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.8150897Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.8151342Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:32.8151428Z return fn(*args, **kwargs) 2025-05-07T20:32:32.8151913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.8152002Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.8152353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.8152569Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.8152906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.8152990Z kernel = self.compile( 2025-05-07T20:32:32.8153366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.8153541Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.8153660Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.8153664Z 2025-05-07T20:32:32.8153863Z self = 2025-05-07T20:32:32.8154638Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.8155234Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faad303e160>} 2025-05-07T20:32:32.8155976Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.8156165Z context = 2025-05-07T20:32:32.8156170Z 2025-05-07T20:32:32.8156330Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.8156585Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.8156683Z module_map=module_map) 2025-05-07T20:32:32.8156841Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.8156932Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.8156998Z E ^ 2025-05-07T20:32:32.8157351Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.8157355Z 2025-05-07T20:32:32.8157758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.8157767Z 2025-05-07T20:32:32.8157863Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.8158075Z self=, 2025-05-07T20:32:32.8158142Z T=4096, 2025-05-07T20:32:32.8158212Z D=7168, 2025-05-07T20:32:32.8158282Z scale_ub=None, 2025-05-07T20:32:32.8158358Z contiguous=False, 2025-05-07T20:32:32.8158434Z compiled=True, 2025-05-07T20:32:32.8158496Z ) 2025-05-07T20:32:32.8158709Z self = 2025-05-07T20:32:32.8158884Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:32.8158890Z 2025-05-07T20:32:32.8158971Z @given( 2025-05-07T20:32:32.8159088Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.8159182Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.8159297Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.8159423Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.8159632Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.8159696Z ) 2025-05-07T20:32:32.8159938Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.8160023Z def test_silu_mul_quant( 2025-05-07T20:32:32.8160092Z self, 2025-05-07T20:32:32.8160157Z T: int, 2025-05-07T20:32:32.8160222Z D: int, 2025-05-07T20:32:32.8160312Z scale_ub: Optional[float], 2025-05-07T20:32:32.8160392Z contiguous: bool, 2025-05-07T20:32:32.8160468Z compiled: bool, 2025-05-07T20:32:32.8160541Z ) -> None: 2025-05-07T20:32:32.8160625Z torch.manual_seed(2025) 2025-05-07T20:32:32.8160691Z 2025-05-07T20:32:32.8160857Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.8160920Z 2025-05-07T20:32:32.8161003Z x_sign = torch.sign(x) 2025-05-07T20:32:32.8161123Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.8161209Z x = x_sign * x_clamp 2025-05-07T20:32:32.8161282Z x0 = x[:, :D] 2025-05-07T20:32:32.8161351Z x1 = x[:, D:] 2025-05-07T20:32:32.8161413Z 2025-05-07T20:32:32.8161488Z if contiguous: 2025-05-07T20:32:32.8161570Z x0 = x0.contiguous() 2025-05-07T20:32:32.8161648Z x1 = x1.contiguous() 2025-05-07T20:32:32.8161716Z 2025-05-07T20:32:32.8161797Z if scale_ub is not None: 2025-05-07T20:32:32.8161894Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.8162024Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.8162089Z ) 2025-05-07T20:32:32.8162154Z else: 2025-05-07T20:32:32.8162327Z scale_ub_tensor = None 2025-05-07T20:32:32.8162390Z 2025-05-07T20:32:32.8162513Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.8162593Z op = silu_mul_quant 2025-05-07T20:32:32.8162668Z if compiled: 2025-05-07T20:32:32.8162771Z op = torch.compile(op) 2025-05-07T20:32:32.8162866Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.8162930Z 2025-05-07T20:32:32.8163017Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.8163021Z 2025-05-07T20:32:32.8163108Z moe/activation_test.py:117: 2025-05-07T20:32:32.8163231Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.8163325Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.8163417Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.8163781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:32.8163872Z return fn(*args, **kwargs) 2025-05-07T20:32:32.8164442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.8164537Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.8164895Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.8165113Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.8165447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.8165531Z kernel = self.compile( 2025-05-07T20:32:32.8165910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.8166083Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.8166211Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.8166216Z 2025-05-07T20:32:32.8166418Z self = 2025-05-07T20:32:32.8167191Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.8167778Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faad303ee80>} 2025-05-07T20:32:32.8168523Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.8168709Z context = 2025-05-07T20:32:32.8168721Z 2025-05-07T20:32:32.8168883Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.8169151Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.8169270Z module_map=module_map) 2025-05-07T20:32:32.8169456Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.8169544Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.8169617Z E ^ 2025-05-07T20:32:32.8169967Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.8169971Z 2025-05-07T20:32:32.8170386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.8170390Z 2025-05-07T20:32:32.8170483Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.8170699Z self=, 2025-05-07T20:32:32.8170845Z T=16384, 2025-05-07T20:32:32.8170912Z D=5120, 2025-05-07T20:32:32.8170986Z scale_ub=1200.0, 2025-05-07T20:32:32.8171066Z contiguous=False, 2025-05-07T20:32:32.8171140Z compiled=False, 2025-05-07T20:32:32.8171202Z ) 2025-05-07T20:32:32.8171424Z self = 2025-05-07T20:32:32.8171601Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:32.8171606Z 2025-05-07T20:32:32.8171673Z @given( 2025-05-07T20:32:32.8171785Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.8171875Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.8171985Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.8172093Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.8172197Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.8172263Z ) 2025-05-07T20:32:32.8172507Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.8172593Z def test_silu_mul_quant( 2025-05-07T20:32:32.8172658Z self, 2025-05-07T20:32:32.8172724Z T: int, 2025-05-07T20:32:32.8172792Z D: int, 2025-05-07T20:32:32.8172881Z scale_ub: Optional[float], 2025-05-07T20:32:32.8172968Z contiguous: bool, 2025-05-07T20:32:32.8173045Z compiled: bool, 2025-05-07T20:32:32.8173114Z ) -> None: 2025-05-07T20:32:32.8173198Z torch.manual_seed(2025) 2025-05-07T20:32:32.8173263Z 2025-05-07T20:32:32.8173428Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.8173490Z 2025-05-07T20:32:32.8173581Z x_sign = torch.sign(x) 2025-05-07T20:32:32.8173698Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.8173778Z x = x_sign * x_clamp 2025-05-07T20:32:32.8173849Z x0 = x[:, :D] 2025-05-07T20:32:32.8173918Z x1 = x[:, D:] 2025-05-07T20:32:32.8173990Z 2025-05-07T20:32:32.8174065Z if contiguous: 2025-05-07T20:32:32.8174147Z x0 = x0.contiguous() 2025-05-07T20:32:32.8174231Z x1 = x1.contiguous() 2025-05-07T20:32:32.8174293Z 2025-05-07T20:32:32.8174374Z if scale_ub is not None: 2025-05-07T20:32:32.8174554Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.8174683Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.8174748Z ) 2025-05-07T20:32:32.8174818Z else: 2025-05-07T20:32:32.8174902Z scale_ub_tensor = None 2025-05-07T20:32:32.8174967Z 2025-05-07T20:32:32.8175092Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.8175174Z op = silu_mul_quant 2025-05-07T20:32:32.8175254Z if compiled: 2025-05-07T20:32:32.8175346Z op = torch.compile(op) 2025-05-07T20:32:32.8175444Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.8175510Z 2025-05-07T20:32:32.8175598Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.8175602Z 2025-05-07T20:32:32.8175690Z moe/activation_test.py:117: 2025-05-07T20:32:32.8175815Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.8175908Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.8176009Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.8176500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.8176586Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.8176943Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.8177162Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.8177495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.8177658Z kernel = self.compile( 2025-05-07T20:32:32.8178035Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.8178205Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.8178330Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.8178334Z 2025-05-07T20:32:32.8178532Z self = 2025-05-07T20:32:32.8179310Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.8179818Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faad3138220>} 2025-05-07T20:32:32.8180570Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.8180756Z context = 2025-05-07T20:32:32.8180767Z 2025-05-07T20:32:32.8180924Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.8181182Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.8181281Z module_map=module_map) 2025-05-07T20:32:32.8181438Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.8181527Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.8181596Z E ^ 2025-05-07T20:32:32.8181949Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.8181953Z 2025-05-07T20:32:32.8182364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.8182368Z 2025-05-07T20:32:32.8182470Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.8182685Z self=, 2025-05-07T20:32:32.8182829Z T=16384, 2025-05-07T20:32:32.8182897Z D=5120, 2025-05-07T20:32:32.8182970Z scale_ub=1200.0, 2025-05-07T20:32:32.8183044Z contiguous=True, 2025-05-07T20:32:32.8183120Z compiled=True, 2025-05-07T20:32:32.8183182Z ) 2025-05-07T20:32:32.8183395Z self = 2025-05-07T20:32:32.8183568Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:32.8183572Z 2025-05-07T20:32:32.8183638Z @given( 2025-05-07T20:32:32.8183750Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.8183844Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.8183954Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.8184065Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.8184170Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.8184239Z ) 2025-05-07T20:32:32.8184479Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.8184562Z def test_silu_mul_quant( 2025-05-07T20:32:32.8184632Z self, 2025-05-07T20:32:32.8184697Z T: int, 2025-05-07T20:32:32.8184764Z D: int, 2025-05-07T20:32:32.8184856Z scale_ub: Optional[float], 2025-05-07T20:32:32.8184941Z contiguous: bool, 2025-05-07T20:32:32.8185019Z compiled: bool, 2025-05-07T20:32:32.8185092Z ) -> None: 2025-05-07T20:32:32.8185176Z torch.manual_seed(2025) 2025-05-07T20:32:32.8185238Z 2025-05-07T20:32:32.8185401Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.8185582Z 2025-05-07T20:32:32.8185680Z x_sign = torch.sign(x) 2025-05-07T20:32:32.8185798Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.8185876Z x = x_sign * x_clamp 2025-05-07T20:32:32.8185945Z x0 = x[:, :D] 2025-05-07T20:32:32.8186023Z x1 = x[:, D:] 2025-05-07T20:32:32.8186084Z 2025-05-07T20:32:32.8186162Z if contiguous: 2025-05-07T20:32:32.8186245Z x0 = x0.contiguous() 2025-05-07T20:32:32.8186324Z x1 = x1.contiguous() 2025-05-07T20:32:32.8186394Z 2025-05-07T20:32:32.8186477Z if scale_ub is not None: 2025-05-07T20:32:32.8186574Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.8186705Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.8186771Z ) 2025-05-07T20:32:32.8186835Z else: 2025-05-07T20:32:32.8186921Z scale_ub_tensor = None 2025-05-07T20:32:32.8186982Z 2025-05-07T20:32:32.8187108Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.8187191Z op = silu_mul_quant 2025-05-07T20:32:32.8187265Z if compiled: 2025-05-07T20:32:32.8187358Z op = torch.compile(op) 2025-05-07T20:32:32.8187454Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.8187521Z 2025-05-07T20:32:32.8187610Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.8187614Z 2025-05-07T20:32:32.8187703Z moe/activation_test.py:117: 2025-05-07T20:32:32.8187823Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.8187916Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.8188005Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.8188362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:32.8188450Z return fn(*args, **kwargs) 2025-05-07T20:32:32.8188939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.8189033Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.8189380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.8189679Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.8190012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.8190095Z kernel = self.compile( 2025-05-07T20:32:32.8190471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.8190637Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.8190756Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.8190760Z 2025-05-07T20:32:32.8190966Z self = 2025-05-07T20:32:32.8191729Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.8192240Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faad31394e0>} 2025-05-07T20:32:32.8192974Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.8193159Z context = 2025-05-07T20:32:32.8193164Z 2025-05-07T20:32:32.8193325Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.8194343Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.8194451Z module_map=module_map) 2025-05-07T20:32:32.8194604Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.8194697Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.8194765Z E ^ 2025-05-07T20:32:32.8195111Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.8195116Z 2025-05-07T20:32:32.8195522Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.8195526Z 2025-05-07T20:32:32.8195622Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.8195836Z self=, 2025-05-07T20:32:32.8195905Z T=16384, 2025-05-07T20:32:32.8195973Z D=5120, 2025-05-07T20:32:32.8196049Z scale_ub=None, 2025-05-07T20:32:32.8196128Z contiguous=False, 2025-05-07T20:32:32.8196199Z compiled=True, 2025-05-07T20:32:32.8196261Z ) 2025-05-07T20:32:32.8196474Z self = 2025-05-07T20:32:32.8196642Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:32.8196651Z 2025-05-07T20:32:32.8196718Z @given( 2025-05-07T20:32:32.8196827Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.8196916Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.8197031Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.8197137Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.8197240Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.8197306Z ) 2025-05-07T20:32:32.8197545Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.8197631Z def test_silu_mul_quant( 2025-05-07T20:32:32.8197711Z self, 2025-05-07T20:32:32.8197776Z T: int, 2025-05-07T20:32:32.8197845Z D: int, 2025-05-07T20:32:32.8197934Z scale_ub: Optional[float], 2025-05-07T20:32:32.8198014Z contiguous: bool, 2025-05-07T20:32:32.8198092Z compiled: bool, 2025-05-07T20:32:32.8198245Z ) -> None: 2025-05-07T20:32:32.8198329Z torch.manual_seed(2025) 2025-05-07T20:32:32.8198394Z 2025-05-07T20:32:32.8198555Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.8198619Z 2025-05-07T20:32:32.8198706Z x_sign = torch.sign(x) 2025-05-07T20:32:32.8198823Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.8198901Z x = x_sign * x_clamp 2025-05-07T20:32:32.8198972Z x0 = x[:, :D] 2025-05-07T20:32:32.8199041Z x1 = x[:, D:] 2025-05-07T20:32:32.8199102Z 2025-05-07T20:32:32.8199177Z if contiguous: 2025-05-07T20:32:32.8199258Z x0 = x0.contiguous() 2025-05-07T20:32:32.8199344Z x1 = x1.contiguous() 2025-05-07T20:32:32.8199406Z 2025-05-07T20:32:32.8199485Z if scale_ub is not None: 2025-05-07T20:32:32.8199588Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.8199716Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.8199788Z ) 2025-05-07T20:32:32.8199854Z else: 2025-05-07T20:32:32.8199940Z scale_ub_tensor = None 2025-05-07T20:32:32.8200003Z 2025-05-07T20:32:32.8200126Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.8200205Z op = silu_mul_quant 2025-05-07T20:32:32.8200278Z if compiled: 2025-05-07T20:32:32.8200370Z op = torch.compile(op) 2025-05-07T20:32:32.8200469Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.8200537Z 2025-05-07T20:32:32.8200620Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.8200624Z 2025-05-07T20:32:32.8200791Z moe/activation_test.py:117: 2025-05-07T20:32:32.8200917Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.8201010Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.8201100Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.8201465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:32.8201557Z return fn(*args, **kwargs) 2025-05-07T20:32:32.8202050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.8202137Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.8202485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.8202704Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.8203040Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.8203127Z kernel = self.compile( 2025-05-07T20:32:32.8203502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.8203670Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.8203797Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.8203802Z 2025-05-07T20:32:32.8203998Z self = 2025-05-07T20:32:32.8204845Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.8205353Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faad313a2a0>} 2025-05-07T20:32:32.8206089Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.8206354Z context = 2025-05-07T20:32:32.8206359Z 2025-05-07T20:32:32.8206517Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.8206779Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.8206879Z module_map=module_map) 2025-05-07T20:32:32.8207030Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.8207121Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.8207187Z E ^ 2025-05-07T20:32:32.8207540Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.8207544Z 2025-05-07T20:32:32.8207950Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.8207954Z 2025-05-07T20:32:32.8208046Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.8208502Z self=, 2025-05-07T20:32:32.8208606Z T=2048, 2025-05-07T20:32:32.8208695Z D=5120, 2025-05-07T20:32:32.8208774Z scale_ub=None, 2025-05-07T20:32:32.8208854Z contiguous=False, 2025-05-07T20:32:32.8208926Z compiled=True, 2025-05-07T20:32:32.8208993Z ) 2025-05-07T20:32:32.8209205Z self = 2025-05-07T20:32:32.8209371Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:32.8209375Z 2025-05-07T20:32:32.8209453Z @given( 2025-05-07T20:32:32.8209704Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.8209800Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.8209906Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.8210012Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.8210119Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.8210190Z ) 2025-05-07T20:32:32.8210427Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.8210512Z def test_silu_mul_quant( 2025-05-07T20:32:32.8210580Z self, 2025-05-07T20:32:32.8210651Z T: int, 2025-05-07T20:32:32.8210720Z D: int, 2025-05-07T20:32:32.8210808Z scale_ub: Optional[float], 2025-05-07T20:32:32.8210889Z contiguous: bool, 2025-05-07T20:32:32.8210968Z compiled: bool, 2025-05-07T20:32:32.8211035Z ) -> None: 2025-05-07T20:32:32.8211129Z torch.manual_seed(2025) 2025-05-07T20:32:32.8211195Z 2025-05-07T20:32:32.8211362Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.8211427Z 2025-05-07T20:32:32.8211514Z x_sign = torch.sign(x) 2025-05-07T20:32:32.8211636Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.8211722Z x = x_sign * x_clamp 2025-05-07T20:32:32.8211798Z x0 = x[:, :D] 2025-05-07T20:32:32.8211868Z x1 = x[:, D:] 2025-05-07T20:32:32.8211933Z 2025-05-07T20:32:32.8212006Z if contiguous: 2025-05-07T20:32:32.8212093Z x0 = x0.contiguous() 2025-05-07T20:32:32.8212178Z x1 = x1.contiguous() 2025-05-07T20:32:32.8212244Z 2025-05-07T20:32:32.8212329Z if scale_ub is not None: 2025-05-07T20:32:32.8212425Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.8212551Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.8212627Z ) 2025-05-07T20:32:32.8212691Z else: 2025-05-07T20:32:32.8212774Z scale_ub_tensor = None 2025-05-07T20:32:32.8212847Z 2025-05-07T20:32:32.8212970Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.8213051Z op = silu_mul_quant 2025-05-07T20:32:32.8213127Z if compiled: 2025-05-07T20:32:32.8213220Z op = torch.compile(op) 2025-05-07T20:32:32.8213464Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.8213529Z 2025-05-07T20:32:32.8213613Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.8213617Z 2025-05-07T20:32:32.8213708Z moe/activation_test.py:117: 2025-05-07T20:32:32.8213835Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.8213929Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.8214022Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.8214382Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:32.8214476Z return fn(*args, **kwargs) 2025-05-07T20:32:32.8214972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.8215060Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.8215412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.8215636Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.8215966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.8216057Z kernel = self.compile( 2025-05-07T20:32:32.8216441Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.8216608Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.8216735Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.8216820Z 2025-05-07T20:32:32.8217022Z self = 2025-05-07T20:32:32.8217806Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.8218313Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faad313b560>} 2025-05-07T20:32:32.8219061Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.8219246Z context = 2025-05-07T20:32:32.8219250Z 2025-05-07T20:32:32.8219416Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.8219673Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.8219773Z module_map=module_map) 2025-05-07T20:32:32.8219932Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.8220031Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.8220098Z E ^ 2025-05-07T20:32:32.8220448Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.8220453Z 2025-05-07T20:32:32.8220855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.8220859Z 2025-05-07T20:32:32.8220952Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.8221176Z self=, 2025-05-07T20:32:32.8221247Z T=2048, 2025-05-07T20:32:32.8221325Z D=5120, 2025-05-07T20:32:32.8221399Z scale_ub=1200.0, 2025-05-07T20:32:32.8221483Z contiguous=False, 2025-05-07T20:32:32.8221560Z compiled=True, 2025-05-07T20:32:32.8225494Z ) 2025-05-07T20:32:32.8225730Z self = 2025-05-07T20:32:32.8226016Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:32.8226022Z 2025-05-07T20:32:32.8226099Z @given( 2025-05-07T20:32:32.8226218Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.8226312Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.8226437Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.8226559Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.8226668Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.8226737Z ) 2025-05-07T20:32:32.8226978Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.8227077Z def test_silu_mul_quant( 2025-05-07T20:32:32.8227152Z self, 2025-05-07T20:32:32.8227224Z T: int, 2025-05-07T20:32:32.8227297Z D: int, 2025-05-07T20:32:32.8227388Z scale_ub: Optional[float], 2025-05-07T20:32:32.8227486Z contiguous: bool, 2025-05-07T20:32:32.8227582Z compiled: bool, 2025-05-07T20:32:32.8227661Z ) -> None: 2025-05-07T20:32:32.8227751Z torch.manual_seed(2025) 2025-05-07T20:32:32.8227822Z 2025-05-07T20:32:32.8227989Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.8228069Z 2025-05-07T20:32:32.8228156Z x_sign = torch.sign(x) 2025-05-07T20:32:32.8228275Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.8228363Z x = x_sign * x_clamp 2025-05-07T20:32:32.8228438Z x0 = x[:, :D] 2025-05-07T20:32:32.8228514Z x1 = x[:, D:] 2025-05-07T20:32:32.8228587Z 2025-05-07T20:32:32.8228753Z if contiguous: 2025-05-07T20:32:32.8228841Z x0 = x0.contiguous() 2025-05-07T20:32:32.8228929Z x1 = x1.contiguous() 2025-05-07T20:32:32.8229000Z 2025-05-07T20:32:32.8229085Z if scale_ub is not None: 2025-05-07T20:32:32.8229187Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.8229328Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.8229404Z ) 2025-05-07T20:32:32.8229475Z else: 2025-05-07T20:32:32.8229562Z scale_ub_tensor = None 2025-05-07T20:32:32.8229638Z 2025-05-07T20:32:32.8229761Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.8229844Z op = silu_mul_quant 2025-05-07T20:32:32.8229934Z if compiled: 2025-05-07T20:32:32.8230035Z op = torch.compile(op) 2025-05-07T20:32:32.8230137Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.8230205Z 2025-05-07T20:32:32.8230292Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.8230305Z 2025-05-07T20:32:32.8230398Z moe/activation_test.py:117: 2025-05-07T20:32:32.8230531Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.8230636Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.8230732Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.8231106Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:32.8231193Z return fn(*args, **kwargs) 2025-05-07T20:32:32.8231689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.8231781Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.8232131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.8232352Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.8232693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.8232786Z kernel = self.compile( 2025-05-07T20:32:32.8233163Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.8233423Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.8233549Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.8233554Z 2025-05-07T20:32:32.8233753Z self = 2025-05-07T20:32:32.8234530Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.8235042Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faad29f0c20>} 2025-05-07T20:32:32.8235789Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.8235990Z context = 2025-05-07T20:32:32.8235994Z 2025-05-07T20:32:32.8236160Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.8236436Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.8236543Z module_map=module_map) 2025-05-07T20:32:32.8236706Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.8236804Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.8236877Z E ^ 2025-05-07T20:32:32.8237308Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.8237313Z 2025-05-07T20:32:32.8237721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.8237732Z 2025-05-07T20:32:32.8237830Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.8238049Z self=, 2025-05-07T20:32:32.8238122Z T=4096, 2025-05-07T20:32:32.8238192Z D=5120, 2025-05-07T20:32:32.8238282Z scale_ub=1200.0, 2025-05-07T20:32:32.8238371Z contiguous=True, 2025-05-07T20:32:32.8238455Z compiled=True, 2025-05-07T20:32:32.8238522Z ) 2025-05-07T20:32:32.8238743Z self = 2025-05-07T20:32:32.8238920Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:32.8238924Z 2025-05-07T20:32:32.8239002Z @given( 2025-05-07T20:32:32.8239115Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.8239213Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.8239321Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.8239433Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.8239550Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.8239619Z ) 2025-05-07T20:32:32.8239861Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.8239949Z def test_silu_mul_quant( 2025-05-07T20:32:32.8240022Z self, 2025-05-07T20:32:32.8240101Z T: int, 2025-05-07T20:32:32.8240180Z D: int, 2025-05-07T20:32:32.8240281Z scale_ub: Optional[float], 2025-05-07T20:32:32.8240371Z contiguous: bool, 2025-05-07T20:32:32.8240452Z compiled: bool, 2025-05-07T20:32:32.8240525Z ) -> None: 2025-05-07T20:32:32.8240626Z torch.manual_seed(2025) 2025-05-07T20:32:32.8240701Z 2025-05-07T20:32:32.8240869Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.8240940Z 2025-05-07T20:32:32.8241025Z x_sign = torch.sign(x) 2025-05-07T20:32:32.8241153Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.8241324Z x = x_sign * x_clamp 2025-05-07T20:32:32.8241399Z x0 = x[:, :D] 2025-05-07T20:32:32.8241482Z x1 = x[:, D:] 2025-05-07T20:32:32.8241549Z 2025-05-07T20:32:32.8241628Z if contiguous: 2025-05-07T20:32:32.8241716Z x0 = x0.contiguous() 2025-05-07T20:32:32.8241810Z x1 = x1.contiguous() 2025-05-07T20:32:32.8241879Z 2025-05-07T20:32:32.8241968Z if scale_ub is not None: 2025-05-07T20:32:32.8242066Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.8242198Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.8242273Z ) 2025-05-07T20:32:32.8242341Z else: 2025-05-07T20:32:32.8242437Z scale_ub_tensor = None 2025-05-07T20:32:32.8242503Z 2025-05-07T20:32:32.8242634Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.8242727Z op = silu_mul_quant 2025-05-07T20:32:32.8242813Z if compiled: 2025-05-07T20:32:32.8242920Z op = torch.compile(op) 2025-05-07T20:32:32.8243026Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.8243094Z 2025-05-07T20:32:32.8243183Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.8243188Z 2025-05-07T20:32:32.8243282Z moe/activation_test.py:117: 2025-05-07T20:32:32.8243410Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.8243521Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.8243623Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.8243986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:32.8244289Z return fn(*args, **kwargs) 2025-05-07T20:32:32.8244784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.8244887Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.8245244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.8245464Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.8245801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.8245891Z kernel = self.compile( 2025-05-07T20:32:32.8246274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.8246445Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.8246574Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.8246579Z 2025-05-07T20:32:32.8246782Z self = 2025-05-07T20:32:32.8247555Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.8248070Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faad29f1a80>} 2025-05-07T20:32:32.8248816Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.8249004Z context = 2025-05-07T20:32:32.8249014Z 2025-05-07T20:32:32.8249180Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.8249440Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.8249543Z module_map=module_map) 2025-05-07T20:32:32.8249808Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.8249898Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.8249971Z E ^ 2025-05-07T20:32:32.8250322Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.8250327Z 2025-05-07T20:32:32.8250738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.8250743Z 2025-05-07T20:32:32.8250835Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.8251059Z self=, 2025-05-07T20:32:32.8251128Z T=128, 2025-05-07T20:32:32.8251195Z D=5120, 2025-05-07T20:32:32.8251274Z scale_ub=1200.0, 2025-05-07T20:32:32.8251364Z contiguous=False, 2025-05-07T20:32:32.8251437Z compiled=True, 2025-05-07T20:32:32.8251499Z ) 2025-05-07T20:32:32.8251721Z self = 2025-05-07T20:32:32.8251883Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:32.8251888Z 2025-05-07T20:32:32.8251958Z @given( 2025-05-07T20:32:32.8252069Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.8252159Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.8252267Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.8252376Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.8252479Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.8252546Z ) 2025-05-07T20:32:32.8252863Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.8252948Z def test_silu_mul_quant( 2025-05-07T20:32:32.8253015Z self, 2025-05-07T20:32:32.8253083Z T: int, 2025-05-07T20:32:32.8253155Z D: int, 2025-05-07T20:32:32.8253251Z scale_ub: Optional[float], 2025-05-07T20:32:32.8253332Z contiguous: bool, 2025-05-07T20:32:32.8253410Z compiled: bool, 2025-05-07T20:32:32.8253478Z ) -> None: 2025-05-07T20:32:32.8253562Z torch.manual_seed(2025) 2025-05-07T20:32:32.8253627Z 2025-05-07T20:32:32.8253791Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.8253855Z 2025-05-07T20:32:32.8253943Z x_sign = torch.sign(x) 2025-05-07T20:32:32.8254059Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.8254140Z x = x_sign * x_clamp 2025-05-07T20:32:32.8254213Z x0 = x[:, :D] 2025-05-07T20:32:32.8254282Z x1 = x[:, D:] 2025-05-07T20:32:32.8254352Z 2025-05-07T20:32:32.8254428Z if contiguous: 2025-05-07T20:32:32.8254512Z x0 = x0.contiguous() 2025-05-07T20:32:32.8254594Z x1 = x1.contiguous() 2025-05-07T20:32:32.8254657Z 2025-05-07T20:32:32.8254739Z if scale_ub is not None: 2025-05-07T20:32:32.8254841Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.8254971Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.8255036Z ) 2025-05-07T20:32:32.8255105Z else: 2025-05-07T20:32:32.8255189Z scale_ub_tensor = None 2025-05-07T20:32:32.8255252Z 2025-05-07T20:32:32.8255382Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.8255463Z op = silu_mul_quant 2025-05-07T20:32:32.8255538Z if compiled: 2025-05-07T20:32:32.8255630Z op = torch.compile(op) 2025-05-07T20:32:32.8255726Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.8255793Z 2025-05-07T20:32:32.8255880Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.8255885Z 2025-05-07T20:32:32.8255971Z moe/activation_test.py:117: 2025-05-07T20:32:32.8256094Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.8256271Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.8256360Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.8256720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:32.8256803Z return fn(*args, **kwargs) 2025-05-07T20:32:32.8257291Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.8257378Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.8257724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.8257949Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.8258279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.8258364Z kernel = self.compile( 2025-05-07T20:32:32.8258743Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.8258916Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.8259041Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.8259045Z 2025-05-07T20:32:32.8259246Z self = 2025-05-07T20:32:32.8260015Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.8260601Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faad29f2ca0>} 2025-05-07T20:32:32.8261342Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.8261535Z context = 2025-05-07T20:32:32.8261539Z 2025-05-07T20:32:32.8261695Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.8261953Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.8262054Z module_map=module_map) 2025-05-07T20:32:32.8262206Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.8262299Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.8262373Z E ^ 2025-05-07T20:32:32.8262717Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.8262722Z 2025-05-07T20:32:32.8263129Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.8263138Z 2025-05-07T20:32:32.8263232Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.8263449Z self=, 2025-05-07T20:32:32.8263516Z T=16384, 2025-05-07T20:32:32.8263581Z D=7168, 2025-05-07T20:32:32.8263659Z scale_ub=1200.0, 2025-05-07T20:32:32.8263737Z contiguous=True, 2025-05-07T20:32:32.8263808Z compiled=True, 2025-05-07T20:32:32.8263873Z ) 2025-05-07T20:32:32.8264082Z self = 2025-05-07T20:32:32.8264255Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:32.8264259Z 2025-05-07T20:32:32.8264329Z @given( 2025-05-07T20:32:32.8264439Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.8264532Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.8264639Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.8264827Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.8264936Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.8264999Z ) 2025-05-07T20:32:32.8265237Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.8265323Z def test_silu_mul_quant( 2025-05-07T20:32:32.8265391Z self, 2025-05-07T20:32:32.8265457Z T: int, 2025-05-07T20:32:32.8265524Z D: int, 2025-05-07T20:32:32.8265611Z scale_ub: Optional[float], 2025-05-07T20:32:32.8265696Z contiguous: bool, 2025-05-07T20:32:32.8265773Z compiled: bool, 2025-05-07T20:32:32.8265848Z ) -> None: 2025-05-07T20:32:32.8265934Z torch.manual_seed(2025) 2025-05-07T20:32:32.8265997Z 2025-05-07T20:32:32.8266159Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.8266226Z 2025-05-07T20:32:32.8266308Z x_sign = torch.sign(x) 2025-05-07T20:32:32.8266431Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.8266515Z x = x_sign * x_clamp 2025-05-07T20:32:32.8266585Z x0 = x[:, :D] 2025-05-07T20:32:32.8266653Z x1 = x[:, D:] 2025-05-07T20:32:32.8266719Z 2025-05-07T20:32:32.8266792Z if contiguous: 2025-05-07T20:32:32.8266873Z x0 = x0.contiguous() 2025-05-07T20:32:32.8266955Z x1 = x1.contiguous() 2025-05-07T20:32:32.8267016Z 2025-05-07T20:32:32.8267103Z if scale_ub is not None: 2025-05-07T20:32:32.8267201Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.8267328Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.8267480Z ) 2025-05-07T20:32:32.8267547Z else: 2025-05-07T20:32:32.8267632Z scale_ub_tensor = None 2025-05-07T20:32:32.8267697Z 2025-05-07T20:32:32.8267816Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.8267898Z op = silu_mul_quant 2025-05-07T20:32:32.8267984Z if compiled: 2025-05-07T20:32:32.8268074Z op = torch.compile(op) 2025-05-07T20:32:32.8268171Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.8268238Z 2025-05-07T20:32:32.8268317Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.8268322Z 2025-05-07T20:32:32.8268414Z moe/activation_test.py:117: 2025-05-07T20:32:32.8268536Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.8268629Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.8268723Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.8269086Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:32.8269171Z return fn(*args, **kwargs) 2025-05-07T20:32:32.8269658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.8269751Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.8270104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.8270319Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.8270651Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.8270738Z kernel = self.compile( 2025-05-07T20:32:32.8271111Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.8271287Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.8271410Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.8271414Z 2025-05-07T20:32:32.8271614Z self = 2025-05-07T20:32:32.8272385Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.8272968Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faad2540400>} 2025-05-07T20:32:32.8273707Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.8273901Z context = 2025-05-07T20:32:32.8273905Z 2025-05-07T20:32:32.8274063Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.8274326Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.8274429Z module_map=module_map) 2025-05-07T20:32:32.8274589Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.8274677Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.8274744Z E ^ 2025-05-07T20:32:32.8275093Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.8275098Z 2025-05-07T20:32:32.8275504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.8275508Z 2025-05-07T20:32:32.8275601Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.8275919Z self=, 2025-05-07T20:32:32.8275987Z T=16384, 2025-05-07T20:32:32.8276060Z D=5120, 2025-05-07T20:32:32.8276134Z scale_ub=1200.0, 2025-05-07T20:32:32.8276211Z contiguous=True, 2025-05-07T20:32:32.8276289Z compiled=False, 2025-05-07T20:32:32.8276351Z ) 2025-05-07T20:32:32.8276560Z self = 2025-05-07T20:32:32.8276736Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:32.8276740Z 2025-05-07T20:32:32.8276807Z @given( 2025-05-07T20:32:32.8276917Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.8277012Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.8277118Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.8277227Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.8277332Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.8277401Z ) 2025-05-07T20:32:32.8277642Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.8277726Z def test_silu_mul_quant( 2025-05-07T20:32:32.8277793Z self, 2025-05-07T20:32:32.8277863Z T: int, 2025-05-07T20:32:32.8277932Z D: int, 2025-05-07T20:32:32.8278020Z scale_ub: Optional[float], 2025-05-07T20:32:32.8278102Z contiguous: bool, 2025-05-07T20:32:32.8278180Z compiled: bool, 2025-05-07T20:32:32.8278246Z ) -> None: 2025-05-07T20:32:32.8278337Z torch.manual_seed(2025) 2025-05-07T20:32:32.8278400Z 2025-05-07T20:32:32.8278565Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.8278628Z 2025-05-07T20:32:32.8278710Z x_sign = torch.sign(x) 2025-05-07T20:32:32.8278830Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.8278910Z x = x_sign * x_clamp 2025-05-07T20:32:32.8278984Z x0 = x[:, :D] 2025-05-07T20:32:32.8279059Z x1 = x[:, D:] 2025-05-07T20:32:32.8279120Z 2025-05-07T20:32:32.8279194Z if contiguous: 2025-05-07T20:32:32.8279278Z x0 = x0.contiguous() 2025-05-07T20:32:32.8279356Z x1 = x1.contiguous() 2025-05-07T20:32:32.8279502Z 2025-05-07T20:32:32.8279587Z if scale_ub is not None: 2025-05-07T20:32:32.8279681Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.8279810Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.8279876Z ) 2025-05-07T20:32:32.8279942Z else: 2025-05-07T20:32:32.8280027Z scale_ub_tensor = None 2025-05-07T20:32:32.8280088Z 2025-05-07T20:32:32.8280208Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.8280292Z op = silu_mul_quant 2025-05-07T20:32:32.8280367Z if compiled: 2025-05-07T20:32:32.8280458Z op = torch.compile(op) 2025-05-07T20:32:32.8280561Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.8280623Z 2025-05-07T20:32:32.8280703Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.8280714Z 2025-05-07T20:32:32.8280802Z moe/activation_test.py:117: 2025-05-07T20:32:32.8280923Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.8281024Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.8281114Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.8281606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.8281702Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.8282050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.8282266Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.8282679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.8282764Z kernel = self.compile( 2025-05-07T20:32:32.8283142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.8283313Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.8283434Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.8283439Z 2025-05-07T20:32:32.8283639Z self = 2025-05-07T20:32:32.8284514Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.8285025Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faad2540e00>} 2025-05-07T20:32:32.8285760Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.8285958Z context = 2025-05-07T20:32:32.8285962Z 2025-05-07T20:32:32.8286118Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.8286373Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.8286476Z module_map=module_map) 2025-05-07T20:32:32.8286631Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.8286723Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.8286796Z E ^ 2025-05-07T20:32:32.8287150Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.8287155Z 2025-05-07T20:32:32.8287565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.8287569Z 2025-05-07T20:32:32.8287748Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.8287968Z self=, 2025-05-07T20:32:32.8288049Z T=1, 2025-05-07T20:32:32.8288122Z D=7168, 2025-05-07T20:32:32.8288200Z scale_ub=1200.0, 2025-05-07T20:32:32.8288285Z contiguous=False, 2025-05-07T20:32:32.8288371Z compiled=False, 2025-05-07T20:32:32.8288440Z ) 2025-05-07T20:32:32.8288655Z self = 2025-05-07T20:32:32.8288828Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:32.8288832Z 2025-05-07T20:32:32.8288914Z @given( 2025-05-07T20:32:32.8289034Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.8289127Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.8289240Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.8289351Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.8289468Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.8289542Z ) 2025-05-07T20:32:32.8289783Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.8289874Z def test_silu_mul_quant( 2025-05-07T20:32:32.8289947Z self, 2025-05-07T20:32:32.8290021Z T: int, 2025-05-07T20:32:32.8290093Z D: int, 2025-05-07T20:32:32.8290187Z scale_ub: Optional[float], 2025-05-07T20:32:32.8290273Z contiguous: bool, 2025-05-07T20:32:32.8290357Z compiled: bool, 2025-05-07T20:32:32.8290431Z ) -> None: 2025-05-07T20:32:32.8290519Z torch.manual_seed(2025) 2025-05-07T20:32:32.8290590Z 2025-05-07T20:32:32.8290832Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.8290903Z 2025-05-07T20:32:32.8290992Z x_sign = torch.sign(x) 2025-05-07T20:32:32.8291111Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.8291199Z x = x_sign * x_clamp 2025-05-07T20:32:32.8291280Z x0 = x[:, :D] 2025-05-07T20:32:32.8291356Z x1 = x[:, D:] 2025-05-07T20:32:32.8291428Z 2025-05-07T20:32:32.8291508Z if contiguous: 2025-05-07T20:32:32.8291594Z x0 = x0.contiguous() 2025-05-07T20:32:32.8291679Z x1 = x1.contiguous() 2025-05-07T20:32:32.8291751Z 2025-05-07T20:32:32.8291837Z if scale_ub is not None: 2025-05-07T20:32:32.8291943Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.8292073Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.8292146Z ) 2025-05-07T20:32:32.8292221Z else: 2025-05-07T20:32:32.8292314Z scale_ub_tensor = None 2025-05-07T20:32:32.8292380Z 2025-05-07T20:32:32.8292510Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.8292593Z op = silu_mul_quant 2025-05-07T20:32:32.8292679Z if compiled: 2025-05-07T20:32:32.8292777Z op = torch.compile(op) 2025-05-07T20:32:32.8292882Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.8292952Z 2025-05-07T20:32:32.8293037Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.8293041Z 2025-05-07T20:32:32.8293136Z moe/activation_test.py:117: 2025-05-07T20:32:32.8293267Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.8293363Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.8293458Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.8293951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.8294051Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.8294410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.8294629Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.8295052Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.8295146Z kernel = self.compile( 2025-05-07T20:32:32.8295523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.8295694Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.8295819Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.8295824Z 2025-05-07T20:32:32.8296027Z self = 2025-05-07T20:32:32.8296815Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.8297315Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faad2542160>} 2025-05-07T20:32:32.8298067Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.8298256Z context = 2025-05-07T20:32:32.8298261Z 2025-05-07T20:32:32.8298426Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.8298686Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.8298863Z module_map=module_map) 2025-05-07T20:32:32.8299028Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.8299122Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.8299194Z E ^ 2025-05-07T20:32:32.8299548Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.8299555Z 2025-05-07T20:32:32.8299964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.8299968Z 2025-05-07T20:32:32.8300070Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.8300286Z self=, 2025-05-07T20:32:32.8300364Z T=4096, 2025-05-07T20:32:32.8300436Z D=7168, 2025-05-07T20:32:32.8300512Z scale_ub=1200.0, 2025-05-07T20:32:32.8300596Z contiguous=False, 2025-05-07T20:32:32.8300679Z compiled=True, 2025-05-07T20:32:32.8300751Z ) 2025-05-07T20:32:32.8300964Z self = 2025-05-07T20:32:32.8301137Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:32.8301142Z 2025-05-07T20:32:32.8301221Z @given( 2025-05-07T20:32:32.8301343Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.8301437Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.8301546Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.8301661Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.8301768Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.8301835Z ) 2025-05-07T20:32:32.8302078Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.8302166Z def test_silu_mul_quant( 2025-05-07T20:32:32.8302239Z self, 2025-05-07T20:32:32.8302312Z T: int, 2025-05-07T20:32:32.8302390Z D: int, 2025-05-07T20:32:32.8302490Z scale_ub: Optional[float], 2025-05-07T20:32:32.8302576Z contiguous: bool, 2025-05-07T20:32:32.8302659Z compiled: bool, 2025-05-07T20:32:32.8302735Z ) -> None: 2025-05-07T20:32:32.8302825Z torch.manual_seed(2025) 2025-05-07T20:32:32.8302972Z 2025-05-07T20:32:32.8303139Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.8303208Z 2025-05-07T20:32:32.8303295Z x_sign = torch.sign(x) 2025-05-07T20:32:32.8303422Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.8303510Z x = x_sign * x_clamp 2025-05-07T20:32:32.8303598Z x0 = x[:, :D] 2025-05-07T20:32:32.8303675Z x1 = x[:, D:] 2025-05-07T20:32:32.8303749Z 2025-05-07T20:32:32.8303833Z if contiguous: 2025-05-07T20:32:32.8303919Z x0 = x0.contiguous() 2025-05-07T20:32:32.8304001Z x1 = x1.contiguous() 2025-05-07T20:32:32.8304069Z 2025-05-07T20:32:32.8304165Z if scale_ub is not None: 2025-05-07T20:32:32.8304269Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.8304401Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.8304475Z ) 2025-05-07T20:32:32.8304545Z else: 2025-05-07T20:32:32.8304649Z scale_ub_tensor = None 2025-05-07T20:32:32.8304718Z 2025-05-07T20:32:32.8304851Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.8304937Z op = silu_mul_quant 2025-05-07T20:32:32.8305018Z if compiled: 2025-05-07T20:32:32.8305120Z op = torch.compile(op) 2025-05-07T20:32:32.8305225Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.8305297Z 2025-05-07T20:32:32.8305387Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.8305392Z 2025-05-07T20:32:32.8305485Z moe/activation_test.py:117: 2025-05-07T20:32:32.8305609Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.8305895Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.8305990Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.8306358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:32.8306451Z return fn(*args, **kwargs) 2025-05-07T20:32:32.8306937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.8307033Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.8307381Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.8307600Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.8307937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.8308029Z kernel = self.compile( 2025-05-07T20:32:32.8310052Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.8310291Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.8310415Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.8310429Z 2025-05-07T20:32:32.8310634Z self = 2025-05-07T20:32:32.8311414Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.8311919Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faad2543420>} 2025-05-07T20:32:32.8312660Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.8312847Z context = 2025-05-07T20:32:32.8313147Z 2025-05-07T20:32:32.8313307Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.8313565Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.8313666Z module_map=module_map) 2025-05-07T20:32:32.8313817Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.8313906Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.8313976Z E ^ 2025-05-07T20:32:32.8314321Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.8314325Z 2025-05-07T20:32:32.8314734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.8314739Z 2025-05-07T20:32:32.8314831Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.8315045Z self=, 2025-05-07T20:32:32.8315122Z T=128, 2025-05-07T20:32:32.8315190Z D=7168, 2025-05-07T20:32:32.8315261Z scale_ub=1200.0, 2025-05-07T20:32:32.8315339Z contiguous=False, 2025-05-07T20:32:32.8315412Z compiled=True, 2025-05-07T20:32:32.8315474Z ) 2025-05-07T20:32:32.8315688Z self = 2025-05-07T20:32:32.8315849Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:32.8315853Z 2025-05-07T20:32:32.8315922Z @given( 2025-05-07T20:32:32.8316034Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.8316122Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.8316348Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.8316459Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.8316562Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.8316629Z ) 2025-05-07T20:32:32.8316864Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.8316954Z def test_silu_mul_quant( 2025-05-07T20:32:32.8317020Z self, 2025-05-07T20:32:32.8317084Z T: int, 2025-05-07T20:32:32.8317149Z D: int, 2025-05-07T20:32:32.8317236Z scale_ub: Optional[float], 2025-05-07T20:32:32.8317313Z contiguous: bool, 2025-05-07T20:32:32.8317396Z compiled: bool, 2025-05-07T20:32:32.8317464Z ) -> None: 2025-05-07T20:32:32.8317546Z torch.manual_seed(2025) 2025-05-07T20:32:32.8317610Z 2025-05-07T20:32:32.8317772Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.8317834Z 2025-05-07T20:32:32.8317925Z x_sign = torch.sign(x) 2025-05-07T20:32:32.8318040Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.8318117Z x = x_sign * x_clamp 2025-05-07T20:32:32.8318190Z x0 = x[:, :D] 2025-05-07T20:32:32.8318260Z x1 = x[:, D:] 2025-05-07T20:32:32.8318330Z 2025-05-07T20:32:32.8318403Z if contiguous: 2025-05-07T20:32:32.8318485Z x0 = x0.contiguous() 2025-05-07T20:32:32.8318565Z x1 = x1.contiguous() 2025-05-07T20:32:32.8318628Z 2025-05-07T20:32:32.8318708Z if scale_ub is not None: 2025-05-07T20:32:32.8318811Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.8318936Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.8319000Z ) 2025-05-07T20:32:32.8319067Z else: 2025-05-07T20:32:32.8319148Z scale_ub_tensor = None 2025-05-07T20:32:32.8319209Z 2025-05-07T20:32:32.8319330Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.8319414Z op = silu_mul_quant 2025-05-07T20:32:32.8319490Z if compiled: 2025-05-07T20:32:32.8319580Z op = torch.compile(op) 2025-05-07T20:32:32.8319674Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.8319827Z 2025-05-07T20:32:32.8319909Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.8319913Z 2025-05-07T20:32:32.8320001Z moe/activation_test.py:117: 2025-05-07T20:32:32.8320124Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.8320216Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.8320304Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.8320663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:32.8320745Z return fn(*args, **kwargs) 2025-05-07T20:32:32.8321235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.8321320Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.8321668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.8321888Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.8322226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.8322309Z kernel = self.compile( 2025-05-07T20:32:32.8322681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.8322850Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.8322975Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.8322980Z 2025-05-07T20:32:32.8323176Z self = 2025-05-07T20:32:32.8324023Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.8324650Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faad2468720>} 2025-05-07T20:32:32.8325387Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.8325574Z context = 2025-05-07T20:32:32.8325578Z 2025-05-07T20:32:32.8325732Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.8326000Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.8326098Z module_map=module_map) 2025-05-07T20:32:32.8326251Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.8326342Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.8326410Z E ^ 2025-05-07T20:32:32.8326754Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.8326759Z 2025-05-07T20:32:32.8327168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.8327172Z 2025-05-07T20:32:32.8327264Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.8327482Z self=, 2025-05-07T20:32:32.8327548Z T=2048, 2025-05-07T20:32:32.8327613Z D=7168, 2025-05-07T20:32:32.8327689Z scale_ub=None, 2025-05-07T20:32:32.8327768Z contiguous=True, 2025-05-07T20:32:32.8327840Z compiled=True, 2025-05-07T20:32:32.8327904Z ) 2025-05-07T20:32:32.8328113Z self = 2025-05-07T20:32:32.8328274Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:32.8328368Z 2025-05-07T20:32:32.8328433Z @given( 2025-05-07T20:32:32.8328542Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.8328635Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.8328739Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.8328846Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.8328958Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.8329020Z ) 2025-05-07T20:32:32.8329255Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.8329341Z def test_silu_mul_quant( 2025-05-07T20:32:32.8329406Z self, 2025-05-07T20:32:32.8329479Z T: int, 2025-05-07T20:32:32.8329550Z D: int, 2025-05-07T20:32:32.8329637Z scale_ub: Optional[float], 2025-05-07T20:32:32.8329719Z contiguous: bool, 2025-05-07T20:32:32.8329792Z compiled: bool, 2025-05-07T20:32:32.8329858Z ) -> None: 2025-05-07T20:32:32.8329954Z torch.manual_seed(2025) 2025-05-07T20:32:32.8330015Z 2025-05-07T20:32:32.8330174Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.8330239Z 2025-05-07T20:32:32.8330323Z x_sign = torch.sign(x) 2025-05-07T20:32:32.8330438Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.8330519Z x = x_sign * x_clamp 2025-05-07T20:32:32.8330586Z x0 = x[:, :D] 2025-05-07T20:32:32.8330654Z x1 = x[:, D:] 2025-05-07T20:32:32.8330717Z 2025-05-07T20:32:32.8330791Z if contiguous: 2025-05-07T20:32:32.8330878Z x0 = x0.contiguous() 2025-05-07T20:32:32.8331037Z x1 = x1.contiguous() 2025-05-07T20:32:32.8331102Z 2025-05-07T20:32:32.8331188Z if scale_ub is not None: 2025-05-07T20:32:32.8331283Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.8331410Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.8331483Z ) 2025-05-07T20:32:32.8331547Z else: 2025-05-07T20:32:32.8331629Z scale_ub_tensor = None 2025-05-07T20:32:32.8331694Z 2025-05-07T20:32:32.8331816Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.8331895Z op = silu_mul_quant 2025-05-07T20:32:32.8331973Z if compiled: 2025-05-07T20:32:32.8332064Z op = torch.compile(op) 2025-05-07T20:32:32.8332162Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.8332223Z 2025-05-07T20:32:32.8332303Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.8332308Z 2025-05-07T20:32:32.8332398Z moe/activation_test.py:117: 2025-05-07T20:32:32.8332528Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.8332623Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.8332722Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.8333083Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:32.8333171Z return fn(*args, **kwargs) 2025-05-07T20:32:32.8333660Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.8333751Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.8334106Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.8334325Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.8334655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.8334752Z kernel = self.compile( 2025-05-07T20:32:32.8335130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.8335304Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.8335508Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.8335513Z 2025-05-07T20:32:32.8335714Z self = 2025-05-07T20:32:32.8336491Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.8336991Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faad2469440>} 2025-05-07T20:32:32.8337738Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.8337924Z context = 2025-05-07T20:32:32.8337933Z 2025-05-07T20:32:32.8338091Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.8338352Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.8338454Z module_map=module_map) 2025-05-07T20:32:32.8338614Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.8338705Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.8338780Z E ^ 2025-05-07T20:32:32.8339133Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.8339240Z 2025-05-07T20:32:32.8339646Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.8339650Z 2025-05-07T20:32:32.8339749Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.8339969Z self=, 2025-05-07T20:32:32.8340044Z T=16384, 2025-05-07T20:32:32.8340118Z D=5120, 2025-05-07T20:32:32.8340194Z scale_ub=None, 2025-05-07T20:32:32.8340275Z contiguous=False, 2025-05-07T20:32:32.8340353Z compiled=False, 2025-05-07T20:32:32.8340421Z ) 2025-05-07T20:32:32.8340632Z self = 2025-05-07T20:32:32.8340807Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:32.8340811Z 2025-05-07T20:32:32.8340881Z @given( 2025-05-07T20:32:32.8340997Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.8341099Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.8341218Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.8341327Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.8341439Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.8341510Z ) 2025-05-07T20:32:32.8341748Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.8346163Z def test_silu_mul_quant( 2025-05-07T20:32:32.8346243Z self, 2025-05-07T20:32:32.8346309Z T: int, 2025-05-07T20:32:32.8346380Z D: int, 2025-05-07T20:32:32.8346470Z scale_ub: Optional[float], 2025-05-07T20:32:32.8346551Z contiguous: bool, 2025-05-07T20:32:32.8346637Z compiled: bool, 2025-05-07T20:32:32.8346706Z ) -> None: 2025-05-07T20:32:32.8346793Z torch.manual_seed(2025) 2025-05-07T20:32:32.8346855Z 2025-05-07T20:32:32.8347032Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.8347099Z 2025-05-07T20:32:32.8347182Z x_sign = torch.sign(x) 2025-05-07T20:32:32.8347300Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.8349111Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 40.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:32.8349227Z 2025-05-07T20:32:32.8349345Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:32.8349349Z 2025-05-07T20:32:32.8349447Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.8349668Z self=, 2025-05-07T20:32:32.8349734Z T=4096, 2025-05-07T20:32:32.8349802Z D=7168, 2025-05-07T20:32:32.8349876Z scale_ub=1200.0, 2025-05-07T20:32:32.8349957Z contiguous=True, 2025-05-07T20:32:32.8350035Z compiled=True, 2025-05-07T20:32:32.8350097Z ) 2025-05-07T20:32:32.8350310Z self = 2025-05-07T20:32:32.8350473Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:32.8350478Z 2025-05-07T20:32:32.8350542Z @given( 2025-05-07T20:32:32.8350653Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.8350744Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.8350849Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.8350959Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.8351062Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.8351210Z ) 2025-05-07T20:32:32.8351449Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.8351535Z def test_silu_mul_quant( 2025-05-07T20:32:32.8351604Z self, 2025-05-07T20:32:32.8351670Z T: int, 2025-05-07T20:32:32.8351739Z D: int, 2025-05-07T20:32:32.8351832Z scale_ub: Optional[float], 2025-05-07T20:32:32.8351911Z contiguous: bool, 2025-05-07T20:32:32.8351985Z compiled: bool, 2025-05-07T20:32:32.8352059Z ) -> None: 2025-05-07T20:32:32.8352146Z torch.manual_seed(2025) 2025-05-07T20:32:32.8352207Z 2025-05-07T20:32:32.8352373Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.8352435Z 2025-05-07T20:32:32.8352518Z x_sign = torch.sign(x) 2025-05-07T20:32:32.8352635Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.8354419Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:32.8354434Z 2025-05-07T20:32:32.8354542Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:32.8354546Z 2025-05-07T20:32:32.8354636Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.8354851Z self=, 2025-05-07T20:32:32.8354916Z T=16384, 2025-05-07T20:32:32.8354980Z D=7168, 2025-05-07T20:32:32.8355053Z scale_ub=None, 2025-05-07T20:32:32.8355128Z contiguous=False, 2025-05-07T20:32:32.8355204Z compiled=False, 2025-05-07T20:32:32.8355271Z ) 2025-05-07T20:32:32.8355479Z self = 2025-05-07T20:32:32.8355650Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:32.8355740Z 2025-05-07T20:32:32.8355808Z @given( 2025-05-07T20:32:32.8355918Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.8356007Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.8356110Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.8356223Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.8356332Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.8356399Z ) 2025-05-07T20:32:32.8356642Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.8356731Z def test_silu_mul_quant( 2025-05-07T20:32:32.8356798Z self, 2025-05-07T20:32:32.8356864Z T: int, 2025-05-07T20:32:32.8356933Z D: int, 2025-05-07T20:32:32.8357019Z scale_ub: Optional[float], 2025-05-07T20:32:32.8357098Z contiguous: bool, 2025-05-07T20:32:32.8357174Z compiled: bool, 2025-05-07T20:32:32.8357241Z ) -> None: 2025-05-07T20:32:32.8357326Z torch.manual_seed(2025) 2025-05-07T20:32:32.8357397Z 2025-05-07T20:32:32.8357553Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.8359422Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:32.8359429Z 2025-05-07T20:32:32.8359541Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:32.8359546Z 2025-05-07T20:32:32.8359642Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.8359855Z self=, 2025-05-07T20:32:32.8359937Z T=2048, 2025-05-07T20:32:32.8360000Z D=7168, 2025-05-07T20:32:32.8360070Z scale_ub=1200.0, 2025-05-07T20:32:32.8360145Z contiguous=True, 2025-05-07T20:32:32.8360216Z compiled=True, 2025-05-07T20:32:32.8360280Z ) 2025-05-07T20:32:32.8360488Z self = 2025-05-07T20:32:32.8360649Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:32.8360654Z 2025-05-07T20:32:32.8360718Z @given( 2025-05-07T20:32:32.8360829Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.8360926Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.8361036Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.8361145Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.8361252Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.8361323Z ) 2025-05-07T20:32:32.8361567Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.8361649Z def test_silu_mul_quant( 2025-05-07T20:32:32.8361720Z self, 2025-05-07T20:32:32.8361785Z T: int, 2025-05-07T20:32:32.8361851Z D: int, 2025-05-07T20:32:32.8361940Z scale_ub: Optional[float], 2025-05-07T20:32:32.8362018Z contiguous: bool, 2025-05-07T20:32:32.8362094Z compiled: bool, 2025-05-07T20:32:32.8362166Z ) -> None: 2025-05-07T20:32:32.8362249Z torch.manual_seed(2025) 2025-05-07T20:32:32.8362320Z 2025-05-07T20:32:32.8362477Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.8362546Z 2025-05-07T20:32:32.8362632Z x_sign = torch.sign(x) 2025-05-07T20:32:32.8362749Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.8364584Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:32.8364678Z 2025-05-07T20:32:32.8364789Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:32.8364793Z 2025-05-07T20:32:32.8364884Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.8365107Z self=, 2025-05-07T20:32:32.8365172Z T=2048, 2025-05-07T20:32:32.8365236Z D=7168, 2025-05-07T20:32:32.8365308Z scale_ub=None, 2025-05-07T20:32:32.8365382Z contiguous=True, 2025-05-07T20:32:32.8365461Z compiled=False, 2025-05-07T20:32:32.8365528Z ) 2025-05-07T20:32:32.8365737Z self = 2025-05-07T20:32:32.8365907Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:32.8365911Z 2025-05-07T20:32:32.8365976Z @given( 2025-05-07T20:32:32.8366083Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.8366174Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.8366280Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.8366393Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.8366498Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.8366563Z ) 2025-05-07T20:32:32.8366878Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.8366962Z def test_silu_mul_quant( 2025-05-07T20:32:32.8367030Z self, 2025-05-07T20:32:32.8367099Z T: int, 2025-05-07T20:32:32.8367166Z D: int, 2025-05-07T20:32:32.8367261Z scale_ub: Optional[float], 2025-05-07T20:32:32.8367340Z contiguous: bool, 2025-05-07T20:32:32.8367414Z compiled: bool, 2025-05-07T20:32:32.8367484Z ) -> None: 2025-05-07T20:32:32.8367567Z torch.manual_seed(2025) 2025-05-07T20:32:32.8367628Z 2025-05-07T20:32:32.8367790Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.8367853Z 2025-05-07T20:32:32.8367934Z > x_sign = torch.sign(x) 2025-05-07T20:32:32.8369700Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:32.8369712Z 2025-05-07T20:32:32.8369819Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:32.8369823Z 2025-05-07T20:32:32.8369918Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.8370130Z self=, 2025-05-07T20:32:32.8370198Z T=1, 2025-05-07T20:32:32.8370263Z D=7168, 2025-05-07T20:32:32.8370336Z scale_ub=1200.0, 2025-05-07T20:32:32.8370414Z contiguous=True, 2025-05-07T20:32:32.8370486Z compiled=False, 2025-05-07T20:32:32.8370547Z ) 2025-05-07T20:32:32.8370766Z self = 2025-05-07T20:32:32.8370924Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:32.8370929Z 2025-05-07T20:32:32.8370993Z @given( 2025-05-07T20:32:32.8371102Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.8371298Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.8371414Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.8371522Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.8371623Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.8371691Z ) 2025-05-07T20:32:32.8371926Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.8372007Z def test_silu_mul_quant( 2025-05-07T20:32:32.8372079Z self, 2025-05-07T20:32:32.8372144Z T: int, 2025-05-07T20:32:32.8372209Z D: int, 2025-05-07T20:32:32.8372299Z scale_ub: Optional[float], 2025-05-07T20:32:32.8372384Z contiguous: bool, 2025-05-07T20:32:32.8372460Z compiled: bool, 2025-05-07T20:32:32.8372528Z ) -> None: 2025-05-07T20:32:32.8372611Z torch.manual_seed(2025) 2025-05-07T20:32:32.8372675Z 2025-05-07T20:32:32.8372833Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.8372900Z 2025-05-07T20:32:32.8372986Z x_sign = torch.sign(x) 2025-05-07T20:32:32.8373104Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.8373182Z x = x_sign * x_clamp 2025-05-07T20:32:32.8373254Z x0 = x[:, :D] 2025-05-07T20:32:32.8373326Z x1 = x[:, D:] 2025-05-07T20:32:32.8373387Z 2025-05-07T20:32:32.8373466Z if contiguous: 2025-05-07T20:32:32.8373549Z x0 = x0.contiguous() 2025-05-07T20:32:32.8373628Z x1 = x1.contiguous() 2025-05-07T20:32:32.8373693Z 2025-05-07T20:32:32.8373774Z if scale_ub is not None: 2025-05-07T20:32:32.8373951Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.8374085Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.8374151Z ) 2025-05-07T20:32:32.8374219Z else: 2025-05-07T20:32:32.8374303Z scale_ub_tensor = None 2025-05-07T20:32:32.8374370Z 2025-05-07T20:32:32.8374492Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.8374572Z op = silu_mul_quant 2025-05-07T20:32:32.8374646Z if compiled: 2025-05-07T20:32:32.8374741Z op = torch.compile(op) 2025-05-07T20:32:32.8374837Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.8374903Z 2025-05-07T20:32:32.8374986Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.8374990Z 2025-05-07T20:32:32.8375078Z moe/activation_test.py:117: 2025-05-07T20:32:32.8375209Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.8375302Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.8375396Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.8375892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.8375978Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.8376333Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.8376553Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.8376888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.8376977Z kernel = self.compile( 2025-05-07T20:32:32.8377353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.8377521Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.8377652Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.8377656Z 2025-05-07T20:32:32.8377854Z self = 2025-05-07T20:32:32.8378629Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.8379215Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faad2afc400>} 2025-05-07T20:32:32.8379952Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.8380138Z context = 2025-05-07T20:32:32.8380143Z 2025-05-07T20:32:32.8380305Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.8380568Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.8380668Z module_map=module_map) 2025-05-07T20:32:32.8380827Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.8380919Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.8380984Z E ^ 2025-05-07T20:32:32.8381332Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.8381340Z 2025-05-07T20:32:32.8381743Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.8381748Z 2025-05-07T20:32:32.8381841Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.8382138Z self=, 2025-05-07T20:32:32.8382206Z T=128, 2025-05-07T20:32:32.8382271Z D=5120, 2025-05-07T20:32:32.8382345Z scale_ub=None, 2025-05-07T20:32:32.8382420Z contiguous=True, 2025-05-07T20:32:32.8382494Z compiled=False, 2025-05-07T20:32:32.8382560Z ) 2025-05-07T20:32:32.8382775Z self = 2025-05-07T20:32:32.8382942Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:32.8382946Z 2025-05-07T20:32:32.8383011Z @given( 2025-05-07T20:32:32.8383122Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.8383212Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.8383316Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.8383422Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.8383535Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.8383597Z ) 2025-05-07T20:32:32.8383842Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.8383925Z def test_silu_mul_quant( 2025-05-07T20:32:32.8383989Z self, 2025-05-07T20:32:32.8384056Z T: int, 2025-05-07T20:32:32.8384121Z D: int, 2025-05-07T20:32:32.8384207Z scale_ub: Optional[float], 2025-05-07T20:32:32.8384298Z contiguous: bool, 2025-05-07T20:32:32.8384374Z compiled: bool, 2025-05-07T20:32:32.8384440Z ) -> None: 2025-05-07T20:32:32.8384526Z torch.manual_seed(2025) 2025-05-07T20:32:32.8384588Z 2025-05-07T20:32:32.8384746Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.8384813Z 2025-05-07T20:32:32.8384895Z x_sign = torch.sign(x) 2025-05-07T20:32:32.8385009Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.8385090Z x = x_sign * x_clamp 2025-05-07T20:32:32.8385160Z x0 = x[:, :D] 2025-05-07T20:32:32.8385234Z x1 = x[:, D:] 2025-05-07T20:32:32.8385300Z 2025-05-07T20:32:32.8385372Z if contiguous: 2025-05-07T20:32:32.8385456Z x0 = x0.contiguous() 2025-05-07T20:32:32.8385535Z x1 = x1.contiguous() 2025-05-07T20:32:32.8385595Z 2025-05-07T20:32:32.8385677Z if scale_ub is not None: 2025-05-07T20:32:32.8385858Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.8385984Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.8386054Z ) 2025-05-07T20:32:32.8386119Z else: 2025-05-07T20:32:32.8386202Z scale_ub_tensor = None 2025-05-07T20:32:32.8386266Z 2025-05-07T20:32:32.8386385Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.8386466Z op = silu_mul_quant 2025-05-07T20:32:32.8386540Z if compiled: 2025-05-07T20:32:32.8386635Z op = torch.compile(op) 2025-05-07T20:32:32.8386742Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.8386803Z 2025-05-07T20:32:32.8386887Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.8386891Z 2025-05-07T20:32:32.8386984Z moe/activation_test.py:117: 2025-05-07T20:32:32.8387105Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.8387196Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.8387292Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.8387784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.8387873Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.8388223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.8388439Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.8388772Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.8388935Z kernel = self.compile( 2025-05-07T20:32:32.8389352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.8389530Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.8389655Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.8389660Z 2025-05-07T20:32:32.8389862Z self = 2025-05-07T20:32:32.8390633Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.8391138Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faad2afd300>} 2025-05-07T20:32:32.8391884Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.8392071Z context = 2025-05-07T20:32:32.8392079Z 2025-05-07T20:32:32.8392241Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.8392494Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.8392596Z module_map=module_map) 2025-05-07T20:32:32.8392750Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.8392842Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.8392913Z E ^ 2025-05-07T20:32:32.8393258Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.8393263Z 2025-05-07T20:32:32.8393674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.8393682Z 2025-05-07T20:32:32.8393776Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.8393991Z self=, 2025-05-07T20:32:32.8394144Z T=128, 2025-05-07T20:32:32.8394210Z D=7168, 2025-05-07T20:32:32.8394283Z scale_ub=None, 2025-05-07T20:32:32.8394363Z contiguous=True, 2025-05-07T20:32:32.8394434Z compiled=False, 2025-05-07T20:32:32.8394495Z ) 2025-05-07T20:32:32.8394712Z self = 2025-05-07T20:32:32.8394872Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:32.8394877Z 2025-05-07T20:32:32.8394946Z @given( 2025-05-07T20:32:32.8395055Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.8395149Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.8395256Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.8395361Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.8395463Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.8395539Z ) 2025-05-07T20:32:32.8395776Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.8395858Z def test_silu_mul_quant( 2025-05-07T20:32:32.8395926Z self, 2025-05-07T20:32:32.8395990Z T: int, 2025-05-07T20:32:32.8396055Z D: int, 2025-05-07T20:32:32.8396146Z scale_ub: Optional[float], 2025-05-07T20:32:32.8396228Z contiguous: bool, 2025-05-07T20:32:32.8396306Z compiled: bool, 2025-05-07T20:32:32.8396375Z ) -> None: 2025-05-07T20:32:32.8396457Z torch.manual_seed(2025) 2025-05-07T20:32:32.8396521Z 2025-05-07T20:32:32.8396679Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.8396827Z 2025-05-07T20:32:32.8396911Z x_sign = torch.sign(x) 2025-05-07T20:32:32.8397027Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.8397105Z x = x_sign * x_clamp 2025-05-07T20:32:32.8397175Z x0 = x[:, :D] 2025-05-07T20:32:32.8397248Z x1 = x[:, D:] 2025-05-07T20:32:32.8397310Z 2025-05-07T20:32:32.8397386Z if contiguous: 2025-05-07T20:32:32.8397467Z x0 = x0.contiguous() 2025-05-07T20:32:32.8397545Z x1 = x1.contiguous() 2025-05-07T20:32:32.8397611Z 2025-05-07T20:32:32.8397691Z if scale_ub is not None: 2025-05-07T20:32:32.8397788Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.8397916Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.8397981Z ) 2025-05-07T20:32:32.8398049Z else: 2025-05-07T20:32:32.8398131Z scale_ub_tensor = None 2025-05-07T20:32:32.8398192Z 2025-05-07T20:32:32.8398322Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.8398400Z op = silu_mul_quant 2025-05-07T20:32:32.8398473Z if compiled: 2025-05-07T20:32:32.8398565Z op = torch.compile(op) 2025-05-07T20:32:32.8398663Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.8398730Z 2025-05-07T20:32:32.8398813Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.8398817Z 2025-05-07T20:32:32.8398902Z moe/activation_test.py:117: 2025-05-07T20:32:32.8399025Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.8399130Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.8399229Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.8399742Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.8399827Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.8400183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.8400404Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.8400736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.8400931Z kernel = self.compile( 2025-05-07T20:32:32.8401304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.8401473Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.8401598Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.8401602Z 2025-05-07T20:32:32.8401798Z self = 2025-05-07T20:32:32.8402570Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.8403067Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faad2afe0c0>} 2025-05-07T20:32:32.8403811Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.8404001Z context = 2025-05-07T20:32:32.8404006Z 2025-05-07T20:32:32.8404162Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.8404516Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.8404617Z module_map=module_map) 2025-05-07T20:32:32.8404847Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.8404940Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.8405005Z E ^ 2025-05-07T20:32:32.8405353Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.8405361Z 2025-05-07T20:32:32.8405765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.8405769Z 2025-05-07T20:32:32.8405864Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.8406081Z self=, 2025-05-07T20:32:32.8406150Z T=2048, 2025-05-07T20:32:32.8406216Z D=7168, 2025-05-07T20:32:32.8406293Z scale_ub=1200.0, 2025-05-07T20:32:32.8406367Z contiguous=True, 2025-05-07T20:32:32.8406445Z compiled=False, 2025-05-07T20:32:32.8406506Z ) 2025-05-07T20:32:32.8406723Z self = 2025-05-07T20:32:32.8406894Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:32.8406899Z 2025-05-07T20:32:32.8406965Z @given( 2025-05-07T20:32:32.8407074Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.8407171Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.8407279Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.8407386Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.8407491Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.8407554Z ) 2025-05-07T20:32:32.8407793Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.8407881Z def test_silu_mul_quant( 2025-05-07T20:32:32.8407946Z self, 2025-05-07T20:32:32.8408018Z T: int, 2025-05-07T20:32:32.8408087Z D: int, 2025-05-07T20:32:32.8408178Z scale_ub: Optional[float], 2025-05-07T20:32:32.8408581Z contiguous: bool, 2025-05-07T20:32:32.8408698Z compiled: bool, 2025-05-07T20:32:32.8408771Z ) -> None: 2025-05-07T20:32:32.8408856Z torch.manual_seed(2025) 2025-05-07T20:32:32.8408920Z 2025-05-07T20:32:32.8409081Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.8411327Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.70 GiB is allocated by PyTorch, and 53.93 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:32.8411334Z 2025-05-07T20:32:32.8411450Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:32.8411455Z 2025-05-07T20:32:32.8411550Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.8411766Z self=, 2025-05-07T20:32:32.8411834Z T=1, 2025-05-07T20:32:32.8411912Z D=5120, 2025-05-07T20:32:32.8411983Z scale_ub=1200.0, 2025-05-07T20:32:32.8412061Z contiguous=True, 2025-05-07T20:32:32.8412136Z compiled=False, 2025-05-07T20:32:32.8412198Z ) 2025-05-07T20:32:32.8412409Z self = 2025-05-07T20:32:32.8412565Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:32.8412569Z 2025-05-07T20:32:32.8412638Z @given( 2025-05-07T20:32:32.8412745Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.8412832Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.8412941Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.8413168Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.8413274Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.8413340Z ) 2025-05-07T20:32:32.8413576Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.8413664Z def test_silu_mul_quant( 2025-05-07T20:32:32.8413733Z self, 2025-05-07T20:32:32.8413799Z T: int, 2025-05-07T20:32:32.8413862Z D: int, 2025-05-07T20:32:32.8413957Z scale_ub: Optional[float], 2025-05-07T20:32:32.8414037Z contiguous: bool, 2025-05-07T20:32:32.8414116Z compiled: bool, 2025-05-07T20:32:32.8414184Z ) -> None: 2025-05-07T20:32:32.8414266Z torch.manual_seed(2025) 2025-05-07T20:32:32.8414332Z 2025-05-07T20:32:32.8414491Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.8414551Z 2025-05-07T20:32:32.8414637Z x_sign = torch.sign(x) 2025-05-07T20:32:32.8414761Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.8414840Z x = x_sign * x_clamp 2025-05-07T20:32:32.8414911Z x0 = x[:, :D] 2025-05-07T20:32:32.8414980Z x1 = x[:, D:] 2025-05-07T20:32:32.8415043Z 2025-05-07T20:32:32.8415122Z if contiguous: 2025-05-07T20:32:32.8415210Z x0 = x0.contiguous() 2025-05-07T20:32:32.8415294Z x1 = x1.contiguous() 2025-05-07T20:32:32.8415356Z 2025-05-07T20:32:32.8415438Z if scale_ub is not None: 2025-05-07T20:32:32.8415539Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.8415664Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.8415732Z ) 2025-05-07T20:32:32.8415802Z else: 2025-05-07T20:32:32.8415887Z scale_ub_tensor = None 2025-05-07T20:32:32.8415948Z 2025-05-07T20:32:32.8416072Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.8416153Z op = silu_mul_quant 2025-05-07T20:32:32.8416237Z if compiled: 2025-05-07T20:32:32.8416328Z op = torch.compile(op) 2025-05-07T20:32:32.8416424Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.8416493Z 2025-05-07T20:32:32.8416577Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.8416666Z 2025-05-07T20:32:32.8416752Z moe/activation_test.py:117: 2025-05-07T20:32:32.8416875Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.8416967Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.8417056Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.8417545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.8417633Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.8417989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.8418212Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.8418544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.8418633Z kernel = self.compile( 2025-05-07T20:32:32.8419006Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.8419181Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.8419303Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.8419308Z 2025-05-07T20:32:32.8419506Z self = 2025-05-07T20:32:32.8420283Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.8420862Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faad2aff6a0>} 2025-05-07T20:32:32.8421605Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.8421796Z context = 2025-05-07T20:32:32.8421801Z 2025-05-07T20:32:32.8421956Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.8422218Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.8422316Z module_map=module_map) 2025-05-07T20:32:32.8422470Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.8422560Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.8422631Z E ^ 2025-05-07T20:32:32.8422980Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.8422985Z 2025-05-07T20:32:32.8423388Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.8423396Z 2025-05-07T20:32:32.8423490Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.8423704Z self=, 2025-05-07T20:32:32.8423769Z T=2048, 2025-05-07T20:32:32.8423839Z D=5120, 2025-05-07T20:32:32.8423911Z scale_ub=None, 2025-05-07T20:32:32.8423987Z contiguous=True, 2025-05-07T20:32:32.8424062Z compiled=False, 2025-05-07T20:32:32.8424125Z ) 2025-05-07T20:32:32.8424335Z self = 2025-05-07T20:32:32.8424509Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:32.8424514Z 2025-05-07T20:32:32.8424579Z @given( 2025-05-07T20:32:32.8424691Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.8424781Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.8424887Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.8425084Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.8425194Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.8425263Z ) 2025-05-07T20:32:32.8425506Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.8425590Z def test_silu_mul_quant( 2025-05-07T20:32:32.8425654Z self, 2025-05-07T20:32:32.8425721Z T: int, 2025-05-07T20:32:32.8425789Z D: int, 2025-05-07T20:32:32.8425881Z scale_ub: Optional[float], 2025-05-07T20:32:32.8425960Z contiguous: bool, 2025-05-07T20:32:32.8426036Z compiled: bool, 2025-05-07T20:32:32.8426110Z ) -> None: 2025-05-07T20:32:32.8426202Z torch.manual_seed(2025) 2025-05-07T20:32:32.8426267Z 2025-05-07T20:32:32.8426431Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.8426496Z 2025-05-07T20:32:32.8426578Z > x_sign = torch.sign(x) 2025-05-07T20:32:32.8428363Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:32.8428369Z 2025-05-07T20:32:32.8428480Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:32.8428587Z 2025-05-07T20:32:32.8428690Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.8428908Z self=, 2025-05-07T20:32:32.8428976Z T=16384, 2025-05-07T20:32:32.8429040Z D=5120, 2025-05-07T20:32:32.8429115Z scale_ub=None, 2025-05-07T20:32:32.8429192Z contiguous=True, 2025-05-07T20:32:32.8429269Z compiled=False, 2025-05-07T20:32:32.8429330Z ) 2025-05-07T20:32:32.8429541Z self = 2025-05-07T20:32:32.8429712Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:32.8429717Z 2025-05-07T20:32:32.8429784Z @given( 2025-05-07T20:32:32.8429902Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.8429993Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.8430098Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.8430212Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.8430315Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.8430382Z ) 2025-05-07T20:32:32.8430618Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.8430702Z def test_silu_mul_quant( 2025-05-07T20:32:32.8430774Z self, 2025-05-07T20:32:32.8430841Z T: int, 2025-05-07T20:32:32.8430908Z D: int, 2025-05-07T20:32:32.8431000Z scale_ub: Optional[float], 2025-05-07T20:32:32.8431082Z contiguous: bool, 2025-05-07T20:32:32.8431156Z compiled: bool, 2025-05-07T20:32:32.8431226Z ) -> None: 2025-05-07T20:32:32.8431329Z torch.manual_seed(2025) 2025-05-07T20:32:32.8431396Z 2025-05-07T20:32:32.8431585Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.8433353Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:32.8433444Z 2025-05-07T20:32:32.8433553Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:32.8433557Z 2025-05-07T20:32:32.8433649Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.8433868Z self=, 2025-05-07T20:32:32.8433936Z T=4096, 2025-05-07T20:32:32.8434000Z D=5120, 2025-05-07T20:32:32.8434072Z scale_ub=None, 2025-05-07T20:32:32.8434145Z contiguous=True, 2025-05-07T20:32:32.8434217Z compiled=False, 2025-05-07T20:32:32.8434281Z ) 2025-05-07T20:32:32.8434494Z self = 2025-05-07T20:32:32.8434657Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:32.8434661Z 2025-05-07T20:32:32.8434728Z @given( 2025-05-07T20:32:32.8434834Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.8434932Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.8435037Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.8435142Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.8435247Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.8435310Z ) 2025-05-07T20:32:32.8435543Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.8435629Z def test_silu_mul_quant( 2025-05-07T20:32:32.8435695Z self, 2025-05-07T20:32:32.8435761Z T: int, 2025-05-07T20:32:32.8435825Z D: int, 2025-05-07T20:32:32.8436070Z scale_ub: Optional[float], 2025-05-07T20:32:32.8436163Z contiguous: bool, 2025-05-07T20:32:32.8436237Z compiled: bool, 2025-05-07T20:32:32.8436303Z ) -> None: 2025-05-07T20:32:32.8436388Z torch.manual_seed(2025) 2025-05-07T20:32:32.8436452Z 2025-05-07T20:32:32.8436615Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.8438377Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:32.8438383Z 2025-05-07T20:32:32.8438494Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:32.8438498Z 2025-05-07T20:32:32.8438593Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.8438806Z self=, 2025-05-07T20:32:32.8438881Z T=2048, 2025-05-07T20:32:32.8438947Z D=5120, 2025-05-07T20:32:32.8439019Z scale_ub=None, 2025-05-07T20:32:32.8439120Z contiguous=False, 2025-05-07T20:32:32.8439195Z compiled=False, 2025-05-07T20:32:32.8439271Z ) 2025-05-07T20:32:32.8439493Z self = 2025-05-07T20:32:32.8439657Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:32.8439662Z 2025-05-07T20:32:32.8439727Z @given( 2025-05-07T20:32:32.8439838Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.8439925Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.8440032Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.8440140Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.8440245Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.8440311Z ) 2025-05-07T20:32:32.8440544Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.8440710Z def test_silu_mul_quant( 2025-05-07T20:32:32.8440779Z self, 2025-05-07T20:32:32.8440847Z T: int, 2025-05-07T20:32:32.8440911Z D: int, 2025-05-07T20:32:32.8441004Z scale_ub: Optional[float], 2025-05-07T20:32:32.8441082Z contiguous: bool, 2025-05-07T20:32:32.8441161Z compiled: bool, 2025-05-07T20:32:32.8441231Z ) -> None: 2025-05-07T20:32:32.8441313Z torch.manual_seed(2025) 2025-05-07T20:32:32.8441374Z 2025-05-07T20:32:32.8441534Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.8443294Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:32.8443309Z 2025-05-07T20:32:32.8443417Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:32.8443422Z 2025-05-07T20:32:32.8443513Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.8443731Z self=, 2025-05-07T20:32:32.8443796Z T=4096, 2025-05-07T20:32:32.8443860Z D=7168, 2025-05-07T20:32:32.8443936Z scale_ub=None, 2025-05-07T20:32:32.8444011Z contiguous=True, 2025-05-07T20:32:32.8444162Z compiled=True, 2025-05-07T20:32:32.8444318Z ) 2025-05-07T20:32:32.8444530Z self = 2025-05-07T20:32:32.8444696Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:32.8444701Z 2025-05-07T20:32:32.8444777Z @given( 2025-05-07T20:32:32.8444884Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.8444977Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.8445081Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.8445187Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.8445293Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.8445356Z ) 2025-05-07T20:32:32.8445589Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.8445675Z def test_silu_mul_quant( 2025-05-07T20:32:32.8445739Z self, 2025-05-07T20:32:32.8445808Z T: int, 2025-05-07T20:32:32.8445878Z D: int, 2025-05-07T20:32:32.8445965Z scale_ub: Optional[float], 2025-05-07T20:32:32.8446048Z contiguous: bool, 2025-05-07T20:32:32.8446122Z compiled: bool, 2025-05-07T20:32:32.8446189Z ) -> None: 2025-05-07T20:32:32.8446274Z torch.manual_seed(2025) 2025-05-07T20:32:32.8446339Z 2025-05-07T20:32:32.8446497Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.8448260Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:32.8448266Z 2025-05-07T20:32:32.8448373Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:32.8448377Z 2025-05-07T20:32:32.8448472Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.8448682Z self=, 2025-05-07T20:32:32.8448837Z T=2048, 2025-05-07T20:32:32.8448902Z D=5120, 2025-05-07T20:32:32.8448976Z scale_ub=1200.0, 2025-05-07T20:32:32.8449053Z contiguous=False, 2025-05-07T20:32:32.8449126Z compiled=False, 2025-05-07T20:32:32.8449187Z ) 2025-05-07T20:32:32.8449403Z self = 2025-05-07T20:32:32.8449576Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:32.8449580Z 2025-05-07T20:32:32.8449645Z @given( 2025-05-07T20:32:32.8449755Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.8449847Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.8449951Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.8450058Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.8450161Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.8450226Z ) 2025-05-07T20:32:32.8450465Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.8450549Z def test_silu_mul_quant( 2025-05-07T20:32:32.8450620Z self, 2025-05-07T20:32:32.8450683Z T: int, 2025-05-07T20:32:32.8450749Z D: int, 2025-05-07T20:32:32.8450838Z scale_ub: Optional[float], 2025-05-07T20:32:32.8450916Z contiguous: bool, 2025-05-07T20:32:32.8450989Z compiled: bool, 2025-05-07T20:32:32.8451061Z ) -> None: 2025-05-07T20:32:32.8451146Z torch.manual_seed(2025) 2025-05-07T20:32:32.8451207Z 2025-05-07T20:32:32.8451368Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.8453196Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:32.8453213Z 2025-05-07T20:32:32.8453323Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:32.8453327Z 2025-05-07T20:32:32.8453418Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.8453636Z self=, 2025-05-07T20:32:32.8453702Z T=4096, 2025-05-07T20:32:32.8453766Z D=7168, 2025-05-07T20:32:32.8453842Z scale_ub=1200.0, 2025-05-07T20:32:32.8453923Z contiguous=True, 2025-05-07T20:32:32.8453995Z compiled=False, 2025-05-07T20:32:32.8454062Z ) 2025-05-07T20:32:32.8454268Z self = 2025-05-07T20:32:32.8454433Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:32.8454445Z 2025-05-07T20:32:32.8454510Z @given( 2025-05-07T20:32:32.8454619Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.8454707Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.8454810Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.8454918Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.8455024Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.8455087Z ) 2025-05-07T20:32:32.8455320Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.8455407Z def test_silu_mul_quant( 2025-05-07T20:32:32.8455478Z self, 2025-05-07T20:32:32.8455552Z T: int, 2025-05-07T20:32:32.8455616Z D: int, 2025-05-07T20:32:32.8455701Z scale_ub: Optional[float], 2025-05-07T20:32:32.8455785Z contiguous: bool, 2025-05-07T20:32:32.8455861Z compiled: bool, 2025-05-07T20:32:32.8456012Z ) -> None: 2025-05-07T20:32:32.8456103Z torch.manual_seed(2025) 2025-05-07T20:32:32.8456163Z 2025-05-07T20:32:32.8456321Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.8458090Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:32.8458096Z 2025-05-07T20:32:32.8458202Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:32.8458207Z 2025-05-07T20:32:32.8458300Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.8458518Z self=, 2025-05-07T20:32:32.8458589Z T=16384, 2025-05-07T20:32:32.8458653Z D=7168, 2025-05-07T20:32:32.8458725Z scale_ub=None, 2025-05-07T20:32:32.8458803Z contiguous=False, 2025-05-07T20:32:32.8458876Z compiled=True, 2025-05-07T20:32:32.8458940Z ) 2025-05-07T20:32:32.8459150Z self = 2025-05-07T20:32:32.8459316Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:32.8459321Z 2025-05-07T20:32:32.8459386Z @given( 2025-05-07T20:32:32.8459599Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.8459687Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.8459794Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.8459901Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.8460002Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.8460076Z ) 2025-05-07T20:32:32.8460316Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.8460397Z def test_silu_mul_quant( 2025-05-07T20:32:32.8460470Z self, 2025-05-07T20:32:32.8460534Z T: int, 2025-05-07T20:32:32.8460599Z D: int, 2025-05-07T20:32:32.8460692Z scale_ub: Optional[float], 2025-05-07T20:32:32.8460772Z contiguous: bool, 2025-05-07T20:32:32.8460848Z compiled: bool, 2025-05-07T20:32:32.8460923Z ) -> None: 2025-05-07T20:32:32.8465195Z torch.manual_seed(2025) 2025-05-07T20:32:32.8465279Z 2025-05-07T20:32:32.8465463Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.8467254Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:32.8467266Z 2025-05-07T20:32:32.8467385Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:32.8467392Z 2025-05-07T20:32:32.8467492Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.8467714Z self=, 2025-05-07T20:32:32.8467792Z T=4096, 2025-05-07T20:32:32.8467863Z D=7168, 2025-05-07T20:32:32.8467940Z scale_ub=None, 2025-05-07T20:32:32.8468023Z contiguous=True, 2025-05-07T20:32:32.8468103Z compiled=False, 2025-05-07T20:32:32.8468173Z ) 2025-05-07T20:32:32.8468388Z self = 2025-05-07T20:32:32.8468682Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:32.8468687Z 2025-05-07T20:32:32.8468771Z @given( 2025-05-07T20:32:32.8468887Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.8468977Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.8469088Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.8469198Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.8469304Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.8469379Z ) 2025-05-07T20:32:32.8469627Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.8469714Z def test_silu_mul_quant( 2025-05-07T20:32:32.8469788Z self, 2025-05-07T20:32:32.8469862Z T: int, 2025-05-07T20:32:32.8469931Z D: int, 2025-05-07T20:32:32.8470027Z scale_ub: Optional[float], 2025-05-07T20:32:32.8470115Z contiguous: bool, 2025-05-07T20:32:32.8470197Z compiled: bool, 2025-05-07T20:32:32.8470271Z ) -> None: 2025-05-07T20:32:32.8470359Z torch.manual_seed(2025) 2025-05-07T20:32:32.8470430Z 2025-05-07T20:32:32.8470593Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.8472438Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:32.8472449Z 2025-05-07T20:32:32.8472563Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:32.8472572Z 2025-05-07T20:32:32.8472669Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.8472888Z self=, 2025-05-07T20:32:32.8472962Z T=16384, 2025-05-07T20:32:32.8473037Z D=7168, 2025-05-07T20:32:32.8473116Z scale_ub=None, 2025-05-07T20:32:32.8473197Z contiguous=True, 2025-05-07T20:32:32.8473289Z compiled=False, 2025-05-07T20:32:32.8473358Z ) 2025-05-07T20:32:32.8473567Z self = 2025-05-07T20:32:32.8473739Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:32.8473743Z 2025-05-07T20:32:32.8473820Z @given( 2025-05-07T20:32:32.8473930Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.8474033Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.8474139Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.8474248Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.8474362Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.8474432Z ) 2025-05-07T20:32:32.8474674Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.8474761Z def test_silu_mul_quant( 2025-05-07T20:32:32.8474832Z self, 2025-05-07T20:32:32.8474906Z T: int, 2025-05-07T20:32:32.8474975Z D: int, 2025-05-07T20:32:32.8475067Z scale_ub: Optional[float], 2025-05-07T20:32:32.8475154Z contiguous: bool, 2025-05-07T20:32:32.8475234Z compiled: bool, 2025-05-07T20:32:32.8475305Z ) -> None: 2025-05-07T20:32:32.8475395Z torch.manual_seed(2025) 2025-05-07T20:32:32.8475470Z 2025-05-07T20:32:32.8475628Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.8477407Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:32.8477493Z 2025-05-07T20:32:32.8477605Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:32.8477618Z 2025-05-07T20:32:32.8477722Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.8477942Z self=, 2025-05-07T20:32:32.8478022Z T=16384, 2025-05-07T20:32:32.8478092Z D=7168, 2025-05-07T20:32:32.8478167Z scale_ub=1200.0, 2025-05-07T20:32:32.8478247Z contiguous=True, 2025-05-07T20:32:32.8478324Z compiled=False, 2025-05-07T20:32:32.8478394Z ) 2025-05-07T20:32:32.8478605Z self = 2025-05-07T20:32:32.8478776Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:32.8478780Z 2025-05-07T20:32:32.8478856Z @given( 2025-05-07T20:32:32.8478965Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.8479061Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.8479179Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.8479290Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.8479393Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.8479468Z ) 2025-05-07T20:32:32.8479787Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.8479875Z def test_silu_mul_quant( 2025-05-07T20:32:32.8479949Z self, 2025-05-07T20:32:32.8480019Z T: int, 2025-05-07T20:32:32.8480090Z D: int, 2025-05-07T20:32:32.8480187Z scale_ub: Optional[float], 2025-05-07T20:32:32.8480268Z contiguous: bool, 2025-05-07T20:32:32.8480349Z compiled: bool, 2025-05-07T20:32:32.8480421Z ) -> None: 2025-05-07T20:32:32.8480507Z torch.manual_seed(2025) 2025-05-07T20:32:32.8480578Z 2025-05-07T20:32:32.8480742Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.8482518Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:32.8482533Z 2025-05-07T20:32:32.8482647Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:32.8482651Z 2025-05-07T20:32:32.8482746Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.8482964Z self=, 2025-05-07T20:32:32.8483041Z T=128, 2025-05-07T20:32:32.8483115Z D=5120, 2025-05-07T20:32:32.8483197Z scale_ub=1200.0, 2025-05-07T20:32:32.8483278Z contiguous=False, 2025-05-07T20:32:32.8483359Z compiled=False, 2025-05-07T20:32:32.8483430Z ) 2025-05-07T20:32:32.8483641Z self = 2025-05-07T20:32:32.8483820Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:32.8483824Z 2025-05-07T20:32:32.8483901Z @given( 2025-05-07T20:32:32.8484012Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.8484104Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.8484408Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.8484517Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.8484632Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.8484702Z ) 2025-05-07T20:32:32.8484946Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.8485031Z def test_silu_mul_quant( 2025-05-07T20:32:32.8485104Z self, 2025-05-07T20:32:32.8485177Z T: int, 2025-05-07T20:32:32.8485245Z D: int, 2025-05-07T20:32:32.8485334Z scale_ub: Optional[float], 2025-05-07T20:32:32.8485424Z contiguous: bool, 2025-05-07T20:32:32.8485505Z compiled: bool, 2025-05-07T20:32:32.8485581Z ) -> None: 2025-05-07T20:32:32.8485682Z torch.manual_seed(2025) 2025-05-07T20:32:32.8485754Z 2025-05-07T20:32:32.8485913Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.8485989Z 2025-05-07T20:32:32.8486080Z x_sign = torch.sign(x) 2025-05-07T20:32:32.8486210Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.8486296Z x = x_sign * x_clamp 2025-05-07T20:32:32.8486377Z x0 = x[:, :D] 2025-05-07T20:32:32.8486456Z x1 = x[:, D:] 2025-05-07T20:32:32.8486520Z 2025-05-07T20:32:32.8486599Z if contiguous: 2025-05-07T20:32:32.8486685Z x0 = x0.contiguous() 2025-05-07T20:32:32.8486769Z x1 = x1.contiguous() 2025-05-07T20:32:32.8486842Z 2025-05-07T20:32:32.8486926Z if scale_ub is not None: 2025-05-07T20:32:32.8487026Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.8487242Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.8487314Z ) 2025-05-07T20:32:32.8487387Z else: 2025-05-07T20:32:32.8487478Z scale_ub_tensor = None 2025-05-07T20:32:32.8487546Z 2025-05-07T20:32:32.8487670Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.8487765Z op = silu_mul_quant 2025-05-07T20:32:32.8487845Z if compiled: 2025-05-07T20:32:32.8487942Z op = torch.compile(op) 2025-05-07T20:32:32.8488042Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.8488110Z 2025-05-07T20:32:32.8488198Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.8488202Z 2025-05-07T20:32:32.8488293Z moe/activation_test.py:117: 2025-05-07T20:32:32.8488419Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.8488520Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.8488616Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.8489119Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.8489213Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.8489567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.8489792Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.8490126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.8490219Z kernel = self.compile( 2025-05-07T20:32:32.8490604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.8490778Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.8490901Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.8490906Z 2025-05-07T20:32:32.8491117Z self = 2025-05-07T20:32:32.8491892Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.8492486Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faad2399bc0>} 2025-05-07T20:32:32.8493229Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.8493420Z context = 2025-05-07T20:32:32.8493424Z 2025-05-07T20:32:32.8493584Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.8493853Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.8493960Z module_map=module_map) 2025-05-07T20:32:32.8494117Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.8494217Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.8494289Z E ^ 2025-05-07T20:32:32.8494638Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.8494643Z 2025-05-07T20:32:32.8495053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.8495058Z 2025-05-07T20:32:32.8495158Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.8495382Z self=, 2025-05-07T20:32:32.8495457Z T=2048, 2025-05-07T20:32:32.8495530Z D=7168, 2025-05-07T20:32:32.8495711Z scale_ub=None, 2025-05-07T20:32:32.8495794Z contiguous=False, 2025-05-07T20:32:32.8495872Z compiled=False, 2025-05-07T20:32:32.8495946Z ) 2025-05-07T20:32:32.8496158Z self = 2025-05-07T20:32:32.8496334Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:32.8496339Z 2025-05-07T20:32:32.8496412Z @given( 2025-05-07T20:32:32.8496523Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.8496619Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.8496727Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.8496835Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.8496947Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.8497017Z ) 2025-05-07T20:32:32.8497256Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.8497357Z def test_silu_mul_quant( 2025-05-07T20:32:32.8497427Z self, 2025-05-07T20:32:32.8497497Z T: int, 2025-05-07T20:32:32.8497574Z D: int, 2025-05-07T20:32:32.8497667Z scale_ub: Optional[float], 2025-05-07T20:32:32.8497748Z contiguous: bool, 2025-05-07T20:32:32.8497836Z compiled: bool, 2025-05-07T20:32:32.8497907Z ) -> None: 2025-05-07T20:32:32.8497998Z torch.manual_seed(2025) 2025-05-07T20:32:32.8498064Z 2025-05-07T20:32:32.8498224Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.8500004Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 5.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:32.8500010Z 2025-05-07T20:32:32.8500123Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:32.8500128Z 2025-05-07T20:32:32.8500317Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.8500532Z self=, 2025-05-07T20:32:32.8500603Z T=128, 2025-05-07T20:32:32.8500679Z D=7168, 2025-05-07T20:32:32.8500753Z scale_ub=1200.0, 2025-05-07T20:32:32.8500833Z contiguous=True, 2025-05-07T20:32:32.8500911Z compiled=True, 2025-05-07T20:32:32.8500978Z ) 2025-05-07T20:32:32.8501197Z self = 2025-05-07T20:32:32.8501357Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:32.8501361Z 2025-05-07T20:32:32.8501434Z @given( 2025-05-07T20:32:32.8501558Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.8501651Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.8501761Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.8501877Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.8501989Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.8502063Z ) 2025-05-07T20:32:32.8502304Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.8502392Z def test_silu_mul_quant( 2025-05-07T20:32:32.8502466Z self, 2025-05-07T20:32:32.8502536Z T: int, 2025-05-07T20:32:32.8502606Z D: int, 2025-05-07T20:32:32.8502698Z scale_ub: Optional[float], 2025-05-07T20:32:32.8502779Z contiguous: bool, 2025-05-07T20:32:32.8502859Z compiled: bool, 2025-05-07T20:32:32.8502933Z ) -> None: 2025-05-07T20:32:32.8503021Z torch.manual_seed(2025) 2025-05-07T20:32:32.8503088Z 2025-05-07T20:32:32.8503332Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.8503400Z 2025-05-07T20:32:32.8503486Z x_sign = torch.sign(x) 2025-05-07T20:32:32.8503605Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.8503688Z x = x_sign * x_clamp 2025-05-07T20:32:32.8503759Z x0 = x[:, :D] 2025-05-07T20:32:32.8503829Z x1 = x[:, D:] 2025-05-07T20:32:32.8503891Z 2025-05-07T20:32:32.8505449Z if contiguous: 2025-05-07T20:32:32.8505530Z x0 = x0.contiguous() 2025-05-07T20:32:32.8505609Z x1 = x1.contiguous() 2025-05-07T20:32:32.8505672Z 2025-05-07T20:32:32.8505751Z if scale_ub is not None: 2025-05-07T20:32:32.8505849Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.8505979Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.8506045Z ) 2025-05-07T20:32:32.8506109Z else: 2025-05-07T20:32:32.8506200Z scale_ub_tensor = None 2025-05-07T20:32:32.8506260Z 2025-05-07T20:32:32.8506381Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.8506464Z op = silu_mul_quant 2025-05-07T20:32:32.8506541Z if compiled: 2025-05-07T20:32:32.8506639Z op = torch.compile(op) 2025-05-07T20:32:32.8506734Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.8506795Z 2025-05-07T20:32:32.8506878Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.8506882Z 2025-05-07T20:32:32.8506970Z moe/activation_test.py:117: 2025-05-07T20:32:32.8507090Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.8507184Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.8507274Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.8507634Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:32.8507719Z return fn(*args, **kwargs) 2025-05-07T20:32:32.8508208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.8508615Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.8509006Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.8509382Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.8509717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.8509806Z kernel = self.compile( 2025-05-07T20:32:32.8510184Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.8510354Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.8510481Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.8510485Z 2025-05-07T20:32:32.8510690Z self = 2025-05-07T20:32:32.8511470Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.8511984Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7faad1fd02c0>} 2025-05-07T20:32:32.8512720Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.8512908Z context = 2025-05-07T20:32:32.8512917Z 2025-05-07T20:32:32.8513193Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.8513456Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.8513562Z module_map=module_map) 2025-05-07T20:32:32.8513728Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.8513820Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.8513895Z E ^ 2025-05-07T20:32:32.8514244Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.8514248Z 2025-05-07T20:32:32.8514658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.8514663Z 2025-05-07T20:32:32.8514759Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.8514975Z self=, 2025-05-07T20:32:32.8515055Z T=128, 2025-05-07T20:32:32.8515128Z D=7168, 2025-05-07T20:32:32.8515208Z scale_ub=1200.0, 2025-05-07T20:32:32.8515291Z contiguous=True, 2025-05-07T20:32:32.8515376Z compiled=False, 2025-05-07T20:32:32.8515445Z ) 2025-05-07T20:32:32.8515664Z self = 2025-05-07T20:32:32.8515839Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:32.8515844Z 2025-05-07T20:32:32.8515923Z @given( 2025-05-07T20:32:32.8516037Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.8516132Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.8516247Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.8516362Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.8516469Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.8516544Z ) 2025-05-07T20:32:32.8516789Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.8516875Z def test_silu_mul_quant( 2025-05-07T20:32:32.8516957Z self, 2025-05-07T20:32:32.8517026Z T: int, 2025-05-07T20:32:32.8517102Z D: int, 2025-05-07T20:32:32.8517199Z scale_ub: Optional[float], 2025-05-07T20:32:32.8517370Z contiguous: bool, 2025-05-07T20:32:32.8517453Z compiled: bool, 2025-05-07T20:32:32.8517524Z ) -> None: 2025-05-07T20:32:32.8517612Z torch.manual_seed(2025) 2025-05-07T20:32:32.8517684Z 2025-05-07T20:32:32.8517852Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.8517918Z 2025-05-07T20:32:32.8518005Z x_sign = torch.sign(x) 2025-05-07T20:32:32.8518127Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.8519901Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 4.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:32.8519911Z 2025-05-07T20:32:32.8520027Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:32.8520031Z 2025-05-07T20:32:32.8520133Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.8520346Z self=, 2025-05-07T20:32:32.8520418Z T=128, 2025-05-07T20:32:32.8520490Z D=5120, 2025-05-07T20:32:32.8520567Z scale_ub=1200.0, 2025-05-07T20:32:32.8520646Z contiguous=True, 2025-05-07T20:32:32.8520726Z compiled=True, 2025-05-07T20:32:32.8520792Z ) 2025-05-07T20:32:32.8521081Z self = 2025-05-07T20:32:32.8521246Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:32.8521250Z 2025-05-07T20:32:32.8521324Z @given( 2025-05-07T20:32:32.8521435Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.8521534Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.8521640Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.8521752Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.8521859Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.8521927Z ) 2025-05-07T20:32:32.8522167Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.8522258Z def test_silu_mul_quant( 2025-05-07T20:32:32.8522329Z self, 2025-05-07T20:32:32.8522401Z T: int, 2025-05-07T20:32:32.8522471Z D: int, 2025-05-07T20:32:32.8522561Z scale_ub: Optional[float], 2025-05-07T20:32:32.8522654Z contiguous: bool, 2025-05-07T20:32:32.8522733Z compiled: bool, 2025-05-07T20:32:32.8522806Z ) -> None: 2025-05-07T20:32:32.8522895Z torch.manual_seed(2025) 2025-05-07T20:32:32.8522964Z 2025-05-07T20:32:32.8523128Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.8523202Z 2025-05-07T20:32:32.8523289Z > x_sign = torch.sign(x) 2025-05-07T20:32:32.8525133Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:32.8525140Z 2025-05-07T20:32:32.8525250Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:32.8525255Z 2025-05-07T20:32:32.8525352Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.8525568Z self=, 2025-05-07T20:32:32.8525757Z T=128, 2025-05-07T20:32:32.8525829Z D=7168, 2025-05-07T20:32:32.8525904Z scale_ub=None, 2025-05-07T20:32:32.8525981Z contiguous=True, 2025-05-07T20:32:32.8526060Z compiled=True, 2025-05-07T20:32:32.8526127Z ) 2025-05-07T20:32:32.8526338Z self = 2025-05-07T20:32:32.8526496Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:32.8526501Z 2025-05-07T20:32:32.8526572Z @given( 2025-05-07T20:32:32.8526688Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.8526780Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.8526891Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.8527002Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.8527108Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.8527189Z ) 2025-05-07T20:32:32.8527430Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.8527517Z def test_silu_mul_quant( 2025-05-07T20:32:32.8527592Z self, 2025-05-07T20:32:32.8527663Z T: int, 2025-05-07T20:32:32.8527734Z D: int, 2025-05-07T20:32:32.8527826Z scale_ub: Optional[float], 2025-05-07T20:32:32.8527909Z contiguous: bool, 2025-05-07T20:32:32.8527991Z compiled: bool, 2025-05-07T20:32:32.8528065Z ) -> None: 2025-05-07T20:32:32.8528152Z torch.manual_seed(2025) 2025-05-07T20:32:32.8528217Z 2025-05-07T20:32:32.8528381Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.8530221Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:32.8530235Z 2025-05-07T20:32:32.8530354Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:32.8530484Z =============================== warnings summary =============================== 2025-05-07T20:32:32.8530793Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:32.8531100Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:32.8531396Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:32.8532266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details. 2025-05-07T20:32:32.8532496Z warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See " 2025-05-07T20:32:32.8532500Z 2025-05-07T20:32:32.8532675Z experimental/gen_ai/test/moe/activation_test.py: 10 warnings 2025-05-07T20:32:32.8533931Z /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py:72: FutureWarning: `torch.testing.assert_allclose()` is deprecated since 1.12 and will be removed in a future release. Please use `torch.testing.assert_close()` instead. You can find detailed upgrade instructions in https://github.com/pytorch/pytorch/issues/61844. 2025-05-07T20:32:32.8534121Z torch.testing.assert_allclose(y, y_ref, rtol=1.6e-2, atol=1e-3) 2025-05-07T20:32:32.8534126Z 2025-05-07T20:32:32.8534331Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html 2025-05-07T20:32:32.8534570Z ================== 1 failed, 1 passed, 13 warnings in 19.85s =================== 2025-05-07T20:32:34.9435491Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./moe/activation_test.py` failed. (See above for error) 2025-05-07T20:32:35.0126063Z 2025-05-07T20:32:35.0126473Z [TEST] Some tests FAILED. Re-attempting only FAILED tests: ./moe/activation_test.py 2025-05-07T20:32:35.0126820Z 2025-05-07T20:32:35.0126824Z 2025-05-07T20:32:35.0148468Z [EXEC] [ATTEMPT 0/2] + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py 2025-05-07T20:32:37.2317081Z ============================= test session starts ============================== 2025-05-07T20:32:37.2318308Z platform linux -- Python 3.13.0, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:32:37.2319361Z cachedir: .pytest_cache 2025-05-07T20:32:37.2320459Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:32:37.2321179Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:32:37.2321583Z plugins: hypothesis-6.131.14 2025-05-07T20:32:38.8420563Z TMA benchmarks will be running with experimental grid constant TMA descriptor. 2025-05-07T20:32:38.9408152Z collecting ... collected 2 items / 1 deselected / 1 selected 2025-05-07T20:32:38.9408731Z run-last-failure: rerun previous 1 failure 2025-05-07T20:32:38.9408964Z 2025-05-07T20:32:40.9046709Z W0507 20:32:40.902000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:40.9047821Z W0507 20:32:40.902000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Traceback (most recent call last): 2025-05-07T20:32:40.9049207Z W0507 20:32:40.902000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:40.9050671Z W0507 20:32:40.902000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:40.9051666Z W0507 20:32:40.902000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:40.9053053Z W0507 20:32:40.902000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:40.9054457Z W0507 20:32:40.902000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:40.9055757Z W0507 20:32:40.902000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:40.9057131Z W0507 20:32:40.902000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:40.9058175Z W0507 20:32:40.902000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] module_map=module_map) 2025-05-07T20:32:40.9059436Z W0507 20:32:40.902000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:40.9060848Z W0507 20:32:40.902000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] generator.visit(fn.parse()) 2025-05-07T20:32:40.9061686Z W0507 20:32:40.902000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:40.9062883Z W0507 20:32:40.902000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:40.9064088Z W0507 20:32:40.902000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ret = super().visit(node) 2025-05-07T20:32:40.9065125Z W0507 20:32:40.902000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:40.9066131Z W0507 20:32:40.902000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] return visitor(node) 2025-05-07T20:32:40.9067335Z W0507 20:32:40.902000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:40.9068595Z W0507 20:32:40.902000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:40.9069578Z W0507 20:32:40.902000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:40.9070642Z W0507 20:32:40.902000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:40.9071680Z W0507 20:32:40.902000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] self.visit(item) 2025-05-07T20:32:40.9072439Z W0507 20:32:40.902000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:40.9073595Z W0507 20:32:40.902000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:40.9074943Z W0507 20:32:40.902000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:40.9075990Z W0507 20:32:40.902000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:40.9076896Z W0507 20:32:40.902000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:40.9077636Z W0507 20:32:40.902000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^ 2025-05-07T20:32:40.9078663Z W0507 20:32:40.902000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:40.9218905Z W0507 20:32:40.920000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:40.9220090Z W0507 20:32:40.920000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Traceback (most recent call last): 2025-05-07T20:32:40.9221428Z W0507 20:32:40.920000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:40.9223120Z W0507 20:32:40.920000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:40.9224118Z W0507 20:32:40.920000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:40.9225457Z W0507 20:32:40.920000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:40.9226862Z W0507 20:32:40.920000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:40.9228202Z W0507 20:32:40.920000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:40.9229607Z W0507 20:32:40.920000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:40.9230675Z W0507 20:32:40.920000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] module_map=module_map) 2025-05-07T20:32:40.9232090Z W0507 20:32:40.920000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:40.9233360Z W0507 20:32:40.920000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] generator.visit(fn.parse()) 2025-05-07T20:32:40.9234209Z W0507 20:32:40.920000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:40.9235403Z W0507 20:32:40.920000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:40.9236608Z W0507 20:32:40.920000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ret = super().visit(node) 2025-05-07T20:32:40.9237643Z W0507 20:32:40.920000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:40.9238649Z W0507 20:32:40.920000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] return visitor(node) 2025-05-07T20:32:40.9239863Z W0507 20:32:40.920000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:40.9241140Z W0507 20:32:40.920000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:40.9242059Z W0507 20:32:40.920000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:40.9243165Z W0507 20:32:40.920000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:40.9244194Z W0507 20:32:40.920000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] self.visit(item) 2025-05-07T20:32:40.9245090Z W0507 20:32:40.920000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:40.9246626Z W0507 20:32:40.920000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:40.9247980Z W0507 20:32:40.920000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:40.9249021Z W0507 20:32:40.920000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:40.9249931Z W0507 20:32:40.920000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:40.9250667Z W0507 20:32:40.920000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^ 2025-05-07T20:32:40.9251683Z W0507 20:32:40.920000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.3243072Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.3244411Z self=, 2025-05-07T20:32:41.3244992Z T=1, 2025-05-07T20:32:41.3245299Z D=5120, 2025-05-07T20:32:41.3245582Z scale_ub=None, 2025-05-07T20:32:41.3245782Z contiguous=True, 2025-05-07T20:32:41.3246000Z compiled=True, 2025-05-07T20:32:41.3246200Z ) 2025-05-07T20:32:41.3246933Z self = 2025-05-07T20:32:41.3247429Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:41.3247681Z 2025-05-07T20:32:41.3247767Z @given( 2025-05-07T20:32:41.3248012Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.3248314Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.3248614Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.3248935Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.3249249Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.3249525Z ) 2025-05-07T20:32:41.3249867Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.3250295Z def test_silu_mul_quant( 2025-05-07T20:32:41.3250529Z self, 2025-05-07T20:32:41.3250715Z T: int, 2025-05-07T20:32:41.3250896Z D: int, 2025-05-07T20:32:41.3251114Z scale_ub: Optional[float], 2025-05-07T20:32:41.3251376Z contiguous: bool, 2025-05-07T20:32:41.3251601Z compiled: bool, 2025-05-07T20:32:41.3251823Z ) -> None: 2025-05-07T20:32:41.3252034Z torch.manual_seed(2025) 2025-05-07T20:32:41.3252260Z 2025-05-07T20:32:41.3252535Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.3252866Z 2025-05-07T20:32:41.3253055Z x_sign = torch.sign(x) 2025-05-07T20:32:41.3253339Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.3253645Z x = x_sign * x_clamp 2025-05-07T20:32:41.3253885Z x0 = x[:, :D] 2025-05-07T20:32:41.3254088Z x1 = x[:, D:] 2025-05-07T20:32:41.3254291Z 2025-05-07T20:32:41.3254469Z if contiguous: 2025-05-07T20:32:41.3254689Z x0 = x0.contiguous() 2025-05-07T20:32:41.3254948Z x1 = x1.contiguous() 2025-05-07T20:32:41.3255183Z 2025-05-07T20:32:41.3255363Z if scale_ub is not None: 2025-05-07T20:32:41.3255633Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:41.3255965Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:41.3256259Z ) 2025-05-07T20:32:41.3256445Z else: 2025-05-07T20:32:41.3256651Z scale_ub_tensor = None 2025-05-07T20:32:41.3257072Z 2025-05-07T20:32:41.3257299Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.3257609Z op = silu_mul_quant 2025-05-07T20:32:41.3257855Z if compiled: 2025-05-07T20:32:41.3258095Z op = torch.compile(op) 2025-05-07T20:32:41.3258384Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.3258649Z 2025-05-07T20:32:41.3258825Z y_fp8, y_scale = fn() 2025-05-07T20:32:41.3259103Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:41.3259383Z 2025-05-07T20:32:41.3259608Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.3259938Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:41.3260224Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:41.3260524Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:41.3260874Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:41.3261183Z 2025-05-07T20:32:41.3261380Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:41.3261569Z 2025-05-07T20:32:41.3261667Z moe/activation_test.py:126: 2025-05-07T20:32:41.3261978Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.3262307Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:41.3272011Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:41.3272865Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:41.3273622Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:41.3274309Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:41.3274993Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:41.3275684Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:41.3276418Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:41.3277158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:41.3277794Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:41.3278402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:41.3278926Z fn() 2025-05-07T20:32:41.3279442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:41.3280034Z self.fn.run( 2025-05-07T20:32:41.3280513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:41.3281051Z kernel = self.compile( 2025-05-07T20:32:41.3281598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:41.3282309Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:41.3282715Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.3282946Z 2025-05-07T20:32:41.3283155Z self = 2025-05-07T20:32:41.3284327Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:41.3285720Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa085d836a0>} 2025-05-07T20:32:41.3287063Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:41.3288176Z context = 2025-05-07T20:32:41.3288468Z 2025-05-07T20:32:41.3288637Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:41.3289159Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:41.3289635Z module_map=module_map) 2025-05-07T20:32:41.3290016Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:41.3290374Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:41.3290649Z E ^ 2025-05-07T20:32:41.3291128Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.3291577Z 2025-05-07T20:32:41.3291994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:41.3292524Z 2025-05-07T20:32:41.3292633Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.3293053Z self=, 2025-05-07T20:32:41.3293454Z T=2048, 2025-05-07T20:32:41.3293642Z D=5120, 2025-05-07T20:32:41.3293847Z scale_ub=1200.0, 2025-05-07T20:32:41.3294084Z contiguous=True, 2025-05-07T20:32:41.3294302Z compiled=False, 2025-05-07T20:32:41.3294523Z ) 2025-05-07T20:32:41.3294852Z self = 2025-05-07T20:32:41.3295433Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:41.3295717Z 2025-05-07T20:32:41.3295794Z @given( 2025-05-07T20:32:41.3296031Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.3296340Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.3296662Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.3296998Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.3297339Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.3297626Z ) 2025-05-07T20:32:41.3297982Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.3298431Z def test_silu_mul_quant( 2025-05-07T20:32:41.3298675Z self, 2025-05-07T20:32:41.3298881Z T: int, 2025-05-07T20:32:41.3299092Z D: int, 2025-05-07T20:32:41.3299309Z scale_ub: Optional[float], 2025-05-07T20:32:41.3299588Z contiguous: bool, 2025-05-07T20:32:41.3299842Z compiled: bool, 2025-05-07T20:32:41.3300079Z ) -> None: 2025-05-07T20:32:41.3300307Z torch.manual_seed(2025) 2025-05-07T20:32:41.3300570Z 2025-05-07T20:32:41.3300849Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.3301221Z 2025-05-07T20:32:41.3301437Z x_sign = torch.sign(x) 2025-05-07T20:32:41.3301744Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.3302063Z x = x_sign * x_clamp 2025-05-07T20:32:41.3302306Z x0 = x[:, :D] 2025-05-07T20:32:41.3302536Z x1 = x[:, D:] 2025-05-07T20:32:41.3302742Z 2025-05-07T20:32:41.3302938Z if contiguous: 2025-05-07T20:32:41.3303182Z x0 = x0.contiguous() 2025-05-07T20:32:41.3303458Z x1 = x1.contiguous() 2025-05-07T20:32:41.3303697Z 2025-05-07T20:32:41.3303897Z if scale_ub is not None: 2025-05-07T20:32:41.3304176Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:41.3304514Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:41.3304831Z ) 2025-05-07T20:32:41.3305036Z else: 2025-05-07T20:32:41.3305252Z scale_ub_tensor = None 2025-05-07T20:32:41.3305515Z 2025-05-07T20:32:41.3305762Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.3306191Z op = silu_mul_quant 2025-05-07T20:32:41.3306463Z if compiled: 2025-05-07T20:32:41.3306732Z op = torch.compile(op) 2025-05-07T20:32:41.3307038Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.3307336Z 2025-05-07T20:32:41.3307552Z > y_fp8, y_scale = fn() 2025-05-07T20:32:41.3307727Z 2025-05-07T20:32:41.3307844Z moe/activation_test.py:117: 2025-05-07T20:32:41.3308146Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.3308880Z moe/activation_test.py:115: in fn 2025-05-07T20:32:41.3309178Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.3309875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:41.3310573Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:41.3311124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:41.3311830Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:41.3312546Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:41.3313091Z kernel = self.compile( 2025-05-07T20:32:41.3313649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:41.3314303Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:41.3314704Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.3314941Z 2025-05-07T20:32:41.3315308Z self = 2025-05-07T20:32:41.3316408Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:41.3317791Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa0859f5f80>} 2025-05-07T20:32:41.3319150Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:41.3320180Z context = 2025-05-07T20:32:41.3320471Z 2025-05-07T20:32:41.3320655Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:41.3321193Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:41.3321663Z module_map=module_map) 2025-05-07T20:32:41.3322049Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:41.3322423Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:41.3322687Z E ^ 2025-05-07T20:32:41.3323167Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.3323617Z 2025-05-07T20:32:41.3324051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:41.7239935Z W0507 20:32:41.719000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:41.7241457Z W0507 20:32:41.719000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Traceback (most recent call last): 2025-05-07T20:32:41.7242819Z W0507 20:32:41.719000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:41.7244750Z W0507 20:32:41.719000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:41.7245729Z W0507 20:32:41.719000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:41.7247051Z W0507 20:32:41.719000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:41.7248454Z W0507 20:32:41.719000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:41.7249864Z W0507 20:32:41.719000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:41.7251348Z W0507 20:32:41.719000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:41.7252478Z W0507 20:32:41.719000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] module_map=module_map) 2025-05-07T20:32:41.7254013Z W0507 20:32:41.719000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:41.7255411Z W0507 20:32:41.719000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] generator.visit(fn.parse()) 2025-05-07T20:32:41.7256263Z W0507 20:32:41.719000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:41.7257460Z W0507 20:32:41.719000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:41.7258668Z W0507 20:32:41.719000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ret = super().visit(node) 2025-05-07T20:32:41.7259715Z W0507 20:32:41.719000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:41.7260737Z W0507 20:32:41.719000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] return visitor(node) 2025-05-07T20:32:41.7261959Z W0507 20:32:41.719000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:41.7263238Z W0507 20:32:41.719000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:41.7264145Z W0507 20:32:41.719000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:41.7265238Z W0507 20:32:41.719000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:41.7266280Z W0507 20:32:41.719000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] self.visit(item) 2025-05-07T20:32:41.7267045Z W0507 20:32:41.719000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:41.7268310Z W0507 20:32:41.719000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:41.7269663Z W0507 20:32:41.719000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:41.7270730Z W0507 20:32:41.719000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:41.7271647Z W0507 20:32:41.719000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:41.7272388Z W0507 20:32:41.719000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^ 2025-05-07T20:32:41.7273415Z W0507 20:32:41.719000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.8030619Z W0507 20:32:41.799000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:41.8031694Z W0507 20:32:41.799000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Traceback (most recent call last): 2025-05-07T20:32:41.8033326Z W0507 20:32:41.799000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:41.8034812Z W0507 20:32:41.799000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:41.8035808Z W0507 20:32:41.799000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:41.8037113Z W0507 20:32:41.799000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:41.8038518Z W0507 20:32:41.799000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:41.8039829Z W0507 20:32:41.799000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:41.8041214Z W0507 20:32:41.799000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:41.8042263Z W0507 20:32:41.799000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] module_map=module_map) 2025-05-07T20:32:41.8043583Z W0507 20:32:41.799000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:41.8044964Z W0507 20:32:41.799000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] generator.visit(fn.parse()) 2025-05-07T20:32:41.8045814Z W0507 20:32:41.799000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:41.8047013Z W0507 20:32:41.799000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:41.8048418Z W0507 20:32:41.799000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ret = super().visit(node) 2025-05-07T20:32:41.8049460Z W0507 20:32:41.799000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:41.8050483Z W0507 20:32:41.799000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] return visitor(node) 2025-05-07T20:32:41.8051709Z W0507 20:32:41.799000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:41.8052994Z W0507 20:32:41.799000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:41.8053908Z W0507 20:32:41.799000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:41.8055002Z W0507 20:32:41.799000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:41.8056049Z W0507 20:32:41.799000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] self.visit(item) 2025-05-07T20:32:41.8056920Z W0507 20:32:41.799000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:41.8058087Z W0507 20:32:41.799000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:41.8059451Z W0507 20:32:41.799000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:41.8060520Z W0507 20:32:41.799000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:41.8061439Z W0507 20:32:41.799000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:41.8062187Z W0507 20:32:41.799000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^ 2025-05-07T20:32:41.8063206Z W0507 20:32:41.799000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.4399477Z 2025-05-07T20:32:42.4400116Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.4400934Z self=, 2025-05-07T20:32:42.4401485Z T=2048, 2025-05-07T20:32:42.4401679Z D=5120, 2025-05-07T20:32:42.4401881Z scale_ub=1200.0, 2025-05-07T20:32:42.4402109Z contiguous=True, 2025-05-07T20:32:42.4402324Z compiled=True, 2025-05-07T20:32:42.4402531Z ) 2025-05-07T20:32:42.4402855Z self = 2025-05-07T20:32:42.4403350Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:42.4403630Z 2025-05-07T20:32:42.4403707Z @given( 2025-05-07T20:32:42.4403965Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.4404397Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.4404713Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.4405046Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.4405759Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.4406036Z ) 2025-05-07T20:32:42.4406390Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.4406836Z def test_silu_mul_quant( 2025-05-07T20:32:42.4407072Z self, 2025-05-07T20:32:42.4407272Z T: int, 2025-05-07T20:32:42.4407476Z D: int, 2025-05-07T20:32:42.4407693Z scale_ub: Optional[float], 2025-05-07T20:32:42.4407970Z contiguous: bool, 2025-05-07T20:32:42.4408213Z compiled: bool, 2025-05-07T20:32:42.4408677Z ) -> None: 2025-05-07T20:32:42.4408900Z torch.manual_seed(2025) 2025-05-07T20:32:42.4409156Z 2025-05-07T20:32:42.4409426Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.4409774Z 2025-05-07T20:32:42.4409977Z x_sign = torch.sign(x) 2025-05-07T20:32:42.4410269Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.4410581Z x = x_sign * x_clamp 2025-05-07T20:32:42.4410823Z x0 = x[:, :D] 2025-05-07T20:32:42.4411041Z x1 = x[:, D:] 2025-05-07T20:32:42.4411246Z 2025-05-07T20:32:42.4411436Z if contiguous: 2025-05-07T20:32:42.4411674Z x0 = x0.contiguous() 2025-05-07T20:32:42.4411928Z x1 = x1.contiguous() 2025-05-07T20:32:42.4412174Z 2025-05-07T20:32:42.4412374Z if scale_ub is not None: 2025-05-07T20:32:42.4412643Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.4412982Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.4413293Z ) 2025-05-07T20:32:42.4413483Z else: 2025-05-07T20:32:42.4413918Z scale_ub_tensor = None 2025-05-07T20:32:42.4414179Z 2025-05-07T20:32:42.4414405Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.4414721Z op = silu_mul_quant 2025-05-07T20:32:42.4414971Z if compiled: 2025-05-07T20:32:42.4415231Z op = torch.compile(op) 2025-05-07T20:32:42.4415520Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.4415795Z 2025-05-07T20:32:42.4415987Z y_fp8, y_scale = fn() 2025-05-07T20:32:42.4416265Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:42.4416552Z 2025-05-07T20:32:42.4416788Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.4417116Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:42.4417407Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:42.4417724Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:42.4418080Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:42.4419696Z 2025-05-07T20:32:42.4419897Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:42.4420090Z 2025-05-07T20:32:42.4420199Z moe/activation_test.py:126: 2025-05-07T20:32:42.4420492Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.4420835Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:42.4421161Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:42.4421944Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:42.4422702Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:42.4423252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.4423933Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.4424624Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:42.4425354Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:42.4426092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:42.4426853Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:42.4427458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:42.4427976Z fn() 2025-05-07T20:32:42.4428488Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:42.4429063Z self.fn.run( 2025-05-07T20:32:42.4429536Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.4430073Z kernel = self.compile( 2025-05-07T20:32:42.4430613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.4431269Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.4431673Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.4431910Z 2025-05-07T20:32:42.4432128Z self = 2025-05-07T20:32:42.4433203Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.4434610Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa085a4d800>} 2025-05-07T20:32:42.4436037Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.4437065Z context = 2025-05-07T20:32:42.4437359Z 2025-05-07T20:32:42.4437529Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.4438045Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.4438518Z module_map=module_map) 2025-05-07T20:32:42.4438891Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.4439423Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:42.4439684Z E ^ 2025-05-07T20:32:42.4440156Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.4440603Z 2025-05-07T20:32:42.4441031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.4441540Z 2025-05-07T20:32:42.4441645Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.4442060Z self=, 2025-05-07T20:32:42.4442470Z T=16384, 2025-05-07T20:32:42.4442668Z D=7168, 2025-05-07T20:32:42.4442856Z scale_ub=1200.0, 2025-05-07T20:32:42.4443085Z contiguous=False, 2025-05-07T20:32:42.4443315Z compiled=False, 2025-05-07T20:32:42.4443514Z ) 2025-05-07T20:32:42.4443861Z self = 2025-05-07T20:32:42.4444437Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:42.4444715Z 2025-05-07T20:32:42.4444792Z @given( 2025-05-07T20:32:42.4445029Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.4445352Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.4445665Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.4446005Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.4446345Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.4446639Z ) 2025-05-07T20:32:42.4447079Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.4447531Z def test_silu_mul_quant( 2025-05-07T20:32:42.4447787Z self, 2025-05-07T20:32:42.4447982Z T: int, 2025-05-07T20:32:42.4448197Z D: int, 2025-05-07T20:32:42.4448429Z scale_ub: Optional[float], 2025-05-07T20:32:42.4448703Z contiguous: bool, 2025-05-07T20:32:42.4448952Z compiled: bool, 2025-05-07T20:32:42.4449192Z ) -> None: 2025-05-07T20:32:42.4449404Z torch.manual_seed(2025) 2025-05-07T20:32:42.4449655Z 2025-05-07T20:32:42.4449935Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.4450287Z 2025-05-07T20:32:42.4450572Z x_sign = torch.sign(x) 2025-05-07T20:32:42.4450952Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.4451270Z x = x_sign * x_clamp 2025-05-07T20:32:42.4451504Z x0 = x[:, :D] 2025-05-07T20:32:42.4451728Z x1 = x[:, D:] 2025-05-07T20:32:42.4451939Z 2025-05-07T20:32:42.4452119Z if contiguous: 2025-05-07T20:32:42.4452357Z x0 = x0.contiguous() 2025-05-07T20:32:42.4452650Z x1 = x1.contiguous() 2025-05-07T20:32:42.4452902Z 2025-05-07T20:32:42.4453097Z if scale_ub is not None: 2025-05-07T20:32:42.4453368Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.4453695Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.4454005Z ) 2025-05-07T20:32:42.4454204Z else: 2025-05-07T20:32:42.4454410Z scale_ub_tensor = None 2025-05-07T20:32:42.4454665Z 2025-05-07T20:32:42.4455034Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.4455349Z op = silu_mul_quant 2025-05-07T20:32:42.4455595Z if compiled: 2025-05-07T20:32:42.4455845Z op = torch.compile(op) 2025-05-07T20:32:42.4456135Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.4456418Z 2025-05-07T20:32:42.4456607Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.4456771Z 2025-05-07T20:32:42.4456874Z moe/activation_test.py:117: 2025-05-07T20:32:42.4457169Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.4457503Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.4457789Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.4458470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.4459154Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.4459692Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.4460370Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.4461022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.4461552Z kernel = self.compile( 2025-05-07T20:32:42.4462095Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.4462745Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.4463145Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.4463384Z 2025-05-07T20:32:42.4463592Z self = 2025-05-07T20:32:42.4464680Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.4466047Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa0859de980>} 2025-05-07T20:32:42.4467467Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.4468489Z context = 2025-05-07T20:32:42.4468788Z 2025-05-07T20:32:42.4468959Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.4469480Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.4469940Z module_map=module_map) 2025-05-07T20:32:42.4470312Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.4470672Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.4470933Z E ^ 2025-05-07T20:32:42.4471400Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.4471861Z 2025-05-07T20:32:42.4472272Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.6720142Z W0507 20:32:42.668000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:42.6721285Z W0507 20:32:42.668000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Traceback (most recent call last): 2025-05-07T20:32:42.6723107Z W0507 20:32:42.668000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:42.6724721Z W0507 20:32:42.668000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:42.6725755Z W0507 20:32:42.668000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:42.6727080Z W0507 20:32:42.668000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:42.6728487Z W0507 20:32:42.668000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.6729826Z W0507 20:32:42.668000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:42.6731240Z W0507 20:32:42.668000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.6732313Z W0507 20:32:42.668000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] module_map=module_map) 2025-05-07T20:32:42.6733594Z W0507 20:32:42.668000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:42.6743545Z W0507 20:32:42.668000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] generator.visit(fn.parse()) 2025-05-07T20:32:42.6744457Z W0507 20:32:42.668000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:42.6745684Z W0507 20:32:42.668000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:42.6747158Z W0507 20:32:42.668000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ret = super().visit(node) 2025-05-07T20:32:42.6748208Z W0507 20:32:42.668000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:42.6749300Z W0507 20:32:42.668000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] return visitor(node) 2025-05-07T20:32:42.6750517Z W0507 20:32:42.668000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:42.6751805Z W0507 20:32:42.668000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:42.6752732Z W0507 20:32:42.668000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:42.6753850Z W0507 20:32:42.668000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:42.6754918Z W0507 20:32:42.668000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] self.visit(item) 2025-05-07T20:32:42.6755791Z W0507 20:32:42.668000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:42.6756974Z W0507 20:32:42.668000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:42.6758346Z W0507 20:32:42.668000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:42.6759415Z W0507 20:32:42.668000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.6760329Z W0507 20:32:42.668000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.6761092Z W0507 20:32:42.668000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^ 2025-05-07T20:32:42.6762124Z W0507 20:32:42.668000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.7274713Z W0507 20:32:42.723000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:42.7275867Z W0507 20:32:42.723000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Traceback (most recent call last): 2025-05-07T20:32:42.7277185Z W0507 20:32:42.723000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:42.7278622Z W0507 20:32:42.723000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:42.7279653Z W0507 20:32:42.723000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:42.7281298Z W0507 20:32:42.723000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:42.7282678Z W0507 20:32:42.723000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.7283989Z W0507 20:32:42.723000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:42.7285590Z W0507 20:32:42.723000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.7286640Z W0507 20:32:42.723000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] module_map=module_map) 2025-05-07T20:32:42.7287904Z W0507 20:32:42.723000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:42.7289143Z W0507 20:32:42.723000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] generator.visit(fn.parse()) 2025-05-07T20:32:42.7289970Z W0507 20:32:42.723000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:42.7291967Z W0507 20:32:42.723000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:42.7293181Z W0507 20:32:42.723000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ret = super().visit(node) 2025-05-07T20:32:42.7294217Z W0507 20:32:42.723000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:42.7295224Z W0507 20:32:42.723000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] return visitor(node) 2025-05-07T20:32:42.7296427Z W0507 20:32:42.723000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:42.7297696Z W0507 20:32:42.723000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:42.7298592Z W0507 20:32:42.723000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:42.7299674Z W0507 20:32:42.723000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:42.7300701Z W0507 20:32:42.723000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] self.visit(item) 2025-05-07T20:32:42.7301457Z W0507 20:32:42.723000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:42.7302626Z W0507 20:32:42.723000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:42.7303971Z W0507 20:32:42.723000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:42.7305159Z W0507 20:32:42.723000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.7306057Z W0507 20:32:42.723000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.7306796Z W0507 20:32:42.723000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^ 2025-05-07T20:32:42.7307811Z W0507 20:32:42.723000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.1809331Z 2025-05-07T20:32:43.1809675Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.1810423Z self=, 2025-05-07T20:32:43.1811064Z T=1, 2025-05-07T20:32:43.1811382Z D=7168, 2025-05-07T20:32:43.1811665Z scale_ub=None, 2025-05-07T20:32:43.1811955Z contiguous=True, 2025-05-07T20:32:43.1812252Z compiled=True, 2025-05-07T20:32:43.1812543Z ) 2025-05-07T20:32:43.1812960Z self = 2025-05-07T20:32:43.1813477Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:43.1813745Z 2025-05-07T20:32:43.1813821Z @given( 2025-05-07T20:32:43.1814050Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.1814353Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.1814659Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.1815320Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.1815651Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.1815923Z ) 2025-05-07T20:32:43.1816268Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.1816709Z def test_silu_mul_quant( 2025-05-07T20:32:43.1816941Z self, 2025-05-07T20:32:43.1817131Z T: int, 2025-05-07T20:32:43.1817323Z D: int, 2025-05-07T20:32:43.1817530Z scale_ub: Optional[float], 2025-05-07T20:32:43.1817797Z contiguous: bool, 2025-05-07T20:32:43.1818031Z compiled: bool, 2025-05-07T20:32:43.1818249Z ) -> None: 2025-05-07T20:32:43.1818461Z torch.manual_seed(2025) 2025-05-07T20:32:43.1818698Z 2025-05-07T20:32:43.1818959Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.1819297Z 2025-05-07T20:32:43.1819485Z x_sign = torch.sign(x) 2025-05-07T20:32:43.1819767Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.1820072Z x = x_sign * x_clamp 2025-05-07T20:32:43.1820308Z x0 = x[:, :D] 2025-05-07T20:32:43.1820520Z x1 = x[:, D:] 2025-05-07T20:32:43.1820715Z 2025-05-07T20:32:43.1820897Z if contiguous: 2025-05-07T20:32:43.1821129Z x0 = x0.contiguous() 2025-05-07T20:32:43.1821382Z x1 = x1.contiguous() 2025-05-07T20:32:43.1821615Z 2025-05-07T20:32:43.1821801Z if scale_ub is not None: 2025-05-07T20:32:43.1822063Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.1822393Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.1822709Z ) 2025-05-07T20:32:43.1822921Z else: 2025-05-07T20:32:43.1823134Z scale_ub_tensor = None 2025-05-07T20:32:43.1823380Z 2025-05-07T20:32:43.1823600Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.1823911Z op = silu_mul_quant 2025-05-07T20:32:43.1824159Z if compiled: 2025-05-07T20:32:43.1824397Z op = torch.compile(op) 2025-05-07T20:32:43.1824687Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.1824955Z 2025-05-07T20:32:43.1825136Z y_fp8, y_scale = fn() 2025-05-07T20:32:43.1825589Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:43.1825876Z 2025-05-07T20:32:43.1826109Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.1826433Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:43.1826722Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:43.1827032Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:43.1827379Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:43.1827686Z 2025-05-07T20:32:43.1827885Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:43.1828074Z 2025-05-07T20:32:43.1828173Z moe/activation_test.py:126: 2025-05-07T20:32:43.1828471Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.1828808Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:43.1829133Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:43.1829907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:43.1830654Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:43.1831196Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.1831863Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.1832542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:43.1833506Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:43.1834494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:43.1835121Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:43.1835708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:43.1836222Z fn() 2025-05-07T20:32:43.1836719Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:43.1837287Z self.fn.run( 2025-05-07T20:32:43.1837746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.1838267Z kernel = self.compile( 2025-05-07T20:32:43.1838795Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.1839443Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.1839839Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.1840066Z 2025-05-07T20:32:43.1840275Z self = 2025-05-07T20:32:43.1841342Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.1842725Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa085a05e40>} 2025-05-07T20:32:43.1844050Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.1845193Z context = 2025-05-07T20:32:43.1845476Z 2025-05-07T20:32:43.1845648Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.1846156Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.1846700Z module_map=module_map) 2025-05-07T20:32:43.1847058Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.1847403Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:43.1847667Z E ^ 2025-05-07T20:32:43.1848123Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.1848564Z 2025-05-07T20:32:43.1848979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.1849481Z 2025-05-07T20:32:43.1849580Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.1849996Z self=, 2025-05-07T20:32:43.1850392Z T=4096, 2025-05-07T20:32:43.1850569Z D=5120, 2025-05-07T20:32:43.1850758Z scale_ub=None, 2025-05-07T20:32:43.1850972Z contiguous=False, 2025-05-07T20:32:43.1851187Z compiled=False, 2025-05-07T20:32:43.1851396Z ) 2025-05-07T20:32:43.1851712Z self = 2025-05-07T20:32:43.1852191Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:43.1852464Z 2025-05-07T20:32:43.1852539Z @given( 2025-05-07T20:32:43.1852770Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.1853126Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.1853422Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.1853749Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.1854080Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.1854435Z ) 2025-05-07T20:32:43.1854779Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.1855214Z def test_silu_mul_quant( 2025-05-07T20:32:43.1855439Z self, 2025-05-07T20:32:43.1855625Z T: int, 2025-05-07T20:32:43.1855820Z D: int, 2025-05-07T20:32:43.1856031Z scale_ub: Optional[float], 2025-05-07T20:32:43.1856289Z contiguous: bool, 2025-05-07T20:32:43.1856523Z compiled: bool, 2025-05-07T20:32:43.1856739Z ) -> None: 2025-05-07T20:32:43.1856949Z torch.manual_seed(2025) 2025-05-07T20:32:43.1857181Z 2025-05-07T20:32:43.1857443Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.1857767Z 2025-05-07T20:32:43.1857955Z x_sign = torch.sign(x) 2025-05-07T20:32:43.1858242Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.1858535Z x = x_sign * x_clamp 2025-05-07T20:32:43.1858770Z x0 = x[:, :D] 2025-05-07T20:32:43.1858984Z x1 = x[:, D:] 2025-05-07T20:32:43.1859178Z 2025-05-07T20:32:43.1859362Z if contiguous: 2025-05-07T20:32:43.1859585Z x0 = x0.contiguous() 2025-05-07T20:32:43.1859829Z x1 = x1.contiguous() 2025-05-07T20:32:43.1860063Z 2025-05-07T20:32:43.1860247Z if scale_ub is not None: 2025-05-07T20:32:43.1860504Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.1860831Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.1861132Z ) 2025-05-07T20:32:43.1861319Z else: 2025-05-07T20:32:43.1861519Z scale_ub_tensor = None 2025-05-07T20:32:43.1861759Z 2025-05-07T20:32:43.1861985Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.1862290Z op = silu_mul_quant 2025-05-07T20:32:43.1862533Z if compiled: 2025-05-07T20:32:43.1862775Z op = torch.compile(op) 2025-05-07T20:32:43.1863065Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.1863330Z 2025-05-07T20:32:43.1863517Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.1863676Z 2025-05-07T20:32:43.1863775Z moe/activation_test.py:117: 2025-05-07T20:32:43.1864065Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.1864478Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.1864752Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.1865425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.1866102Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.1866633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.1867303Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.1867959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.1868479Z kernel = self.compile( 2025-05-07T20:32:43.1869014Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.1869652Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.1870051Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.1870274Z 2025-05-07T20:32:43.1870482Z self = 2025-05-07T20:32:43.1871546Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.1873064Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa084544a40>} 2025-05-07T20:32:43.1874398Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.1875413Z context = 2025-05-07T20:32:43.1875700Z 2025-05-07T20:32:43.1875866Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.1876373Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.1876831Z module_map=module_map) 2025-05-07T20:32:43.1877190Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.1877537Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.1877783Z E ^ 2025-05-07T20:32:43.1878244Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.1878686Z 2025-05-07T20:32:43.1879102Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.4697094Z W0507 20:32:43.465000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:43.4698374Z W0507 20:32:43.465000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Traceback (most recent call last): 2025-05-07T20:32:43.4699915Z W0507 20:32:43.465000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:43.4701566Z W0507 20:32:43.465000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:43.4702680Z W0507 20:32:43.465000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:43.4704175Z W0507 20:32:43.465000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:43.4706170Z W0507 20:32:43.465000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.4707665Z W0507 20:32:43.465000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:43.4709547Z W0507 20:32:43.465000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.4710639Z W0507 20:32:43.465000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] module_map=module_map) 2025-05-07T20:32:43.4711928Z W0507 20:32:43.465000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:43.4713192Z W0507 20:32:43.465000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] generator.visit(fn.parse()) 2025-05-07T20:32:43.4714036Z W0507 20:32:43.465000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:43.4715436Z W0507 20:32:43.465000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:43.4716641Z W0507 20:32:43.465000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ret = super().visit(node) 2025-05-07T20:32:43.4717669Z W0507 20:32:43.465000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:43.4718664Z W0507 20:32:43.465000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] return visitor(node) 2025-05-07T20:32:43.4719866Z W0507 20:32:43.465000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:43.4721129Z W0507 20:32:43.465000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:43.4722021Z W0507 20:32:43.465000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:43.4723151Z W0507 20:32:43.465000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:43.4724170Z W0507 20:32:43.465000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] self.visit(item) 2025-05-07T20:32:43.4725084Z W0507 20:32:43.465000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:43.4726244Z W0507 20:32:43.465000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:43.4727584Z W0507 20:32:43.465000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:43.4728767Z W0507 20:32:43.465000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.4729665Z W0507 20:32:43.465000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.4730396Z W0507 20:32:43.465000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^ 2025-05-07T20:32:43.4731406Z W0507 20:32:43.465000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.6552209Z W0507 20:32:43.651000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:43.6553337Z W0507 20:32:43.651000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Traceback (most recent call last): 2025-05-07T20:32:43.6554693Z W0507 20:32:43.651000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:43.6556247Z W0507 20:32:43.651000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:43.6557659Z W0507 20:32:43.651000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:43.6559029Z W0507 20:32:43.651000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:43.6560477Z W0507 20:32:43.651000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.6561831Z W0507 20:32:43.651000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:43.6563264Z W0507 20:32:43.651000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.6564481Z W0507 20:32:43.651000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] module_map=module_map) 2025-05-07T20:32:43.6565737Z W0507 20:32:43.651000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:43.6566972Z W0507 20:32:43.651000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] generator.visit(fn.parse()) 2025-05-07T20:32:43.6567803Z W0507 20:32:43.651000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:43.6568997Z W0507 20:32:43.651000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:43.6570196Z W0507 20:32:43.651000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ret = super().visit(node) 2025-05-07T20:32:43.6571221Z W0507 20:32:43.651000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:43.6572391Z W0507 20:32:43.651000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] return visitor(node) 2025-05-07T20:32:43.6573592Z W0507 20:32:43.651000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:43.6574863Z W0507 20:32:43.651000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:43.6575768Z W0507 20:32:43.651000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:43.6576844Z W0507 20:32:43.651000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:43.6577872Z W0507 20:32:43.651000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] self.visit(item) 2025-05-07T20:32:43.6578638Z W0507 20:32:43.651000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:43.6579798Z W0507 20:32:43.651000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:43.6581221Z W0507 20:32:43.651000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:43.6582278Z W0507 20:32:43.651000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.6583178Z W0507 20:32:43.651000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.6583915Z W0507 20:32:43.651000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^ 2025-05-07T20:32:43.6584921Z W0507 20:32:43.651000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.1830620Z 2025-05-07T20:32:44.1831099Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.1831855Z self=, 2025-05-07T20:32:44.1832514Z T=4096, 2025-05-07T20:32:44.1832826Z D=7168, 2025-05-07T20:32:44.1833162Z scale_ub=None, 2025-05-07T20:32:44.1833412Z contiguous=False, 2025-05-07T20:32:44.1833629Z compiled=False, 2025-05-07T20:32:44.1833842Z ) 2025-05-07T20:32:44.1834163Z self = 2025-05-07T20:32:44.1834663Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:44.1834939Z 2025-05-07T20:32:44.1835012Z @given( 2025-05-07T20:32:44.1835242Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.1835541Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.1835851Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.1836178Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.1836493Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.1836772Z ) 2025-05-07T20:32:44.1837122Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.1837555Z def test_silu_mul_quant( 2025-05-07T20:32:44.1837787Z self, 2025-05-07T20:32:44.1837985Z T: int, 2025-05-07T20:32:44.1838193Z D: int, 2025-05-07T20:32:44.1838407Z scale_ub: Optional[float], 2025-05-07T20:32:44.1839064Z contiguous: bool, 2025-05-07T20:32:44.1839308Z compiled: bool, 2025-05-07T20:32:44.1839542Z ) -> None: 2025-05-07T20:32:44.1839769Z torch.manual_seed(2025) 2025-05-07T20:32:44.1840008Z 2025-05-07T20:32:44.1840278Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.1840663Z 2025-05-07T20:32:44.1840849Z x_sign = torch.sign(x) 2025-05-07T20:32:44.1841149Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.1841466Z x = x_sign * x_clamp 2025-05-07T20:32:44.1841701Z x0 = x[:, :D] 2025-05-07T20:32:44.1841921Z x1 = x[:, D:] 2025-05-07T20:32:44.1842140Z 2025-05-07T20:32:44.1842319Z if contiguous: 2025-05-07T20:32:44.1842543Z x0 = x0.contiguous() 2025-05-07T20:32:44.1842798Z x1 = x1.contiguous() 2025-05-07T20:32:44.1843025Z 2025-05-07T20:32:44.1843213Z if scale_ub is not None: 2025-05-07T20:32:44.1843482Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.1843816Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.1844117Z ) 2025-05-07T20:32:44.1844400Z else: 2025-05-07T20:32:44.1844608Z scale_ub_tensor = None 2025-05-07T20:32:44.1844844Z 2025-05-07T20:32:44.1845070Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.1845376Z op = silu_mul_quant 2025-05-07T20:32:44.1845696Z if compiled: 2025-05-07T20:32:44.1846023Z op = torch.compile(op) 2025-05-07T20:32:44.1846421Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.1846776Z 2025-05-07T20:32:44.1846998Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.1847348Z 2025-05-07T20:32:44.1847458Z moe/activation_test.py:117: 2025-05-07T20:32:44.1847748Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.1848077Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.1848355Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.1849050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.1849730Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.1850263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.1850945Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.1851595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.1852118Z kernel = self.compile( 2025-05-07T20:32:44.1852663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.1853344Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.1853755Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.1853995Z 2025-05-07T20:32:44.1854202Z self = 2025-05-07T20:32:44.1855280Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.1856663Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa084546660>} 2025-05-07T20:32:44.1857993Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.1859008Z context = 2025-05-07T20:32:44.1859415Z 2025-05-07T20:32:44.1859576Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.1860111Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.1868164Z module_map=module_map) 2025-05-07T20:32:44.1868549Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.1868922Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.1869200Z E ^ 2025-05-07T20:32:44.1869671Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.1870136Z 2025-05-07T20:32:44.1870576Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.1871102Z 2025-05-07T20:32:44.1871212Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.1871640Z self=, 2025-05-07T20:32:44.1872053Z T=128, 2025-05-07T20:32:44.1872258Z D=7168, 2025-05-07T20:32:44.1872467Z scale_ub=None, 2025-05-07T20:32:44.1872687Z contiguous=False, 2025-05-07T20:32:44.1872925Z compiled=True, 2025-05-07T20:32:44.1873142Z ) 2025-05-07T20:32:44.1873467Z self = 2025-05-07T20:32:44.1873967Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:44.1874239Z 2025-05-07T20:32:44.1874333Z @given( 2025-05-07T20:32:44.1874588Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.1874908Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.1875353Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.1875700Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.1876029Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.1876328Z ) 2025-05-07T20:32:44.1876690Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.1877142Z def test_silu_mul_quant( 2025-05-07T20:32:44.1877403Z self, 2025-05-07T20:32:44.1877610Z T: int, 2025-05-07T20:32:44.1877810Z D: int, 2025-05-07T20:32:44.1878041Z scale_ub: Optional[float], 2025-05-07T20:32:44.1878331Z contiguous: bool, 2025-05-07T20:32:44.1878575Z compiled: bool, 2025-05-07T20:32:44.1880298Z ) -> None: 2025-05-07T20:32:44.1880535Z torch.manual_seed(2025) 2025-05-07T20:32:44.1880787Z 2025-05-07T20:32:44.1881091Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.1881461Z 2025-05-07T20:32:44.1881691Z x_sign = torch.sign(x) 2025-05-07T20:32:44.1881994Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.1882335Z x = x_sign * x_clamp 2025-05-07T20:32:44.1882605Z x0 = x[:, :D] 2025-05-07T20:32:44.1882838Z x1 = x[:, D:] 2025-05-07T20:32:44.1883084Z 2025-05-07T20:32:44.1883298Z if contiguous: 2025-05-07T20:32:44.1883546Z x0 = x0.contiguous() 2025-05-07T20:32:44.1883836Z x1 = x1.contiguous() 2025-05-07T20:32:44.1884104Z 2025-05-07T20:32:44.1884431Z if scale_ub is not None: 2025-05-07T20:32:44.1884731Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.1885087Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.1885408Z ) 2025-05-07T20:32:44.1885632Z else: 2025-05-07T20:32:44.1885869Z scale_ub_tensor = None 2025-05-07T20:32:44.1886133Z 2025-05-07T20:32:44.1886390Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.1886735Z op = silu_mul_quant 2025-05-07T20:32:44.1887012Z if compiled: 2025-05-07T20:32:44.1887266Z op = torch.compile(op) 2025-05-07T20:32:44.1887576Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.1887862Z 2025-05-07T20:32:44.1888154Z y_fp8, y_scale = fn() 2025-05-07T20:32:44.1888457Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:44.1888761Z 2025-05-07T20:32:44.1889001Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.1889348Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:44.1889644Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:44.1889968Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:44.1890329Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:44.1890650Z 2025-05-07T20:32:44.1890867Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:44.1891063Z 2025-05-07T20:32:44.1891174Z moe/activation_test.py:126: 2025-05-07T20:32:44.1891481Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.1891839Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:44.1892173Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:44.1892966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:44.1893720Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:44.1894268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.1894953Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.1895629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:44.1896432Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:44.1897162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:44.1897787Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:44.1898387Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:44.1898900Z fn() 2025-05-07T20:32:44.1899411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:44.1899990Z self.fn.run( 2025-05-07T20:32:44.1900457Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.1900983Z kernel = self.compile( 2025-05-07T20:32:44.1901514Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.1902171Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.1902568Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.1902798Z 2025-05-07T20:32:44.1903015Z self = 2025-05-07T20:32:44.1904096Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.1905465Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa084545bc0>} 2025-05-07T20:32:44.1906852Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.1907868Z context = 2025-05-07T20:32:44.1908153Z 2025-05-07T20:32:44.1908622Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.1909340Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.1910026Z module_map=module_map) 2025-05-07T20:32:44.1910397Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.1910746Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:44.1911016Z E ^ 2025-05-07T20:32:44.1911482Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.1911927Z 2025-05-07T20:32:44.1912352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.4294054Z 2025-05-07T20:32:44.4294633Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.4295364Z self=, 2025-05-07T20:32:44.4295976Z T=128, 2025-05-07T20:32:44.4296251Z D=7168, 2025-05-07T20:32:44.4296520Z scale_ub=None, 2025-05-07T20:32:44.4296840Z contiguous=False, 2025-05-07T20:32:44.4297098Z compiled=False, 2025-05-07T20:32:44.4297293Z ) 2025-05-07T20:32:44.4297607Z self = 2025-05-07T20:32:44.4298092Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:44.4298353Z 2025-05-07T20:32:44.4298424Z @given( 2025-05-07T20:32:44.4298652Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.4298959Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.4299251Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.4299575Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.4300295Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.4300580Z ) 2025-05-07T20:32:44.4300912Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.4301345Z def test_silu_mul_quant( 2025-05-07T20:32:44.4301576Z self, 2025-05-07T20:32:44.4301758Z T: int, 2025-05-07T20:32:44.4301941Z D: int, 2025-05-07T20:32:44.4302149Z scale_ub: Optional[float], 2025-05-07T20:32:44.4302401Z contiguous: bool, 2025-05-07T20:32:44.4302630Z compiled: bool, 2025-05-07T20:32:44.4302848Z ) -> None: 2025-05-07T20:32:44.4303044Z torch.manual_seed(2025) 2025-05-07T20:32:44.4303276Z 2025-05-07T20:32:44.4303539Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.4303861Z 2025-05-07T20:32:44.4304088Z x_sign = torch.sign(x) 2025-05-07T20:32:44.4304363Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.4304666Z x = x_sign * x_clamp 2025-05-07T20:32:44.4304895Z x0 = x[:, :D] 2025-05-07T20:32:44.4305096Z x1 = x[:, D:] 2025-05-07T20:32:44.4305293Z 2025-05-07T20:32:44.4305468Z if contiguous: 2025-05-07T20:32:44.4305683Z x0 = x0.contiguous() 2025-05-07T20:32:44.4305939Z x1 = x1.contiguous() 2025-05-07T20:32:44.4306167Z 2025-05-07T20:32:44.4306343Z if scale_ub is not None: 2025-05-07T20:32:44.4306599Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.4306922Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.4307217Z ) 2025-05-07T20:32:44.4307395Z else: 2025-05-07T20:32:44.4307593Z scale_ub_tensor = None 2025-05-07T20:32:44.4307839Z 2025-05-07T20:32:44.4308054Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.4308651Z op = silu_mul_quant 2025-05-07T20:32:44.4308894Z if compiled: 2025-05-07T20:32:44.4309133Z op = torch.compile(op) 2025-05-07T20:32:44.4309420Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.4309686Z 2025-05-07T20:32:44.4309865Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.4310032Z 2025-05-07T20:32:44.4310125Z moe/activation_test.py:117: 2025-05-07T20:32:44.4310584Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.4310913Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.4311183Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.4311864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.4312543Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.4313063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.4313787Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.4314449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.4314967Z kernel = self.compile( 2025-05-07T20:32:44.4315492Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.4316148Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.4316542Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.4316764Z 2025-05-07T20:32:44.4316978Z self = 2025-05-07T20:32:44.4318046Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.4319542Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa07fca23e0>} 2025-05-07T20:32:44.4320875Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.4321898Z context = 2025-05-07T20:32:44.4322179Z 2025-05-07T20:32:44.4322340Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.4322858Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.4323315Z module_map=module_map) 2025-05-07T20:32:44.4323674Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.4324011Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.4324438Z E ^ 2025-05-07T20:32:44.4324899Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.4325340Z 2025-05-07T20:32:44.4325751Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.4326264Z 2025-05-07T20:32:44.4326362Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.4326768Z self=, 2025-05-07T20:32:44.4327159Z T=4096, 2025-05-07T20:32:44.4327330Z D=5120, 2025-05-07T20:32:44.4327511Z scale_ub=1200.0, 2025-05-07T20:32:44.4327725Z contiguous=True, 2025-05-07T20:32:44.4327931Z compiled=False, 2025-05-07T20:32:44.4328130Z ) 2025-05-07T20:32:44.4328443Z self = 2025-05-07T20:32:44.4328922Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:44.4329196Z 2025-05-07T20:32:44.4329269Z @given( 2025-05-07T20:32:44.4329496Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.4329799Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.4330090Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.4330413Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.4330858Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.4331126Z ) 2025-05-07T20:32:44.4331467Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.4331898Z def test_silu_mul_quant( 2025-05-07T20:32:44.4332124Z self, 2025-05-07T20:32:44.4332318Z T: int, 2025-05-07T20:32:44.4332508Z D: int, 2025-05-07T20:32:44.4332718Z scale_ub: Optional[float], 2025-05-07T20:32:44.4332981Z contiguous: bool, 2025-05-07T20:32:44.4333216Z compiled: bool, 2025-05-07T20:32:44.4333426Z ) -> None: 2025-05-07T20:32:44.4333635Z torch.manual_seed(2025) 2025-05-07T20:32:44.4333872Z 2025-05-07T20:32:44.4334130Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.4334467Z 2025-05-07T20:32:44.4334652Z x_sign = torch.sign(x) 2025-05-07T20:32:44.4334934Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.4335233Z x = x_sign * x_clamp 2025-05-07T20:32:44.4335462Z x0 = x[:, :D] 2025-05-07T20:32:44.4335667Z x1 = x[:, D:] 2025-05-07T20:32:44.4335857Z 2025-05-07T20:32:44.4336030Z if contiguous: 2025-05-07T20:32:44.4336252Z x0 = x0.contiguous() 2025-05-07T20:32:44.4336496Z x1 = x1.contiguous() 2025-05-07T20:32:44.4336723Z 2025-05-07T20:32:44.4336907Z if scale_ub is not None: 2025-05-07T20:32:44.4337162Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.4337488Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.4337788Z ) 2025-05-07T20:32:44.4337963Z else: 2025-05-07T20:32:44.4338278Z scale_ub_tensor = None 2025-05-07T20:32:44.4338524Z 2025-05-07T20:32:44.4338740Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.4339043Z op = silu_mul_quant 2025-05-07T20:32:44.4339288Z if compiled: 2025-05-07T20:32:44.4339530Z op = torch.compile(op) 2025-05-07T20:32:44.4339810Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.4340071Z 2025-05-07T20:32:44.4340251Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.4340409Z 2025-05-07T20:32:44.4340502Z moe/activation_test.py:117: 2025-05-07T20:32:44.4340790Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.4341113Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.4341381Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.4342063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.4342738Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.4343263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.4343930Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.4344591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.4345112Z kernel = self.compile( 2025-05-07T20:32:44.4345637Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.4346284Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.4346673Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.4346897Z 2025-05-07T20:32:44.4347105Z self = 2025-05-07T20:32:44.4348187Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.4349544Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa07fca2700>} 2025-05-07T20:32:44.4351006Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.4352023Z context = 2025-05-07T20:32:44.4352308Z 2025-05-07T20:32:44.4352478Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.4352990Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.4353453Z module_map=module_map) 2025-05-07T20:32:44.4353810Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.4354154Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.4354409Z E ^ 2025-05-07T20:32:44.4354863Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.4355308Z 2025-05-07T20:32:44.4355724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.4356227Z 2025-05-07T20:32:44.4356327Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.4356733Z self=, 2025-05-07T20:32:44.4357130Z T=1, 2025-05-07T20:32:44.4357305Z D=5120, 2025-05-07T20:32:44.4357482Z scale_ub=None, 2025-05-07T20:32:44.4357690Z contiguous=True, 2025-05-07T20:32:44.4357988Z compiled=True, 2025-05-07T20:32:44.4358178Z ) 2025-05-07T20:32:44.4358488Z self = 2025-05-07T20:32:44.4358961Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:44.4359218Z 2025-05-07T20:32:44.4359289Z @given( 2025-05-07T20:32:44.4359510Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.4359812Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.4360101Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.4360423Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.4360742Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.4361017Z ) 2025-05-07T20:32:44.4361348Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.4361781Z def test_silu_mul_quant( 2025-05-07T20:32:44.4362010Z self, 2025-05-07T20:32:44.4362188Z T: int, 2025-05-07T20:32:44.4362382Z D: int, 2025-05-07T20:32:44.4362591Z scale_ub: Optional[float], 2025-05-07T20:32:44.4362844Z contiguous: bool, 2025-05-07T20:32:44.4363086Z compiled: bool, 2025-05-07T20:32:44.4363335Z ) -> None: 2025-05-07T20:32:44.4363543Z torch.manual_seed(2025) 2025-05-07T20:32:44.4363780Z 2025-05-07T20:32:44.4364042Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.4364472Z 2025-05-07T20:32:44.4364653Z x_sign = torch.sign(x) 2025-05-07T20:32:44.4364936Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.4365228Z x = x_sign * x_clamp 2025-05-07T20:32:44.4365461Z x0 = x[:, :D] 2025-05-07T20:32:44.4365668Z x1 = x[:, D:] 2025-05-07T20:32:44.4365865Z 2025-05-07T20:32:44.4366031Z if contiguous: 2025-05-07T20:32:44.4366255Z x0 = x0.contiguous() 2025-05-07T20:32:44.4366507Z x1 = x1.contiguous() 2025-05-07T20:32:44.4366734Z 2025-05-07T20:32:44.4366916Z if scale_ub is not None: 2025-05-07T20:32:44.4367182Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.4367503Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.4367809Z ) 2025-05-07T20:32:44.4368147Z else: 2025-05-07T20:32:44.4368342Z scale_ub_tensor = None 2025-05-07T20:32:44.4368590Z 2025-05-07T20:32:44.4368815Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.4369113Z op = silu_mul_quant 2025-05-07T20:32:44.4369360Z if compiled: 2025-05-07T20:32:44.4369603Z op = torch.compile(op) 2025-05-07T20:32:44.4369884Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.4370148Z 2025-05-07T20:32:44.4370334Z y_fp8, y_scale = fn() 2025-05-07T20:32:44.4370613Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:44.4370890Z 2025-05-07T20:32:44.4371123Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.4371447Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:44.4371724Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:44.4372028Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:44.4372385Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:44.4372676Z 2025-05-07T20:32:44.4372869Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:44.4373058Z 2025-05-07T20:32:44.4373156Z moe/activation_test.py:126: 2025-05-07T20:32:44.4373440Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.4373766Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:44.4374083Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:44.4374856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:44.4375673Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:44.4376208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.4376884Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.4377569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:44.4378273Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:44.4378991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:44.4379620Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:44.4380201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:44.4380707Z fn() 2025-05-07T20:32:44.4381209Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:44.4381783Z self.fn.run( 2025-05-07T20:32:44.4382234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.4382755Z kernel = self.compile( 2025-05-07T20:32:44.4383285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.4383924Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.4384312Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.4384542Z 2025-05-07T20:32:44.4384746Z self = 2025-05-07T20:32:44.4385820Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.4387182Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa07fca3ba0>} 2025-05-07T20:32:44.4388590Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.4389604Z context = 2025-05-07T20:32:44.4389898Z 2025-05-07T20:32:44.4390058Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.4390574Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.4391032Z module_map=module_map) 2025-05-07T20:32:44.4391401Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.4391756Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:44.4392008Z E ^ 2025-05-07T20:32:44.4392461Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.4392920Z 2025-05-07T20:32:44.4393331Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.6591221Z W0507 20:32:44.655000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:44.6592281Z W0507 20:32:44.655000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Traceback (most recent call last): 2025-05-07T20:32:44.6594063Z W0507 20:32:44.655000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:44.6603289Z W0507 20:32:44.655000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:44.6604506Z W0507 20:32:44.655000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:44.6605815Z W0507 20:32:44.655000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:44.6607201Z W0507 20:32:44.655000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.6608758Z W0507 20:32:44.655000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:44.6610140Z W0507 20:32:44.655000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.6611194Z W0507 20:32:44.655000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] module_map=module_map) 2025-05-07T20:32:44.6612464Z W0507 20:32:44.655000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:44.6613778Z W0507 20:32:44.655000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] generator.visit(fn.parse()) 2025-05-07T20:32:44.6614627Z W0507 20:32:44.655000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:44.6615829Z W0507 20:32:44.655000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:44.6617233Z W0507 20:32:44.655000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ret = super().visit(node) 2025-05-07T20:32:44.6618273Z W0507 20:32:44.655000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:44.6619290Z W0507 20:32:44.655000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] return visitor(node) 2025-05-07T20:32:44.6620502Z W0507 20:32:44.655000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:44.6621757Z W0507 20:32:44.655000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:44.6622657Z W0507 20:32:44.655000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:44.6623731Z W0507 20:32:44.655000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:44.6624759Z W0507 20:32:44.655000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] self.visit(item) 2025-05-07T20:32:44.6625627Z W0507 20:32:44.655000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:44.6626783Z W0507 20:32:44.655000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:44.6628127Z W0507 20:32:44.655000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:44.6629173Z W0507 20:32:44.655000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.6630076Z W0507 20:32:44.655000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.6630805Z W0507 20:32:44.655000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^ 2025-05-07T20:32:44.6631807Z W0507 20:32:44.655000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.7224947Z W0507 20:32:44.718000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:44.7226084Z W0507 20:32:44.718000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Traceback (most recent call last): 2025-05-07T20:32:44.7227485Z W0507 20:32:44.718000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:44.7228997Z W0507 20:32:44.718000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:44.7230013Z W0507 20:32:44.718000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:44.7231704Z W0507 20:32:44.718000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:44.7233144Z W0507 20:32:44.718000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.7234511Z W0507 20:32:44.718000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:44.7235944Z W0507 20:32:44.718000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.7237026Z W0507 20:32:44.718000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] module_map=module_map) 2025-05-07T20:32:44.7238348Z W0507 20:32:44.718000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:44.7239645Z W0507 20:32:44.718000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] generator.visit(fn.parse()) 2025-05-07T20:32:44.7240517Z W0507 20:32:44.718000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:44.7241862Z W0507 20:32:44.718000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:44.7243050Z W0507 20:32:44.718000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ret = super().visit(node) 2025-05-07T20:32:44.7244079Z W0507 20:32:44.718000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:44.7245195Z W0507 20:32:44.718000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] return visitor(node) 2025-05-07T20:32:44.7246411Z W0507 20:32:44.718000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:44.7247675Z W0507 20:32:44.718000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:44.7248557Z W0507 20:32:44.718000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:44.7249636Z W0507 20:32:44.718000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:44.7250673Z W0507 20:32:44.718000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] self.visit(item) 2025-05-07T20:32:44.7251435Z W0507 20:32:44.718000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:44.7252597Z W0507 20:32:44.718000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:44.7253938Z W0507 20:32:44.718000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:44.7255090Z W0507 20:32:44.718000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.7256076Z W0507 20:32:44.718000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.7256815Z W0507 20:32:44.718000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^ 2025-05-07T20:32:44.7257825Z W0507 20:32:44.718000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:45.2132998Z W0507 20:32:45.209000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:45.2134136Z W0507 20:32:45.209000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Traceback (most recent call last): 2025-05-07T20:32:45.2135510Z W0507 20:32:45.209000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:45.2137059Z W0507 20:32:45.209000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:45.2138475Z W0507 20:32:45.209000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:45.2139840Z W0507 20:32:45.209000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:45.2141219Z W0507 20:32:45.209000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:45.2142510Z W0507 20:32:45.209000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:45.2143879Z W0507 20:32:45.209000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:45.2144935Z W0507 20:32:45.209000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] module_map=module_map) 2025-05-07T20:32:45.2146200Z W0507 20:32:45.209000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:45.2147444Z W0507 20:32:45.209000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] generator.visit(fn.parse()) 2025-05-07T20:32:45.2148277Z W0507 20:32:45.209000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:45.2149743Z W0507 20:32:45.209000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:45.2151024Z W0507 20:32:45.209000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ret = super().visit(node) 2025-05-07T20:32:45.2152050Z W0507 20:32:45.209000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:45.2153281Z W0507 20:32:45.209000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] return visitor(node) 2025-05-07T20:32:45.2154482Z W0507 20:32:45.209000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:45.2155749Z W0507 20:32:45.209000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:45.2156660Z W0507 20:32:45.209000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:45.2157733Z W0507 20:32:45.209000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:45.2158771Z W0507 20:32:45.209000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] self.visit(item) 2025-05-07T20:32:45.2159528Z W0507 20:32:45.209000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:45.2160685Z W0507 20:32:45.209000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:45.2162107Z W0507 20:32:45.209000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:45.2163163Z W0507 20:32:45.209000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:45.2164067Z W0507 20:32:45.209000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:45.2164949Z W0507 20:32:45.209000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^ 2025-05-07T20:32:45.2165960Z W0507 20:32:45.209000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:45.2760781Z W0507 20:32:45.272000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:45.2762107Z W0507 20:32:45.272000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Traceback (most recent call last): 2025-05-07T20:32:45.2763437Z W0507 20:32:45.272000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:45.2765006Z W0507 20:32:45.272000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:45.2765999Z W0507 20:32:45.272000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:45.2767329Z W0507 20:32:45.272000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:45.2768733Z W0507 20:32:45.272000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:45.2770385Z W0507 20:32:45.272000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:45.2771773Z W0507 20:32:45.272000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:45.2772835Z W0507 20:32:45.272000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] module_map=module_map) 2025-05-07T20:32:45.2774122Z W0507 20:32:45.272000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:45.2775381Z W0507 20:32:45.272000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] generator.visit(fn.parse()) 2025-05-07T20:32:45.2776243Z W0507 20:32:45.272000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:45.2777459Z W0507 20:32:45.272000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:45.2778688Z W0507 20:32:45.272000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ret = super().visit(node) 2025-05-07T20:32:45.2779900Z W0507 20:32:45.272000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:45.2780956Z W0507 20:32:45.272000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] return visitor(node) 2025-05-07T20:32:45.2782151Z W0507 20:32:45.272000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:45.2783413Z W0507 20:32:45.272000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:45.2784302Z W0507 20:32:45.272000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:45.2785383Z W0507 20:32:45.272000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:45.2786404Z W0507 20:32:45.272000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] self.visit(item) 2025-05-07T20:32:45.2787160Z W0507 20:32:45.272000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:45.2788310Z W0507 20:32:45.272000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:45.2789645Z W0507 20:32:45.272000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:45.2790699Z W0507 20:32:45.272000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:45.2791605Z W0507 20:32:45.272000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:45.2792328Z W0507 20:32:45.272000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^ 2025-05-07T20:32:45.2793416Z W0507 20:32:45.272000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:45.5488388Z 2025-05-07T20:32:45.5489106Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:45.5489845Z self=, 2025-05-07T20:32:45.5490437Z T=2048, 2025-05-07T20:32:45.5490628Z D=5120, 2025-05-07T20:32:45.5490819Z scale_ub=None, 2025-05-07T20:32:45.5491026Z contiguous=True, 2025-05-07T20:32:45.5491246Z compiled=True, 2025-05-07T20:32:45.5491473Z ) 2025-05-07T20:32:45.5491785Z self = 2025-05-07T20:32:45.5492269Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:45.5492541Z 2025-05-07T20:32:45.5492630Z @given( 2025-05-07T20:32:45.5492859Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:45.5493166Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:45.5493476Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:45.5493809Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:45.5494132Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:45.5494421Z ) 2025-05-07T20:32:45.5494772Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:45.5495206Z def test_silu_mul_quant( 2025-05-07T20:32:45.5495453Z self, 2025-05-07T20:32:45.5495653Z T: int, 2025-05-07T20:32:45.5496237Z D: int, 2025-05-07T20:32:45.5496470Z scale_ub: Optional[float], 2025-05-07T20:32:45.5496735Z contiguous: bool, 2025-05-07T20:32:45.5496980Z compiled: bool, 2025-05-07T20:32:45.5497216Z ) -> None: 2025-05-07T20:32:45.5497429Z torch.manual_seed(2025) 2025-05-07T20:32:45.5497679Z 2025-05-07T20:32:45.5497951Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:45.5498283Z 2025-05-07T20:32:45.5498471Z x_sign = torch.sign(x) 2025-05-07T20:32:45.5498761Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:45.5499064Z x = x_sign * x_clamp 2025-05-07T20:32:45.5499292Z x0 = x[:, :D] 2025-05-07T20:32:45.5499506Z x1 = x[:, D:] 2025-05-07T20:32:45.5499712Z 2025-05-07T20:32:45.5499885Z if contiguous: 2025-05-07T20:32:45.5500124Z x0 = x0.contiguous() 2025-05-07T20:32:45.5500383Z x1 = x1.contiguous() 2025-05-07T20:32:45.5500614Z 2025-05-07T20:32:45.5500807Z if scale_ub is not None: 2025-05-07T20:32:45.5501079Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:45.5501407Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:45.5501713Z ) 2025-05-07T20:32:45.5501901Z else: 2025-05-07T20:32:45.5502112Z scale_ub_tensor = None 2025-05-07T20:32:45.5502360Z 2025-05-07T20:32:45.5502589Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:45.5502890Z op = silu_mul_quant 2025-05-07T20:32:45.5503138Z if compiled: 2025-05-07T20:32:45.5503388Z op = torch.compile(op) 2025-05-07T20:32:45.5503675Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:45.5503947Z 2025-05-07T20:32:45.5504139Z y_fp8, y_scale = fn() 2025-05-07T20:32:45.5504424Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:45.5504699Z 2025-05-07T20:32:45.5504934Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:45.5505263Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:45.5505540Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:45.5505848Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:45.5506200Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:45.5507156Z 2025-05-07T20:32:45.5507357Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:45.5507553Z 2025-05-07T20:32:45.5507665Z moe/activation_test.py:126: 2025-05-07T20:32:45.5507956Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:45.5508564Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:45.5508891Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:45.5509676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:45.5510421Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:45.5510961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:45.5511637Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:45.5512324Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:45.5513033Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:45.5513756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:45.5514387Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:45.5514977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:45.5515486Z fn() 2025-05-07T20:32:45.5516126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:45.5516708Z self.fn.run( 2025-05-07T20:32:45.5517162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:45.5517692Z kernel = self.compile( 2025-05-07T20:32:45.5518241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:45.5518882Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:45.5519277Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:45.5519513Z 2025-05-07T20:32:45.5519717Z self = 2025-05-07T20:32:45.5520803Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:45.5522188Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa07f33ec00>} 2025-05-07T20:32:45.5523518Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:45.5524676Z context = 2025-05-07T20:32:45.5524968Z 2025-05-07T20:32:45.5525132Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:45.5525651Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:45.5526106Z module_map=module_map) 2025-05-07T20:32:45.5526470Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:45.5526842Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:45.5527099Z E ^ 2025-05-07T20:32:45.5527558Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:45.5528014Z 2025-05-07T20:32:45.5528608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:45.5529118Z 2025-05-07T20:32:45.5529227Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:45.5529628Z self=, 2025-05-07T20:32:45.5530031Z T=128, 2025-05-07T20:32:45.5530213Z D=5120, 2025-05-07T20:32:45.5530392Z scale_ub=None, 2025-05-07T20:32:45.5530606Z contiguous=True, 2025-05-07T20:32:45.5530830Z compiled=True, 2025-05-07T20:32:45.5531020Z ) 2025-05-07T20:32:45.5531341Z self = 2025-05-07T20:32:45.5531840Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:45.5532101Z 2025-05-07T20:32:45.5532188Z @given( 2025-05-07T20:32:45.5532415Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:45.5532731Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:45.5533051Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:45.5533375Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:45.5533711Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:45.5534000Z ) 2025-05-07T20:32:45.5534350Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:45.5534797Z def test_silu_mul_quant( 2025-05-07T20:32:45.5535039Z self, 2025-05-07T20:32:45.5535229Z T: int, 2025-05-07T20:32:45.5535417Z D: int, 2025-05-07T20:32:45.5535630Z scale_ub: Optional[float], 2025-05-07T20:32:45.5535897Z contiguous: bool, 2025-05-07T20:32:45.5536213Z compiled: bool, 2025-05-07T20:32:45.5536437Z ) -> None: 2025-05-07T20:32:45.5536646Z torch.manual_seed(2025) 2025-05-07T20:32:45.5536877Z 2025-05-07T20:32:45.5537155Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:45.5537495Z 2025-05-07T20:32:45.5537680Z x_sign = torch.sign(x) 2025-05-07T20:32:45.5537967Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:45.5538274Z x = x_sign * x_clamp 2025-05-07T20:32:45.5538501Z x0 = x[:, :D] 2025-05-07T20:32:45.5538716Z x1 = x[:, D:] 2025-05-07T20:32:45.5538923Z 2025-05-07T20:32:45.5539102Z if contiguous: 2025-05-07T20:32:45.5539330Z x0 = x0.contiguous() 2025-05-07T20:32:45.5539591Z x1 = x1.contiguous() 2025-05-07T20:32:45.5539822Z 2025-05-07T20:32:45.5540043Z if scale_ub is not None: 2025-05-07T20:32:45.5540326Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:45.5540665Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:45.5540961Z ) 2025-05-07T20:32:45.5541151Z else: 2025-05-07T20:32:45.5541362Z scale_ub_tensor = None 2025-05-07T20:32:45.5541601Z 2025-05-07T20:32:45.5541829Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:45.5542149Z op = silu_mul_quant 2025-05-07T20:32:45.5542388Z if compiled: 2025-05-07T20:32:45.5542634Z op = torch.compile(op) 2025-05-07T20:32:45.5542930Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:45.5543196Z 2025-05-07T20:32:45.5543383Z y_fp8, y_scale = fn() 2025-05-07T20:32:45.5543665Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:45.5543940Z 2025-05-07T20:32:45.5544172Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:45.5544503Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:45.5544783Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:45.5545095Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:45.5545445Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:45.5545747Z 2025-05-07T20:32:45.5545937Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:45.5546226Z 2025-05-07T20:32:45.5546323Z moe/activation_test.py:126: 2025-05-07T20:32:45.5546615Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:45.5546937Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:45.5547258Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:45.5548035Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:45.5548780Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:45.5549310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:45.5549999Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:45.5550680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:45.5551397Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:45.5552119Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:45.5552752Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:45.5553347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:45.5553852Z fn() 2025-05-07T20:32:45.5554353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:45.5554929Z self.fn.run( 2025-05-07T20:32:45.5555478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:45.5555997Z kernel = self.compile( 2025-05-07T20:32:45.5556539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:45.5557191Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:45.5557575Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:45.5557813Z 2025-05-07T20:32:45.5558017Z self = 2025-05-07T20:32:45.5559093Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:45.5560469Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa07f34cfe0>} 2025-05-07T20:32:45.5561807Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:45.5562823Z context = 2025-05-07T20:32:45.5563117Z 2025-05-07T20:32:45.5563280Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:45.5563806Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:45.5564359Z module_map=module_map) 2025-05-07T20:32:45.5564712Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:45.5565060Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:45.5565321Z E ^ 2025-05-07T20:32:45.5565779Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:45.5566233Z 2025-05-07T20:32:45.5566643Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:45.7823839Z W0507 20:32:45.778000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:45.7824983Z W0507 20:32:45.778000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Traceback (most recent call last): 2025-05-07T20:32:45.7826329Z W0507 20:32:45.778000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:45.7827792Z W0507 20:32:45.778000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:45.7828778Z W0507 20:32:45.778000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:45.7830091Z W0507 20:32:45.778000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:45.7831469Z W0507 20:32:45.778000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:45.7833145Z W0507 20:32:45.778000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:45.7834517Z W0507 20:32:45.778000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:45.7835613Z W0507 20:32:45.778000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] module_map=module_map) 2025-05-07T20:32:45.7836875Z W0507 20:32:45.778000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:45.7838101Z W0507 20:32:45.778000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] generator.visit(fn.parse()) 2025-05-07T20:32:45.7838929Z W0507 20:32:45.778000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:45.7840113Z W0507 20:32:45.778000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:45.7841298Z W0507 20:32:45.778000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ret = super().visit(node) 2025-05-07T20:32:45.7842323Z W0507 20:32:45.778000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:45.7843325Z W0507 20:32:45.778000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] return visitor(node) 2025-05-07T20:32:45.7844674Z W0507 20:32:45.778000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:45.7845924Z W0507 20:32:45.778000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:45.7846814Z W0507 20:32:45.778000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:45.7848055Z W0507 20:32:45.778000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:45.7849074Z W0507 20:32:45.778000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] self.visit(item) 2025-05-07T20:32:45.7849825Z W0507 20:32:45.778000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:45.7850980Z W0507 20:32:45.778000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:45.7852322Z W0507 20:32:45.778000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:45.7853377Z W0507 20:32:45.778000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:45.7854278Z W0507 20:32:45.778000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:45.7854997Z W0507 20:32:45.778000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^ 2025-05-07T20:32:45.7856129Z W0507 20:32:45.778000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:45.8444700Z W0507 20:32:45.840000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:45.8445820Z W0507 20:32:45.840000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Traceback (most recent call last): 2025-05-07T20:32:45.8447170Z W0507 20:32:45.840000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:45.8448584Z W0507 20:32:45.840000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:45.8449554Z W0507 20:32:45.840000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:45.8450847Z W0507 20:32:45.840000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:45.8452216Z W0507 20:32:45.840000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:45.8453509Z W0507 20:32:45.840000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:45.8454876Z W0507 20:32:45.840000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:45.8455904Z W0507 20:32:45.840000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] module_map=module_map) 2025-05-07T20:32:45.8457151Z W0507 20:32:45.840000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:45.8458743Z W0507 20:32:45.840000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] generator.visit(fn.parse()) 2025-05-07T20:32:45.8459580Z W0507 20:32:45.840000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:45.8460770Z W0507 20:32:45.840000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:45.8461954Z W0507 20:32:45.840000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ret = super().visit(node) 2025-05-07T20:32:45.8462979Z W0507 20:32:45.840000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:45.8464045Z W0507 20:32:45.840000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] return visitor(node) 2025-05-07T20:32:45.8465244Z W0507 20:32:45.840000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:45.8466499Z W0507 20:32:45.840000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:45.8467536Z W0507 20:32:45.840000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:45.8468613Z W0507 20:32:45.840000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:45.8469647Z W0507 20:32:45.840000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] self.visit(item) 2025-05-07T20:32:45.8470424Z W0507 20:32:45.840000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:45.8481350Z W0507 20:32:45.840000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:45.8482748Z W0507 20:32:45.840000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:45.8483826Z W0507 20:32:45.840000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:45.8484875Z W0507 20:32:45.840000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:45.8485617Z W0507 20:32:45.840000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^ 2025-05-07T20:32:45.8486646Z W0507 20:32:45.840000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:46.3879144Z W0507 20:32:46.384000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:46.3880219Z W0507 20:32:46.384000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Traceback (most recent call last): 2025-05-07T20:32:46.3881551Z W0507 20:32:46.384000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:46.3883426Z W0507 20:32:46.384000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:46.3884500Z W0507 20:32:46.384000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:46.3885803Z W0507 20:32:46.384000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:46.3887179Z W0507 20:32:46.384000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:46.3888478Z W0507 20:32:46.384000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:46.3889845Z W0507 20:32:46.384000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:46.3890950Z W0507 20:32:46.384000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] module_map=module_map) 2025-05-07T20:32:46.3892388Z W0507 20:32:46.384000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:46.3893624Z W0507 20:32:46.384000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] generator.visit(fn.parse()) 2025-05-07T20:32:46.3894466Z W0507 20:32:46.384000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:46.3895644Z W0507 20:32:46.384000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:46.3896896Z W0507 20:32:46.384000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ret = super().visit(node) 2025-05-07T20:32:46.3897923Z W0507 20:32:46.384000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:46.3898939Z W0507 20:32:46.384000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] return visitor(node) 2025-05-07T20:32:46.3900148Z W0507 20:32:46.384000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:46.3901417Z W0507 20:32:46.384000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:46.3902312Z W0507 20:32:46.384000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:46.3903392Z W0507 20:32:46.384000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:46.3904470Z W0507 20:32:46.384000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] self.visit(item) 2025-05-07T20:32:46.3905313Z W0507 20:32:46.384000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:46.3906471Z W0507 20:32:46.384000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:46.3907817Z W0507 20:32:46.384000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:46.3909258Z W0507 20:32:46.384000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:46.3910165Z W0507 20:32:46.384000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:46.3910893Z W0507 20:32:46.384000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^ 2025-05-07T20:32:46.3911909Z W0507 20:32:46.384000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:46.4507827Z W0507 20:32:46.447000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:46.4510248Z W0507 20:32:46.447000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Traceback (most recent call last): 2025-05-07T20:32:46.4512011Z W0507 20:32:46.447000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:46.4513474Z W0507 20:32:46.447000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:46.4514494Z W0507 20:32:46.447000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:46.4515813Z W0507 20:32:46.447000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:46.4517202Z W0507 20:32:46.447000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:46.4518522Z W0507 20:32:46.447000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:46.4519891Z W0507 20:32:46.447000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:46.4520951Z W0507 20:32:46.447000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] module_map=module_map) 2025-05-07T20:32:46.4522223Z W0507 20:32:46.447000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:46.4523469Z W0507 20:32:46.447000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] generator.visit(fn.parse()) 2025-05-07T20:32:46.4524444Z W0507 20:32:46.447000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:46.4525810Z W0507 20:32:46.447000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:46.4527020Z W0507 20:32:46.447000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ret = super().visit(node) 2025-05-07T20:32:46.4528066Z W0507 20:32:46.447000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:46.4529092Z W0507 20:32:46.447000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] return visitor(node) 2025-05-07T20:32:46.4530303Z W0507 20:32:46.447000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:46.4531599Z W0507 20:32:46.447000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:46.4532551Z W0507 20:32:46.447000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:46.4533630Z W0507 20:32:46.447000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:46.4534815Z W0507 20:32:46.447000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] self.visit(item) 2025-05-07T20:32:46.4535572Z W0507 20:32:46.447000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:46.4536743Z W0507 20:32:46.447000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:46.4538103Z W0507 20:32:46.447000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:46.4539163Z W0507 20:32:46.447000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:46.4540067Z W0507 20:32:46.447000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:46.4540809Z W0507 20:32:46.447000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^ 2025-05-07T20:32:46.4541829Z W0507 20:32:46.447000 88618 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:46.7615043Z 2025-05-07T20:32:46.7615462Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:46.7615899Z self=, 2025-05-07T20:32:46.7616348Z T=4096, 2025-05-07T20:32:46.7616535Z D=5120, 2025-05-07T20:32:46.7616718Z scale_ub=None, 2025-05-07T20:32:46.7616919Z contiguous=True, 2025-05-07T20:32:46.7617141Z compiled=True, 2025-05-07T20:32:46.7617340Z ) 2025-05-07T20:32:46.7617645Z self = 2025-05-07T20:32:46.7618142Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:46.7618408Z 2025-05-07T20:32:46.7618480Z @given( 2025-05-07T20:32:46.7618704Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:46.7619004Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:46.7619307Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:46.7619803Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:46.7620118Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:46.7620394Z ) 2025-05-07T20:32:46.7620735Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:46.7621161Z def test_silu_mul_quant( 2025-05-07T20:32:46.7621395Z self, 2025-05-07T20:32:46.7621586Z T: int, 2025-05-07T20:32:46.7621768Z D: int, 2025-05-07T20:32:46.7621978Z scale_ub: Optional[float], 2025-05-07T20:32:46.7622241Z contiguous: bool, 2025-05-07T20:32:46.7622477Z compiled: bool, 2025-05-07T20:32:46.7622692Z ) -> None: 2025-05-07T20:32:46.7622902Z torch.manual_seed(2025) 2025-05-07T20:32:46.7623140Z 2025-05-07T20:32:46.7623397Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:46.7623733Z 2025-05-07T20:32:46.7623917Z x_sign = torch.sign(x) 2025-05-07T20:32:46.7624201Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:46.7624502Z x = x_sign * x_clamp 2025-05-07T20:32:46.7624734Z x0 = x[:, :D] 2025-05-07T20:32:46.7624932Z x1 = x[:, D:] 2025-05-07T20:32:46.7625128Z 2025-05-07T20:32:46.7625298Z if contiguous: 2025-05-07T20:32:46.7625510Z x0 = x0.contiguous() 2025-05-07T20:32:46.7625760Z x1 = x1.contiguous() 2025-05-07T20:32:46.7625988Z 2025-05-07T20:32:46.7626163Z if scale_ub is not None: 2025-05-07T20:32:46.7626427Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:46.7626753Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:46.7627175Z ) 2025-05-07T20:32:46.7627359Z else: 2025-05-07T20:32:46.7627559Z scale_ub_tensor = None 2025-05-07T20:32:46.7627800Z 2025-05-07T20:32:46.7628016Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:46.7628324Z op = silu_mul_quant 2025-05-07T20:32:46.7628566Z if compiled: 2025-05-07T20:32:46.7628799Z op = torch.compile(op) 2025-05-07T20:32:46.7629087Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:46.7629349Z 2025-05-07T20:32:46.7629522Z y_fp8, y_scale = fn() 2025-05-07T20:32:46.7629800Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:46.7630082Z 2025-05-07T20:32:46.7630301Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:46.7630628Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:46.7630914Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:46.7631221Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:46.7631578Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:46.7631881Z 2025-05-07T20:32:46.7632073Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:46.7632262Z 2025-05-07T20:32:46.7632361Z moe/activation_test.py:126: 2025-05-07T20:32:46.7632658Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:46.7632995Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:46.7633310Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:46.7634098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:46.7634843Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:46.7635387Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:46.7636066Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:46.7636746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:46.7637463Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:46.7638286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:46.7638910Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:46.7639502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:46.7640009Z fn() 2025-05-07T20:32:46.7640500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:46.7641066Z self.fn.run( 2025-05-07T20:32:46.7641532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:46.7642053Z kernel = self.compile( 2025-05-07T20:32:46.7642577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:46.7643221Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:46.7643617Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:46.7643841Z 2025-05-07T20:32:46.7644043Z self = 2025-05-07T20:32:46.7645213Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:46.7646658Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa07e6baca0>} 2025-05-07T20:32:46.7647986Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:46.7648997Z context = 2025-05-07T20:32:46.7649280Z 2025-05-07T20:32:46.7649440Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:46.7649953Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:46.7650412Z module_map=module_map) 2025-05-07T20:32:46.7650769Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:46.7651109Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:46.7651363Z E ^ 2025-05-07T20:32:46.7651823Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:46.7652266Z 2025-05-07T20:32:46.7652676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:46.7653185Z 2025-05-07T20:32:46.7653281Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:46.7653695Z self=, 2025-05-07T20:32:46.7654085Z T=16384, 2025-05-07T20:32:46.7654262Z D=5120, 2025-05-07T20:32:46.7654447Z scale_ub=None, 2025-05-07T20:32:46.7654655Z contiguous=True, 2025-05-07T20:32:46.7654862Z compiled=True, 2025-05-07T20:32:46.7655054Z ) 2025-05-07T20:32:46.7655363Z self = 2025-05-07T20:32:46.7655845Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:46.7656117Z 2025-05-07T20:32:46.7656188Z @given( 2025-05-07T20:32:46.7656422Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:46.7656722Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:46.7657024Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:46.7657352Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:46.7657675Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:46.7658067Z ) 2025-05-07T20:32:46.7658409Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:46.7658843Z def test_silu_mul_quant( 2025-05-07T20:32:46.7659068Z self, 2025-05-07T20:32:46.7659254Z T: int, 2025-05-07T20:32:46.7659443Z D: int, 2025-05-07T20:32:46.7659645Z scale_ub: Optional[float], 2025-05-07T20:32:46.7659909Z contiguous: bool, 2025-05-07T20:32:46.7660142Z compiled: bool, 2025-05-07T20:32:46.7660346Z ) -> None: 2025-05-07T20:32:46.7660557Z torch.manual_seed(2025) 2025-05-07T20:32:46.7660787Z 2025-05-07T20:32:46.7661046Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:46.7661377Z 2025-05-07T20:32:46.7661559Z x_sign = torch.sign(x) 2025-05-07T20:32:46.7661833Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:46.7662131Z x = x_sign * x_clamp 2025-05-07T20:32:46.7662366Z x0 = x[:, :D] 2025-05-07T20:32:46.7662574Z x1 = x[:, D:] 2025-05-07T20:32:46.7662764Z 2025-05-07T20:32:46.7662940Z if contiguous: 2025-05-07T20:32:46.7663159Z x0 = x0.contiguous() 2025-05-07T20:32:46.7663409Z x1 = x1.contiguous() 2025-05-07T20:32:46.7663646Z 2025-05-07T20:32:46.7663826Z if scale_ub is not None: 2025-05-07T20:32:46.7664083Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:46.7664408Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:46.7664702Z ) 2025-05-07T20:32:46.7664877Z else: 2025-05-07T20:32:46.7665074Z scale_ub_tensor = None 2025-05-07T20:32:46.7665398Z 2025-05-07T20:32:46.7665616Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:46.7665917Z op = silu_mul_quant 2025-05-07T20:32:46.7666161Z if compiled: 2025-05-07T20:32:46.7666392Z op = torch.compile(op) 2025-05-07T20:32:46.7666685Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:46.7666948Z 2025-05-07T20:32:46.7667127Z y_fp8, y_scale = fn() 2025-05-07T20:32:46.7667394Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:46.7667677Z 2025-05-07T20:32:46.7667903Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:46.7668224Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:46.7668509Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:46.7668813Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:46.7669153Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:46.7669458Z 2025-05-07T20:32:46.7669649Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:46.7669837Z 2025-05-07T20:32:46.7669929Z moe/activation_test.py:126: 2025-05-07T20:32:46.7670221Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:46.7670549Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:46.7670866Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:46.7671633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:46.7672371Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:46.7672908Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:46.7673578Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:46.7674254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:46.7674964Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:46.7675680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:46.7676393Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:46.7676977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:46.7677481Z fn() 2025-05-07T20:32:46.7677978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:46.7678542Z self.fn.run( 2025-05-07T20:32:46.7679000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:46.7679516Z kernel = self.compile( 2025-05-07T20:32:46.7680050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:46.7680685Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:46.7681072Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:46.7681301Z 2025-05-07T20:32:46.7681508Z self = 2025-05-07T20:32:46.7682573Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:46.7683917Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa07e6b9b20>} 2025-05-07T20:32:46.7685398Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:46.7686460Z context = 2025-05-07T20:32:46.7686745Z 2025-05-07T20:32:46.7686918Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:46.7687423Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:46.7687876Z module_map=module_map) 2025-05-07T20:32:46.7688231Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:46.7688572Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:46.7688819Z E ^ 2025-05-07T20:32:46.7689267Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:46.7689707Z 2025-05-07T20:32:46.7690125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:46.7897923Z W0507 20:32:46.788000 88618 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8) 2025-05-07T20:32:46.7899150Z W0507 20:32:46.788000 88618 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55) 2025-05-07T20:32:46.7900470Z W0507 20:32:46.788000 88618 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240 2025-05-07T20:32:46.7901504Z W0507 20:32:46.788000 88618 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles". 2025-05-07T20:32:46.7902598Z W0507 20:32:46.788000 88618 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html. 2025-05-07T20:32:47.2669569Z 2025-05-07T20:32:47.2669915Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.2670356Z self=, 2025-05-07T20:32:47.2671040Z T=1, 2025-05-07T20:32:47.2671275Z D=5120, 2025-05-07T20:32:47.2671474Z scale_ub=1200.0, 2025-05-07T20:32:47.2671687Z contiguous=True, 2025-05-07T20:32:47.2671905Z compiled=True, 2025-05-07T20:32:47.2672117Z ) 2025-05-07T20:32:47.2672432Z self = 2025-05-07T20:32:47.2672924Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:47.2673177Z 2025-05-07T20:32:47.2673257Z @given( 2025-05-07T20:32:47.2673472Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.2673776Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.2674084Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.2674405Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.2674718Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.2674996Z ) 2025-05-07T20:32:47.2675337Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.2675770Z def test_silu_mul_quant( 2025-05-07T20:32:47.2676007Z self, 2025-05-07T20:32:47.2676193Z T: int, 2025-05-07T20:32:47.2676372Z D: int, 2025-05-07T20:32:47.2676582Z scale_ub: Optional[float], 2025-05-07T20:32:47.2676841Z contiguous: bool, 2025-05-07T20:32:47.2677066Z compiled: bool, 2025-05-07T20:32:47.2677282Z ) -> None: 2025-05-07T20:32:47.2677488Z torch.manual_seed(2025) 2025-05-07T20:32:47.2677714Z 2025-05-07T20:32:47.2677975Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.2678306Z 2025-05-07T20:32:47.2678484Z x_sign = torch.sign(x) 2025-05-07T20:32:47.2678905Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.2679215Z x = x_sign * x_clamp 2025-05-07T20:32:47.2679445Z x0 = x[:, :D] 2025-05-07T20:32:47.2679643Z x1 = x[:, D:] 2025-05-07T20:32:47.2679838Z 2025-05-07T20:32:47.2680018Z if contiguous: 2025-05-07T20:32:47.2680233Z x0 = x0.contiguous() 2025-05-07T20:32:47.2680481Z x1 = x1.contiguous() 2025-05-07T20:32:47.2680705Z 2025-05-07T20:32:47.2680879Z if scale_ub is not None: 2025-05-07T20:32:47.2681142Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:47.2681470Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:47.2681762Z ) 2025-05-07T20:32:47.2681946Z else: 2025-05-07T20:32:47.2682146Z scale_ub_tensor = None 2025-05-07T20:32:47.2682380Z 2025-05-07T20:32:47.2682604Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.2682912Z op = silu_mul_quant 2025-05-07T20:32:47.2683148Z if compiled: 2025-05-07T20:32:47.2683386Z op = torch.compile(op) 2025-05-07T20:32:47.2683671Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.2683938Z 2025-05-07T20:32:47.2684115Z > y_fp8, y_scale = fn() 2025-05-07T20:32:47.2684406Z 2025-05-07T20:32:47.2684500Z moe/activation_test.py:117: 2025-05-07T20:32:47.2684791Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.2685108Z moe/activation_test.py:115: in fn 2025-05-07T20:32:47.2685383Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.2685937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:47.2686485Z return fn(*args, **kwargs) 2025-05-07T20:32:47.2687133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:47.2687815Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:47.2688339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:47.2689004Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:47.2689925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:47.2690454Z kernel = self.compile( 2025-05-07T20:32:47.2690982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:47.2691637Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.2692029Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.2692252Z 2025-05-07T20:32:47.2692458Z self = 2025-05-07T20:32:47.2693521Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:47.2694934Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa07f015080>} 2025-05-07T20:32:47.2696270Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:47.2697282Z context = 2025-05-07T20:32:47.2697566Z 2025-05-07T20:32:47.2697734Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:47.2698244Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.2698831Z module_map=module_map) 2025-05-07T20:32:47.2699200Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.2699543Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:47.2699800Z E ^ 2025-05-07T20:32:47.2700266Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.2700709Z 2025-05-07T20:32:47.2701127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:47.2701632Z 2025-05-07T20:32:47.2701733Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.2702145Z self=, 2025-05-07T20:32:47.2702554Z T=1, 2025-05-07T20:32:47.2702740Z D=5120, 2025-05-07T20:32:47.2702933Z scale_ub=None, 2025-05-07T20:32:47.2703144Z contiguous=False, 2025-05-07T20:32:47.2703378Z compiled=True, 2025-05-07T20:32:47.2711281Z ) 2025-05-07T20:32:47.2711625Z self = 2025-05-07T20:32:47.2712122Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:47.2712386Z 2025-05-07T20:32:47.2712473Z @given( 2025-05-07T20:32:47.2712708Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.2713023Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.2713326Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.2713659Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.2713986Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.2714261Z ) 2025-05-07T20:32:47.2714612Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.2715060Z def test_silu_mul_quant( 2025-05-07T20:32:47.2715304Z self, 2025-05-07T20:32:47.2715492Z T: int, 2025-05-07T20:32:47.2715698Z D: int, 2025-05-07T20:32:47.2715920Z scale_ub: Optional[float], 2025-05-07T20:32:47.2716185Z contiguous: bool, 2025-05-07T20:32:47.2716429Z compiled: bool, 2025-05-07T20:32:47.2716655Z ) -> None: 2025-05-07T20:32:47.2716865Z torch.manual_seed(2025) 2025-05-07T20:32:47.2717288Z 2025-05-07T20:32:47.2717563Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.2717895Z 2025-05-07T20:32:47.2718092Z x_sign = torch.sign(x) 2025-05-07T20:32:47.2718385Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.2718689Z x = x_sign * x_clamp 2025-05-07T20:32:47.2718929Z x0 = x[:, :D] 2025-05-07T20:32:47.2719146Z x1 = x[:, D:] 2025-05-07T20:32:47.2719345Z 2025-05-07T20:32:47.2719531Z if contiguous: 2025-05-07T20:32:47.2719767Z x0 = x0.contiguous() 2025-05-07T20:32:47.2720026Z x1 = x1.contiguous() 2025-05-07T20:32:47.2720257Z 2025-05-07T20:32:47.2720456Z if scale_ub is not None: 2025-05-07T20:32:47.2720728Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:47.2721055Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:47.2721362Z ) 2025-05-07T20:32:47.2721560Z else: 2025-05-07T20:32:47.2721765Z scale_ub_tensor = None 2025-05-07T20:32:47.2722017Z 2025-05-07T20:32:47.2722243Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.2722549Z op = silu_mul_quant 2025-05-07T20:32:47.2722794Z if compiled: 2025-05-07T20:32:47.2723041Z op = torch.compile(op) 2025-05-07T20:32:47.2723328Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.2723602Z 2025-05-07T20:32:47.2723799Z y_fp8, y_scale = fn() 2025-05-07T20:32:47.2724081Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:47.2724559Z 2025-05-07T20:32:47.2724917Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.2725252Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:47.2725530Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:47.2725834Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:47.2726189Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:47.2726488Z 2025-05-07T20:32:47.2726687Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:47.2726876Z 2025-05-07T20:32:47.2726977Z moe/activation_test.py:126: 2025-05-07T20:32:47.2727268Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.2727600Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:47.2727921Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:47.2728705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:47.2729448Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:47.2729989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:47.2730668Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:47.2731343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:47.2732064Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:47.2732775Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:47.2733404Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:47.2733991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:47.2734496Z fn() 2025-05-07T20:32:47.2735001Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:47.2735577Z self.fn.run( 2025-05-07T20:32:47.2736038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:47.2736638Z kernel = self.compile( 2025-05-07T20:32:47.2737172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:47.2737809Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.2738194Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.2738427Z 2025-05-07T20:32:47.2738627Z self = 2025-05-07T20:32:47.2739706Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:47.2741064Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa07f34e700>} 2025-05-07T20:32:47.2742398Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:47.2743402Z context = 2025-05-07T20:32:47.2743692Z 2025-05-07T20:32:47.2743855Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:47.2744378Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.2744894Z module_map=module_map) 2025-05-07T20:32:47.2745333Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.2745675Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:47.2745932Z E ^ 2025-05-07T20:32:47.2746378Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.2746835Z 2025-05-07T20:32:47.2747243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:47.4168447Z 2025-05-07T20:32:47.4168816Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.4169277Z self=, 2025-05-07T20:32:47.4169742Z T=1, 2025-05-07T20:32:47.4169982Z D=5120, 2025-05-07T20:32:47.4170173Z scale_ub=None, 2025-05-07T20:32:47.4170396Z contiguous=True, 2025-05-07T20:32:47.4170628Z compiled=False, 2025-05-07T20:32:47.4170833Z ) 2025-05-07T20:32:47.4171163Z self = 2025-05-07T20:32:47.4171687Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:47.4171961Z 2025-05-07T20:32:47.4172042Z @given( 2025-05-07T20:32:47.4172271Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.4172599Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.4172921Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.4173258Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.4173601Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.4173909Z ) 2025-05-07T20:32:47.4174265Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.4174730Z def test_silu_mul_quant( 2025-05-07T20:32:47.4174985Z self, 2025-05-07T20:32:47.4175181Z T: int, 2025-05-07T20:32:47.4175382Z D: int, 2025-05-07T20:32:47.4175606Z scale_ub: Optional[float], 2025-05-07T20:32:47.4175873Z contiguous: bool, 2025-05-07T20:32:47.4176118Z compiled: bool, 2025-05-07T20:32:47.4176345Z ) -> None: 2025-05-07T20:32:47.4176556Z torch.manual_seed(2025) 2025-05-07T20:32:47.4176793Z 2025-05-07T20:32:47.4177066Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.4177596Z 2025-05-07T20:32:47.4177778Z x_sign = torch.sign(x) 2025-05-07T20:32:47.4178066Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.4178375Z x = x_sign * x_clamp 2025-05-07T20:32:47.4178606Z x0 = x[:, :D] 2025-05-07T20:32:47.4178820Z x1 = x[:, D:] 2025-05-07T20:32:47.4179021Z 2025-05-07T20:32:47.4179192Z if contiguous: 2025-05-07T20:32:47.4179424Z x0 = x0.contiguous() 2025-05-07T20:32:47.4179684Z x1 = x1.contiguous() 2025-05-07T20:32:47.4179917Z 2025-05-07T20:32:47.4180107Z if scale_ub is not None: 2025-05-07T20:32:47.4180379Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:47.4180715Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:47.4181029Z ) 2025-05-07T20:32:47.4181224Z else: 2025-05-07T20:32:47.4181433Z scale_ub_tensor = None 2025-05-07T20:32:47.4181686Z 2025-05-07T20:32:47.4181909Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.4182238Z op = silu_mul_quant 2025-05-07T20:32:47.4182485Z if compiled: 2025-05-07T20:32:47.4182735Z op = torch.compile(op) 2025-05-07T20:32:47.4183040Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.4183307Z 2025-05-07T20:32:47.4183506Z > y_fp8, y_scale = fn() 2025-05-07T20:32:47.4183672Z 2025-05-07T20:32:47.4183775Z moe/activation_test.py:117: 2025-05-07T20:32:47.4184066Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.4184407Z moe/activation_test.py:115: in fn 2025-05-07T20:32:47.4184694Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.4185521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:47.4186216Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:47.4186760Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:47.4187462Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:47.4188131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:47.4188674Z kernel = self.compile( 2025-05-07T20:32:47.4189228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:47.4189894Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.4190290Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.4190534Z 2025-05-07T20:32:47.4190740Z self = 2025-05-07T20:32:47.4191834Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:47.4193210Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa07f4c91c0>} 2025-05-07T20:32:47.4194541Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:47.4195560Z context = 2025-05-07T20:32:47.4195855Z 2025-05-07T20:32:47.4196027Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:47.4196594Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.4197056Z module_map=module_map) 2025-05-07T20:32:47.4197555Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.4197907Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:47.4198166Z E ^ 2025-05-07T20:32:47.4198625Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.4199077Z 2025-05-07T20:32:47.4199485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:47.4199992Z 2025-05-07T20:32:47.4200106Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.4200512Z self=, 2025-05-07T20:32:47.4200910Z T=128, 2025-05-07T20:32:47.4201106Z D=5120, 2025-05-07T20:32:47.4201307Z scale_ub=None, 2025-05-07T20:32:47.4201518Z contiguous=False, 2025-05-07T20:32:47.4201749Z compiled=True, 2025-05-07T20:32:47.4201951Z ) 2025-05-07T20:32:47.4202263Z self = 2025-05-07T20:32:47.4202757Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:47.4203024Z 2025-05-07T20:32:47.4203114Z @given( 2025-05-07T20:32:47.4203344Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.4203660Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.4203982Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.4204408Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.4204737Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.4205032Z ) 2025-05-07T20:32:47.4205501Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.4205943Z def test_silu_mul_quant( 2025-05-07T20:32:47.4206187Z self, 2025-05-07T20:32:47.4206395Z T: int, 2025-05-07T20:32:47.4206591Z D: int, 2025-05-07T20:32:47.4206806Z scale_ub: Optional[float], 2025-05-07T20:32:47.4207091Z contiguous: bool, 2025-05-07T20:32:47.4207327Z compiled: bool, 2025-05-07T20:32:47.4207560Z ) -> None: 2025-05-07T20:32:47.4207772Z torch.manual_seed(2025) 2025-05-07T20:32:47.4208009Z 2025-05-07T20:32:47.4208599Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.4209011Z 2025-05-07T20:32:47.4209199Z x_sign = torch.sign(x) 2025-05-07T20:32:47.4209491Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.4209801Z x = x_sign * x_clamp 2025-05-07T20:32:47.4210035Z x0 = x[:, :D] 2025-05-07T20:32:47.4210261Z x1 = x[:, D:] 2025-05-07T20:32:47.4210470Z 2025-05-07T20:32:47.4210668Z if contiguous: 2025-05-07T20:32:47.4210937Z x0 = x0.contiguous() 2025-05-07T20:32:47.4211207Z x1 = x1.contiguous() 2025-05-07T20:32:47.4211441Z 2025-05-07T20:32:47.4211622Z if scale_ub is not None: 2025-05-07T20:32:47.4211898Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:47.4212235Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:47.4212539Z ) 2025-05-07T20:32:47.4212725Z else: 2025-05-07T20:32:47.4212932Z scale_ub_tensor = None 2025-05-07T20:32:47.4213172Z 2025-05-07T20:32:47.4213398Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.4213714Z op = silu_mul_quant 2025-05-07T20:32:47.4213957Z if compiled: 2025-05-07T20:32:47.4214203Z op = torch.compile(op) 2025-05-07T20:32:47.4214497Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.4214760Z 2025-05-07T20:32:47.4214947Z > y_fp8, y_scale = fn() 2025-05-07T20:32:47.4215121Z 2025-05-07T20:32:47.4215218Z moe/activation_test.py:117: 2025-05-07T20:32:47.4215510Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.4215832Z moe/activation_test.py:115: in fn 2025-05-07T20:32:47.4216268Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.4216824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:47.4217374Z return fn(*args, **kwargs) 2025-05-07T20:32:47.4218033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:47.4218721Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:47.4219288Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:47.4220199Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:47.4220868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:47.4221391Z kernel = self.compile( 2025-05-07T20:32:47.4221923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:47.4222570Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.4222963Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.4223188Z 2025-05-07T20:32:47.4223403Z self = 2025-05-07T20:32:47.4224464Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:47.4225971Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa07eb8b240>} 2025-05-07T20:32:47.4227303Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:47.4228317Z context = 2025-05-07T20:32:47.4228600Z 2025-05-07T20:32:47.4228766Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:47.4229274Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.4229738Z module_map=module_map) 2025-05-07T20:32:47.4230094Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.4230441Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:47.4230687Z E ^ 2025-05-07T20:32:47.4231149Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.4231592Z 2025-05-07T20:32:47.4232009Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:47.4232523Z 2025-05-07T20:32:47.4232624Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.4233031Z self=, 2025-05-07T20:32:47.4233425Z T=128, 2025-05-07T20:32:47.4233612Z D=7168, 2025-05-07T20:32:47.4233799Z scale_ub=1200.0, 2025-05-07T20:32:47.4234028Z contiguous=False, 2025-05-07T20:32:47.4234259Z compiled=False, 2025-05-07T20:32:47.5794092Z ) 2025-05-07T20:32:47.5795203Z self = 2025-05-07T20:32:47.5796016Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:47.5796443Z 2025-05-07T20:32:47.5796546Z @given( 2025-05-07T20:32:47.5796788Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.5797105Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.5797417Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.5798111Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.5798445Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.5798729Z ) 2025-05-07T20:32:47.5799068Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.5799512Z def test_silu_mul_quant( 2025-05-07T20:32:47.5799755Z self, 2025-05-07T20:32:47.5799941Z T: int, 2025-05-07T20:32:47.5800146Z D: int, 2025-05-07T20:32:47.5800365Z scale_ub: Optional[float], 2025-05-07T20:32:47.5800627Z contiguous: bool, 2025-05-07T20:32:47.5800872Z compiled: bool, 2025-05-07T20:32:47.5801115Z ) -> None: 2025-05-07T20:32:47.5801335Z torch.manual_seed(2025) 2025-05-07T20:32:47.5801578Z 2025-05-07T20:32:47.5801853Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.5802188Z 2025-05-07T20:32:47.5802369Z x_sign = torch.sign(x) 2025-05-07T20:32:47.5802659Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.5802978Z x = x_sign * x_clamp 2025-05-07T20:32:47.5803207Z x0 = x[:, :D] 2025-05-07T20:32:47.5803415Z x1 = x[:, D:] 2025-05-07T20:32:47.5803618Z 2025-05-07T20:32:47.5803794Z if contiguous: 2025-05-07T20:32:47.5804028Z x0 = x0.contiguous() 2025-05-07T20:32:47.5804441Z x1 = x1.contiguous() 2025-05-07T20:32:47.5804670Z 2025-05-07T20:32:47.5804856Z if scale_ub is not None: 2025-05-07T20:32:47.5805125Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:47.5805490Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:47.5805801Z ) 2025-05-07T20:32:47.5806150Z else: 2025-05-07T20:32:47.5806361Z scale_ub_tensor = None 2025-05-07T20:32:47.5806609Z 2025-05-07T20:32:47.5806832Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.5807143Z op = silu_mul_quant 2025-05-07T20:32:47.5807398Z if compiled: 2025-05-07T20:32:47.5807646Z op = torch.compile(op) 2025-05-07T20:32:47.5807944Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.5808579Z 2025-05-07T20:32:47.5808783Z > y_fp8, y_scale = fn() 2025-05-07T20:32:47.5808959Z 2025-05-07T20:32:47.5809063Z moe/activation_test.py:117: 2025-05-07T20:32:47.5809368Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.5809712Z moe/activation_test.py:115: in fn 2025-05-07T20:32:47.5809993Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.5810708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:47.5811414Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:47.5811952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:47.5812649Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:47.5813334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:47.5813882Z kernel = self.compile( 2025-05-07T20:32:47.5814429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:47.5815097Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.5815504Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.5815735Z 2025-05-07T20:32:47.5815954Z self = 2025-05-07T20:32:47.5817050Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:47.5818656Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa07eb89080>} 2025-05-07T20:32:47.5820257Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:47.5821296Z context = 2025-05-07T20:32:47.5821584Z 2025-05-07T20:32:47.5821752Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:47.5822287Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.5822766Z module_map=module_map) 2025-05-07T20:32:47.5823138Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.5823489Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:47.5823768Z E ^ 2025-05-07T20:32:47.5824234Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.5824683Z 2025-05-07T20:32:47.5825099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:47.5825619Z 2025-05-07T20:32:47.5825728Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.5826147Z self=, 2025-05-07T20:32:47.5826555Z T=128, 2025-05-07T20:32:47.5826746Z D=5120, 2025-05-07T20:32:47.5826950Z scale_ub=None, 2025-05-07T20:32:47.5827316Z contiguous=False, 2025-05-07T20:32:47.5827550Z compiled=False, 2025-05-07T20:32:47.5827772Z ) 2025-05-07T20:32:47.5828097Z self = 2025-05-07T20:32:47.5828588Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:47.5828874Z 2025-05-07T20:32:47.5828958Z @given( 2025-05-07T20:32:47.5829200Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.5829524Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.5829832Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.5830174Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.5830511Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.5830797Z ) 2025-05-07T20:32:47.5831156Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.5831602Z def test_silu_mul_quant( 2025-05-07T20:32:47.5831844Z self, 2025-05-07T20:32:47.5832055Z T: int, 2025-05-07T20:32:47.5832263Z D: int, 2025-05-07T20:32:47.5832483Z scale_ub: Optional[float], 2025-05-07T20:32:47.5832768Z contiguous: bool, 2025-05-07T20:32:47.5833016Z compiled: bool, 2025-05-07T20:32:47.5833244Z ) -> None: 2025-05-07T20:32:47.5833481Z torch.manual_seed(2025) 2025-05-07T20:32:47.5833736Z 2025-05-07T20:32:47.5834017Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.5834397Z 2025-05-07T20:32:47.5834629Z x_sign = torch.sign(x) 2025-05-07T20:32:47.5834928Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.5835240Z x = x_sign * x_clamp 2025-05-07T20:32:47.5843469Z x0 = x[:, :D] 2025-05-07T20:32:47.5843714Z x1 = x[:, D:] 2025-05-07T20:32:47.5843943Z 2025-05-07T20:32:47.5844151Z if contiguous: 2025-05-07T20:32:47.5844541Z x0 = x0.contiguous() 2025-05-07T20:32:47.5844822Z x1 = x1.contiguous() 2025-05-07T20:32:47.5845081Z 2025-05-07T20:32:47.5845293Z if scale_ub is not None: 2025-05-07T20:32:47.5845575Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:47.5845938Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:47.5846381Z ) 2025-05-07T20:32:47.5846582Z else: 2025-05-07T20:32:47.5846811Z scale_ub_tensor = None 2025-05-07T20:32:47.5847075Z 2025-05-07T20:32:47.5847317Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.5847651Z op = silu_mul_quant 2025-05-07T20:32:47.5847920Z if compiled: 2025-05-07T20:32:47.5848173Z op = torch.compile(op) 2025-05-07T20:32:47.5848488Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.5848777Z 2025-05-07T20:32:47.5848974Z > y_fp8, y_scale = fn() 2025-05-07T20:32:47.5849154Z 2025-05-07T20:32:47.5849258Z moe/activation_test.py:117: 2025-05-07T20:32:47.5849583Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.5849935Z moe/activation_test.py:115: in fn 2025-05-07T20:32:47.5850222Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.5850927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:47.5851641Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:47.5852183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:47.5852884Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:47.5853559Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:47.5854105Z kernel = self.compile( 2025-05-07T20:32:47.5854654Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:47.5855412Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.5855829Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.5856065Z 2025-05-07T20:32:47.5856288Z self = 2025-05-07T20:32:47.5857382Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:47.5858761Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa07e82c9a0>} 2025-05-07T20:32:47.5860120Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:47.5861159Z context = 2025-05-07T20:32:47.5861454Z 2025-05-07T20:32:47.5861626Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:47.5862168Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.5862653Z module_map=module_map) 2025-05-07T20:32:47.5863037Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.5863397Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:47.5863672Z E ^ 2025-05-07T20:32:47.5864154Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.5864658Z 2025-05-07T20:32:47.5865078Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:47.5865608Z 2025-05-07T20:32:47.5865723Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.5866156Z self=, 2025-05-07T20:32:47.5866577Z T=128, 2025-05-07T20:32:47.5866777Z D=5120, 2025-05-07T20:32:47.5866992Z scale_ub=1200.0, 2025-05-07T20:32:47.5867318Z contiguous=True, 2025-05-07T20:32:47.5867549Z compiled=False, 2025-05-07T20:32:47.5867772Z ) 2025-05-07T20:32:47.5868111Z self = 2025-05-07T20:32:47.5868609Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:47.5868894Z 2025-05-07T20:32:47.5868975Z @given( 2025-05-07T20:32:47.5869224Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.5869555Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.5869870Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.5870219Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.5870570Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.5870859Z ) 2025-05-07T20:32:47.5871215Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.5871663Z def test_silu_mul_quant( 2025-05-07T20:32:47.5871923Z self, 2025-05-07T20:32:47.5872119Z T: int, 2025-05-07T20:32:47.5872323Z D: int, 2025-05-07T20:32:47.5872552Z scale_ub: Optional[float], 2025-05-07T20:32:47.5872829Z contiguous: bool, 2025-05-07T20:32:47.5873084Z compiled: bool, 2025-05-07T20:32:47.5873318Z ) -> None: 2025-05-07T20:32:47.5873535Z torch.manual_seed(2025) 2025-05-07T20:32:47.5873789Z 2025-05-07T20:32:47.5874075Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.5874426Z 2025-05-07T20:32:47.5874625Z x_sign = torch.sign(x) 2025-05-07T20:32:47.5874930Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.5875339Z x = x_sign * x_clamp 2025-05-07T20:32:47.5875588Z x0 = x[:, :D] 2025-05-07T20:32:47.5875820Z x1 = x[:, D:] 2025-05-07T20:32:47.5876037Z 2025-05-07T20:32:47.5876222Z if contiguous: 2025-05-07T20:32:47.5876464Z x0 = x0.contiguous() 2025-05-07T20:32:47.5876742Z x1 = x1.contiguous() 2025-05-07T20:32:47.5876986Z 2025-05-07T20:32:47.5877188Z if scale_ub is not None: 2025-05-07T20:32:47.5877473Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:47.5877812Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:47.5878135Z ) 2025-05-07T20:32:47.5878340Z else: 2025-05-07T20:32:47.5878558Z scale_ub_tensor = None 2025-05-07T20:32:47.5878822Z 2025-05-07T20:32:47.5879065Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.5879390Z op = silu_mul_quant 2025-05-07T20:32:47.5879640Z if compiled: 2025-05-07T20:32:47.5879900Z op = torch.compile(op) 2025-05-07T20:32:47.5880204Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.5880479Z 2025-05-07T20:32:47.5880683Z > y_fp8, y_scale = fn() 2025-05-07T20:32:47.5880848Z 2025-05-07T20:32:47.5880959Z moe/activation_test.py:117: 2025-05-07T20:32:47.5888023Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.5888515Z moe/activation_test.py:115: in fn 2025-05-07T20:32:47.5888800Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.5889482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:47.5890168Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:47.5890702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:47.5891390Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:47.5892043Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:47.5892576Z kernel = self.compile( 2025-05-07T20:32:47.5893118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:47.5893879Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.5894272Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.5894524Z 2025-05-07T20:32:47.5894764Z self = 2025-05-07T20:32:47.5895840Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:47.5897201Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa08437b2e0>} 2025-05-07T20:32:47.5898536Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:47.5899560Z context = 2025-05-07T20:32:47.5899844Z 2025-05-07T20:32:47.5900017Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:47.5900526Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.5900992Z module_map=module_map) 2025-05-07T20:32:47.5901358Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.5901708Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:47.5901958Z E ^ 2025-05-07T20:32:47.5902508Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.5902957Z 2025-05-07T20:32:47.5903379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:47.7448844Z 2025-05-07T20:32:47.7449384Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.7450237Z self=, 2025-05-07T20:32:47.7450865Z T=1, 2025-05-07T20:32:47.7451056Z D=7168, 2025-05-07T20:32:47.7451244Z scale_ub=1200.0, 2025-05-07T20:32:47.7451466Z contiguous=True, 2025-05-07T20:32:47.7451686Z compiled=True, 2025-05-07T20:32:47.7451892Z ) 2025-05-07T20:32:47.7452214Z self = 2025-05-07T20:32:47.7452707Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:47.7452969Z 2025-05-07T20:32:47.7453072Z @given( 2025-05-07T20:32:47.7453306Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.7453625Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.7453922Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.7454260Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.7454976Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.7455260Z ) 2025-05-07T20:32:47.7455601Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.7456052Z def test_silu_mul_quant( 2025-05-07T20:32:47.7456294Z self, 2025-05-07T20:32:47.7456479Z T: int, 2025-05-07T20:32:47.7456672Z D: int, 2025-05-07T20:32:47.7456889Z scale_ub: Optional[float], 2025-05-07T20:32:47.7457150Z contiguous: bool, 2025-05-07T20:32:47.7457393Z compiled: bool, 2025-05-07T20:32:47.7457630Z ) -> None: 2025-05-07T20:32:47.7457841Z torch.manual_seed(2025) 2025-05-07T20:32:47.7458083Z 2025-05-07T20:32:47.7458355Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.7458685Z 2025-05-07T20:32:47.7458875Z x_sign = torch.sign(x) 2025-05-07T20:32:47.7459164Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.7459582Z x = x_sign * x_clamp 2025-05-07T20:32:47.7459818Z x0 = x[:, :D] 2025-05-07T20:32:47.7460032Z x1 = x[:, D:] 2025-05-07T20:32:47.7460238Z 2025-05-07T20:32:47.7460414Z if contiguous: 2025-05-07T20:32:47.7460644Z x0 = x0.contiguous() 2025-05-07T20:32:47.7460901Z x1 = x1.contiguous() 2025-05-07T20:32:47.7461129Z 2025-05-07T20:32:47.7461316Z if scale_ub is not None: 2025-05-07T20:32:47.7461586Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:47.7461912Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:47.7462216Z ) 2025-05-07T20:32:47.7462412Z else: 2025-05-07T20:32:47.7462613Z scale_ub_tensor = None 2025-05-07T20:32:47.7462860Z 2025-05-07T20:32:47.7463091Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.7463393Z op = silu_mul_quant 2025-05-07T20:32:47.7463646Z if compiled: 2025-05-07T20:32:47.7463895Z op = torch.compile(op) 2025-05-07T20:32:47.7464180Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.7464453Z 2025-05-07T20:32:47.7464643Z > y_fp8, y_scale = fn() 2025-05-07T20:32:47.7464806Z 2025-05-07T20:32:47.7464912Z moe/activation_test.py:117: 2025-05-07T20:32:47.7465205Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.7465535Z moe/activation_test.py:115: in fn 2025-05-07T20:32:47.7465815Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.7466365Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:47.7467091Z return fn(*args, **kwargs) 2025-05-07T20:32:47.7467749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:47.7468427Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:47.7468955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:47.7469634Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:47.7470294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:47.7470808Z kernel = self.compile( 2025-05-07T20:32:47.7471348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:47.7472004Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.7472407Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.7472632Z 2025-05-07T20:32:47.7472834Z self = 2025-05-07T20:32:47.7473911Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:47.7475419Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa07f586f20>} 2025-05-07T20:32:47.7476753Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:47.7477769Z context = 2025-05-07T20:32:47.7478058Z 2025-05-07T20:32:47.7478219Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:47.7478739Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.7479203Z module_map=module_map) 2025-05-07T20:32:47.7479609Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.7479957Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:47.7480211Z E ^ 2025-05-07T20:32:47.7480713Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.7481164Z 2025-05-07T20:32:47.7481573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:47.7482085Z 2025-05-07T20:32:47.7483682Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.7484096Z self=, 2025-05-07T20:32:47.7484613Z T=1, 2025-05-07T20:32:47.7484791Z D=7168, 2025-05-07T20:32:47.7484981Z scale_ub=1200.0, 2025-05-07T20:32:47.7485194Z contiguous=False, 2025-05-07T20:32:47.7485415Z compiled=True, 2025-05-07T20:32:47.7485619Z ) 2025-05-07T20:32:47.7485932Z self = 2025-05-07T20:32:47.7486412Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:47.7486680Z 2025-05-07T20:32:47.7486751Z @given( 2025-05-07T20:32:47.7486982Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.7487282Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.7487585Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.7487914Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.7488230Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.7488510Z ) 2025-05-07T20:32:47.7488937Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.7489373Z def test_silu_mul_quant( 2025-05-07T20:32:47.7489603Z self, 2025-05-07T20:32:47.7489793Z T: int, 2025-05-07T20:32:47.7489985Z D: int, 2025-05-07T20:32:47.7490195Z scale_ub: Optional[float], 2025-05-07T20:32:47.7490466Z contiguous: bool, 2025-05-07T20:32:47.7490704Z compiled: bool, 2025-05-07T20:32:47.7490917Z ) -> None: 2025-05-07T20:32:47.7491132Z torch.manual_seed(2025) 2025-05-07T20:32:47.7491368Z 2025-05-07T20:32:47.7491627Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.7491965Z 2025-05-07T20:32:47.7492154Z x_sign = torch.sign(x) 2025-05-07T20:32:47.7492433Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.7492738Z x = x_sign * x_clamp 2025-05-07T20:32:47.7492977Z x0 = x[:, :D] 2025-05-07T20:32:47.7493182Z x1 = x[:, D:] 2025-05-07T20:32:47.7493382Z 2025-05-07T20:32:47.7493558Z if contiguous: 2025-05-07T20:32:47.7493798Z x0 = x0.contiguous() 2025-05-07T20:32:47.7494043Z x1 = x1.contiguous() 2025-05-07T20:32:47.7494280Z 2025-05-07T20:32:47.7494468Z if scale_ub is not None: 2025-05-07T20:32:47.7494785Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:47.7495119Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:47.7495422Z ) 2025-05-07T20:32:47.7495602Z else: 2025-05-07T20:32:47.7495808Z scale_ub_tensor = None 2025-05-07T20:32:47.7496055Z 2025-05-07T20:32:47.7496272Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.7496582Z op = silu_mul_quant 2025-05-07T20:32:47.7496838Z if compiled: 2025-05-07T20:32:47.7497076Z op = torch.compile(op) 2025-05-07T20:32:47.7497371Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.7497648Z 2025-05-07T20:32:47.7497835Z > y_fp8, y_scale = fn() 2025-05-07T20:32:47.7497997Z 2025-05-07T20:32:47.7498091Z moe/activation_test.py:117: 2025-05-07T20:32:47.7498386Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.7498766Z moe/activation_test.py:115: in fn 2025-05-07T20:32:47.7499042Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.7499603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:47.7500154Z return fn(*args, **kwargs) 2025-05-07T20:32:47.7500801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:47.7501465Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:47.7501991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:47.7502662Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:47.7503305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:47.7503823Z kernel = self.compile( 2025-05-07T20:32:47.7504387Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:47.7505056Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.7505438Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.7505671Z 2025-05-07T20:32:47.7505872Z self = 2025-05-07T20:32:47.7506938Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:47.7508688Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa07f5872e0>} 2025-05-07T20:32:47.7510018Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:47.7511034Z context = 2025-05-07T20:32:47.7511320Z 2025-05-07T20:32:47.7511480Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:47.7511990Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.7512442Z module_map=module_map) 2025-05-07T20:32:47.7512800Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.7513140Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:47.7513395Z E ^ 2025-05-07T20:32:47.7513840Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.7514290Z 2025-05-07T20:32:47.7514749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:47.9587307Z 2025-05-07T20:32:47.9588043Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.9588886Z self=, 2025-05-07T20:32:47.9589584Z T=1, 2025-05-07T20:32:47.9589847Z D=7168, 2025-05-07T20:32:47.9590124Z scale_ub=None, 2025-05-07T20:32:47.9590436Z contiguous=False, 2025-05-07T20:32:47.9590704Z compiled=True, 2025-05-07T20:32:47.9590916Z ) 2025-05-07T20:32:47.9591241Z self = 2025-05-07T20:32:47.9591757Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:47.9592020Z 2025-05-07T20:32:47.9592095Z @given( 2025-05-07T20:32:47.9592329Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.9592643Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.9592943Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.9593632Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.9593965Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.9594243Z ) 2025-05-07T20:32:47.9594592Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.9595036Z def test_silu_mul_quant( 2025-05-07T20:32:47.9595278Z self, 2025-05-07T20:32:47.9595467Z T: int, 2025-05-07T20:32:47.9595662Z D: int, 2025-05-07T20:32:47.9595873Z scale_ub: Optional[float], 2025-05-07T20:32:47.9596133Z contiguous: bool, 2025-05-07T20:32:47.9596368Z compiled: bool, 2025-05-07T20:32:47.9596597Z ) -> None: 2025-05-07T20:32:47.9596805Z torch.manual_seed(2025) 2025-05-07T20:32:47.9597044Z 2025-05-07T20:32:47.9597310Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.9597642Z 2025-05-07T20:32:47.9597830Z x_sign = torch.sign(x) 2025-05-07T20:32:47.9598125Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.9598431Z x = x_sign * x_clamp 2025-05-07T20:32:47.9598671Z x0 = x[:, :D] 2025-05-07T20:32:47.9598888Z x1 = x[:, D:] 2025-05-07T20:32:47.9599088Z 2025-05-07T20:32:47.9599268Z if contiguous: 2025-05-07T20:32:47.9599498Z x0 = x0.contiguous() 2025-05-07T20:32:47.9599747Z x1 = x1.contiguous() 2025-05-07T20:32:47.9599985Z 2025-05-07T20:32:47.9600174Z if scale_ub is not None: 2025-05-07T20:32:47.9600446Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:47.9600776Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:47.9601257Z ) 2025-05-07T20:32:47.9601456Z else: 2025-05-07T20:32:47.9601659Z scale_ub_tensor = None 2025-05-07T20:32:47.9601912Z 2025-05-07T20:32:47.9602142Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.9602462Z op = silu_mul_quant 2025-05-07T20:32:47.9602720Z if compiled: 2025-05-07T20:32:47.9602970Z op = torch.compile(op) 2025-05-07T20:32:47.9603260Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.9603541Z 2025-05-07T20:32:47.9603736Z y_fp8, y_scale = fn() 2025-05-07T20:32:47.9604017Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:47.9604454Z 2025-05-07T20:32:47.9604690Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.9605025Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:47.9605324Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:47.9605649Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:47.9606014Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:47.9606324Z 2025-05-07T20:32:47.9606531Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:47.9606730Z 2025-05-07T20:32:47.9606836Z moe/activation_test.py:126: 2025-05-07T20:32:47.9607236Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.9607580Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:47.9607904Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:47.9608945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:47.9609689Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:47.9610228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:47.9610908Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:47.9611581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:47.9612295Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:47.9613097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:47.9613727Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:47.9614310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:47.9614859Z fn() 2025-05-07T20:32:47.9615381Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:47.9615945Z self.fn.run( 2025-05-07T20:32:47.9616410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:47.9616933Z kernel = self.compile( 2025-05-07T20:32:47.9626592Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:47.9627320Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.9627743Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.9627973Z 2025-05-07T20:32:47.9628184Z self = 2025-05-07T20:32:47.9629287Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:47.9630887Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa07f587380>} 2025-05-07T20:32:47.9632244Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:47.9633288Z context = 2025-05-07T20:32:47.9633581Z 2025-05-07T20:32:47.9633753Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:47.9634290Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.9634816Z module_map=module_map) 2025-05-07T20:32:47.9635189Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.9635556Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:47.9635835Z E ^ 2025-05-07T20:32:47.9636320Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.9636778Z 2025-05-07T20:32:47.9637201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:47.9637727Z 2025-05-07T20:32:47.9637835Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.9638338Z self=, 2025-05-07T20:32:47.9638942Z T=1, 2025-05-07T20:32:47.9639134Z D=5120, 2025-05-07T20:32:47.9639340Z scale_ub=1200.0, 2025-05-07T20:32:47.9639571Z contiguous=False, 2025-05-07T20:32:47.9639795Z compiled=True, 2025-05-07T20:32:47.9640006Z ) 2025-05-07T20:32:47.9640338Z self = 2025-05-07T20:32:47.9640821Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:47.9641093Z 2025-05-07T20:32:47.9641172Z @given( 2025-05-07T20:32:47.9641417Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.9641727Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.9642041Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.9642374Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.9642710Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.9643088Z ) 2025-05-07T20:32:47.9643493Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.9644022Z def test_silu_mul_quant( 2025-05-07T20:32:47.9644376Z self, 2025-05-07T20:32:47.9644622Z T: int, 2025-05-07T20:32:47.9644848Z D: int, 2025-05-07T20:32:47.9645079Z scale_ub: Optional[float], 2025-05-07T20:32:47.9645386Z contiguous: bool, 2025-05-07T20:32:47.9645655Z compiled: bool, 2025-05-07T20:32:47.9645892Z ) -> None: 2025-05-07T20:32:47.9646134Z torch.manual_seed(2025) 2025-05-07T20:32:47.9646403Z 2025-05-07T20:32:47.9646709Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.9647106Z 2025-05-07T20:32:47.9647326Z x_sign = torch.sign(x) 2025-05-07T20:32:47.9647644Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.9648004Z x = x_sign * x_clamp 2025-05-07T20:32:47.9648275Z x0 = x[:, :D] 2025-05-07T20:32:47.9648516Z x1 = x[:, D:] 2025-05-07T20:32:47.9648737Z 2025-05-07T20:32:47.9648942Z if contiguous: 2025-05-07T20:32:47.9649198Z x0 = x0.contiguous() 2025-05-07T20:32:47.9649480Z x1 = x1.contiguous() 2025-05-07T20:32:47.9649750Z 2025-05-07T20:32:47.9649958Z if scale_ub is not None: 2025-05-07T20:32:47.9650259Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:47.9650641Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:47.9650999Z ) 2025-05-07T20:32:47.9651201Z else: 2025-05-07T20:32:47.9651434Z scale_ub_tensor = None 2025-05-07T20:32:47.9651802Z 2025-05-07T20:32:47.9652048Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.9652408Z op = silu_mul_quant 2025-05-07T20:32:47.9652679Z if compiled: 2025-05-07T20:32:47.9652928Z op = torch.compile(op) 2025-05-07T20:32:47.9653219Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.9653497Z 2025-05-07T20:32:47.9653696Z > y_fp8, y_scale = fn() 2025-05-07T20:32:47.9653860Z 2025-05-07T20:32:47.9653964Z moe/activation_test.py:117: 2025-05-07T20:32:47.9654254Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.9654590Z moe/activation_test.py:115: in fn 2025-05-07T20:32:47.9654895Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.9655441Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:47.9655993Z return fn(*args, **kwargs) 2025-05-07T20:32:47.9656651Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:47.9657330Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:47.9657858Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:47.9658582Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:47.9659240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:47.9659761Z kernel = self.compile( 2025-05-07T20:32:47.9660301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:47.9660952Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.9661344Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.9661569Z 2025-05-07T20:32:47.9661778Z self = 2025-05-07T20:32:47.9662851Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:47.9664260Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa07fe33f60>} 2025-05-07T20:32:47.9665586Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:47.9666592Z context = 2025-05-07T20:32:47.9666885Z 2025-05-07T20:32:47.9667050Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:47.9667565Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.9668028Z module_map=module_map) 2025-05-07T20:32:47.9668382Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.9668738Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:47.9668990Z E ^ 2025-05-07T20:32:47.9669444Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.9669895Z 2025-05-07T20:32:47.9670305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:48.1054859Z 2025-05-07T20:32:48.1055858Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:48.1056663Z self=, 2025-05-07T20:32:48.1057255Z T=1, 2025-05-07T20:32:48.1057916Z D=5120, 2025-05-07T20:32:48.1058124Z scale_ub=1200.0, 2025-05-07T20:32:48.1058355Z contiguous=False, 2025-05-07T20:32:48.1058589Z compiled=False, 2025-05-07T20:32:48.1058792Z ) 2025-05-07T20:32:48.1059126Z self = 2025-05-07T20:32:48.1059644Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:48.1059924Z 2025-05-07T20:32:48.1060007Z @given( 2025-05-07T20:32:48.1060254Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:48.1060578Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:48.1060882Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:48.1061220Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:48.1061556Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:48.1061838Z ) 2025-05-07T20:32:48.1062190Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:48.1062637Z def test_silu_mul_quant( 2025-05-07T20:32:48.1062872Z self, 2025-05-07T20:32:48.1063054Z T: int, 2025-05-07T20:32:48.1063245Z D: int, 2025-05-07T20:32:48.1063460Z scale_ub: Optional[float], 2025-05-07T20:32:48.1063718Z contiguous: bool, 2025-05-07T20:32:48.1063958Z compiled: bool, 2025-05-07T20:32:48.1064325Z ) -> None: 2025-05-07T20:32:48.1064530Z torch.manual_seed(2025) 2025-05-07T20:32:48.1064767Z 2025-05-07T20:32:48.1065039Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:48.1065368Z 2025-05-07T20:32:48.1065553Z x_sign = torch.sign(x) 2025-05-07T20:32:48.1065841Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:48.1066140Z x = x_sign * x_clamp 2025-05-07T20:32:48.1066372Z x0 = x[:, :D] 2025-05-07T20:32:48.1066587Z x1 = x[:, D:] 2025-05-07T20:32:48.1066785Z 2025-05-07T20:32:48.1066972Z if contiguous: 2025-05-07T20:32:48.1067209Z x0 = x0.contiguous() 2025-05-07T20:32:48.1067461Z x1 = x1.contiguous() 2025-05-07T20:32:48.1067688Z 2025-05-07T20:32:48.1067882Z if scale_ub is not None: 2025-05-07T20:32:48.1068152Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:48.1068593Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:48.1068907Z ) 2025-05-07T20:32:48.1069104Z else: 2025-05-07T20:32:48.1069307Z scale_ub_tensor = None 2025-05-07T20:32:48.1069557Z 2025-05-07T20:32:48.1069789Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:48.1070093Z op = silu_mul_quant 2025-05-07T20:32:48.1070339Z if compiled: 2025-05-07T20:32:48.1070587Z op = torch.compile(op) 2025-05-07T20:32:48.1070898Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:48.1071204Z 2025-05-07T20:32:48.1071392Z > y_fp8, y_scale = fn() 2025-05-07T20:32:48.1071561Z 2025-05-07T20:32:48.1071671Z moe/activation_test.py:117: 2025-05-07T20:32:48.1071959Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:48.1072286Z moe/activation_test.py:115: in fn 2025-05-07T20:32:48.1072573Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:48.1073268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:48.1073964Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:48.1074508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:48.1075181Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:48.1075852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:48.1076437Z kernel = self.compile( 2025-05-07T20:32:48.1077073Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:48.1077725Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:48.1078124Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:48.1078368Z 2025-05-07T20:32:48.1078572Z self = 2025-05-07T20:32:48.1079662Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:48.1081049Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa07f00f2e0>} 2025-05-07T20:32:48.1082399Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:48.1083429Z context = 2025-05-07T20:32:48.1083716Z 2025-05-07T20:32:48.1083898Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:48.1084577Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:48.1085035Z module_map=module_map) 2025-05-07T20:32:48.1085396Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:48.1085749Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:48.1086005Z E ^ 2025-05-07T20:32:48.1086476Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:48.1086923Z 2025-05-07T20:32:48.1087354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:48.1087865Z 2025-05-07T20:32:48.1087978Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:48.1088387Z self=, 2025-05-07T20:32:48.1088844Z T=16384, 2025-05-07T20:32:48.1089051Z D=5120, 2025-05-07T20:32:48.1089250Z scale_ub=1200.0, 2025-05-07T20:32:48.1089478Z contiguous=False, 2025-05-07T20:32:48.1089714Z compiled=True, 2025-05-07T20:32:48.1089915Z ) 2025-05-07T20:32:48.1090244Z self = 2025-05-07T20:32:48.1090751Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:48.1091033Z 2025-05-07T20:32:48.1091117Z @given( 2025-05-07T20:32:48.1091361Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:48.1091684Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:48.1092001Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:48.1092326Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:48.1092664Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:48.1092962Z ) 2025-05-07T20:32:48.1093313Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:48.1093771Z def test_silu_mul_quant( 2025-05-07T20:32:48.1094025Z self, 2025-05-07T20:32:48.1094222Z T: int, 2025-05-07T20:32:48.1094429Z D: int, 2025-05-07T20:32:48.1094657Z scale_ub: Optional[float], 2025-05-07T20:32:48.1094931Z contiguous: bool, 2025-05-07T20:32:48.1095181Z compiled: bool, 2025-05-07T20:32:48.1095415Z ) -> None: 2025-05-07T20:32:48.1095630Z torch.manual_seed(2025) 2025-05-07T20:32:48.1095877Z 2025-05-07T20:32:48.1096152Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:48.1096498Z 2025-05-07T20:32:48.1096777Z x_sign = torch.sign(x) 2025-05-07T20:32:48.1097079Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:48.1097396Z x = x_sign * x_clamp 2025-05-07T20:32:48.1097640Z x0 = x[:, :D] 2025-05-07T20:32:48.1097866Z x1 = x[:, D:] 2025-05-07T20:32:48.1098083Z 2025-05-07T20:32:48.1098277Z if contiguous: 2025-05-07T20:32:48.1098530Z x0 = x0.contiguous() 2025-05-07T20:32:48.1098802Z x1 = x1.contiguous() 2025-05-07T20:32:48.1099045Z 2025-05-07T20:32:48.1099256Z if scale_ub is not None: 2025-05-07T20:32:48.1099545Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:48.1099884Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:48.1100205Z ) 2025-05-07T20:32:48.1100410Z else: 2025-05-07T20:32:48.1100629Z scale_ub_tensor = None 2025-05-07T20:32:48.1100908Z 2025-05-07T20:32:48.1101194Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:48.1101531Z op = silu_mul_quant 2025-05-07T20:32:48.1101786Z if compiled: 2025-05-07T20:32:48.1102040Z op = torch.compile(op) 2025-05-07T20:32:48.1102341Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:48.1102614Z 2025-05-07T20:32:48.1102808Z > y_fp8, y_scale = fn() 2025-05-07T20:32:48.1103029Z 2025-05-07T20:32:48.1103138Z moe/activation_test.py:117: 2025-05-07T20:32:48.1103427Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:48.1103762Z moe/activation_test.py:115: in fn 2025-05-07T20:32:48.1104045Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:48.1104602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:48.1105163Z return fn(*args, **kwargs) 2025-05-07T20:32:48.1105823Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:48.1106522Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:48.1107057Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:48.1107743Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:48.1108753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:48.1109292Z kernel = self.compile( 2025-05-07T20:32:48.1109830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:48.1110495Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:48.1110906Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:48.1111137Z 2025-05-07T20:32:48.1111346Z self = 2025-05-07T20:32:48.1112440Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:48.1113820Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa07e6ba660>} 2025-05-07T20:32:48.1115190Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:48.1116226Z context = 2025-05-07T20:32:48.1116514Z 2025-05-07T20:32:48.1116688Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:48.1117355Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:48.1117836Z module_map=module_map) 2025-05-07T20:32:48.1118214Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:48.1118564Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:48.1118838Z E ^ 2025-05-07T20:32:48.1119315Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:48.1119767Z 2025-05-07T20:32:48.1120183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:48.1120707Z 2025-05-07T20:32:48.1120809Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:48.1121230Z self=, 2025-05-07T20:32:48.1121633Z T=2048, 2025-05-07T20:32:48.1121811Z D=7168, 2025-05-07T20:32:48.1122009Z scale_ub=1200.0, 2025-05-07T20:32:48.1122234Z contiguous=False, 2025-05-07T20:32:48.1122460Z compiled=True, 2025-05-07T20:32:48.3006636Z ) 2025-05-07T20:32:48.3007335Z self = 2025-05-07T20:32:48.3008015Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:48.3008564Z 2025-05-07T20:32:48.3008960Z @given( 2025-05-07T20:32:48.3009212Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:48.3009557Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:48.3009890Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:48.3010254Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:48.3010616Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:48.3010924Z ) 2025-05-07T20:32:48.3011310Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:48.3011808Z def test_silu_mul_quant( 2025-05-07T20:32:48.3012064Z self, 2025-05-07T20:32:48.3012286Z T: int, 2025-05-07T20:32:48.3012499Z D: int, 2025-05-07T20:32:48.3012730Z scale_ub: Optional[float], 2025-05-07T20:32:48.3013031Z contiguous: bool, 2025-05-07T20:32:48.3013295Z compiled: bool, 2025-05-07T20:32:48.3013538Z ) -> None: 2025-05-07T20:32:48.3013874Z torch.manual_seed(2025) 2025-05-07T20:32:48.3014143Z 2025-05-07T20:32:48.3014439Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:48.3014811Z 2025-05-07T20:32:48.3015026Z x_sign = torch.sign(x) 2025-05-07T20:32:48.3015327Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:48.3015631Z x = x_sign * x_clamp 2025-05-07T20:32:48.3015871Z x0 = x[:, :D] 2025-05-07T20:32:48.3016085Z x1 = x[:, D:] 2025-05-07T20:32:48.3016283Z 2025-05-07T20:32:48.3016466Z if contiguous: 2025-05-07T20:32:48.3016697Z x0 = x0.contiguous() 2025-05-07T20:32:48.3016950Z x1 = x1.contiguous() 2025-05-07T20:32:48.3017195Z 2025-05-07T20:32:48.3017385Z if scale_ub is not None: 2025-05-07T20:32:48.3017653Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:48.3017999Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:48.3018306Z ) 2025-05-07T20:32:48.3018492Z else: 2025-05-07T20:32:48.3018704Z scale_ub_tensor = None 2025-05-07T20:32:48.3018954Z 2025-05-07T20:32:48.3019186Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:48.3019497Z op = silu_mul_quant 2025-05-07T20:32:48.3019754Z if compiled: 2025-05-07T20:32:48.3020003Z op = torch.compile(op) 2025-05-07T20:32:48.3020293Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:48.3020568Z 2025-05-07T20:32:48.3020788Z > y_fp8, y_scale = fn() 2025-05-07T20:32:48.3020959Z 2025-05-07T20:32:48.3021056Z moe/activation_test.py:117: 2025-05-07T20:32:48.3021527Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:48.3021864Z moe/activation_test.py:115: in fn 2025-05-07T20:32:48.3022145Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:48.3022724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:48.3023294Z return fn(*args, **kwargs) 2025-05-07T20:32:48.3023965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:48.3024670Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:48.3025215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:48.3025905Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:48.3026578Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:48.3027121Z kernel = self.compile( 2025-05-07T20:32:48.3027685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:48.3028352Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:48.3028764Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:48.3029052Z 2025-05-07T20:32:48.3029259Z self = 2025-05-07T20:32:48.3030374Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:48.3032267Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa07f016520>} 2025-05-07T20:32:48.3033636Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:48.3034666Z context = 2025-05-07T20:32:48.3035028Z 2025-05-07T20:32:48.3047228Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:48.3047800Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:48.3048273Z module_map=module_map) 2025-05-07T20:32:48.3048650Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:48.3049022Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:48.3049281Z E ^ 2025-05-07T20:32:48.3049761Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:48.3050221Z 2025-05-07T20:32:48.3050653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:48.3051171Z 2025-05-07T20:32:48.3051285Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:48.3051694Z self=, 2025-05-07T20:32:48.3052110Z T=1, 2025-05-07T20:32:48.3052303Z D=5120, 2025-05-07T20:32:48.3052496Z scale_ub=None, 2025-05-07T20:32:48.3052719Z contiguous=False, 2025-05-07T20:32:48.3052955Z compiled=False, 2025-05-07T20:32:48.3053165Z ) 2025-05-07T20:32:48.3053493Z self = 2025-05-07T20:32:48.3053993Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:48.3054256Z 2025-05-07T20:32:48.3054341Z @given( 2025-05-07T20:32:48.3054573Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:48.3055011Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:48.3055329Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:48.3055652Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:48.3055993Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:48.3056292Z ) 2025-05-07T20:32:48.3056638Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:48.3057097Z def test_silu_mul_quant( 2025-05-07T20:32:48.3057345Z self, 2025-05-07T20:32:48.3057540Z T: int, 2025-05-07T20:32:48.3057743Z D: int, 2025-05-07T20:32:48.3057972Z scale_ub: Optional[float], 2025-05-07T20:32:48.3058258Z contiguous: bool, 2025-05-07T20:32:48.3058499Z compiled: bool, 2025-05-07T20:32:48.3058729Z ) -> None: 2025-05-07T20:32:48.3058952Z torch.manual_seed(2025) 2025-05-07T20:32:48.3059195Z 2025-05-07T20:32:48.3059474Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:48.3059827Z 2025-05-07T20:32:48.3060022Z x_sign = torch.sign(x) 2025-05-07T20:32:48.3060324Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:48.3060645Z x = x_sign * x_clamp 2025-05-07T20:32:48.3060884Z x0 = x[:, :D] 2025-05-07T20:32:48.3061137Z x1 = x[:, D:] 2025-05-07T20:32:48.3061421Z 2025-05-07T20:32:48.3061602Z if contiguous: 2025-05-07T20:32:48.3061847Z x0 = x0.contiguous() 2025-05-07T20:32:48.3062117Z x1 = x1.contiguous() 2025-05-07T20:32:48.3062356Z 2025-05-07T20:32:48.3062557Z if scale_ub is not None: 2025-05-07T20:32:48.3062837Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:48.3063174Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:48.3063475Z ) 2025-05-07T20:32:48.3063672Z else: 2025-05-07T20:32:48.3063893Z scale_ub_tensor = None 2025-05-07T20:32:48.3064137Z 2025-05-07T20:32:48.3064382Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:48.3064706Z op = silu_mul_quant 2025-05-07T20:32:48.3064951Z if compiled: 2025-05-07T20:32:48.3065205Z op = torch.compile(op) 2025-05-07T20:32:48.3065507Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:48.3065825Z 2025-05-07T20:32:48.3066024Z > y_fp8, y_scale = fn() 2025-05-07T20:32:48.3066186Z 2025-05-07T20:32:48.3066293Z moe/activation_test.py:117: 2025-05-07T20:32:48.3066580Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:48.3066913Z moe/activation_test.py:115: in fn 2025-05-07T20:32:48.3067186Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:48.3067872Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:48.3068554Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:48.3069086Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:48.3069768Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:48.3070440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:48.3070980Z kernel = self.compile( 2025-05-07T20:32:48.3071518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:48.3072172Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:48.3072569Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:48.3072795Z 2025-05-07T20:32:48.3073000Z self = 2025-05-07T20:32:48.3074169Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:48.3075544Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa07f0149a0>} 2025-05-07T20:32:48.3076892Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:48.3077912Z context = 2025-05-07T20:32:48.3078197Z 2025-05-07T20:32:48.3078361Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:48.3078880Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:48.3079344Z module_map=module_map) 2025-05-07T20:32:48.3079704Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:48.3080054Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:48.3080304Z E ^ 2025-05-07T20:32:48.3080754Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:48.3081251Z 2025-05-07T20:32:48.3081667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:48.3082176Z 2025-05-07T20:32:48.3082275Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:48.3082682Z self=, 2025-05-07T20:32:48.3083074Z T=4096, 2025-05-07T20:32:48.3083248Z D=7168, 2025-05-07T20:32:48.3083440Z scale_ub=1200.0, 2025-05-07T20:32:48.3083657Z contiguous=False, 2025-05-07T20:32:48.3083870Z compiled=False, 2025-05-07T20:32:48.3084069Z ) 2025-05-07T20:32:48.3084462Z self = 2025-05-07T20:32:48.3084944Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:48.3085214Z 2025-05-07T20:32:48.3085286Z @given( 2025-05-07T20:32:48.3085509Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:48.3085863Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:48.3086168Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:48.3086490Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:48.3086808Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:48.3087082Z ) 2025-05-07T20:32:48.3087425Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:48.3087865Z def test_silu_mul_quant( 2025-05-07T20:32:48.3088094Z self, 2025-05-07T20:32:48.3088282Z T: int, 2025-05-07T20:32:48.3088471Z D: int, 2025-05-07T20:32:48.3088675Z scale_ub: Optional[float], 2025-05-07T20:32:48.3088946Z contiguous: bool, 2025-05-07T20:32:48.3089185Z compiled: bool, 2025-05-07T20:32:48.3089394Z ) -> None: 2025-05-07T20:32:48.3089607Z torch.manual_seed(2025) 2025-05-07T20:32:48.3089851Z 2025-05-07T20:32:48.3090114Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:48.3090455Z 2025-05-07T20:32:48.3090646Z x_sign = torch.sign(x) 2025-05-07T20:32:48.3090930Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:48.3091283Z x = x_sign * x_clamp 2025-05-07T20:32:48.3091518Z x0 = x[:, :D] 2025-05-07T20:32:48.3091732Z x1 = x[:, D:] 2025-05-07T20:32:48.3091927Z 2025-05-07T20:32:48.3092112Z if contiguous: 2025-05-07T20:32:48.3092344Z x0 = x0.contiguous() 2025-05-07T20:32:48.3092594Z x1 = x1.contiguous() 2025-05-07T20:32:48.3092834Z 2025-05-07T20:32:48.3093023Z if scale_ub is not None: 2025-05-07T20:32:48.3093403Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:48.3093873Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:48.3094182Z ) 2025-05-07T20:32:48.3094370Z else: 2025-05-07T20:32:48.3094588Z scale_ub_tensor = None 2025-05-07T20:32:48.3094851Z 2025-05-07T20:32:48.3095075Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:48.3095399Z op = silu_mul_quant 2025-05-07T20:32:48.3095656Z if compiled: 2025-05-07T20:32:48.3095901Z op = torch.compile(op) 2025-05-07T20:32:48.3096201Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:48.3096470Z 2025-05-07T20:32:48.3096657Z > y_fp8, y_scale = fn() 2025-05-07T20:32:48.3096821Z 2025-05-07T20:32:48.3096916Z moe/activation_test.py:117: 2025-05-07T20:32:48.3097206Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:48.3097537Z moe/activation_test.py:115: in fn 2025-05-07T20:32:48.3097811Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:48.3098494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:48.3099175Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:48.3099703Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:48.3100447Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:48.3101102Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:48.3101623Z kernel = self.compile( 2025-05-07T20:32:48.3102152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:48.3102802Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:48.3103200Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:48.3103424Z 2025-05-07T20:32:48.3103635Z self = 2025-05-07T20:32:48.3104700Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:48.3106114Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9f939daf20>} 2025-05-07T20:32:48.3107449Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:48.3108776Z context = 2025-05-07T20:32:48.3109068Z 2025-05-07T20:32:48.3109238Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:48.3109747Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:48.3110207Z module_map=module_map) 2025-05-07T20:32:48.3110576Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:48.3110919Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:48.3111176Z E ^ 2025-05-07T20:32:48.3111634Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:48.3112079Z 2025-05-07T20:32:48.3112494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:48.4679771Z 2025-05-07T20:32:48.4680524Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:48.4681800Z self=, 2025-05-07T20:32:48.4682241Z T=16384, 2025-05-07T20:32:48.4682441Z D=7168, 2025-05-07T20:32:48.4682623Z scale_ub=None, 2025-05-07T20:32:48.4682830Z contiguous=True, 2025-05-07T20:32:48.4683046Z compiled=True, 2025-05-07T20:32:48.4683241Z ) 2025-05-07T20:32:48.4683567Z self = 2025-05-07T20:32:48.4684068Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:48.4684500Z 2025-05-07T20:32:48.4684576Z @given( 2025-05-07T20:32:48.4684832Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:48.4685163Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:48.4685456Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:48.4685786Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:48.4686109Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:48.4686391Z ) 2025-05-07T20:32:48.4686738Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:48.4687178Z def test_silu_mul_quant( 2025-05-07T20:32:48.4687420Z self, 2025-05-07T20:32:48.4687599Z T: int, 2025-05-07T20:32:48.4687789Z D: int, 2025-05-07T20:32:48.4688005Z scale_ub: Optional[float], 2025-05-07T20:32:48.4688388Z contiguous: bool, 2025-05-07T20:32:48.4688630Z compiled: bool, 2025-05-07T20:32:48.4688865Z ) -> None: 2025-05-07T20:32:48.4689076Z torch.manual_seed(2025) 2025-05-07T20:32:48.4689327Z 2025-05-07T20:32:48.4689609Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:48.4689944Z 2025-05-07T20:32:48.4690141Z x_sign = torch.sign(x) 2025-05-07T20:32:48.4690438Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:48.4690742Z x = x_sign * x_clamp 2025-05-07T20:32:48.4690988Z x0 = x[:, :D] 2025-05-07T20:32:48.4691211Z x1 = x[:, D:] 2025-05-07T20:32:48.4691426Z 2025-05-07T20:32:48.4691607Z if contiguous: 2025-05-07T20:32:48.4691850Z x0 = x0.contiguous() 2025-05-07T20:32:48.4692119Z x1 = x1.contiguous() 2025-05-07T20:32:48.4692358Z 2025-05-07T20:32:48.4692555Z if scale_ub is not None: 2025-05-07T20:32:48.4692939Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:48.4693275Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:48.4693581Z ) 2025-05-07T20:32:48.4693772Z else: 2025-05-07T20:32:48.4693975Z scale_ub_tensor = None 2025-05-07T20:32:48.4694224Z 2025-05-07T20:32:48.4694456Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:48.4694788Z op = silu_mul_quant 2025-05-07T20:32:48.4695060Z if compiled: 2025-05-07T20:32:48.4695309Z op = torch.compile(op) 2025-05-07T20:32:48.4695595Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:48.4695861Z 2025-05-07T20:32:48.4696049Z > y_fp8, y_scale = fn() 2025-05-07T20:32:48.4696207Z 2025-05-07T20:32:48.4696308Z moe/activation_test.py:117: 2025-05-07T20:32:48.4696590Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:48.4696915Z moe/activation_test.py:115: in fn 2025-05-07T20:32:48.4697195Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:48.4697751Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:48.4698299Z return fn(*args, **kwargs) 2025-05-07T20:32:48.4698946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:48.4699619Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:48.4700146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:48.4700897Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:48.4701553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:48.4702077Z kernel = self.compile( 2025-05-07T20:32:48.4702603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:48.4703258Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:48.4703655Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:48.4703879Z 2025-05-07T20:32:48.4704091Z self = 2025-05-07T20:32:48.4705212Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:48.4706600Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa07e8d28e0>} 2025-05-07T20:32:48.4707933Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:48.4709291Z context = 2025-05-07T20:32:48.4709576Z 2025-05-07T20:32:48.4709745Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:48.4710256Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:48.4710718Z module_map=module_map) 2025-05-07T20:32:48.4711078Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:48.4711417Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:48.4711677Z E ^ 2025-05-07T20:32:48.4712134Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:48.4712576Z 2025-05-07T20:32:48.4712991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:48.4713579Z 2025-05-07T20:32:48.4713677Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:48.4714085Z self=, 2025-05-07T20:32:48.4714478Z T=4096, 2025-05-07T20:32:48.4714670Z D=5120, 2025-05-07T20:32:48.4714879Z scale_ub=None, 2025-05-07T20:32:48.4715086Z contiguous=False, 2025-05-07T20:32:48.4715296Z compiled=True, 2025-05-07T20:32:48.4715488Z ) 2025-05-07T20:32:48.4715798Z self = 2025-05-07T20:32:48.4716285Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:48.4716556Z 2025-05-07T20:32:48.4716626Z @given( 2025-05-07T20:32:48.4716852Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:48.4717166Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:48.4717459Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:48.4717791Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:48.4718114Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:48.4718381Z ) 2025-05-07T20:32:48.4718725Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:48.4719162Z def test_silu_mul_quant( 2025-05-07T20:32:48.4719388Z self, 2025-05-07T20:32:48.4719583Z T: int, 2025-05-07T20:32:48.4719778Z D: int, 2025-05-07T20:32:48.4719993Z scale_ub: Optional[float], 2025-05-07T20:32:48.4720250Z contiguous: bool, 2025-05-07T20:32:48.4720484Z compiled: bool, 2025-05-07T20:32:48.4720845Z ) -> None: 2025-05-07T20:32:48.4721052Z torch.manual_seed(2025) 2025-05-07T20:32:48.4721289Z 2025-05-07T20:32:48.4721558Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:48.4721885Z 2025-05-07T20:32:48.4722082Z x_sign = torch.sign(x) 2025-05-07T20:32:48.4722376Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:48.4722673Z x = x_sign * x_clamp 2025-05-07T20:32:48.4722910Z x0 = x[:, :D] 2025-05-07T20:32:48.4723123Z x1 = x[:, D:] 2025-05-07T20:32:48.4723322Z 2025-05-07T20:32:48.4723516Z if contiguous: 2025-05-07T20:32:48.4723746Z x0 = x0.contiguous() 2025-05-07T20:32:48.4723999Z x1 = x1.contiguous() 2025-05-07T20:32:48.4724342Z 2025-05-07T20:32:48.4724538Z if scale_ub is not None: 2025-05-07T20:32:48.4724800Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:48.4725138Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:48.4725456Z ) 2025-05-07T20:32:48.4725651Z else: 2025-05-07T20:32:48.4725855Z scale_ub_tensor = None 2025-05-07T20:32:48.4726116Z 2025-05-07T20:32:48.4726368Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:48.4726677Z op = silu_mul_quant 2025-05-07T20:32:48.4727002Z if compiled: 2025-05-07T20:32:48.4727251Z op = torch.compile(op) 2025-05-07T20:32:48.4727541Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:48.4727817Z 2025-05-07T20:32:48.4728013Z > y_fp8, y_scale = fn() 2025-05-07T20:32:48.4728178Z 2025-05-07T20:32:48.4728279Z moe/activation_test.py:117: 2025-05-07T20:32:48.4728581Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:48.4728908Z moe/activation_test.py:115: in fn 2025-05-07T20:32:48.4729183Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:48.4729736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:48.4730282Z return fn(*args, **kwargs) 2025-05-07T20:32:48.4730937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:48.4731656Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:48.4732183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:48.4732856Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:48.4733516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:48.4734031Z kernel = self.compile( 2025-05-07T20:32:48.4734567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:48.4735228Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:48.4735610Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:48.4735845Z 2025-05-07T20:32:48.4736047Z self = 2025-05-07T20:32:48.4737118Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:48.4738486Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa07e8d2de0>} 2025-05-07T20:32:48.4739822Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:48.4740948Z context = 2025-05-07T20:32:48.4741249Z 2025-05-07T20:32:48.4741413Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:48.4741945Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:48.4742428Z module_map=module_map) 2025-05-07T20:32:48.4742789Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:48.4743151Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:48.4743422Z E ^ 2025-05-07T20:32:48.4743884Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:48.4744349Z 2025-05-07T20:32:48.4744763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:48.6143849Z 2025-05-07T20:32:48.6144424Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:48.6145374Z self=, 2025-05-07T20:32:48.6146161Z T=4096, 2025-05-07T20:32:48.6146395Z D=5120, 2025-05-07T20:32:48.6146596Z scale_ub=1200.0, 2025-05-07T20:32:48.6146845Z contiguous=False, 2025-05-07T20:32:48.6147110Z compiled=False, 2025-05-07T20:32:48.6147655Z ) 2025-05-07T20:32:48.6148001Z self = 2025-05-07T20:32:48.6148533Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:48.6148820Z 2025-05-07T20:32:48.6148910Z @given( 2025-05-07T20:32:48.6149146Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:48.6149479Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:48.6149805Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:48.6150140Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:48.6150484Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:48.6150793Z ) 2025-05-07T20:32:48.6151190Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:48.6151665Z def test_silu_mul_quant( 2025-05-07T20:32:48.6151914Z self, 2025-05-07T20:32:48.6152120Z T: int, 2025-05-07T20:32:48.6152424Z D: int, 2025-05-07T20:32:48.6152703Z scale_ub: Optional[float], 2025-05-07T20:32:48.6153054Z contiguous: bool, 2025-05-07T20:32:48.6153532Z compiled: bool, 2025-05-07T20:32:48.6163999Z ) -> None: 2025-05-07T20:32:48.6164471Z torch.manual_seed(2025) 2025-05-07T20:32:48.6164735Z 2025-05-07T20:32:48.6165030Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:48.6165381Z 2025-05-07T20:32:48.6165588Z x_sign = torch.sign(x) 2025-05-07T20:32:48.6165899Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:48.6166219Z x = x_sign * x_clamp 2025-05-07T20:32:48.6166484Z x0 = x[:, :D] 2025-05-07T20:32:48.6166710Z x1 = x[:, D:] 2025-05-07T20:32:48.6166923Z 2025-05-07T20:32:48.6167119Z if contiguous: 2025-05-07T20:32:48.6167366Z x0 = x0.contiguous() 2025-05-07T20:32:48.6167625Z x1 = x1.contiguous() 2025-05-07T20:32:48.6167882Z 2025-05-07T20:32:48.6168089Z if scale_ub is not None: 2025-05-07T20:32:48.6168376Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:48.6168711Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:48.6169026Z ) 2025-05-07T20:32:48.6169234Z else: 2025-05-07T20:32:48.6169449Z scale_ub_tensor = None 2025-05-07T20:32:48.6169709Z 2025-05-07T20:32:48.6169986Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:48.6170317Z op = silu_mul_quant 2025-05-07T20:32:48.6170576Z if compiled: 2025-05-07T20:32:48.6170827Z op = torch.compile(op) 2025-05-07T20:32:48.6171394Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:48.6171682Z 2025-05-07T20:32:48.6171878Z > y_fp8, y_scale = fn() 2025-05-07T20:32:48.6172053Z 2025-05-07T20:32:48.6172155Z moe/activation_test.py:117: 2025-05-07T20:32:48.6172461Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:48.6172804Z moe/activation_test.py:115: in fn 2025-05-07T20:32:48.6173096Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:48.6173795Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:48.6174480Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:48.6175014Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:48.6175696Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:48.6176365Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:48.6176892Z kernel = self.compile( 2025-05-07T20:32:48.6177443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:48.6178102Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:48.6178557Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:48.6178786Z 2025-05-07T20:32:48.6178995Z self = 2025-05-07T20:32:48.6180075Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:48.6181521Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9f939e82c0>} 2025-05-07T20:32:48.6182859Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:48.6183929Z context = 2025-05-07T20:32:48.6184217Z 2025-05-07T20:32:48.6184385Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:48.6184905Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:48.6185378Z module_map=module_map) 2025-05-07T20:32:48.6185748Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:48.6186109Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:48.6186375Z E ^ 2025-05-07T20:32:48.6186851Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:48.6187299Z 2025-05-07T20:32:48.6187713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:48.6188233Z 2025-05-07T20:32:48.6188341Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:48.6188760Z self=, 2025-05-07T20:32:48.6189167Z T=4096, 2025-05-07T20:32:48.6189360Z D=5120, 2025-05-07T20:32:48.6189566Z scale_ub=1200.0, 2025-05-07T20:32:48.6189803Z contiguous=False, 2025-05-07T20:32:48.6190029Z compiled=True, 2025-05-07T20:32:48.6190243Z ) 2025-05-07T20:32:48.6190573Z self = 2025-05-07T20:32:48.6191064Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:48.6191334Z 2025-05-07T20:32:48.6191416Z @given( 2025-05-07T20:32:48.6191733Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:48.6192049Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:48.6192351Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:48.6192676Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:48.6193003Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:48.6193280Z ) 2025-05-07T20:32:48.6193626Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:48.6194058Z def test_silu_mul_quant( 2025-05-07T20:32:48.6194302Z self, 2025-05-07T20:32:48.6194486Z T: int, 2025-05-07T20:32:48.6194681Z D: int, 2025-05-07T20:32:48.6194893Z scale_ub: Optional[float], 2025-05-07T20:32:48.6195158Z contiguous: bool, 2025-05-07T20:32:48.6195391Z compiled: bool, 2025-05-07T20:32:48.6195611Z ) -> None: 2025-05-07T20:32:48.6195820Z torch.manual_seed(2025) 2025-05-07T20:32:48.6196057Z 2025-05-07T20:32:48.6196332Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:48.6196664Z 2025-05-07T20:32:48.6196847Z x_sign = torch.sign(x) 2025-05-07T20:32:48.6197135Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:48.6197441Z x = x_sign * x_clamp 2025-05-07T20:32:48.6197737Z x0 = x[:, :D] 2025-05-07T20:32:48.6197956Z x1 = x[:, D:] 2025-05-07T20:32:48.6198158Z 2025-05-07T20:32:48.6198352Z if contiguous: 2025-05-07T20:32:48.6198589Z x0 = x0.contiguous() 2025-05-07T20:32:48.6198853Z x1 = x1.contiguous() 2025-05-07T20:32:48.6199090Z 2025-05-07T20:32:48.6199285Z if scale_ub is not None: 2025-05-07T20:32:48.6199562Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:48.6199898Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:48.6200215Z ) 2025-05-07T20:32:48.6200420Z else: 2025-05-07T20:32:48.6200636Z scale_ub_tensor = None 2025-05-07T20:32:48.6200898Z 2025-05-07T20:32:48.6201141Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:48.6201454Z op = silu_mul_quant 2025-05-07T20:32:48.6201712Z if compiled: 2025-05-07T20:32:48.6201963Z op = torch.compile(op) 2025-05-07T20:32:48.6202308Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:48.6202591Z 2025-05-07T20:32:48.6202793Z > y_fp8, y_scale = fn() 2025-05-07T20:32:48.6202958Z 2025-05-07T20:32:48.6203058Z moe/activation_test.py:117: 2025-05-07T20:32:48.6203360Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:48.6203704Z moe/activation_test.py:115: in fn 2025-05-07T20:32:48.6203993Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:48.6204707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:48.6205295Z return fn(*args, **kwargs) 2025-05-07T20:32:48.6205957Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:48.6206635Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:48.6207173Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:48.6207859Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:48.6208785Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:48.6209311Z kernel = self.compile( 2025-05-07T20:32:48.6209858Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:48.6210517Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:48.6210919Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:48.6211285Z 2025-05-07T20:32:48.6211497Z self = 2025-05-07T20:32:48.6212577Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:48.6213950Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9f939e9b20>} 2025-05-07T20:32:48.6215285Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:48.6216299Z context = 2025-05-07T20:32:48.6216590Z 2025-05-07T20:32:48.6216761Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:48.6217285Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:48.6217755Z module_map=module_map) 2025-05-07T20:32:48.6218119Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:48.6218545Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:48.6218806Z E ^ 2025-05-07T20:32:48.6219267Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:48.6219722Z 2025-05-07T20:32:48.6220135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:48.6220646Z 2025-05-07T20:32:48.6220748Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:48.6221162Z self=, 2025-05-07T20:32:48.6221560Z T=2048, 2025-05-07T20:32:48.6221762Z D=7168, 2025-05-07T20:32:48.6221963Z scale_ub=1200.0, 2025-05-07T20:32:48.6222188Z contiguous=False, 2025-05-07T20:32:48.6222416Z compiled=False, 2025-05-07T20:32:48.8177219Z ) 2025-05-07T20:32:48.8177959Z self = 2025-05-07T20:32:48.8178985Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:48.8179270Z 2025-05-07T20:32:48.8179355Z @given( 2025-05-07T20:32:48.8179584Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:48.8179904Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:48.8180219Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:48.8180550Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:48.8180887Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:48.8181176Z ) 2025-05-07T20:32:48.8181538Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:48.8181982Z def test_silu_mul_quant( 2025-05-07T20:32:48.8182235Z self, 2025-05-07T20:32:48.8182432Z T: int, 2025-05-07T20:32:48.8182632Z D: int, 2025-05-07T20:32:48.8182855Z scale_ub: Optional[float], 2025-05-07T20:32:48.8183132Z contiguous: bool, 2025-05-07T20:32:48.8183371Z compiled: bool, 2025-05-07T20:32:48.8183609Z ) -> None: 2025-05-07T20:32:48.8183830Z torch.manual_seed(2025) 2025-05-07T20:32:48.8184071Z 2025-05-07T20:32:48.8184348Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:48.8184691Z 2025-05-07T20:32:48.8184884Z x_sign = torch.sign(x) 2025-05-07T20:32:48.8185181Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:48.8185490Z x = x_sign * x_clamp 2025-05-07T20:32:48.8185723Z x0 = x[:, :D] 2025-05-07T20:32:48.8185944Z x1 = x[:, D:] 2025-05-07T20:32:48.8186156Z 2025-05-07T20:32:48.8186507Z if contiguous: 2025-05-07T20:32:48.8186742Z x0 = x0.contiguous() 2025-05-07T20:32:48.8186999Z x1 = x1.contiguous() 2025-05-07T20:32:48.8187235Z 2025-05-07T20:32:48.8187429Z if scale_ub is not None: 2025-05-07T20:32:48.8187703Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:48.8188037Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:48.8188348Z ) 2025-05-07T20:32:48.8188545Z else: 2025-05-07T20:32:48.8188763Z scale_ub_tensor = None 2025-05-07T20:32:48.8189014Z 2025-05-07T20:32:48.8189249Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:48.8189566Z op = silu_mul_quant 2025-05-07T20:32:48.8189809Z if compiled: 2025-05-07T20:32:48.8190060Z op = torch.compile(op) 2025-05-07T20:32:48.8190364Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:48.8190637Z 2025-05-07T20:32:48.8190836Z > y_fp8, y_scale = fn() 2025-05-07T20:32:48.8191000Z 2025-05-07T20:32:48.8191108Z moe/activation_test.py:117: 2025-05-07T20:32:48.8191399Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:48.8191731Z moe/activation_test.py:115: in fn 2025-05-07T20:32:48.8192015Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:48.8192818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:48.8193495Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:48.8194029Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:48.8194715Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:48.8195366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:48.8195896Z kernel = self.compile( 2025-05-07T20:32:48.8196442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:48.8197096Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:48.8197482Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:48.8197762Z 2025-05-07T20:32:48.8197963Z self = 2025-05-07T20:32:48.8199035Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:48.8200410Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9f939ea700>} 2025-05-07T20:32:48.8201793Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:48.8202800Z context = 2025-05-07T20:32:48.8203090Z 2025-05-07T20:32:48.8203254Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:48.8203769Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:48.8204216Z module_map=module_map) 2025-05-07T20:32:48.8204708Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:48.8205052Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:48.8205298Z E ^ 2025-05-07T20:32:48.8205745Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:48.8206195Z 2025-05-07T20:32:48.8206688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:48.8207191Z 2025-05-07T20:32:48.8207293Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:48.8207694Z self=, 2025-05-07T20:32:48.8208083Z T=1, 2025-05-07T20:32:48.8208511Z D=7168, 2025-05-07T20:32:48.8208700Z scale_ub=None, 2025-05-07T20:32:48.8208897Z contiguous=True, 2025-05-07T20:32:48.8209110Z compiled=False, 2025-05-07T20:32:48.8209306Z ) 2025-05-07T20:32:48.8209607Z self = 2025-05-07T20:32:48.8210078Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:48.8210333Z 2025-05-07T20:32:48.8210412Z @given( 2025-05-07T20:32:48.8210628Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:48.8210930Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:48.8211233Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:48.8211555Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:48.8211866Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:48.8212139Z ) 2025-05-07T20:32:48.8212476Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:48.8212977Z def test_silu_mul_quant( 2025-05-07T20:32:48.8213207Z self, 2025-05-07T20:32:48.8213392Z T: int, 2025-05-07T20:32:48.8213572Z D: int, 2025-05-07T20:32:48.8213780Z scale_ub: Optional[float], 2025-05-07T20:32:48.8214038Z contiguous: bool, 2025-05-07T20:32:48.8214262Z compiled: bool, 2025-05-07T20:32:48.8214475Z ) -> None: 2025-05-07T20:32:48.8214679Z torch.manual_seed(2025) 2025-05-07T20:32:48.8214903Z 2025-05-07T20:32:48.8215168Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:48.8215501Z 2025-05-07T20:32:48.8215682Z x_sign = torch.sign(x) 2025-05-07T20:32:48.8215964Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:48.8216265Z x = x_sign * x_clamp 2025-05-07T20:32:48.8216494Z x0 = x[:, :D] 2025-05-07T20:32:48.8216701Z x1 = x[:, D:] 2025-05-07T20:32:48.8216973Z 2025-05-07T20:32:48.8217156Z if contiguous: 2025-05-07T20:32:48.8217369Z x0 = x0.contiguous() 2025-05-07T20:32:48.8217626Z x1 = x1.contiguous() 2025-05-07T20:32:48.8217858Z 2025-05-07T20:32:48.8218034Z if scale_ub is not None: 2025-05-07T20:32:48.8218304Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:48.8218635Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:48.8218929Z ) 2025-05-07T20:32:48.8219116Z else: 2025-05-07T20:32:48.8219320Z scale_ub_tensor = None 2025-05-07T20:32:48.8219555Z 2025-05-07T20:32:48.8219779Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:48.8220093Z op = silu_mul_quant 2025-05-07T20:32:48.8220339Z if compiled: 2025-05-07T20:32:48.8220573Z op = torch.compile(op) 2025-05-07T20:32:48.8220859Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:48.8221122Z 2025-05-07T20:32:48.8221305Z > y_fp8, y_scale = fn() 2025-05-07T20:32:48.8221476Z 2025-05-07T20:32:48.8221571Z moe/activation_test.py:117: 2025-05-07T20:32:48.8221860Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:48.8222179Z moe/activation_test.py:115: in fn 2025-05-07T20:32:48.8222450Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:48.8223119Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:48.8223788Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:48.8224431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:48.8225104Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:48.8225755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:48.8226270Z kernel = self.compile( 2025-05-07T20:32:48.8226993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:48.8227638Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:48.8228024Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:48.8228247Z 2025-05-07T20:32:48.8228450Z self = 2025-05-07T20:32:48.8229521Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:48.8230872Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9f939eba60>} 2025-05-07T20:32:48.8232202Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:48.8233264Z context = 2025-05-07T20:32:48.8233544Z 2025-05-07T20:32:48.8233704Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:48.8234221Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:48.8234677Z module_map=module_map) 2025-05-07T20:32:48.8235034Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:48.8235377Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:48.8235624Z E ^ 2025-05-07T20:32:48.8236074Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:48.8236615Z 2025-05-07T20:32:48.8237028Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:48.8237541Z 2025-05-07T20:32:48.8237638Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:48.8238043Z self=, 2025-05-07T20:32:48.8238426Z T=16384, 2025-05-07T20:32:48.8238610Z D=7168, 2025-05-07T20:32:48.8238797Z scale_ub=1200.0, 2025-05-07T20:32:48.8239008Z contiguous=False, 2025-05-07T20:32:48.8239230Z compiled=True, 2025-05-07T20:32:48.8239424Z ) 2025-05-07T20:32:48.8239736Z self = 2025-05-07T20:32:48.8240227Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:48.8240510Z 2025-05-07T20:32:48.8240580Z @given( 2025-05-07T20:32:48.8240823Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:48.8241129Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:48.8241427Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:48.8241752Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:48.8242074Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:48.8242342Z ) 2025-05-07T20:32:48.8242683Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:48.8243114Z def test_silu_mul_quant( 2025-05-07T20:32:48.8243350Z self, 2025-05-07T20:32:48.8243530Z T: int, 2025-05-07T20:32:48.8243720Z D: int, 2025-05-07T20:32:48.8243933Z scale_ub: Optional[float], 2025-05-07T20:32:48.8244190Z contiguous: bool, 2025-05-07T20:32:48.8244573Z compiled: bool, 2025-05-07T20:32:48.8244788Z ) -> None: 2025-05-07T20:32:48.8244988Z torch.manual_seed(2025) 2025-05-07T20:32:48.8245220Z 2025-05-07T20:32:48.8245485Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:48.8245813Z 2025-05-07T20:32:48.8245998Z x_sign = torch.sign(x) 2025-05-07T20:32:48.8246277Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:48.8246568Z x = x_sign * x_clamp 2025-05-07T20:32:48.8246794Z x0 = x[:, :D] 2025-05-07T20:32:48.8247002Z x1 = x[:, D:] 2025-05-07T20:32:48.8247192Z 2025-05-07T20:32:48.8247368Z if contiguous: 2025-05-07T20:32:48.8247588Z x0 = x0.contiguous() 2025-05-07T20:32:48.8247828Z x1 = x1.contiguous() 2025-05-07T20:32:48.8248059Z 2025-05-07T20:32:48.8248243Z if scale_ub is not None: 2025-05-07T20:32:48.8248503Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:48.8248835Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:48.8249134Z ) 2025-05-07T20:32:48.8249316Z else: 2025-05-07T20:32:48.8249511Z scale_ub_tensor = None 2025-05-07T20:32:48.8249755Z 2025-05-07T20:32:48.8249980Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:48.8250326Z op = silu_mul_quant 2025-05-07T20:32:48.8250570Z if compiled: 2025-05-07T20:32:48.8250807Z op = torch.compile(op) 2025-05-07T20:32:48.8251088Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:48.8251350Z 2025-05-07T20:32:48.8251530Z > y_fp8, y_scale = fn() 2025-05-07T20:32:48.8251692Z 2025-05-07T20:32:48.8251785Z moe/activation_test.py:117: 2025-05-07T20:32:48.8252072Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:48.8252395Z moe/activation_test.py:115: in fn 2025-05-07T20:32:48.8252670Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:48.8253218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:48.8253766Z return fn(*args, **kwargs) 2025-05-07T20:32:48.8254418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:48.8255139Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:48.8255669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:48.8256341Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:48.8256997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:48.8257511Z kernel = self.compile( 2025-05-07T20:32:48.8258048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:48.8258696Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:48.8259078Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:48.8259309Z 2025-05-07T20:32:48.8259509Z self = 2025-05-07T20:32:48.8260581Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:48.8261935Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9f931ccd60>} 2025-05-07T20:32:48.8263368Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:48.8264389Z context = 2025-05-07T20:32:48.8264687Z 2025-05-07T20:32:48.8264873Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:48.8265403Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:48.8265866Z module_map=module_map) 2025-05-07T20:32:48.8266238Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:48.8275537Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:48.8275808Z E ^ 2025-05-07T20:32:48.8276287Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:48.8276748Z 2025-05-07T20:32:48.8277186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:48.9588240Z 2025-05-07T20:32:48.9589097Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:48.9589882Z self=, 2025-05-07T20:32:48.9590561Z T=1, 2025-05-07T20:32:48.9590768Z D=7168, 2025-05-07T20:32:48.9590972Z scale_ub=None, 2025-05-07T20:32:48.9591472Z contiguous=False, 2025-05-07T20:32:48.9591701Z compiled=False, 2025-05-07T20:32:48.9591908Z ) 2025-05-07T20:32:48.9592228Z self = 2025-05-07T20:32:48.9592720Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:48.9592976Z 2025-05-07T20:32:48.9593050Z @given( 2025-05-07T20:32:48.9593283Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:48.9593595Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:48.9593892Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:48.9594215Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:48.9594542Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:48.9594814Z ) 2025-05-07T20:32:48.9595157Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:48.9595593Z def test_silu_mul_quant( 2025-05-07T20:32:48.9595942Z self, 2025-05-07T20:32:48.9596127Z T: int, 2025-05-07T20:32:48.9596322Z D: int, 2025-05-07T20:32:48.9596541Z scale_ub: Optional[float], 2025-05-07T20:32:48.9596800Z contiguous: bool, 2025-05-07T20:32:48.9597035Z compiled: bool, 2025-05-07T20:32:48.9597261Z ) -> None: 2025-05-07T20:32:48.9597465Z torch.manual_seed(2025) 2025-05-07T20:32:48.9597702Z 2025-05-07T20:32:48.9597971Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:48.9598299Z 2025-05-07T20:32:48.9598490Z x_sign = torch.sign(x) 2025-05-07T20:32:48.9598777Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:48.9599078Z x = x_sign * x_clamp 2025-05-07T20:32:48.9599315Z x0 = x[:, :D] 2025-05-07T20:32:48.9599525Z x1 = x[:, D:] 2025-05-07T20:32:48.9599720Z 2025-05-07T20:32:48.9599898Z if contiguous: 2025-05-07T20:32:48.9600122Z x0 = x0.contiguous() 2025-05-07T20:32:48.9600382Z x1 = x1.contiguous() 2025-05-07T20:32:48.9600609Z 2025-05-07T20:32:48.9600798Z if scale_ub is not None: 2025-05-07T20:32:48.9601068Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:48.9601395Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:48.9601702Z ) 2025-05-07T20:32:48.9601892Z else: 2025-05-07T20:32:48.9602093Z scale_ub_tensor = None 2025-05-07T20:32:48.9602340Z 2025-05-07T20:32:48.9602575Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:48.9602881Z op = silu_mul_quant 2025-05-07T20:32:48.9603136Z if compiled: 2025-05-07T20:32:48.9603565Z op = torch.compile(op) 2025-05-07T20:32:48.9603854Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:48.9604126Z 2025-05-07T20:32:48.9604453Z > y_fp8, y_scale = fn() 2025-05-07T20:32:48.9604615Z 2025-05-07T20:32:48.9604723Z moe/activation_test.py:117: 2025-05-07T20:32:48.9605044Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:48.9605396Z moe/activation_test.py:115: in fn 2025-05-07T20:32:48.9605676Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:48.9606367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:48.9607051Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:48.9607583Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:48.9608547Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:48.9609213Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:48.9609727Z kernel = self.compile( 2025-05-07T20:32:48.9611691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:48.9612428Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:48.9612810Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:48.9613044Z 2025-05-07T20:32:48.9613246Z self = 2025-05-07T20:32:48.9614331Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:48.9615763Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9f931cd760>} 2025-05-07T20:32:48.9617091Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:48.9618170Z context = 2025-05-07T20:32:48.9618458Z 2025-05-07T20:32:48.9618620Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:48.9619135Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:48.9619597Z module_map=module_map) 2025-05-07T20:32:48.9619946Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:48.9620290Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:48.9620542Z E ^ 2025-05-07T20:32:48.9620993Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:48.9621444Z 2025-05-07T20:32:48.9621853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:48.9622374Z 2025-05-07T20:32:48.9622472Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:48.9622875Z self=, 2025-05-07T20:32:48.9623258Z T=2048, 2025-05-07T20:32:48.9623433Z D=7168, 2025-05-07T20:32:48.9623618Z scale_ub=None, 2025-05-07T20:32:48.9623816Z contiguous=False, 2025-05-07T20:32:48.9624033Z compiled=True, 2025-05-07T20:32:48.9624227Z ) 2025-05-07T20:32:48.9624528Z self = 2025-05-07T20:32:48.9625012Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:48.9625276Z 2025-05-07T20:32:48.9625476Z @given( 2025-05-07T20:32:48.9625696Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:48.9626002Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:48.9626296Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:48.9626625Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:48.9626940Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:48.9627213Z ) 2025-05-07T20:32:48.9627552Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:48.9627979Z def test_silu_mul_quant( 2025-05-07T20:32:48.9628265Z self, 2025-05-07T20:32:48.9628530Z T: int, 2025-05-07T20:32:48.9628793Z D: int, 2025-05-07T20:32:48.9629103Z scale_ub: Optional[float], 2025-05-07T20:32:48.9629501Z contiguous: bool, 2025-05-07T20:32:48.9629840Z compiled: bool, 2025-05-07T20:32:48.9630214Z ) -> None: 2025-05-07T20:32:48.9630534Z torch.manual_seed(2025) 2025-05-07T20:32:48.9630864Z 2025-05-07T20:32:48.9631213Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:48.9631635Z 2025-05-07T20:32:48.9631824Z x_sign = torch.sign(x) 2025-05-07T20:32:48.9632104Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:48.9632486Z x = x_sign * x_clamp 2025-05-07T20:32:48.9632723Z x0 = x[:, :D] 2025-05-07T20:32:48.9632929Z x1 = x[:, D:] 2025-05-07T20:32:48.9633128Z 2025-05-07T20:32:48.9633305Z if contiguous: 2025-05-07T20:32:48.9633524Z x0 = x0.contiguous() 2025-05-07T20:32:48.9633777Z x1 = x1.contiguous() 2025-05-07T20:32:48.9634013Z 2025-05-07T20:32:48.9634189Z if scale_ub is not None: 2025-05-07T20:32:48.9634453Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:48.9634782Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:48.9635080Z ) 2025-05-07T20:32:48.9635262Z else: 2025-05-07T20:32:48.9635464Z scale_ub_tensor = None 2025-05-07T20:32:48.9635705Z 2025-05-07T20:32:48.9635920Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:48.9636227Z op = silu_mul_quant 2025-05-07T20:32:48.9636521Z if compiled: 2025-05-07T20:32:48.9636764Z op = torch.compile(op) 2025-05-07T20:32:48.9637052Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:48.9637359Z 2025-05-07T20:32:48.9637541Z > y_fp8, y_scale = fn() 2025-05-07T20:32:48.9637707Z 2025-05-07T20:32:48.9637801Z moe/activation_test.py:117: 2025-05-07T20:32:48.9638092Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:48.9638426Z moe/activation_test.py:115: in fn 2025-05-07T20:32:48.9638693Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:48.9639244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:48.9639794Z return fn(*args, **kwargs) 2025-05-07T20:32:48.9640435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:48.9641110Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:48.9641643Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:48.9642316Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:48.9642966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:48.9643491Z kernel = self.compile( 2025-05-07T20:32:48.9644025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:48.9644802Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:48.9645285Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:48.9645518Z 2025-05-07T20:32:48.9645719Z self = 2025-05-07T20:32:48.9646808Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:48.9648178Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9f931cef20>} 2025-05-07T20:32:48.9649505Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:48.9650523Z context = 2025-05-07T20:32:48.9650822Z 2025-05-07T20:32:48.9650986Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:48.9651502Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:48.9651958Z module_map=module_map) 2025-05-07T20:32:48.9652360Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:48.9652712Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:48.9652957Z E ^ 2025-05-07T20:32:48.9653415Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:48.9653870Z 2025-05-07T20:32:48.9654279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:48.9654783Z 2025-05-07T20:32:48.9654888Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:48.9655294Z self=, 2025-05-07T20:32:48.9655682Z T=4096, 2025-05-07T20:32:48.9655865Z D=7168, 2025-05-07T20:32:48.9656043Z scale_ub=None, 2025-05-07T20:32:48.9656254Z contiguous=False, 2025-05-07T20:32:48.9656473Z compiled=True, 2025-05-07T20:32:49.1916069Z ) 2025-05-07T20:32:49.1916967Z self = 2025-05-07T20:32:49.1917731Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:49.1918099Z 2025-05-07T20:32:49.1918197Z @given( 2025-05-07T20:32:49.1918484Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:49.1918878Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:49.1919261Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:49.1919583Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:49.1919895Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:49.1920168Z ) 2025-05-07T20:32:49.1920523Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:49.1920954Z def test_silu_mul_quant( 2025-05-07T20:32:49.1921178Z self, 2025-05-07T20:32:49.1921367Z T: int, 2025-05-07T20:32:49.1921563Z D: int, 2025-05-07T20:32:49.1921776Z scale_ub: Optional[float], 2025-05-07T20:32:49.1922042Z contiguous: bool, 2025-05-07T20:32:49.1922274Z compiled: bool, 2025-05-07T20:32:49.1922489Z ) -> None: 2025-05-07T20:32:49.1922702Z torch.manual_seed(2025) 2025-05-07T20:32:49.1922939Z 2025-05-07T20:32:49.1923197Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:49.1923535Z 2025-05-07T20:32:49.1923718Z x_sign = torch.sign(x) 2025-05-07T20:32:49.1923993Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:49.1924421Z x = x_sign * x_clamp 2025-05-07T20:32:49.1924654Z x0 = x[:, :D] 2025-05-07T20:32:49.1925261Z x1 = x[:, D:] 2025-05-07T20:32:49.1925470Z 2025-05-07T20:32:49.1925648Z if contiguous: 2025-05-07T20:32:49.1925874Z x0 = x0.contiguous() 2025-05-07T20:32:49.1926117Z x1 = x1.contiguous() 2025-05-07T20:32:49.1926353Z 2025-05-07T20:32:49.1926539Z if scale_ub is not None: 2025-05-07T20:32:49.1926796Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:49.1927124Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:49.1927424Z ) 2025-05-07T20:32:49.1927605Z else: 2025-05-07T20:32:49.1927818Z scale_ub_tensor = None 2025-05-07T20:32:49.1928065Z 2025-05-07T20:32:49.1928282Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:49.1928592Z op = silu_mul_quant 2025-05-07T20:32:49.1928838Z if compiled: 2025-05-07T20:32:49.1929077Z op = torch.compile(op) 2025-05-07T20:32:49.1929378Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:49.1929663Z 2025-05-07T20:32:49.1929849Z > y_fp8, y_scale = fn() 2025-05-07T20:32:49.1930026Z 2025-05-07T20:32:49.1930126Z moe/activation_test.py:117: 2025-05-07T20:32:49.1930434Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:49.1930776Z moe/activation_test.py:115: in fn 2025-05-07T20:32:49.1931145Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:49.1931719Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:49.1932277Z return fn(*args, **kwargs) 2025-05-07T20:32:49.1932929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:49.1933618Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:49.1934158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:49.1934848Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:49.1935510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:49.1936037Z kernel = self.compile( 2025-05-07T20:32:49.1936674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:49.1937380Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:49.1937774Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:49.1937998Z 2025-05-07T20:32:49.1938210Z self = 2025-05-07T20:32:49.1939270Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:49.1940656Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa07e2c00e0>} 2025-05-07T20:32:49.1941990Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:49.1943008Z context = 2025-05-07T20:32:49.1943292Z 2025-05-07T20:32:49.1943453Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:49.1943973Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:49.1944433Z module_map=module_map) 2025-05-07T20:32:49.1944796Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:49.1945220Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:49.1945482Z E ^ 2025-05-07T20:32:49.1945949Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:49.1946395Z 2025-05-07T20:32:49.1946806Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:49.1947329Z 2025-05-07T20:32:49.1947438Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:49.1947861Z self=, 2025-05-07T20:32:49.1948271Z T=16384, 2025-05-07T20:32:49.1948469Z D=5120, 2025-05-07T20:32:49.1948685Z scale_ub=1200.0, 2025-05-07T20:32:49.1948928Z contiguous=False, 2025-05-07T20:32:49.1949158Z compiled=False, 2025-05-07T20:32:49.1949381Z ) 2025-05-07T20:32:49.1949711Z self = 2025-05-07T20:32:49.1950213Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:49.1950508Z 2025-05-07T20:32:49.1950591Z @given( 2025-05-07T20:32:49.1950834Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:49.1951159Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:49.1951467Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:49.1951853Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:49.1952187Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:49.1952470Z ) 2025-05-07T20:32:49.1952822Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:49.1953260Z def test_silu_mul_quant( 2025-05-07T20:32:49.1953492Z self, 2025-05-07T20:32:49.1953691Z T: int, 2025-05-07T20:32:49.1953888Z D: int, 2025-05-07T20:32:49.1954099Z scale_ub: Optional[float], 2025-05-07T20:32:49.1954371Z contiguous: bool, 2025-05-07T20:32:49.1954618Z compiled: bool, 2025-05-07T20:32:49.1954836Z ) -> None: 2025-05-07T20:32:49.1955051Z torch.manual_seed(2025) 2025-05-07T20:32:49.1955291Z 2025-05-07T20:32:49.1955560Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:49.1955883Z 2025-05-07T20:32:49.1956115Z x_sign = torch.sign(x) 2025-05-07T20:32:49.1956398Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:49.1956688Z x = x_sign * x_clamp 2025-05-07T20:32:49.1956921Z x0 = x[:, :D] 2025-05-07T20:32:49.1957129Z x1 = x[:, D:] 2025-05-07T20:32:49.1957318Z 2025-05-07T20:32:49.1957491Z if contiguous: 2025-05-07T20:32:49.1957714Z x0 = x0.contiguous() 2025-05-07T20:32:49.1957962Z x1 = x1.contiguous() 2025-05-07T20:32:49.1958192Z 2025-05-07T20:32:49.1958375Z if scale_ub is not None: 2025-05-07T20:32:49.1958630Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:49.1958961Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:49.1959261Z ) 2025-05-07T20:32:49.1959440Z else: 2025-05-07T20:32:49.1959644Z scale_ub_tensor = None 2025-05-07T20:32:49.1959892Z 2025-05-07T20:32:49.1960105Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:49.1960415Z op = silu_mul_quant 2025-05-07T20:32:49.1960662Z if compiled: 2025-05-07T20:32:49.1960902Z op = torch.compile(op) 2025-05-07T20:32:49.1961184Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:49.1961452Z 2025-05-07T20:32:49.1961639Z > y_fp8, y_scale = fn() 2025-05-07T20:32:49.1961797Z 2025-05-07T20:32:49.1961890Z moe/activation_test.py:117: 2025-05-07T20:32:49.1962176Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:49.1962494Z moe/activation_test.py:115: in fn 2025-05-07T20:32:49.1962762Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:49.1963543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:49.1964217Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:49.1964866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:49.1965533Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:49.1966181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:49.1966700Z kernel = self.compile( 2025-05-07T20:32:49.1967224Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:49.1967870Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:49.1968265Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:49.1968488Z 2025-05-07T20:32:49.1968704Z self = 2025-05-07T20:32:49.1969761Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:49.1972385Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa07e2c0b80>} 2025-05-07T20:32:49.1973713Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:49.1974720Z context = 2025-05-07T20:32:49.1975032Z 2025-05-07T20:32:49.1975223Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:49.1975734Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:49.1976191Z module_map=module_map) 2025-05-07T20:32:49.1976549Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:49.1976933Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:49.1977191Z E ^ 2025-05-07T20:32:49.1977653Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:49.1978100Z 2025-05-07T20:32:49.1978531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:49.1979037Z 2025-05-07T20:32:49.1979136Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:49.1979552Z self=, 2025-05-07T20:32:49.1979947Z T=16384, 2025-05-07T20:32:49.1980125Z D=5120, 2025-05-07T20:32:49.1980316Z scale_ub=1200.0, 2025-05-07T20:32:49.1980536Z contiguous=True, 2025-05-07T20:32:49.1980761Z compiled=True, 2025-05-07T20:32:49.1980957Z ) 2025-05-07T20:32:49.1981275Z self = 2025-05-07T20:32:49.1981774Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:49.1982048Z 2025-05-07T20:32:49.1982124Z @given( 2025-05-07T20:32:49.1982364Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:49.1982679Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:49.1982981Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:49.1983313Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:49.1983639Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:49.1983927Z ) 2025-05-07T20:32:49.1984265Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:49.1984794Z def test_silu_mul_quant( 2025-05-07T20:32:49.1985039Z self, 2025-05-07T20:32:49.1985231Z T: int, 2025-05-07T20:32:49.1985428Z D: int, 2025-05-07T20:32:49.1985648Z scale_ub: Optional[float], 2025-05-07T20:32:49.1985908Z contiguous: bool, 2025-05-07T20:32:49.1986158Z compiled: bool, 2025-05-07T20:32:49.1986383Z ) -> None: 2025-05-07T20:32:49.1986597Z torch.manual_seed(2025) 2025-05-07T20:32:49.1986838Z 2025-05-07T20:32:49.1987106Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:49.1987462Z 2025-05-07T20:32:49.1987657Z x_sign = torch.sign(x) 2025-05-07T20:32:49.1997362Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:49.1997730Z x = x_sign * x_clamp 2025-05-07T20:32:49.1997979Z x0 = x[:, :D] 2025-05-07T20:32:49.1998213Z x1 = x[:, D:] 2025-05-07T20:32:49.1998435Z 2025-05-07T20:32:49.1998616Z if contiguous: 2025-05-07T20:32:49.1998859Z x0 = x0.contiguous() 2025-05-07T20:32:49.1999123Z x1 = x1.contiguous() 2025-05-07T20:32:49.1999359Z 2025-05-07T20:32:49.1999560Z if scale_ub is not None: 2025-05-07T20:32:49.1999845Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:49.2000191Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:49.2000584Z ) 2025-05-07T20:32:49.2000794Z else: 2025-05-07T20:32:49.2001006Z scale_ub_tensor = None 2025-05-07T20:32:49.2001270Z 2025-05-07T20:32:49.2001515Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:49.2001832Z op = silu_mul_quant 2025-05-07T20:32:49.2002099Z if compiled: 2025-05-07T20:32:49.2002363Z op = torch.compile(op) 2025-05-07T20:32:49.2002670Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:49.2002945Z 2025-05-07T20:32:49.2003147Z > y_fp8, y_scale = fn() 2025-05-07T20:32:49.2003318Z 2025-05-07T20:32:49.2003436Z moe/activation_test.py:117: 2025-05-07T20:32:49.2003734Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:49.2004077Z moe/activation_test.py:115: in fn 2025-05-07T20:32:49.2004467Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:49.2005073Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:49.2005636Z return fn(*args, **kwargs) 2025-05-07T20:32:49.2006294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:49.2007025Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:49.2007553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:49.2008517Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:49.2009189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:49.2009717Z kernel = self.compile( 2025-05-07T20:32:49.2010254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:49.2010917Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:49.2011324Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:49.2011552Z 2025-05-07T20:32:49.2011759Z self = 2025-05-07T20:32:49.2012835Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:49.2014374Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa07e2c22a0>} 2025-05-07T20:32:49.2015758Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:49.2016784Z context = 2025-05-07T20:32:49.2017069Z 2025-05-07T20:32:49.2017232Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:49.2017753Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:49.2018227Z module_map=module_map) 2025-05-07T20:32:49.2018589Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:49.2018942Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:49.2019205Z E ^ 2025-05-07T20:32:49.2019682Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:49.2020129Z 2025-05-07T20:32:49.2020544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:49.3605453Z 2025-05-07T20:32:49.3605797Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:49.3607091Z self=, 2025-05-07T20:32:49.3607975Z T=16384, 2025-05-07T20:32:49.3608184Z D=5120, 2025-05-07T20:32:49.3608642Z scale_ub=None, 2025-05-07T20:32:49.3608883Z contiguous=False, 2025-05-07T20:32:49.3609126Z compiled=True, 2025-05-07T20:32:49.3609361Z ) 2025-05-07T20:32:49.3609712Z self = 2025-05-07T20:32:49.3610253Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:49.3610572Z 2025-05-07T20:32:49.3610658Z @given( 2025-05-07T20:32:49.3610913Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:49.3611264Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:49.3611597Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:49.3611951Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:49.3612391Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:49.3612675Z ) 2025-05-07T20:32:49.3613013Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:49.3613454Z def test_silu_mul_quant( 2025-05-07T20:32:49.3613689Z self, 2025-05-07T20:32:49.3613872Z T: int, 2025-05-07T20:32:49.3614064Z D: int, 2025-05-07T20:32:49.3614278Z scale_ub: Optional[float], 2025-05-07T20:32:49.3614538Z contiguous: bool, 2025-05-07T20:32:49.3614811Z compiled: bool, 2025-05-07T20:32:49.3615027Z ) -> None: 2025-05-07T20:32:49.3615240Z torch.manual_seed(2025) 2025-05-07T20:32:49.3615476Z 2025-05-07T20:32:49.3615744Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:49.3616086Z 2025-05-07T20:32:49.3616275Z x_sign = torch.sign(x) 2025-05-07T20:32:49.3616605Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:49.3616915Z x = x_sign * x_clamp 2025-05-07T20:32:49.3617154Z x0 = x[:, :D] 2025-05-07T20:32:49.3617364Z x1 = x[:, D:] 2025-05-07T20:32:49.3617560Z 2025-05-07T20:32:49.3617741Z if contiguous: 2025-05-07T20:32:49.3617966Z x0 = x0.contiguous() 2025-05-07T20:32:49.3618217Z x1 = x1.contiguous() 2025-05-07T20:32:49.3618453Z 2025-05-07T20:32:49.3618635Z if scale_ub is not None: 2025-05-07T20:32:49.3618894Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:49.3619227Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:49.3619525Z ) 2025-05-07T20:32:49.3619717Z else: 2025-05-07T20:32:49.3620079Z scale_ub_tensor = None 2025-05-07T20:32:49.3620337Z 2025-05-07T20:32:49.3620567Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:49.3620884Z op = silu_mul_quant 2025-05-07T20:32:49.3621152Z if compiled: 2025-05-07T20:32:49.3621486Z op = torch.compile(op) 2025-05-07T20:32:49.3621896Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:49.3622263Z 2025-05-07T20:32:49.3622514Z > y_fp8, y_scale = fn() 2025-05-07T20:32:49.3622725Z 2025-05-07T20:32:49.3622819Z moe/activation_test.py:117: 2025-05-07T20:32:49.3623107Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:49.3623431Z moe/activation_test.py:115: in fn 2025-05-07T20:32:49.3623697Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:49.3624250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:49.3624804Z return fn(*args, **kwargs) 2025-05-07T20:32:49.3625449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:49.3626122Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:49.3626646Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:49.3627403Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:49.3628047Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:49.3628566Z kernel = self.compile( 2025-05-07T20:32:49.3629097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:49.3629741Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:49.3630124Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:49.3630353Z 2025-05-07T20:32:49.3630554Z self = 2025-05-07T20:32:49.3631620Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:49.3633031Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa07e2c3060>} 2025-05-07T20:32:49.3634352Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:49.3635359Z context = 2025-05-07T20:32:49.3635648Z 2025-05-07T20:32:49.3635811Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:49.3636325Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:49.3636778Z module_map=module_map) 2025-05-07T20:32:49.3637138Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:49.3637484Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:49.3637725Z E ^ 2025-05-07T20:32:49.3638183Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:49.3638630Z 2025-05-07T20:32:49.3639043Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:49.3639545Z 2025-05-07T20:32:49.3639651Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:49.3640051Z self=, 2025-05-07T20:32:49.3640522Z T=2048, 2025-05-07T20:32:49.3640703Z D=5120, 2025-05-07T20:32:49.3640878Z scale_ub=None, 2025-05-07T20:32:49.3641089Z contiguous=False, 2025-05-07T20:32:49.3641309Z compiled=True, 2025-05-07T20:32:49.3641506Z ) 2025-05-07T20:32:49.3641813Z self = 2025-05-07T20:32:49.3642306Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:49.3642569Z 2025-05-07T20:32:49.3642649Z @given( 2025-05-07T20:32:49.3642866Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:49.3643171Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:49.3643472Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:49.3643788Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:49.3644112Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:49.3644505Z ) 2025-05-07T20:32:49.3644853Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:49.3645280Z def test_silu_mul_quant( 2025-05-07T20:32:49.3645517Z self, 2025-05-07T20:32:49.3645703Z T: int, 2025-05-07T20:32:49.3645884Z D: int, 2025-05-07T20:32:49.3646096Z scale_ub: Optional[float], 2025-05-07T20:32:49.3646363Z contiguous: bool, 2025-05-07T20:32:49.3646643Z compiled: bool, 2025-05-07T20:32:49.3646862Z ) -> None: 2025-05-07T20:32:49.3647067Z torch.manual_seed(2025) 2025-05-07T20:32:49.3647291Z 2025-05-07T20:32:49.3647557Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:49.3647889Z 2025-05-07T20:32:49.3648065Z x_sign = torch.sign(x) 2025-05-07T20:32:49.3648352Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:49.3648651Z x = x_sign * x_clamp 2025-05-07T20:32:49.3648871Z x0 = x[:, :D] 2025-05-07T20:32:49.3649079Z x1 = x[:, D:] 2025-05-07T20:32:49.3649278Z 2025-05-07T20:32:49.3649453Z if contiguous: 2025-05-07T20:32:49.3649677Z x0 = x0.contiguous() 2025-05-07T20:32:49.3649926Z x1 = x1.contiguous() 2025-05-07T20:32:49.3650159Z 2025-05-07T20:32:49.3650333Z if scale_ub is not None: 2025-05-07T20:32:49.3650653Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:49.3650985Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:49.3651273Z ) 2025-05-07T20:32:49.3651472Z else: 2025-05-07T20:32:49.3651709Z scale_ub_tensor = None 2025-05-07T20:32:49.3651944Z 2025-05-07T20:32:49.3652170Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:49.3652476Z op = silu_mul_quant 2025-05-07T20:32:49.3652716Z if compiled: 2025-05-07T20:32:49.3652955Z op = torch.compile(op) 2025-05-07T20:32:49.3653239Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:49.3653494Z 2025-05-07T20:32:49.3653678Z > y_fp8, y_scale = fn() 2025-05-07T20:32:49.3653835Z 2025-05-07T20:32:49.3653934Z moe/activation_test.py:117: 2025-05-07T20:32:49.3654226Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:49.3654542Z moe/activation_test.py:115: in fn 2025-05-07T20:32:49.3654825Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:49.3655376Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:49.3655916Z return fn(*args, **kwargs) 2025-05-07T20:32:49.3656564Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:49.3657242Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:49.3657772Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:49.3658595Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:49.3659250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:49.3659771Z kernel = self.compile( 2025-05-07T20:32:49.3660296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:49.3660945Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:49.3661336Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:49.3661559Z 2025-05-07T20:32:49.3661768Z self = 2025-05-07T20:32:49.3662831Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:49.3664191Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9f93cd07c0>} 2025-05-07T20:32:49.3665520Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:49.3666573Z context = 2025-05-07T20:32:49.3666853Z 2025-05-07T20:32:49.3667019Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:49.3667521Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:49.3667977Z module_map=module_map) 2025-05-07T20:32:49.3668329Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:49.3668663Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:49.3668911Z E ^ 2025-05-07T20:32:49.3669369Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:49.3669810Z 2025-05-07T20:32:49.3670224Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:49.8083633Z 2025-05-07T20:32:49.8084129Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:49.8085507Z self=, 2025-05-07T20:32:49.8086311Z T=2048, 2025-05-07T20:32:49.8086633Z D=5120, 2025-05-07T20:32:49.8086924Z scale_ub=1200.0, 2025-05-07T20:32:49.8087156Z contiguous=False, 2025-05-07T20:32:49.8087385Z compiled=True, 2025-05-07T20:32:49.8087584Z ) 2025-05-07T20:32:49.8087916Z self = 2025-05-07T20:32:49.8088422Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:49.8088746Z 2025-05-07T20:32:49.8088826Z @given( 2025-05-07T20:32:49.8089052Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:49.8089354Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:49.8089658Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:49.8089987Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:49.8090306Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:49.8090589Z ) 2025-05-07T20:32:49.8090935Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:49.8091380Z def test_silu_mul_quant( 2025-05-07T20:32:49.8091622Z self, 2025-05-07T20:32:49.8091811Z T: int, 2025-05-07T20:32:49.8091996Z D: int, 2025-05-07T20:32:49.8092210Z scale_ub: Optional[float], 2025-05-07T20:32:49.8092476Z contiguous: bool, 2025-05-07T20:32:49.8092710Z compiled: bool, 2025-05-07T20:32:49.8092926Z ) -> None: 2025-05-07T20:32:49.8093310Z torch.manual_seed(2025) 2025-05-07T20:32:49.8093558Z 2025-05-07T20:32:49.8093822Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:49.8094159Z 2025-05-07T20:32:49.8094347Z x_sign = torch.sign(x) 2025-05-07T20:32:49.8094637Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:49.8094947Z x = x_sign * x_clamp 2025-05-07T20:32:49.8095183Z x0 = x[:, :D] 2025-05-07T20:32:49.8095385Z x1 = x[:, D:] 2025-05-07T20:32:49.8095590Z 2025-05-07T20:32:49.8095772Z if contiguous: 2025-05-07T20:32:49.8095993Z x0 = x0.contiguous() 2025-05-07T20:32:49.8096250Z x1 = x1.contiguous() 2025-05-07T20:32:49.8096484Z 2025-05-07T20:32:49.8096663Z if scale_ub is not None: 2025-05-07T20:32:49.8096934Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:49.8097270Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:49.8097567Z ) 2025-05-07T20:32:49.8097760Z else: 2025-05-07T20:32:49.8097970Z scale_ub_tensor = None 2025-05-07T20:32:49.8098218Z 2025-05-07T20:32:49.8098437Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:49.8098747Z op = silu_mul_quant 2025-05-07T20:32:49.8098995Z if compiled: 2025-05-07T20:32:49.8099357Z op = torch.compile(op) 2025-05-07T20:32:49.8099647Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:49.8099917Z 2025-05-07T20:32:49.8100098Z > y_fp8, y_scale = fn() 2025-05-07T20:32:49.8100265Z 2025-05-07T20:32:49.8100361Z moe/activation_test.py:117: 2025-05-07T20:32:49.8100654Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:49.8100977Z moe/activation_test.py:115: in fn 2025-05-07T20:32:49.8101258Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:49.8101826Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:49.8102385Z return fn(*args, **kwargs) 2025-05-07T20:32:49.8103030Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:49.8103710Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:49.8104313Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:49.8104978Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:49.8105631Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:49.8106155Z kernel = self.compile( 2025-05-07T20:32:49.8106691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:49.8107334Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:49.8107732Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:49.8107957Z 2025-05-07T20:32:49.8108167Z self = 2025-05-07T20:32:49.8109534Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:49.8110900Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9f93cd1580>} 2025-05-07T20:32:49.8112238Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:49.8113371Z context = 2025-05-07T20:32:49.8113658Z 2025-05-07T20:32:49.8113827Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:49.8114340Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:49.8114807Z module_map=module_map) 2025-05-07T20:32:49.8115172Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:49.8115521Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:49.8115769Z E ^ 2025-05-07T20:32:49.8116226Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:49.8116673Z 2025-05-07T20:32:49.8117089Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:49.8117595Z 2025-05-07T20:32:49.8117703Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:49.8118107Z self=, 2025-05-07T20:32:49.8118500Z T=4096, 2025-05-07T20:32:49.8118682Z D=5120, 2025-05-07T20:32:49.8118861Z scale_ub=1200.0, 2025-05-07T20:32:49.8119073Z contiguous=True, 2025-05-07T20:32:49.8119288Z compiled=True, 2025-05-07T20:32:49.8119478Z ) 2025-05-07T20:32:49.8119849Z self = 2025-05-07T20:32:49.8120333Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:49.8120596Z 2025-05-07T20:32:49.8120666Z @given( 2025-05-07T20:32:49.8120884Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:49.8122668Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:49.8122966Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:49.8123280Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:49.8123597Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:49.8123870Z ) 2025-05-07T20:32:49.8124209Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:49.8124730Z def test_silu_mul_quant( 2025-05-07T20:32:49.8124961Z self, 2025-05-07T20:32:49.8125139Z T: int, 2025-05-07T20:32:49.8125419Z D: int, 2025-05-07T20:32:49.8125658Z scale_ub: Optional[float], 2025-05-07T20:32:49.8125916Z contiguous: bool, 2025-05-07T20:32:49.8126146Z compiled: bool, 2025-05-07T20:32:49.8126357Z ) -> None: 2025-05-07T20:32:49.8126558Z torch.manual_seed(2025) 2025-05-07T20:32:49.8126789Z 2025-05-07T20:32:49.8127055Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:49.8127391Z 2025-05-07T20:32:49.8127566Z x_sign = torch.sign(x) 2025-05-07T20:32:49.8127851Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:49.8128154Z x = x_sign * x_clamp 2025-05-07T20:32:49.8128379Z x0 = x[:, :D] 2025-05-07T20:32:49.8128596Z x1 = x[:, D:] 2025-05-07T20:32:49.8128799Z 2025-05-07T20:32:49.8128972Z if contiguous: 2025-05-07T20:32:49.8129201Z x0 = x0.contiguous() 2025-05-07T20:32:49.8129459Z x1 = x1.contiguous() 2025-05-07T20:32:49.8129684Z 2025-05-07T20:32:49.8129876Z if scale_ub is not None: 2025-05-07T20:32:49.8130146Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:49.8130470Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:49.8130771Z ) 2025-05-07T20:32:49.8130955Z else: 2025-05-07T20:32:49.8131152Z scale_ub_tensor = None 2025-05-07T20:32:49.8131400Z 2025-05-07T20:32:49.8131632Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:49.8131938Z op = silu_mul_quant 2025-05-07T20:32:49.8132178Z if compiled: 2025-05-07T20:32:49.8132424Z op = torch.compile(op) 2025-05-07T20:32:49.8132715Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:49.8133058Z 2025-05-07T20:32:49.8133241Z > y_fp8, y_scale = fn() 2025-05-07T20:32:49.8133402Z 2025-05-07T20:32:49.8133504Z moe/activation_test.py:117: 2025-05-07T20:32:49.8133789Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:49.8134120Z moe/activation_test.py:115: in fn 2025-05-07T20:32:49.8134399Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:49.8134940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:49.8135554Z return fn(*args, **kwargs) 2025-05-07T20:32:49.8136203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:49.8136875Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:49.8137405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:49.8138085Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:49.8138741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:49.8139257Z kernel = self.compile( 2025-05-07T20:32:49.8147716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:49.8148483Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:49.8148892Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:49.8149140Z 2025-05-07T20:32:49.8149354Z self = 2025-05-07T20:32:49.8150472Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:49.8151854Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9f93cd2840>} 2025-05-07T20:32:49.8153196Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:49.8154282Z context = 2025-05-07T20:32:49.8154584Z 2025-05-07T20:32:49.8154755Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:49.8155287Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:49.8155757Z module_map=module_map) 2025-05-07T20:32:49.8156132Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:49.8156497Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:49.8156757Z E ^ 2025-05-07T20:32:49.8157235Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:49.8157700Z 2025-05-07T20:32:49.8158119Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:49.9851561Z 2025-05-07T20:32:49.9852055Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:49.9852797Z self=, 2025-05-07T20:32:49.9853388Z T=128, 2025-05-07T20:32:49.9853644Z D=5120, 2025-05-07T20:32:49.9853911Z scale_ub=1200.0, 2025-05-07T20:32:49.9854200Z contiguous=False, 2025-05-07T20:32:49.9854495Z compiled=True, 2025-05-07T20:32:49.9854758Z ) 2025-05-07T20:32:49.9855070Z self = 2025-05-07T20:32:49.9855823Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:49.9856106Z 2025-05-07T20:32:49.9856182Z @given( 2025-05-07T20:32:49.9856421Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:49.9856725Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:49.9857039Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:49.9857382Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:49.9857698Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:49.9857981Z ) 2025-05-07T20:32:49.9858324Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:49.9858750Z def test_silu_mul_quant( 2025-05-07T20:32:49.9858988Z self, 2025-05-07T20:32:49.9859182Z T: int, 2025-05-07T20:32:49.9859368Z D: int, 2025-05-07T20:32:49.9859579Z scale_ub: Optional[float], 2025-05-07T20:32:49.9859843Z contiguous: bool, 2025-05-07T20:32:49.9860078Z compiled: bool, 2025-05-07T20:32:49.9860293Z ) -> None: 2025-05-07T20:32:49.9860504Z torch.manual_seed(2025) 2025-05-07T20:32:49.9860742Z 2025-05-07T20:32:49.9861006Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:49.9861349Z 2025-05-07T20:32:49.9861551Z x_sign = torch.sign(x) 2025-05-07T20:32:49.9861906Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:49.9862210Z x = x_sign * x_clamp 2025-05-07T20:32:49.9862439Z x0 = x[:, :D] 2025-05-07T20:32:49.9862634Z x1 = x[:, D:] 2025-05-07T20:32:49.9862829Z 2025-05-07T20:32:49.9863037Z if contiguous: 2025-05-07T20:32:49.9863257Z x0 = x0.contiguous() 2025-05-07T20:32:49.9863513Z x1 = x1.contiguous() 2025-05-07T20:32:49.9863740Z 2025-05-07T20:32:49.9863931Z if scale_ub is not None: 2025-05-07T20:32:49.9864195Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:49.9864524Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:49.9864832Z ) 2025-05-07T20:32:49.9865022Z else: 2025-05-07T20:32:49.9865219Z scale_ub_tensor = None 2025-05-07T20:32:49.9865460Z 2025-05-07T20:32:49.9865685Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:49.9866072Z op = silu_mul_quant 2025-05-07T20:32:49.9866313Z if compiled: 2025-05-07T20:32:49.9866558Z op = torch.compile(op) 2025-05-07T20:32:49.9866857Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:49.9867123Z 2025-05-07T20:32:49.9867309Z > y_fp8, y_scale = fn() 2025-05-07T20:32:49.9867477Z 2025-05-07T20:32:49.9867578Z moe/activation_test.py:117: 2025-05-07T20:32:49.9867875Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:49.9868203Z moe/activation_test.py:115: in fn 2025-05-07T20:32:49.9868481Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:49.9869038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:49.9869589Z return fn(*args, **kwargs) 2025-05-07T20:32:49.9870255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:49.9870954Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:49.9871491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:49.9872165Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:49.9872822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:49.9873341Z kernel = self.compile( 2025-05-07T20:32:49.9873882Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:49.9874624Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:49.9875016Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:49.9875242Z 2025-05-07T20:32:49.9875457Z self = 2025-05-07T20:32:49.9876531Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:49.9877970Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9f93cd34c0>} 2025-05-07T20:32:49.9879315Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:49.9880338Z context = 2025-05-07T20:32:49.9880628Z 2025-05-07T20:32:49.9880801Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:49.9881311Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:49.9881824Z module_map=module_map) 2025-05-07T20:32:49.9882182Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:49.9882521Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:49.9882774Z E ^ 2025-05-07T20:32:49.9883233Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:49.9883677Z 2025-05-07T20:32:49.9884097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:49.9884752Z 2025-05-07T20:32:49.9884856Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:49.9885271Z self=, 2025-05-07T20:32:49.9885705Z T=16384, 2025-05-07T20:32:49.9885891Z D=7168, 2025-05-07T20:32:49.9886071Z scale_ub=1200.0, 2025-05-07T20:32:49.9886342Z contiguous=True, 2025-05-07T20:32:49.9886561Z compiled=True, 2025-05-07T20:32:49.9886755Z ) 2025-05-07T20:32:49.9887069Z self = 2025-05-07T20:32:49.9887566Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:49.9887836Z 2025-05-07T20:32:49.9887908Z @given( 2025-05-07T20:32:49.9888132Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:49.9888437Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:49.9888728Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:49.9889047Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:49.9889374Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:49.9889650Z ) 2025-05-07T20:32:49.9889975Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:49.9890404Z def test_silu_mul_quant( 2025-05-07T20:32:49.9890635Z self, 2025-05-07T20:32:49.9890820Z T: int, 2025-05-07T20:32:49.9891013Z D: int, 2025-05-07T20:32:49.9891223Z scale_ub: Optional[float], 2025-05-07T20:32:49.9891477Z contiguous: bool, 2025-05-07T20:32:49.9891704Z compiled: bool, 2025-05-07T20:32:49.9891916Z ) -> None: 2025-05-07T20:32:49.9892114Z torch.manual_seed(2025) 2025-05-07T20:32:49.9892346Z 2025-05-07T20:32:49.9892609Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:49.9892933Z 2025-05-07T20:32:49.9893112Z x_sign = torch.sign(x) 2025-05-07T20:32:49.9893395Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:49.9893692Z x = x_sign * x_clamp 2025-05-07T20:32:49.9894003Z x0 = x[:, :D] 2025-05-07T20:32:49.9894208Z x1 = x[:, D:] 2025-05-07T20:32:49.9894399Z 2025-05-07T20:32:49.9894571Z if contiguous: 2025-05-07T20:32:49.9894792Z x0 = x0.contiguous() 2025-05-07T20:32:49.9895037Z x1 = x1.contiguous() 2025-05-07T20:32:49.9895265Z 2025-05-07T20:32:49.9895445Z if scale_ub is not None: 2025-05-07T20:32:49.9895711Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:49.9896033Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:49.9896324Z ) 2025-05-07T20:32:49.9896505Z else: 2025-05-07T20:32:49.9896702Z scale_ub_tensor = None 2025-05-07T20:32:49.9896942Z 2025-05-07T20:32:49.9897162Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:49.9897462Z op = silu_mul_quant 2025-05-07T20:32:49.9897698Z if compiled: 2025-05-07T20:32:49.9897939Z op = torch.compile(op) 2025-05-07T20:32:49.9898224Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:49.9898492Z 2025-05-07T20:32:49.9898672Z > y_fp8, y_scale = fn() 2025-05-07T20:32:49.9898831Z 2025-05-07T20:32:49.9898930Z moe/activation_test.py:117: 2025-05-07T20:32:49.9899215Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:49.9899593Z moe/activation_test.py:115: in fn 2025-05-07T20:32:49.9899870Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:49.9900410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:49.9900964Z return fn(*args, **kwargs) 2025-05-07T20:32:49.9901625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:49.9902305Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:49.9902845Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:49.9903524Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:49.9904186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:49.9904746Z kernel = self.compile( 2025-05-07T20:32:49.9905281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:49.9905928Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:49.9906319Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:49.9906545Z 2025-05-07T20:32:49.9906745Z self = 2025-05-07T20:32:49.9907879Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:49.9909436Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9f92fc4c20>} 2025-05-07T20:32:49.9910774Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:49.9911785Z context = 2025-05-07T20:32:49.9912080Z 2025-05-07T20:32:49.9912243Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:49.9912758Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:49.9913225Z module_map=module_map) 2025-05-07T20:32:49.9913709Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:49.9914061Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:49.9914316Z E ^ 2025-05-07T20:32:49.9914765Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:49.9915217Z 2025-05-07T20:32:49.9915678Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:50.1075679Z 2025-05-07T20:32:50.1075992Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:50.1076414Z self=, 2025-05-07T20:32:50.1076854Z T=16384, 2025-05-07T20:32:50.1077064Z D=5120, 2025-05-07T20:32:50.1077375Z scale_ub=1200.0, 2025-05-07T20:32:50.1077681Z contiguous=True, 2025-05-07T20:32:50.1078038Z compiled=False, 2025-05-07T20:32:50.1078344Z ) 2025-05-07T20:32:50.1078854Z self = 2025-05-07T20:32:50.1079374Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:50.1079647Z 2025-05-07T20:32:50.1079731Z @given( 2025-05-07T20:32:50.1079964Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:50.1080269Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:50.1080703Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:50.1081029Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:50.1081345Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:50.1081631Z ) 2025-05-07T20:32:50.1081977Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:50.1082404Z def test_silu_mul_quant( 2025-05-07T20:32:50.1082639Z self, 2025-05-07T20:32:50.1082834Z T: int, 2025-05-07T20:32:50.1083016Z D: int, 2025-05-07T20:32:50.1083226Z scale_ub: Optional[float], 2025-05-07T20:32:50.1083490Z contiguous: bool, 2025-05-07T20:32:50.1083720Z compiled: bool, 2025-05-07T20:32:50.1083946Z ) -> None: 2025-05-07T20:32:50.1084165Z torch.manual_seed(2025) 2025-05-07T20:32:50.1084515Z 2025-05-07T20:32:50.1084789Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:50.1085221Z 2025-05-07T20:32:50.1085430Z x_sign = torch.sign(x) 2025-05-07T20:32:50.1085736Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:50.1086048Z x = x_sign * x_clamp 2025-05-07T20:32:50.1086290Z x0 = x[:, :D] 2025-05-07T20:32:50.1086498Z x1 = x[:, D:] 2025-05-07T20:32:50.1086696Z 2025-05-07T20:32:50.1086867Z if contiguous: 2025-05-07T20:32:50.1087082Z x0 = x0.contiguous() 2025-05-07T20:32:50.1087341Z x1 = x1.contiguous() 2025-05-07T20:32:50.1087570Z 2025-05-07T20:32:50.1087755Z if scale_ub is not None: 2025-05-07T20:32:50.1088031Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:50.1088362Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:50.1088653Z ) 2025-05-07T20:32:50.1088839Z else: 2025-05-07T20:32:50.1089041Z scale_ub_tensor = None 2025-05-07T20:32:50.1089276Z 2025-05-07T20:32:50.1089510Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:50.1089827Z op = silu_mul_quant 2025-05-07T20:32:50.1090076Z if compiled: 2025-05-07T20:32:50.1090312Z op = torch.compile(op) 2025-05-07T20:32:50.1090606Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.1090867Z 2025-05-07T20:32:50.1091048Z > y_fp8, y_scale = fn() 2025-05-07T20:32:50.1091207Z 2025-05-07T20:32:50.1091302Z moe/activation_test.py:117: 2025-05-07T20:32:50.1091597Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.1091921Z moe/activation_test.py:115: in fn 2025-05-07T20:32:50.1092364Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.1093058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:50.1093740Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:50.1094270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:50.1094960Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:50.1095618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:50.1096128Z kernel = self.compile( 2025-05-07T20:32:50.1096654Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:50.1097300Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:50.1097699Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.1097925Z 2025-05-07T20:32:50.1098126Z self = 2025-05-07T20:32:50.1099195Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:50.1100610Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9f92fc5580>} 2025-05-07T20:32:50.1101940Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:50.1102947Z context = 2025-05-07T20:32:50.1103227Z 2025-05-07T20:32:50.1103394Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:50.1103905Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:50.1104358Z module_map=module_map) 2025-05-07T20:32:50.1104750Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:50.1105094Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:50.1105342Z E ^ 2025-05-07T20:32:50.1105787Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:50.1106234Z 2025-05-07T20:32:50.1106641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:50.1107152Z 2025-05-07T20:32:50.1107248Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:50.1107680Z self=, 2025-05-07T20:32:50.1108119Z T=1, 2025-05-07T20:32:50.1108532Z D=7168, 2025-05-07T20:32:50.1108734Z scale_ub=1200.0, 2025-05-07T20:32:50.1108948Z contiguous=False, 2025-05-07T20:32:50.1109159Z compiled=False, 2025-05-07T20:32:50.1109354Z ) 2025-05-07T20:32:50.1109665Z self = 2025-05-07T20:32:50.1110148Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:50.1110412Z 2025-05-07T20:32:50.1110481Z @given( 2025-05-07T20:32:50.1110703Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:50.1111001Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:50.1111285Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:50.1111603Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:50.1111915Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:50.1112180Z ) 2025-05-07T20:32:50.1112656Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:50.1113088Z def test_silu_mul_quant( 2025-05-07T20:32:50.1113316Z self, 2025-05-07T20:32:50.1113496Z T: int, 2025-05-07T20:32:50.1113684Z D: int, 2025-05-07T20:32:50.1113893Z scale_ub: Optional[float], 2025-05-07T20:32:50.1114156Z contiguous: bool, 2025-05-07T20:32:50.1114382Z compiled: bool, 2025-05-07T20:32:50.1114595Z ) -> None: 2025-05-07T20:32:50.1114792Z torch.manual_seed(2025) 2025-05-07T20:32:50.1115019Z 2025-05-07T20:32:50.1115278Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:50.1115599Z 2025-05-07T20:32:50.1115777Z x_sign = torch.sign(x) 2025-05-07T20:32:50.1116056Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:50.1116346Z x = x_sign * x_clamp 2025-05-07T20:32:50.1116573Z x0 = x[:, :D] 2025-05-07T20:32:50.1116776Z x1 = x[:, D:] 2025-05-07T20:32:50.1116969Z 2025-05-07T20:32:50.1117135Z if contiguous: 2025-05-07T20:32:50.1117352Z x0 = x0.contiguous() 2025-05-07T20:32:50.1117595Z x1 = x1.contiguous() 2025-05-07T20:32:50.1117822Z 2025-05-07T20:32:50.1118001Z if scale_ub is not None: 2025-05-07T20:32:50.1118261Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:50.1118651Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:50.1118940Z ) 2025-05-07T20:32:50.1119123Z else: 2025-05-07T20:32:50.1119311Z scale_ub_tensor = None 2025-05-07T20:32:50.1119547Z 2025-05-07T20:32:50.1119764Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:50.1120060Z op = silu_mul_quant 2025-05-07T20:32:50.1120300Z if compiled: 2025-05-07T20:32:50.1120534Z op = torch.compile(op) 2025-05-07T20:32:50.1120811Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.1121070Z 2025-05-07T20:32:50.1121258Z > y_fp8, y_scale = fn() 2025-05-07T20:32:50.1121414Z 2025-05-07T20:32:50.1121508Z moe/activation_test.py:117: 2025-05-07T20:32:50.1121792Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.1122113Z moe/activation_test.py:115: in fn 2025-05-07T20:32:50.1122450Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.1123123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:50.1123795Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:50.1124414Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:50.1125079Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:50.1125724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:50.1126245Z kernel = self.compile( 2025-05-07T20:32:50.1126775Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:50.1127407Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:50.1127795Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.1128020Z 2025-05-07T20:32:50.1128224Z self = 2025-05-07T20:32:50.1129293Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:50.1130641Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9f92fc68e0>} 2025-05-07T20:32:50.1132050Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:50.1133053Z context = 2025-05-07T20:32:50.1133343Z 2025-05-07T20:32:50.1133509Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:50.1134009Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:50.1134461Z module_map=module_map) 2025-05-07T20:32:50.1134816Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:50.1135155Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:50.1135396Z E ^ 2025-05-07T20:32:50.1135842Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:50.1136283Z 2025-05-07T20:32:50.1136700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:50.1137199Z 2025-05-07T20:32:50.1137299Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:50.1137690Z self=, 2025-05-07T20:32:50.1138121Z T=4096, 2025-05-07T20:32:50.1138293Z D=7168, 2025-05-07T20:32:50.1138467Z scale_ub=1200.0, 2025-05-07T20:32:50.1138675Z contiguous=False, 2025-05-07T20:32:50.1138887Z compiled=True, 2025-05-07T20:32:50.2775322Z ) 2025-05-07T20:32:50.2776805Z self = 2025-05-07T20:32:50.2778363Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:50.2779120Z 2025-05-07T20:32:50.2779324Z @given( 2025-05-07T20:32:50.2779892Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:50.2780520Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:50.2781103Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:50.2781735Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:50.2782359Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:50.2782894Z ) 2025-05-07T20:32:50.2783791Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:50.2784648Z def test_silu_mul_quant( 2025-05-07T20:32:50.2785104Z self, 2025-05-07T20:32:50.2785366Z T: int, 2025-05-07T20:32:50.2792129Z D: int, 2025-05-07T20:32:50.2792365Z scale_ub: Optional[float], 2025-05-07T20:32:50.2792657Z contiguous: bool, 2025-05-07T20:32:50.2792914Z compiled: bool, 2025-05-07T20:32:50.2793146Z ) -> None: 2025-05-07T20:32:50.2793374Z torch.manual_seed(2025) 2025-05-07T20:32:50.2793627Z 2025-05-07T20:32:50.2793908Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:50.2794282Z 2025-05-07T20:32:50.2794484Z x_sign = torch.sign(x) 2025-05-07T20:32:50.2794792Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:50.2795110Z x = x_sign * x_clamp 2025-05-07T20:32:50.2795376Z x0 = x[:, :D] 2025-05-07T20:32:50.2795639Z x1 = x[:, D:] 2025-05-07T20:32:50.2795854Z 2025-05-07T20:32:50.2796052Z if contiguous: 2025-05-07T20:32:50.2796295Z x0 = x0.contiguous() 2025-05-07T20:32:50.2796560Z x1 = x1.contiguous() 2025-05-07T20:32:50.2796798Z 2025-05-07T20:32:50.2796995Z if scale_ub is not None: 2025-05-07T20:32:50.2797264Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:50.2797610Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:50.2797921Z ) 2025-05-07T20:32:50.2798109Z else: 2025-05-07T20:32:50.2798329Z scale_ub_tensor = None 2025-05-07T20:32:50.2798583Z 2025-05-07T20:32:50.2798970Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:50.2799293Z op = silu_mul_quant 2025-05-07T20:32:50.2799548Z if compiled: 2025-05-07T20:32:50.2799797Z op = torch.compile(op) 2025-05-07T20:32:50.2800096Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.2800381Z 2025-05-07T20:32:50.2800585Z > y_fp8, y_scale = fn() 2025-05-07T20:32:50.2800754Z 2025-05-07T20:32:50.2800854Z moe/activation_test.py:117: 2025-05-07T20:32:50.2801162Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.2801502Z moe/activation_test.py:115: in fn 2025-05-07T20:32:50.2801781Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.2802345Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:50.2802913Z return fn(*args, **kwargs) 2025-05-07T20:32:50.2803583Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:50.2804395Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:50.2804942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:50.2805638Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:50.2806391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:50.2806927Z kernel = self.compile( 2025-05-07T20:32:50.2807478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:50.2808150Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:50.2808734Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.2808980Z 2025-05-07T20:32:50.2809214Z self = 2025-05-07T20:32:50.2810325Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:50.2811794Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9f92fc7a60>} 2025-05-07T20:32:50.2813149Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:50.2814186Z context = 2025-05-07T20:32:50.2814482Z 2025-05-07T20:32:50.2814654Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:50.2815190Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:50.2815706Z module_map=module_map) 2025-05-07T20:32:50.2816075Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:50.2816440Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:50.2816717Z E ^ 2025-05-07T20:32:50.2817184Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:50.2817634Z 2025-05-07T20:32:50.2818049Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:50.2818558Z 2025-05-07T20:32:50.2818680Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:50.2819095Z self=, 2025-05-07T20:32:50.2819497Z T=128, 2025-05-07T20:32:50.2819687Z D=7168, 2025-05-07T20:32:50.2819884Z scale_ub=1200.0, 2025-05-07T20:32:50.2820229Z contiguous=False, 2025-05-07T20:32:50.2820463Z compiled=True, 2025-05-07T20:32:50.2820672Z ) 2025-05-07T20:32:50.2821001Z self = 2025-05-07T20:32:50.2821487Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:50.2821767Z 2025-05-07T20:32:50.2821846Z @given( 2025-05-07T20:32:50.2822080Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:50.2822392Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:50.2822706Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:50.2823039Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:50.2823360Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:50.2823660Z ) 2025-05-07T20:32:50.2824012Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:50.2824455Z def test_silu_mul_quant( 2025-05-07T20:32:50.2824705Z self, 2025-05-07T20:32:50.2824905Z T: int, 2025-05-07T20:32:50.2825110Z D: int, 2025-05-07T20:32:50.2825349Z scale_ub: Optional[float], 2025-05-07T20:32:50.2825647Z contiguous: bool, 2025-05-07T20:32:50.2825891Z compiled: bool, 2025-05-07T20:32:50.2826117Z ) -> None: 2025-05-07T20:32:50.2826402Z torch.manual_seed(2025) 2025-05-07T20:32:50.2826646Z 2025-05-07T20:32:50.2826922Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:50.2827275Z 2025-05-07T20:32:50.2827475Z x_sign = torch.sign(x) 2025-05-07T20:32:50.2827767Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:50.2828082Z x = x_sign * x_clamp 2025-05-07T20:32:50.2828326Z x0 = x[:, :D] 2025-05-07T20:32:50.2828540Z x1 = x[:, D:] 2025-05-07T20:32:50.2828748Z 2025-05-07T20:32:50.2828936Z if contiguous: 2025-05-07T20:32:50.2829165Z x0 = x0.contiguous() 2025-05-07T20:32:50.2829437Z x1 = x1.contiguous() 2025-05-07T20:32:50.2829679Z 2025-05-07T20:32:50.2829880Z if scale_ub is not None: 2025-05-07T20:32:50.2830153Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:50.2830490Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:50.2830854Z ) 2025-05-07T20:32:50.2831050Z else: 2025-05-07T20:32:50.2831265Z scale_ub_tensor = None 2025-05-07T20:32:50.2831524Z 2025-05-07T20:32:50.2831757Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:50.2832075Z op = silu_mul_quant 2025-05-07T20:32:50.2832329Z if compiled: 2025-05-07T20:32:50.2832576Z op = torch.compile(op) 2025-05-07T20:32:50.2832874Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.2833149Z 2025-05-07T20:32:50.2833341Z > y_fp8, y_scale = fn() 2025-05-07T20:32:50.2833511Z 2025-05-07T20:32:50.2833613Z moe/activation_test.py:117: 2025-05-07T20:32:50.2833909Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.2834248Z moe/activation_test.py:115: in fn 2025-05-07T20:32:50.2834523Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.2835082Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:50.2835644Z return fn(*args, **kwargs) 2025-05-07T20:32:50.2836294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:50.2836978Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:50.2837514Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:50.2838189Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:50.2838926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:50.2839465Z kernel = self.compile( 2025-05-07T20:32:50.2840004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:50.2840655Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:50.2841062Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.2841293Z 2025-05-07T20:32:50.2841498Z self = 2025-05-07T20:32:50.2842633Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:50.2844001Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9f930b8ea0>} 2025-05-07T20:32:50.2845389Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:50.2846412Z context = 2025-05-07T20:32:50.2846747Z 2025-05-07T20:32:50.2846914Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:50.2847432Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:50.2847904Z module_map=module_map) 2025-05-07T20:32:50.2848322Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:50.2848674Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:50.2848932Z E ^ 2025-05-07T20:32:50.2849404Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:50.2849858Z 2025-05-07T20:32:50.2850273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:50.2850779Z 2025-05-07T20:32:50.2850884Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:50.2851335Z self=, 2025-05-07T20:32:50.2851741Z T=2048, 2025-05-07T20:32:50.2851934Z D=7168, 2025-05-07T20:32:50.2852129Z scale_ub=None, 2025-05-07T20:32:50.2852347Z contiguous=True, 2025-05-07T20:32:50.2852574Z compiled=True, 2025-05-07T20:32:50.4066376Z ) 2025-05-07T20:32:50.4067510Z self = 2025-05-07T20:32:50.4069283Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:50.4070194Z 2025-05-07T20:32:50.4070340Z @given( 2025-05-07T20:32:50.4070787Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:50.4071381Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:50.4071951Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:50.4072574Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:50.4073198Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:50.4073738Z ) 2025-05-07T20:32:50.4074396Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:50.4075242Z def test_silu_mul_quant( 2025-05-07T20:32:50.4075693Z self, 2025-05-07T20:32:50.4076037Z T: int, 2025-05-07T20:32:50.4076392Z D: int, 2025-05-07T20:32:50.4076791Z scale_ub: Optional[float], 2025-05-07T20:32:50.4077289Z contiguous: bool, 2025-05-07T20:32:50.4077730Z compiled: bool, 2025-05-07T20:32:50.4078139Z ) -> None: 2025-05-07T20:32:50.4078527Z torch.manual_seed(2025) 2025-05-07T20:32:50.4078973Z 2025-05-07T20:32:50.4079812Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:50.4080457Z 2025-05-07T20:32:50.4080806Z x_sign = torch.sign(x) 2025-05-07T20:32:50.4081347Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:50.4081816Z x = x_sign * x_clamp 2025-05-07T20:32:50.4082078Z x0 = x[:, :D] 2025-05-07T20:32:50.4082298Z x1 = x[:, D:] 2025-05-07T20:32:50.4082484Z 2025-05-07T20:32:50.4082650Z if contiguous: 2025-05-07T20:32:50.4082865Z x0 = x0.contiguous() 2025-05-07T20:32:50.4083100Z x1 = x1.contiguous() 2025-05-07T20:32:50.4083320Z 2025-05-07T20:32:50.4083499Z if scale_ub is not None: 2025-05-07T20:32:50.4083750Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:50.4084072Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:50.4084496Z ) 2025-05-07T20:32:50.4084674Z else: 2025-05-07T20:32:50.4084864Z scale_ub_tensor = None 2025-05-07T20:32:50.4085103Z 2025-05-07T20:32:50.4085319Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:50.4085614Z op = silu_mul_quant 2025-05-07T20:32:50.4085853Z if compiled: 2025-05-07T20:32:50.4086084Z op = torch.compile(op) 2025-05-07T20:32:50.4086364Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.4086723Z 2025-05-07T20:32:50.4086897Z > y_fp8, y_scale = fn() 2025-05-07T20:32:50.4087054Z 2025-05-07T20:32:50.4087146Z moe/activation_test.py:117: 2025-05-07T20:32:50.4087425Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.4087745Z moe/activation_test.py:115: in fn 2025-05-07T20:32:50.4088009Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.4088553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:50.4089098Z return fn(*args, **kwargs) 2025-05-07T20:32:50.4089743Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:50.4090406Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:50.4090923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:50.4091654Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:50.4092290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:50.4092799Z kernel = self.compile( 2025-05-07T20:32:50.4093331Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:50.4093960Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:50.4094334Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.4094557Z 2025-05-07T20:32:50.4094758Z self = 2025-05-07T20:32:50.4095818Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:50.4097169Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9f930b9c60>} 2025-05-07T20:32:50.4098478Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:50.4099473Z context = 2025-05-07T20:32:50.4099755Z 2025-05-07T20:32:50.4099996Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:50.4100500Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:50.4100944Z module_map=module_map) 2025-05-07T20:32:50.4101288Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:50.4101628Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:50.4101866Z E ^ 2025-05-07T20:32:50.4102306Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:50.4102749Z 2025-05-07T20:32:50.4103158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:50.4103659Z 2025-05-07T20:32:50.4103754Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:50.4104153Z self=, 2025-05-07T20:32:50.4104534Z T=16384, 2025-05-07T20:32:50.4104721Z D=5120, 2025-05-07T20:32:50.4104896Z scale_ub=None, 2025-05-07T20:32:50.4105095Z contiguous=False, 2025-05-07T20:32:50.4105304Z compiled=False, 2025-05-07T20:32:50.4105498Z ) 2025-05-07T20:32:50.4105794Z self = 2025-05-07T20:32:50.4106271Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:50.4106613Z 2025-05-07T20:32:50.4106690Z @given( 2025-05-07T20:32:50.4106922Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:50.4107212Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:50.4107498Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:50.4107807Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:50.4108113Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:50.4108562Z ) 2025-05-07T20:32:50.4108898Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:50.4109320Z def test_silu_mul_quant( 2025-05-07T20:32:50.4109543Z self, 2025-05-07T20:32:50.4109724Z T: int, 2025-05-07T20:32:50.4109903Z D: int, 2025-05-07T20:32:50.4110106Z scale_ub: Optional[float], 2025-05-07T20:32:50.4110356Z contiguous: bool, 2025-05-07T20:32:50.4110657Z compiled: bool, 2025-05-07T20:32:50.4110867Z ) -> None: 2025-05-07T20:32:50.4111062Z torch.manual_seed(2025) 2025-05-07T20:32:50.4111286Z 2025-05-07T20:32:50.4111545Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:50.4111874Z 2025-05-07T20:32:50.4112057Z x_sign = torch.sign(x) 2025-05-07T20:32:50.4112331Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:50.4114336Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:50.4116248Z 2025-05-07T20:32:50.4116360Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:50.4116564Z 2025-05-07T20:32:50.4116667Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:50.4117060Z self=, 2025-05-07T20:32:50.4117446Z T=4096, 2025-05-07T20:32:50.4117634Z D=7168, 2025-05-07T20:32:50.4117825Z scale_ub=1200.0, 2025-05-07T20:32:50.4118047Z contiguous=True, 2025-05-07T20:32:50.4118264Z compiled=True, 2025-05-07T20:32:50.4118459Z ) 2025-05-07T20:32:50.4118882Z self = 2025-05-07T20:32:50.4119369Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:50.4119636Z 2025-05-07T20:32:50.4119718Z @given( 2025-05-07T20:32:50.4119931Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:50.4120242Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:50.4120543Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:50.4120862Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:50.4121183Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:50.4121456Z ) 2025-05-07T20:32:50.4121796Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:50.4122227Z def test_silu_mul_quant( 2025-05-07T20:32:50.4122462Z self, 2025-05-07T20:32:50.4122649Z T: int, 2025-05-07T20:32:50.4122840Z D: int, 2025-05-07T20:32:50.4123055Z scale_ub: Optional[float], 2025-05-07T20:32:50.4123318Z contiguous: bool, 2025-05-07T20:32:50.4123556Z compiled: bool, 2025-05-07T20:32:50.4123780Z ) -> None: 2025-05-07T20:32:50.4123992Z torch.manual_seed(2025) 2025-05-07T20:32:50.4124223Z 2025-05-07T20:32:50.4124558Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:50.4124960Z 2025-05-07T20:32:50.4125140Z x_sign = torch.sign(x) 2025-05-07T20:32:50.4125422Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:50.4127418Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:50.4129272Z 2025-05-07T20:32:50.4129391Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:50.4129598Z 2025-05-07T20:32:50.4129705Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:50.4130147Z self=, 2025-05-07T20:32:50.4130549Z T=16384, 2025-05-07T20:32:50.4130740Z D=7168, 2025-05-07T20:32:50.4130924Z scale_ub=None, 2025-05-07T20:32:50.4131133Z contiguous=False, 2025-05-07T20:32:50.4131352Z compiled=False, 2025-05-07T20:32:50.4131555Z ) 2025-05-07T20:32:50.4131862Z self = 2025-05-07T20:32:50.4132348Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:50.4132619Z 2025-05-07T20:32:50.4132695Z @given( 2025-05-07T20:32:50.4132915Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:50.4133233Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:50.4133536Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:50.4133851Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:50.4134168Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:50.4134449Z ) 2025-05-07T20:32:50.4134796Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:50.4135231Z def test_silu_mul_quant( 2025-05-07T20:32:50.4135466Z self, 2025-05-07T20:32:50.4135656Z T: int, 2025-05-07T20:32:50.4135852Z D: int, 2025-05-07T20:32:50.4136067Z scale_ub: Optional[float], 2025-05-07T20:32:50.4136325Z contiguous: bool, 2025-05-07T20:32:50.4136564Z compiled: bool, 2025-05-07T20:32:50.4136779Z ) -> None: 2025-05-07T20:32:50.4136983Z torch.manual_seed(2025) 2025-05-07T20:32:50.4137216Z 2025-05-07T20:32:50.4137568Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:50.4139608Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:50.4141468Z 2025-05-07T20:32:50.4141583Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:50.5357359Z 2025-05-07T20:32:50.5357665Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:50.5358270Z self=, 2025-05-07T20:32:50.5358829Z T=2048, 2025-05-07T20:32:50.5359082Z D=7168, 2025-05-07T20:32:50.5359342Z scale_ub=1200.0, 2025-05-07T20:32:50.5359654Z contiguous=True, 2025-05-07T20:32:50.5359865Z compiled=True, 2025-05-07T20:32:50.5360055Z ) 2025-05-07T20:32:50.5360367Z self = 2025-05-07T20:32:50.5360848Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:50.5361218Z 2025-05-07T20:32:50.5361293Z @given( 2025-05-07T20:32:50.5361511Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:50.5361846Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:50.5362142Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:50.5362457Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:50.5362769Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:50.5363042Z ) 2025-05-07T20:32:50.5363377Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:50.5363809Z def test_silu_mul_quant( 2025-05-07T20:32:50.5364036Z self, 2025-05-07T20:32:50.5364219Z T: int, 2025-05-07T20:32:50.5364505Z D: int, 2025-05-07T20:32:50.5364713Z scale_ub: Optional[float], 2025-05-07T20:32:50.5364970Z contiguous: bool, 2025-05-07T20:32:50.5365274Z compiled: bool, 2025-05-07T20:32:50.5365517Z ) -> None: 2025-05-07T20:32:50.5365743Z torch.manual_seed(2025) 2025-05-07T20:32:50.5365972Z 2025-05-07T20:32:50.5366228Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:50.5366557Z 2025-05-07T20:32:50.5366738Z x_sign = torch.sign(x) 2025-05-07T20:32:50.5367024Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:50.5369005Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:50.5370849Z 2025-05-07T20:32:50.5370968Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:50.5371170Z 2025-05-07T20:32:50.5371272Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:50.5371667Z self=, 2025-05-07T20:32:50.5372056Z T=2048, 2025-05-07T20:32:50.5372231Z D=7168, 2025-05-07T20:32:50.5378498Z scale_ub=None, 2025-05-07T20:32:50.5378728Z contiguous=True, 2025-05-07T20:32:50.5378962Z compiled=False, 2025-05-07T20:32:50.5379181Z ) 2025-05-07T20:32:50.5379503Z self = 2025-05-07T20:32:50.5380184Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:50.5380466Z 2025-05-07T20:32:50.5380550Z @given( 2025-05-07T20:32:50.5380796Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:50.5381114Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:50.5381429Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:50.5381779Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:50.5382109Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:50.5382400Z ) 2025-05-07T20:32:50.5382760Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:50.5383206Z def test_silu_mul_quant( 2025-05-07T20:32:50.5383462Z self, 2025-05-07T20:32:50.5383666Z T: int, 2025-05-07T20:32:50.5383868Z D: int, 2025-05-07T20:32:50.5384089Z scale_ub: Optional[float], 2025-05-07T20:32:50.5384368Z contiguous: bool, 2025-05-07T20:32:50.5384613Z compiled: bool, 2025-05-07T20:32:50.5384847Z ) -> None: 2025-05-07T20:32:50.5385067Z torch.manual_seed(2025) 2025-05-07T20:32:50.5385304Z 2025-05-07T20:32:50.5385585Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:50.5385936Z 2025-05-07T20:32:50.5386180Z > x_sign = torch.sign(x) 2025-05-07T20:32:50.5388120Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:50.5389965Z 2025-05-07T20:32:50.5390090Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:50.5390307Z 2025-05-07T20:32:50.5390411Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:50.5390829Z self=, 2025-05-07T20:32:50.5391228Z T=1, 2025-05-07T20:32:50.5391470Z D=7168, 2025-05-07T20:32:50.5391671Z scale_ub=1200.0, 2025-05-07T20:32:50.5391903Z contiguous=True, 2025-05-07T20:32:50.5392124Z compiled=False, 2025-05-07T20:32:50.5392328Z ) 2025-05-07T20:32:50.5392643Z self = 2025-05-07T20:32:50.5393124Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:50.5393391Z 2025-05-07T20:32:50.5393473Z @given( 2025-05-07T20:32:50.5393713Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:50.5394021Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:50.5394337Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:50.5394667Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:50.5394988Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:50.5395281Z ) 2025-05-07T20:32:50.5395660Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:50.5396124Z def test_silu_mul_quant( 2025-05-07T20:32:50.5396369Z self, 2025-05-07T20:32:50.5396573Z T: int, 2025-05-07T20:32:50.5396778Z D: int, 2025-05-07T20:32:50.5396992Z scale_ub: Optional[float], 2025-05-07T20:32:50.5397265Z contiguous: bool, 2025-05-07T20:32:50.5397511Z compiled: bool, 2025-05-07T20:32:50.5397734Z ) -> None: 2025-05-07T20:32:50.5397953Z torch.manual_seed(2025) 2025-05-07T20:32:50.5398199Z 2025-05-07T20:32:50.5398475Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:50.5398821Z 2025-05-07T20:32:50.5399018Z x_sign = torch.sign(x) 2025-05-07T20:32:50.5399388Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:50.5399701Z x = x_sign * x_clamp 2025-05-07T20:32:50.5399945Z x0 = x[:, :D] 2025-05-07T20:32:50.5400166Z x1 = x[:, D:] 2025-05-07T20:32:50.5400377Z 2025-05-07T20:32:50.5400571Z if contiguous: 2025-05-07T20:32:50.5400805Z x0 = x0.contiguous() 2025-05-07T20:32:50.5401061Z x1 = x1.contiguous() 2025-05-07T20:32:50.5401308Z 2025-05-07T20:32:50.5401507Z if scale_ub is not None: 2025-05-07T20:32:50.5401786Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:50.5402117Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:50.5402432Z ) 2025-05-07T20:32:50.5402629Z else: 2025-05-07T20:32:50.5402836Z scale_ub_tensor = None 2025-05-07T20:32:50.5403092Z 2025-05-07T20:32:50.5403325Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:50.5403636Z op = silu_mul_quant 2025-05-07T20:32:50.5403903Z if compiled: 2025-05-07T20:32:50.5404157Z op = torch.compile(op) 2025-05-07T20:32:50.5404616Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.5404894Z 2025-05-07T20:32:50.5405095Z > y_fp8, y_scale = fn() 2025-05-07T20:32:50.5405264Z 2025-05-07T20:32:50.5405417Z moe/activation_test.py:117: 2025-05-07T20:32:50.5405763Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.5406098Z moe/activation_test.py:115: in fn 2025-05-07T20:32:50.5406386Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.5407075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:50.5407763Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:50.5408582Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:50.5409277Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:50.5409940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:50.5410474Z kernel = self.compile( 2025-05-07T20:32:50.5411100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:50.5411756Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:50.5412151Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.5412379Z 2025-05-07T20:32:50.5412589Z self = 2025-05-07T20:32:50.5413678Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:50.5415045Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9f92e58b80>} 2025-05-07T20:32:50.5416438Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:50.5417462Z context = 2025-05-07T20:32:50.5417751Z 2025-05-07T20:32:50.5417922Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:50.5418439Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:50.5418904Z module_map=module_map) 2025-05-07T20:32:50.5419267Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:50.5419735Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:50.5419995Z E ^ 2025-05-07T20:32:50.5420462Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:50.5420906Z 2025-05-07T20:32:50.5421327Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:50.5421839Z 2025-05-07T20:32:50.5421948Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:50.5422357Z self=, 2025-05-07T20:32:50.5422760Z T=128, 2025-05-07T20:32:50.5422953Z D=5120, 2025-05-07T20:32:50.5423141Z scale_ub=None, 2025-05-07T20:32:50.5423359Z contiguous=True, 2025-05-07T20:32:50.5423584Z compiled=False, 2025-05-07T20:32:50.5423784Z ) 2025-05-07T20:32:50.5424102Z self = 2025-05-07T20:32:50.5424597Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:50.5424862Z 2025-05-07T20:32:50.5424942Z @given( 2025-05-07T20:32:50.5425171Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:50.5425477Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:50.5425779Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:50.5426159Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:50.5426482Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:50.5426756Z ) 2025-05-07T20:32:50.5427089Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:50.5427521Z def test_silu_mul_quant( 2025-05-07T20:32:50.5427759Z self, 2025-05-07T20:32:50.5427942Z T: int, 2025-05-07T20:32:50.5428133Z D: int, 2025-05-07T20:32:50.5428344Z scale_ub: Optional[float], 2025-05-07T20:32:50.5428604Z contiguous: bool, 2025-05-07T20:32:50.5428838Z compiled: bool, 2025-05-07T20:32:50.5429060Z ) -> None: 2025-05-07T20:32:50.5429263Z torch.manual_seed(2025) 2025-05-07T20:32:50.5429498Z 2025-05-07T20:32:50.5429765Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:50.5430099Z 2025-05-07T20:32:50.5430346Z x_sign = torch.sign(x) 2025-05-07T20:32:50.5430635Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:50.5430938Z x = x_sign * x_clamp 2025-05-07T20:32:50.5431166Z x0 = x[:, :D] 2025-05-07T20:32:50.5431376Z x1 = x[:, D:] 2025-05-07T20:32:50.5431575Z 2025-05-07T20:32:50.5431754Z if contiguous: 2025-05-07T20:32:50.5431981Z x0 = x0.contiguous() 2025-05-07T20:32:50.5432235Z x1 = x1.contiguous() 2025-05-07T20:32:50.5432465Z 2025-05-07T20:32:50.5432651Z if scale_ub is not None: 2025-05-07T20:32:50.5432917Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:50.5433243Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:50.5433544Z ) 2025-05-07T20:32:50.5433729Z else: 2025-05-07T20:32:50.5433932Z scale_ub_tensor = None 2025-05-07T20:32:50.5434177Z 2025-05-07T20:32:50.5434401Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:50.5434714Z op = silu_mul_quant 2025-05-07T20:32:50.5434957Z if compiled: 2025-05-07T20:32:50.5435197Z op = torch.compile(op) 2025-05-07T20:32:50.5435486Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.5435747Z 2025-05-07T20:32:50.5435930Z > y_fp8, y_scale = fn() 2025-05-07T20:32:50.5436090Z 2025-05-07T20:32:50.5436190Z moe/activation_test.py:117: 2025-05-07T20:32:50.5436476Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.5436800Z moe/activation_test.py:115: in fn 2025-05-07T20:32:50.5437075Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.5437829Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:50.5438505Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:50.5439026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:50.5439698Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:50.5440345Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:50.5440861Z kernel = self.compile( 2025-05-07T20:32:50.5441391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:50.5442093Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:50.5442478Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.5442705Z 2025-05-07T20:32:50.5442914Z self = 2025-05-07T20:32:50.5443979Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:50.5445491Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9f92e59a80>} 2025-05-07T20:32:50.5446814Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:50.5447874Z context = 2025-05-07T20:32:50.5448162Z 2025-05-07T20:32:50.5448324Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:50.5448840Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:50.5449294Z module_map=module_map) 2025-05-07T20:32:50.5449649Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:50.5450043Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:50.5450293Z E ^ 2025-05-07T20:32:50.5450746Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:50.5451195Z 2025-05-07T20:32:50.5451604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:50.6573574Z 2025-05-07T20:32:50.6573820Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:50.6574533Z self=, 2025-05-07T20:32:50.6575262Z T=128, 2025-05-07T20:32:50.6575530Z D=7168, 2025-05-07T20:32:50.6575828Z scale_ub=None, 2025-05-07T20:32:50.6576089Z contiguous=True, 2025-05-07T20:32:50.6576298Z compiled=False, 2025-05-07T20:32:50.6576491Z ) 2025-05-07T20:32:50.6576798Z self = 2025-05-07T20:32:50.6577281Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:50.6577541Z 2025-05-07T20:32:50.6577614Z @given( 2025-05-07T20:32:50.6577837Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:50.6578143Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:50.6578440Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:50.6578757Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:50.6579073Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:50.6579345Z ) 2025-05-07T20:32:50.6579675Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:50.6580276Z def test_silu_mul_quant( 2025-05-07T20:32:50.6580513Z self, 2025-05-07T20:32:50.6580699Z T: int, 2025-05-07T20:32:50.6580885Z D: int, 2025-05-07T20:32:50.6581088Z scale_ub: Optional[float], 2025-05-07T20:32:50.6581340Z contiguous: bool, 2025-05-07T20:32:50.6581569Z compiled: bool, 2025-05-07T20:32:50.6581783Z ) -> None: 2025-05-07T20:32:50.6581981Z torch.manual_seed(2025) 2025-05-07T20:32:50.6582213Z 2025-05-07T20:32:50.6582478Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:50.6582805Z 2025-05-07T20:32:50.6582981Z x_sign = torch.sign(x) 2025-05-07T20:32:50.6583261Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:50.6583548Z x = x_sign * x_clamp 2025-05-07T20:32:50.6583775Z x0 = x[:, :D] 2025-05-07T20:32:50.6583976Z x1 = x[:, D:] 2025-05-07T20:32:50.6584168Z 2025-05-07T20:32:50.6584338Z if contiguous: 2025-05-07T20:32:50.6584561Z x0 = x0.contiguous() 2025-05-07T20:32:50.6584814Z x1 = x1.contiguous() 2025-05-07T20:32:50.6585050Z 2025-05-07T20:32:50.6585227Z if scale_ub is not None: 2025-05-07T20:32:50.6585487Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:50.6585816Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:50.6586178Z ) 2025-05-07T20:32:50.6586367Z else: 2025-05-07T20:32:50.6586557Z scale_ub_tensor = None 2025-05-07T20:32:50.6586794Z 2025-05-07T20:32:50.6587013Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:50.6587311Z op = silu_mul_quant 2025-05-07T20:32:50.6587551Z if compiled: 2025-05-07T20:32:50.6587785Z op = torch.compile(op) 2025-05-07T20:32:50.6588065Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.6588328Z 2025-05-07T20:32:50.6588509Z > y_fp8, y_scale = fn() 2025-05-07T20:32:50.6588668Z 2025-05-07T20:32:50.6588767Z moe/activation_test.py:117: 2025-05-07T20:32:50.6589052Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.6589371Z moe/activation_test.py:115: in fn 2025-05-07T20:32:50.6589642Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.6590381Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:50.6591047Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:50.6591567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:50.6592232Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:50.6592878Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:50.6593387Z kernel = self.compile( 2025-05-07T20:32:50.6593921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:50.6594564Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:50.6594953Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.6595189Z 2025-05-07T20:32:50.6595392Z self = 2025-05-07T20:32:50.6596447Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:50.6597799Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9f92e5a980>} 2025-05-07T20:32:50.6599211Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:50.6600220Z context = 2025-05-07T20:32:50.6600505Z 2025-05-07T20:32:50.6600664Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:50.6601183Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:50.6601639Z module_map=module_map) 2025-05-07T20:32:50.6601998Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:50.6602333Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:50.6602580Z E ^ 2025-05-07T20:32:50.6603030Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:50.6603472Z 2025-05-07T20:32:50.6603886Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:50.6604491Z 2025-05-07T20:32:50.6604597Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:50.6604996Z self=, 2025-05-07T20:32:50.6605387Z T=2048, 2025-05-07T20:32:50.6605572Z D=7168, 2025-05-07T20:32:50.6605803Z scale_ub=1200.0, 2025-05-07T20:32:50.6606008Z contiguous=True, 2025-05-07T20:32:50.6606220Z compiled=False, 2025-05-07T20:32:50.6606412Z ) 2025-05-07T20:32:50.6606716Z self = 2025-05-07T20:32:50.6607196Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:50.6607460Z 2025-05-07T20:32:50.6607537Z @given( 2025-05-07T20:32:50.6607757Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:50.6608061Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:50.6608537Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:50.6608850Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:50.6609168Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:50.6609444Z ) 2025-05-07T20:32:50.6609780Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:50.6610287Z def test_silu_mul_quant( 2025-05-07T20:32:50.6610519Z self, 2025-05-07T20:32:50.6610703Z T: int, 2025-05-07T20:32:50.6610883Z D: int, 2025-05-07T20:32:50.6611095Z scale_ub: Optional[float], 2025-05-07T20:32:50.6611354Z contiguous: bool, 2025-05-07T20:32:50.6611581Z compiled: bool, 2025-05-07T20:32:50.6611799Z ) -> None: 2025-05-07T20:32:50.6612004Z torch.manual_seed(2025) 2025-05-07T20:32:50.6612233Z 2025-05-07T20:32:50.6612497Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:50.6614521Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:50.6616360Z 2025-05-07T20:32:50.6616474Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:50.6616678Z 2025-05-07T20:32:50.6616776Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:50.6617170Z self=, 2025-05-07T20:32:50.6617559Z T=1, 2025-05-07T20:32:50.6617731Z D=5120, 2025-05-07T20:32:50.6617907Z scale_ub=1200.0, 2025-05-07T20:32:50.6618121Z contiguous=True, 2025-05-07T20:32:50.6618503Z compiled=False, 2025-05-07T20:32:50.6618697Z ) 2025-05-07T20:32:50.6618999Z self = 2025-05-07T20:32:50.6619466Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:50.6619720Z 2025-05-07T20:32:50.6619799Z @given( 2025-05-07T20:32:50.6620018Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:50.6620318Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:50.6620614Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:50.6620930Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:50.6621252Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:50.6621524Z ) 2025-05-07T20:32:50.6621857Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:50.6622282Z def test_silu_mul_quant( 2025-05-07T20:32:50.6622513Z self, 2025-05-07T20:32:50.6622696Z T: int, 2025-05-07T20:32:50.6622894Z D: int, 2025-05-07T20:32:50.6623104Z scale_ub: Optional[float], 2025-05-07T20:32:50.6623364Z contiguous: bool, 2025-05-07T20:32:50.6623591Z compiled: bool, 2025-05-07T20:32:50.6623800Z ) -> None: 2025-05-07T20:32:50.6623997Z torch.manual_seed(2025) 2025-05-07T20:32:50.6624292Z 2025-05-07T20:32:50.6624553Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:50.6624880Z 2025-05-07T20:32:50.6625057Z x_sign = torch.sign(x) 2025-05-07T20:32:50.6625344Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:50.6625640Z x = x_sign * x_clamp 2025-05-07T20:32:50.6625866Z x0 = x[:, :D] 2025-05-07T20:32:50.6626070Z x1 = x[:, D:] 2025-05-07T20:32:50.6626265Z 2025-05-07T20:32:50.6626437Z if contiguous: 2025-05-07T20:32:50.6626659Z x0 = x0.contiguous() 2025-05-07T20:32:50.6626905Z x1 = x1.contiguous() 2025-05-07T20:32:50.6627127Z 2025-05-07T20:32:50.6627316Z if scale_ub is not None: 2025-05-07T20:32:50.6627578Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:50.6627899Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:50.6628197Z ) 2025-05-07T20:32:50.6628429Z else: 2025-05-07T20:32:50.6628638Z scale_ub_tensor = None 2025-05-07T20:32:50.6628876Z 2025-05-07T20:32:50.6629093Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:50.6629399Z op = silu_mul_quant 2025-05-07T20:32:50.6629633Z if compiled: 2025-05-07T20:32:50.6629870Z op = torch.compile(op) 2025-05-07T20:32:50.6630154Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.6630411Z 2025-05-07T20:32:50.6630594Z > y_fp8, y_scale = fn() 2025-05-07T20:32:50.6630752Z 2025-05-07T20:32:50.6630852Z moe/activation_test.py:117: 2025-05-07T20:32:50.6631140Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.6631462Z moe/activation_test.py:115: in fn 2025-05-07T20:32:50.6631734Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.6632405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:50.6633077Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:50.6633602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:50.6634269Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:50.6634912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:50.6635428Z kernel = self.compile( 2025-05-07T20:32:50.6635955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:50.6636700Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:50.6637082Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.6637307Z 2025-05-07T20:32:50.6637506Z self = 2025-05-07T20:32:50.6638570Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:50.6639929Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9f92e5be20>} 2025-05-07T20:32:50.6641251Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:50.6642267Z context = 2025-05-07T20:32:50.6642545Z 2025-05-07T20:32:50.6642708Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:50.6643216Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:50.6643710Z module_map=module_map) 2025-05-07T20:32:50.6650326Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:50.6650713Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:50.6650974Z E ^ 2025-05-07T20:32:50.6651452Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:50.6651903Z 2025-05-07T20:32:50.6652324Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:50.7455831Z 2025-05-07T20:32:50.7456400Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:50.7457795Z self=, 2025-05-07T20:32:50.7458930Z T=2048, 2025-05-07T20:32:50.7459274Z D=5120, 2025-05-07T20:32:50.7459628Z scale_ub=None, 2025-05-07T20:32:50.7460273Z contiguous=True, 2025-05-07T20:32:50.7460745Z compiled=False, 2025-05-07T20:32:50.7461121Z ) 2025-05-07T20:32:50.7461720Z self = 2025-05-07T20:32:50.7462674Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:50.7463197Z 2025-05-07T20:32:50.7463341Z @given( 2025-05-07T20:32:50.7463762Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:50.7464360Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:50.7464939Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:50.7465461Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:50.7465774Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:50.7466045Z ) 2025-05-07T20:32:50.7466386Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:50.7466815Z def test_silu_mul_quant( 2025-05-07T20:32:50.7467044Z self, 2025-05-07T20:32:50.7467225Z T: int, 2025-05-07T20:32:50.7467403Z D: int, 2025-05-07T20:32:50.7467607Z scale_ub: Optional[float], 2025-05-07T20:32:50.7467864Z contiguous: bool, 2025-05-07T20:32:50.7468087Z compiled: bool, 2025-05-07T20:32:50.7468294Z ) -> None: 2025-05-07T20:32:50.7468503Z torch.manual_seed(2025) 2025-05-07T20:32:50.7468732Z 2025-05-07T20:32:50.7469001Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:50.7469325Z 2025-05-07T20:32:50.7469500Z > x_sign = torch.sign(x) 2025-05-07T20:32:50.7471566Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:50.7473436Z 2025-05-07T20:32:50.7473546Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:50.7473760Z 2025-05-07T20:32:50.7473860Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:50.7474264Z self=, 2025-05-07T20:32:50.7474651Z T=16384, 2025-05-07T20:32:50.7474831Z D=5120, 2025-05-07T20:32:50.7475003Z scale_ub=None, 2025-05-07T20:32:50.7475200Z contiguous=True, 2025-05-07T20:32:50.7475405Z compiled=False, 2025-05-07T20:32:50.7475599Z ) 2025-05-07T20:32:50.7475909Z self = 2025-05-07T20:32:50.7476391Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:50.7476668Z 2025-05-07T20:32:50.7476737Z @given( 2025-05-07T20:32:50.7476961Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:50.7477324Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:50.7477615Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:50.7477930Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:50.7478242Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:50.7478511Z ) 2025-05-07T20:32:50.7478848Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:50.7479278Z def test_silu_mul_quant( 2025-05-07T20:32:50.7479501Z self, 2025-05-07T20:32:50.7479677Z T: int, 2025-05-07T20:32:50.7479859Z D: int, 2025-05-07T20:32:50.7480065Z scale_ub: Optional[float], 2025-05-07T20:32:50.7480329Z contiguous: bool, 2025-05-07T20:32:50.7480555Z compiled: bool, 2025-05-07T20:32:50.7480768Z ) -> None: 2025-05-07T20:32:50.7480964Z torch.manual_seed(2025) 2025-05-07T20:32:50.7481238Z 2025-05-07T20:32:50.7481498Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:50.7483532Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:50.7485497Z 2025-05-07T20:32:50.7485607Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:50.7485814Z 2025-05-07T20:32:50.7485908Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:50.7486303Z self=, 2025-05-07T20:32:50.7486689Z T=4096, 2025-05-07T20:32:50.7486856Z D=5120, 2025-05-07T20:32:50.7487029Z scale_ub=None, 2025-05-07T20:32:50.7487236Z contiguous=True, 2025-05-07T20:32:50.7487436Z compiled=False, 2025-05-07T20:32:50.7487623Z ) 2025-05-07T20:32:50.7487955Z self = 2025-05-07T20:32:50.7488448Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:50.7488711Z 2025-05-07T20:32:50.7488779Z @given( 2025-05-07T20:32:50.7488987Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:50.7489280Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:50.7489651Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:50.7489965Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:50.7490277Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:50.7490538Z ) 2025-05-07T20:32:50.7490870Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:50.7491300Z def test_silu_mul_quant( 2025-05-07T20:32:50.7491523Z self, 2025-05-07T20:32:50.7491699Z T: int, 2025-05-07T20:32:50.7491877Z D: int, 2025-05-07T20:32:50.7492072Z scale_ub: Optional[float], 2025-05-07T20:32:50.7492323Z contiguous: bool, 2025-05-07T20:32:50.7492548Z compiled: bool, 2025-05-07T20:32:50.7492747Z ) -> None: 2025-05-07T20:32:50.7492945Z torch.manual_seed(2025) 2025-05-07T20:32:50.7493170Z 2025-05-07T20:32:50.7493418Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:50.7495431Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:50.7497394Z 2025-05-07T20:32:50.7497504Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:50.7497709Z 2025-05-07T20:32:50.7497804Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:50.7498200Z self=, 2025-05-07T20:32:50.7498580Z T=2048, 2025-05-07T20:32:50.7498749Z D=5120, 2025-05-07T20:32:50.7498929Z scale_ub=None, 2025-05-07T20:32:50.7499128Z contiguous=False, 2025-05-07T20:32:50.7499345Z compiled=False, 2025-05-07T20:32:50.7499536Z ) 2025-05-07T20:32:50.7499838Z self = 2025-05-07T20:32:50.7500307Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:50.7500618Z 2025-05-07T20:32:50.7500690Z @given( 2025-05-07T20:32:50.7500905Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:50.7501198Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:50.7501486Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:50.7501796Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:50.7502105Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:50.7502376Z ) 2025-05-07T20:32:50.7502703Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:50.7503128Z def test_silu_mul_quant( 2025-05-07T20:32:50.7503351Z self, 2025-05-07T20:32:50.7503537Z T: int, 2025-05-07T20:32:50.7503715Z D: int, 2025-05-07T20:32:50.7503914Z scale_ub: Optional[float], 2025-05-07T20:32:50.7504174Z contiguous: bool, 2025-05-07T20:32:50.7504398Z compiled: bool, 2025-05-07T20:32:50.7504600Z ) -> None: 2025-05-07T20:32:50.7504802Z torch.manual_seed(2025) 2025-05-07T20:32:50.7505029Z 2025-05-07T20:32:50.7505283Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:50.7507365Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:50.7509372Z 2025-05-07T20:32:50.7509486Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:50.7509701Z 2025-05-07T20:32:50.7509795Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:50.7510194Z self=, 2025-05-07T20:32:50.7510577Z T=4096, 2025-05-07T20:32:50.7510751Z D=7168, 2025-05-07T20:32:50.7510931Z scale_ub=None, 2025-05-07T20:32:50.7511125Z contiguous=True, 2025-05-07T20:32:50.7511335Z compiled=True, 2025-05-07T20:32:50.7511520Z ) 2025-05-07T20:32:50.7511820Z self = 2025-05-07T20:32:50.7512293Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:50.7512547Z 2025-05-07T20:32:50.7512624Z @given( 2025-05-07T20:32:50.7512841Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:50.7513141Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:50.7513434Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:50.7513746Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:50.7514056Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:50.7514331Z ) 2025-05-07T20:32:50.7514740Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:50.7515160Z def test_silu_mul_quant( 2025-05-07T20:32:50.7515390Z self, 2025-05-07T20:32:50.7515569Z T: int, 2025-05-07T20:32:50.7515744Z D: int, 2025-05-07T20:32:50.7515950Z scale_ub: Optional[float], 2025-05-07T20:32:50.7516207Z contiguous: bool, 2025-05-07T20:32:50.7516428Z compiled: bool, 2025-05-07T20:32:50.7516633Z ) -> None: 2025-05-07T20:32:50.7516835Z torch.manual_seed(2025) 2025-05-07T20:32:50.7517053Z 2025-05-07T20:32:50.7517314Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:50.7519376Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:50.7521288Z 2025-05-07T20:32:50.7521397Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:50.7521599Z 2025-05-07T20:32:50.7521696Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:50.7522084Z self=, 2025-05-07T20:32:50.7522467Z T=2048, 2025-05-07T20:32:50.7522635Z D=5120, 2025-05-07T20:32:50.7522815Z scale_ub=1200.0, 2025-05-07T20:32:50.7523022Z contiguous=False, 2025-05-07T20:32:50.7523234Z compiled=False, 2025-05-07T20:32:50.8057824Z ) 2025-05-07T20:32:50.8058460Z self = 2025-05-07T20:32:50.8059415Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:50.8059857Z 2025-05-07T20:32:50.8059929Z @given( 2025-05-07T20:32:50.8060151Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:50.8060453Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:50.8060743Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:50.8061060Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:50.8061380Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:50.8061651Z ) 2025-05-07T20:32:50.8061988Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:50.8062610Z def test_silu_mul_quant( 2025-05-07T20:32:50.8062845Z self, 2025-05-07T20:32:50.8063027Z T: int, 2025-05-07T20:32:50.8063212Z D: int, 2025-05-07T20:32:50.8063424Z scale_ub: Optional[float], 2025-05-07T20:32:50.8063679Z contiguous: bool, 2025-05-07T20:32:50.8063910Z compiled: bool, 2025-05-07T20:32:50.8064127Z ) -> None: 2025-05-07T20:32:50.8064328Z torch.manual_seed(2025) 2025-05-07T20:32:50.8064563Z 2025-05-07T20:32:50.8064825Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:50.8066902Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:50.8068754Z 2025-05-07T20:32:50.8068867Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:50.8069072Z 2025-05-07T20:32:50.8069170Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:50.8069635Z self=, 2025-05-07T20:32:50.8070016Z T=4096, 2025-05-07T20:32:50.8070185Z D=7168, 2025-05-07T20:32:50.8070360Z scale_ub=1200.0, 2025-05-07T20:32:50.8070575Z contiguous=True, 2025-05-07T20:32:50.8070778Z compiled=False, 2025-05-07T20:32:50.8070964Z ) 2025-05-07T20:32:50.8071266Z self = 2025-05-07T20:32:50.8071737Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:50.8072001Z 2025-05-07T20:32:50.8072073Z @given( 2025-05-07T20:32:50.8072286Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:50.8072588Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:50.8072879Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:50.8073195Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:50.8073580Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:50.8073853Z ) 2025-05-07T20:32:50.8074191Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:50.8074616Z def test_silu_mul_quant( 2025-05-07T20:32:50.8074840Z self, 2025-05-07T20:32:50.8075029Z T: int, 2025-05-07T20:32:50.8075228Z D: int, 2025-05-07T20:32:50.8075436Z scale_ub: Optional[float], 2025-05-07T20:32:50.8075700Z contiguous: bool, 2025-05-07T20:32:50.8075936Z compiled: bool, 2025-05-07T20:32:50.8076150Z ) -> None: 2025-05-07T20:32:50.8076360Z torch.manual_seed(2025) 2025-05-07T20:32:50.8076595Z 2025-05-07T20:32:50.8076860Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:50.8078884Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:50.8080742Z 2025-05-07T20:32:50.8080856Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:50.8081070Z 2025-05-07T20:32:50.8081171Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:50.8081575Z self=, 2025-05-07T20:32:50.8082548Z T=16384, 2025-05-07T20:32:50.8082754Z D=7168, 2025-05-07T20:32:50.8082940Z scale_ub=None, 2025-05-07T20:32:50.8083145Z contiguous=False, 2025-05-07T20:32:50.8083370Z compiled=True, 2025-05-07T20:32:50.8083563Z ) 2025-05-07T20:32:50.8083870Z self = 2025-05-07T20:32:50.8084473Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:50.8084755Z 2025-05-07T20:32:50.8084829Z @given( 2025-05-07T20:32:50.8085047Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:50.8085349Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:50.8085645Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:50.8085961Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:50.8086280Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:50.8086558Z ) 2025-05-07T20:32:50.8086899Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:50.8087328Z def test_silu_mul_quant( 2025-05-07T20:32:50.8087566Z self, 2025-05-07T20:32:50.8087747Z T: int, 2025-05-07T20:32:50.8087933Z D: int, 2025-05-07T20:32:50.8088150Z scale_ub: Optional[float], 2025-05-07T20:32:50.8088415Z contiguous: bool, 2025-05-07T20:32:50.8088693Z compiled: bool, 2025-05-07T20:32:50.8088903Z ) -> None: 2025-05-07T20:32:50.8089113Z torch.manual_seed(2025) 2025-05-07T20:32:50.8089350Z 2025-05-07T20:32:50.8089623Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:50.8091637Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:50.8093477Z 2025-05-07T20:32:50.8093639Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:50.8093847Z 2025-05-07T20:32:50.8093947Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:50.8094347Z self=, 2025-05-07T20:32:50.8094743Z T=4096, 2025-05-07T20:32:50.8094929Z D=7168, 2025-05-07T20:32:50.8095114Z scale_ub=None, 2025-05-07T20:32:50.8095328Z contiguous=True, 2025-05-07T20:32:50.8095542Z compiled=False, 2025-05-07T20:32:50.8095736Z ) 2025-05-07T20:32:50.8096043Z self = 2025-05-07T20:32:50.8096521Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:50.8096785Z 2025-05-07T20:32:50.8096865Z @given( 2025-05-07T20:32:50.8097081Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:50.8097382Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:50.8097684Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:50.8098003Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:50.8098328Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:50.8098605Z ) 2025-05-07T20:32:50.8098939Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:50.8099365Z def test_silu_mul_quant( 2025-05-07T20:32:50.8099597Z self, 2025-05-07T20:32:50.8099781Z T: int, 2025-05-07T20:32:50.8099968Z D: int, 2025-05-07T20:32:50.8100178Z scale_ub: Optional[float], 2025-05-07T20:32:50.8100439Z contiguous: bool, 2025-05-07T20:32:50.8100664Z compiled: bool, 2025-05-07T20:32:50.8100875Z ) -> None: 2025-05-07T20:32:50.8101167Z torch.manual_seed(2025) 2025-05-07T20:32:50.8101400Z 2025-05-07T20:32:50.8101663Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:50.8103679Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:50.8105517Z 2025-05-07T20:32:50.8105634Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:50.8105838Z 2025-05-07T20:32:50.8105937Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:50.8106342Z self=, 2025-05-07T20:32:50.8106732Z T=16384, 2025-05-07T20:32:50.8106916Z D=7168, 2025-05-07T20:32:50.8107125Z scale_ub=None, 2025-05-07T20:32:50.8107353Z contiguous=True, 2025-05-07T20:32:50.8107573Z compiled=False, 2025-05-07T20:32:50.8107766Z ) 2025-05-07T20:32:50.8108119Z self = 2025-05-07T20:32:50.8108780Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:50.8109052Z 2025-05-07T20:32:50.8109127Z @given( 2025-05-07T20:32:50.8109347Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:50.8109647Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:50.8109943Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:50.8110255Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:50.8110570Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:50.8110856Z ) 2025-05-07T20:32:50.8111195Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:50.8111624Z def test_silu_mul_quant( 2025-05-07T20:32:50.8111855Z self, 2025-05-07T20:32:50.8112036Z T: int, 2025-05-07T20:32:50.8112302Z D: int, 2025-05-07T20:32:50.8112518Z scale_ub: Optional[float], 2025-05-07T20:32:50.8112774Z contiguous: bool, 2025-05-07T20:32:50.8113014Z compiled: bool, 2025-05-07T20:32:50.8113234Z ) -> None: 2025-05-07T20:32:50.8113437Z torch.manual_seed(2025) 2025-05-07T20:32:50.8113668Z 2025-05-07T20:32:50.8113937Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:50.8115953Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:50.8117797Z 2025-05-07T20:32:50.8117919Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:50.8118128Z 2025-05-07T20:32:50.8118227Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:50.8118637Z self=, 2025-05-07T20:32:50.8119036Z T=16384, 2025-05-07T20:32:50.8119224Z D=7168, 2025-05-07T20:32:50.8119408Z scale_ub=1200.0, 2025-05-07T20:32:50.8119627Z contiguous=True, 2025-05-07T20:32:50.8119841Z compiled=False, 2025-05-07T20:32:50.8120047Z ) 2025-05-07T20:32:50.8120361Z self = 2025-05-07T20:32:50.8120969Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:50.8121247Z 2025-05-07T20:32:50.8121327Z @given( 2025-05-07T20:32:50.8121554Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:50.8121860Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:50.8122158Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:50.8122479Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:50.8122805Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:50.8123077Z ) 2025-05-07T20:32:50.8123419Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:50.8123856Z def test_silu_mul_quant( 2025-05-07T20:32:50.8124098Z self, 2025-05-07T20:32:50.8124375Z T: int, 2025-05-07T20:32:50.8124569Z D: int, 2025-05-07T20:32:50.8124786Z scale_ub: Optional[float], 2025-05-07T20:32:50.8125047Z contiguous: bool, 2025-05-07T20:32:50.8125298Z compiled: bool, 2025-05-07T20:32:50.8125518Z ) -> None: 2025-05-07T20:32:50.8125725Z torch.manual_seed(2025) 2025-05-07T20:32:50.8125966Z 2025-05-07T20:32:50.8126237Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:50.8128269Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:50.8130189Z 2025-05-07T20:32:50.8130298Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:50.9914330Z 2025-05-07T20:32:50.9914536Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:50.9915167Z self=, 2025-05-07T20:32:50.9915743Z T=128, 2025-05-07T20:32:50.9915985Z D=5120, 2025-05-07T20:32:50.9916255Z scale_ub=1200.0, 2025-05-07T20:32:50.9916752Z contiguous=False, 2025-05-07T20:32:50.9917068Z compiled=False, 2025-05-07T20:32:50.9917347Z ) 2025-05-07T20:32:50.9917775Z self = 2025-05-07T20:32:50.9918256Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:50.9918523Z 2025-05-07T20:32:50.9918596Z @given( 2025-05-07T20:32:50.9918807Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:50.9919115Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:50.9919412Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:50.9919730Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:50.9920057Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:50.9920331Z ) 2025-05-07T20:32:50.9920665Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:50.9921093Z def test_silu_mul_quant( 2025-05-07T20:32:50.9921416Z self, 2025-05-07T20:32:50.9921748Z T: int, 2025-05-07T20:32:50.9922116Z D: int, 2025-05-07T20:32:50.9922500Z scale_ub: Optional[float], 2025-05-07T20:32:50.9932347Z contiguous: bool, 2025-05-07T20:32:50.9932627Z compiled: bool, 2025-05-07T20:32:50.9932864Z ) -> None: 2025-05-07T20:32:50.9933092Z torch.manual_seed(2025) 2025-05-07T20:32:50.9933336Z 2025-05-07T20:32:50.9933611Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:50.9933956Z 2025-05-07T20:32:50.9934139Z x_sign = torch.sign(x) 2025-05-07T20:32:50.9934446Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:50.9934925Z x = x_sign * x_clamp 2025-05-07T20:32:50.9935167Z x0 = x[:, :D] 2025-05-07T20:32:50.9935390Z x1 = x[:, D:] 2025-05-07T20:32:50.9935598Z 2025-05-07T20:32:50.9935780Z if contiguous: 2025-05-07T20:32:50.9936013Z x0 = x0.contiguous() 2025-05-07T20:32:50.9936278Z x1 = x1.contiguous() 2025-05-07T20:32:50.9936523Z 2025-05-07T20:32:50.9936711Z if scale_ub is not None: 2025-05-07T20:32:50.9936988Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:50.9937323Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:50.9937625Z ) 2025-05-07T20:32:50.9937817Z else: 2025-05-07T20:32:50.9938035Z scale_ub_tensor = None 2025-05-07T20:32:50.9938279Z 2025-05-07T20:32:50.9938505Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:50.9938825Z op = silu_mul_quant 2025-05-07T20:32:50.9939072Z if compiled: 2025-05-07T20:32:50.9939327Z op = torch.compile(op) 2025-05-07T20:32:50.9939626Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.9939892Z 2025-05-07T20:32:50.9940084Z > y_fp8, y_scale = fn() 2025-05-07T20:32:50.9940255Z 2025-05-07T20:32:50.9940356Z moe/activation_test.py:117: 2025-05-07T20:32:50.9940659Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.9941058Z moe/activation_test.py:115: in fn 2025-05-07T20:32:50.9941346Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.9942048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:50.9942737Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:50.9943271Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:50.9943956Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:50.9944629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:50.9945156Z kernel = self.compile( 2025-05-07T20:32:50.9945718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:50.9946427Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:50.9946817Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.9947048Z 2025-05-07T20:32:50.9947251Z self = 2025-05-07T20:32:50.9948325Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:50.9949695Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9f92ce0ae0>} 2025-05-07T20:32:50.9951028Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:50.9952046Z context = 2025-05-07T20:32:50.9952344Z 2025-05-07T20:32:50.9952511Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:50.9953038Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:50.9953512Z module_map=module_map) 2025-05-07T20:32:50.9953874Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:50.9954221Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:50.9954486Z E ^ 2025-05-07T20:32:50.9955034Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:50.9955484Z 2025-05-07T20:32:50.9955897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:50.9956468Z 2025-05-07T20:32:50.9956579Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:50.9956991Z self=, 2025-05-07T20:32:50.9957390Z T=2048, 2025-05-07T20:32:50.9957574Z D=7168, 2025-05-07T20:32:50.9957766Z scale_ub=None, 2025-05-07T20:32:50.9957982Z contiguous=False, 2025-05-07T20:32:50.9958209Z compiled=False, 2025-05-07T20:32:50.9958413Z ) 2025-05-07T20:32:50.9958729Z self = 2025-05-07T20:32:50.9959216Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:50.9959487Z 2025-05-07T20:32:50.9959569Z @given( 2025-05-07T20:32:50.9959799Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:50.9960113Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:50.9960416Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:50.9960738Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:50.9961118Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:50.9961395Z ) 2025-05-07T20:32:50.9961739Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:50.9962173Z def test_silu_mul_quant( 2025-05-07T20:32:50.9962405Z self, 2025-05-07T20:32:50.9962603Z T: int, 2025-05-07T20:32:50.9962793Z D: int, 2025-05-07T20:32:50.9963001Z scale_ub: Optional[float], 2025-05-07T20:32:50.9963269Z contiguous: bool, 2025-05-07T20:32:50.9963512Z compiled: bool, 2025-05-07T20:32:50.9963728Z ) -> None: 2025-05-07T20:32:50.9963940Z torch.manual_seed(2025) 2025-05-07T20:32:50.9964189Z 2025-05-07T20:32:50.9964538Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:50.9966569Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:50.9968460Z 2025-05-07T20:32:50.9968576Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:50.9968790Z 2025-05-07T20:32:50.9968889Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:50.9969300Z self=, 2025-05-07T20:32:50.9969691Z T=128, 2025-05-07T20:32:50.9969876Z D=7168, 2025-05-07T20:32:50.9970062Z scale_ub=1200.0, 2025-05-07T20:32:50.9970274Z contiguous=True, 2025-05-07T20:32:50.9970490Z compiled=True, 2025-05-07T20:32:50.9970691Z ) 2025-05-07T20:32:50.9971010Z self = 2025-05-07T20:32:50.9971490Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:50.9971756Z 2025-05-07T20:32:50.9971831Z @given( 2025-05-07T20:32:50.9972078Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:50.9972406Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:50.9972706Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:50.9973028Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:50.9973352Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:50.9973635Z ) 2025-05-07T20:32:50.9974061Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:50.9974498Z def test_silu_mul_quant( 2025-05-07T20:32:50.9974733Z self, 2025-05-07T20:32:50.9974923Z T: int, 2025-05-07T20:32:50.9975116Z D: int, 2025-05-07T20:32:50.9975329Z scale_ub: Optional[float], 2025-05-07T20:32:50.9975593Z contiguous: bool, 2025-05-07T20:32:50.9975826Z compiled: bool, 2025-05-07T20:32:50.9976037Z ) -> None: 2025-05-07T20:32:50.9976252Z torch.manual_seed(2025) 2025-05-07T20:32:50.9976491Z 2025-05-07T20:32:50.9976753Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:50.9977086Z 2025-05-07T20:32:50.9977278Z x_sign = torch.sign(x) 2025-05-07T20:32:50.9977557Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:50.9977860Z x = x_sign * x_clamp 2025-05-07T20:32:50.9978097Z x0 = x[:, :D] 2025-05-07T20:32:50.9978303Z x1 = x[:, D:] 2025-05-07T20:32:50.9978512Z 2025-05-07T20:32:50.9978691Z if contiguous: 2025-05-07T20:32:50.9978917Z x0 = x0.contiguous() 2025-05-07T20:32:50.9979168Z x1 = x1.contiguous() 2025-05-07T20:32:50.9979407Z 2025-05-07T20:32:50.9979592Z if scale_ub is not None: 2025-05-07T20:32:50.9979861Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:50.9980238Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:50.9980539Z ) 2025-05-07T20:32:50.9980728Z else: 2025-05-07T20:32:50.9980937Z scale_ub_tensor = None 2025-05-07T20:32:50.9981186Z 2025-05-07T20:32:50.9981406Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:50.9981717Z op = silu_mul_quant 2025-05-07T20:32:50.9981968Z if compiled: 2025-05-07T20:32:50.9982206Z op = torch.compile(op) 2025-05-07T20:32:50.9982497Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.9982772Z 2025-05-07T20:32:50.9982973Z > y_fp8, y_scale = fn() 2025-05-07T20:32:50.9983144Z 2025-05-07T20:32:50.9983245Z moe/activation_test.py:117: 2025-05-07T20:32:50.9983546Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.9983925Z moe/activation_test.py:115: in fn 2025-05-07T20:32:50.9984210Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.9984766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:50.9985319Z return fn(*args, **kwargs) 2025-05-07T20:32:50.9986022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:50.9986708Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:50.9987237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:50.9987923Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:50.9988575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:50.9989108Z kernel = self.compile( 2025-05-07T20:32:50.9989643Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:50.9990302Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:50.9990700Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.9990928Z 2025-05-07T20:32:50.9991134Z self = 2025-05-07T20:32:50.9992212Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:50.9993665Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9f928409a0>} 2025-05-07T20:32:50.9994991Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:50.9996057Z context = 2025-05-07T20:32:50.9996343Z 2025-05-07T20:32:50.9996509Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:50.9997020Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:50.9997486Z module_map=module_map) 2025-05-07T20:32:50.9997852Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:50.9998200Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:50.9998462Z E ^ 2025-05-07T20:32:50.9998922Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:50.9999368Z 2025-05-07T20:32:50.9999780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.3367254Z 2025-05-07T20:32:51.3367568Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.3368258Z self=, 2025-05-07T20:32:51.3368893Z T=128, 2025-05-07T20:32:51.3369155Z D=7168, 2025-05-07T20:32:51.3369343Z scale_ub=1200.0, 2025-05-07T20:32:51.3369555Z contiguous=True, 2025-05-07T20:32:51.3369767Z compiled=False, 2025-05-07T20:32:51.3369968Z ) 2025-05-07T20:32:51.3370273Z self = 2025-05-07T20:32:51.3370769Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:51.3371030Z 2025-05-07T20:32:51.3371104Z @given( 2025-05-07T20:32:51.3371326Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.3371622Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.3371911Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.3372360Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.3372671Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.3372942Z ) 2025-05-07T20:32:51.3373272Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.3373691Z def test_silu_mul_quant( 2025-05-07T20:32:51.3373921Z self, 2025-05-07T20:32:51.3374101Z T: int, 2025-05-07T20:32:51.3374279Z D: int, 2025-05-07T20:32:51.3374486Z scale_ub: Optional[float], 2025-05-07T20:32:51.3374748Z contiguous: bool, 2025-05-07T20:32:51.3374969Z compiled: bool, 2025-05-07T20:32:51.3375179Z ) -> None: 2025-05-07T20:32:51.3375382Z torch.manual_seed(2025) 2025-05-07T20:32:51.3375608Z 2025-05-07T20:32:51.3375868Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.3376197Z 2025-05-07T20:32:51.3376388Z x_sign = torch.sign(x) 2025-05-07T20:32:51.3376669Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.3378654Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:51.3380490Z 2025-05-07T20:32:51.3380722Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:51.3380926Z 2025-05-07T20:32:51.3381028Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.3381418Z self=, 2025-05-07T20:32:51.3381809Z T=128, 2025-05-07T20:32:51.3381985Z D=5120, 2025-05-07T20:32:51.3382166Z scale_ub=1200.0, 2025-05-07T20:32:51.3382372Z contiguous=True, 2025-05-07T20:32:51.3382584Z compiled=True, 2025-05-07T20:32:51.3382778Z ) 2025-05-07T20:32:51.3383079Z self = 2025-05-07T20:32:51.3383553Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:51.3383810Z 2025-05-07T20:32:51.3383888Z @given( 2025-05-07T20:32:51.3384105Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.3384398Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.3384700Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.3385005Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.3385317Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.3385590Z ) 2025-05-07T20:32:51.3385930Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.3386421Z def test_silu_mul_quant( 2025-05-07T20:32:51.3386646Z self, 2025-05-07T20:32:51.3386825Z T: int, 2025-05-07T20:32:51.3387004Z D: int, 2025-05-07T20:32:51.3387210Z scale_ub: Optional[float], 2025-05-07T20:32:51.3387472Z contiguous: bool, 2025-05-07T20:32:51.3387691Z compiled: bool, 2025-05-07T20:32:51.3387897Z ) -> None: 2025-05-07T20:32:51.3388099Z torch.manual_seed(2025) 2025-05-07T20:32:51.3388326Z 2025-05-07T20:32:51.3388584Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.3388908Z 2025-05-07T20:32:51.3389095Z x_sign = torch.sign(x) 2025-05-07T20:32:51.3389379Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.3391335Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:51.3393280Z 2025-05-07T20:32:51.3393451Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:51.3393743Z 2025-05-07T20:32:51.3393896Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.3394475Z self=, 2025-05-07T20:32:51.3395039Z T=128, 2025-05-07T20:32:51.3395286Z D=7168, 2025-05-07T20:32:51.3395552Z scale_ub=None, 2025-05-07T20:32:51.3395847Z contiguous=True, 2025-05-07T20:32:51.3396150Z compiled=True, 2025-05-07T20:32:51.3396437Z ) 2025-05-07T20:32:51.3396862Z self = 2025-05-07T20:32:51.3397541Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:51.3397900Z 2025-05-07T20:32:51.3398002Z @given( 2025-05-07T20:32:51.3398298Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.3398719Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.3399122Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.3399572Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.3400010Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.3400403Z ) 2025-05-07T20:32:51.3401002Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.3401598Z def test_silu_mul_quant( 2025-05-07T20:32:51.3401906Z self, 2025-05-07T20:32:51.3402150Z T: int, 2025-05-07T20:32:51.3402408Z D: int, 2025-05-07T20:32:51.3402707Z scale_ub: Optional[float], 2025-05-07T20:32:51.3403048Z contiguous: bool, 2025-05-07T20:32:51.3403372Z compiled: bool, 2025-05-07T20:32:51.3403662Z ) -> None: 2025-05-07T20:32:51.3403935Z torch.manual_seed(2025) 2025-05-07T20:32:51.3404405Z 2025-05-07T20:32:51.3404773Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.3407611Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:51.3410406Z 2025-05-07T20:32:51.3410578Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:51.3410985Z 2025-05-07T20:32:51.3411542Z FAILED 2025-05-07T20:32:51.3411676Z 2025-05-07T20:32:51.3411850Z =================================== FAILURES =================================== 2025-05-07T20:32:51.3412417Z _____________________ ActivationTests.test_silu_mul_quant ______________________ 2025-05-07T20:32:51.3413005Z + Exception Group Traceback (most recent call last): 2025-05-07T20:32:51.3413852Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 58, in testPartExecutor 2025-05-07T20:32:51.3414574Z | yield 2025-05-07T20:32:51.3415150Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 651, in run 2025-05-07T20:32:51.3415850Z | self._callTestMethod(testMethod) 2025-05-07T20:32:51.3416227Z | ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:51.3416947Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 606, in _callTestMethod 2025-05-07T20:32:51.3417818Z | if method() is not None: 2025-05-07T20:32:51.3418162Z | ~~~~~~^^ 2025-05-07T20:32:51.3419012Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant 2025-05-07T20:32:51.3420004Z | T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.3420390Z | ^^^^^^^ 2025-05-07T20:32:51.3421142Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/hypothesis/core.py", line 1850, in wrapped_test 2025-05-07T20:32:51.3421984Z | raise the_error_hypothesis_found 2025-05-07T20:32:51.3422557Z | ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions) 2025-05-07T20:32:51.3423111Z +-+---------------- 1 ---------------- 2025-05-07T20:32:51.3423481Z | Traceback (most recent call last): 2025-05-07T20:32:51.3424440Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:51.3425486Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.3428429Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:51.3430694Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:51.3431109Z | self=, 2025-05-07T20:32:51.3431493Z | T=2048, 2025-05-07T20:32:51.3431718Z | D=5120, # or any other generated value 2025-05-07T20:32:51.3432064Z | scale_ub=None, # or any other generated value 2025-05-07T20:32:51.3432464Z | contiguous=True, # or any other generated value 2025-05-07T20:32:51.3432873Z | compiled=False, # or any other generated value 2025-05-07T20:32:51.3433204Z | ) 2025-05-07T20:32:51.3433388Z | 2025-05-07T20:32:51.3434040Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case 2025-05-07T20:32:51.3434632Z +---------------- 2 ---------------- 2025-05-07T20:32:51.3434898Z | Traceback (most recent call last): 2025-05-07T20:32:51.3435598Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:51.3436357Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.3438381Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:51.3440387Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:51.3465190Z | self=, 2025-05-07T20:32:51.3465778Z | T=128, 2025-05-07T20:32:51.3466079Z | D=7168, 2025-05-07T20:32:51.3466351Z | scale_ub=None, 2025-05-07T20:32:51.3466676Z | contiguous=True, 2025-05-07T20:32:51.3467014Z | compiled=True, 2025-05-07T20:32:51.3467462Z | ) 2025-05-07T20:32:51.3467716Z | 2025-05-07T20:32:51.3468452Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case 2025-05-07T20:32:51.3469299Z +---------------- 3 ---------------- 2025-05-07T20:32:51.3469685Z | Traceback (most recent call last): 2025-05-07T20:32:51.3470644Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:51.3471704Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.3474517Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:51.3477217Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:51.3477792Z | self=, 2025-05-07T20:32:51.3478345Z | T=128, 2025-05-07T20:32:51.3478624Z | D=5120, 2025-05-07T20:32:51.3478903Z | scale_ub=1200.0, 2025-05-07T20:32:51.3479232Z | contiguous=True, 2025-05-07T20:32:51.3479558Z | compiled=True, 2025-05-07T20:32:51.3479981Z | ) 2025-05-07T20:32:51.3480229Z | 2025-05-07T20:32:51.3480955Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case 2025-05-07T20:32:51.3481770Z +---------------- 4 ---------------- 2025-05-07T20:32:51.3482165Z | Traceback (most recent call last): 2025-05-07T20:32:51.3483106Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant 2025-05-07T20:32:51.3484043Z | y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:51.3484521Z | ~~~~~~^^ 2025-05-07T20:32:51.3485377Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn 2025-05-07T20:32:51.3486304Z | return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:51.3487441Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row 2025-05-07T20:32:51.3488522Z | _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:51.3488913Z | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^ 2025-05-07T20:32:51.3489270Z | a, 2025-05-07T20:32:51.3489600Z | ^^ 2025-05-07T20:32:51.3489870Z | ...<23 lines>... 2025-05-07T20:32:51.3490195Z | USE_INT64=use_int64, 2025-05-07T20:32:51.3490533Z | ^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:51.3490859Z | ) 2025-05-07T20:32:51.3491106Z | ^ 2025-05-07T20:32:51.3491806Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py", line 330, in 2025-05-07T20:32:51.3492803Z | return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.3493400Z | ~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:51.3494249Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 186, in run 2025-05-07T20:32:51.3495266Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:51.3495958Z | ~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:51.3496790Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 166, in _bench 2025-05-07T20:32:51.3497690Z | return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:51.3498174Z | ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:51.3498973Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py", line 117, in do_bench 2025-05-07T20:32:51.3499710Z | fn() 2025-05-07T20:32:51.3499954Z | ~~^^ 2025-05-07T20:32:51.3500726Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call 2025-05-07T20:32:51.3501595Z | self.fn.run( 2025-05-07T20:32:51.3501893Z | ~~~~~~~~~~~^ 2025-05-07T20:32:51.3502159Z | *args, 2025-05-07T20:32:51.3502432Z | ^^^^^^ 2025-05-07T20:32:51.3502705Z | **current, 2025-05-07T20:32:51.3502989Z | ^^^^^^^^^^ 2025-05-07T20:32:51.3503269Z | ) 2025-05-07T20:32:51.3503505Z | ^ 2025-05-07T20:32:51.3504115Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py", line 623, in run 2025-05-07T20:32:51.3504887Z | kernel = self.compile( 2025-05-07T20:32:51.3505219Z | src, 2025-05-07T20:32:51.3505506Z | target=target, 2025-05-07T20:32:51.3505873Z | options=options.__dict__, 2025-05-07T20:32:51.3506254Z | ) 2025-05-07T20:32:51.3507073Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 273, in compile 2025-05-07T20:32:51.3508001Z | module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.3509187Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:51.3510209Z | return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.3510806Z | module_map=module_map) 2025-05-07T20:32:51.3511261Z | triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.3511705Z | def _kernel_quantize_fp8_row( 2025-05-07T20:32:51.3512041Z | ^ 2025-05-07T20:32:51.3512658Z | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.3513423Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:51.3513938Z | # The test always failed when commented parts were varied together. 2025-05-07T20:32:51.3514589Z | self=, 2025-05-07T20:32:51.3515138Z | T=1, # or any other generated value 2025-05-07T20:32:51.3515688Z | D=5120, # or any other generated value 2025-05-07T20:32:51.3516132Z | scale_ub=None, # or any other generated value 2025-05-07T20:32:51.3516600Z | contiguous=True, # or any other generated value 2025-05-07T20:32:51.3517062Z | compiled=True, # or any other generated value 2025-05-07T20:32:51.3517452Z | ) 2025-05-07T20:32:51.3517676Z | 2025-05-07T20:32:51.3518368Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case 2025-05-07T20:32:51.3519191Z +------------------------------------ 2025-05-07T20:32:51.3519662Z ---------------------------------- Hypothesis ---------------------------------- 2025-05-07T20:32:51.3520127Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.3520652Z self=, 2025-05-07T20:32:51.3521160Z T=1, 2025-05-07T20:32:51.3521492Z D=5120, 2025-05-07T20:32:51.3521730Z scale_ub=None, 2025-05-07T20:32:51.3521999Z contiguous=True, 2025-05-07T20:32:51.3522283Z compiled=True, 2025-05-07T20:32:51.3522535Z ) 2025-05-07T20:32:51.3522951Z self = 2025-05-07T20:32:51.3523575Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:51.3523907Z 2025-05-07T20:32:51.3524004Z @given( 2025-05-07T20:32:51.3524441Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.3524869Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.3525269Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.3525719Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.3526159Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.3526538Z ) 2025-05-07T20:32:51.3526994Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.3527621Z def test_silu_mul_quant( 2025-05-07T20:32:51.3527933Z self, 2025-05-07T20:32:51.3528172Z T: int, 2025-05-07T20:32:51.3528423Z D: int, 2025-05-07T20:32:51.3528702Z scale_ub: Optional[float], 2025-05-07T20:32:51.3529039Z contiguous: bool, 2025-05-07T20:32:51.3529348Z compiled: bool, 2025-05-07T20:32:51.3529634Z ) -> None: 2025-05-07T20:32:51.3529900Z torch.manual_seed(2025) 2025-05-07T20:32:51.3530217Z 2025-05-07T20:32:51.3530560Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.3530985Z 2025-05-07T20:32:51.3531229Z x_sign = torch.sign(x) 2025-05-07T20:32:51.3531752Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.3532150Z x = x_sign * x_clamp 2025-05-07T20:32:51.3532439Z x0 = x[:, :D] 2025-05-07T20:32:51.3532706Z x1 = x[:, D:] 2025-05-07T20:32:51.3532973Z 2025-05-07T20:32:51.3533199Z if contiguous: 2025-05-07T20:32:51.3533491Z x0 = x0.contiguous() 2025-05-07T20:32:51.3533815Z x1 = x1.contiguous() 2025-05-07T20:32:51.3534107Z 2025-05-07T20:32:51.3534345Z if scale_ub is not None: 2025-05-07T20:32:51.3534690Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.3535119Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.3535523Z ) 2025-05-07T20:32:51.3535782Z else: 2025-05-07T20:32:51.3536059Z scale_ub_tensor = None 2025-05-07T20:32:51.3536392Z 2025-05-07T20:32:51.3536686Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.3537079Z op = silu_mul_quant 2025-05-07T20:32:51.3537412Z if compiled: 2025-05-07T20:32:51.3537729Z op = torch.compile(op) 2025-05-07T20:32:51.3538099Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.3538453Z 2025-05-07T20:32:51.3538693Z y_fp8, y_scale = fn() 2025-05-07T20:32:51.3539061Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:51.3539488Z 2025-05-07T20:32:51.3539786Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.3540218Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:51.3540589Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:51.3541005Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:51.3541484Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:51.3541911Z 2025-05-07T20:32:51.3542177Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:51.3542440Z 2025-05-07T20:32:51.3542570Z moe/activation_test.py:126: 2025-05-07T20:32:51.3542954Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.3543374Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:51.3543796Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:51.3544833Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:51.3545908Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:51.3546613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.3547504Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.3548400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:51.3549320Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:51.3550288Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:51.3551117Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:51.3551891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:51.3552577Z fn() 2025-05-07T20:32:51.3553254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:51.3554022Z self.fn.run( 2025-05-07T20:32:51.3554619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.3555308Z kernel = self.compile( 2025-05-07T20:32:51.3556008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.3556857Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.3557474Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.3557795Z 2025-05-07T20:32:51.3558071Z self = 2025-05-07T20:32:51.3559524Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.3561388Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa085d836a0>} 2025-05-07T20:32:51.3563105Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.3564553Z context = 2025-05-07T20:32:51.3564933Z 2025-05-07T20:32:51.3565137Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.3565810Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.3566415Z module_map=module_map) 2025-05-07T20:32:51.3566987Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.3567472Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:51.3567823Z E ^ 2025-05-07T20:32:51.3568439Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.3569040Z 2025-05-07T20:32:51.3569594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.3570295Z 2025-05-07T20:32:51.3570439Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.3571006Z self=, 2025-05-07T20:32:51.3571534Z T=2048, 2025-05-07T20:32:51.3571775Z D=5120, 2025-05-07T20:32:51.3572023Z scale_ub=1200.0, 2025-05-07T20:32:51.3572318Z contiguous=True, 2025-05-07T20:32:51.3572603Z compiled=False, 2025-05-07T20:32:51.3572925Z ) 2025-05-07T20:32:51.3573357Z self = 2025-05-07T20:32:51.3574029Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:51.3574398Z 2025-05-07T20:32:51.3574496Z @given( 2025-05-07T20:32:51.3574802Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.3575223Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.3575617Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.3576060Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.3576506Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.3576891Z ) 2025-05-07T20:32:51.3577348Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.3577944Z def test_silu_mul_quant( 2025-05-07T20:32:51.3578255Z self, 2025-05-07T20:32:51.3578502Z T: int, 2025-05-07T20:32:51.3578765Z D: int, 2025-05-07T20:32:51.3579124Z scale_ub: Optional[float], 2025-05-07T20:32:51.3579541Z contiguous: bool, 2025-05-07T20:32:51.3579976Z compiled: bool, 2025-05-07T20:32:51.3580544Z ) -> None: 2025-05-07T20:32:51.3580914Z torch.manual_seed(2025) 2025-05-07T20:32:51.3581337Z 2025-05-07T20:32:51.3581925Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.3582419Z 2025-05-07T20:32:51.3582934Z x_sign = torch.sign(x) 2025-05-07T20:32:51.3583480Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.3583898Z x = x_sign * x_clamp 2025-05-07T20:32:51.3584401Z x0 = x[:, :D] 2025-05-07T20:32:51.3584969Z x1 = x[:, D:] 2025-05-07T20:32:51.3585341Z 2025-05-07T20:32:51.3585663Z if contiguous: 2025-05-07T20:32:51.3586176Z x0 = x0.contiguous() 2025-05-07T20:32:51.3586625Z x1 = x1.contiguous() 2025-05-07T20:32:51.3587000Z 2025-05-07T20:32:51.3587417Z if scale_ub is not None: 2025-05-07T20:32:51.3587801Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.3588175Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.3588697Z ) 2025-05-07T20:32:51.3588987Z else: 2025-05-07T20:32:51.3589227Z scale_ub_tensor = None 2025-05-07T20:32:51.3589659Z 2025-05-07T20:32:51.3589975Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.3590325Z op = silu_mul_quant 2025-05-07T20:32:51.3590750Z if compiled: 2025-05-07T20:32:51.3591088Z op = torch.compile(op) 2025-05-07T20:32:51.3591561Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.3591923Z 2025-05-07T20:32:51.3592203Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.3592393Z 2025-05-07T20:32:51.3592608Z moe/activation_test.py:117: 2025-05-07T20:32:51.3592998Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.3593484Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.3593921Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.3596202Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.3596961Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.3597620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.3598393Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.3599110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.3599821Z kernel = self.compile( 2025-05-07T20:32:51.3600467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.3601230Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.3601758Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.3602072Z 2025-05-07T20:32:51.3602303Z self = 2025-05-07T20:32:51.3603474Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.3605148Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa0859f5f80>} 2025-05-07T20:32:51.3606599Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.3607670Z context = 2025-05-07T20:32:51.3608051Z 2025-05-07T20:32:51.3608507Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.3609194Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.3609768Z module_map=module_map) 2025-05-07T20:32:51.3610253Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.3610701Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.3611018Z E ^ 2025-05-07T20:32:51.3611793Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.3612326Z 2025-05-07T20:32:51.3612764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.3613331Z 2025-05-07T20:32:51.3613441Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.3614041Z self=, 2025-05-07T20:32:51.3614477Z T=2048, 2025-05-07T20:32:51.3614721Z D=5120, 2025-05-07T20:32:51.3615085Z scale_ub=1200.0, 2025-05-07T20:32:51.3615388Z contiguous=True, 2025-05-07T20:32:51.3615669Z compiled=True, 2025-05-07T20:32:51.3616025Z ) 2025-05-07T20:32:51.3616383Z self = 2025-05-07T20:32:51.3616938Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:51.3617350Z 2025-05-07T20:32:51.3617452Z @given( 2025-05-07T20:32:51.3617764Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.3618099Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.3618564Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.3618970Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.3619448Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.3619861Z ) 2025-05-07T20:32:51.3620325Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.3620897Z def test_silu_mul_quant( 2025-05-07T20:32:51.3621189Z self, 2025-05-07T20:32:51.3621476Z T: int, 2025-05-07T20:32:51.3621778Z D: int, 2025-05-07T20:32:51.3622050Z scale_ub: Optional[float], 2025-05-07T20:32:51.3622414Z contiguous: bool, 2025-05-07T20:32:51.3622760Z compiled: bool, 2025-05-07T20:32:51.3623036Z ) -> None: 2025-05-07T20:32:51.3623349Z torch.manual_seed(2025) 2025-05-07T20:32:51.3623706Z 2025-05-07T20:32:51.3626203Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.3626698Z 2025-05-07T20:32:51.3627050Z x_sign = torch.sign(x) 2025-05-07T20:32:51.3627446Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.3627797Z x = x_sign * x_clamp 2025-05-07T20:32:51.3628271Z x0 = x[:, :D] 2025-05-07T20:32:51.3628593Z x1 = x[:, D:] 2025-05-07T20:32:51.3628838Z 2025-05-07T20:32:51.3629148Z if contiguous: 2025-05-07T20:32:51.3629485Z x0 = x0.contiguous() 2025-05-07T20:32:51.3629791Z x1 = x1.contiguous() 2025-05-07T20:32:51.3630161Z 2025-05-07T20:32:51.3630478Z if scale_ub is not None: 2025-05-07T20:32:51.3630815Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.3631255Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.3631661Z ) 2025-05-07T20:32:51.3631917Z else: 2025-05-07T20:32:51.3632255Z scale_ub_tensor = None 2025-05-07T20:32:51.3632589Z 2025-05-07T20:32:51.3632885Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.3633318Z op = silu_mul_quant 2025-05-07T20:32:51.3633649Z if compiled: 2025-05-07T20:32:51.3633967Z op = torch.compile(op) 2025-05-07T20:32:51.3634457Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.3634773Z 2025-05-07T20:32:51.3635028Z y_fp8, y_scale = fn() 2025-05-07T20:32:51.3635477Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:51.3635809Z 2025-05-07T20:32:51.3636109Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.3636609Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:51.3636950Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:51.3637350Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:51.3637875Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:51.3638265Z 2025-05-07T20:32:51.3638608Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:51.3638955Z 2025-05-07T20:32:51.3639078Z moe/activation_test.py:126: 2025-05-07T20:32:51.3639477Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.3639845Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:51.3640314Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:51.3641203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:51.3642090Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:51.3642720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.3643503Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.3644422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:51.3645260Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:51.3646036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:51.3646876Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:51.3647647Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:51.3648235Z fn() 2025-05-07T20:32:51.3648815Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:51.3649516Z self.fn.run( 2025-05-07T20:32:51.3650062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.3650665Z kernel = self.compile( 2025-05-07T20:32:51.3651336Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.3652075Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.3652625Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.3652933Z 2025-05-07T20:32:51.3653169Z self = 2025-05-07T20:32:51.3654324Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.3655893Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa085a4d800>} 2025-05-07T20:32:51.3657317Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.3658404Z context = 2025-05-07T20:32:51.3658789Z 2025-05-07T20:32:51.3658995Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.3659604Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.3660158Z module_map=module_map) 2025-05-07T20:32:51.3660673Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.3661073Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:51.3661422Z E ^ 2025-05-07T20:32:51.3662036Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.3662511Z 2025-05-07T20:32:51.3663033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.3663653Z 2025-05-07T20:32:51.3663763Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.3664342Z self=, 2025-05-07T20:32:51.3664846Z T=16384, 2025-05-07T20:32:51.3665066Z D=7168, 2025-05-07T20:32:51.3665408Z scale_ub=1200.0, 2025-05-07T20:32:51.3665732Z contiguous=False, 2025-05-07T20:32:51.3665985Z compiled=False, 2025-05-07T20:32:51.3666327Z ) 2025-05-07T20:32:51.3666744Z self = 2025-05-07T20:32:51.3667274Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:51.3667680Z 2025-05-07T20:32:51.3667786Z @given( 2025-05-07T20:32:51.3668158Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.3668626Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.3677176Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.3677514Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.3677844Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.3678127Z ) 2025-05-07T20:32:51.3678480Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.3678996Z def test_silu_mul_quant( 2025-05-07T20:32:51.3679244Z self, 2025-05-07T20:32:51.3679437Z T: int, 2025-05-07T20:32:51.3679627Z D: int, 2025-05-07T20:32:51.3679846Z scale_ub: Optional[float], 2025-05-07T20:32:51.3680122Z contiguous: bool, 2025-05-07T20:32:51.3680356Z compiled: bool, 2025-05-07T20:32:51.3680578Z ) -> None: 2025-05-07T20:32:51.3680797Z torch.manual_seed(2025) 2025-05-07T20:32:51.3681030Z 2025-05-07T20:32:51.3681303Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.3681650Z 2025-05-07T20:32:51.3681840Z x_sign = torch.sign(x) 2025-05-07T20:32:51.3682134Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.3682445Z x = x_sign * x_clamp 2025-05-07T20:32:51.3682671Z x0 = x[:, :D] 2025-05-07T20:32:51.3682887Z x1 = x[:, D:] 2025-05-07T20:32:51.3683097Z 2025-05-07T20:32:51.3683336Z if contiguous: 2025-05-07T20:32:51.3683561Z x0 = x0.contiguous() 2025-05-07T20:32:51.3683820Z x1 = x1.contiguous() 2025-05-07T20:32:51.3684057Z 2025-05-07T20:32:51.3684368Z if scale_ub is not None: 2025-05-07T20:32:51.3684638Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.3684979Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.3685277Z ) 2025-05-07T20:32:51.3685476Z else: 2025-05-07T20:32:51.3685690Z scale_ub_tensor = None 2025-05-07T20:32:51.3685936Z 2025-05-07T20:32:51.3686179Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.3686500Z op = silu_mul_quant 2025-05-07T20:32:51.3686747Z if compiled: 2025-05-07T20:32:51.3687002Z op = torch.compile(op) 2025-05-07T20:32:51.3687305Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.3687569Z 2025-05-07T20:32:51.3687766Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.3687944Z 2025-05-07T20:32:51.3688043Z moe/activation_test.py:117: 2025-05-07T20:32:51.3688341Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.3688668Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.3688953Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.3689644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.3690329Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.3690865Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.3691659Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.3692322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.3692841Z kernel = self.compile( 2025-05-07T20:32:51.3693389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.3694047Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.3694176Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.3694181Z 2025-05-07T20:32:51.3694397Z self = 2025-05-07T20:32:51.3695177Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.3695678Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa0859de980>} 2025-05-07T20:32:51.3696422Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.3696657Z context = 2025-05-07T20:32:51.3696662Z 2025-05-07T20:32:51.3696836Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.3697096Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.3697202Z module_map=module_map) 2025-05-07T20:32:51.3697371Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.3697477Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.3697567Z E ^ 2025-05-07T20:32:51.3697921Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.3697925Z 2025-05-07T20:32:51.3698381Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.3698388Z 2025-05-07T20:32:51.3698501Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.3698722Z self=, 2025-05-07T20:32:51.3698806Z T=1, 2025-05-07T20:32:51.3698884Z D=7168, 2025-05-07T20:32:51.3698965Z scale_ub=None, 2025-05-07T20:32:51.3699058Z contiguous=True, 2025-05-07T20:32:51.3699136Z compiled=True, 2025-05-07T20:32:51.3699208Z ) 2025-05-07T20:32:51.3699433Z self = 2025-05-07T20:32:51.3699597Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:51.3699601Z 2025-05-07T20:32:51.3699675Z @given( 2025-05-07T20:32:51.3699800Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.3699896Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.3700010Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.3700135Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.3700247Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.3700327Z ) 2025-05-07T20:32:51.3700570Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.3700662Z def test_silu_mul_quant( 2025-05-07T20:32:51.3700751Z self, 2025-05-07T20:32:51.3700826Z T: int, 2025-05-07T20:32:51.3700902Z D: int, 2025-05-07T20:32:51.3701006Z scale_ub: Optional[float], 2025-05-07T20:32:51.3701096Z contiguous: bool, 2025-05-07T20:32:51.3701180Z compiled: bool, 2025-05-07T20:32:51.3701347Z ) -> None: 2025-05-07T20:32:51.3701440Z torch.manual_seed(2025) 2025-05-07T20:32:51.3701514Z 2025-05-07T20:32:51.3701686Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.3701757Z 2025-05-07T20:32:51.3701858Z x_sign = torch.sign(x) 2025-05-07T20:32:51.3701983Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.3702068Z x = x_sign * x_clamp 2025-05-07T20:32:51.3702155Z x0 = x[:, :D] 2025-05-07T20:32:51.3702233Z x1 = x[:, D:] 2025-05-07T20:32:51.3702302Z 2025-05-07T20:32:51.3702393Z if contiguous: 2025-05-07T20:32:51.3702481Z x0 = x0.contiguous() 2025-05-07T20:32:51.3702569Z x1 = x1.contiguous() 2025-05-07T20:32:51.3702644Z 2025-05-07T20:32:51.3702732Z if scale_ub is not None: 2025-05-07T20:32:51.3702836Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.3702982Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.3703054Z ) 2025-05-07T20:32:51.3703126Z else: 2025-05-07T20:32:51.3703227Z scale_ub_tensor = None 2025-05-07T20:32:51.3703298Z 2025-05-07T20:32:51.3703436Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.3703528Z op = silu_mul_quant 2025-05-07T20:32:51.3703656Z if compiled: 2025-05-07T20:32:51.3703765Z op = torch.compile(op) 2025-05-07T20:32:51.3703866Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.3703935Z 2025-05-07T20:32:51.3704030Z y_fp8, y_scale = fn() 2025-05-07T20:32:51.3704149Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:51.3704217Z 2025-05-07T20:32:51.3704354Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.3704455Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:51.3704558Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:51.3704685Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:51.3704820Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:51.3704896Z 2025-05-07T20:32:51.3704992Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:51.3705039Z 2025-05-07T20:32:51.3705134Z moe/activation_test.py:126: 2025-05-07T20:32:51.3705271Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.3705374Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:51.3705504Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:51.3706111Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:51.3706209Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:51.3706569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.3706793Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.3707157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:51.3707414Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:51.3707791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:51.3707958Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:51.3708554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:51.3708661Z fn() 2025-05-07T20:32:51.3709094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:51.3709169Z self.fn.run( 2025-05-07T20:32:51.3709680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.3709776Z kernel = self.compile( 2025-05-07T20:32:51.3710150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.3710328Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.3710454Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.3710459Z 2025-05-07T20:32:51.3710658Z self = 2025-05-07T20:32:51.3711436Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.3711939Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa085a05e40>} 2025-05-07T20:32:51.3712680Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.3712929Z context = 2025-05-07T20:32:51.3712934Z 2025-05-07T20:32:51.3713095Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.3713352Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.3713452Z module_map=module_map) 2025-05-07T20:32:51.3713614Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.3713712Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:51.3713780Z E ^ 2025-05-07T20:32:51.3714137Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.3714141Z 2025-05-07T20:32:51.3714549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.3714553Z 2025-05-07T20:32:51.3714722Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.3714940Z self=, 2025-05-07T20:32:51.3715010Z T=4096, 2025-05-07T20:32:51.3715087Z D=5120, 2025-05-07T20:32:51.3715161Z scale_ub=None, 2025-05-07T20:32:51.3715243Z contiguous=False, 2025-05-07T20:32:51.3715327Z compiled=False, 2025-05-07T20:32:51.3715393Z ) 2025-05-07T20:32:51.3715606Z self = 2025-05-07T20:32:51.3715782Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:51.3715787Z 2025-05-07T20:32:51.3715855Z @given( 2025-05-07T20:32:51.3715978Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.3716071Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.3716181Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.3716301Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.3716413Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.3716484Z ) 2025-05-07T20:32:51.3716729Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.3716815Z def test_silu_mul_quant( 2025-05-07T20:32:51.3716884Z self, 2025-05-07T20:32:51.3716959Z T: int, 2025-05-07T20:32:51.3717028Z D: int, 2025-05-07T20:32:51.3717126Z scale_ub: Optional[float], 2025-05-07T20:32:51.3717208Z contiguous: bool, 2025-05-07T20:32:51.3717287Z compiled: bool, 2025-05-07T20:32:51.3717366Z ) -> None: 2025-05-07T20:32:51.3717454Z torch.manual_seed(2025) 2025-05-07T20:32:51.3717519Z 2025-05-07T20:32:51.3717778Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.3717845Z 2025-05-07T20:32:51.3717934Z x_sign = torch.sign(x) 2025-05-07T20:32:51.3718064Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.3718149Z x = x_sign * x_clamp 2025-05-07T20:32:51.3718225Z x0 = x[:, :D] 2025-05-07T20:32:51.3718310Z x1 = x[:, D:] 2025-05-07T20:32:51.3718374Z 2025-05-07T20:32:51.3718458Z if contiguous: 2025-05-07T20:32:51.3718542Z x0 = x0.contiguous() 2025-05-07T20:32:51.3718623Z x1 = x1.contiguous() 2025-05-07T20:32:51.3718693Z 2025-05-07T20:32:51.3718776Z if scale_ub is not None: 2025-05-07T20:32:51.3718876Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.3719014Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.3719082Z ) 2025-05-07T20:32:51.3719149Z else: 2025-05-07T20:32:51.3719251Z scale_ub_tensor = None 2025-05-07T20:32:51.3719316Z 2025-05-07T20:32:51.3719437Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.3719525Z op = silu_mul_quant 2025-05-07T20:32:51.3719602Z if compiled: 2025-05-07T20:32:51.3719696Z op = torch.compile(op) 2025-05-07T20:32:51.3719856Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.3719919Z 2025-05-07T20:32:51.3720008Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.3720012Z 2025-05-07T20:32:51.3720101Z moe/activation_test.py:117: 2025-05-07T20:32:51.3720225Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.3720323Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.3720416Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.3720905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.3721006Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.3721354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.3721580Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.3721955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.3722043Z kernel = self.compile( 2025-05-07T20:32:51.3722424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.3722593Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.3722714Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.3722726Z 2025-05-07T20:32:51.3722923Z self = 2025-05-07T20:32:51.3723694Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.3724200Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa084544a40>} 2025-05-07T20:32:51.3725032Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.3725224Z context = 2025-05-07T20:32:51.3725228Z 2025-05-07T20:32:51.3725385Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.3725643Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.3725860Z module_map=module_map) 2025-05-07T20:32:51.3726017Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.3726113Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.3726182Z E ^ 2025-05-07T20:32:51.3726531Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.3726538Z 2025-05-07T20:32:51.3726952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.3726956Z 2025-05-07T20:32:51.3727053Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.3727269Z self=, 2025-05-07T20:32:51.3727346Z T=4096, 2025-05-07T20:32:51.3727415Z D=7168, 2025-05-07T20:32:51.3727498Z scale_ub=None, 2025-05-07T20:32:51.3727577Z contiguous=False, 2025-05-07T20:32:51.3727653Z compiled=False, 2025-05-07T20:32:51.3727729Z ) 2025-05-07T20:32:51.3727945Z self = 2025-05-07T20:32:51.3728112Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:51.3728117Z 2025-05-07T20:32:51.3728196Z @given( 2025-05-07T20:32:51.3728352Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.3728444Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.3728562Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.3728673Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.3728785Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.3728851Z ) 2025-05-07T20:32:51.3729090Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.3729184Z def test_silu_mul_quant( 2025-05-07T20:32:51.3729253Z self, 2025-05-07T20:32:51.3729322Z T: int, 2025-05-07T20:32:51.3729406Z D: int, 2025-05-07T20:32:51.3729497Z scale_ub: Optional[float], 2025-05-07T20:32:51.3729579Z contiguous: bool, 2025-05-07T20:32:51.3729667Z compiled: bool, 2025-05-07T20:32:51.3729739Z ) -> None: 2025-05-07T20:32:51.3729827Z torch.manual_seed(2025) 2025-05-07T20:32:51.3729945Z 2025-05-07T20:32:51.3730113Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.3730186Z 2025-05-07T20:32:51.3730275Z x_sign = torch.sign(x) 2025-05-07T20:32:51.3730393Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.3730480Z x = x_sign * x_clamp 2025-05-07T20:32:51.3730553Z x0 = x[:, :D] 2025-05-07T20:32:51.3730625Z x1 = x[:, D:] 2025-05-07T20:32:51.3730696Z 2025-05-07T20:32:51.3730771Z if contiguous: 2025-05-07T20:32:51.3730856Z x0 = x0.contiguous() 2025-05-07T20:32:51.3730947Z x1 = x1.contiguous() 2025-05-07T20:32:51.3731012Z 2025-05-07T20:32:51.3731100Z if scale_ub is not None: 2025-05-07T20:32:51.3731207Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.3731337Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.3731412Z ) 2025-05-07T20:32:51.3731487Z else: 2025-05-07T20:32:51.3731576Z scale_ub_tensor = None 2025-05-07T20:32:51.3731648Z 2025-05-07T20:32:51.3731770Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.3731853Z op = silu_mul_quant 2025-05-07T20:32:51.3731940Z if compiled: 2025-05-07T20:32:51.3732034Z op = torch.compile(op) 2025-05-07T20:32:51.3732132Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.3732204Z 2025-05-07T20:32:51.3732288Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.3732292Z 2025-05-07T20:32:51.3732383Z moe/activation_test.py:117: 2025-05-07T20:32:51.3732596Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.3732693Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.3732792Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.3733280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.3733377Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.3733733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.3733948Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.3734291Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.3734376Z kernel = self.compile( 2025-05-07T20:32:51.3734750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.3734936Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.3735056Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.3735061Z 2025-05-07T20:32:51.3735262Z self = 2025-05-07T20:32:51.3736044Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.3736585Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa084546660>} 2025-05-07T20:32:51.3737326Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.3737519Z context = 2025-05-07T20:32:51.3737524Z 2025-05-07T20:32:51.3737689Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.3737945Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.3738098Z module_map=module_map) 2025-05-07T20:32:51.3738254Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.3738352Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.3738422Z E ^ 2025-05-07T20:32:51.3738777Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.3738781Z 2025-05-07T20:32:51.3739187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.3739191Z 2025-05-07T20:32:51.3739292Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.3739517Z self=, 2025-05-07T20:32:51.3739589Z T=128, 2025-05-07T20:32:51.3739666Z D=7168, 2025-05-07T20:32:51.3739738Z scale_ub=None, 2025-05-07T20:32:51.3739817Z contiguous=False, 2025-05-07T20:32:51.3739904Z compiled=True, 2025-05-07T20:32:51.3739971Z ) 2025-05-07T20:32:51.3740183Z self = 2025-05-07T20:32:51.3740354Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:51.3740358Z 2025-05-07T20:32:51.3740426Z @given( 2025-05-07T20:32:51.3740538Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.3740638Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.3740746Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.3740863Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.3741048Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.3741115Z ) 2025-05-07T20:32:51.3741361Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.3741449Z def test_silu_mul_quant( 2025-05-07T20:32:51.3741517Z self, 2025-05-07T20:32:51.3741597Z T: int, 2025-05-07T20:32:51.3741669Z D: int, 2025-05-07T20:32:51.3741763Z scale_ub: Optional[float], 2025-05-07T20:32:51.3741852Z contiguous: bool, 2025-05-07T20:32:51.3741930Z compiled: bool, 2025-05-07T20:32:51.3742000Z ) -> None: 2025-05-07T20:32:51.3742094Z torch.manual_seed(2025) 2025-05-07T20:32:51.3742158Z 2025-05-07T20:32:51.3742321Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.3742393Z 2025-05-07T20:32:51.3742479Z x_sign = torch.sign(x) 2025-05-07T20:32:51.3742605Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.3742690Z x = x_sign * x_clamp 2025-05-07T20:32:51.3742772Z x0 = x[:, :D] 2025-05-07T20:32:51.3742850Z x1 = x[:, D:] 2025-05-07T20:32:51.3742913Z 2025-05-07T20:32:51.3742987Z if contiguous: 2025-05-07T20:32:51.3743079Z x0 = x0.contiguous() 2025-05-07T20:32:51.3743161Z x1 = x1.contiguous() 2025-05-07T20:32:51.3743228Z 2025-05-07T20:32:51.3743363Z if scale_ub is not None: 2025-05-07T20:32:51.3743463Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.3743592Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.3743667Z ) 2025-05-07T20:32:51.3743735Z else: 2025-05-07T20:32:51.3743831Z scale_ub_tensor = None 2025-05-07T20:32:51.3743895Z 2025-05-07T20:32:51.3744024Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.3744112Z op = silu_mul_quant 2025-05-07T20:32:51.3744189Z if compiled: 2025-05-07T20:32:51.3744283Z op = torch.compile(op) 2025-05-07T20:32:51.3744397Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.3744461Z 2025-05-07T20:32:51.3744545Z y_fp8, y_scale = fn() 2025-05-07T20:32:51.3744668Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:51.3744733Z 2025-05-07T20:32:51.3744907Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.3745013Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:51.3745108Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:51.3745229Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:51.3745365Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:51.3745431Z 2025-05-07T20:32:51.3745533Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:51.3745537Z 2025-05-07T20:32:51.3745629Z moe/activation_test.py:126: 2025-05-07T20:32:51.3745757Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.3745867Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:51.3745992Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:51.3746546Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:51.3746643Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:51.3746998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.3747220Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.3747577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:51.3747825Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:51.3748279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:51.3748442Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:51.3748780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:51.3748849Z fn() 2025-05-07T20:32:51.3749242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:51.3749326Z self.fn.run( 2025-05-07T20:32:51.3749656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.3749748Z kernel = self.compile( 2025-05-07T20:32:51.3750118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.3750286Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.3750415Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.3750426Z 2025-05-07T20:32:51.3750624Z self = 2025-05-07T20:32:51.3751392Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.3751944Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa084545bc0>} 2025-05-07T20:32:51.3752677Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.3752869Z context = 2025-05-07T20:32:51.3752873Z 2025-05-07T20:32:51.3753037Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.3753303Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.3753405Z module_map=module_map) 2025-05-07T20:32:51.3753561Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.3753732Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:51.3753801Z E ^ 2025-05-07T20:32:51.3754148Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.3754153Z 2025-05-07T20:32:51.3754564Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.3754569Z 2025-05-07T20:32:51.3754666Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.3754889Z self=, 2025-05-07T20:32:51.3754960Z T=128, 2025-05-07T20:32:51.3755034Z D=7168, 2025-05-07T20:32:51.3755119Z scale_ub=None, 2025-05-07T20:32:51.3755198Z contiguous=False, 2025-05-07T20:32:51.3755279Z compiled=False, 2025-05-07T20:32:51.3755361Z ) 2025-05-07T20:32:51.3755573Z self = 2025-05-07T20:32:51.3755751Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:51.3755756Z 2025-05-07T20:32:51.3755835Z @given( 2025-05-07T20:32:51.3755961Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.3756086Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.3756196Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.3756306Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.3756417Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.3756487Z ) 2025-05-07T20:32:51.3756806Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.3756899Z def test_silu_mul_quant( 2025-05-07T20:32:51.3756967Z self, 2025-05-07T20:32:51.3757041Z T: int, 2025-05-07T20:32:51.3757108Z D: int, 2025-05-07T20:32:51.3757199Z scale_ub: Optional[float], 2025-05-07T20:32:51.3757291Z contiguous: bool, 2025-05-07T20:32:51.3757372Z compiled: bool, 2025-05-07T20:32:51.3757442Z ) -> None: 2025-05-07T20:32:51.3757538Z torch.manual_seed(2025) 2025-05-07T20:32:51.3757601Z 2025-05-07T20:32:51.3757766Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.3757837Z 2025-05-07T20:32:51.3757920Z x_sign = torch.sign(x) 2025-05-07T20:32:51.3758038Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.3758125Z x = x_sign * x_clamp 2025-05-07T20:32:51.3758196Z x0 = x[:, :D] 2025-05-07T20:32:51.3758268Z x1 = x[:, D:] 2025-05-07T20:32:51.3758336Z 2025-05-07T20:32:51.3758418Z if contiguous: 2025-05-07T20:32:51.3758508Z x0 = x0.contiguous() 2025-05-07T20:32:51.3758592Z x1 = x1.contiguous() 2025-05-07T20:32:51.3758657Z 2025-05-07T20:32:51.3758746Z if scale_ub is not None: 2025-05-07T20:32:51.3758844Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.3759024Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.3759100Z ) 2025-05-07T20:32:51.3759170Z else: 2025-05-07T20:32:51.3759258Z scale_ub_tensor = None 2025-05-07T20:32:51.3759328Z 2025-05-07T20:32:51.3759451Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.3759533Z op = silu_mul_quant 2025-05-07T20:32:51.3759615Z if compiled: 2025-05-07T20:32:51.3759707Z op = torch.compile(op) 2025-05-07T20:32:51.3759811Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.3759875Z 2025-05-07T20:32:51.3759958Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.3759970Z 2025-05-07T20:32:51.3760066Z moe/activation_test.py:117: 2025-05-07T20:32:51.3760189Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.3760285Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.3760430Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.3760922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.3761022Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.3761371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.3761587Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.3761924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.3762011Z kernel = self.compile( 2025-05-07T20:32:51.3762387Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.3762561Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.3762683Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.3762692Z 2025-05-07T20:32:51.3762896Z self = 2025-05-07T20:32:51.3763660Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.3764154Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa07fca23e0>} 2025-05-07T20:32:51.3765115Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.3765302Z context = 2025-05-07T20:32:51.3765310Z 2025-05-07T20:32:51.3765477Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.3765760Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.3765885Z module_map=module_map) 2025-05-07T20:32:51.3766049Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.3766139Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.3766216Z E ^ 2025-05-07T20:32:51.3766562Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.3766567Z 2025-05-07T20:32:51.3766977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.3766982Z 2025-05-07T20:32:51.3767087Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.3767304Z self=, 2025-05-07T20:32:51.3767380Z T=4096, 2025-05-07T20:32:51.3767495Z D=5120, 2025-05-07T20:32:51.3767572Z scale_ub=1200.0, 2025-05-07T20:32:51.3767657Z contiguous=True, 2025-05-07T20:32:51.3767733Z compiled=False, 2025-05-07T20:32:51.3767800Z ) 2025-05-07T20:32:51.3768022Z self = 2025-05-07T20:32:51.3768191Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:51.3768195Z 2025-05-07T20:32:51.3768263Z @given( 2025-05-07T20:32:51.3768381Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.3768473Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.3768594Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.3768703Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.3768809Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.3768882Z ) 2025-05-07T20:32:51.3769119Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.3769252Z def test_silu_mul_quant( 2025-05-07T20:32:51.3769327Z self, 2025-05-07T20:32:51.3769394Z T: int, 2025-05-07T20:32:51.3769463Z D: int, 2025-05-07T20:32:51.3769558Z scale_ub: Optional[float], 2025-05-07T20:32:51.3769640Z contiguous: bool, 2025-05-07T20:32:51.3769719Z compiled: bool, 2025-05-07T20:32:51.3769794Z ) -> None: 2025-05-07T20:32:51.3769880Z torch.manual_seed(2025) 2025-05-07T20:32:51.3769949Z 2025-05-07T20:32:51.3770113Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.3770179Z 2025-05-07T20:32:51.3770274Z x_sign = torch.sign(x) 2025-05-07T20:32:51.3770393Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.3770474Z x = x_sign * x_clamp 2025-05-07T20:32:51.3770554Z x0 = x[:, :D] 2025-05-07T20:32:51.3770626Z x1 = x[:, D:] 2025-05-07T20:32:51.3770693Z 2025-05-07T20:32:51.3770780Z if contiguous: 2025-05-07T20:32:51.3770863Z x0 = x0.contiguous() 2025-05-07T20:32:51.3770950Z x1 = x1.contiguous() 2025-05-07T20:32:51.3771020Z 2025-05-07T20:32:51.3771103Z if scale_ub is not None: 2025-05-07T20:32:51.3771202Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.3771339Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.3771406Z ) 2025-05-07T20:32:51.3771485Z else: 2025-05-07T20:32:51.3771571Z scale_ub_tensor = None 2025-05-07T20:32:51.3771637Z 2025-05-07T20:32:51.3771767Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.3771932Z op = silu_mul_quant 2025-05-07T20:32:51.3772013Z if compiled: 2025-05-07T20:32:51.3772112Z op = torch.compile(op) 2025-05-07T20:32:51.3772210Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.3772275Z 2025-05-07T20:32:51.3772366Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.3772373Z 2025-05-07T20:32:51.3772462Z moe/activation_test.py:117: 2025-05-07T20:32:51.3772591Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.3772684Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.3772780Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.3773280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.3773371Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.3773724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.3773948Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.3774280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.3774374Z kernel = self.compile( 2025-05-07T20:32:51.3774793Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.3774963Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.3775092Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.3775096Z 2025-05-07T20:32:51.3775297Z self = 2025-05-07T20:32:51.3776075Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.3776571Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa07fca2700>} 2025-05-07T20:32:51.3777346Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.3777540Z context = 2025-05-07T20:32:51.3777544Z 2025-05-07T20:32:51.3777701Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.3777964Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.3778064Z module_map=module_map) 2025-05-07T20:32:51.3778223Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.3778320Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.3778387Z E ^ 2025-05-07T20:32:51.3778733Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.3778747Z 2025-05-07T20:32:51.3779150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.3779157Z 2025-05-07T20:32:51.3779250Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.3779471Z self=, 2025-05-07T20:32:51.3779540Z T=1, 2025-05-07T20:32:51.3779608Z D=5120, 2025-05-07T20:32:51.3779690Z scale_ub=None, 2025-05-07T20:32:51.3779766Z contiguous=True, 2025-05-07T20:32:51.3779840Z compiled=True, 2025-05-07T20:32:51.3779910Z ) 2025-05-07T20:32:51.3780124Z self = 2025-05-07T20:32:51.3780364Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:51.3780369Z 2025-05-07T20:32:51.3780439Z @given( 2025-05-07T20:32:51.3780551Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.3780651Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.3780765Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.3780873Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.3780988Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.3781053Z ) 2025-05-07T20:32:51.3781298Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.3781383Z def test_silu_mul_quant( 2025-05-07T20:32:51.3781451Z self, 2025-05-07T20:32:51.3781525Z T: int, 2025-05-07T20:32:51.3781592Z D: int, 2025-05-07T20:32:51.3781685Z scale_ub: Optional[float], 2025-05-07T20:32:51.3781774Z contiguous: bool, 2025-05-07T20:32:51.3781858Z compiled: bool, 2025-05-07T20:32:51.3781929Z ) -> None: 2025-05-07T20:32:51.3782025Z torch.manual_seed(2025) 2025-05-07T20:32:51.3782089Z 2025-05-07T20:32:51.3782251Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.3782327Z 2025-05-07T20:32:51.3782504Z x_sign = torch.sign(x) 2025-05-07T20:32:51.3782624Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.3782713Z x = x_sign * x_clamp 2025-05-07T20:32:51.3782786Z x0 = x[:, :D] 2025-05-07T20:32:51.3782868Z x1 = x[:, D:] 2025-05-07T20:32:51.3782933Z 2025-05-07T20:32:51.3783010Z if contiguous: 2025-05-07T20:32:51.3783102Z x0 = x0.contiguous() 2025-05-07T20:32:51.3783183Z x1 = x1.contiguous() 2025-05-07T20:32:51.3783251Z 2025-05-07T20:32:51.3783340Z if scale_ub is not None: 2025-05-07T20:32:51.3783440Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.3783575Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.3783651Z ) 2025-05-07T20:32:51.3783718Z else: 2025-05-07T20:32:51.3783805Z scale_ub_tensor = None 2025-05-07T20:32:51.3783877Z 2025-05-07T20:32:51.3783999Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.3784135Z op = silu_mul_quant 2025-05-07T20:32:51.3784215Z if compiled: 2025-05-07T20:32:51.3784309Z op = torch.compile(op) 2025-05-07T20:32:51.3784415Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.3784482Z 2025-05-07T20:32:51.3784568Z y_fp8, y_scale = fn() 2025-05-07T20:32:51.3784691Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:51.3784754Z 2025-05-07T20:32:51.3784881Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.3784983Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:51.3785079Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:51.3785192Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:51.3785333Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:51.3785400Z 2025-05-07T20:32:51.3785499Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:51.3785506Z 2025-05-07T20:32:51.3785599Z moe/activation_test.py:126: 2025-05-07T20:32:51.3785721Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.3785824Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:51.3785952Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:51.3786499Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:51.3786597Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:51.3787030Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.3787254Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.3787614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:51.3787864Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:51.3788237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:51.3788402Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:51.3788768Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:51.3788854Z fn() 2025-05-07T20:32:51.3789250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:51.3789330Z self.fn.run( 2025-05-07T20:32:51.3789664Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.3789749Z kernel = self.compile( 2025-05-07T20:32:51.3790127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.3790339Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.3790470Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.3790474Z 2025-05-07T20:32:51.3790671Z self = 2025-05-07T20:32:51.3791439Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.3791945Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa07fca3ba0>} 2025-05-07T20:32:51.3792679Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.3792913Z context = 2025-05-07T20:32:51.3792918Z 2025-05-07T20:32:51.3793074Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.3793331Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.3793435Z module_map=module_map) 2025-05-07T20:32:51.3793588Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.3793689Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:51.3793757Z E ^ 2025-05-07T20:32:51.3794107Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.3794112Z 2025-05-07T20:32:51.3794521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.3794529Z 2025-05-07T20:32:51.3794625Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.3794849Z self=, 2025-05-07T20:32:51.3794917Z T=2048, 2025-05-07T20:32:51.3794985Z D=5120, 2025-05-07T20:32:51.3795065Z scale_ub=None, 2025-05-07T20:32:51.3795141Z contiguous=True, 2025-05-07T20:32:51.3795214Z compiled=True, 2025-05-07T20:32:51.3795286Z ) 2025-05-07T20:32:51.3795499Z self = 2025-05-07T20:32:51.3795664Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:51.3795668Z 2025-05-07T20:32:51.3795746Z @given( 2025-05-07T20:32:51.3796005Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.3796116Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.3796282Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.3796424Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.3806374Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.3806457Z ) 2025-05-07T20:32:51.3806716Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.3806809Z def test_silu_mul_quant( 2025-05-07T20:32:51.3806882Z self, 2025-05-07T20:32:51.3806964Z T: int, 2025-05-07T20:32:51.3807034Z D: int, 2025-05-07T20:32:51.3807130Z scale_ub: Optional[float], 2025-05-07T20:32:51.3807223Z contiguous: bool, 2025-05-07T20:32:51.3807306Z compiled: bool, 2025-05-07T20:32:51.3807383Z ) -> None: 2025-05-07T20:32:51.3807482Z torch.manual_seed(2025) 2025-05-07T20:32:51.3807561Z 2025-05-07T20:32:51.3807739Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.3807811Z 2025-05-07T20:32:51.3807901Z x_sign = torch.sign(x) 2025-05-07T20:32:51.3808031Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.3808117Z x = x_sign * x_clamp 2025-05-07T20:32:51.3808460Z x0 = x[:, :D] 2025-05-07T20:32:51.3808585Z x1 = x[:, D:] 2025-05-07T20:32:51.3808684Z 2025-05-07T20:32:51.3808797Z if contiguous: 2025-05-07T20:32:51.3808896Z x0 = x0.contiguous() 2025-05-07T20:32:51.3808983Z x1 = x1.contiguous() 2025-05-07T20:32:51.3809054Z 2025-05-07T20:32:51.3809149Z if scale_ub is not None: 2025-05-07T20:32:51.3809254Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.3809389Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.3809468Z ) 2025-05-07T20:32:51.3809543Z else: 2025-05-07T20:32:51.3809650Z scale_ub_tensor = None 2025-05-07T20:32:51.3809721Z 2025-05-07T20:32:51.3809848Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.3809941Z op = silu_mul_quant 2025-05-07T20:32:51.3810025Z if compiled: 2025-05-07T20:32:51.3810230Z op = torch.compile(op) 2025-05-07T20:32:51.3810344Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.3810416Z 2025-05-07T20:32:51.3810504Z y_fp8, y_scale = fn() 2025-05-07T20:32:51.3810628Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:51.3810700Z 2025-05-07T20:32:51.3810830Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.3810936Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:51.3811032Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:51.3811154Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:51.3811297Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:51.3811366Z 2025-05-07T20:32:51.3811472Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:51.3811477Z 2025-05-07T20:32:51.3811573Z moe/activation_test.py:126: 2025-05-07T20:32:51.3811702Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.3811815Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:51.3811945Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:51.3812507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:51.3812605Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:51.3812961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.3813185Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.3813666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:51.3813919Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:51.3814292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:51.3814462Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:51.3814802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:51.3814874Z fn() 2025-05-07T20:32:51.3815265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:51.3815348Z self.fn.run( 2025-05-07T20:32:51.3815679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.3815784Z kernel = self.compile( 2025-05-07T20:32:51.3816157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.3816328Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.3816460Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.3816532Z 2025-05-07T20:32:51.3816736Z self = 2025-05-07T20:32:51.3817513Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.3818014Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa07f33ec00>} 2025-05-07T20:32:51.3818754Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.3818949Z context = 2025-05-07T20:32:51.3818993Z 2025-05-07T20:32:51.3819158Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.3819424Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.3819528Z module_map=module_map) 2025-05-07T20:32:51.3819686Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.3819790Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:51.3819864Z E ^ 2025-05-07T20:32:51.3820217Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.3820221Z 2025-05-07T20:32:51.3820640Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.3820644Z 2025-05-07T20:32:51.3820743Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.3820965Z self=, 2025-05-07T20:32:51.3821045Z T=128, 2025-05-07T20:32:51.3821118Z D=5120, 2025-05-07T20:32:51.3821198Z scale_ub=None, 2025-05-07T20:32:51.3821279Z contiguous=True, 2025-05-07T20:32:51.3821358Z compiled=True, 2025-05-07T20:32:51.3821440Z ) 2025-05-07T20:32:51.3821659Z self = 2025-05-07T20:32:51.3821831Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:51.3821836Z 2025-05-07T20:32:51.3821910Z @given( 2025-05-07T20:32:51.3822026Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.3822131Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.3822327Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.3822444Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.3822564Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.3822636Z ) 2025-05-07T20:32:51.3822881Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.3822982Z def test_silu_mul_quant( 2025-05-07T20:32:51.3823057Z self, 2025-05-07T20:32:51.3823141Z T: int, 2025-05-07T20:32:51.3823215Z D: int, 2025-05-07T20:32:51.3823307Z scale_ub: Optional[float], 2025-05-07T20:32:51.3823399Z contiguous: bool, 2025-05-07T20:32:51.3823482Z compiled: bool, 2025-05-07T20:32:51.3823558Z ) -> None: 2025-05-07T20:32:51.3823661Z torch.manual_seed(2025) 2025-05-07T20:32:51.3823731Z 2025-05-07T20:32:51.3823897Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.3823976Z 2025-05-07T20:32:51.3824071Z x_sign = torch.sign(x) 2025-05-07T20:32:51.3824192Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.3824283Z x = x_sign * x_clamp 2025-05-07T20:32:51.3824361Z x0 = x[:, :D] 2025-05-07T20:32:51.3824438Z x1 = x[:, D:] 2025-05-07T20:32:51.3824519Z 2025-05-07T20:32:51.3824668Z if contiguous: 2025-05-07T20:32:51.3824766Z x0 = x0.contiguous() 2025-05-07T20:32:51.3824851Z x1 = x1.contiguous() 2025-05-07T20:32:51.3824919Z 2025-05-07T20:32:51.3825016Z if scale_ub is not None: 2025-05-07T20:32:51.3825120Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.3825253Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.3825338Z ) 2025-05-07T20:32:51.3825409Z else: 2025-05-07T20:32:51.3825502Z scale_ub_tensor = None 2025-05-07T20:32:51.3825571Z 2025-05-07T20:32:51.3825704Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.3825794Z op = silu_mul_quant 2025-05-07T20:32:51.3825874Z if compiled: 2025-05-07T20:32:51.3825969Z op = torch.compile(op) 2025-05-07T20:32:51.3826075Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.3826187Z 2025-05-07T20:32:51.3826281Z y_fp8, y_scale = fn() 2025-05-07T20:32:51.3826399Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:51.3826468Z 2025-05-07T20:32:51.3826604Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.3826699Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:51.3826795Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:51.3826919Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:51.3827055Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:51.3827126Z 2025-05-07T20:32:51.3827223Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:51.3827231Z 2025-05-07T20:32:51.3827323Z moe/activation_test.py:126: 2025-05-07T20:32:51.3827452Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.3827556Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:51.3827685Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:51.3828243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:51.3828339Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:51.3828691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.3828915Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.3829275Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:51.3829613Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:51.3829981Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:51.3830143Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:51.3830486Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:51.3830558Z fn() 2025-05-07T20:32:51.3830956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:51.3831034Z self.fn.run( 2025-05-07T20:32:51.3831362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.3831456Z kernel = self.compile( 2025-05-07T20:32:51.3831830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.3832006Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.3832138Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.3832142Z 2025-05-07T20:32:51.3832342Z self = 2025-05-07T20:32:51.3833158Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.3833655Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa07f34cfe0>} 2025-05-07T20:32:51.3834402Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.3834588Z context = 2025-05-07T20:32:51.3834593Z 2025-05-07T20:32:51.3834751Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.3835058Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.3835164Z module_map=module_map) 2025-05-07T20:32:51.3835317Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.3835420Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:51.3835492Z E ^ 2025-05-07T20:32:51.3835846Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.3835851Z 2025-05-07T20:32:51.3836256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.3836260Z 2025-05-07T20:32:51.3836363Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.3836587Z self=, 2025-05-07T20:32:51.3836657Z T=4096, 2025-05-07T20:32:51.3836731Z D=5120, 2025-05-07T20:32:51.3836808Z scale_ub=None, 2025-05-07T20:32:51.3836889Z contiguous=True, 2025-05-07T20:32:51.3836976Z compiled=True, 2025-05-07T20:32:51.3837047Z ) 2025-05-07T20:32:51.3837261Z self = 2025-05-07T20:32:51.3837434Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:51.3837438Z 2025-05-07T20:32:51.3837509Z @given( 2025-05-07T20:32:51.3837621Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.3837721Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.3837829Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.3837946Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.3838132Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.3838204Z ) 2025-05-07T20:32:51.3838448Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.3838536Z def test_silu_mul_quant( 2025-05-07T20:32:51.3838612Z self, 2025-05-07T20:32:51.3838692Z T: int, 2025-05-07T20:32:51.3838765Z D: int, 2025-05-07T20:32:51.3838858Z scale_ub: Optional[float], 2025-05-07T20:32:51.3838945Z contiguous: bool, 2025-05-07T20:32:51.3839025Z compiled: bool, 2025-05-07T20:32:51.3839099Z ) -> None: 2025-05-07T20:32:51.3839197Z torch.manual_seed(2025) 2025-05-07T20:32:51.3839259Z 2025-05-07T20:32:51.3839422Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.3839489Z 2025-05-07T20:32:51.3839572Z x_sign = torch.sign(x) 2025-05-07T20:32:51.3839695Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.3839778Z x = x_sign * x_clamp 2025-05-07T20:32:51.3839850Z x0 = x[:, :D] 2025-05-07T20:32:51.3839922Z x1 = x[:, D:] 2025-05-07T20:32:51.3839985Z 2025-05-07T20:32:51.3840059Z if contiguous: 2025-05-07T20:32:51.3840143Z x0 = x0.contiguous() 2025-05-07T20:32:51.3840227Z x1 = x1.contiguous() 2025-05-07T20:32:51.3840337Z 2025-05-07T20:32:51.3840425Z if scale_ub is not None: 2025-05-07T20:32:51.3840523Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.3840650Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.3840721Z ) 2025-05-07T20:32:51.3840786Z else: 2025-05-07T20:32:51.3840876Z scale_ub_tensor = None 2025-05-07T20:32:51.3840940Z 2025-05-07T20:32:51.3841061Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.3841143Z op = silu_mul_quant 2025-05-07T20:32:51.3841219Z if compiled: 2025-05-07T20:32:51.3841316Z op = torch.compile(op) 2025-05-07T20:32:51.3841421Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.3841483Z 2025-05-07T20:32:51.3841564Z y_fp8, y_scale = fn() 2025-05-07T20:32:51.3841680Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:51.3841787Z 2025-05-07T20:32:51.3841918Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.3842016Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:51.3842103Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:51.3842219Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:51.3842347Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:51.3842411Z 2025-05-07T20:32:51.3842505Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:51.3842509Z 2025-05-07T20:32:51.3842598Z moe/activation_test.py:126: 2025-05-07T20:32:51.3842723Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.3842822Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:51.3842947Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:51.3843497Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:51.3843596Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:51.3843945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.3844162Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.3844606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:51.3844854Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:51.3845303Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:51.3845465Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:51.3845801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:51.3845871Z fn() 2025-05-07T20:32:51.3846263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:51.3846337Z self.fn.run( 2025-05-07T20:32:51.3846663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.3846752Z kernel = self.compile( 2025-05-07T20:32:51.3847122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.3847288Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.3847416Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.3847421Z 2025-05-07T20:32:51.3847617Z self = 2025-05-07T20:32:51.3848388Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.3848936Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa07e6baca0>} 2025-05-07T20:32:51.3849667Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.3849855Z context = 2025-05-07T20:32:51.3849860Z 2025-05-07T20:32:51.3850020Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.3850284Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.3850383Z module_map=module_map) 2025-05-07T20:32:51.3850579Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.3850675Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:51.3850741Z E ^ 2025-05-07T20:32:51.3851086Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.3851090Z 2025-05-07T20:32:51.3851497Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.3851501Z 2025-05-07T20:32:51.3851595Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.3851820Z self=, 2025-05-07T20:32:51.3851887Z T=16384, 2025-05-07T20:32:51.3851955Z D=5120, 2025-05-07T20:32:51.3852032Z scale_ub=None, 2025-05-07T20:32:51.3852107Z contiguous=True, 2025-05-07T20:32:51.3852180Z compiled=True, 2025-05-07T20:32:51.3852249Z ) 2025-05-07T20:32:51.3852462Z self = 2025-05-07T20:32:51.3852639Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:51.3852644Z 2025-05-07T20:32:51.3852708Z @given( 2025-05-07T20:32:51.3852816Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.3852914Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.3853020Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.3853129Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.3853238Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.3853303Z ) 2025-05-07T20:32:51.3853641Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.3853733Z def test_silu_mul_quant( 2025-05-07T20:32:51.3853799Z self, 2025-05-07T20:32:51.3853874Z T: int, 2025-05-07T20:32:51.3853941Z D: int, 2025-05-07T20:32:51.3854036Z scale_ub: Optional[float], 2025-05-07T20:32:51.3854122Z contiguous: bool, 2025-05-07T20:32:51.3854203Z compiled: bool, 2025-05-07T20:32:51.3854272Z ) -> None: 2025-05-07T20:32:51.3854360Z torch.manual_seed(2025) 2025-05-07T20:32:51.3854424Z 2025-05-07T20:32:51.3854584Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.3854650Z 2025-05-07T20:32:51.3854730Z x_sign = torch.sign(x) 2025-05-07T20:32:51.3854850Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.3854929Z x = x_sign * x_clamp 2025-05-07T20:32:51.3854998Z x0 = x[:, :D] 2025-05-07T20:32:51.3855070Z x1 = x[:, D:] 2025-05-07T20:32:51.3855141Z 2025-05-07T20:32:51.3855215Z if contiguous: 2025-05-07T20:32:51.3855304Z x0 = x0.contiguous() 2025-05-07T20:32:51.3855383Z x1 = x1.contiguous() 2025-05-07T20:32:51.3855448Z 2025-05-07T20:32:51.3855538Z if scale_ub is not None: 2025-05-07T20:32:51.3855638Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.3855810Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.3855885Z ) 2025-05-07T20:32:51.3855952Z else: 2025-05-07T20:32:51.3856036Z scale_ub_tensor = None 2025-05-07T20:32:51.3856107Z 2025-05-07T20:32:51.3856228Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.3856309Z op = silu_mul_quant 2025-05-07T20:32:51.3856390Z if compiled: 2025-05-07T20:32:51.3856481Z op = torch.compile(op) 2025-05-07T20:32:51.3856585Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.3856647Z 2025-05-07T20:32:51.3856735Z y_fp8, y_scale = fn() 2025-05-07T20:32:51.3856859Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:51.3856922Z 2025-05-07T20:32:51.3857047Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.3857187Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:51.3857280Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:51.3857393Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:51.3857528Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:51.3857591Z 2025-05-07T20:32:51.3857680Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:51.3857690Z 2025-05-07T20:32:51.3857778Z moe/activation_test.py:126: 2025-05-07T20:32:51.3857898Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.3857995Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:51.3858124Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:51.3858670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:51.3858767Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:51.3859121Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.3859349Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.3859709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:51.3859954Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:51.3860324Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:51.3860485Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:51.3860892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:51.3860966Z fn() 2025-05-07T20:32:51.3861358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:51.3861442Z self.fn.run( 2025-05-07T20:32:51.3861769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.3861853Z kernel = self.compile( 2025-05-07T20:32:51.3862233Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.3862399Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.3862531Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.3862536Z 2025-05-07T20:32:51.3862738Z self = 2025-05-07T20:32:51.3863504Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.3864004Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa07e6b9b20>} 2025-05-07T20:32:51.3864778Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.3864964Z context = 2025-05-07T20:32:51.3864968Z 2025-05-07T20:32:51.3865123Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.3865383Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.3865485Z module_map=module_map) 2025-05-07T20:32:51.3865635Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.3865733Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:51.3865842Z E ^ 2025-05-07T20:32:51.3866186Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.3866191Z 2025-05-07T20:32:51.3866602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.3866611Z 2025-05-07T20:32:51.3866705Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.3866918Z self=, 2025-05-07T20:32:51.3866992Z T=1, 2025-05-07T20:32:51.3867059Z D=5120, 2025-05-07T20:32:51.3867134Z scale_ub=1200.0, 2025-05-07T20:32:51.3867219Z contiguous=True, 2025-05-07T20:32:51.3867294Z compiled=True, 2025-05-07T20:32:51.3867356Z ) 2025-05-07T20:32:51.3867570Z self = 2025-05-07T20:32:51.3867727Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:51.3867737Z 2025-05-07T20:32:51.3867808Z @given( 2025-05-07T20:32:51.3867919Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.3868006Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.3868119Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.3868226Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.3868331Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.3868399Z ) 2025-05-07T20:32:51.3868639Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.3868723Z def test_silu_mul_quant( 2025-05-07T20:32:51.3868881Z self, 2025-05-07T20:32:51.3868950Z T: int, 2025-05-07T20:32:51.3869018Z D: int, 2025-05-07T20:32:51.3869114Z scale_ub: Optional[float], 2025-05-07T20:32:51.3869193Z contiguous: bool, 2025-05-07T20:32:51.3869281Z compiled: bool, 2025-05-07T20:32:51.3869356Z ) -> None: 2025-05-07T20:32:51.3869443Z torch.manual_seed(2025) 2025-05-07T20:32:51.3869514Z 2025-05-07T20:32:51.3869675Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.3869738Z 2025-05-07T20:32:51.3869829Z x_sign = torch.sign(x) 2025-05-07T20:32:51.3869946Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.3870025Z x = x_sign * x_clamp 2025-05-07T20:32:51.3870102Z x0 = x[:, :D] 2025-05-07T20:32:51.3870172Z x1 = x[:, D:] 2025-05-07T20:32:51.3870236Z 2025-05-07T20:32:51.3870315Z if contiguous: 2025-05-07T20:32:51.3870398Z x0 = x0.contiguous() 2025-05-07T20:32:51.3870493Z x1 = x1.contiguous() 2025-05-07T20:32:51.3870555Z 2025-05-07T20:32:51.3870637Z if scale_ub is not None: 2025-05-07T20:32:51.3870744Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.3870871Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.3870939Z ) 2025-05-07T20:32:51.3871058Z else: 2025-05-07T20:32:51.3871144Z scale_ub_tensor = None 2025-05-07T20:32:51.3871207Z 2025-05-07T20:32:51.3871334Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.3871412Z op = silu_mul_quant 2025-05-07T20:32:51.3871487Z if compiled: 2025-05-07T20:32:51.3871584Z op = torch.compile(op) 2025-05-07T20:32:51.3871683Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.3871750Z 2025-05-07T20:32:51.3871832Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.3871836Z 2025-05-07T20:32:51.3871923Z moe/activation_test.py:117: 2025-05-07T20:32:51.3872062Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.3872152Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.3872241Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.3872606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:51.3872734Z return fn(*args, **kwargs) 2025-05-07T20:32:51.3873217Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.3873310Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.3873656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.3873878Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.3874213Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.3874298Z kernel = self.compile( 2025-05-07T20:32:51.3874674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.3874837Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.3874970Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.3874975Z 2025-05-07T20:32:51.3875169Z self = 2025-05-07T20:32:51.3875934Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.3876509Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa07f015080>} 2025-05-07T20:32:51.3877245Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.3877435Z context = 2025-05-07T20:32:51.3877441Z 2025-05-07T20:32:51.3877597Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.3877850Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.3877952Z module_map=module_map) 2025-05-07T20:32:51.3878105Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.3878196Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.3878262Z E ^ 2025-05-07T20:32:51.3878608Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.3878613Z 2025-05-07T20:32:51.3879019Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.3879024Z 2025-05-07T20:32:51.3879117Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.3879335Z self=, 2025-05-07T20:32:51.3879445Z T=1, 2025-05-07T20:32:51.3879512Z D=5120, 2025-05-07T20:32:51.3879585Z scale_ub=None, 2025-05-07T20:32:51.3879660Z contiguous=False, 2025-05-07T20:32:51.3879734Z compiled=True, 2025-05-07T20:32:51.3879799Z ) 2025-05-07T20:32:51.3880009Z self = 2025-05-07T20:32:51.3880167Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:51.3880171Z 2025-05-07T20:32:51.3880242Z @given( 2025-05-07T20:32:51.3880349Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.3880449Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.3880556Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.3880668Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.3880778Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.3880882Z ) 2025-05-07T20:32:51.3881121Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.3881209Z def test_silu_mul_quant( 2025-05-07T20:32:51.3881274Z self, 2025-05-07T20:32:51.3881340Z T: int, 2025-05-07T20:32:51.3881415Z D: int, 2025-05-07T20:32:51.3881503Z scale_ub: Optional[float], 2025-05-07T20:32:51.3881584Z contiguous: bool, 2025-05-07T20:32:51.3881664Z compiled: bool, 2025-05-07T20:32:51.3881730Z ) -> None: 2025-05-07T20:32:51.3881819Z torch.manual_seed(2025) 2025-05-07T20:32:51.3881884Z 2025-05-07T20:32:51.3882048Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.3882112Z 2025-05-07T20:32:51.3882192Z x_sign = torch.sign(x) 2025-05-07T20:32:51.3882305Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.3882391Z x = x_sign * x_clamp 2025-05-07T20:32:51.3882464Z x0 = x[:, :D] 2025-05-07T20:32:51.3882536Z x1 = x[:, D:] 2025-05-07T20:32:51.3882601Z 2025-05-07T20:32:51.3882676Z if contiguous: 2025-05-07T20:32:51.3882756Z x0 = x0.contiguous() 2025-05-07T20:32:51.3882839Z x1 = x1.contiguous() 2025-05-07T20:32:51.3882900Z 2025-05-07T20:32:51.3882979Z if scale_ub is not None: 2025-05-07T20:32:51.3883079Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.3883204Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.3883270Z ) 2025-05-07T20:32:51.3883334Z else: 2025-05-07T20:32:51.3883416Z scale_ub_tensor = None 2025-05-07T20:32:51.3883482Z 2025-05-07T20:32:51.3883706Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.3883786Z op = silu_mul_quant 2025-05-07T20:32:51.3883865Z if compiled: 2025-05-07T20:32:51.3883954Z op = torch.compile(op) 2025-05-07T20:32:51.3884051Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.3884120Z 2025-05-07T20:32:51.3884199Z y_fp8, y_scale = fn() 2025-05-07T20:32:51.3884427Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:51.3884490Z 2025-05-07T20:32:51.3884614Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.3884708Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:51.3884795Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:51.3884907Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:51.3885040Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:51.3885102Z 2025-05-07T20:32:51.3885201Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:51.3885206Z 2025-05-07T20:32:51.3885296Z moe/activation_test.py:126: 2025-05-07T20:32:51.3885415Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.3885512Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:51.3885681Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:51.3886226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:51.3886316Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:51.3886663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.3886874Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.3887239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:51.3887483Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:51.3887886Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:51.3888094Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:51.3888426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:51.3888495Z fn() 2025-05-07T20:32:51.3888885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:51.3888962Z self.fn.run( 2025-05-07T20:32:51.3889288Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.3889369Z kernel = self.compile( 2025-05-07T20:32:51.3889749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.3889916Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.3890037Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.3890045Z 2025-05-07T20:32:51.3890246Z self = 2025-05-07T20:32:51.3891008Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.3891504Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa07f34e700>} 2025-05-07T20:32:51.3892312Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.3892501Z context = 2025-05-07T20:32:51.3892506Z 2025-05-07T20:32:51.3892662Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.3892920Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.3893023Z module_map=module_map) 2025-05-07T20:32:51.3893175Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.3893265Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:51.3893334Z E ^ 2025-05-07T20:32:51.3893677Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.3893681Z 2025-05-07T20:32:51.3894089Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.3894094Z 2025-05-07T20:32:51.3894185Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.3894397Z self=, 2025-05-07T20:32:51.3894465Z T=1, 2025-05-07T20:32:51.3894535Z D=5120, 2025-05-07T20:32:51.3894645Z scale_ub=None, 2025-05-07T20:32:51.3894722Z contiguous=True, 2025-05-07T20:32:51.3894798Z compiled=False, 2025-05-07T20:32:51.3894866Z ) 2025-05-07T20:32:51.3895073Z self = 2025-05-07T20:32:51.3895229Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:51.3895234Z 2025-05-07T20:32:51.3895305Z @given( 2025-05-07T20:32:51.3895413Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.3895503Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.3895613Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.3895727Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.3895831Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.3895905Z ) 2025-05-07T20:32:51.3896140Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.3896274Z def test_silu_mul_quant( 2025-05-07T20:32:51.3896342Z self, 2025-05-07T20:32:51.3896411Z T: int, 2025-05-07T20:32:51.3896478Z D: int, 2025-05-07T20:32:51.3896565Z scale_ub: Optional[float], 2025-05-07T20:32:51.3896645Z contiguous: bool, 2025-05-07T20:32:51.3896725Z compiled: bool, 2025-05-07T20:32:51.3896792Z ) -> None: 2025-05-07T20:32:51.3896876Z torch.manual_seed(2025) 2025-05-07T20:32:51.3896945Z 2025-05-07T20:32:51.3897104Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.3897165Z 2025-05-07T20:32:51.3897250Z x_sign = torch.sign(x) 2025-05-07T20:32:51.3897372Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.3897456Z x = x_sign * x_clamp 2025-05-07T20:32:51.3897526Z x0 = x[:, :D] 2025-05-07T20:32:51.3897595Z x1 = x[:, D:] 2025-05-07T20:32:51.3897661Z 2025-05-07T20:32:51.3897736Z if contiguous: 2025-05-07T20:32:51.3897824Z x0 = x0.contiguous() 2025-05-07T20:32:51.3897913Z x1 = x1.contiguous() 2025-05-07T20:32:51.3897975Z 2025-05-07T20:32:51.3898056Z if scale_ub is not None: 2025-05-07T20:32:51.3898156Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.3898282Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.3898345Z ) 2025-05-07T20:32:51.3898418Z else: 2025-05-07T20:32:51.3898505Z scale_ub_tensor = None 2025-05-07T20:32:51.3898570Z 2025-05-07T20:32:51.3898692Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.3898774Z op = silu_mul_quant 2025-05-07T20:32:51.3898938Z if compiled: 2025-05-07T20:32:51.3899029Z op = torch.compile(op) 2025-05-07T20:32:51.3899127Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.3899195Z 2025-05-07T20:32:51.3899274Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.3899282Z 2025-05-07T20:32:51.3899371Z moe/activation_test.py:117: 2025-05-07T20:32:51.3899497Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.3899589Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.3899680Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.3900174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.3900264Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.3900614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.3900837Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.3901165Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.3901252Z kernel = self.compile( 2025-05-07T20:32:51.3901625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.3901903Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.3902022Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.3902027Z 2025-05-07T20:32:51.3902224Z self = 2025-05-07T20:32:51.3903002Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.3903504Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa07f4c91c0>} 2025-05-07T20:32:51.3904240Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.3904470Z context = 2025-05-07T20:32:51.3904475Z 2025-05-07T20:32:51.3904628Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.3904884Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.3904984Z module_map=module_map) 2025-05-07T20:32:51.3905139Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.3905227Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.3905298Z E ^ 2025-05-07T20:32:51.3905641Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.3905645Z 2025-05-07T20:32:51.3906045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.3906054Z 2025-05-07T20:32:51.3906148Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.3906362Z self=, 2025-05-07T20:32:51.3906428Z T=128, 2025-05-07T20:32:51.3906495Z D=5120, 2025-05-07T20:32:51.3906564Z scale_ub=None, 2025-05-07T20:32:51.3906642Z contiguous=False, 2025-05-07T20:32:51.3906718Z compiled=True, 2025-05-07T20:32:51.3906779Z ) 2025-05-07T20:32:51.3906987Z self = 2025-05-07T20:32:51.3907236Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:51.3907240Z 2025-05-07T20:32:51.3907304Z @given( 2025-05-07T20:32:51.3907418Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.3907505Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.3907611Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.3907728Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.3907831Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.3907897Z ) 2025-05-07T20:32:51.3908141Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.3908228Z def test_silu_mul_quant( 2025-05-07T20:32:51.3908494Z self, 2025-05-07T20:32:51.3908604Z T: int, 2025-05-07T20:32:51.3908697Z D: int, 2025-05-07T20:32:51.3908787Z scale_ub: Optional[float], 2025-05-07T20:32:51.3908869Z contiguous: bool, 2025-05-07T20:32:51.3908943Z compiled: bool, 2025-05-07T20:32:51.3909024Z ) -> None: 2025-05-07T20:32:51.3909112Z torch.manual_seed(2025) 2025-05-07T20:32:51.3909176Z 2025-05-07T20:32:51.3909344Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.3909406Z 2025-05-07T20:32:51.3909489Z x_sign = torch.sign(x) 2025-05-07T20:32:51.3909611Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.3909772Z x = x_sign * x_clamp 2025-05-07T20:32:51.3909842Z x0 = x[:, :D] 2025-05-07T20:32:51.3909916Z x1 = x[:, D:] 2025-05-07T20:32:51.3909977Z 2025-05-07T20:32:51.3910050Z if contiguous: 2025-05-07T20:32:51.3910136Z x0 = x0.contiguous() 2025-05-07T20:32:51.3910216Z x1 = x1.contiguous() 2025-05-07T20:32:51.3910282Z 2025-05-07T20:32:51.3910362Z if scale_ub is not None: 2025-05-07T20:32:51.3910457Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.3910589Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.3910658Z ) 2025-05-07T20:32:51.3910722Z else: 2025-05-07T20:32:51.3910814Z scale_ub_tensor = None 2025-05-07T20:32:51.3910877Z 2025-05-07T20:32:51.3910999Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.3911176Z op = silu_mul_quant 2025-05-07T20:32:51.3911257Z if compiled: 2025-05-07T20:32:51.3911348Z op = torch.compile(op) 2025-05-07T20:32:51.3911450Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.3911513Z 2025-05-07T20:32:51.3911601Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.3911606Z 2025-05-07T20:32:51.3911694Z moe/activation_test.py:117: 2025-05-07T20:32:51.3911817Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.3911914Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.3912006Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.3912369Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:51.3912457Z return fn(*args, **kwargs) 2025-05-07T20:32:51.3912938Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.3913032Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.3913386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.3913601Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.3913935Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.3914019Z kernel = self.compile( 2025-05-07T20:32:51.3914389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.3914678Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.3914799Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.3914804Z 2025-05-07T20:32:51.3915011Z self = 2025-05-07T20:32:51.3915786Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.3916328Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa07eb8b240>} 2025-05-07T20:32:51.3917068Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.3917257Z context = 2025-05-07T20:32:51.3917262Z 2025-05-07T20:32:51.3917420Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.3917674Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.3917838Z module_map=module_map) 2025-05-07T20:32:51.3917990Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.3918078Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.3918153Z E ^ 2025-05-07T20:32:51.3918496Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.3918501Z 2025-05-07T20:32:51.3918903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.3918908Z 2025-05-07T20:32:51.3919007Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.3919226Z self=, 2025-05-07T20:32:51.3919298Z T=128, 2025-05-07T20:32:51.3919363Z D=7168, 2025-05-07T20:32:51.3919434Z scale_ub=1200.0, 2025-05-07T20:32:51.3919516Z contiguous=False, 2025-05-07T20:32:51.3919636Z compiled=False, 2025-05-07T20:32:51.3919704Z ) 2025-05-07T20:32:51.3919916Z self = 2025-05-07T20:32:51.3920077Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:51.3920082Z 2025-05-07T20:32:51.3920146Z @given( 2025-05-07T20:32:51.3920261Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.3920349Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.3920460Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.3920567Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.3920674Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.3920743Z ) 2025-05-07T20:32:51.3920980Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.3921065Z def test_silu_mul_quant( 2025-05-07T20:32:51.3921136Z self, 2025-05-07T20:32:51.3921201Z T: int, 2025-05-07T20:32:51.3921268Z D: int, 2025-05-07T20:32:51.3921362Z scale_ub: Optional[float], 2025-05-07T20:32:51.3921440Z contiguous: bool, 2025-05-07T20:32:51.3921515Z compiled: bool, 2025-05-07T20:32:51.3921589Z ) -> None: 2025-05-07T20:32:51.3921673Z torch.manual_seed(2025) 2025-05-07T20:32:51.3921739Z 2025-05-07T20:32:51.3921897Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.3921960Z 2025-05-07T20:32:51.3922048Z x_sign = torch.sign(x) 2025-05-07T20:32:51.3925528Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.3925631Z x = x_sign * x_clamp 2025-05-07T20:32:51.3925818Z x0 = x[:, :D] 2025-05-07T20:32:51.3925892Z x1 = x[:, D:] 2025-05-07T20:32:51.3925962Z 2025-05-07T20:32:51.3926046Z if contiguous: 2025-05-07T20:32:51.3926140Z x0 = x0.contiguous() 2025-05-07T20:32:51.3926223Z x1 = x1.contiguous() 2025-05-07T20:32:51.3926297Z 2025-05-07T20:32:51.3926386Z if scale_ub is not None: 2025-05-07T20:32:51.3926490Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.3926628Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.3926700Z ) 2025-05-07T20:32:51.3926774Z else: 2025-05-07T20:32:51.3926864Z scale_ub_tensor = None 2025-05-07T20:32:51.3926933Z 2025-05-07T20:32:51.3927075Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.3927160Z op = silu_mul_quant 2025-05-07T20:32:51.3927238Z if compiled: 2025-05-07T20:32:51.3927335Z op = torch.compile(op) 2025-05-07T20:32:51.3927441Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.3927510Z 2025-05-07T20:32:51.3927602Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.3927607Z 2025-05-07T20:32:51.3927700Z moe/activation_test.py:117: 2025-05-07T20:32:51.3927823Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.3927974Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.3928071Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.3928565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.3928657Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.3929008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.3929224Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.3929560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.3929651Z kernel = self.compile( 2025-05-07T20:32:51.3930027Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.3930244Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.3930373Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.3930378Z 2025-05-07T20:32:51.3930575Z self = 2025-05-07T20:32:51.3931347Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.3931847Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa07eb89080>} 2025-05-07T20:32:51.3932582Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.3932773Z context = 2025-05-07T20:32:51.3932777Z 2025-05-07T20:32:51.3932933Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.3933197Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.3933299Z module_map=module_map) 2025-05-07T20:32:51.3933455Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.3933556Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.3933628Z E ^ 2025-05-07T20:32:51.3934054Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.3934066Z 2025-05-07T20:32:51.3934475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.3934480Z 2025-05-07T20:32:51.3934578Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.3934799Z self=, 2025-05-07T20:32:51.3934870Z T=128, 2025-05-07T20:32:51.3934940Z D=5120, 2025-05-07T20:32:51.3935018Z scale_ub=None, 2025-05-07T20:32:51.3935101Z contiguous=False, 2025-05-07T20:32:51.3935180Z compiled=False, 2025-05-07T20:32:51.3935249Z ) 2025-05-07T20:32:51.3935458Z self = 2025-05-07T20:32:51.3935625Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:51.3935630Z 2025-05-07T20:32:51.3935702Z @given( 2025-05-07T20:32:51.3935819Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.3935914Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.3936023Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.3936132Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.3936252Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.3936366Z ) 2025-05-07T20:32:51.3936603Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.3936701Z def test_silu_mul_quant( 2025-05-07T20:32:51.3936774Z self, 2025-05-07T20:32:51.3936855Z T: int, 2025-05-07T20:32:51.3936924Z D: int, 2025-05-07T20:32:51.3937018Z scale_ub: Optional[float], 2025-05-07T20:32:51.3937102Z contiguous: bool, 2025-05-07T20:32:51.3937181Z compiled: bool, 2025-05-07T20:32:51.3937253Z ) -> None: 2025-05-07T20:32:51.3937345Z torch.manual_seed(2025) 2025-05-07T20:32:51.3937412Z 2025-05-07T20:32:51.3937581Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.3937658Z 2025-05-07T20:32:51.3937750Z x_sign = torch.sign(x) 2025-05-07T20:32:51.3937871Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.3937961Z x = x_sign * x_clamp 2025-05-07T20:32:51.3938084Z x0 = x[:, :D] 2025-05-07T20:32:51.3938165Z x1 = x[:, D:] 2025-05-07T20:32:51.3938232Z 2025-05-07T20:32:51.3938311Z if contiguous: 2025-05-07T20:32:51.3938397Z x0 = x0.contiguous() 2025-05-07T20:32:51.3938482Z x1 = x1.contiguous() 2025-05-07T20:32:51.3938549Z 2025-05-07T20:32:51.3938639Z if scale_ub is not None: 2025-05-07T20:32:51.3938739Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.3938871Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.3938944Z ) 2025-05-07T20:32:51.3939017Z else: 2025-05-07T20:32:51.3939114Z scale_ub_tensor = None 2025-05-07T20:32:51.3939192Z 2025-05-07T20:32:51.3939317Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.3939403Z op = silu_mul_quant 2025-05-07T20:32:51.3939492Z if compiled: 2025-05-07T20:32:51.3939589Z op = torch.compile(op) 2025-05-07T20:32:51.3939707Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.3939777Z 2025-05-07T20:32:51.3939867Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.3939871Z 2025-05-07T20:32:51.3939971Z moe/activation_test.py:117: 2025-05-07T20:32:51.3940091Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.3940188Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.3940283Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.3940771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.3940947Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.3941298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.3941514Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.3941851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.3941945Z kernel = self.compile( 2025-05-07T20:32:51.3942318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.3942491Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.3942613Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.3942617Z 2025-05-07T20:32:51.3942816Z self = 2025-05-07T20:32:51.3943589Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.3944084Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa07e82c9a0>} 2025-05-07T20:32:51.3944889Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.3945073Z context = 2025-05-07T20:32:51.3945078Z 2025-05-07T20:32:51.3945242Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.3945504Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.3945615Z module_map=module_map) 2025-05-07T20:32:51.3945774Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.3945869Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.3945941Z E ^ 2025-05-07T20:32:51.3946284Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.3946332Z 2025-05-07T20:32:51.3946742Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.3946747Z 2025-05-07T20:32:51.3946843Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.3947062Z self=, 2025-05-07T20:32:51.3947133Z T=128, 2025-05-07T20:32:51.3947203Z D=5120, 2025-05-07T20:32:51.3947281Z scale_ub=1200.0, 2025-05-07T20:32:51.3947360Z contiguous=True, 2025-05-07T20:32:51.3947441Z compiled=False, 2025-05-07T20:32:51.3947524Z ) 2025-05-07T20:32:51.3947737Z self = 2025-05-07T20:32:51.3947901Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:51.3947906Z 2025-05-07T20:32:51.3947990Z @given( 2025-05-07T20:32:51.3948118Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.3948233Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.3948353Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.3948463Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.3948571Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.3948641Z ) 2025-05-07T20:32:51.3948880Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.3948975Z def test_silu_mul_quant( 2025-05-07T20:32:51.3949046Z self, 2025-05-07T20:32:51.3949119Z T: int, 2025-05-07T20:32:51.3949273Z D: int, 2025-05-07T20:32:51.3949366Z scale_ub: Optional[float], 2025-05-07T20:32:51.3949449Z contiguous: bool, 2025-05-07T20:32:51.3949529Z compiled: bool, 2025-05-07T20:32:51.3949603Z ) -> None: 2025-05-07T20:32:51.3949691Z torch.manual_seed(2025) 2025-05-07T20:32:51.3949769Z 2025-05-07T20:32:51.3949934Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.3950009Z 2025-05-07T20:32:51.3950096Z x_sign = torch.sign(x) 2025-05-07T20:32:51.3950214Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.3950301Z x = x_sign * x_clamp 2025-05-07T20:32:51.3950378Z x0 = x[:, :D] 2025-05-07T20:32:51.3950457Z x1 = x[:, D:] 2025-05-07T20:32:51.3950533Z 2025-05-07T20:32:51.3950612Z if contiguous: 2025-05-07T20:32:51.3950699Z x0 = x0.contiguous() 2025-05-07T20:32:51.3950784Z x1 = x1.contiguous() 2025-05-07T20:32:51.3950852Z 2025-05-07T20:32:51.3950940Z if scale_ub is not None: 2025-05-07T20:32:51.3951046Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.3951175Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.3951255Z ) 2025-05-07T20:32:51.3951326Z else: 2025-05-07T20:32:51.3951417Z scale_ub_tensor = None 2025-05-07T20:32:51.3951531Z 2025-05-07T20:32:51.3951655Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.3951741Z op = silu_mul_quant 2025-05-07T20:32:51.3951824Z if compiled: 2025-05-07T20:32:51.3951919Z op = torch.compile(op) 2025-05-07T20:32:51.3952020Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.3952090Z 2025-05-07T20:32:51.3952174Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.3952178Z 2025-05-07T20:32:51.3952268Z moe/activation_test.py:117: 2025-05-07T20:32:51.3952399Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.3952502Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.3952601Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.3953088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.3953220Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.3953575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.3953788Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.3954123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.3954210Z kernel = self.compile( 2025-05-07T20:32:51.3954583Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.3954759Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.3954879Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.3954883Z 2025-05-07T20:32:51.3955083Z self = 2025-05-07T20:32:51.3955861Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.3956358Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa08437b2e0>} 2025-05-07T20:32:51.3957094Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.3957359Z context = 2025-05-07T20:32:51.3957364Z 2025-05-07T20:32:51.3957524Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.3957779Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.3957884Z module_map=module_map) 2025-05-07T20:32:51.3958044Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.3958141Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.3958211Z E ^ 2025-05-07T20:32:51.3958563Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.3958568Z 2025-05-07T20:32:51.3958972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.3958976Z 2025-05-07T20:32:51.3959074Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.3959296Z self=, 2025-05-07T20:32:51.3959365Z T=1, 2025-05-07T20:32:51.3959437Z D=7168, 2025-05-07T20:32:51.3959513Z scale_ub=1200.0, 2025-05-07T20:32:51.3959592Z contiguous=True, 2025-05-07T20:32:51.3959671Z compiled=True, 2025-05-07T20:32:51.3959781Z ) 2025-05-07T20:32:51.3960001Z self = 2025-05-07T20:32:51.3960159Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:51.3960164Z 2025-05-07T20:32:51.3960234Z @given( 2025-05-07T20:32:51.3960348Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.3960441Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.3960553Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.3960669Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.3960779Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.3960848Z ) 2025-05-07T20:32:51.3961087Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.3961172Z def test_silu_mul_quant( 2025-05-07T20:32:51.3961248Z self, 2025-05-07T20:32:51.3961366Z T: int, 2025-05-07T20:32:51.3961438Z D: int, 2025-05-07T20:32:51.3961534Z scale_ub: Optional[float], 2025-05-07T20:32:51.3961616Z contiguous: bool, 2025-05-07T20:32:51.3961695Z compiled: bool, 2025-05-07T20:32:51.3961771Z ) -> None: 2025-05-07T20:32:51.3961859Z torch.manual_seed(2025) 2025-05-07T20:32:51.3961926Z 2025-05-07T20:32:51.3962089Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.3962155Z 2025-05-07T20:32:51.3962239Z x_sign = torch.sign(x) 2025-05-07T20:32:51.3962360Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.3962442Z x = x_sign * x_clamp 2025-05-07T20:32:51.3962527Z x0 = x[:, :D] 2025-05-07T20:32:51.3962602Z x1 = x[:, D:] 2025-05-07T20:32:51.3962669Z 2025-05-07T20:32:51.3962750Z if contiguous: 2025-05-07T20:32:51.3962838Z x0 = x0.contiguous() 2025-05-07T20:32:51.3962922Z x1 = x1.contiguous() 2025-05-07T20:32:51.3962997Z 2025-05-07T20:32:51.3963082Z if scale_ub is not None: 2025-05-07T20:32:51.3963185Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.3963320Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.3963391Z ) 2025-05-07T20:32:51.3963463Z else: 2025-05-07T20:32:51.3963552Z scale_ub_tensor = None 2025-05-07T20:32:51.3963617Z 2025-05-07T20:32:51.3963739Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.3963824Z op = silu_mul_quant 2025-05-07T20:32:51.3963905Z if compiled: 2025-05-07T20:32:51.3964004Z op = torch.compile(op) 2025-05-07T20:32:51.3964189Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.3964350Z 2025-05-07T20:32:51.3964438Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.3964442Z 2025-05-07T20:32:51.3964529Z moe/activation_test.py:117: 2025-05-07T20:32:51.3964651Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.3964755Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.3964843Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.3965201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:51.3965285Z return fn(*args, **kwargs) 2025-05-07T20:32:51.3965764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.3965858Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.3966212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.3966422Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.3966750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.3966836Z kernel = self.compile( 2025-05-07T20:32:51.3967260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.3967423Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.3967542Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.3967546Z 2025-05-07T20:32:51.3967743Z self = 2025-05-07T20:32:51.3968519Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.3969019Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa07f586f20>} 2025-05-07T20:32:51.3969790Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.3969973Z context = 2025-05-07T20:32:51.3969977Z 2025-05-07T20:32:51.3970133Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.3970385Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.3971919Z module_map=module_map) 2025-05-07T20:32:51.3972072Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.3972159Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.3972231Z E ^ 2025-05-07T20:32:51.3972574Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.3972582Z 2025-05-07T20:32:51.3972986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.3972993Z 2025-05-07T20:32:51.3973084Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.3973298Z self=, 2025-05-07T20:32:51.3973365Z T=1, 2025-05-07T20:32:51.3973429Z D=7168, 2025-05-07T20:32:51.3973500Z scale_ub=1200.0, 2025-05-07T20:32:51.3973579Z contiguous=False, 2025-05-07T20:32:51.3973652Z compiled=True, 2025-05-07T20:32:51.3973713Z ) 2025-05-07T20:32:51.3973923Z self = 2025-05-07T20:32:51.3974181Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:51.3974185Z 2025-05-07T20:32:51.3974253Z @given( 2025-05-07T20:32:51.3974360Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.3974449Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.3974564Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.3974670Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.3974772Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.3974841Z ) 2025-05-07T20:32:51.3975075Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.3975156Z def test_silu_mul_quant( 2025-05-07T20:32:51.3975221Z self, 2025-05-07T20:32:51.3975289Z T: int, 2025-05-07T20:32:51.3975356Z D: int, 2025-05-07T20:32:51.3975447Z scale_ub: Optional[float], 2025-05-07T20:32:51.3975525Z contiguous: bool, 2025-05-07T20:32:51.3975613Z compiled: bool, 2025-05-07T20:32:51.3975679Z ) -> None: 2025-05-07T20:32:51.3975763Z torch.manual_seed(2025) 2025-05-07T20:32:51.3975828Z 2025-05-07T20:32:51.3975990Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.3976056Z 2025-05-07T20:32:51.3976188Z x_sign = torch.sign(x) 2025-05-07T20:32:51.3976302Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.3976380Z x = x_sign * x_clamp 2025-05-07T20:32:51.3976451Z x0 = x[:, :D] 2025-05-07T20:32:51.3976519Z x1 = x[:, D:] 2025-05-07T20:32:51.3976581Z 2025-05-07T20:32:51.3976658Z if contiguous: 2025-05-07T20:32:51.3976738Z x0 = x0.contiguous() 2025-05-07T20:32:51.3976825Z x1 = x1.contiguous() 2025-05-07T20:32:51.3976889Z 2025-05-07T20:32:51.3976974Z if scale_ub is not None: 2025-05-07T20:32:51.3977071Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.3977204Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.3977268Z ) 2025-05-07T20:32:51.3977336Z else: 2025-05-07T20:32:51.3977417Z scale_ub_tensor = None 2025-05-07T20:32:51.3977480Z 2025-05-07T20:32:51.3977601Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.3977726Z op = silu_mul_quant 2025-05-07T20:32:51.3977800Z if compiled: 2025-05-07T20:32:51.3977890Z op = torch.compile(op) 2025-05-07T20:32:51.3977986Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.3978048Z 2025-05-07T20:32:51.3978128Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.3978132Z 2025-05-07T20:32:51.3978220Z moe/activation_test.py:117: 2025-05-07T20:32:51.3978349Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.3978441Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.3978530Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.3978893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:51.3978976Z return fn(*args, **kwargs) 2025-05-07T20:32:51.3979462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.3979551Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.3979897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.3980115Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.3980442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.3980524Z kernel = self.compile( 2025-05-07T20:32:51.3980903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.3981145Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.3981269Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.3981273Z 2025-05-07T20:32:51.3981467Z self = 2025-05-07T20:32:51.3982237Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.3982732Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa07f5872e0>} 2025-05-07T20:32:51.3983465Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.3983650Z context = 2025-05-07T20:32:51.3983654Z 2025-05-07T20:32:51.3983806Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.3984062Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.3984204Z module_map=module_map) 2025-05-07T20:32:51.3984355Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.3984450Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.3984514Z E ^ 2025-05-07T20:32:51.3984859Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.3984863Z 2025-05-07T20:32:51.3985273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.3985277Z 2025-05-07T20:32:51.3985375Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.3985586Z self=, 2025-05-07T20:32:51.3985662Z T=1, 2025-05-07T20:32:51.3985736Z D=7168, 2025-05-07T20:32:51.3985823Z scale_ub=None, 2025-05-07T20:32:51.3985958Z contiguous=False, 2025-05-07T20:32:51.3986044Z compiled=True, 2025-05-07T20:32:51.3986106Z ) 2025-05-07T20:32:51.3986319Z self = 2025-05-07T20:32:51.3986473Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:51.3986478Z 2025-05-07T20:32:51.3986543Z @given( 2025-05-07T20:32:51.3986654Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.3986742Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.3986850Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.3986955Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.3987061Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.3987127Z ) 2025-05-07T20:32:51.3987363Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.3987452Z def test_silu_mul_quant( 2025-05-07T20:32:51.3987520Z self, 2025-05-07T20:32:51.3987587Z T: int, 2025-05-07T20:32:51.3987659Z D: int, 2025-05-07T20:32:51.3987746Z scale_ub: Optional[float], 2025-05-07T20:32:51.3987825Z contiguous: bool, 2025-05-07T20:32:51.3987906Z compiled: bool, 2025-05-07T20:32:51.3987974Z ) -> None: 2025-05-07T20:32:51.3988061Z torch.manual_seed(2025) 2025-05-07T20:32:51.3988127Z 2025-05-07T20:32:51.3988288Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.3988349Z 2025-05-07T20:32:51.3988435Z x_sign = torch.sign(x) 2025-05-07T20:32:51.3988549Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.3988711Z x = x_sign * x_clamp 2025-05-07T20:32:51.3988788Z x0 = x[:, :D] 2025-05-07T20:32:51.3988858Z x1 = x[:, D:] 2025-05-07T20:32:51.3988922Z 2025-05-07T20:32:51.3988996Z if contiguous: 2025-05-07T20:32:51.3989079Z x0 = x0.contiguous() 2025-05-07T20:32:51.3989162Z x1 = x1.contiguous() 2025-05-07T20:32:51.3989225Z 2025-05-07T20:32:51.3989303Z if scale_ub is not None: 2025-05-07T20:32:51.3989404Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.3989534Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.3989598Z ) 2025-05-07T20:32:51.3989666Z else: 2025-05-07T20:32:51.3989749Z scale_ub_tensor = None 2025-05-07T20:32:51.3989810Z 2025-05-07T20:32:51.3989932Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.3990011Z op = silu_mul_quant 2025-05-07T20:32:51.3990087Z if compiled: 2025-05-07T20:32:51.3990185Z op = torch.compile(op) 2025-05-07T20:32:51.3990281Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.3990346Z 2025-05-07T20:32:51.3990426Z y_fp8, y_scale = fn() 2025-05-07T20:32:51.3990536Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:51.3990606Z 2025-05-07T20:32:51.3990777Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.3990867Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:51.3990963Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:51.3991073Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:51.3991205Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:51.3991274Z 2025-05-07T20:32:51.3991362Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:51.3991367Z 2025-05-07T20:32:51.3991462Z moe/activation_test.py:126: 2025-05-07T20:32:51.3991587Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.3991683Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:51.3991811Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:51.3992356Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:51.3992523Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:51.3992872Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.3993084Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.3993440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:51.3993688Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:51.3994055Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:51.3994214Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:51.3994542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:51.3994613Z fn() 2025-05-07T20:32:51.3995003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:51.3995074Z self.fn.run( 2025-05-07T20:32:51.3995401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.3995482Z kernel = self.compile( 2025-05-07T20:32:51.3995850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.3996016Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.3996218Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.3996223Z 2025-05-07T20:32:51.3996422Z self = 2025-05-07T20:32:51.3997186Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.3997685Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa07f587380>} 2025-05-07T20:32:51.3998417Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.3998602Z context = 2025-05-07T20:32:51.3998606Z 2025-05-07T20:32:51.3998765Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.3999019Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.3999122Z module_map=module_map) 2025-05-07T20:32:51.3999275Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.3999408Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:51.3999476Z E ^ 2025-05-07T20:32:51.3999822Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.3999826Z 2025-05-07T20:32:51.4000233Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.4000237Z 2025-05-07T20:32:51.4000333Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.4000544Z self=, 2025-05-07T20:32:51.4000618Z T=1, 2025-05-07T20:32:51.4000681Z D=5120, 2025-05-07T20:32:51.4000753Z scale_ub=1200.0, 2025-05-07T20:32:51.4000831Z contiguous=False, 2025-05-07T20:32:51.4000905Z compiled=True, 2025-05-07T20:32:51.4000966Z ) 2025-05-07T20:32:51.4001226Z self = 2025-05-07T20:32:51.4001389Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:51.4001394Z 2025-05-07T20:32:51.4001457Z @given( 2025-05-07T20:32:51.4001570Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.4001659Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.4001768Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.4001874Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.4001977Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.4002042Z ) 2025-05-07T20:32:51.4002308Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.4002396Z def test_silu_mul_quant( 2025-05-07T20:32:51.4002477Z self, 2025-05-07T20:32:51.4002543Z T: int, 2025-05-07T20:32:51.4002609Z D: int, 2025-05-07T20:32:51.4002699Z scale_ub: Optional[float], 2025-05-07T20:32:51.4002783Z contiguous: bool, 2025-05-07T20:32:51.4002856Z compiled: bool, 2025-05-07T20:32:51.4002924Z ) -> None: 2025-05-07T20:32:51.4003006Z torch.manual_seed(2025) 2025-05-07T20:32:51.4003068Z 2025-05-07T20:32:51.4003226Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.4003288Z 2025-05-07T20:32:51.4003371Z x_sign = torch.sign(x) 2025-05-07T20:32:51.4003484Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.4003561Z x = x_sign * x_clamp 2025-05-07T20:32:51.4003635Z x0 = x[:, :D] 2025-05-07T20:32:51.4003704Z x1 = x[:, D:] 2025-05-07T20:32:51.4003872Z 2025-05-07T20:32:51.4003950Z if contiguous: 2025-05-07T20:32:51.4004031Z x0 = x0.contiguous() 2025-05-07T20:32:51.4004109Z x1 = x1.contiguous() 2025-05-07T20:32:51.4004173Z 2025-05-07T20:32:51.4004319Z if scale_ub is not None: 2025-05-07T20:32:51.4004421Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.4004548Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.4004612Z ) 2025-05-07T20:32:51.4004683Z else: 2025-05-07T20:32:51.4004765Z scale_ub_tensor = None 2025-05-07T20:32:51.4004825Z 2025-05-07T20:32:51.4004944Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.4005022Z op = silu_mul_quant 2025-05-07T20:32:51.4005095Z if compiled: 2025-05-07T20:32:51.4005186Z op = torch.compile(op) 2025-05-07T20:32:51.4005280Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.4005339Z 2025-05-07T20:32:51.4005426Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.4005430Z 2025-05-07T20:32:51.4005515Z moe/activation_test.py:117: 2025-05-07T20:32:51.4005637Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.4005729Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.4005820Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.4006228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:51.4006310Z return fn(*args, **kwargs) 2025-05-07T20:32:51.4006796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.4006884Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.4007229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.4007452Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.4007782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.4007867Z kernel = self.compile( 2025-05-07T20:32:51.4008517Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.4008866Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.4009000Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.4009011Z 2025-05-07T20:32:51.4009207Z self = 2025-05-07T20:32:51.4009970Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.4010472Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa07fe33f60>} 2025-05-07T20:32:51.4011202Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.4011393Z context = 2025-05-07T20:32:51.4011397Z 2025-05-07T20:32:51.4011553Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.4011807Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.4011911Z module_map=module_map) 2025-05-07T20:32:51.4012065Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.4012157Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.4012368Z E ^ 2025-05-07T20:32:51.4012713Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.4012718Z 2025-05-07T20:32:51.4013120Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.4013130Z 2025-05-07T20:32:51.4013226Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.4013437Z self=, 2025-05-07T20:32:51.4013503Z T=1, 2025-05-07T20:32:51.4013569Z D=5120, 2025-05-07T20:32:51.4013642Z scale_ub=1200.0, 2025-05-07T20:32:51.4013716Z contiguous=False, 2025-05-07T20:32:51.4013794Z compiled=False, 2025-05-07T20:32:51.4013864Z ) 2025-05-07T20:32:51.4014071Z self = 2025-05-07T20:32:51.4014230Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:51.4014239Z 2025-05-07T20:32:51.4014307Z @given( 2025-05-07T20:32:51.4014414Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.4014503Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.4014611Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.4014720Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.4014892Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.4014954Z ) 2025-05-07T20:32:51.4015188Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.4015278Z def test_silu_mul_quant( 2025-05-07T20:32:51.4015342Z self, 2025-05-07T20:32:51.4015409Z T: int, 2025-05-07T20:32:51.4015476Z D: int, 2025-05-07T20:32:51.4015562Z scale_ub: Optional[float], 2025-05-07T20:32:51.4015642Z contiguous: bool, 2025-05-07T20:32:51.4015721Z compiled: bool, 2025-05-07T20:32:51.4015786Z ) -> None: 2025-05-07T20:32:51.4015876Z torch.manual_seed(2025) 2025-05-07T20:32:51.4015943Z 2025-05-07T20:32:51.4016104Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.4016171Z 2025-05-07T20:32:51.4016250Z x_sign = torch.sign(x) 2025-05-07T20:32:51.4016412Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.4016504Z x = x_sign * x_clamp 2025-05-07T20:32:51.4016572Z x0 = x[:, :D] 2025-05-07T20:32:51.4016641Z x1 = x[:, D:] 2025-05-07T20:32:51.4016704Z 2025-05-07T20:32:51.4016779Z if contiguous: 2025-05-07T20:32:51.4016861Z x0 = x0.contiguous() 2025-05-07T20:32:51.4016944Z x1 = x1.contiguous() 2025-05-07T20:32:51.4017005Z 2025-05-07T20:32:51.4017087Z if scale_ub is not None: 2025-05-07T20:32:51.4017186Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.4017311Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.4017387Z ) 2025-05-07T20:32:51.4017455Z else: 2025-05-07T20:32:51.4017539Z scale_ub_tensor = None 2025-05-07T20:32:51.4017605Z 2025-05-07T20:32:51.4017725Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.4017806Z op = silu_mul_quant 2025-05-07T20:32:51.4017891Z if compiled: 2025-05-07T20:32:51.4017983Z op = torch.compile(op) 2025-05-07T20:32:51.4018080Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.4018148Z 2025-05-07T20:32:51.4018227Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.4018231Z 2025-05-07T20:32:51.4018320Z moe/activation_test.py:117: 2025-05-07T20:32:51.4018441Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.4018532Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.4018629Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.4019197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.4019288Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.4019637Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.4019852Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.4020185Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.4020269Z kernel = self.compile( 2025-05-07T20:32:51.4020639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.4020813Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.4020933Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.4020937Z 2025-05-07T20:32:51.4021136Z self = 2025-05-07T20:32:51.4021903Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.4022443Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa07f00f2e0>} 2025-05-07T20:32:51.4023176Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.4023358Z context = 2025-05-07T20:32:51.4023363Z 2025-05-07T20:32:51.4023519Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.4023777Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.4023874Z module_map=module_map) 2025-05-07T20:32:51.4024031Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.4024118Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.4024229Z E ^ 2025-05-07T20:32:51.4024573Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.4024578Z 2025-05-07T20:32:51.4024977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.4024981Z 2025-05-07T20:32:51.4025073Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.4025290Z self=, 2025-05-07T20:32:51.4025356Z T=16384, 2025-05-07T20:32:51.4025422Z D=5120, 2025-05-07T20:32:51.4025492Z scale_ub=1200.0, 2025-05-07T20:32:51.4025573Z contiguous=False, 2025-05-07T20:32:51.4025649Z compiled=True, 2025-05-07T20:32:51.4025709Z ) 2025-05-07T20:32:51.4025916Z self = 2025-05-07T20:32:51.4026090Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:51.4026098Z 2025-05-07T20:32:51.4026162Z @given( 2025-05-07T20:32:51.4026276Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.4026363Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.4026468Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.4026578Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.4026680Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.4026742Z ) 2025-05-07T20:32:51.4027018Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.4027115Z def test_silu_mul_quant( 2025-05-07T20:32:51.4027341Z self, 2025-05-07T20:32:51.4027413Z T: int, 2025-05-07T20:32:51.4027477Z D: int, 2025-05-07T20:32:51.4027568Z scale_ub: Optional[float], 2025-05-07T20:32:51.4027646Z contiguous: bool, 2025-05-07T20:32:51.4027719Z compiled: bool, 2025-05-07T20:32:51.4027791Z ) -> None: 2025-05-07T20:32:51.4027880Z torch.manual_seed(2025) 2025-05-07T20:32:51.4027943Z 2025-05-07T20:32:51.4028106Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.4028168Z 2025-05-07T20:32:51.4028250Z x_sign = torch.sign(x) 2025-05-07T20:32:51.4028365Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.4028445Z x = x_sign * x_clamp 2025-05-07T20:32:51.4028519Z x0 = x[:, :D] 2025-05-07T20:32:51.4028586Z x1 = x[:, D:] 2025-05-07T20:32:51.4028648Z 2025-05-07T20:32:51.4028722Z if contiguous: 2025-05-07T20:32:51.4028803Z x0 = x0.contiguous() 2025-05-07T20:32:51.4028889Z x1 = x1.contiguous() 2025-05-07T20:32:51.4028952Z 2025-05-07T20:32:51.4029031Z if scale_ub is not None: 2025-05-07T20:32:51.4029125Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.4029254Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.4029371Z ) 2025-05-07T20:32:51.4029436Z else: 2025-05-07T20:32:51.4029527Z scale_ub_tensor = None 2025-05-07T20:32:51.4029591Z 2025-05-07T20:32:51.4029709Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.4029789Z op = silu_mul_quant 2025-05-07T20:32:51.4029865Z if compiled: 2025-05-07T20:32:51.4029963Z op = torch.compile(op) 2025-05-07T20:32:51.4030058Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.4030119Z 2025-05-07T20:32:51.4030208Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.4030212Z 2025-05-07T20:32:51.4030301Z moe/activation_test.py:117: 2025-05-07T20:32:51.4030431Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.4030533Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.4030621Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.4030977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:51.4031141Z return fn(*args, **kwargs) 2025-05-07T20:32:51.4031619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.4031707Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.4032054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.4032270Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.4032605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.4032689Z kernel = self.compile( 2025-05-07T20:32:51.4033065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.4033229Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.4033353Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.4033357Z 2025-05-07T20:32:51.4033561Z self = 2025-05-07T20:32:51.4034326Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.4034906Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa07e6ba660>} 2025-05-07T20:32:51.4035638Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.4035822Z context = 2025-05-07T20:32:51.4035831Z 2025-05-07T20:32:51.4035988Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.4036239Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.4036344Z module_map=module_map) 2025-05-07T20:32:51.4036494Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.4036582Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.4036654Z E ^ 2025-05-07T20:32:51.4037004Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.4037008Z 2025-05-07T20:32:51.4037411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.4037416Z 2025-05-07T20:32:51.4037506Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.4037762Z self=, 2025-05-07T20:32:51.4037831Z T=2048, 2025-05-07T20:32:51.4037896Z D=7168, 2025-05-07T20:32:51.4037967Z scale_ub=1200.0, 2025-05-07T20:32:51.4038051Z contiguous=False, 2025-05-07T20:32:51.4038121Z compiled=True, 2025-05-07T20:32:51.4038183Z ) 2025-05-07T20:32:51.4038394Z self = 2025-05-07T20:32:51.4038561Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:51.4038565Z 2025-05-07T20:32:51.4038632Z @given( 2025-05-07T20:32:51.4038746Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.4038840Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.4038947Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.4039059Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.4039161Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.4039274Z ) 2025-05-07T20:32:51.4039510Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.4039594Z def test_silu_mul_quant( 2025-05-07T20:32:51.4039664Z self, 2025-05-07T20:32:51.4039728Z T: int, 2025-05-07T20:32:51.4039795Z D: int, 2025-05-07T20:32:51.4039882Z scale_ub: Optional[float], 2025-05-07T20:32:51.4039960Z contiguous: bool, 2025-05-07T20:32:51.4040038Z compiled: bool, 2025-05-07T20:32:51.4040104Z ) -> None: 2025-05-07T20:32:51.4040188Z torch.manual_seed(2025) 2025-05-07T20:32:51.4040250Z 2025-05-07T20:32:51.4040412Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.4040473Z 2025-05-07T20:32:51.4040559Z x_sign = torch.sign(x) 2025-05-07T20:32:51.4043782Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.4043885Z x = x_sign * x_clamp 2025-05-07T20:32:51.4043971Z x0 = x[:, :D] 2025-05-07T20:32:51.4044051Z x1 = x[:, D:] 2025-05-07T20:32:51.4044120Z 2025-05-07T20:32:51.4044202Z if contiguous: 2025-05-07T20:32:51.4044391Z x0 = x0.contiguous() 2025-05-07T20:32:51.4044480Z x1 = x1.contiguous() 2025-05-07T20:32:51.4044553Z 2025-05-07T20:32:51.4044639Z if scale_ub is not None: 2025-05-07T20:32:51.4044751Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.4044884Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.4044959Z ) 2025-05-07T20:32:51.4045035Z else: 2025-05-07T20:32:51.4045126Z scale_ub_tensor = None 2025-05-07T20:32:51.4045310Z 2025-05-07T20:32:51.4045452Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.4045539Z op = silu_mul_quant 2025-05-07T20:32:51.4045620Z if compiled: 2025-05-07T20:32:51.4045718Z op = torch.compile(op) 2025-05-07T20:32:51.4045828Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.4045899Z 2025-05-07T20:32:51.4045994Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.4045999Z 2025-05-07T20:32:51.4046094Z moe/activation_test.py:117: 2025-05-07T20:32:51.4046224Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.4046321Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.4046418Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.4046788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:51.4046877Z return fn(*args, **kwargs) 2025-05-07T20:32:51.4047367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.4047460Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.4047809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.4048099Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.4048430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.4048517Z kernel = self.compile( 2025-05-07T20:32:51.4048896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.4049064Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.4049192Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.4049202Z 2025-05-07T20:32:51.4049399Z self = 2025-05-07T20:32:51.4050170Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.4050716Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa07f016520>} 2025-05-07T20:32:51.4051450Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.4051646Z context = 2025-05-07T20:32:51.4051651Z 2025-05-07T20:32:51.4051811Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.4052065Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.4052173Z module_map=module_map) 2025-05-07T20:32:51.4052327Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.4052427Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.4052502Z E ^ 2025-05-07T20:32:51.4052847Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.4052852Z 2025-05-07T20:32:51.4053258Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.4053262Z 2025-05-07T20:32:51.4053357Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.4053575Z self=, 2025-05-07T20:32:51.4053646Z T=1, 2025-05-07T20:32:51.4053804Z D=5120, 2025-05-07T20:32:51.4053884Z scale_ub=None, 2025-05-07T20:32:51.4053965Z contiguous=False, 2025-05-07T20:32:51.4054042Z compiled=False, 2025-05-07T20:32:51.4054112Z ) 2025-05-07T20:32:51.4054324Z self = 2025-05-07T20:32:51.4054490Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:51.4054494Z 2025-05-07T20:32:51.4054568Z @given( 2025-05-07T20:32:51.4054678Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.4054771Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.4054885Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.4054997Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.4055109Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.4055180Z ) 2025-05-07T20:32:51.4055416Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.4055512Z def test_silu_mul_quant( 2025-05-07T20:32:51.4055582Z self, 2025-05-07T20:32:51.4055652Z T: int, 2025-05-07T20:32:51.4055726Z D: int, 2025-05-07T20:32:51.4055817Z scale_ub: Optional[float], 2025-05-07T20:32:51.4055898Z contiguous: bool, 2025-05-07T20:32:51.4056027Z compiled: bool, 2025-05-07T20:32:51.4056097Z ) -> None: 2025-05-07T20:32:51.4056185Z torch.manual_seed(2025) 2025-05-07T20:32:51.4056256Z 2025-05-07T20:32:51.4056419Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.4056491Z 2025-05-07T20:32:51.4056577Z x_sign = torch.sign(x) 2025-05-07T20:32:51.4056694Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.4056779Z x = x_sign * x_clamp 2025-05-07T20:32:51.4056855Z x0 = x[:, :D] 2025-05-07T20:32:51.4056927Z x1 = x[:, D:] 2025-05-07T20:32:51.4056993Z 2025-05-07T20:32:51.4057070Z if contiguous: 2025-05-07T20:32:51.4057161Z x0 = x0.contiguous() 2025-05-07T20:32:51.4057254Z x1 = x1.contiguous() 2025-05-07T20:32:51.4057319Z 2025-05-07T20:32:51.4057403Z if scale_ub is not None: 2025-05-07T20:32:51.4057505Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.4057679Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.4057757Z ) 2025-05-07T20:32:51.4057831Z else: 2025-05-07T20:32:51.4057925Z scale_ub_tensor = None 2025-05-07T20:32:51.4057995Z 2025-05-07T20:32:51.4058121Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.4058204Z op = silu_mul_quant 2025-05-07T20:32:51.4058286Z if compiled: 2025-05-07T20:32:51.4058379Z op = torch.compile(op) 2025-05-07T20:32:51.4058479Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.4058553Z 2025-05-07T20:32:51.4058635Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.4058643Z 2025-05-07T20:32:51.4058735Z moe/activation_test.py:117: 2025-05-07T20:32:51.4058858Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.4059312Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.4059412Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.4059904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.4059995Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.4060347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.4060561Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.4060893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.4060981Z kernel = self.compile( 2025-05-07T20:32:51.4061438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.4061622Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.4061742Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.4061754Z 2025-05-07T20:32:51.4061956Z self = 2025-05-07T20:32:51.4062725Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.4063218Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa07f0149a0>} 2025-05-07T20:32:51.4063957Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.4064146Z context = 2025-05-07T20:32:51.4064150Z 2025-05-07T20:32:51.4064310Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.4064632Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.4064736Z module_map=module_map) 2025-05-07T20:32:51.4064891Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.4064982Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.4065060Z E ^ 2025-05-07T20:32:51.4065405Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.4065410Z 2025-05-07T20:32:51.4065824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.4065828Z 2025-05-07T20:32:51.4065924Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.4066141Z self=, 2025-05-07T20:32:51.4066257Z T=4096, 2025-05-07T20:32:51.4066329Z D=7168, 2025-05-07T20:32:51.4066406Z scale_ub=1200.0, 2025-05-07T20:32:51.4066490Z contiguous=False, 2025-05-07T20:32:51.4066570Z compiled=False, 2025-05-07T20:32:51.4066637Z ) 2025-05-07T20:32:51.4066847Z self = 2025-05-07T20:32:51.4067021Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:51.4067026Z 2025-05-07T20:32:51.4067101Z @given( 2025-05-07T20:32:51.4067213Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.4067305Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.4067420Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.4067530Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.4067637Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.4067709Z ) 2025-05-07T20:32:51.4067945Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.4068035Z def test_silu_mul_quant( 2025-05-07T20:32:51.4068107Z self, 2025-05-07T20:32:51.4068177Z T: int, 2025-05-07T20:32:51.4068251Z D: int, 2025-05-07T20:32:51.4068343Z scale_ub: Optional[float], 2025-05-07T20:32:51.4068424Z contiguous: bool, 2025-05-07T20:32:51.4068507Z compiled: bool, 2025-05-07T20:32:51.4068580Z ) -> None: 2025-05-07T20:32:51.4068670Z torch.manual_seed(2025) 2025-05-07T20:32:51.4068742Z 2025-05-07T20:32:51.4068904Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.4068971Z 2025-05-07T20:32:51.4069135Z x_sign = torch.sign(x) 2025-05-07T20:32:51.4069256Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.4069340Z x = x_sign * x_clamp 2025-05-07T20:32:51.4069417Z x0 = x[:, :D] 2025-05-07T20:32:51.4069491Z x1 = x[:, D:] 2025-05-07T20:32:51.4069561Z 2025-05-07T20:32:51.4069639Z if contiguous: 2025-05-07T20:32:51.4069725Z x0 = x0.contiguous() 2025-05-07T20:32:51.4069809Z x1 = x1.contiguous() 2025-05-07T20:32:51.4069874Z 2025-05-07T20:32:51.4069962Z if scale_ub is not None: 2025-05-07T20:32:51.4070063Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.4070190Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.4070259Z ) 2025-05-07T20:32:51.4070335Z else: 2025-05-07T20:32:51.4070422Z scale_ub_tensor = None 2025-05-07T20:32:51.4070487Z 2025-05-07T20:32:51.4070620Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.4070707Z op = silu_mul_quant 2025-05-07T20:32:51.4070788Z if compiled: 2025-05-07T20:32:51.4070882Z op = torch.compile(op) 2025-05-07T20:32:51.4070980Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.4071046Z 2025-05-07T20:32:51.4071131Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.4071181Z 2025-05-07T20:32:51.4071272Z moe/activation_test.py:117: 2025-05-07T20:32:51.4071397Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.4071490Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.4071583Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.4072072Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.4072166Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.4072528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.4072742Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.4073073Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.4073205Z kernel = self.compile( 2025-05-07T20:32:51.4073580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.4073749Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.4073867Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.4073872Z 2025-05-07T20:32:51.4074067Z self = 2025-05-07T20:32:51.4074845Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.4075338Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9f939daf20>} 2025-05-07T20:32:51.4076075Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.4076266Z context = 2025-05-07T20:32:51.4076270Z 2025-05-07T20:32:51.4076427Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.4076686Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.4076787Z module_map=module_map) 2025-05-07T20:32:51.4076943Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.4077115Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.4077186Z E ^ 2025-05-07T20:32:51.4077533Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.4077537Z 2025-05-07T20:32:51.4077943Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.4077950Z 2025-05-07T20:32:51.4078050Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.4078264Z self=, 2025-05-07T20:32:51.4078336Z T=16384, 2025-05-07T20:32:51.4078406Z D=7168, 2025-05-07T20:32:51.4078481Z scale_ub=None, 2025-05-07T20:32:51.4078562Z contiguous=True, 2025-05-07T20:32:51.4078644Z compiled=True, 2025-05-07T20:32:51.4078712Z ) 2025-05-07T20:32:51.4078925Z self = 2025-05-07T20:32:51.4079101Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:51.4079106Z 2025-05-07T20:32:51.4079176Z @given( 2025-05-07T20:32:51.4079291Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.4079382Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.4079493Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.4079649Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.4079757Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.4079823Z ) 2025-05-07T20:32:51.4080063Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.4080150Z def test_silu_mul_quant( 2025-05-07T20:32:51.4080219Z self, 2025-05-07T20:32:51.4080290Z T: int, 2025-05-07T20:32:51.4080360Z D: int, 2025-05-07T20:32:51.4080451Z scale_ub: Optional[float], 2025-05-07T20:32:51.4080536Z contiguous: bool, 2025-05-07T20:32:51.4080620Z compiled: bool, 2025-05-07T20:32:51.4080697Z ) -> None: 2025-05-07T20:32:51.4080784Z torch.manual_seed(2025) 2025-05-07T20:32:51.4080848Z 2025-05-07T20:32:51.4081012Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.4081127Z 2025-05-07T20:32:51.4081213Z x_sign = torch.sign(x) 2025-05-07T20:32:51.4081338Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.4081420Z x = x_sign * x_clamp 2025-05-07T20:32:51.4081497Z x0 = x[:, :D] 2025-05-07T20:32:51.4081572Z x1 = x[:, D:] 2025-05-07T20:32:51.4081640Z 2025-05-07T20:32:51.4081716Z if contiguous: 2025-05-07T20:32:51.4081803Z x0 = x0.contiguous() 2025-05-07T20:32:51.4081884Z x1 = x1.contiguous() 2025-05-07T20:32:51.4081949Z 2025-05-07T20:32:51.4082034Z if scale_ub is not None: 2025-05-07T20:32:51.4082134Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.4082273Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.4082347Z ) 2025-05-07T20:32:51.4082417Z else: 2025-05-07T20:32:51.4082512Z scale_ub_tensor = None 2025-05-07T20:32:51.4082582Z 2025-05-07T20:32:51.4082704Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.4082800Z op = silu_mul_quant 2025-05-07T20:32:51.4082879Z if compiled: 2025-05-07T20:32:51.4082973Z op = torch.compile(op) 2025-05-07T20:32:51.4083073Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.4083139Z 2025-05-07T20:32:51.4083227Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.4083231Z 2025-05-07T20:32:51.4083321Z moe/activation_test.py:117: 2025-05-07T20:32:51.4083444Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.4083541Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.4083635Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.4084082Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:51.4084173Z return fn(*args, **kwargs) 2025-05-07T20:32:51.4084732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.4084832Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.4085183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.4085404Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.4085738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.4085825Z kernel = self.compile( 2025-05-07T20:32:51.4086199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.4086373Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.4086494Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.4086499Z 2025-05-07T20:32:51.4086702Z self = 2025-05-07T20:32:51.4087520Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.4088016Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa07e8d28e0>} 2025-05-07T20:32:51.4088766Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.4088951Z context = 2025-05-07T20:32:51.4088955Z 2025-05-07T20:32:51.4089115Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.4089373Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.4089521Z module_map=module_map) 2025-05-07T20:32:51.4089677Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.4089768Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.4089839Z E ^ 2025-05-07T20:32:51.4090186Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.4090190Z 2025-05-07T20:32:51.4090597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.4090602Z 2025-05-07T20:32:51.4090702Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.4090916Z self=, 2025-05-07T20:32:51.4090986Z T=4096, 2025-05-07T20:32:51.4091054Z D=5120, 2025-05-07T20:32:51.4091134Z scale_ub=None, 2025-05-07T20:32:51.4091219Z contiguous=False, 2025-05-07T20:32:51.4091298Z compiled=True, 2025-05-07T20:32:51.4091364Z ) 2025-05-07T20:32:51.4091578Z self = 2025-05-07T20:32:51.4091745Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:51.4091749Z 2025-05-07T20:32:51.4091817Z @given( 2025-05-07T20:32:51.4091929Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.4092021Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.4092155Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.4092283Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.4092501Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.4092572Z ) 2025-05-07T20:32:51.4092809Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.4092902Z def test_silu_mul_quant( 2025-05-07T20:32:51.4092975Z self, 2025-05-07T20:32:51.4093050Z T: int, 2025-05-07T20:32:51.4093121Z D: int, 2025-05-07T20:32:51.4093217Z scale_ub: Optional[float], 2025-05-07T20:32:51.4093299Z contiguous: bool, 2025-05-07T20:32:51.4093376Z compiled: bool, 2025-05-07T20:32:51.4093451Z ) -> None: 2025-05-07T20:32:51.4093538Z torch.manual_seed(2025) 2025-05-07T20:32:51.4093605Z 2025-05-07T20:32:51.4093767Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.4093833Z 2025-05-07T20:32:51.4093928Z x_sign = torch.sign(x) 2025-05-07T20:32:51.4094050Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.4094138Z x = x_sign * x_clamp 2025-05-07T20:32:51.4094217Z x0 = x[:, :D] 2025-05-07T20:32:51.4094294Z x1 = x[:, D:] 2025-05-07T20:32:51.4094359Z 2025-05-07T20:32:51.4094438Z if contiguous: 2025-05-07T20:32:51.4094522Z x0 = x0.contiguous() 2025-05-07T20:32:51.4094611Z x1 = x1.contiguous() 2025-05-07T20:32:51.4094724Z 2025-05-07T20:32:51.4094808Z if scale_ub is not None: 2025-05-07T20:32:51.4094909Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.4095037Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.4095108Z ) 2025-05-07T20:32:51.4095180Z else: 2025-05-07T20:32:51.4095266Z scale_ub_tensor = None 2025-05-07T20:32:51.4095331Z 2025-05-07T20:32:51.4095464Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.4095544Z op = silu_mul_quant 2025-05-07T20:32:51.4095623Z if compiled: 2025-05-07T20:32:51.4095728Z op = torch.compile(op) 2025-05-07T20:32:51.4095829Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.4095894Z 2025-05-07T20:32:51.4095981Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.4095986Z 2025-05-07T20:32:51.4096074Z moe/activation_test.py:117: 2025-05-07T20:32:51.4096246Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.4096343Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.4096437Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.4096804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:51.4096891Z return fn(*args, **kwargs) 2025-05-07T20:32:51.4097373Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.4097491Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.4097883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.4098121Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.4098454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.4098549Z kernel = self.compile( 2025-05-07T20:32:51.4098929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.4099098Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.4099218Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.4099225Z 2025-05-07T20:32:51.4099424Z self = 2025-05-07T20:32:51.4100275Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.4100781Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa07e8d2de0>} 2025-05-07T20:32:51.4101533Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.4101724Z context = 2025-05-07T20:32:51.4101729Z 2025-05-07T20:32:51.4101890Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.4102148Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.4102254Z module_map=module_map) 2025-05-07T20:32:51.4102415Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.4102507Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.4102581Z E ^ 2025-05-07T20:32:51.4102929Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.4102978Z 2025-05-07T20:32:51.4103386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.4103397Z 2025-05-07T20:32:51.4103496Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.4103713Z self=, 2025-05-07T20:32:51.4103785Z T=4096, 2025-05-07T20:32:51.4103857Z D=5120, 2025-05-07T20:32:51.4103932Z scale_ub=1200.0, 2025-05-07T20:32:51.4104016Z contiguous=False, 2025-05-07T20:32:51.4104089Z compiled=False, 2025-05-07T20:32:51.4104154Z ) 2025-05-07T20:32:51.4104375Z self = 2025-05-07T20:32:51.4104546Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:51.4104550Z 2025-05-07T20:32:51.4104623Z @given( 2025-05-07T20:32:51.4104736Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.4104873Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.4104987Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.4105097Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.4105203Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.4105281Z ) 2025-05-07T20:32:51.4105522Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.4105610Z def test_silu_mul_quant( 2025-05-07T20:32:51.4105685Z self, 2025-05-07T20:32:51.4105757Z T: int, 2025-05-07T20:32:51.4105827Z D: int, 2025-05-07T20:32:51.4105927Z scale_ub: Optional[float], 2025-05-07T20:32:51.4106028Z contiguous: bool, 2025-05-07T20:32:51.4106119Z compiled: bool, 2025-05-07T20:32:51.4106190Z ) -> None: 2025-05-07T20:32:51.4106278Z torch.manual_seed(2025) 2025-05-07T20:32:51.4106346Z 2025-05-07T20:32:51.4106509Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.4106582Z 2025-05-07T20:32:51.4106669Z x_sign = torch.sign(x) 2025-05-07T20:32:51.4106788Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.4106869Z x = x_sign * x_clamp 2025-05-07T20:32:51.4106953Z x0 = x[:, :D] 2025-05-07T20:32:51.4107026Z x1 = x[:, D:] 2025-05-07T20:32:51.4107091Z 2025-05-07T20:32:51.4107168Z if contiguous: 2025-05-07T20:32:51.4107254Z x0 = x0.contiguous() 2025-05-07T20:32:51.4107339Z x1 = x1.contiguous() 2025-05-07T20:32:51.4107405Z 2025-05-07T20:32:51.4107487Z if scale_ub is not None: 2025-05-07T20:32:51.4107671Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.4107802Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.4107873Z ) 2025-05-07T20:32:51.4107945Z else: 2025-05-07T20:32:51.4108032Z scale_ub_tensor = None 2025-05-07T20:32:51.4108099Z 2025-05-07T20:32:51.4108424Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.4108537Z op = silu_mul_quant 2025-05-07T20:32:51.4108647Z if compiled: 2025-05-07T20:32:51.4108781Z op = torch.compile(op) 2025-05-07T20:32:51.4108915Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.4108992Z 2025-05-07T20:32:51.4109080Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.4109085Z 2025-05-07T20:32:51.4109178Z moe/activation_test.py:117: 2025-05-07T20:32:51.4109308Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.4109405Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.4109507Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.4110004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.4110097Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.4110457Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.4110773Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.4111106Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.4111198Z kernel = self.compile( 2025-05-07T20:32:51.4111579Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.4111749Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.4111882Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.4111887Z 2025-05-07T20:32:51.4112086Z self = 2025-05-07T20:32:51.4112865Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.4113431Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9f939e82c0>} 2025-05-07T20:32:51.4114172Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.4114359Z context = 2025-05-07T20:32:51.4114367Z 2025-05-07T20:32:51.4114526Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.4114787Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.4114891Z module_map=module_map) 2025-05-07T20:32:51.4115056Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.4115153Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.4115227Z E ^ 2025-05-07T20:32:51.4115581Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.4115585Z 2025-05-07T20:32:51.4115993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.4115998Z 2025-05-07T20:32:51.4116098Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.4116438Z self=, 2025-05-07T20:32:51.4116514Z T=4096, 2025-05-07T20:32:51.4116592Z D=5120, 2025-05-07T20:32:51.4116673Z scale_ub=1200.0, 2025-05-07T20:32:51.4116755Z contiguous=False, 2025-05-07T20:32:51.4116836Z compiled=True, 2025-05-07T20:32:51.4116907Z ) 2025-05-07T20:32:51.4117124Z self = 2025-05-07T20:32:51.4117300Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:51.4117305Z 2025-05-07T20:32:51.4117377Z @given( 2025-05-07T20:32:51.4117492Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.4117592Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.4117705Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.4117822Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.4117933Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.4118004Z ) 2025-05-07T20:32:51.4118258Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.4118352Z def test_silu_mul_quant( 2025-05-07T20:32:51.4118424Z self, 2025-05-07T20:32:51.4118502Z T: int, 2025-05-07T20:32:51.4118576Z D: int, 2025-05-07T20:32:51.4118673Z scale_ub: Optional[float], 2025-05-07T20:32:51.4118811Z contiguous: bool, 2025-05-07T20:32:51.4118893Z compiled: bool, 2025-05-07T20:32:51.4118966Z ) -> None: 2025-05-07T20:32:51.4119061Z torch.manual_seed(2025) 2025-05-07T20:32:51.4119130Z 2025-05-07T20:32:51.4119294Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.4119363Z 2025-05-07T20:32:51.4119453Z x_sign = torch.sign(x) 2025-05-07T20:32:51.4119577Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.4119662Z x = x_sign * x_clamp 2025-05-07T20:32:51.4119738Z x0 = x[:, :D] 2025-05-07T20:32:51.4119816Z x1 = x[:, D:] 2025-05-07T20:32:51.4119891Z 2025-05-07T20:32:51.4119971Z if contiguous: 2025-05-07T20:32:51.4120060Z x0 = x0.contiguous() 2025-05-07T20:32:51.4120145Z x1 = x1.contiguous() 2025-05-07T20:32:51.4120213Z 2025-05-07T20:32:51.4120303Z if scale_ub is not None: 2025-05-07T20:32:51.4120455Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.4120586Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.4120662Z ) 2025-05-07T20:32:51.4120735Z else: 2025-05-07T20:32:51.4120828Z scale_ub_tensor = None 2025-05-07T20:32:51.4120896Z 2025-05-07T20:32:51.4121021Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.4121109Z op = silu_mul_quant 2025-05-07T20:32:51.4121191Z if compiled: 2025-05-07T20:32:51.4121287Z op = torch.compile(op) 2025-05-07T20:32:51.4121391Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.4121464Z 2025-05-07T20:32:51.4121554Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.4121559Z 2025-05-07T20:32:51.4121655Z moe/activation_test.py:117: 2025-05-07T20:32:51.4121781Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.4121884Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.4121983Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.4122345Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:51.4122437Z return fn(*args, **kwargs) 2025-05-07T20:32:51.4122923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.4123015Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.4123369Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.4123688Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.4124025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.4124116Z kernel = self.compile( 2025-05-07T20:32:51.4124543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.4124719Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.4124842Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.4124847Z 2025-05-07T20:32:51.4125049Z self = 2025-05-07T20:32:51.4125822Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.4126328Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9f939e9b20>} 2025-05-07T20:32:51.4127070Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.4127305Z context = 2025-05-07T20:32:51.4127309Z 2025-05-07T20:32:51.4127471Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.4127729Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.4127833Z module_map=module_map) 2025-05-07T20:32:51.4127996Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.4128090Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.4128169Z E ^ 2025-05-07T20:32:51.4128521Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.4128526Z 2025-05-07T20:32:51.4128935Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.4128985Z 2025-05-07T20:32:51.4129088Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.4129308Z self=, 2025-05-07T20:32:51.4129382Z T=2048, 2025-05-07T20:32:51.4129459Z D=7168, 2025-05-07T20:32:51.4129541Z scale_ub=1200.0, 2025-05-07T20:32:51.4129623Z contiguous=False, 2025-05-07T20:32:51.4129705Z compiled=False, 2025-05-07T20:32:51.4129776Z ) 2025-05-07T20:32:51.4129993Z self = 2025-05-07T20:32:51.4130167Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:51.4130172Z 2025-05-07T20:32:51.4130244Z @given( 2025-05-07T20:32:51.4130361Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.4130456Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.4130567Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.4130690Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.4130801Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.4130876Z ) 2025-05-07T20:32:51.4131117Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.4131206Z def test_silu_mul_quant( 2025-05-07T20:32:51.4131281Z self, 2025-05-07T20:32:51.4131354Z T: int, 2025-05-07T20:32:51.4131428Z D: int, 2025-05-07T20:32:51.4131525Z scale_ub: Optional[float], 2025-05-07T20:32:51.4131610Z contiguous: bool, 2025-05-07T20:32:51.4131692Z compiled: bool, 2025-05-07T20:32:51.4131850Z ) -> None: 2025-05-07T20:32:51.4131942Z torch.manual_seed(2025) 2025-05-07T20:32:51.4132011Z 2025-05-07T20:32:51.4132178Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.4132246Z 2025-05-07T20:32:51.4132334Z x_sign = torch.sign(x) 2025-05-07T20:32:51.4132461Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.4132546Z x = x_sign * x_clamp 2025-05-07T20:32:51.4132626Z x0 = x[:, :D] 2025-05-07T20:32:51.4132703Z x1 = x[:, D:] 2025-05-07T20:32:51.4132771Z 2025-05-07T20:32:51.4132856Z if contiguous: 2025-05-07T20:32:51.4132944Z x0 = x0.contiguous() 2025-05-07T20:32:51.4133029Z x1 = x1.contiguous() 2025-05-07T20:32:51.4133101Z 2025-05-07T20:32:51.4133190Z if scale_ub is not None: 2025-05-07T20:32:51.4133292Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.4133427Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.4133504Z ) 2025-05-07T20:32:51.4133577Z else: 2025-05-07T20:32:51.4133672Z scale_ub_tensor = None 2025-05-07T20:32:51.4133742Z 2025-05-07T20:32:51.4133873Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.4133963Z op = silu_mul_quant 2025-05-07T20:32:51.4134092Z if compiled: 2025-05-07T20:32:51.4134191Z op = torch.compile(op) 2025-05-07T20:32:51.4134292Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.4134361Z 2025-05-07T20:32:51.4134455Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.4134460Z 2025-05-07T20:32:51.4134553Z moe/activation_test.py:117: 2025-05-07T20:32:51.4134680Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.4134779Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.4134876Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.4135375Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.4135468Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.4135823Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.4136087Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.4136422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.4136515Z kernel = self.compile( 2025-05-07T20:32:51.4136894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.4137066Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.4137195Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.4137200Z 2025-05-07T20:32:51.4137403Z self = 2025-05-07T20:32:51.4138175Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.4138681Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9f939ea700>} 2025-05-07T20:32:51.4139418Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.4139606Z context = 2025-05-07T20:32:51.4139610Z 2025-05-07T20:32:51.4139772Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.4140113Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.4140219Z module_map=module_map) 2025-05-07T20:32:51.4140377Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.4140478Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.4140556Z E ^ 2025-05-07T20:32:51.4140907Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.4140911Z 2025-05-07T20:32:51.4141323Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.4141328Z 2025-05-07T20:32:51.4141427Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.4141648Z self=, 2025-05-07T20:32:51.4141722Z T=1, 2025-05-07T20:32:51.4141795Z D=7168, 2025-05-07T20:32:51.4141883Z scale_ub=None, 2025-05-07T20:32:51.4141967Z contiguous=True, 2025-05-07T20:32:51.4142045Z compiled=False, 2025-05-07T20:32:51.4142117Z ) 2025-05-07T20:32:51.4142330Z self = 2025-05-07T20:32:51.4142493Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:51.4142545Z 2025-05-07T20:32:51.4142619Z @given( 2025-05-07T20:32:51.4142734Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.4142830Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.4142941Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.4143055Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.4143167Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.4143237Z ) 2025-05-07T20:32:51.4143477Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.4143576Z def test_silu_mul_quant( 2025-05-07T20:32:51.4143649Z self, 2025-05-07T20:32:51.4143725Z T: int, 2025-05-07T20:32:51.4143801Z D: int, 2025-05-07T20:32:51.4143895Z scale_ub: Optional[float], 2025-05-07T20:32:51.4143984Z contiguous: bool, 2025-05-07T20:32:51.4144111Z compiled: bool, 2025-05-07T20:32:51.4144190Z ) -> None: 2025-05-07T20:32:51.4144287Z torch.manual_seed(2025) 2025-05-07T20:32:51.4144355Z 2025-05-07T20:32:51.4144522Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.4144595Z 2025-05-07T20:32:51.4144682Z x_sign = torch.sign(x) 2025-05-07T20:32:51.4144801Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.4144889Z x = x_sign * x_clamp 2025-05-07T20:32:51.4144965Z x0 = x[:, :D] 2025-05-07T20:32:51.4145042Z x1 = x[:, D:] 2025-05-07T20:32:51.4145115Z 2025-05-07T20:32:51.4145198Z if contiguous: 2025-05-07T20:32:51.4145289Z x0 = x0.contiguous() 2025-05-07T20:32:51.4145379Z x1 = x1.contiguous() 2025-05-07T20:32:51.4145449Z 2025-05-07T20:32:51.4145539Z if scale_ub is not None: 2025-05-07T20:32:51.4145640Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.4145772Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.4145855Z ) 2025-05-07T20:32:51.4145930Z else: 2025-05-07T20:32:51.4146019Z scale_ub_tensor = None 2025-05-07T20:32:51.4146091Z 2025-05-07T20:32:51.4146216Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.4146301Z op = silu_mul_quant 2025-05-07T20:32:51.4146387Z if compiled: 2025-05-07T20:32:51.4146483Z op = torch.compile(op) 2025-05-07T20:32:51.4146586Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.4146659Z 2025-05-07T20:32:51.4146745Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.4146749Z 2025-05-07T20:32:51.4146930Z moe/activation_test.py:117: 2025-05-07T20:32:51.4147062Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.4147161Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.4147259Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.4147755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.4147852Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.4148209Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.4148427Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.4148768Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.4148858Z kernel = self.compile( 2025-05-07T20:32:51.4149241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.4149417Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.4149542Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.4149548Z 2025-05-07T20:32:51.4149752Z self = 2025-05-07T20:32:51.4150568Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.4151066Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9f939eba60>} 2025-05-07T20:32:51.4151813Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.4152003Z context = 2025-05-07T20:32:51.4152007Z 2025-05-07T20:32:51.4152169Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.4152496Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.4152602Z module_map=module_map) 2025-05-07T20:32:51.4152767Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.4152862Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.4152940Z E ^ 2025-05-07T20:32:51.4153291Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.4153296Z 2025-05-07T20:32:51.4153711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.4153716Z 2025-05-07T20:32:51.4153817Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.4154036Z self=, 2025-05-07T20:32:51.4154113Z T=16384, 2025-05-07T20:32:51.4154189Z D=7168, 2025-05-07T20:32:51.4154271Z scale_ub=1200.0, 2025-05-07T20:32:51.4154357Z contiguous=False, 2025-05-07T20:32:51.4154437Z compiled=True, 2025-05-07T20:32:51.4154508Z ) 2025-05-07T20:32:51.4154728Z self = 2025-05-07T20:32:51.4154905Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:51.4154909Z 2025-05-07T20:32:51.4154982Z @given( 2025-05-07T20:32:51.4155100Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.4155195Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.4155308Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.4155593Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.4155705Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.4155784Z ) 2025-05-07T20:32:51.4156025Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.4156118Z def test_silu_mul_quant( 2025-05-07T20:32:51.4156199Z self, 2025-05-07T20:32:51.4156274Z T: int, 2025-05-07T20:32:51.4156350Z D: int, 2025-05-07T20:32:51.4156453Z scale_ub: Optional[float], 2025-05-07T20:32:51.4156538Z contiguous: bool, 2025-05-07T20:32:51.4156620Z compiled: bool, 2025-05-07T20:32:51.4156699Z ) -> None: 2025-05-07T20:32:51.4156790Z torch.manual_seed(2025) 2025-05-07T20:32:51.4156860Z 2025-05-07T20:32:51.4157029Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.4157099Z 2025-05-07T20:32:51.4157190Z x_sign = torch.sign(x) 2025-05-07T20:32:51.4157316Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.4157400Z x = x_sign * x_clamp 2025-05-07T20:32:51.4157481Z x0 = x[:, :D] 2025-05-07T20:32:51.4157556Z x1 = x[:, D:] 2025-05-07T20:32:51.4157625Z 2025-05-07T20:32:51.4157710Z if contiguous: 2025-05-07T20:32:51.4157800Z x0 = x0.contiguous() 2025-05-07T20:32:51.4157929Z x1 = x1.contiguous() 2025-05-07T20:32:51.4158000Z 2025-05-07T20:32:51.4158088Z if scale_ub is not None: 2025-05-07T20:32:51.4158189Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.4161237Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.4161325Z ) 2025-05-07T20:32:51.4161405Z else: 2025-05-07T20:32:51.4161503Z scale_ub_tensor = None 2025-05-07T20:32:51.4161573Z 2025-05-07T20:32:51.4161709Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.4161799Z op = silu_mul_quant 2025-05-07T20:32:51.4161892Z if compiled: 2025-05-07T20:32:51.4161993Z op = torch.compile(op) 2025-05-07T20:32:51.4162098Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.4162181Z 2025-05-07T20:32:51.4162289Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.4162371Z 2025-05-07T20:32:51.4162481Z moe/activation_test.py:117: 2025-05-07T20:32:51.4162610Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.4162711Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.4162810Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.4163190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:51.4163282Z return fn(*args, **kwargs) 2025-05-07T20:32:51.4163771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.4163872Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.4164224Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.4164516Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.4164859Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.4164953Z kernel = self.compile( 2025-05-07T20:32:51.4165336Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.4165509Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.4165634Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.4165639Z 2025-05-07T20:32:51.4165848Z self = 2025-05-07T20:32:51.4166704Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.4167210Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9f931ccd60>} 2025-05-07T20:32:51.4167956Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.4168145Z context = 2025-05-07T20:32:51.4168153Z 2025-05-07T20:32:51.4168313Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.4168599Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.4168725Z module_map=module_map) 2025-05-07T20:32:51.4168914Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.4169027Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.4169106Z E ^ 2025-05-07T20:32:51.4169459Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.4169506Z 2025-05-07T20:32:51.4169925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.4169930Z 2025-05-07T20:32:51.4170030Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.4170250Z self=, 2025-05-07T20:32:51.4170331Z T=1, 2025-05-07T20:32:51.4170406Z D=7168, 2025-05-07T20:32:51.4170486Z scale_ub=None, 2025-05-07T20:32:51.4170580Z contiguous=False, 2025-05-07T20:32:51.4170663Z compiled=False, 2025-05-07T20:32:51.4170743Z ) 2025-05-07T20:32:51.4170964Z self = 2025-05-07T20:32:51.4171128Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:51.4171133Z 2025-05-07T20:32:51.4171257Z @given( 2025-05-07T20:32:51.4171373Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.4171474Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.4171590Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.4171707Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.4171817Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.4171893Z ) 2025-05-07T20:32:51.4172136Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.4172230Z def test_silu_mul_quant( 2025-05-07T20:32:51.4172306Z self, 2025-05-07T20:32:51.4172382Z T: int, 2025-05-07T20:32:51.4172465Z D: int, 2025-05-07T20:32:51.4172561Z scale_ub: Optional[float], 2025-05-07T20:32:51.4172649Z contiguous: bool, 2025-05-07T20:32:51.4172736Z compiled: bool, 2025-05-07T20:32:51.4172816Z ) -> None: 2025-05-07T20:32:51.4172908Z torch.manual_seed(2025) 2025-05-07T20:32:51.4172994Z 2025-05-07T20:32:51.4173165Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.4173237Z 2025-05-07T20:32:51.4173330Z x_sign = torch.sign(x) 2025-05-07T20:32:51.4173452Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.4173539Z x = x_sign * x_clamp 2025-05-07T20:32:51.4173620Z x0 = x[:, :D] 2025-05-07T20:32:51.4173697Z x1 = x[:, D:] 2025-05-07T20:32:51.4173771Z 2025-05-07T20:32:51.4173853Z if contiguous: 2025-05-07T20:32:51.4173942Z x0 = x0.contiguous() 2025-05-07T20:32:51.4174036Z x1 = x1.contiguous() 2025-05-07T20:32:51.4174112Z 2025-05-07T20:32:51.4174289Z if scale_ub is not None: 2025-05-07T20:32:51.4174399Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.4174533Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.4174609Z ) 2025-05-07T20:32:51.4174685Z else: 2025-05-07T20:32:51.4174782Z scale_ub_tensor = None 2025-05-07T20:32:51.4174856Z 2025-05-07T20:32:51.4174985Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.4175073Z op = silu_mul_quant 2025-05-07T20:32:51.4175163Z if compiled: 2025-05-07T20:32:51.4175262Z op = torch.compile(op) 2025-05-07T20:32:51.4175367Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.4175441Z 2025-05-07T20:32:51.4175531Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.4175535Z 2025-05-07T20:32:51.4175632Z moe/activation_test.py:117: 2025-05-07T20:32:51.4175769Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.4175885Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.4176002Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.4176501Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.4176599Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.4176999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.4177219Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.4177556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.4177654Z kernel = self.compile( 2025-05-07T20:32:51.4178031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.4178209Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.4178334Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.4178338Z 2025-05-07T20:32:51.4178539Z self = 2025-05-07T20:32:51.4180084Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.4180583Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9f931cd760>} 2025-05-07T20:32:51.4181322Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.4181514Z context = 2025-05-07T20:32:51.4181519Z 2025-05-07T20:32:51.4181680Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.4181949Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.4182062Z module_map=module_map) 2025-05-07T20:32:51.4182224Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.4182320Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.4182398Z E ^ 2025-05-07T20:32:51.4182748Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.4182753Z 2025-05-07T20:32:51.4183159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.4183164Z 2025-05-07T20:32:51.4183267Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.4183580Z self=, 2025-05-07T20:32:51.4183662Z T=2048, 2025-05-07T20:32:51.4183739Z D=7168, 2025-05-07T20:32:51.4183825Z scale_ub=None, 2025-05-07T20:32:51.4183913Z contiguous=False, 2025-05-07T20:32:51.4184002Z compiled=True, 2025-05-07T20:32:51.4184075Z ) 2025-05-07T20:32:51.4184294Z self = 2025-05-07T20:32:51.4184464Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:51.4184469Z 2025-05-07T20:32:51.4184543Z @given( 2025-05-07T20:32:51.4184659Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.4184756Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.4184874Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.4184991Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.4185110Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.4185190Z ) 2025-05-07T20:32:51.4185433Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.4185528Z def test_silu_mul_quant( 2025-05-07T20:32:51.4185606Z self, 2025-05-07T20:32:51.4185685Z T: int, 2025-05-07T20:32:51.4185826Z D: int, 2025-05-07T20:32:51.4185926Z scale_ub: Optional[float], 2025-05-07T20:32:51.4186015Z contiguous: bool, 2025-05-07T20:32:51.4186099Z compiled: bool, 2025-05-07T20:32:51.4186181Z ) -> None: 2025-05-07T20:32:51.4186277Z torch.manual_seed(2025) 2025-05-07T20:32:51.4186354Z 2025-05-07T20:32:51.4186519Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.4186591Z 2025-05-07T20:32:51.4186684Z x_sign = torch.sign(x) 2025-05-07T20:32:51.4186806Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.4186893Z x = x_sign * x_clamp 2025-05-07T20:32:51.4186979Z x0 = x[:, :D] 2025-05-07T20:32:51.4187056Z x1 = x[:, D:] 2025-05-07T20:32:51.4187130Z 2025-05-07T20:32:51.4187215Z if contiguous: 2025-05-07T20:32:51.4187305Z x0 = x0.contiguous() 2025-05-07T20:32:51.4187395Z x1 = x1.contiguous() 2025-05-07T20:32:51.4187523Z 2025-05-07T20:32:51.4187616Z if scale_ub is not None: 2025-05-07T20:32:51.4187719Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.4187855Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.4187932Z ) 2025-05-07T20:32:51.4188011Z else: 2025-05-07T20:32:51.4188105Z scale_ub_tensor = None 2025-05-07T20:32:51.4188176Z 2025-05-07T20:32:51.4188304Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.4188393Z op = silu_mul_quant 2025-05-07T20:32:51.4188475Z if compiled: 2025-05-07T20:32:51.4188577Z op = torch.compile(op) 2025-05-07T20:32:51.4188683Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.4188756Z 2025-05-07T20:32:51.4188848Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.4188853Z 2025-05-07T20:32:51.4188947Z moe/activation_test.py:117: 2025-05-07T20:32:51.4189077Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.4189183Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.4189282Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.4189646Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:51.4189738Z return fn(*args, **kwargs) 2025-05-07T20:32:51.4190225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.4190327Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.4190764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.4190991Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.4191327Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.4191422Z kernel = self.compile( 2025-05-07T20:32:51.4191804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.4191976Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.4192101Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.4192108Z 2025-05-07T20:32:51.4192308Z self = 2025-05-07T20:32:51.4193087Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.4193591Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9f931cef20>} 2025-05-07T20:32:51.4194331Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.4194566Z context = 2025-05-07T20:32:51.4194570Z 2025-05-07T20:32:51.4194731Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.4194990Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.4195100Z module_map=module_map) 2025-05-07T20:32:51.4195274Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.4195374Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.4195448Z E ^ 2025-05-07T20:32:51.4195800Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.4195845Z 2025-05-07T20:32:51.4196256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.4196262Z 2025-05-07T20:32:51.4196363Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.4196582Z self=, 2025-05-07T20:32:51.4196660Z T=4096, 2025-05-07T20:32:51.4196739Z D=7168, 2025-05-07T20:32:51.4196823Z scale_ub=None, 2025-05-07T20:32:51.4196908Z contiguous=False, 2025-05-07T20:32:51.4196991Z compiled=True, 2025-05-07T20:32:51.4197067Z ) 2025-05-07T20:32:51.4197284Z self = 2025-05-07T20:32:51.4197458Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:51.4197463Z 2025-05-07T20:32:51.4197539Z @given( 2025-05-07T20:32:51.4197655Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.4197751Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.4197870Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.4197984Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.4198104Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.4198177Z ) 2025-05-07T20:32:51.4198419Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.4198511Z def test_silu_mul_quant( 2025-05-07T20:32:51.4198585Z self, 2025-05-07T20:32:51.4198664Z T: int, 2025-05-07T20:32:51.4198747Z D: int, 2025-05-07T20:32:51.4198843Z scale_ub: Optional[float], 2025-05-07T20:32:51.4198930Z contiguous: bool, 2025-05-07T20:32:51.4199098Z compiled: bool, 2025-05-07T20:32:51.4199179Z ) -> None: 2025-05-07T20:32:51.4199272Z torch.manual_seed(2025) 2025-05-07T20:32:51.4199348Z 2025-05-07T20:32:51.4199516Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.4199594Z 2025-05-07T20:32:51.4199685Z x_sign = torch.sign(x) 2025-05-07T20:32:51.4199806Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.4199895Z x = x_sign * x_clamp 2025-05-07T20:32:51.4199973Z x0 = x[:, :D] 2025-05-07T20:32:51.4200052Z x1 = x[:, D:] 2025-05-07T20:32:51.4200126Z 2025-05-07T20:32:51.4200207Z if contiguous: 2025-05-07T20:32:51.4200296Z x0 = x0.contiguous() 2025-05-07T20:32:51.4200387Z x1 = x1.contiguous() 2025-05-07T20:32:51.4200456Z 2025-05-07T20:32:51.4200543Z if scale_ub is not None: 2025-05-07T20:32:51.4200648Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.4200785Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.4200859Z ) 2025-05-07T20:32:51.4200936Z else: 2025-05-07T20:32:51.4201027Z scale_ub_tensor = None 2025-05-07T20:32:51.4201099Z 2025-05-07T20:32:51.4201225Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.4201363Z op = silu_mul_quant 2025-05-07T20:32:51.4201450Z if compiled: 2025-05-07T20:32:51.4201547Z op = torch.compile(op) 2025-05-07T20:32:51.4201651Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.4201723Z 2025-05-07T20:32:51.4201811Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.4201815Z 2025-05-07T20:32:51.4201909Z moe/activation_test.py:117: 2025-05-07T20:32:51.4202037Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.4202136Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.4202238Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.4202605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:51.4202697Z return fn(*args, **kwargs) 2025-05-07T20:32:51.4203195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.4203337Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.4203688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.4203911Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.4204296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.4204394Z kernel = self.compile( 2025-05-07T20:32:51.4204770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.4204950Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.4205081Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.4205085Z 2025-05-07T20:32:51.4205288Z self = 2025-05-07T20:32:51.4206108Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.4206618Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa07e2c00e0>} 2025-05-07T20:32:51.4207439Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.4207635Z context = 2025-05-07T20:32:51.4207639Z 2025-05-07T20:32:51.4207800Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.4208063Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.4208172Z module_map=module_map) 2025-05-07T20:32:51.4208611Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.4208737Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.4208813Z E ^ 2025-05-07T20:32:51.4209167Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.4209172Z 2025-05-07T20:32:51.4209582Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.4209587Z 2025-05-07T20:32:51.4209693Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.4209918Z self=, 2025-05-07T20:32:51.4209993Z T=16384, 2025-05-07T20:32:51.4210067Z D=5120, 2025-05-07T20:32:51.4210153Z scale_ub=1200.0, 2025-05-07T20:32:51.4210244Z contiguous=False, 2025-05-07T20:32:51.4210419Z compiled=False, 2025-05-07T20:32:51.4210492Z ) 2025-05-07T20:32:51.4210714Z self = 2025-05-07T20:32:51.4210894Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:51.4210899Z 2025-05-07T20:32:51.4210974Z @given( 2025-05-07T20:32:51.4211089Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.4211189Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.4211302Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.4211426Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.4211539Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.4211611Z ) 2025-05-07T20:32:51.4211856Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.4211946Z def test_silu_mul_quant( 2025-05-07T20:32:51.4212095Z self, 2025-05-07T20:32:51.4212177Z T: int, 2025-05-07T20:32:51.4212255Z D: int, 2025-05-07T20:32:51.4212350Z scale_ub: Optional[float], 2025-05-07T20:32:51.4212443Z contiguous: bool, 2025-05-07T20:32:51.4212525Z compiled: bool, 2025-05-07T20:32:51.4212602Z ) -> None: 2025-05-07T20:32:51.4212696Z torch.manual_seed(2025) 2025-05-07T20:32:51.4212766Z 2025-05-07T20:32:51.4212932Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.4213007Z 2025-05-07T20:32:51.4213095Z x_sign = torch.sign(x) 2025-05-07T20:32:51.4213223Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.4213314Z x = x_sign * x_clamp 2025-05-07T20:32:51.4213394Z x0 = x[:, :D] 2025-05-07T20:32:51.4213477Z x1 = x[:, D:] 2025-05-07T20:32:51.4213550Z 2025-05-07T20:32:51.4213634Z if contiguous: 2025-05-07T20:32:51.4213728Z x0 = x0.contiguous() 2025-05-07T20:32:51.4213818Z x1 = x1.contiguous() 2025-05-07T20:32:51.4213891Z 2025-05-07T20:32:51.4213982Z if scale_ub is not None: 2025-05-07T20:32:51.4214086Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.4214219Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.4214295Z ) 2025-05-07T20:32:51.4214369Z else: 2025-05-07T20:32:51.4214464Z scale_ub_tensor = None 2025-05-07T20:32:51.4214541Z 2025-05-07T20:32:51.4214668Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.4214757Z op = silu_mul_quant 2025-05-07T20:32:51.4214841Z if compiled: 2025-05-07T20:32:51.4215089Z op = torch.compile(op) 2025-05-07T20:32:51.4215197Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.4215268Z 2025-05-07T20:32:51.4215358Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.4215362Z 2025-05-07T20:32:51.4215458Z moe/activation_test.py:117: 2025-05-07T20:32:51.4215591Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.4215690Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.4215788Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.4216279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.4216377Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.4216729Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.4216953Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.4217292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.4217383Z kernel = self.compile( 2025-05-07T20:32:51.4217765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.4217984Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.4218109Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.4218114Z 2025-05-07T20:32:51.4218318Z self = 2025-05-07T20:32:51.4219088Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.4219597Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa07e2c0b80>} 2025-05-07T20:32:51.4220338Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.4220570Z context = 2025-05-07T20:32:51.4220575Z 2025-05-07T20:32:51.4220745Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.4221004Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.4221112Z module_map=module_map) 2025-05-07T20:32:51.4221272Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.4221367Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.4221445Z E ^ 2025-05-07T20:32:51.4221797Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.4221802Z 2025-05-07T20:32:51.4222211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.4222225Z 2025-05-07T20:32:51.4222341Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.4222593Z self=, 2025-05-07T20:32:51.4222673Z T=16384, 2025-05-07T20:32:51.4222748Z D=5120, 2025-05-07T20:32:51.4222829Z scale_ub=1200.0, 2025-05-07T20:32:51.4222920Z contiguous=True, 2025-05-07T20:32:51.4223000Z compiled=True, 2025-05-07T20:32:51.4223073Z ) 2025-05-07T20:32:51.4223297Z self = 2025-05-07T20:32:51.4223469Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:51.4223473Z 2025-05-07T20:32:51.4223636Z @given( 2025-05-07T20:32:51.4223758Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.4223857Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.4223969Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.4224089Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.4224203Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.4224277Z ) 2025-05-07T20:32:51.4224523Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.4224613Z def test_silu_mul_quant( 2025-05-07T20:32:51.4224692Z self, 2025-05-07T20:32:51.4224768Z T: int, 2025-05-07T20:32:51.4224843Z D: int, 2025-05-07T20:32:51.4224940Z scale_ub: Optional[float], 2025-05-07T20:32:51.4225027Z contiguous: bool, 2025-05-07T20:32:51.4225109Z compiled: bool, 2025-05-07T20:32:51.4225191Z ) -> None: 2025-05-07T20:32:51.4225288Z torch.manual_seed(2025) 2025-05-07T20:32:51.4225361Z 2025-05-07T20:32:51.4225529Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.4225599Z 2025-05-07T20:32:51.4225688Z x_sign = torch.sign(x) 2025-05-07T20:32:51.4225817Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.4225947Z x = x_sign * x_clamp 2025-05-07T20:32:51.4226034Z x0 = x[:, :D] 2025-05-07T20:32:51.4226115Z x1 = x[:, D:] 2025-05-07T20:32:51.4226187Z 2025-05-07T20:32:51.4226275Z if contiguous: 2025-05-07T20:32:51.4226363Z x0 = x0.contiguous() 2025-05-07T20:32:51.4226452Z x1 = x1.contiguous() 2025-05-07T20:32:51.4226528Z 2025-05-07T20:32:51.4226617Z if scale_ub is not None: 2025-05-07T20:32:51.4226719Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.4226862Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.4226937Z ) 2025-05-07T20:32:51.4227017Z else: 2025-05-07T20:32:51.4227112Z scale_ub_tensor = None 2025-05-07T20:32:51.4227184Z 2025-05-07T20:32:51.4227309Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.4227401Z op = silu_mul_quant 2025-05-07T20:32:51.4227531Z if compiled: 2025-05-07T20:32:51.4227636Z op = torch.compile(op) 2025-05-07T20:32:51.4227741Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.4227811Z 2025-05-07T20:32:51.4227904Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.4227908Z 2025-05-07T20:32:51.4228007Z moe/activation_test.py:117: 2025-05-07T20:32:51.4228133Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.4228234Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.4228332Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.4228700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:51.4228793Z return fn(*args, **kwargs) 2025-05-07T20:32:51.4229281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.4229379Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.4229734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.4229953Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.4230292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.4230382Z kernel = self.compile( 2025-05-07T20:32:51.4230764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.4230936Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.4231141Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.4231146Z 2025-05-07T20:32:51.4231354Z self = 2025-05-07T20:32:51.4232125Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.4232631Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa07e2c22a0>} 2025-05-07T20:32:51.4233371Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.4233564Z context = 2025-05-07T20:32:51.4233571Z 2025-05-07T20:32:51.4233731Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.4233989Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.4234102Z module_map=module_map) 2025-05-07T20:32:51.4234303Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.4234398Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.4234475Z E ^ 2025-05-07T20:32:51.4234822Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.4234827Z 2025-05-07T20:32:51.4235240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.4235244Z 2025-05-07T20:32:51.4235344Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.4235569Z self=, 2025-05-07T20:32:51.4235647Z T=16384, 2025-05-07T20:32:51.4235722Z D=5120, 2025-05-07T20:32:51.4235803Z scale_ub=None, 2025-05-07T20:32:51.4235890Z contiguous=False, 2025-05-07T20:32:51.4235971Z compiled=True, 2025-05-07T20:32:51.4236086Z ) 2025-05-07T20:32:51.4236305Z self = 2025-05-07T20:32:51.4236478Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:51.4236483Z 2025-05-07T20:32:51.4236559Z @given( 2025-05-07T20:32:51.4236673Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.4236770Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.4236884Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.4236998Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.4237111Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.4237186Z ) 2025-05-07T20:32:51.4237434Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.4237527Z def test_silu_mul_quant( 2025-05-07T20:32:51.4237605Z self, 2025-05-07T20:32:51.4237680Z T: int, 2025-05-07T20:32:51.4237758Z D: int, 2025-05-07T20:32:51.4237857Z scale_ub: Optional[float], 2025-05-07T20:32:51.4237947Z contiguous: bool, 2025-05-07T20:32:51.4238036Z compiled: bool, 2025-05-07T20:32:51.4238111Z ) -> None: 2025-05-07T20:32:51.4238202Z torch.manual_seed(2025) 2025-05-07T20:32:51.4238276Z 2025-05-07T20:32:51.4238442Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.4238513Z 2025-05-07T20:32:51.4238608Z x_sign = torch.sign(x) 2025-05-07T20:32:51.4238728Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.4238814Z x = x_sign * x_clamp 2025-05-07T20:32:51.4238898Z x0 = x[:, :D] 2025-05-07T20:32:51.4239057Z x1 = x[:, D:] 2025-05-07T20:32:51.4239135Z 2025-05-07T20:32:51.4239217Z if contiguous: 2025-05-07T20:32:51.4239305Z x0 = x0.contiguous() 2025-05-07T20:32:51.4239395Z x1 = x1.contiguous() 2025-05-07T20:32:51.4239464Z 2025-05-07T20:32:51.4239555Z if scale_ub is not None: 2025-05-07T20:32:51.4239665Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.4239796Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.4239868Z ) 2025-05-07T20:32:51.4239944Z else: 2025-05-07T20:32:51.4240037Z scale_ub_tensor = None 2025-05-07T20:32:51.4240109Z 2025-05-07T20:32:51.4240239Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.4240324Z op = silu_mul_quant 2025-05-07T20:32:51.4240407Z if compiled: 2025-05-07T20:32:51.4240507Z op = torch.compile(op) 2025-05-07T20:32:51.4240613Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.4240691Z 2025-05-07T20:32:51.4240781Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.4240785Z 2025-05-07T20:32:51.4240878Z moe/activation_test.py:117: 2025-05-07T20:32:51.4241007Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.4241110Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.4241254Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.4241618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:51.4241709Z return fn(*args, **kwargs) 2025-05-07T20:32:51.4242197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.4242291Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.4242642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.4242869Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.4243202Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.4243292Z kernel = self.compile( 2025-05-07T20:32:51.4243711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.4243883Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.4244011Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.4244016Z 2025-05-07T20:32:51.4244216Z self = 2025-05-07T20:32:51.4245075Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.4245577Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fa07e2c3060>} 2025-05-07T20:32:51.4246314Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.4246511Z context = 2025-05-07T20:32:51.4246516Z 2025-05-07T20:32:51.4246675Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.4246935Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.4247040Z module_map=module_map) 2025-05-07T20:32:51.4247197Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.4247399Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.4247476Z E ^ 2025-05-07T20:32:51.4247825Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.4247830Z 2025-05-07T20:32:51.4248240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.4248251Z 2025-05-07T20:32:51.4248350Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.4248572Z self=, 2025-05-07T20:32:51.4248647Z T=2048, 2025-05-07T20:32:51.4248720Z D=5120, 2025-05-07T20:32:51.4248806Z scale_ub=None, 2025-05-07T20:32:51.4248890Z contiguous=False, 2025-05-07T20:32:51.4248972Z compiled=True, 2025-05-07T20:32:51.4249046Z ) 2025-05-07T20:32:51.4249259Z self = 2025-05-07T20:32:51.4249437Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:51.4249446Z 2025-05-07T20:32:51.4249520Z @given( 2025-05-07T20:32:51.4249634Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.4249734Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.4249847Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.4250005Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.4250118Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.4250195Z ) 2025-05-07T20:32:51.4250437Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.4250531Z def test_silu_mul_quant( 2025-05-07T20:32:51.4250607Z self, 2025-05-07T20:32:51.4250682Z T: int, 2025-05-07T20:32:51.4250759Z D: int, 2025-05-07T20:32:51.4250856Z scale_ub: Optional[float], 2025-05-07T20:32:51.4250945Z contiguous: bool, 2025-05-07T20:32:51.4251027Z compiled: bool, 2025-05-07T20:32:51.4251109Z ) -> None: 2025-05-07T20:32:51.4251202Z torch.manual_seed(2025) 2025-05-07T20:32:51.4251274Z 2025-05-07T20:32:51.4251438Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.4251511Z 2025-05-07T20:32:51.4251647Z x_sign = torch.sign(x) 2025-05-07T20:32:51.4251773Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.4251861Z x = x_sign * x_clamp 2025-05-07T20:32:51.4251938Z x0 = x[:, :D] 2025-05-07T20:32:51.4252016Z x1 = x[:, D:] 2025-05-07T20:32:51.4252089Z 2025-05-07T20:32:51.4252171Z if contiguous: 2025-05-07T20:32:51.4252263Z x0 = x0.contiguous() 2025-05-07T20:32:51.4252350Z x1 = x1.contiguous() 2025-05-07T20:32:51.4252426Z 2025-05-07T20:32:51.4252518Z if scale_ub is not None: 2025-05-07T20:32:51.4252621Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.4252757Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.4252835Z ) 2025-05-07T20:32:51.4252908Z else: 2025-05-07T20:32:51.4253001Z scale_ub_tensor = None 2025-05-07T20:32:51.4253073Z 2025-05-07T20:32:51.4253199Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.4253288Z op = silu_mul_quant 2025-05-07T20:32:51.4253373Z if compiled: 2025-05-07T20:32:51.4253467Z op = torch.compile(op) 2025-05-07T20:32:51.4253568Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.4253639Z 2025-05-07T20:32:51.4253722Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.4253726Z 2025-05-07T20:32:51.4253820Z moe/activation_test.py:117: 2025-05-07T20:32:51.4253942Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.4254039Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.4254137Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.4254579Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:51.4254668Z return fn(*args, **kwargs) 2025-05-07T20:32:51.4255154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.4255250Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.4255600Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.4255817Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.4256147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.4256236Z kernel = self.compile( 2025-05-07T20:32:51.4256610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.4256786Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.4256909Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.4256913Z 2025-05-07T20:32:51.4257115Z self = 2025-05-07T20:32:51.4257886Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.4258425Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9f93cd07c0>} 2025-05-07T20:32:51.4259161Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.4259352Z context = 2025-05-07T20:32:51.4259356Z 2025-05-07T20:32:51.4259515Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.4259774Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.4259918Z module_map=module_map) 2025-05-07T20:32:51.4260078Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.4260170Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.4260240Z E ^ 2025-05-07T20:32:51.4260590Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.4260595Z 2025-05-07T20:32:51.4261001Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.4261005Z 2025-05-07T20:32:51.4261113Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.4261328Z self=, 2025-05-07T20:32:51.4261398Z T=2048, 2025-05-07T20:32:51.4261469Z D=5120, 2025-05-07T20:32:51.4261547Z scale_ub=1200.0, 2025-05-07T20:32:51.4261627Z contiguous=False, 2025-05-07T20:32:51.4261709Z compiled=True, 2025-05-07T20:32:51.4261780Z ) 2025-05-07T20:32:51.4261991Z self = 2025-05-07T20:32:51.4262161Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:51.4262165Z 2025-05-07T20:32:51.4262235Z @given( 2025-05-07T20:32:51.4262352Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.4262444Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.4262552Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.4262665Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.4262861Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.4262930Z ) 2025-05-07T20:32:51.4263171Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.4263256Z def test_silu_mul_quant( 2025-05-07T20:32:51.4263326Z self, 2025-05-07T20:32:51.4263403Z T: int, 2025-05-07T20:32:51.4263476Z D: int, 2025-05-07T20:32:51.4263567Z scale_ub: Optional[float], 2025-05-07T20:32:51.4263653Z contiguous: bool, 2025-05-07T20:32:51.4263734Z compiled: bool, 2025-05-07T20:32:51.4263807Z ) -> None: 2025-05-07T20:32:51.4263895Z torch.manual_seed(2025) 2025-05-07T20:32:51.4263965Z 2025-05-07T20:32:51.4264129Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.4264197Z 2025-05-07T20:32:51.4264281Z x_sign = torch.sign(x) 2025-05-07T20:32:51.4264402Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.4264484Z x = x_sign * x_clamp 2025-05-07T20:32:51.4264565Z x0 = x[:, :D] 2025-05-07T20:32:51.4264641Z x1 = x[:, D:] 2025-05-07T20:32:51.4264708Z 2025-05-07T20:32:51.4264786Z if contiguous: 2025-05-07T20:32:51.4264875Z x0 = x0.contiguous() 2025-05-07T20:32:51.4264958Z x1 = x1.contiguous() 2025-05-07T20:32:51.4265027Z 2025-05-07T20:32:51.4265160Z if scale_ub is not None: 2025-05-07T20:32:51.4265261Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.4265394Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.4265465Z ) 2025-05-07T20:32:51.4265539Z else: 2025-05-07T20:32:51.4265631Z scale_ub_tensor = None 2025-05-07T20:32:51.4265698Z 2025-05-07T20:32:51.4265823Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.4265909Z op = silu_mul_quant 2025-05-07T20:32:51.4265989Z if compiled: 2025-05-07T20:32:51.4266084Z op = torch.compile(op) 2025-05-07T20:32:51.4266200Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.4266270Z 2025-05-07T20:32:51.4266357Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.4266365Z 2025-05-07T20:32:51.4266456Z moe/activation_test.py:117: 2025-05-07T20:32:51.4266580Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.4266725Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.4266820Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.4267177Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:51.4267267Z return fn(*args, **kwargs) 2025-05-07T20:32:51.4267749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.4267843Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.4268194Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.4268408Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.4268742Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.4268836Z kernel = self.compile( 2025-05-07T20:32:51.4269213Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.4269388Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.4269512Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.4269516Z 2025-05-07T20:32:51.4269717Z self = 2025-05-07T20:32:51.4270559Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.4271056Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9f93cd1580>} 2025-05-07T20:32:51.4271800Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.4271987Z context = 2025-05-07T20:32:51.4271992Z 2025-05-07T20:32:51.4272156Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.4272413Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.4272515Z module_map=module_map) 2025-05-07T20:32:51.4272680Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.4272776Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.4272848Z E ^ 2025-05-07T20:32:51.4273194Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.4273201Z 2025-05-07T20:32:51.4273648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.4273653Z 2025-05-07T20:32:51.4273753Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.4273968Z self=, 2025-05-07T20:32:51.4274043Z T=4096, 2025-05-07T20:32:51.4274113Z D=5120, 2025-05-07T20:32:51.4274193Z scale_ub=1200.0, 2025-05-07T20:32:51.4274276Z contiguous=True, 2025-05-07T20:32:51.4274351Z compiled=True, 2025-05-07T20:32:51.4274418Z ) 2025-05-07T20:32:51.4274640Z self = 2025-05-07T20:32:51.4274807Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:51.4274811Z 2025-05-07T20:32:51.4274880Z @given( 2025-05-07T20:32:51.4274997Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.4275156Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.4275273Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.4275384Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.4275491Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.4275563Z ) 2025-05-07T20:32:51.4275801Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.4275888Z def test_silu_mul_quant( 2025-05-07T20:32:51.4275965Z self, 2025-05-07T20:32:51.4276042Z T: int, 2025-05-07T20:32:51.4276113Z D: int, 2025-05-07T20:32:51.4276209Z scale_ub: Optional[float], 2025-05-07T20:32:51.4276296Z contiguous: bool, 2025-05-07T20:32:51.4276377Z compiled: bool, 2025-05-07T20:32:51.4276453Z ) -> None: 2025-05-07T20:32:51.4276540Z torch.manual_seed(2025) 2025-05-07T20:32:51.4276610Z 2025-05-07T20:32:51.4276771Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.4276841Z 2025-05-07T20:32:51.4276931Z x_sign = torch.sign(x) 2025-05-07T20:32:51.4277050Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.4277132Z x = x_sign * x_clamp 2025-05-07T20:32:51.4277210Z x0 = x[:, :D] 2025-05-07T20:32:51.4277284Z x1 = x[:, D:] 2025-05-07T20:32:51.4277349Z 2025-05-07T20:32:51.4277431Z if contiguous: 2025-05-07T20:32:51.4277516Z x0 = x0.contiguous() 2025-05-07T20:32:51.4277598Z x1 = x1.contiguous() 2025-05-07T20:32:51.4277673Z 2025-05-07T20:32:51.4277758Z if scale_ub is not None: 2025-05-07T20:32:51.4277862Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.4278075Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.4278146Z ) 2025-05-07T20:32:51.4278219Z else: 2025-05-07T20:32:51.4278306Z scale_ub_tensor = None 2025-05-07T20:32:51.4278371Z 2025-05-07T20:32:51.4278502Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.4278590Z op = silu_mul_quant 2025-05-07T20:32:51.4278668Z if compiled: 2025-05-07T20:32:51.4278764Z op = torch.compile(op) 2025-05-07T20:32:51.4278863Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.4278930Z 2025-05-07T20:32:51.4281829Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.4281836Z 2025-05-07T20:32:51.4281944Z moe/activation_test.py:117: 2025-05-07T20:32:51.4282081Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.4282184Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.4282292Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.4282663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:51.4282756Z return fn(*args, **kwargs) 2025-05-07T20:32:51.4283248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.4283416Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.4283769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.4283991Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.4284447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.4284539Z kernel = self.compile( 2025-05-07T20:32:51.4284926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.4285099Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.4285227Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.4285231Z 2025-05-07T20:32:51.4285445Z self = 2025-05-07T20:32:51.4286317Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.4286822Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9f93cd2840>} 2025-05-07T20:32:51.4287566Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.4287758Z context = 2025-05-07T20:32:51.4287763Z 2025-05-07T20:32:51.4287924Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.4288187Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.4288300Z module_map=module_map) 2025-05-07T20:32:51.4288458Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.4288554Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.4288633Z E ^ 2025-05-07T20:32:51.4288984Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.4288989Z 2025-05-07T20:32:51.4289401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.4289530Z 2025-05-07T20:32:51.4289633Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.4289854Z self=, 2025-05-07T20:32:51.4289932Z T=128, 2025-05-07T20:32:51.4290006Z D=5120, 2025-05-07T20:32:51.4290092Z scale_ub=1200.0, 2025-05-07T20:32:51.4290189Z contiguous=False, 2025-05-07T20:32:51.4290268Z compiled=True, 2025-05-07T20:32:51.4290344Z ) 2025-05-07T20:32:51.4290560Z self = 2025-05-07T20:32:51.4290727Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:51.4290732Z 2025-05-07T20:32:51.4290813Z @given( 2025-05-07T20:32:51.4290929Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.4291026Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.4291151Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.4291271Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.4291382Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.4291461Z ) 2025-05-07T20:32:51.4291705Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.4291798Z def test_silu_mul_quant( 2025-05-07T20:32:51.4291879Z self, 2025-05-07T20:32:51.4291999Z T: int, 2025-05-07T20:32:51.4292078Z D: int, 2025-05-07T20:32:51.4292174Z scale_ub: Optional[float], 2025-05-07T20:32:51.4292261Z contiguous: bool, 2025-05-07T20:32:51.4292346Z compiled: bool, 2025-05-07T20:32:51.4292426Z ) -> None: 2025-05-07T20:32:51.4292517Z torch.manual_seed(2025) 2025-05-07T20:32:51.4292591Z 2025-05-07T20:32:51.4292760Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.4292834Z 2025-05-07T20:32:51.4292926Z x_sign = torch.sign(x) 2025-05-07T20:32:51.4293048Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.4293142Z x = x_sign * x_clamp 2025-05-07T20:32:51.4293221Z x0 = x[:, :D] 2025-05-07T20:32:51.4293299Z x1 = x[:, D:] 2025-05-07T20:32:51.4293380Z 2025-05-07T20:32:51.4293462Z if contiguous: 2025-05-07T20:32:51.4293552Z x0 = x0.contiguous() 2025-05-07T20:32:51.4293695Z x1 = x1.contiguous() 2025-05-07T20:32:51.4293765Z 2025-05-07T20:32:51.4293855Z if scale_ub is not None: 2025-05-07T20:32:51.4293963Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.4294096Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.4294169Z ) 2025-05-07T20:32:51.4294246Z else: 2025-05-07T20:32:51.4294338Z scale_ub_tensor = None 2025-05-07T20:32:51.4294409Z 2025-05-07T20:32:51.4294538Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.4294625Z op = silu_mul_quant 2025-05-07T20:32:51.4294713Z if compiled: 2025-05-07T20:32:51.4294815Z op = torch.compile(op) 2025-05-07T20:32:51.4294921Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.4294997Z 2025-05-07T20:32:51.4295085Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.4295089Z 2025-05-07T20:32:51.4295189Z moe/activation_test.py:117: 2025-05-07T20:32:51.4295321Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.4295422Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.4295518Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.4295884Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:51.4295975Z return fn(*args, **kwargs) 2025-05-07T20:32:51.4296470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.4296564Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.4296999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.4297223Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.4297558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.4297661Z kernel = self.compile( 2025-05-07T20:32:51.4298041Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.4298218Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.4298347Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.4298352Z 2025-05-07T20:32:51.4298555Z self = 2025-05-07T20:32:51.4299341Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.4299841Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9f93cd34c0>} 2025-05-07T20:32:51.4300625Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.4300819Z context = 2025-05-07T20:32:51.4300823Z 2025-05-07T20:32:51.4300986Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.4301247Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.4301358Z module_map=module_map) 2025-05-07T20:32:51.4301517Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.4301616Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.4301692Z E ^ 2025-05-07T20:32:51.4302041Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.4302090Z 2025-05-07T20:32:51.4302499Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.4302503Z 2025-05-07T20:32:51.4302604Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.4302829Z self=, 2025-05-07T20:32:51.4302906Z T=16384, 2025-05-07T20:32:51.4302981Z D=7168, 2025-05-07T20:32:51.4303065Z scale_ub=1200.0, 2025-05-07T20:32:51.4303148Z contiguous=True, 2025-05-07T20:32:51.4303232Z compiled=True, 2025-05-07T20:32:51.4303305Z ) 2025-05-07T20:32:51.4303523Z self = 2025-05-07T20:32:51.4303698Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:51.4303703Z 2025-05-07T20:32:51.4303779Z @given( 2025-05-07T20:32:51.4303898Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.4304002Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.4304117Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.4304234Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.4304347Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.4304419Z ) 2025-05-07T20:32:51.4304664Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.4304756Z def test_silu_mul_quant( 2025-05-07T20:32:51.4304831Z self, 2025-05-07T20:32:51.4304908Z T: int, 2025-05-07T20:32:51.4304983Z D: int, 2025-05-07T20:32:51.4305163Z scale_ub: Optional[float], 2025-05-07T20:32:51.4305252Z contiguous: bool, 2025-05-07T20:32:51.4305335Z compiled: bool, 2025-05-07T20:32:51.4305417Z ) -> None: 2025-05-07T20:32:51.4305509Z torch.manual_seed(2025) 2025-05-07T20:32:51.4305583Z 2025-05-07T20:32:51.4305754Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.4305828Z 2025-05-07T20:32:51.4305917Z x_sign = torch.sign(x) 2025-05-07T20:32:51.4306043Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.4306130Z x = x_sign * x_clamp 2025-05-07T20:32:51.4306208Z x0 = x[:, :D] 2025-05-07T20:32:51.4306290Z x1 = x[:, D:] 2025-05-07T20:32:51.4306364Z 2025-05-07T20:32:51.4306452Z if contiguous: 2025-05-07T20:32:51.4306540Z x0 = x0.contiguous() 2025-05-07T20:32:51.4306629Z x1 = x1.contiguous() 2025-05-07T20:32:51.4306702Z 2025-05-07T20:32:51.4306790Z if scale_ub is not None: 2025-05-07T20:32:51.4306899Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.4307033Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.4307106Z ) 2025-05-07T20:32:51.4307182Z else: 2025-05-07T20:32:51.4307277Z scale_ub_tensor = None 2025-05-07T20:32:51.4307421Z 2025-05-07T20:32:51.4307552Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.4307665Z op = silu_mul_quant 2025-05-07T20:32:51.4307756Z if compiled: 2025-05-07T20:32:51.4307869Z op = torch.compile(op) 2025-05-07T20:32:51.4307977Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.4308051Z 2025-05-07T20:32:51.4308143Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.4308147Z 2025-05-07T20:32:51.4308486Z moe/activation_test.py:117: 2025-05-07T20:32:51.4308665Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.4308804Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.4308907Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.4309271Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:51.4309369Z return fn(*args, **kwargs) 2025-05-07T20:32:51.4309952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.4310055Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.4310410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.4310631Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.4310968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.4311061Z kernel = self.compile( 2025-05-07T20:32:51.4311445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.4311621Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.4311747Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.4311753Z 2025-05-07T20:32:51.4311962Z self = 2025-05-07T20:32:51.4312737Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.4313241Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9f92fc4c20>} 2025-05-07T20:32:51.4314100Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.4314292Z context = 2025-05-07T20:32:51.4314296Z 2025-05-07T20:32:51.4314465Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.4314730Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.4314842Z module_map=module_map) 2025-05-07T20:32:51.4315001Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.4315097Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.4315175Z E ^ 2025-05-07T20:32:51.4315526Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.4315531Z 2025-05-07T20:32:51.4315995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.4316003Z 2025-05-07T20:32:51.4316106Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.4316326Z self=, 2025-05-07T20:32:51.4316411Z T=16384, 2025-05-07T20:32:51.4316486Z D=5120, 2025-05-07T20:32:51.4316629Z scale_ub=1200.0, 2025-05-07T20:32:51.4316716Z contiguous=True, 2025-05-07T20:32:51.4316797Z compiled=False, 2025-05-07T20:32:51.4316870Z ) 2025-05-07T20:32:51.4317088Z self = 2025-05-07T20:32:51.4317264Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:51.4317268Z 2025-05-07T20:32:51.4317346Z @given( 2025-05-07T20:32:51.4317463Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.4317563Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.4317684Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.4317798Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.4317912Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.4317989Z ) 2025-05-07T20:32:51.4318233Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.4318370Z def test_silu_mul_quant( 2025-05-07T20:32:51.4318447Z self, 2025-05-07T20:32:51.4318523Z T: int, 2025-05-07T20:32:51.4318598Z D: int, 2025-05-07T20:32:51.4318697Z scale_ub: Optional[float], 2025-05-07T20:32:51.4318786Z contiguous: bool, 2025-05-07T20:32:51.4318876Z compiled: bool, 2025-05-07T20:32:51.4318954Z ) -> None: 2025-05-07T20:32:51.4319048Z torch.manual_seed(2025) 2025-05-07T20:32:51.4319123Z 2025-05-07T20:32:51.4319288Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.4319363Z 2025-05-07T20:32:51.4319458Z x_sign = torch.sign(x) 2025-05-07T20:32:51.4319584Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.4319671Z x = x_sign * x_clamp 2025-05-07T20:32:51.4319753Z x0 = x[:, :D] 2025-05-07T20:32:51.4319833Z x1 = x[:, D:] 2025-05-07T20:32:51.4319907Z 2025-05-07T20:32:51.4319997Z if contiguous: 2025-05-07T20:32:51.4320093Z x0 = x0.contiguous() 2025-05-07T20:32:51.4320183Z x1 = x1.contiguous() 2025-05-07T20:32:51.4320254Z 2025-05-07T20:32:51.4320343Z if scale_ub is not None: 2025-05-07T20:32:51.4320450Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.4320583Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.4320659Z ) 2025-05-07T20:32:51.4320736Z else: 2025-05-07T20:32:51.4320829Z scale_ub_tensor = None 2025-05-07T20:32:51.4320900Z 2025-05-07T20:32:51.4321030Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.4321197Z op = silu_mul_quant 2025-05-07T20:32:51.4321283Z if compiled: 2025-05-07T20:32:51.4321384Z op = torch.compile(op) 2025-05-07T20:32:51.4321488Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.4321560Z 2025-05-07T20:32:51.4321654Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.4321664Z 2025-05-07T20:32:51.4321759Z moe/activation_test.py:117: 2025-05-07T20:32:51.4321888Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.4321988Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.4322089Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.4322585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.4322681Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.4323036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.4323262Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.4323597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.4323690Z kernel = self.compile( 2025-05-07T20:32:51.4324071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.4324366Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.4324493Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.4324498Z 2025-05-07T20:32:51.4324697Z self = 2025-05-07T20:32:51.4325480Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.4325978Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9f92fc5580>} 2025-05-07T20:32:51.4326716Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.4326950Z context = 2025-05-07T20:32:51.4326954Z 2025-05-07T20:32:51.4327116Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.4327376Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.4327478Z module_map=module_map) 2025-05-07T20:32:51.4327633Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.4327731Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.4327806Z E ^ 2025-05-07T20:32:51.4328152Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.4328157Z 2025-05-07T20:32:51.4328562Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.4328571Z 2025-05-07T20:32:51.4328668Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.4328890Z self=, 2025-05-07T20:32:51.4328963Z T=1, 2025-05-07T20:32:51.4329038Z D=7168, 2025-05-07T20:32:51.4329117Z scale_ub=1200.0, 2025-05-07T20:32:51.4329200Z contiguous=False, 2025-05-07T20:32:51.4329292Z compiled=False, 2025-05-07T20:32:51.4329362Z ) 2025-05-07T20:32:51.4329573Z self = 2025-05-07T20:32:51.4329818Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:51.4329823Z 2025-05-07T20:32:51.4329897Z @given( 2025-05-07T20:32:51.4330011Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.4330109Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.4330220Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.4330336Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.4330447Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.4330517Z ) 2025-05-07T20:32:51.4330759Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.4330849Z def test_silu_mul_quant( 2025-05-07T20:32:51.4330923Z self, 2025-05-07T20:32:51.4330998Z T: int, 2025-05-07T20:32:51.4331072Z D: int, 2025-05-07T20:32:51.4331166Z scale_ub: Optional[float], 2025-05-07T20:32:51.4331253Z contiguous: bool, 2025-05-07T20:32:51.4331335Z compiled: bool, 2025-05-07T20:32:51.4331415Z ) -> None: 2025-05-07T20:32:51.4334267Z torch.manual_seed(2025) 2025-05-07T20:32:51.4334348Z 2025-05-07T20:32:51.4334522Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.4334595Z 2025-05-07T20:32:51.4334689Z x_sign = torch.sign(x) 2025-05-07T20:32:51.4334867Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.4334960Z x = x_sign * x_clamp 2025-05-07T20:32:51.4335040Z x0 = x[:, :D] 2025-05-07T20:32:51.4335120Z x1 = x[:, D:] 2025-05-07T20:32:51.4335195Z 2025-05-07T20:32:51.4335279Z if contiguous: 2025-05-07T20:32:51.4335370Z x0 = x0.contiguous() 2025-05-07T20:32:51.4335462Z x1 = x1.contiguous() 2025-05-07T20:32:51.4335537Z 2025-05-07T20:32:51.4335631Z if scale_ub is not None: 2025-05-07T20:32:51.4335740Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.4335880Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.4335962Z ) 2025-05-07T20:32:51.4336061Z else: 2025-05-07T20:32:51.4336157Z scale_ub_tensor = None 2025-05-07T20:32:51.4336233Z 2025-05-07T20:32:51.4336364Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.4336500Z op = silu_mul_quant 2025-05-07T20:32:51.4336590Z if compiled: 2025-05-07T20:32:51.4336689Z op = torch.compile(op) 2025-05-07T20:32:51.4336795Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.4336869Z 2025-05-07T20:32:51.4336960Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.4336964Z 2025-05-07T20:32:51.4337064Z moe/activation_test.py:117: 2025-05-07T20:32:51.4337199Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.4337303Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.4337407Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.4337907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.4338005Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.4338363Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.4338588Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.4338926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.4339021Z kernel = self.compile( 2025-05-07T20:32:51.4339402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.4339583Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.4339709Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.4339714Z 2025-05-07T20:32:51.4339985Z self = 2025-05-07T20:32:51.4340769Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.4341277Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9f92fc68e0>} 2025-05-07T20:32:51.4342020Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.4342210Z context = 2025-05-07T20:32:51.4342214Z 2025-05-07T20:32:51.4342383Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.4342739Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.4342849Z module_map=module_map) 2025-05-07T20:32:51.4343015Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.4343162Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.4343240Z E ^ 2025-05-07T20:32:51.4343597Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.4343601Z 2025-05-07T20:32:51.4344013Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.4344018Z 2025-05-07T20:32:51.4344124Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.4344351Z self=, 2025-05-07T20:32:51.4344429Z T=4096, 2025-05-07T20:32:51.4344512Z D=7168, 2025-05-07T20:32:51.4344598Z scale_ub=1200.0, 2025-05-07T20:32:51.4344690Z contiguous=False, 2025-05-07T20:32:51.4344776Z compiled=True, 2025-05-07T20:32:51.4344850Z ) 2025-05-07T20:32:51.4345067Z self = 2025-05-07T20:32:51.4345289Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:51.4345296Z 2025-05-07T20:32:51.4345372Z @given( 2025-05-07T20:32:51.4345492Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.4345591Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.4345706Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.4345827Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.4345939Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.4346015Z ) 2025-05-07T20:32:51.4346268Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.4346363Z def test_silu_mul_quant( 2025-05-07T20:32:51.4346442Z self, 2025-05-07T20:32:51.4346530Z T: int, 2025-05-07T20:32:51.4346609Z D: int, 2025-05-07T20:32:51.4346706Z scale_ub: Optional[float], 2025-05-07T20:32:51.4346799Z contiguous: bool, 2025-05-07T20:32:51.4346889Z compiled: bool, 2025-05-07T20:32:51.4346970Z ) -> None: 2025-05-07T20:32:51.4347070Z torch.manual_seed(2025) 2025-05-07T20:32:51.4347144Z 2025-05-07T20:32:51.4347312Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.4347391Z 2025-05-07T20:32:51.4347482Z x_sign = torch.sign(x) 2025-05-07T20:32:51.4347608Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.4347696Z x = x_sign * x_clamp 2025-05-07T20:32:51.4347778Z x0 = x[:, :D] 2025-05-07T20:32:51.4347864Z x1 = x[:, D:] 2025-05-07T20:32:51.4347937Z 2025-05-07T20:32:51.4348021Z if contiguous: 2025-05-07T20:32:51.4348167Z x0 = x0.contiguous() 2025-05-07T20:32:51.4348260Z x1 = x1.contiguous() 2025-05-07T20:32:51.4348333Z 2025-05-07T20:32:51.4348429Z if scale_ub is not None: 2025-05-07T20:32:51.4348534Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.4348674Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.4348762Z ) 2025-05-07T20:32:51.4348840Z else: 2025-05-07T20:32:51.4348939Z scale_ub_tensor = None 2025-05-07T20:32:51.4349012Z 2025-05-07T20:32:51.4349144Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.4349239Z op = silu_mul_quant 2025-05-07T20:32:51.4349324Z if compiled: 2025-05-07T20:32:51.4349423Z op = torch.compile(op) 2025-05-07T20:32:51.4349533Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.4349606Z 2025-05-07T20:32:51.4349696Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.4349701Z 2025-05-07T20:32:51.4349806Z moe/activation_test.py:117: 2025-05-07T20:32:51.4349992Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.4350095Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.4350198Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.4350569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:51.4350709Z return fn(*args, **kwargs) 2025-05-07T20:32:51.4351200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.4351296Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.4351657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.4351880Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.4352222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.4352319Z kernel = self.compile( 2025-05-07T20:32:51.4352699Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.4352920Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.4353049Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.4353054Z 2025-05-07T20:32:51.4353260Z self = 2025-05-07T20:32:51.4354035Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.4354539Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9f92fc7a60>} 2025-05-07T20:32:51.4355287Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.4355485Z context = 2025-05-07T20:32:51.4355489Z 2025-05-07T20:32:51.4355656Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.4355920Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.4356027Z module_map=module_map) 2025-05-07T20:32:51.4356191Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.4356289Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.4356368Z E ^ 2025-05-07T20:32:51.4356768Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.4356775Z 2025-05-07T20:32:51.4357190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.4357198Z 2025-05-07T20:32:51.4357305Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.4357526Z self=, 2025-05-07T20:32:51.4357603Z T=128, 2025-05-07T20:32:51.4357683Z D=7168, 2025-05-07T20:32:51.4357766Z scale_ub=1200.0, 2025-05-07T20:32:51.4357852Z contiguous=False, 2025-05-07T20:32:51.4357935Z compiled=True, 2025-05-07T20:32:51.4358008Z ) 2025-05-07T20:32:51.4358230Z self = 2025-05-07T20:32:51.4358400Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:51.4358405Z 2025-05-07T20:32:51.4358482Z @given( 2025-05-07T20:32:51.4358604Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.4358749Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.4358866Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.4358986Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.4359139Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.4359214Z ) 2025-05-07T20:32:51.4359460Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.4359553Z def test_silu_mul_quant( 2025-05-07T20:32:51.4359633Z self, 2025-05-07T20:32:51.4359712Z T: int, 2025-05-07T20:32:51.4359789Z D: int, 2025-05-07T20:32:51.4359889Z scale_ub: Optional[float], 2025-05-07T20:32:51.4359979Z contiguous: bool, 2025-05-07T20:32:51.4360066Z compiled: bool, 2025-05-07T20:32:51.4360146Z ) -> None: 2025-05-07T20:32:51.4360241Z torch.manual_seed(2025) 2025-05-07T20:32:51.4360319Z 2025-05-07T20:32:51.4360494Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.4360567Z 2025-05-07T20:32:51.4360659Z x_sign = torch.sign(x) 2025-05-07T20:32:51.4360787Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.4360919Z x = x_sign * x_clamp 2025-05-07T20:32:51.4361006Z x0 = x[:, :D] 2025-05-07T20:32:51.4361085Z x1 = x[:, D:] 2025-05-07T20:32:51.4361158Z 2025-05-07T20:32:51.4361246Z if contiguous: 2025-05-07T20:32:51.4361338Z x0 = x0.contiguous() 2025-05-07T20:32:51.4361426Z x1 = x1.contiguous() 2025-05-07T20:32:51.4361508Z 2025-05-07T20:32:51.4361598Z if scale_ub is not None: 2025-05-07T20:32:51.4361705Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.4361845Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.4361923Z ) 2025-05-07T20:32:51.4362000Z else: 2025-05-07T20:32:51.4362099Z scale_ub_tensor = None 2025-05-07T20:32:51.4362172Z 2025-05-07T20:32:51.4362302Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.4362393Z op = silu_mul_quant 2025-05-07T20:32:51.4362477Z if compiled: 2025-05-07T20:32:51.4362585Z op = torch.compile(op) 2025-05-07T20:32:51.4362693Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.4362765Z 2025-05-07T20:32:51.4362856Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.4362861Z 2025-05-07T20:32:51.4362960Z moe/activation_test.py:117: 2025-05-07T20:32:51.4363088Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.4363194Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.4363293Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.4363659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:51.4363797Z return fn(*args, **kwargs) 2025-05-07T20:32:51.4364375Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.4364474Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.4364829Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.4365051Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.4365389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.4365483Z kernel = self.compile( 2025-05-07T20:32:51.4365867Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.4366041Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.4366170Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.4366175Z 2025-05-07T20:32:51.4366429Z self = 2025-05-07T20:32:51.4367202Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.4367749Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9f930b8ea0>} 2025-05-07T20:32:51.4368499Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.4368722Z context = 2025-05-07T20:32:51.4368731Z 2025-05-07T20:32:51.4368904Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.4369168Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.4369278Z module_map=module_map) 2025-05-07T20:32:51.4369502Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.4369602Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.4369684Z E ^ 2025-05-07T20:32:51.4370037Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.4370042Z 2025-05-07T20:32:51.4370458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.4370463Z 2025-05-07T20:32:51.4370567Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.4370787Z self=, 2025-05-07T20:32:51.4370868Z T=2048, 2025-05-07T20:32:51.4370945Z D=7168, 2025-05-07T20:32:51.4371029Z scale_ub=None, 2025-05-07T20:32:51.4371116Z contiguous=True, 2025-05-07T20:32:51.4371199Z compiled=True, 2025-05-07T20:32:51.4371273Z ) 2025-05-07T20:32:51.4371497Z self = 2025-05-07T20:32:51.4371670Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:51.4371674Z 2025-05-07T20:32:51.4371756Z @given( 2025-05-07T20:32:51.4371874Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.4371974Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.4372096Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.4372213Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.4372326Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.4372404Z ) 2025-05-07T20:32:51.4372693Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.4372792Z def test_silu_mul_quant( 2025-05-07T20:32:51.4372874Z self, 2025-05-07T20:32:51.4372951Z T: int, 2025-05-07T20:32:51.4373031Z D: int, 2025-05-07T20:32:51.4373128Z scale_ub: Optional[float], 2025-05-07T20:32:51.4373223Z contiguous: bool, 2025-05-07T20:32:51.4373312Z compiled: bool, 2025-05-07T20:32:51.4373393Z ) -> None: 2025-05-07T20:32:51.4373487Z torch.manual_seed(2025) 2025-05-07T20:32:51.4373564Z 2025-05-07T20:32:51.4373732Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.4373806Z 2025-05-07T20:32:51.4373900Z x_sign = torch.sign(x) 2025-05-07T20:32:51.4374024Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.4374112Z x = x_sign * x_clamp 2025-05-07T20:32:51.4374195Z x0 = x[:, :D] 2025-05-07T20:32:51.4374274Z x1 = x[:, D:] 2025-05-07T20:32:51.4374353Z 2025-05-07T20:32:51.4374437Z if contiguous: 2025-05-07T20:32:51.4374576Z x0 = x0.contiguous() 2025-05-07T20:32:51.4374668Z x1 = x1.contiguous() 2025-05-07T20:32:51.4374744Z 2025-05-07T20:32:51.4374835Z if scale_ub is not None: 2025-05-07T20:32:51.4374946Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.4375119Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.4375194Z ) 2025-05-07T20:32:51.4375272Z else: 2025-05-07T20:32:51.4375366Z scale_ub_tensor = None 2025-05-07T20:32:51.4375439Z 2025-05-07T20:32:51.4375572Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.4375662Z op = silu_mul_quant 2025-05-07T20:32:51.4375764Z if compiled: 2025-05-07T20:32:51.4375875Z op = torch.compile(op) 2025-05-07T20:32:51.4376004Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.4376082Z 2025-05-07T20:32:51.4376176Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.4376180Z 2025-05-07T20:32:51.4376282Z moe/activation_test.py:117: 2025-05-07T20:32:51.4376412Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.4376513Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.4376653Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.4377020Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:51.4377115Z return fn(*args, **kwargs) 2025-05-07T20:32:51.4377608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.4377706Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.4378061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.4378287Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.4378627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.4378721Z kernel = self.compile( 2025-05-07T20:32:51.4379104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.4379282Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.4379412Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.4379417Z 2025-05-07T20:32:51.4379624Z self = 2025-05-07T20:32:51.4380397Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.4380944Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9f930b9c60>} 2025-05-07T20:32:51.4381687Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.4381887Z context = 2025-05-07T20:32:51.4381892Z 2025-05-07T20:32:51.4382056Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.4382319Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.4382427Z module_map=module_map) 2025-05-07T20:32:51.4382589Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.4382696Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.4382776Z E ^ 2025-05-07T20:32:51.4383172Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.4383176Z 2025-05-07T20:32:51.4383591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.4383634Z 2025-05-07T20:32:51.4383738Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.4383963Z self=, 2025-05-07T20:32:51.4384042Z T=16384, 2025-05-07T20:32:51.4384119Z D=5120, 2025-05-07T20:32:51.4384204Z scale_ub=None, 2025-05-07T20:32:51.4384292Z contiguous=False, 2025-05-07T20:32:51.4384376Z compiled=False, 2025-05-07T20:32:51.4384454Z ) 2025-05-07T20:32:51.4384672Z self = 2025-05-07T20:32:51.4384849Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:51.4384861Z 2025-05-07T20:32:51.4384938Z @given( 2025-05-07T20:32:51.4385059Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.4385161Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.4385275Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.4385436Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.4385558Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.4385633Z ) 2025-05-07T20:32:51.4385878Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.4385975Z def test_silu_mul_quant( 2025-05-07T20:32:51.4386052Z self, 2025-05-07T20:32:51.4386131Z T: int, 2025-05-07T20:32:51.4386213Z D: int, 2025-05-07T20:32:51.4386311Z scale_ub: Optional[float], 2025-05-07T20:32:51.4386403Z contiguous: bool, 2025-05-07T20:32:51.4386489Z compiled: bool, 2025-05-07T20:32:51.4386567Z ) -> None: 2025-05-07T20:32:51.4386666Z torch.manual_seed(2025) 2025-05-07T20:32:51.4386739Z 2025-05-07T20:32:51.4386908Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.4386983Z 2025-05-07T20:32:51.4387073Z x_sign = torch.sign(x) 2025-05-07T20:32:51.4387200Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.4389005Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:51.4389011Z 2025-05-07T20:32:51.4389171Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:51.4389176Z 2025-05-07T20:32:51.4389282Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.4389501Z self=, 2025-05-07T20:32:51.4389581Z T=4096, 2025-05-07T20:32:51.4389659Z D=7168, 2025-05-07T20:32:51.4389744Z scale_ub=1200.0, 2025-05-07T20:32:51.4389833Z contiguous=True, 2025-05-07T20:32:51.4389914Z compiled=True, 2025-05-07T20:32:51.4389986Z ) 2025-05-07T20:32:51.4390205Z self = 2025-05-07T20:32:51.4390375Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:51.4390380Z 2025-05-07T20:32:51.4390455Z @given( 2025-05-07T20:32:51.4390574Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.4390671Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.4390790Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.4390910Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.4391069Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.4391147Z ) 2025-05-07T20:32:51.4391396Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.4391532Z def test_silu_mul_quant( 2025-05-07T20:32:51.4391612Z self, 2025-05-07T20:32:51.4391689Z T: int, 2025-05-07T20:32:51.4391765Z D: int, 2025-05-07T20:32:51.4391865Z scale_ub: Optional[float], 2025-05-07T20:32:51.4391955Z contiguous: bool, 2025-05-07T20:32:51.4392040Z compiled: bool, 2025-05-07T20:32:51.4392120Z ) -> None: 2025-05-07T20:32:51.4392214Z torch.manual_seed(2025) 2025-05-07T20:32:51.4392289Z 2025-05-07T20:32:51.4392456Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.4392529Z 2025-05-07T20:32:51.4392626Z x_sign = torch.sign(x) 2025-05-07T20:32:51.4392752Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.4394529Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:51.4394581Z 2025-05-07T20:32:51.4394700Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:51.4394704Z 2025-05-07T20:32:51.4394806Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.4395027Z self=, 2025-05-07T20:32:51.4395104Z T=16384, 2025-05-07T20:32:51.4395182Z D=7168, 2025-05-07T20:32:51.4395265Z scale_ub=None, 2025-05-07T20:32:51.4395352Z contiguous=False, 2025-05-07T20:32:51.4395440Z compiled=False, 2025-05-07T20:32:51.4395515Z ) 2025-05-07T20:32:51.4395731Z self = 2025-05-07T20:32:51.4395916Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:51.4395921Z 2025-05-07T20:32:51.4395996Z @given( 2025-05-07T20:32:51.4396137Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.4396251Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.4396382Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.4396498Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.4396618Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.4396693Z ) 2025-05-07T20:32:51.4396984Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.4397080Z def test_silu_mul_quant( 2025-05-07T20:32:51.4397162Z self, 2025-05-07T20:32:51.4397242Z T: int, 2025-05-07T20:32:51.4397318Z D: int, 2025-05-07T20:32:51.4397415Z scale_ub: Optional[float], 2025-05-07T20:32:51.4397507Z contiguous: bool, 2025-05-07T20:32:51.4397595Z compiled: bool, 2025-05-07T20:32:51.4397675Z ) -> None: 2025-05-07T20:32:51.4397773Z torch.manual_seed(2025) 2025-05-07T20:32:51.4397847Z 2025-05-07T20:32:51.4398015Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.4399845Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:51.4399852Z 2025-05-07T20:32:51.4399971Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:51.4400040Z 2025-05-07T20:32:51.4400144Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.4400366Z self=, 2025-05-07T20:32:51.4400446Z T=2048, 2025-05-07T20:32:51.4400524Z D=7168, 2025-05-07T20:32:51.4400606Z scale_ub=1200.0, 2025-05-07T20:32:51.4400693Z contiguous=True, 2025-05-07T20:32:51.4400775Z compiled=True, 2025-05-07T20:32:51.4400848Z ) 2025-05-07T20:32:51.4401066Z self = 2025-05-07T20:32:51.4401235Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:51.4401240Z 2025-05-07T20:32:51.4401324Z @given( 2025-05-07T20:32:51.4401445Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.4401544Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.4401662Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.4401820Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.4401938Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.4402018Z ) 2025-05-07T20:32:51.4402259Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.4402351Z def test_silu_mul_quant( 2025-05-07T20:32:51.4402434Z self, 2025-05-07T20:32:51.4402510Z T: int, 2025-05-07T20:32:51.4402587Z D: int, 2025-05-07T20:32:51.4402687Z scale_ub: Optional[float], 2025-05-07T20:32:51.4402777Z contiguous: bool, 2025-05-07T20:32:51.4402865Z compiled: bool, 2025-05-07T20:32:51.4402943Z ) -> None: 2025-05-07T20:32:51.4403037Z torch.manual_seed(2025) 2025-05-07T20:32:51.4403113Z 2025-05-07T20:32:51.4403283Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.4403356Z 2025-05-07T20:32:51.4403450Z x_sign = torch.sign(x) 2025-05-07T20:32:51.4403574Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.4405420Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:51.4405426Z 2025-05-07T20:32:51.4405588Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:51.4405593Z 2025-05-07T20:32:51.4405698Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.4405960Z self=, 2025-05-07T20:32:51.4406048Z T=2048, 2025-05-07T20:32:51.4406131Z D=7168, 2025-05-07T20:32:51.4406213Z scale_ub=None, 2025-05-07T20:32:51.4406303Z contiguous=True, 2025-05-07T20:32:51.4406392Z compiled=False, 2025-05-07T20:32:51.4406465Z ) 2025-05-07T20:32:51.4406679Z self = 2025-05-07T20:32:51.4406851Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:51.4406855Z 2025-05-07T20:32:51.4406933Z @given( 2025-05-07T20:32:51.4410440Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.4410567Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.4410688Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.4410817Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.4411047Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.4411123Z ) 2025-05-07T20:32:51.4411373Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.4411470Z def test_silu_mul_quant( 2025-05-07T20:32:51.4411606Z self, 2025-05-07T20:32:51.4411686Z T: int, 2025-05-07T20:32:51.4411763Z D: int, 2025-05-07T20:32:51.4411858Z scale_ub: Optional[float], 2025-05-07T20:32:51.4411948Z contiguous: bool, 2025-05-07T20:32:51.4412028Z compiled: bool, 2025-05-07T20:32:51.4412105Z ) -> None: 2025-05-07T20:32:51.4412205Z torch.manual_seed(2025) 2025-05-07T20:32:51.4412278Z 2025-05-07T20:32:51.4412440Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.4412515Z 2025-05-07T20:32:51.4412601Z > x_sign = torch.sign(x) 2025-05-07T20:32:51.4414368Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:51.4414439Z 2025-05-07T20:32:51.4414557Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:51.4414561Z 2025-05-07T20:32:51.4414667Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.4414882Z self=, 2025-05-07T20:32:51.4414954Z T=1, 2025-05-07T20:32:51.4415030Z D=7168, 2025-05-07T20:32:51.4415108Z scale_ub=1200.0, 2025-05-07T20:32:51.4415189Z contiguous=True, 2025-05-07T20:32:51.4415273Z compiled=False, 2025-05-07T20:32:51.4415340Z ) 2025-05-07T20:32:51.4415551Z self = 2025-05-07T20:32:51.4415714Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:51.4415725Z 2025-05-07T20:32:51.4415799Z @given( 2025-05-07T20:32:51.4415916Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.4416010Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.4416123Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.4416236Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.4416344Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.4416412Z ) 2025-05-07T20:32:51.4416655Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.4416739Z def test_silu_mul_quant( 2025-05-07T20:32:51.4416809Z self, 2025-05-07T20:32:51.4416954Z T: int, 2025-05-07T20:32:51.4417030Z D: int, 2025-05-07T20:32:51.4417128Z scale_ub: Optional[float], 2025-05-07T20:32:51.4417209Z contiguous: bool, 2025-05-07T20:32:51.4417285Z compiled: bool, 2025-05-07T20:32:51.4417363Z ) -> None: 2025-05-07T20:32:51.4417451Z torch.manual_seed(2025) 2025-05-07T20:32:51.4417522Z 2025-05-07T20:32:51.4417688Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.4417759Z 2025-05-07T20:32:51.4417841Z x_sign = torch.sign(x) 2025-05-07T20:32:51.4417962Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.4418046Z x = x_sign * x_clamp 2025-05-07T20:32:51.4418117Z x0 = x[:, :D] 2025-05-07T20:32:51.4418196Z x1 = x[:, D:] 2025-05-07T20:32:51.4418263Z 2025-05-07T20:32:51.4418341Z if contiguous: 2025-05-07T20:32:51.4418433Z x0 = x0.contiguous() 2025-05-07T20:32:51.4418521Z x1 = x1.contiguous() 2025-05-07T20:32:51.4418591Z 2025-05-07T20:32:51.4418781Z if scale_ub is not None: 2025-05-07T20:32:51.4418882Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.4419018Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.4419095Z ) 2025-05-07T20:32:51.4419204Z else: 2025-05-07T20:32:51.4419297Z scale_ub_tensor = None 2025-05-07T20:32:51.4419365Z 2025-05-07T20:32:51.4419490Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.4419577Z op = silu_mul_quant 2025-05-07T20:32:51.4419657Z if compiled: 2025-05-07T20:32:51.4419751Z op = torch.compile(op) 2025-05-07T20:32:51.4419855Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.4419923Z 2025-05-07T20:32:51.4420012Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.4420017Z 2025-05-07T20:32:51.4420107Z moe/activation_test.py:117: 2025-05-07T20:32:51.4420235Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.4420338Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.4420435Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.4420930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.4421072Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.4421421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.4421641Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.4421974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.4422063Z kernel = self.compile( 2025-05-07T20:32:51.4422451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.4422623Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.4422751Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.4422765Z 2025-05-07T20:32:51.4422964Z self = 2025-05-07T20:32:51.4423741Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.4424244Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9f92e58b80>} 2025-05-07T20:32:51.4425021Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.4425216Z context = 2025-05-07T20:32:51.4425220Z 2025-05-07T20:32:51.4425382Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.4425644Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.4425759Z module_map=module_map) 2025-05-07T20:32:51.4425916Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.4426014Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.4426087Z E ^ 2025-05-07T20:32:51.4426434Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.4426439Z 2025-05-07T20:32:51.4426853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.4426861Z 2025-05-07T20:32:51.4426963Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.4427224Z self=, 2025-05-07T20:32:51.4427304Z T=128, 2025-05-07T20:32:51.4427374Z D=5120, 2025-05-07T20:32:51.4427456Z scale_ub=None, 2025-05-07T20:32:51.4427575Z contiguous=True, 2025-05-07T20:32:51.4427650Z compiled=False, 2025-05-07T20:32:51.4427720Z ) 2025-05-07T20:32:51.4427934Z self = 2025-05-07T20:32:51.4428097Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:51.4428101Z 2025-05-07T20:32:51.4428187Z @given( 2025-05-07T20:32:51.4428303Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.4428398Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.4428510Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.4428624Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.4428738Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.4428831Z ) 2025-05-07T20:32:51.4429099Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.4429192Z def test_silu_mul_quant( 2025-05-07T20:32:51.4429311Z self, 2025-05-07T20:32:51.4429391Z T: int, 2025-05-07T20:32:51.4429468Z D: int, 2025-05-07T20:32:51.4429563Z scale_ub: Optional[float], 2025-05-07T20:32:51.4429651Z contiguous: bool, 2025-05-07T20:32:51.4429734Z compiled: bool, 2025-05-07T20:32:51.4429807Z ) -> None: 2025-05-07T20:32:51.4429901Z torch.manual_seed(2025) 2025-05-07T20:32:51.4429971Z 2025-05-07T20:32:51.4430133Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.4430206Z 2025-05-07T20:32:51.4430292Z x_sign = torch.sign(x) 2025-05-07T20:32:51.4430414Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.4430500Z x = x_sign * x_clamp 2025-05-07T20:32:51.4430575Z x0 = x[:, :D] 2025-05-07T20:32:51.4430661Z x1 = x[:, D:] 2025-05-07T20:32:51.4430729Z 2025-05-07T20:32:51.4430811Z if contiguous: 2025-05-07T20:32:51.4430901Z x0 = x0.contiguous() 2025-05-07T20:32:51.4430989Z x1 = x1.contiguous() 2025-05-07T20:32:51.4431058Z 2025-05-07T20:32:51.4431151Z if scale_ub is not None: 2025-05-07T20:32:51.4431258Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.4431387Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.4431464Z ) 2025-05-07T20:32:51.4431539Z else: 2025-05-07T20:32:51.4431637Z scale_ub_tensor = None 2025-05-07T20:32:51.4431706Z 2025-05-07T20:32:51.4431832Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.4431919Z op = silu_mul_quant 2025-05-07T20:32:51.4431998Z if compiled: 2025-05-07T20:32:51.4432136Z op = torch.compile(op) 2025-05-07T20:32:51.4432243Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.4432312Z 2025-05-07T20:32:51.4432401Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.4432405Z 2025-05-07T20:32:51.4432502Z moe/activation_test.py:117: 2025-05-07T20:32:51.4432631Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.4432731Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.4432825Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.4433314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.4433411Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.4433762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.4433983Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.4434393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.4434484Z kernel = self.compile( 2025-05-07T20:32:51.4434861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.4435074Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.4435196Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.4435201Z 2025-05-07T20:32:51.4435401Z self = 2025-05-07T20:32:51.4436170Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.4436675Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9f92e59a80>} 2025-05-07T20:32:51.4437409Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.4437638Z context = 2025-05-07T20:32:51.4437649Z 2025-05-07T20:32:51.4437807Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.4438062Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.4438172Z module_map=module_map) 2025-05-07T20:32:51.4438328Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.4438422Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.4438501Z E ^ 2025-05-07T20:32:51.4438853Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.4438857Z 2025-05-07T20:32:51.4439264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.4439274Z 2025-05-07T20:32:51.4439372Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.4439591Z self=, 2025-05-07T20:32:51.4439669Z T=128, 2025-05-07T20:32:51.4439746Z D=7168, 2025-05-07T20:32:51.4439822Z scale_ub=None, 2025-05-07T20:32:51.4439906Z contiguous=True, 2025-05-07T20:32:51.4439983Z compiled=False, 2025-05-07T20:32:51.4440054Z ) 2025-05-07T20:32:51.4440267Z self = 2025-05-07T20:32:51.4440432Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:51.4440436Z 2025-05-07T20:32:51.4440554Z @given( 2025-05-07T20:32:51.4440672Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.4440765Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.4440880Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.4440995Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.4441108Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.4441182Z ) 2025-05-07T20:32:51.4441420Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.4441507Z def test_silu_mul_quant( 2025-05-07T20:32:51.4441583Z self, 2025-05-07T20:32:51.4441657Z T: int, 2025-05-07T20:32:51.4441731Z D: int, 2025-05-07T20:32:51.4441825Z scale_ub: Optional[float], 2025-05-07T20:32:51.4441912Z contiguous: bool, 2025-05-07T20:32:51.4441995Z compiled: bool, 2025-05-07T20:32:51.4442069Z ) -> None: 2025-05-07T20:32:51.4442162Z torch.manual_seed(2025) 2025-05-07T20:32:51.4442247Z 2025-05-07T20:32:51.4442491Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.4442559Z 2025-05-07T20:32:51.4442649Z x_sign = torch.sign(x) 2025-05-07T20:32:51.4442766Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.4442894Z x = x_sign * x_clamp 2025-05-07T20:32:51.4442971Z x0 = x[:, :D] 2025-05-07T20:32:51.4443045Z x1 = x[:, D:] 2025-05-07T20:32:51.4443118Z 2025-05-07T20:32:51.4443206Z if contiguous: 2025-05-07T20:32:51.4443292Z x0 = x0.contiguous() 2025-05-07T20:32:51.4443378Z x1 = x1.contiguous() 2025-05-07T20:32:51.4443447Z 2025-05-07T20:32:51.4443537Z if scale_ub is not None: 2025-05-07T20:32:51.4443643Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.4443772Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.4443844Z ) 2025-05-07T20:32:51.4443923Z else: 2025-05-07T20:32:51.4444012Z scale_ub_tensor = None 2025-05-07T20:32:51.4444086Z 2025-05-07T20:32:51.4444212Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.4444376Z op = silu_mul_quant 2025-05-07T20:32:51.4444501Z if compiled: 2025-05-07T20:32:51.4444601Z op = torch.compile(op) 2025-05-07T20:32:51.4444700Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.4444773Z 2025-05-07T20:32:51.4444860Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.4444864Z 2025-05-07T20:32:51.4444957Z moe/activation_test.py:117: 2025-05-07T20:32:51.4445083Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.4445178Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.4445271Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.4445767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.4445859Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.4446214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.4446430Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.4446765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.4446859Z kernel = self.compile( 2025-05-07T20:32:51.4447234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.4447406Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.4447530Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.4447534Z 2025-05-07T20:32:51.4447779Z self = 2025-05-07T20:32:51.4448561Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.4449060Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9f92e5a980>} 2025-05-07T20:32:51.4449799Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.4449984Z context = 2025-05-07T20:32:51.4449988Z 2025-05-07T20:32:51.4450147Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.4450407Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.4450555Z module_map=module_map) 2025-05-07T20:32:51.4450716Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.4450811Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.4450889Z E ^ 2025-05-07T20:32:51.4451281Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.4451286Z 2025-05-07T20:32:51.4451691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.4451695Z 2025-05-07T20:32:51.4451796Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.4452017Z self=, 2025-05-07T20:32:51.4452091Z T=2048, 2025-05-07T20:32:51.4452165Z D=7168, 2025-05-07T20:32:51.4452243Z scale_ub=1200.0, 2025-05-07T20:32:51.4452325Z contiguous=True, 2025-05-07T20:32:51.4452405Z compiled=False, 2025-05-07T20:32:51.4452473Z ) 2025-05-07T20:32:51.4452690Z self = 2025-05-07T20:32:51.4452861Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:51.4452908Z 2025-05-07T20:32:51.4452982Z @given( 2025-05-07T20:32:51.4453095Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.4453197Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.4453307Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.4453422Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.4453530Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.4453602Z ) 2025-05-07T20:32:51.4453846Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.4453934Z def test_silu_mul_quant( 2025-05-07T20:32:51.4454007Z self, 2025-05-07T20:32:51.4454088Z T: int, 2025-05-07T20:32:51.4454160Z D: int, 2025-05-07T20:32:51.4454255Z scale_ub: Optional[float], 2025-05-07T20:32:51.4454344Z contiguous: bool, 2025-05-07T20:32:51.4454424Z compiled: bool, 2025-05-07T20:32:51.4454500Z ) -> None: 2025-05-07T20:32:51.4454593Z torch.manual_seed(2025) 2025-05-07T20:32:51.4454660Z 2025-05-07T20:32:51.4454826Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.4456637Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:51.4456645Z 2025-05-07T20:32:51.4456766Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:51.4456771Z 2025-05-07T20:32:51.4456867Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.4457085Z self=, 2025-05-07T20:32:51.4457166Z T=1, 2025-05-07T20:32:51.4457237Z D=5120, 2025-05-07T20:32:51.4457315Z scale_ub=1200.0, 2025-05-07T20:32:51.4457395Z contiguous=True, 2025-05-07T20:32:51.4457475Z compiled=False, 2025-05-07T20:32:51.4457543Z ) 2025-05-07T20:32:51.4457755Z self = 2025-05-07T20:32:51.4457914Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:51.4457919Z 2025-05-07T20:32:51.4457997Z @given( 2025-05-07T20:32:51.4458112Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.4458208Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.4458369Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.4458482Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.4458590Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.4458666Z ) 2025-05-07T20:32:51.4458943Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.4459032Z def test_silu_mul_quant( 2025-05-07T20:32:51.4459104Z self, 2025-05-07T20:32:51.4459178Z T: int, 2025-05-07T20:32:51.4459251Z D: int, 2025-05-07T20:32:51.4459345Z scale_ub: Optional[float], 2025-05-07T20:32:51.4459428Z contiguous: bool, 2025-05-07T20:32:51.4459513Z compiled: bool, 2025-05-07T20:32:51.4459586Z ) -> None: 2025-05-07T20:32:51.4459674Z torch.manual_seed(2025) 2025-05-07T20:32:51.4459744Z 2025-05-07T20:32:51.4459911Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.4459979Z 2025-05-07T20:32:51.4460071Z x_sign = torch.sign(x) 2025-05-07T20:32:51.4460190Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.4460276Z x = x_sign * x_clamp 2025-05-07T20:32:51.4460399Z x0 = x[:, :D] 2025-05-07T20:32:51.4460477Z x1 = x[:, D:] 2025-05-07T20:32:51.4460548Z 2025-05-07T20:32:51.4460629Z if contiguous: 2025-05-07T20:32:51.4460720Z x0 = x0.contiguous() 2025-05-07T20:32:51.4460808Z x1 = x1.contiguous() 2025-05-07T20:32:51.4460874Z 2025-05-07T20:32:51.4460962Z if scale_ub is not None: 2025-05-07T20:32:51.4461066Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.4461194Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.4461266Z ) 2025-05-07T20:32:51.4461342Z else: 2025-05-07T20:32:51.4461431Z scale_ub_tensor = None 2025-05-07T20:32:51.4461497Z 2025-05-07T20:32:51.4461626Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.4461714Z op = silu_mul_quant 2025-05-07T20:32:51.4461798Z if compiled: 2025-05-07T20:32:51.4461893Z op = torch.compile(op) 2025-05-07T20:32:51.4461997Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.4462076Z 2025-05-07T20:32:51.4462161Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.4462166Z 2025-05-07T20:32:51.4462258Z moe/activation_test.py:117: 2025-05-07T20:32:51.4462390Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.4462491Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.4462590Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.4463086Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.4463181Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.4463610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.4463833Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.4464168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.4464274Z kernel = self.compile( 2025-05-07T20:32:51.4464656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.4464829Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.4464960Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.4464964Z 2025-05-07T20:32:51.4465165Z self = 2025-05-07T20:32:51.4466031Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.4466536Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9f92e5be20>} 2025-05-07T20:32:51.4467321Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.4467510Z context = 2025-05-07T20:32:51.4467514Z 2025-05-07T20:32:51.4467676Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.4467938Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.4468048Z module_map=module_map) 2025-05-07T20:32:51.4468212Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.4468313Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.4468388Z E ^ 2025-05-07T20:32:51.4468743Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.4468788Z 2025-05-07T20:32:51.4469205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.4469209Z 2025-05-07T20:32:51.4469314Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.4469541Z self=, 2025-05-07T20:32:51.4469618Z T=2048, 2025-05-07T20:32:51.4469695Z D=5120, 2025-05-07T20:32:51.4469774Z scale_ub=None, 2025-05-07T20:32:51.4469857Z contiguous=True, 2025-05-07T20:32:51.4469939Z compiled=False, 2025-05-07T20:32:51.4470015Z ) 2025-05-07T20:32:51.4470236Z self = 2025-05-07T20:32:51.4470408Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:51.4470416Z 2025-05-07T20:32:51.4470490Z @given( 2025-05-07T20:32:51.4470606Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.4470710Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.4470823Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.4470936Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.4471052Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.4471126Z ) 2025-05-07T20:32:51.4471367Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.4471461Z def test_silu_mul_quant( 2025-05-07T20:32:51.4471537Z self, 2025-05-07T20:32:51.4471612Z T: int, 2025-05-07T20:32:51.4471696Z D: int, 2025-05-07T20:32:51.4471839Z scale_ub: Optional[float], 2025-05-07T20:32:51.4471932Z contiguous: bool, 2025-05-07T20:32:51.4472019Z compiled: bool, 2025-05-07T20:32:51.4472099Z ) -> None: 2025-05-07T20:32:51.4472195Z torch.manual_seed(2025) 2025-05-07T20:32:51.4472271Z 2025-05-07T20:32:51.4472440Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.4472522Z 2025-05-07T20:32:51.4472613Z > x_sign = torch.sign(x) 2025-05-07T20:32:51.4474394Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:51.4474399Z 2025-05-07T20:32:51.4474574Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:51.4474579Z 2025-05-07T20:32:51.4474682Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.4474903Z self=, 2025-05-07T20:32:51.4475020Z T=16384, 2025-05-07T20:32:51.4475099Z D=5120, 2025-05-07T20:32:51.4475179Z scale_ub=None, 2025-05-07T20:32:51.4475262Z contiguous=True, 2025-05-07T20:32:51.4475347Z compiled=False, 2025-05-07T20:32:51.4475419Z ) 2025-05-07T20:32:51.4475635Z self = 2025-05-07T20:32:51.4475832Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:51.4475838Z 2025-05-07T20:32:51.4475922Z @given( 2025-05-07T20:32:51.4476048Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.4476173Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.4476308Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.4476431Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.4476543Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.4476660Z ) 2025-05-07T20:32:51.4476907Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.4477001Z def test_silu_mul_quant( 2025-05-07T20:32:51.4477077Z self, 2025-05-07T20:32:51.4477155Z T: int, 2025-05-07T20:32:51.4477232Z D: int, 2025-05-07T20:32:51.4477332Z scale_ub: Optional[float], 2025-05-07T20:32:51.4477425Z contiguous: bool, 2025-05-07T20:32:51.4477512Z compiled: bool, 2025-05-07T20:32:51.4477589Z ) -> None: 2025-05-07T20:32:51.4477687Z torch.manual_seed(2025) 2025-05-07T20:32:51.4477759Z 2025-05-07T20:32:51.4477930Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.4479715Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:51.4479728Z 2025-05-07T20:32:51.4479849Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:51.4479854Z 2025-05-07T20:32:51.4479954Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.4480174Z self=, 2025-05-07T20:32:51.4480251Z T=4096, 2025-05-07T20:32:51.4480327Z D=5120, 2025-05-07T20:32:51.4480450Z scale_ub=None, 2025-05-07T20:32:51.4480536Z contiguous=True, 2025-05-07T20:32:51.4480620Z compiled=False, 2025-05-07T20:32:51.4480692Z ) 2025-05-07T20:32:51.4480912Z self = 2025-05-07T20:32:51.4481083Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:51.4481092Z 2025-05-07T20:32:51.4481170Z @given( 2025-05-07T20:32:51.4481289Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.4481386Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.4481504Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.4481617Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.4481728Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.4481804Z ) 2025-05-07T20:32:51.4482046Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.4482138Z def test_silu_mul_quant( 2025-05-07T20:32:51.4482223Z self, 2025-05-07T20:32:51.4482301Z T: int, 2025-05-07T20:32:51.4482420Z D: int, 2025-05-07T20:32:51.4482519Z scale_ub: Optional[float], 2025-05-07T20:32:51.4482608Z contiguous: bool, 2025-05-07T20:32:51.4482697Z compiled: bool, 2025-05-07T20:32:51.4482776Z ) -> None: 2025-05-07T20:32:51.4482913Z torch.manual_seed(2025) 2025-05-07T20:32:51.4482989Z 2025-05-07T20:32:51.4483155Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.4485020Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:51.4485026Z 2025-05-07T20:32:51.4485143Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:51.4485148Z 2025-05-07T20:32:51.4485290Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.4485517Z self=, 2025-05-07T20:32:51.4485597Z T=2048, 2025-05-07T20:32:51.4485674Z D=5120, 2025-05-07T20:32:51.4485752Z scale_ub=None, 2025-05-07T20:32:51.4485837Z contiguous=False, 2025-05-07T20:32:51.4485925Z compiled=False, 2025-05-07T20:32:51.4485997Z ) 2025-05-07T20:32:51.4486215Z self = 2025-05-07T20:32:51.4486392Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:51.4486397Z 2025-05-07T20:32:51.4486474Z @given( 2025-05-07T20:32:51.4486591Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.4486692Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.4486808Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.4486927Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.4487041Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.4487117Z ) 2025-05-07T20:32:51.4487361Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.4487453Z def test_silu_mul_quant( 2025-05-07T20:32:51.4487529Z self, 2025-05-07T20:32:51.4487608Z T: int, 2025-05-07T20:32:51.4487683Z D: int, 2025-05-07T20:32:51.4487780Z scale_ub: Optional[float], 2025-05-07T20:32:51.4487873Z contiguous: bool, 2025-05-07T20:32:51.4487959Z compiled: bool, 2025-05-07T20:32:51.4488036Z ) -> None: 2025-05-07T20:32:51.4488130Z torch.manual_seed(2025) 2025-05-07T20:32:51.4488202Z 2025-05-07T20:32:51.4488421Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.4490197Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:51.4490208Z 2025-05-07T20:32:51.4490328Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:51.4490332Z 2025-05-07T20:32:51.4490433Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.4490654Z self=, 2025-05-07T20:32:51.4490734Z T=4096, 2025-05-07T20:32:51.4490811Z D=7168, 2025-05-07T20:32:51.4490892Z scale_ub=None, 2025-05-07T20:32:51.4491020Z contiguous=True, 2025-05-07T20:32:51.4491103Z compiled=True, 2025-05-07T20:32:51.4491175Z ) 2025-05-07T20:32:51.4491393Z self = 2025-05-07T20:32:51.4491603Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:51.4491607Z 2025-05-07T20:32:51.4491687Z @given( 2025-05-07T20:32:51.4491802Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.4491900Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.4492016Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.4492132Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.4492244Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.4492321Z ) 2025-05-07T20:32:51.4492567Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.4492659Z def test_silu_mul_quant( 2025-05-07T20:32:51.4492742Z self, 2025-05-07T20:32:51.4492818Z T: int, 2025-05-07T20:32:51.4492896Z D: int, 2025-05-07T20:32:51.4492992Z scale_ub: Optional[float], 2025-05-07T20:32:51.4493143Z contiguous: bool, 2025-05-07T20:32:51.4493233Z compiled: bool, 2025-05-07T20:32:51.4493310Z ) -> None: 2025-05-07T20:32:51.4493402Z torch.manual_seed(2025) 2025-05-07T20:32:51.4493477Z 2025-05-07T20:32:51.4493643Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.4495421Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:51.4495427Z 2025-05-07T20:32:51.4495546Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:51.4495558Z 2025-05-07T20:32:51.4495658Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.4495886Z self=, 2025-05-07T20:32:51.4495982Z T=2048, 2025-05-07T20:32:51.4496063Z D=5120, 2025-05-07T20:32:51.4496143Z scale_ub=1200.0, 2025-05-07T20:32:51.4496225Z contiguous=False, 2025-05-07T20:32:51.4496309Z compiled=False, 2025-05-07T20:32:51.4496380Z ) 2025-05-07T20:32:51.4496594Z self = 2025-05-07T20:32:51.4496769Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:51.4496773Z 2025-05-07T20:32:51.4496894Z @given( 2025-05-07T20:32:51.4497012Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.4497113Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.4497225Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.4497347Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.4497461Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.4497536Z ) 2025-05-07T20:32:51.4497781Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.4497872Z def test_silu_mul_quant( 2025-05-07T20:32:51.4497946Z self, 2025-05-07T20:32:51.4498024Z T: int, 2025-05-07T20:32:51.4498099Z D: int, 2025-05-07T20:32:51.4498195Z scale_ub: Optional[float], 2025-05-07T20:32:51.4498285Z contiguous: bool, 2025-05-07T20:32:51.4498368Z compiled: bool, 2025-05-07T20:32:51.4498444Z ) -> None: 2025-05-07T20:32:51.4498542Z torch.manual_seed(2025) 2025-05-07T20:32:51.4498613Z 2025-05-07T20:32:51.4498826Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.4500581Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:51.4500629Z 2025-05-07T20:32:51.4500747Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:51.4500752Z 2025-05-07T20:32:51.4500851Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.4501072Z self=, 2025-05-07T20:32:51.4501151Z T=4096, 2025-05-07T20:32:51.4501228Z D=7168, 2025-05-07T20:32:51.4501308Z scale_ub=1200.0, 2025-05-07T20:32:51.4501395Z contiguous=True, 2025-05-07T20:32:51.4501479Z compiled=False, 2025-05-07T20:32:51.4501589Z ) 2025-05-07T20:32:51.4501805Z self = 2025-05-07T20:32:51.4501978Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:51.4501983Z 2025-05-07T20:32:51.4502062Z @given( 2025-05-07T20:32:51.4502177Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.4502273Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.4502387Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.4502505Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.4502615Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.4502694Z ) 2025-05-07T20:32:51.4502939Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.4503032Z def test_silu_mul_quant( 2025-05-07T20:32:51.4503109Z self, 2025-05-07T20:32:51.4503185Z T: int, 2025-05-07T20:32:51.4503264Z D: int, 2025-05-07T20:32:51.4503367Z scale_ub: Optional[float], 2025-05-07T20:32:51.4503455Z contiguous: bool, 2025-05-07T20:32:51.4503541Z compiled: bool, 2025-05-07T20:32:51.4503617Z ) -> None: 2025-05-07T20:32:51.4503710Z torch.manual_seed(2025) 2025-05-07T20:32:51.4503787Z 2025-05-07T20:32:51.4503951Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.4505759Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:51.4505768Z 2025-05-07T20:32:51.4505889Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:51.4505894Z 2025-05-07T20:32:51.4505993Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.4506214Z self=, 2025-05-07T20:32:51.4506290Z T=16384, 2025-05-07T20:32:51.4506373Z D=7168, 2025-05-07T20:32:51.4506453Z scale_ub=None, 2025-05-07T20:32:51.4506539Z contiguous=False, 2025-05-07T20:32:51.4506622Z compiled=True, 2025-05-07T20:32:51.4506695Z ) 2025-05-07T20:32:51.4506909Z self = 2025-05-07T20:32:51.4507088Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:51.4507093Z 2025-05-07T20:32:51.4507170Z @given( 2025-05-07T20:32:51.4507324Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.4507424Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.4507540Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.4507695Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.4507807Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.4507879Z ) 2025-05-07T20:32:51.4508125Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.4508218Z def test_silu_mul_quant( 2025-05-07T20:32:51.4508870Z self, 2025-05-07T20:32:51.4508969Z T: int, 2025-05-07T20:32:51.4509044Z D: int, 2025-05-07T20:32:51.4509142Z scale_ub: Optional[float], 2025-05-07T20:32:51.4509231Z contiguous: bool, 2025-05-07T20:32:51.4509318Z compiled: bool, 2025-05-07T20:32:51.4509402Z ) -> None: 2025-05-07T20:32:51.4509498Z torch.manual_seed(2025) 2025-05-07T20:32:51.4509573Z 2025-05-07T20:32:51.4509742Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.4511505Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:51.4511601Z 2025-05-07T20:32:51.4511720Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:51.4511725Z 2025-05-07T20:32:51.4511828Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.4512048Z self=, 2025-05-07T20:32:51.4512128Z T=4096, 2025-05-07T20:32:51.4512202Z D=7168, 2025-05-07T20:32:51.4512282Z scale_ub=None, 2025-05-07T20:32:51.4512367Z contiguous=True, 2025-05-07T20:32:51.4512453Z compiled=False, 2025-05-07T20:32:51.4512528Z ) 2025-05-07T20:32:51.4512743Z self = 2025-05-07T20:32:51.4512909Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:51.4512914Z 2025-05-07T20:32:51.4512993Z @given( 2025-05-07T20:32:51.4513107Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.4513203Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.4513319Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.4513433Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.4513608Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.4513684Z ) 2025-05-07T20:32:51.4513928Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.4514022Z def test_silu_mul_quant( 2025-05-07T20:32:51.4514099Z self, 2025-05-07T20:32:51.4514178Z T: int, 2025-05-07T20:32:51.4514257Z D: int, 2025-05-07T20:32:51.4514354Z scale_ub: Optional[float], 2025-05-07T20:32:51.4514441Z contiguous: bool, 2025-05-07T20:32:51.4514527Z compiled: bool, 2025-05-07T20:32:51.4514604Z ) -> None: 2025-05-07T20:32:51.4514696Z torch.manual_seed(2025) 2025-05-07T20:32:51.4514772Z 2025-05-07T20:32:51.4514936Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.4516767Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:51.4516824Z 2025-05-07T20:32:51.4516942Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:51.4516946Z 2025-05-07T20:32:51.4517047Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.4517267Z self=, 2025-05-07T20:32:51.4517343Z T=16384, 2025-05-07T20:32:51.4517424Z D=7168, 2025-05-07T20:32:51.4517504Z scale_ub=None, 2025-05-07T20:32:51.4517589Z contiguous=True, 2025-05-07T20:32:51.4517679Z compiled=False, 2025-05-07T20:32:51.4517751Z ) 2025-05-07T20:32:51.4517966Z self = 2025-05-07T20:32:51.4518145Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:51.4518150Z 2025-05-07T20:32:51.4518227Z @given( 2025-05-07T20:32:51.4518343Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.4518499Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.4518613Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.4518729Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.4518839Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.4518911Z ) 2025-05-07T20:32:51.4519157Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.4519247Z def test_silu_mul_quant( 2025-05-07T20:32:51.4519323Z self, 2025-05-07T20:32:51.4519401Z T: int, 2025-05-07T20:32:51.4519477Z D: int, 2025-05-07T20:32:51.4519575Z scale_ub: Optional[float], 2025-05-07T20:32:51.4519673Z contiguous: bool, 2025-05-07T20:32:51.4519757Z compiled: bool, 2025-05-07T20:32:51.4519836Z ) -> None: 2025-05-07T20:32:51.4519931Z torch.manual_seed(2025) 2025-05-07T20:32:51.4520004Z 2025-05-07T20:32:51.4520173Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.4521944Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:51.4521950Z 2025-05-07T20:32:51.4522071Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:51.4522124Z 2025-05-07T20:32:51.4522225Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.4522447Z self=, 2025-05-07T20:32:51.4522529Z T=16384, 2025-05-07T20:32:51.4522605Z D=7168, 2025-05-07T20:32:51.4522688Z scale_ub=1200.0, 2025-05-07T20:32:51.4522773Z contiguous=True, 2025-05-07T20:32:51.4522854Z compiled=False, 2025-05-07T20:32:51.4522926Z ) 2025-05-07T20:32:51.4523141Z self = 2025-05-07T20:32:51.4523315Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:51.4523320Z 2025-05-07T20:32:51.4523399Z @given( 2025-05-07T20:32:51.4523515Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.4523611Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.4523728Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.4523845Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.4523997Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.4524077Z ) 2025-05-07T20:32:51.4524422Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.4524520Z def test_silu_mul_quant( 2025-05-07T20:32:51.4524660Z self, 2025-05-07T20:32:51.4524736Z T: int, 2025-05-07T20:32:51.4524813Z D: int, 2025-05-07T20:32:51.4524911Z scale_ub: Optional[float], 2025-05-07T20:32:51.4524999Z contiguous: bool, 2025-05-07T20:32:51.4525085Z compiled: bool, 2025-05-07T20:32:51.4525164Z ) -> None: 2025-05-07T20:32:51.4525256Z torch.manual_seed(2025) 2025-05-07T20:32:51.4525334Z 2025-05-07T20:32:51.4525501Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.4527272Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:51.4527320Z 2025-05-07T20:32:51.4527437Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:51.4527441Z 2025-05-07T20:32:51.4527541Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.4527760Z self=, 2025-05-07T20:32:51.4527840Z T=128, 2025-05-07T20:32:51.4527918Z D=5120, 2025-05-07T20:32:51.4528001Z scale_ub=1200.0, 2025-05-07T20:32:51.4528084Z contiguous=False, 2025-05-07T20:32:51.4528171Z compiled=False, 2025-05-07T20:32:51.4528242Z ) 2025-05-07T20:32:51.4528458Z self = 2025-05-07T20:32:51.4528633Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:51.4528638Z 2025-05-07T20:32:51.4528714Z @given( 2025-05-07T20:32:51.4528832Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.4528936Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.4529047Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.4529164Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.4529274Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.4529347Z ) 2025-05-07T20:32:51.4529588Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.4529678Z def test_silu_mul_quant( 2025-05-07T20:32:51.4529752Z self, 2025-05-07T20:32:51.4529831Z T: int, 2025-05-07T20:32:51.4529906Z D: int, 2025-05-07T20:32:51.4530047Z scale_ub: Optional[float], 2025-05-07T20:32:51.4530139Z contiguous: bool, 2025-05-07T20:32:51.4530225Z compiled: bool, 2025-05-07T20:32:51.4530300Z ) -> None: 2025-05-07T20:32:51.4530393Z torch.manual_seed(2025) 2025-05-07T20:32:51.4530467Z 2025-05-07T20:32:51.4530640Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.4533474Z 2025-05-07T20:32:51.4533571Z x_sign = torch.sign(x) 2025-05-07T20:32:51.4533700Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.4533786Z x = x_sign * x_clamp 2025-05-07T20:32:51.4533866Z x0 = x[:, :D] 2025-05-07T20:32:51.4533946Z x1 = x[:, D:] 2025-05-07T20:32:51.4534014Z 2025-05-07T20:32:51.4534095Z if contiguous: 2025-05-07T20:32:51.4534187Z x0 = x0.contiguous() 2025-05-07T20:32:51.4534277Z x1 = x1.contiguous() 2025-05-07T20:32:51.4534347Z 2025-05-07T20:32:51.4534443Z if scale_ub is not None: 2025-05-07T20:32:51.4534545Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.4534740Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.4534820Z ) 2025-05-07T20:32:51.4534899Z else: 2025-05-07T20:32:51.4534996Z scale_ub_tensor = None 2025-05-07T20:32:51.4535114Z 2025-05-07T20:32:51.4535243Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.4535332Z op = silu_mul_quant 2025-05-07T20:32:51.4535414Z if compiled: 2025-05-07T20:32:51.4535511Z op = torch.compile(op) 2025-05-07T20:32:51.4535618Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.4535689Z 2025-05-07T20:32:51.4535777Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.4535782Z 2025-05-07T20:32:51.4535881Z moe/activation_test.py:117: 2025-05-07T20:32:51.4536007Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.4536107Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.4536209Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.4536711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.4536850Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.4537206Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.4537422Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.4537763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.4537855Z kernel = self.compile( 2025-05-07T20:32:51.4538238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.4538412Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.4538537Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.4538541Z 2025-05-07T20:32:51.4538747Z self = 2025-05-07T20:32:51.4539519Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.4540023Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9f92ce0ae0>} 2025-05-07T20:32:51.4540761Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.4540990Z context = 2025-05-07T20:32:51.4540995Z 2025-05-07T20:32:51.4541160Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.4541416Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.4541525Z module_map=module_map) 2025-05-07T20:32:51.4541685Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.4541779Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.4541859Z E ^ 2025-05-07T20:32:51.4542211Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.4542215Z 2025-05-07T20:32:51.4542622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.4542629Z 2025-05-07T20:32:51.4542728Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.4542947Z self=, 2025-05-07T20:32:51.4543064Z T=2048, 2025-05-07T20:32:51.4543137Z D=7168, 2025-05-07T20:32:51.4543215Z scale_ub=None, 2025-05-07T20:32:51.4543302Z contiguous=False, 2025-05-07T20:32:51.4543385Z compiled=False, 2025-05-07T20:32:51.4543493Z ) 2025-05-07T20:32:51.4543711Z self = 2025-05-07T20:32:51.4543879Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:51.4543884Z 2025-05-07T20:32:51.4543959Z @given( 2025-05-07T20:32:51.4544074Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.4544169Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.4544285Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.4544397Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.4544509Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.4544589Z ) 2025-05-07T20:32:51.4544832Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.4544921Z def test_silu_mul_quant( 2025-05-07T20:32:51.4544999Z self, 2025-05-07T20:32:51.4545073Z T: int, 2025-05-07T20:32:51.4545189Z D: int, 2025-05-07T20:32:51.4545290Z scale_ub: Optional[float], 2025-05-07T20:32:51.4545377Z contiguous: bool, 2025-05-07T20:32:51.4545463Z compiled: bool, 2025-05-07T20:32:51.4545539Z ) -> None: 2025-05-07T20:32:51.4545631Z torch.manual_seed(2025) 2025-05-07T20:32:51.4545706Z 2025-05-07T20:32:51.4545897Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.4547698Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:51.4547712Z 2025-05-07T20:32:51.4547826Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:51.4547830Z 2025-05-07T20:32:51.4547930Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.4548148Z self=, 2025-05-07T20:32:51.4548221Z T=128, 2025-05-07T20:32:51.4548293Z D=7168, 2025-05-07T20:32:51.4548373Z scale_ub=1200.0, 2025-05-07T20:32:51.4548453Z contiguous=True, 2025-05-07T20:32:51.4548538Z compiled=True, 2025-05-07T20:32:51.4548609Z ) 2025-05-07T20:32:51.4548822Z self = 2025-05-07T20:32:51.4549077Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:51.4549082Z 2025-05-07T20:32:51.4549159Z @given( 2025-05-07T20:32:51.4549274Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.4549378Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.4549494Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.4549612Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.4549725Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.4549798Z ) 2025-05-07T20:32:51.4550041Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.4550131Z def test_silu_mul_quant( 2025-05-07T20:32:51.4550205Z self, 2025-05-07T20:32:51.4550279Z T: int, 2025-05-07T20:32:51.4550349Z D: int, 2025-05-07T20:32:51.4550442Z scale_ub: Optional[float], 2025-05-07T20:32:51.4550529Z contiguous: bool, 2025-05-07T20:32:51.4550612Z compiled: bool, 2025-05-07T20:32:51.4550686Z ) -> None: 2025-05-07T20:32:51.4550819Z torch.manual_seed(2025) 2025-05-07T20:32:51.4550889Z 2025-05-07T20:32:51.4551051Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.4551126Z 2025-05-07T20:32:51.4551215Z x_sign = torch.sign(x) 2025-05-07T20:32:51.4551382Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.4551465Z x = x_sign * x_clamp 2025-05-07T20:32:51.4551540Z x0 = x[:, :D] 2025-05-07T20:32:51.4551618Z x1 = x[:, D:] 2025-05-07T20:32:51.4551687Z 2025-05-07T20:32:51.4551766Z if contiguous: 2025-05-07T20:32:51.4551854Z x0 = x0.contiguous() 2025-05-07T20:32:51.4551938Z x1 = x1.contiguous() 2025-05-07T20:32:51.4552007Z 2025-05-07T20:32:51.4552100Z if scale_ub is not None: 2025-05-07T20:32:51.4552201Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.4552335Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.4552413Z ) 2025-05-07T20:32:51.4552490Z else: 2025-05-07T20:32:51.4552581Z scale_ub_tensor = None 2025-05-07T20:32:51.4552652Z 2025-05-07T20:32:51.4552779Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.4552908Z op = silu_mul_quant 2025-05-07T20:32:51.4552987Z if compiled: 2025-05-07T20:32:51.4553083Z op = torch.compile(op) 2025-05-07T20:32:51.4553184Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.4553250Z 2025-05-07T20:32:51.4553342Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.4553347Z 2025-05-07T20:32:51.4553439Z moe/activation_test.py:117: 2025-05-07T20:32:51.4553562Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.4553664Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.4553759Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.4554127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:51.4554215Z return fn(*args, **kwargs) 2025-05-07T20:32:51.4554702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.4554803Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.4555155Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.4555374Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.4555715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.4555808Z kernel = self.compile( 2025-05-07T20:32:51.4556240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.4556453Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.4556578Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.4556582Z 2025-05-07T20:32:51.4556786Z self = 2025-05-07T20:32:51.4557564Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.4558061Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9f928409a0>} 2025-05-07T20:32:51.4558802Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.4559049Z context = 2025-05-07T20:32:51.4559059Z 2025-05-07T20:32:51.4559220Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.4559477Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.4559625Z module_map=module_map) 2025-05-07T20:32:51.4559783Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.4559874Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.4559952Z E ^ 2025-05-07T20:32:51.4560302Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.4560307Z 2025-05-07T20:32:51.4560718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.4560722Z 2025-05-07T20:32:51.4560823Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.4561041Z self=, 2025-05-07T20:32:51.4561116Z T=128, 2025-05-07T20:32:51.4561189Z D=7168, 2025-05-07T20:32:51.4561266Z scale_ub=1200.0, 2025-05-07T20:32:51.4561397Z contiguous=True, 2025-05-07T20:32:51.4561477Z compiled=False, 2025-05-07T20:32:51.4561549Z ) 2025-05-07T20:32:51.4561766Z self = 2025-05-07T20:32:51.4561931Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:51.4561936Z 2025-05-07T20:32:51.4562011Z @given( 2025-05-07T20:32:51.4562124Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.4562220Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.4562331Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.4562444Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.4562554Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.4562629Z ) 2025-05-07T20:32:51.4562871Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.4562963Z def test_silu_mul_quant( 2025-05-07T20:32:51.4563040Z self, 2025-05-07T20:32:51.4563117Z T: int, 2025-05-07T20:32:51.4563191Z D: int, 2025-05-07T20:32:51.4563285Z scale_ub: Optional[float], 2025-05-07T20:32:51.4563369Z contiguous: bool, 2025-05-07T20:32:51.4563450Z compiled: bool, 2025-05-07T20:32:51.4563523Z ) -> None: 2025-05-07T20:32:51.4563614Z torch.manual_seed(2025) 2025-05-07T20:32:51.4563686Z 2025-05-07T20:32:51.4563846Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.4563919Z 2025-05-07T20:32:51.4564008Z x_sign = torch.sign(x) 2025-05-07T20:32:51.4564131Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.4566095Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:51.4566108Z 2025-05-07T20:32:51.4566223Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:51.4566227Z 2025-05-07T20:32:51.4566327Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.4566542Z self=, 2025-05-07T20:32:51.4566613Z T=128, 2025-05-07T20:32:51.4566689Z D=5120, 2025-05-07T20:32:51.4566766Z scale_ub=1200.0, 2025-05-07T20:32:51.4566847Z contiguous=True, 2025-05-07T20:32:51.4566936Z compiled=True, 2025-05-07T20:32:51.4567047Z ) 2025-05-07T20:32:51.4567258Z self = 2025-05-07T20:32:51.4567426Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:51.4567472Z 2025-05-07T20:32:51.4567546Z @given( 2025-05-07T20:32:51.4567664Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.4567756Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.4567865Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.4567978Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.4568084Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.4568152Z ) 2025-05-07T20:32:51.4568394Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.4568482Z def test_silu_mul_quant( 2025-05-07T20:32:51.4568558Z self, 2025-05-07T20:32:51.4568635Z T: int, 2025-05-07T20:32:51.4568707Z D: int, 2025-05-07T20:32:51.4568805Z scale_ub: Optional[float], 2025-05-07T20:32:51.4568891Z contiguous: bool, 2025-05-07T20:32:51.4568971Z compiled: bool, 2025-05-07T20:32:51.4569092Z ) -> None: 2025-05-07T20:32:51.4569182Z torch.manual_seed(2025) 2025-05-07T20:32:51.4569252Z 2025-05-07T20:32:51.4569417Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.4569489Z 2025-05-07T20:32:51.4569574Z x_sign = torch.sign(x) 2025-05-07T20:32:51.4569696Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.4571448Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:51.4571457Z 2025-05-07T20:32:51.4571572Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:51.4571578Z 2025-05-07T20:32:51.4571674Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.4571892Z self=, 2025-05-07T20:32:51.4571966Z T=128, 2025-05-07T20:32:51.4572039Z D=7168, 2025-05-07T20:32:51.4572119Z scale_ub=None, 2025-05-07T20:32:51.4572198Z contiguous=True, 2025-05-07T20:32:51.4572275Z compiled=True, 2025-05-07T20:32:51.4572347Z ) 2025-05-07T20:32:51.4572557Z self = 2025-05-07T20:32:51.4572760Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:51.4572765Z 2025-05-07T20:32:51.4572842Z @given( 2025-05-07T20:32:51.4572957Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.4573050Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.4573162Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.4573277Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.4573387Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.4573456Z ) 2025-05-07T20:32:51.4573696Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.4573785Z def test_silu_mul_quant( 2025-05-07T20:32:51.4573857Z self, 2025-05-07T20:32:51.4573928Z T: int, 2025-05-07T20:32:51.4574003Z D: int, 2025-05-07T20:32:51.4574093Z scale_ub: Optional[float], 2025-05-07T20:32:51.4574178Z contiguous: bool, 2025-05-07T20:32:51.4574262Z compiled: bool, 2025-05-07T20:32:51.4574339Z ) -> None: 2025-05-07T20:32:51.4574429Z torch.manual_seed(2025) 2025-05-07T20:32:51.4574506Z 2025-05-07T20:32:51.4574711Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.4576466Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:51.4576525Z 2025-05-07T20:32:51.4576641Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:51.4576774Z =============================== warnings summary =============================== 2025-05-07T20:32:51.4577080Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:51.4577375Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:51.4577715Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:51.4578582Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details. 2025-05-07T20:32:51.4578810Z warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See " 2025-05-07T20:32:51.4578814Z 2025-05-07T20:32:51.4579021Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html 2025-05-07T20:32:51.4579188Z ================= 1 failed, 1 deselected, 3 warnings in 14.25s ================= 2025-05-07T20:32:53.4337485Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error) 2025-05-07T20:32:53.5146919Z [EXEC] [ATTEMPT 0/2] Command attempt failed. 2025-05-07T20:32:53.5147632Z 2025-05-07T20:32:55.5164047Z [EXEC] [ATTEMPT 1/2] + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py 2025-05-07T20:32:57.7250037Z ============================= test session starts ============================== 2025-05-07T20:32:57.7250750Z platform linux -- Python 3.13.0, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:32:57.7251278Z cachedir: .pytest_cache 2025-05-07T20:32:57.7252062Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:32:57.7252798Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:32:57.7253195Z plugins: hypothesis-6.131.14 2025-05-07T20:32:59.3219857Z TMA benchmarks will be running with experimental grid constant TMA descriptor. 2025-05-07T20:32:59.4196250Z collecting ... collected 2 items / 1 deselected / 1 selected 2025-05-07T20:32:59.4196685Z run-last-failure: rerun previous 1 failure 2025-05-07T20:32:59.4196904Z 2025-05-07T20:33:01.6397866Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.6398596Z self=, 2025-05-07T20:33:01.6399006Z T=1, 2025-05-07T20:33:01.6399193Z D=5120, 2025-05-07T20:33:01.6399379Z scale_ub=None, 2025-05-07T20:33:01.6399588Z contiguous=True, 2025-05-07T20:33:01.6399811Z compiled=True, 2025-05-07T20:33:01.6400011Z ) 2025-05-07T20:33:01.6400355Z self = 2025-05-07T20:33:01.6401190Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:01.6401452Z 2025-05-07T20:33:01.6401545Z @given( 2025-05-07T20:33:01.6401774Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.6402171Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.6402471Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.6402785Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.6403108Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.6403382Z ) 2025-05-07T20:33:01.6403714Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.6404150Z def test_silu_mul_quant( 2025-05-07T20:33:01.6404525Z self, 2025-05-07T20:33:01.6404708Z T: int, 2025-05-07T20:33:01.6404904Z D: int, 2025-05-07T20:33:01.6405124Z scale_ub: Optional[float], 2025-05-07T20:33:01.6405406Z contiguous: bool, 2025-05-07T20:33:01.6405648Z compiled: bool, 2025-05-07T20:33:01.6405885Z ) -> None: 2025-05-07T20:33:01.6406113Z torch.manual_seed(2025) 2025-05-07T20:33:01.6406346Z 2025-05-07T20:33:01.6406739Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.6407095Z 2025-05-07T20:33:01.6407288Z x_sign = torch.sign(x) 2025-05-07T20:33:01.6407582Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.6407898Z x = x_sign * x_clamp 2025-05-07T20:33:01.6408134Z x0 = x[:, :D] 2025-05-07T20:33:01.6408582Z x1 = x[:, D:] 2025-05-07T20:33:01.6408792Z 2025-05-07T20:33:01.6408975Z if contiguous: 2025-05-07T20:33:01.6409212Z x0 = x0.contiguous() 2025-05-07T20:33:01.6409478Z x1 = x1.contiguous() 2025-05-07T20:33:01.6409712Z 2025-05-07T20:33:01.6409910Z if scale_ub is not None: 2025-05-07T20:33:01.6410193Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.6410531Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.6410850Z ) 2025-05-07T20:33:01.6411054Z else: 2025-05-07T20:33:01.6411275Z scale_ub_tensor = None 2025-05-07T20:33:01.6411526Z 2025-05-07T20:33:01.6411764Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.6412077Z op = silu_mul_quant 2025-05-07T20:33:01.6412319Z if compiled: 2025-05-07T20:33:01.6412569Z op = torch.compile(op) 2025-05-07T20:33:01.6412868Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.6413129Z 2025-05-07T20:33:01.6413321Z y_fp8, y_scale = fn() 2025-05-07T20:33:01.6413600Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:01.6413876Z 2025-05-07T20:33:01.6414110Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.6414539Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:01.6414828Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:01.6415136Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:01.6415485Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:01.6415790Z 2025-05-07T20:33:01.6415980Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:01.6416175Z 2025-05-07T20:33:01.6416270Z moe/activation_test.py:126: 2025-05-07T20:33:01.6416564Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.6416888Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:01.6417206Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:01.6417989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:01.6418732Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:01.6419337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.6420018Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.6420702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:01.6421471Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:01.6422192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:01.6422822Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:01.6423413Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:01.6423912Z fn() 2025-05-07T20:33:01.6424412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:01.6424983Z self.fn.run( 2025-05-07T20:33:01.6425443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.6425955Z kernel = self.compile( 2025-05-07T20:33:01.6426565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.6427215Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.6427600Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.6427834Z 2025-05-07T20:33:01.6428036Z self = 2025-05-07T20:33:01.6429119Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.6430499Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16d32be700>} 2025-05-07T20:33:01.6431827Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.6432833Z context = 2025-05-07T20:33:01.6433124Z 2025-05-07T20:33:01.6433288Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.6433808Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.6434267Z module_map=module_map) 2025-05-07T20:33:01.6434616Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.6435007Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:01.6435265Z E ^ 2025-05-07T20:33:01.6435712Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.6436161Z 2025-05-07T20:33:01.6436568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.6437084Z 2025-05-07T20:33:01.6437184Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.6437589Z self=, 2025-05-07T20:33:01.6437972Z T=2048, 2025-05-07T20:33:01.6438156Z D=5120, 2025-05-07T20:33:01.6438341Z scale_ub=1200.0, 2025-05-07T20:33:01.6438548Z contiguous=True, 2025-05-07T20:33:01.6438765Z compiled=False, 2025-05-07T20:33:01.6438965Z ) 2025-05-07T20:33:01.6439267Z self = 2025-05-07T20:33:01.6439760Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:01.6440025Z 2025-05-07T20:33:01.6440156Z @given( 2025-05-07T20:33:01.6440373Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.6440677Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.6440978Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.6441344Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.6441657Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.6441935Z ) 2025-05-07T20:33:01.6442273Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.6442701Z def test_silu_mul_quant( 2025-05-07T20:33:01.6442932Z self, 2025-05-07T20:33:01.6443117Z T: int, 2025-05-07T20:33:01.6443299Z D: int, 2025-05-07T20:33:01.6443508Z scale_ub: Optional[float], 2025-05-07T20:33:01.6443772Z contiguous: bool, 2025-05-07T20:33:01.6443996Z compiled: bool, 2025-05-07T20:33:01.6444213Z ) -> None: 2025-05-07T20:33:01.6444523Z torch.manual_seed(2025) 2025-05-07T20:33:01.6444754Z 2025-05-07T20:33:01.6445023Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.6445368Z 2025-05-07T20:33:01.6445618Z x_sign = torch.sign(x) 2025-05-07T20:33:01.6445917Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.6446232Z x = x_sign * x_clamp 2025-05-07T20:33:01.6446472Z x0 = x[:, :D] 2025-05-07T20:33:01.6446676Z x1 = x[:, D:] 2025-05-07T20:33:01.6446893Z 2025-05-07T20:33:01.6447081Z if contiguous: 2025-05-07T20:33:01.6447311Z x0 = x0.contiguous() 2025-05-07T20:33:01.6447575Z x1 = x1.contiguous() 2025-05-07T20:33:01.6447820Z 2025-05-07T20:33:01.6448000Z if scale_ub is not None: 2025-05-07T20:33:01.6448278Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.6448619Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.6448925Z ) 2025-05-07T20:33:01.6449125Z else: 2025-05-07T20:33:01.6449339Z scale_ub_tensor = None 2025-05-07T20:33:01.6449582Z 2025-05-07T20:33:01.6449822Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.6450139Z op = silu_mul_quant 2025-05-07T20:33:01.6450387Z if compiled: 2025-05-07T20:33:01.6450642Z op = torch.compile(op) 2025-05-07T20:33:01.6450937Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.6451206Z 2025-05-07T20:33:01.6451387Z > y_fp8, y_scale = fn() 2025-05-07T20:33:01.6451561Z 2025-05-07T20:33:01.6451659Z moe/activation_test.py:117: 2025-05-07T20:33:01.6451995Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.6452329Z moe/activation_test.py:115: in fn 2025-05-07T20:33:01.6452612Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.6453393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:01.6454087Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:01.6454614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.6455299Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.6455948Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.6456459Z kernel = self.compile( 2025-05-07T20:33:01.6456994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.6457644Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.6458036Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.6458263Z 2025-05-07T20:33:01.6458468Z self = 2025-05-07T20:33:01.6459614Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.6461017Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16d316e020>} 2025-05-07T20:33:01.6462360Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.6463369Z context = 2025-05-07T20:33:01.6463663Z 2025-05-07T20:33:01.6463833Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.6464361Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.6464829Z module_map=module_map) 2025-05-07T20:33:01.6465190Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.6465595Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:01.6465861Z E ^ 2025-05-07T20:33:01.6466321Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.6466783Z 2025-05-07T20:33:01.6467197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:02.3156542Z 2025-05-07T20:33:02.3163817Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:02.3164446Z self=, 2025-05-07T20:33:02.3164930Z T=2048, 2025-05-07T20:33:02.3165127Z D=5120, 2025-05-07T20:33:02.3165354Z scale_ub=1200.0, 2025-05-07T20:33:02.3165580Z contiguous=True, 2025-05-07T20:33:02.3165830Z compiled=True, 2025-05-07T20:33:02.3166049Z ) 2025-05-07T20:33:02.3166370Z self = 2025-05-07T20:33:02.3166887Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:02.3167165Z 2025-05-07T20:33:02.3167254Z @given( 2025-05-07T20:33:02.3167490Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:02.3167809Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:02.3168125Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:02.3168454Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:02.3168793Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:02.3169086Z ) 2025-05-07T20:33:02.3169443Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:02.3170192Z def test_silu_mul_quant( 2025-05-07T20:33:02.3170456Z self, 2025-05-07T20:33:02.3170670Z T: int, 2025-05-07T20:33:02.3170872Z D: int, 2025-05-07T20:33:02.3171101Z scale_ub: Optional[float], 2025-05-07T20:33:02.3171382Z contiguous: bool, 2025-05-07T20:33:02.3171621Z compiled: bool, 2025-05-07T20:33:02.3171876Z ) -> None: 2025-05-07T20:33:02.3172106Z torch.manual_seed(2025) 2025-05-07T20:33:02.3172348Z 2025-05-07T20:33:02.3172634Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:02.3172986Z 2025-05-07T20:33:02.3173182Z x_sign = torch.sign(x) 2025-05-07T20:33:02.3173482Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:02.3173803Z x = x_sign * x_clamp 2025-05-07T20:33:02.3174052Z x0 = x[:, :D] 2025-05-07T20:33:02.3174267Z x1 = x[:, D:] 2025-05-07T20:33:02.3174480Z 2025-05-07T20:33:02.3174672Z if contiguous: 2025-05-07T20:33:02.3174905Z x0 = x0.contiguous() 2025-05-07T20:33:02.3175172Z x1 = x1.contiguous() 2025-05-07T20:33:02.3175514Z 2025-05-07T20:33:02.3175707Z if scale_ub is not None: 2025-05-07T20:33:02.3175990Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:02.3176333Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:02.3176716Z ) 2025-05-07T20:33:02.3176922Z else: 2025-05-07T20:33:02.3177142Z scale_ub_tensor = None 2025-05-07T20:33:02.3177390Z 2025-05-07T20:33:02.3177638Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:02.3177962Z op = silu_mul_quant 2025-05-07T20:33:02.3178212Z if compiled: 2025-05-07T20:33:02.3178474Z op = torch.compile(op) 2025-05-07T20:33:02.3178776Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:02.3179050Z 2025-05-07T20:33:02.3179253Z y_fp8, y_scale = fn() 2025-05-07T20:33:02.3179551Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:02.3179844Z 2025-05-07T20:33:02.3180082Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:02.3180427Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:02.3180731Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:02.3181138Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:02.3181507Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:02.3181820Z 2025-05-07T20:33:02.3182023Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:02.3182224Z 2025-05-07T20:33:02.3182323Z moe/activation_test.py:126: 2025-05-07T20:33:02.3182631Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.3182961Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:02.3183291Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:02.3184084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:02.3184826Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:02.3185376Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:02.3186069Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:02.3186754Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:02.3187473Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:02.3188201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:02.3188839Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:02.3189490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:02.3190008Z fn() 2025-05-07T20:33:02.3190568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:02.3191157Z self.fn.run( 2025-05-07T20:33:02.3191632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:02.3192172Z kernel = self.compile( 2025-05-07T20:33:02.3192720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:02.3193373Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:02.3193765Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.3194009Z 2025-05-07T20:33:02.3194218Z self = 2025-05-07T20:33:02.3195360Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:02.3196743Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16d215f100>} 2025-05-07T20:33:02.3198119Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:02.3199141Z context = 2025-05-07T20:33:02.3199441Z 2025-05-07T20:33:02.3199610Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:02.3200130Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:02.3200609Z module_map=module_map) 2025-05-07T20:33:02.3200996Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:02.3201369Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:02.3201648Z E ^ 2025-05-07T20:33:02.3202187Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:02.3202653Z 2025-05-07T20:33:02.3203072Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:02.3203583Z 2025-05-07T20:33:02.3203709Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:02.3204121Z self=, 2025-05-07T20:33:02.3204620Z T=16384, 2025-05-07T20:33:02.3204835Z D=7168, 2025-05-07T20:33:02.3205038Z scale_ub=1200.0, 2025-05-07T20:33:02.3205278Z contiguous=False, 2025-05-07T20:33:02.3205521Z compiled=False, 2025-05-07T20:33:02.3205752Z ) 2025-05-07T20:33:02.3206077Z self = 2025-05-07T20:33:02.3206589Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:02.3206875Z 2025-05-07T20:33:02.3206973Z @given( 2025-05-07T20:33:02.3207215Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:02.3207555Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:02.3207880Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:02.3208490Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:02.3208843Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:02.3209144Z ) 2025-05-07T20:33:02.3209515Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:02.3209961Z def test_silu_mul_quant( 2025-05-07T20:33:02.3210219Z self, 2025-05-07T20:33:02.3210435Z T: int, 2025-05-07T20:33:02.3210739Z D: int, 2025-05-07T20:33:02.3210977Z scale_ub: Optional[float], 2025-05-07T20:33:02.3211265Z contiguous: bool, 2025-05-07T20:33:02.3211512Z compiled: bool, 2025-05-07T20:33:02.3211755Z ) -> None: 2025-05-07T20:33:02.3211989Z torch.manual_seed(2025) 2025-05-07T20:33:02.3212246Z 2025-05-07T20:33:02.3212539Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:02.3212898Z 2025-05-07T20:33:02.3213102Z x_sign = torch.sign(x) 2025-05-07T20:33:02.3213413Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:02.3213734Z x = x_sign * x_clamp 2025-05-07T20:33:02.3213975Z x0 = x[:, :D] 2025-05-07T20:33:02.3214207Z x1 = x[:, D:] 2025-05-07T20:33:02.3214432Z 2025-05-07T20:33:02.3214630Z if contiguous: 2025-05-07T20:33:02.3214874Z x0 = x0.contiguous() 2025-05-07T20:33:02.3215152Z x1 = x1.contiguous() 2025-05-07T20:33:02.3215405Z 2025-05-07T20:33:02.3215603Z if scale_ub is not None: 2025-05-07T20:33:02.3215956Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:02.3216300Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:02.3216608Z ) 2025-05-07T20:33:02.3216820Z else: 2025-05-07T20:33:02.3217102Z scale_ub_tensor = None 2025-05-07T20:33:02.3217353Z 2025-05-07T20:33:02.3217597Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:02.3217915Z op = silu_mul_quant 2025-05-07T20:33:02.3218163Z if compiled: 2025-05-07T20:33:02.3218417Z op = torch.compile(op) 2025-05-07T20:33:02.3218711Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:02.3218984Z 2025-05-07T20:33:02.3219179Z > y_fp8, y_scale = fn() 2025-05-07T20:33:02.3219340Z 2025-05-07T20:33:02.3219446Z moe/activation_test.py:117: 2025-05-07T20:33:02.3219746Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.3220076Z moe/activation_test.py:115: in fn 2025-05-07T20:33:02.3220369Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:02.3221057Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:02.3221844Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:02.3222382Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:02.3223061Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:02.3223721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:02.3224245Z kernel = self.compile( 2025-05-07T20:33:02.3224779Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:02.3225436Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:02.3225833Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.3226068Z 2025-05-07T20:33:02.3226271Z self = 2025-05-07T20:33:02.3227353Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:02.3228716Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16d1e14a40>} 2025-05-07T20:33:02.3230057Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:02.3231120Z context = 2025-05-07T20:33:02.3231419Z 2025-05-07T20:33:02.3231587Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:02.3232106Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:02.3232578Z module_map=module_map) 2025-05-07T20:33:02.3232936Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:02.3233291Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:02.3233550Z E ^ 2025-05-07T20:33:02.3234007Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:02.3234463Z 2025-05-07T20:33:02.3234878Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.0712884Z 2025-05-07T20:33:03.0713114Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.0713730Z self=, 2025-05-07T20:33:03.0714236Z T=1, 2025-05-07T20:33:03.0714453Z D=7168, 2025-05-07T20:33:03.0714661Z scale_ub=None, 2025-05-07T20:33:03.0714882Z contiguous=True, 2025-05-07T20:33:03.0715132Z compiled=True, 2025-05-07T20:33:03.0716925Z ) 2025-05-07T20:33:03.0717263Z self = 2025-05-07T20:33:03.0717787Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:03.0718062Z 2025-05-07T20:33:03.0718155Z @given( 2025-05-07T20:33:03.0718390Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.0718729Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.0719068Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.0719413Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.0719773Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.0720080Z ) 2025-05-07T20:33:03.0720456Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.0720922Z def test_silu_mul_quant( 2025-05-07T20:33:03.0721159Z self, 2025-05-07T20:33:03.0721439Z T: int, 2025-05-07T20:33:03.0721633Z D: int, 2025-05-07T20:33:03.0721848Z scale_ub: Optional[float], 2025-05-07T20:33:03.0722115Z contiguous: bool, 2025-05-07T20:33:03.0722344Z compiled: bool, 2025-05-07T20:33:03.0722566Z ) -> None: 2025-05-07T20:33:03.0722776Z torch.manual_seed(2025) 2025-05-07T20:33:03.0723007Z 2025-05-07T20:33:03.0723275Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.0723612Z 2025-05-07T20:33:03.0723793Z x_sign = torch.sign(x) 2025-05-07T20:33:03.0724081Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.0724488Z x = x_sign * x_clamp 2025-05-07T20:33:03.0724721Z x0 = x[:, :D] 2025-05-07T20:33:03.0724935Z x1 = x[:, D:] 2025-05-07T20:33:03.0725142Z 2025-05-07T20:33:03.0725315Z if contiguous: 2025-05-07T20:33:03.0725548Z x0 = x0.contiguous() 2025-05-07T20:33:03.0725804Z x1 = x1.contiguous() 2025-05-07T20:33:03.0726048Z 2025-05-07T20:33:03.0726232Z if scale_ub is not None: 2025-05-07T20:33:03.0726503Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.0726837Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.0727137Z ) 2025-05-07T20:33:03.0727326Z else: 2025-05-07T20:33:03.0727535Z scale_ub_tensor = None 2025-05-07T20:33:03.0727774Z 2025-05-07T20:33:03.0728007Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.0728318Z op = silu_mul_quant 2025-05-07T20:33:03.0728565Z if compiled: 2025-05-07T20:33:03.0728807Z op = torch.compile(op) 2025-05-07T20:33:03.0729193Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.0729458Z 2025-05-07T20:33:03.0729654Z y_fp8, y_scale = fn() 2025-05-07T20:33:03.0729967Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:03.0730285Z 2025-05-07T20:33:03.0730518Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.0730863Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:03.0731164Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:03.0731483Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:03.0731852Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:03.0732176Z 2025-05-07T20:33:03.0732374Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:03.0732583Z 2025-05-07T20:33:03.0732684Z moe/activation_test.py:126: 2025-05-07T20:33:03.0732991Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.0733334Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:03.0733711Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:03.0734506Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:03.0735274Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:03.0735855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.0736550Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.0737251Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:03.0737978Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:03.0738705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:03.0739362Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:03.0739980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:03.0740504Z fn() 2025-05-07T20:33:03.0741053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:03.0741639Z self.fn.run( 2025-05-07T20:33:03.0742114Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.0742639Z kernel = self.compile( 2025-05-07T20:33:03.0743189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.0743851Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.0744245Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.0744494Z 2025-05-07T20:33:03.0744707Z self = 2025-05-07T20:33:03.0745803Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.0747199Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16d2019ee0>} 2025-05-07T20:33:03.0748541Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.0749556Z context = 2025-05-07T20:33:03.0749855Z 2025-05-07T20:33:03.0750067Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.0750602Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.0751077Z module_map=module_map) 2025-05-07T20:33:03.0751445Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.0751819Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:03.0752092Z E ^ 2025-05-07T20:33:03.0752552Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.0753013Z 2025-05-07T20:33:03.0753430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.0753953Z 2025-05-07T20:33:03.0754054Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.0754478Z self=, 2025-05-07T20:33:03.0754871Z T=4096, 2025-05-07T20:33:03.0755074Z D=5120, 2025-05-07T20:33:03.0755275Z scale_ub=None, 2025-05-07T20:33:03.0755540Z contiguous=False, 2025-05-07T20:33:03.0755780Z compiled=False, 2025-05-07T20:33:03.0755996Z ) 2025-05-07T20:33:03.0756309Z self = 2025-05-07T20:33:03.0756853Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:03.0757131Z 2025-05-07T20:33:03.0757208Z @given( 2025-05-07T20:33:03.0757451Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.0757758Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.0758068Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.0758402Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.0758734Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.0759031Z ) 2025-05-07T20:33:03.0759389Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.0759824Z def test_silu_mul_quant( 2025-05-07T20:33:03.0760088Z self, 2025-05-07T20:33:03.0760287Z T: int, 2025-05-07T20:33:03.0760485Z D: int, 2025-05-07T20:33:03.0760710Z scale_ub: Optional[float], 2025-05-07T20:33:03.0761040Z contiguous: bool, 2025-05-07T20:33:03.0761297Z compiled: bool, 2025-05-07T20:33:03.0761518Z ) -> None: 2025-05-07T20:33:03.0761741Z torch.manual_seed(2025) 2025-05-07T20:33:03.0761987Z 2025-05-07T20:33:03.0762258Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.0762609Z 2025-05-07T20:33:03.0762812Z x_sign = torch.sign(x) 2025-05-07T20:33:03.0763105Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.0763427Z x = x_sign * x_clamp 2025-05-07T20:33:03.0763669Z x0 = x[:, :D] 2025-05-07T20:33:03.0763883Z x1 = x[:, D:] 2025-05-07T20:33:03.0764093Z 2025-05-07T20:33:03.0764432Z if contiguous: 2025-05-07T20:33:03.0764661Z x0 = x0.contiguous() 2025-05-07T20:33:03.0764935Z x1 = x1.contiguous() 2025-05-07T20:33:03.0765180Z 2025-05-07T20:33:03.0765363Z if scale_ub is not None: 2025-05-07T20:33:03.0765642Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.0765993Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.0766301Z ) 2025-05-07T20:33:03.0766494Z else: 2025-05-07T20:33:03.0766716Z scale_ub_tensor = None 2025-05-07T20:33:03.0766967Z 2025-05-07T20:33:03.0767201Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.0767525Z op = silu_mul_quant 2025-05-07T20:33:03.0767785Z if compiled: 2025-05-07T20:33:03.0768037Z op = torch.compile(op) 2025-05-07T20:33:03.0768353Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.0768637Z 2025-05-07T20:33:03.0768829Z > y_fp8, y_scale = fn() 2025-05-07T20:33:03.0769094Z 2025-05-07T20:33:03.0769196Z moe/activation_test.py:117: 2025-05-07T20:33:03.0769514Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.0769840Z moe/activation_test.py:115: in fn 2025-05-07T20:33:03.0770134Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.0770840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:03.0771533Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:03.0772058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.0772750Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.0773422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.0773955Z kernel = self.compile( 2025-05-07T20:33:03.0774537Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.0775199Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.0775606Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.0775881Z 2025-05-07T20:33:03.0776086Z self = 2025-05-07T20:33:03.0777167Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.0778534Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16d1e42700>} 2025-05-07T20:33:03.0779883Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.0780911Z context = 2025-05-07T20:33:03.0781248Z 2025-05-07T20:33:03.0781411Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.0781941Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.0782424Z module_map=module_map) 2025-05-07T20:33:03.0782793Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.0783137Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:03.0783399Z E ^ 2025-05-07T20:33:03.0783863Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.0784307Z 2025-05-07T20:33:03.0784726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.7943400Z 2025-05-07T20:33:03.7944286Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.7944974Z self=, 2025-05-07T20:33:03.7945579Z T=4096, 2025-05-07T20:33:03.7945795Z D=7168, 2025-05-07T20:33:03.7945992Z scale_ub=None, 2025-05-07T20:33:03.7946204Z contiguous=False, 2025-05-07T20:33:03.7946430Z compiled=False, 2025-05-07T20:33:03.7946645Z ) 2025-05-07T20:33:03.7946972Z self = 2025-05-07T20:33:03.7954208Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:03.7954487Z 2025-05-07T20:33:03.7954568Z @given( 2025-05-07T20:33:03.7954812Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.7955155Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.7955782Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.7956112Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.7956441Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.7956723Z ) 2025-05-07T20:33:03.7957061Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.7957504Z def test_silu_mul_quant( 2025-05-07T20:33:03.7957741Z self, 2025-05-07T20:33:03.7957926Z T: int, 2025-05-07T20:33:03.7958116Z D: int, 2025-05-07T20:33:03.7958330Z scale_ub: Optional[float], 2025-05-07T20:33:03.7958589Z contiguous: bool, 2025-05-07T20:33:03.7958827Z compiled: bool, 2025-05-07T20:33:03.7959052Z ) -> None: 2025-05-07T20:33:03.7959252Z torch.manual_seed(2025) 2025-05-07T20:33:03.7959489Z 2025-05-07T20:33:03.7959759Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.7960098Z 2025-05-07T20:33:03.7960280Z x_sign = torch.sign(x) 2025-05-07T20:33:03.7960653Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.7960958Z x = x_sign * x_clamp 2025-05-07T20:33:03.7961181Z x0 = x[:, :D] 2025-05-07T20:33:03.7961391Z x1 = x[:, D:] 2025-05-07T20:33:03.7961601Z 2025-05-07T20:33:03.7961849Z if contiguous: 2025-05-07T20:33:03.7962076Z x0 = x0.contiguous() 2025-05-07T20:33:03.7962326Z x1 = x1.contiguous() 2025-05-07T20:33:03.7962553Z 2025-05-07T20:33:03.7962738Z if scale_ub is not None: 2025-05-07T20:33:03.7963005Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.7963330Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.7963635Z ) 2025-05-07T20:33:03.7963820Z else: 2025-05-07T20:33:03.7964016Z scale_ub_tensor = None 2025-05-07T20:33:03.7964418Z 2025-05-07T20:33:03.7964648Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.7964954Z op = silu_mul_quant 2025-05-07T20:33:03.7965191Z if compiled: 2025-05-07T20:33:03.7965435Z op = torch.compile(op) 2025-05-07T20:33:03.7965724Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.7965979Z 2025-05-07T20:33:03.7966252Z > y_fp8, y_scale = fn() 2025-05-07T20:33:03.7966413Z 2025-05-07T20:33:03.7966515Z moe/activation_test.py:117: 2025-05-07T20:33:03.7966799Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.7967126Z moe/activation_test.py:115: in fn 2025-05-07T20:33:03.7967402Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.7968081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:03.7968767Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:03.7969303Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.7969987Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.7970638Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.7971165Z kernel = self.compile( 2025-05-07T20:33:03.7971705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.7972360Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.7972749Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.7972984Z 2025-05-07T20:33:03.7973187Z self = 2025-05-07T20:33:03.7974320Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.7975706Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16d1e41f80>} 2025-05-07T20:33:03.7977039Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.7978061Z context = 2025-05-07T20:33:03.7978355Z 2025-05-07T20:33:03.7978518Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.7979027Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.7979491Z module_map=module_map) 2025-05-07T20:33:03.7979854Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.7980195Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:03.7980481Z E ^ 2025-05-07T20:33:03.7980940Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.7981392Z 2025-05-07T20:33:03.7981813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.7982360Z 2025-05-07T20:33:03.7982466Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.7982869Z self=, 2025-05-07T20:33:03.7983268Z T=128, 2025-05-07T20:33:03.7983454Z D=7168, 2025-05-07T20:33:03.7983627Z scale_ub=None, 2025-05-07T20:33:03.7983842Z contiguous=False, 2025-05-07T20:33:03.7984066Z compiled=True, 2025-05-07T20:33:03.7984255Z ) 2025-05-07T20:33:03.7984571Z self = 2025-05-07T20:33:03.7985061Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:03.7985328Z 2025-05-07T20:33:03.7985396Z @given( 2025-05-07T20:33:03.7985622Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.7985927Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.7986278Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.7986591Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.7986916Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.7987188Z ) 2025-05-07T20:33:03.7987521Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.7987949Z def test_silu_mul_quant( 2025-05-07T20:33:03.7988180Z self, 2025-05-07T20:33:03.7988360Z T: int, 2025-05-07T20:33:03.7988544Z D: int, 2025-05-07T20:33:03.7988755Z scale_ub: Optional[float], 2025-05-07T20:33:03.7989011Z contiguous: bool, 2025-05-07T20:33:03.7989244Z compiled: bool, 2025-05-07T20:33:03.7989457Z ) -> None: 2025-05-07T20:33:03.7989656Z torch.manual_seed(2025) 2025-05-07T20:33:03.7989888Z 2025-05-07T20:33:03.7990154Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.7990485Z 2025-05-07T20:33:03.7990680Z x_sign = torch.sign(x) 2025-05-07T20:33:03.7990965Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.7991263Z x = x_sign * x_clamp 2025-05-07T20:33:03.7991485Z x0 = x[:, :D] 2025-05-07T20:33:03.7991690Z x1 = x[:, D:] 2025-05-07T20:33:03.7991882Z 2025-05-07T20:33:03.7992048Z if contiguous: 2025-05-07T20:33:03.7992272Z x0 = x0.contiguous() 2025-05-07T20:33:03.7992520Z x1 = x1.contiguous() 2025-05-07T20:33:03.7992739Z 2025-05-07T20:33:03.7992922Z if scale_ub is not None: 2025-05-07T20:33:03.7993188Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.7993561Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.7993866Z ) 2025-05-07T20:33:03.7994045Z else: 2025-05-07T20:33:03.7994238Z scale_ub_tensor = None 2025-05-07T20:33:03.7994473Z 2025-05-07T20:33:03.7994693Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.7994999Z op = silu_mul_quant 2025-05-07T20:33:03.7995242Z if compiled: 2025-05-07T20:33:03.7995478Z op = torch.compile(op) 2025-05-07T20:33:03.7995768Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.7996022Z 2025-05-07T20:33:03.7996205Z y_fp8, y_scale = fn() 2025-05-07T20:33:03.7996483Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:03.7996753Z 2025-05-07T20:33:03.7996983Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.7997311Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:03.7997590Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:03.7997894Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:03.7998292Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:03.7998590Z 2025-05-07T20:33:03.7998786Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:03.7998984Z 2025-05-07T20:33:03.7999117Z moe/activation_test.py:126: 2025-05-07T20:33:03.7999407Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.7999726Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:03.8000039Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:03.8000815Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:03.8001547Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:03.8002085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.8002764Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.8003437Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:03.8004187Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:03.8005049Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:03.8005678Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:03.8006269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:03.8006793Z fn() 2025-05-07T20:33:03.8007317Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:03.8007886Z self.fn.run( 2025-05-07T20:33:03.8008668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.8009197Z kernel = self.compile( 2025-05-07T20:33:03.8009727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.8010379Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.8010768Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.8011001Z 2025-05-07T20:33:03.8011202Z self = 2025-05-07T20:33:03.8012278Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.8013753Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16d17f7c40>} 2025-05-07T20:33:03.8015081Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.8016101Z context = 2025-05-07T20:33:03.8016391Z 2025-05-07T20:33:03.8016553Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.8017067Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.8017520Z module_map=module_map) 2025-05-07T20:33:03.8017882Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.8018231Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:03.8018483Z E ^ 2025-05-07T20:33:03.8019103Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.8019633Z 2025-05-07T20:33:03.8020048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:04.0473496Z 2025-05-07T20:33:04.0474331Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:04.0475683Z self=, 2025-05-07T20:33:04.0476879Z T=128, 2025-05-07T20:33:04.0477419Z D=7168, 2025-05-07T20:33:04.0477937Z scale_ub=None, 2025-05-07T20:33:04.0478341Z contiguous=False, 2025-05-07T20:33:04.0478773Z compiled=False, 2025-05-07T20:33:04.0479371Z ) 2025-05-07T20:33:04.0480042Z self = 2025-05-07T20:33:04.0481008Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:04.0481556Z 2025-05-07T20:33:04.0481702Z @given( 2025-05-07T20:33:04.0482165Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:04.0482791Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:04.0483387Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:04.0484029Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:04.0485097Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:04.0485643Z ) 2025-05-07T20:33:04.0486320Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:04.0486908Z def test_silu_mul_quant( 2025-05-07T20:33:04.0487224Z self, 2025-05-07T20:33:04.0487414Z T: int, 2025-05-07T20:33:04.0487615Z D: int, 2025-05-07T20:33:04.0487838Z scale_ub: Optional[float], 2025-05-07T20:33:04.0488100Z contiguous: bool, 2025-05-07T20:33:04.0488346Z compiled: bool, 2025-05-07T20:33:04.0488582Z ) -> None: 2025-05-07T20:33:04.0488793Z torch.manual_seed(2025) 2025-05-07T20:33:04.0489043Z 2025-05-07T20:33:04.0489322Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:04.0489654Z 2025-05-07T20:33:04.0489854Z x_sign = torch.sign(x) 2025-05-07T20:33:04.0490151Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:04.0490461Z x = x_sign * x_clamp 2025-05-07T20:33:04.0490708Z x0 = x[:, :D] 2025-05-07T20:33:04.0490932Z x1 = x[:, D:] 2025-05-07T20:33:04.0491133Z 2025-05-07T20:33:04.0491303Z if contiguous: 2025-05-07T20:33:04.0491532Z x0 = x0.contiguous() 2025-05-07T20:33:04.0491788Z x1 = x1.contiguous() 2025-05-07T20:33:04.0492013Z 2025-05-07T20:33:04.0492200Z if scale_ub is not None: 2025-05-07T20:33:04.0492470Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:04.0492795Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:04.0493106Z ) 2025-05-07T20:33:04.0493298Z else: 2025-05-07T20:33:04.0493604Z scale_ub_tensor = None 2025-05-07T20:33:04.0493857Z 2025-05-07T20:33:04.0494092Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:04.0494391Z op = silu_mul_quant 2025-05-07T20:33:04.0494641Z if compiled: 2025-05-07T20:33:04.0494893Z op = torch.compile(op) 2025-05-07T20:33:04.0495180Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:04.0495453Z 2025-05-07T20:33:04.0495645Z > y_fp8, y_scale = fn() 2025-05-07T20:33:04.0495808Z 2025-05-07T20:33:04.0495914Z moe/activation_test.py:117: 2025-05-07T20:33:04.0496199Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.0496531Z moe/activation_test.py:115: in fn 2025-05-07T20:33:04.0496810Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:04.0497495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:04.0498184Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:04.0498814Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:04.0499500Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:04.0500196Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:04.0500720Z kernel = self.compile( 2025-05-07T20:33:04.0501257Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:04.0501902Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:04.0502296Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.0502529Z 2025-05-07T20:33:04.0502731Z self = 2025-05-07T20:33:04.0503808Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:04.0505175Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16d12b1d00>} 2025-05-07T20:33:04.0506553Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:04.0507566Z context = 2025-05-07T20:33:04.0507848Z 2025-05-07T20:33:04.0508021Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:04.0508799Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:04.0509255Z module_map=module_map) 2025-05-07T20:33:04.0509620Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:04.0509966Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:04.0510217Z E ^ 2025-05-07T20:33:04.0510677Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:04.0511120Z 2025-05-07T20:33:04.0511538Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:04.0512042Z 2025-05-07T20:33:04.0512153Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:04.0512552Z self=, 2025-05-07T20:33:04.0512945Z T=4096, 2025-05-07T20:33:04.0513132Z D=5120, 2025-05-07T20:33:04.0513311Z scale_ub=1200.0, 2025-05-07T20:33:04.0513528Z contiguous=True, 2025-05-07T20:33:04.0513823Z compiled=False, 2025-05-07T20:33:04.0514015Z ) 2025-05-07T20:33:04.0514337Z self = 2025-05-07T20:33:04.0514829Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:04.0515099Z 2025-05-07T20:33:04.0515185Z @given( 2025-05-07T20:33:04.0515403Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:04.0515716Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:04.0516018Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:04.0516336Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:04.0516662Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:04.0516942Z ) 2025-05-07T20:33:04.0517278Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:04.0517711Z def test_silu_mul_quant( 2025-05-07T20:33:04.0517950Z self, 2025-05-07T20:33:04.0518138Z T: int, 2025-05-07T20:33:04.0518332Z D: int, 2025-05-07T20:33:04.0518621Z scale_ub: Optional[float], 2025-05-07T20:33:04.0518881Z contiguous: bool, 2025-05-07T20:33:04.0519120Z compiled: bool, 2025-05-07T20:33:04.0519340Z ) -> None: 2025-05-07T20:33:04.0519557Z torch.manual_seed(2025) 2025-05-07T20:33:04.0519849Z 2025-05-07T20:33:04.0520118Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:04.0520456Z 2025-05-07T20:33:04.0520638Z x_sign = torch.sign(x) 2025-05-07T20:33:04.0520928Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:04.0521239Z x = x_sign * x_clamp 2025-05-07T20:33:04.0521467Z x0 = x[:, :D] 2025-05-07T20:33:04.0521680Z x1 = x[:, D:] 2025-05-07T20:33:04.0521886Z 2025-05-07T20:33:04.0522058Z if contiguous: 2025-05-07T20:33:04.0522286Z x0 = x0.contiguous() 2025-05-07T20:33:04.0522542Z x1 = x1.contiguous() 2025-05-07T20:33:04.0522776Z 2025-05-07T20:33:04.0522965Z if scale_ub is not None: 2025-05-07T20:33:04.0523240Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:04.0523564Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:04.0523941Z ) 2025-05-07T20:33:04.0524138Z else: 2025-05-07T20:33:04.0524457Z scale_ub_tensor = None 2025-05-07T20:33:04.0524694Z 2025-05-07T20:33:04.0524920Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:04.0525229Z op = silu_mul_quant 2025-05-07T20:33:04.0525468Z if compiled: 2025-05-07T20:33:04.0525714Z op = torch.compile(op) 2025-05-07T20:33:04.0526007Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:04.0526268Z 2025-05-07T20:33:04.0526461Z > y_fp8, y_scale = fn() 2025-05-07T20:33:04.0526620Z 2025-05-07T20:33:04.0526726Z moe/activation_test.py:117: 2025-05-07T20:33:04.0527011Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.0527343Z moe/activation_test.py:115: in fn 2025-05-07T20:33:04.0527622Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:04.0528304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:04.0528982Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:04.0529515Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:04.0530190Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:04.0530839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:04.0531364Z kernel = self.compile( 2025-05-07T20:33:04.0531903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:04.0532602Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:04.0532998Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.0533230Z 2025-05-07T20:33:04.0533435Z self = 2025-05-07T20:33:04.0534512Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:04.0535865Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16d12b2160>} 2025-05-07T20:33:04.0537183Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:04.0538243Z context = 2025-05-07T20:33:04.0538536Z 2025-05-07T20:33:04.0538700Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:04.0539216Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:04.0539711Z module_map=module_map) 2025-05-07T20:33:04.0540071Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:04.0540422Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:04.0540677Z E ^ 2025-05-07T20:33:04.0541129Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:04.0541579Z 2025-05-07T20:33:04.0541991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:04.0542492Z 2025-05-07T20:33:04.0542604Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:04.0543006Z self=, 2025-05-07T20:33:04.0543402Z T=1, 2025-05-07T20:33:04.0543582Z D=5120, 2025-05-07T20:33:04.0543771Z scale_ub=None, 2025-05-07T20:33:04.0544049Z contiguous=True, 2025-05-07T20:33:04.0544274Z compiled=True, 2025-05-07T20:33:04.0544471Z ) 2025-05-07T20:33:04.0544779Z self = 2025-05-07T20:33:04.0545257Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:04.0545511Z 2025-05-07T20:33:04.0545592Z @given( 2025-05-07T20:33:04.0545812Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:04.0546121Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:04.0546421Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:04.0546737Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:04.0547065Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:04.0547341Z ) 2025-05-07T20:33:04.0547685Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:04.0548109Z def test_silu_mul_quant( 2025-05-07T20:33:04.0548349Z self, 2025-05-07T20:33:04.0548542Z T: int, 2025-05-07T20:33:04.0548725Z D: int, 2025-05-07T20:33:04.0548939Z scale_ub: Optional[float], 2025-05-07T20:33:04.0549207Z contiguous: bool, 2025-05-07T20:33:04.0549433Z compiled: bool, 2025-05-07T20:33:04.0549653Z ) -> None: 2025-05-07T20:33:04.0549871Z torch.manual_seed(2025) 2025-05-07T20:33:04.0550104Z 2025-05-07T20:33:04.0550377Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:04.0550717Z 2025-05-07T20:33:04.0550895Z x_sign = torch.sign(x) 2025-05-07T20:33:04.0551187Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:04.0551543Z x = x_sign * x_clamp 2025-05-07T20:33:04.0551775Z x0 = x[:, :D] 2025-05-07T20:33:04.0552008Z x1 = x[:, D:] 2025-05-07T20:33:04.0552234Z 2025-05-07T20:33:04.0552409Z if contiguous: 2025-05-07T20:33:04.0552624Z x0 = x0.contiguous() 2025-05-07T20:33:04.0552881Z x1 = x1.contiguous() 2025-05-07T20:33:04.0553115Z 2025-05-07T20:33:04.0553293Z if scale_ub is not None: 2025-05-07T20:33:04.0560835Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:04.0561231Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:04.0561565Z ) 2025-05-07T20:33:04.0561779Z else: 2025-05-07T20:33:04.0562014Z scale_ub_tensor = None 2025-05-07T20:33:04.0562299Z 2025-05-07T20:33:04.0562567Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:04.0562891Z op = silu_mul_quant 2025-05-07T20:33:04.0563166Z if compiled: 2025-05-07T20:33:04.0563450Z op = torch.compile(op) 2025-05-07T20:33:04.0563760Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:04.0564135Z 2025-05-07T20:33:04.0564503Z y_fp8, y_scale = fn() 2025-05-07T20:33:04.0564787Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:04.0565096Z 2025-05-07T20:33:04.0565405Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:04.0565739Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:04.0566037Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:04.0566376Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:04.0566736Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:04.0567055Z 2025-05-07T20:33:04.0567267Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:04.0567463Z 2025-05-07T20:33:04.0567581Z moe/activation_test.py:126: 2025-05-07T20:33:04.0567882Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.0568231Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:04.0568573Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:04.0569357Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:04.0570170Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:04.0570736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:04.0571427Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:04.0572105Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:04.0572832Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:04.0573563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:04.0574211Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:04.0574799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:04.0575317Z fn() 2025-05-07T20:33:04.0575828Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:04.0576401Z self.fn.run( 2025-05-07T20:33:04.0576928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:04.0577475Z kernel = self.compile( 2025-05-07T20:33:04.0578026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:04.0578676Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:04.0579138Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.0579371Z 2025-05-07T20:33:04.0579595Z self = 2025-05-07T20:33:04.0580672Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:04.0582040Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16d12b2de0>} 2025-05-07T20:33:04.0583367Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:04.0584384Z context = 2025-05-07T20:33:04.0584668Z 2025-05-07T20:33:04.0584843Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:04.0585422Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:04.0585889Z module_map=module_map) 2025-05-07T20:33:04.0586259Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:04.0586660Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:04.0586917Z E ^ 2025-05-07T20:33:04.0587377Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:04.0587819Z 2025-05-07T20:33:04.0588238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:04.7784476Z 2025-05-07T20:33:04.7784978Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:04.7785447Z self=, 2025-05-07T20:33:04.7785933Z T=2048, 2025-05-07T20:33:04.7786209Z D=5120, 2025-05-07T20:33:04.7786413Z scale_ub=None, 2025-05-07T20:33:04.7786612Z contiguous=True, 2025-05-07T20:33:04.7786831Z compiled=True, 2025-05-07T20:33:04.7787030Z ) 2025-05-07T20:33:04.7787337Z self = 2025-05-07T20:33:04.7788166Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:04.7788437Z 2025-05-07T20:33:04.7788509Z @given( 2025-05-07T20:33:04.7788731Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:04.7789067Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:04.7789364Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:04.7789677Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:04.7789997Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:04.7790275Z ) 2025-05-07T20:33:04.7790643Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:04.7791111Z def test_silu_mul_quant( 2025-05-07T20:33:04.7791353Z self, 2025-05-07T20:33:04.7791537Z T: int, 2025-05-07T20:33:04.7791729Z D: int, 2025-05-07T20:33:04.7791943Z scale_ub: Optional[float], 2025-05-07T20:33:04.7792208Z contiguous: bool, 2025-05-07T20:33:04.7792446Z compiled: bool, 2025-05-07T20:33:04.7792673Z ) -> None: 2025-05-07T20:33:04.7792876Z torch.manual_seed(2025) 2025-05-07T20:33:04.7793115Z 2025-05-07T20:33:04.7793383Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:04.7793722Z 2025-05-07T20:33:04.7793902Z x_sign = torch.sign(x) 2025-05-07T20:33:04.7794187Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:04.7794496Z x = x_sign * x_clamp 2025-05-07T20:33:04.7794726Z x0 = x[:, :D] 2025-05-07T20:33:04.7794941Z x1 = x[:, D:] 2025-05-07T20:33:04.7795146Z 2025-05-07T20:33:04.7795416Z if contiguous: 2025-05-07T20:33:04.7795651Z x0 = x0.contiguous() 2025-05-07T20:33:04.7795906Z x1 = x1.contiguous() 2025-05-07T20:33:04.7796134Z 2025-05-07T20:33:04.7796323Z if scale_ub is not None: 2025-05-07T20:33:04.7796595Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:04.7796925Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:04.7797228Z ) 2025-05-07T20:33:04.7797415Z else: 2025-05-07T20:33:04.7797614Z scale_ub_tensor = None 2025-05-07T20:33:04.7797861Z 2025-05-07T20:33:04.7798086Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:04.7798402Z op = silu_mul_quant 2025-05-07T20:33:04.7798644Z if compiled: 2025-05-07T20:33:04.7798888Z op = torch.compile(op) 2025-05-07T20:33:04.7799180Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:04.7799444Z 2025-05-07T20:33:04.7799639Z y_fp8, y_scale = fn() 2025-05-07T20:33:04.7800016Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:04.7800301Z 2025-05-07T20:33:04.7800534Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:04.7800869Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:04.7801236Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:04.7801547Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:04.7801906Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:04.7802218Z 2025-05-07T20:33:04.7802407Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:04.7802607Z 2025-05-07T20:33:04.7802707Z moe/activation_test.py:126: 2025-05-07T20:33:04.7803014Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.7803347Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:04.7803687Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:04.7804712Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:04.7805479Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:04.7806039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:04.7806788Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:04.7807538Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:04.7808516Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:04.7809250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:04.7809889Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:04.7810489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:04.7810993Z fn() 2025-05-07T20:33:04.7811498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:04.7812081Z self.fn.run( 2025-05-07T20:33:04.7812541Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:04.7813069Z kernel = self.compile( 2025-05-07T20:33:04.7813609Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:04.7814259Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:04.7814647Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.7814881Z 2025-05-07T20:33:04.7815087Z self = 2025-05-07T20:33:04.7816283Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:04.7817769Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16d12f4cc0>} 2025-05-07T20:33:04.7819112Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:04.7820127Z context = 2025-05-07T20:33:04.7820419Z 2025-05-07T20:33:04.7820585Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:04.7821108Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:04.7821645Z module_map=module_map) 2025-05-07T20:33:04.7822005Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:04.7822362Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:04.7822634Z E ^ 2025-05-07T20:33:04.7823154Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:04.7823610Z 2025-05-07T20:33:04.7824023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:04.7824538Z 2025-05-07T20:33:04.7824640Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:04.7825052Z self=, 2025-05-07T20:33:04.7825447Z T=128, 2025-05-07T20:33:04.7825635Z D=5120, 2025-05-07T20:33:04.7825832Z scale_ub=None, 2025-05-07T20:33:04.7826046Z contiguous=True, 2025-05-07T20:33:04.7826274Z compiled=True, 2025-05-07T20:33:04.7826476Z ) 2025-05-07T20:33:04.7826798Z self = 2025-05-07T20:33:04.7827341Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:04.7827685Z 2025-05-07T20:33:04.7827768Z @given( 2025-05-07T20:33:04.7828004Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:04.7828320Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:04.7828631Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:04.7828962Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:04.7829284Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:04.7829578Z ) 2025-05-07T20:33:04.7829930Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:04.7830362Z def test_silu_mul_quant( 2025-05-07T20:33:04.7830609Z self, 2025-05-07T20:33:04.7830807Z T: int, 2025-05-07T20:33:04.7830997Z D: int, 2025-05-07T20:33:04.7831218Z scale_ub: Optional[float], 2025-05-07T20:33:04.7831490Z contiguous: bool, 2025-05-07T20:33:04.7831730Z compiled: bool, 2025-05-07T20:33:04.7831947Z ) -> None: 2025-05-07T20:33:04.7832166Z torch.manual_seed(2025) 2025-05-07T20:33:04.7832430Z 2025-05-07T20:33:04.7832710Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:04.7833053Z 2025-05-07T20:33:04.7833243Z x_sign = torch.sign(x) 2025-05-07T20:33:04.7833535Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:04.7833848Z x = x_sign * x_clamp 2025-05-07T20:33:04.7834083Z x0 = x[:, :D] 2025-05-07T20:33:04.7834305Z x1 = x[:, D:] 2025-05-07T20:33:04.7834514Z 2025-05-07T20:33:04.7834700Z if contiguous: 2025-05-07T20:33:04.7834934Z x0 = x0.contiguous() 2025-05-07T20:33:04.7835258Z x1 = x1.contiguous() 2025-05-07T20:33:04.7835502Z 2025-05-07T20:33:04.7835692Z if scale_ub is not None: 2025-05-07T20:33:04.7835974Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:04.7836310Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:04.7836618Z ) 2025-05-07T20:33:04.7836826Z else: 2025-05-07T20:33:04.7837042Z scale_ub_tensor = None 2025-05-07T20:33:04.7837291Z 2025-05-07T20:33:04.7837528Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:04.7837850Z op = silu_mul_quant 2025-05-07T20:33:04.7838094Z if compiled: 2025-05-07T20:33:04.7838343Z op = torch.compile(op) 2025-05-07T20:33:04.7838640Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:04.7838912Z 2025-05-07T20:33:04.7839104Z y_fp8, y_scale = fn() 2025-05-07T20:33:04.7839395Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:04.7839680Z 2025-05-07T20:33:04.7839919Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:04.7840305Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:04.7840600Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:04.7840905Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:04.7841318Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:04.7841627Z 2025-05-07T20:33:04.7841822Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:04.7842019Z 2025-05-07T20:33:04.7842118Z moe/activation_test.py:126: 2025-05-07T20:33:04.7842417Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.7842752Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:04.7843091Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:04.7843891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:04.7844747Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:04.7845292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:04.7845986Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:04.7846732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:04.7847510Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:04.7848233Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:04.7848879Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:04.7849482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:04.7849999Z fn() 2025-05-07T20:33:04.7850522Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:04.7851111Z self.fn.run( 2025-05-07T20:33:04.7851583Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:04.7852113Z kernel = self.compile( 2025-05-07T20:33:04.7852661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:04.7853317Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:04.7853710Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.7853949Z 2025-05-07T20:33:04.7854156Z self = 2025-05-07T20:33:04.7855282Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:04.7856650Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16d0b52ca0>} 2025-05-07T20:33:04.7857995Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:04.7859008Z context = 2025-05-07T20:33:04.7859301Z 2025-05-07T20:33:04.7859465Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:04.7859980Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:04.7860448Z module_map=module_map) 2025-05-07T20:33:04.7860858Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:04.7861256Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:04.7861525Z E ^ 2025-05-07T20:33:04.7861976Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:04.7862470Z 2025-05-07T20:33:04.7862881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:05.5964095Z 2025-05-07T20:33:05.5965065Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.5965930Z self=, 2025-05-07T20:33:05.5966471Z T=4096, 2025-05-07T20:33:05.5966659Z D=5120, 2025-05-07T20:33:05.5966854Z scale_ub=None, 2025-05-07T20:33:05.5967077Z contiguous=True, 2025-05-07T20:33:05.5967304Z compiled=True, 2025-05-07T20:33:05.5967515Z ) 2025-05-07T20:33:05.5967868Z self = 2025-05-07T20:33:05.5968376Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:05.5968647Z 2025-05-07T20:33:05.5968718Z @given( 2025-05-07T20:33:05.5968940Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.5969560Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.5969863Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.5970189Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.5970512Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.5970778Z ) 2025-05-07T20:33:05.5971119Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.5971562Z def test_silu_mul_quant( 2025-05-07T20:33:05.5971786Z self, 2025-05-07T20:33:05.5971970Z T: int, 2025-05-07T20:33:05.5972157Z D: int, 2025-05-07T20:33:05.5972358Z scale_ub: Optional[float], 2025-05-07T20:33:05.5972650Z contiguous: bool, 2025-05-07T20:33:05.5972902Z compiled: bool, 2025-05-07T20:33:05.5973114Z ) -> None: 2025-05-07T20:33:05.5973319Z torch.manual_seed(2025) 2025-05-07T20:33:05.5973550Z 2025-05-07T20:33:05.5973811Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.5974154Z 2025-05-07T20:33:05.5974335Z x_sign = torch.sign(x) 2025-05-07T20:33:05.5974616Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.5974921Z x = x_sign * x_clamp 2025-05-07T20:33:05.5975153Z x0 = x[:, :D] 2025-05-07T20:33:05.5975358Z x1 = x[:, D:] 2025-05-07T20:33:05.5975547Z 2025-05-07T20:33:05.5975718Z if contiguous: 2025-05-07T20:33:05.5975945Z x0 = x0.contiguous() 2025-05-07T20:33:05.5976186Z x1 = x1.contiguous() 2025-05-07T20:33:05.5976415Z 2025-05-07T20:33:05.5976593Z if scale_ub is not None: 2025-05-07T20:33:05.5976946Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:05.5977279Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:05.5977578Z ) 2025-05-07T20:33:05.5977752Z else: 2025-05-07T20:33:05.5977956Z scale_ub_tensor = None 2025-05-07T20:33:05.5978197Z 2025-05-07T20:33:05.5978413Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.5978722Z op = silu_mul_quant 2025-05-07T20:33:05.5978967Z if compiled: 2025-05-07T20:33:05.5979197Z op = torch.compile(op) 2025-05-07T20:33:05.5979484Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.5979747Z 2025-05-07T20:33:05.5979921Z y_fp8, y_scale = fn() 2025-05-07T20:33:05.5980199Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:05.5980483Z 2025-05-07T20:33:05.5980713Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.5981033Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:05.5981323Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:05.5981718Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:05.5982064Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:05.5982364Z 2025-05-07T20:33:05.5982559Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:05.5982839Z 2025-05-07T20:33:05.5982935Z moe/activation_test.py:126: 2025-05-07T20:33:05.5983228Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.5983554Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:05.5983870Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:05.5984646Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:05.5985387Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:05.5985928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:05.5986599Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:05.5987280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:05.5988052Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:05.5988780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:05.5989404Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:05.5989995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:05.5990501Z fn() 2025-05-07T20:33:05.5990999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:05.5991565Z self.fn.run( 2025-05-07T20:33:05.5992025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:05.5992543Z kernel = self.compile( 2025-05-07T20:33:05.5993067Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:05.5993715Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:05.5994101Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.5994325Z 2025-05-07T20:33:05.5994532Z self = 2025-05-07T20:33:05.5995599Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:05.5997152Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16d08ba200>} 2025-05-07T20:33:05.5998479Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:05.5999503Z context = 2025-05-07T20:33:05.5999792Z 2025-05-07T20:33:05.5999955Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:05.6000471Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:05.6000924Z module_map=module_map) 2025-05-07T20:33:05.6001277Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:05.6001624Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:05.6001882Z E ^ 2025-05-07T20:33:05.6002377Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:05.6002831Z 2025-05-07T20:33:05.6003244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:05.6003794Z 2025-05-07T20:33:05.6003901Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.6004477Z self=, 2025-05-07T20:33:05.6004874Z T=16384, 2025-05-07T20:33:05.6005059Z D=5120, 2025-05-07T20:33:05.6005241Z scale_ub=None, 2025-05-07T20:33:05.6005440Z contiguous=True, 2025-05-07T20:33:05.6005654Z compiled=True, 2025-05-07T20:33:05.6005846Z ) 2025-05-07T20:33:05.6006152Z self = 2025-05-07T20:33:05.6006657Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:05.6006928Z 2025-05-07T20:33:05.6007004Z @given( 2025-05-07T20:33:05.6007226Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.6007532Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.6007832Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.6014905Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.6015293Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.6015578Z ) 2025-05-07T20:33:05.6015936Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.6016385Z def test_silu_mul_quant( 2025-05-07T20:33:05.6016632Z self, 2025-05-07T20:33:05.6016831Z T: int, 2025-05-07T20:33:05.6017031Z D: int, 2025-05-07T20:33:05.6017254Z scale_ub: Optional[float], 2025-05-07T20:33:05.6017523Z contiguous: bool, 2025-05-07T20:33:05.6017769Z compiled: bool, 2025-05-07T20:33:05.6017994Z ) -> None: 2025-05-07T20:33:05.6018201Z torch.manual_seed(2025) 2025-05-07T20:33:05.6018445Z 2025-05-07T20:33:05.6018723Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.6019055Z 2025-05-07T20:33:05.6019253Z x_sign = torch.sign(x) 2025-05-07T20:33:05.6019547Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.6019857Z x = x_sign * x_clamp 2025-05-07T20:33:05.6020097Z x0 = x[:, :D] 2025-05-07T20:33:05.6020315Z x1 = x[:, D:] 2025-05-07T20:33:05.6020514Z 2025-05-07T20:33:05.6020699Z if contiguous: 2025-05-07T20:33:05.6020930Z x0 = x0.contiguous() 2025-05-07T20:33:05.6021189Z x1 = x1.contiguous() 2025-05-07T20:33:05.6021416Z 2025-05-07T20:33:05.6021605Z if scale_ub is not None: 2025-05-07T20:33:05.6021877Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:05.6022204Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:05.6022638Z ) 2025-05-07T20:33:05.6022831Z else: 2025-05-07T20:33:05.6023035Z scale_ub_tensor = None 2025-05-07T20:33:05.6023287Z 2025-05-07T20:33:05.6023522Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.6023830Z op = silu_mul_quant 2025-05-07T20:33:05.6024088Z if compiled: 2025-05-07T20:33:05.6024334Z op = torch.compile(op) 2025-05-07T20:33:05.6024623Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.6024897Z 2025-05-07T20:33:05.6025091Z y_fp8, y_scale = fn() 2025-05-07T20:33:05.6025367Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:05.6025658Z 2025-05-07T20:33:05.6025894Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.6026227Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:05.6026508Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:05.6026819Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:05.6027175Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:05.6027545Z 2025-05-07T20:33:05.6027750Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:05.6027942Z 2025-05-07T20:33:05.6028052Z moe/activation_test.py:126: 2025-05-07T20:33:05.6028346Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.6028743Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:05.6029071Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:05.6029859Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:05.6030603Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:05.6031198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:05.6031882Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:05.6032575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:05.6033292Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:05.6034098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:05.6034741Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:05.6035333Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:05.6035849Z fn() 2025-05-07T20:33:05.6036354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:05.6036933Z self.fn.run( 2025-05-07T20:33:05.6037389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:05.6037923Z kernel = self.compile( 2025-05-07T20:33:05.6038455Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:05.6039109Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:05.6039515Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.6039743Z 2025-05-07T20:33:05.6039954Z self = 2025-05-07T20:33:05.6041055Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:05.6042501Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16d0ec5760>} 2025-05-07T20:33:05.6043842Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:05.6044948Z context = 2025-05-07T20:33:05.6045241Z 2025-05-07T20:33:05.6045412Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:05.6045925Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:05.6046392Z module_map=module_map) 2025-05-07T20:33:05.6046754Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:05.6047099Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:05.6047361Z E ^ 2025-05-07T20:33:05.6047822Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:05.6048273Z 2025-05-07T20:33:05.6048746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:05.6249374Z W0507 20:33:05.623000 89080 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8) 2025-05-07T20:33:05.6250902Z W0507 20:33:05.623000 89080 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55) 2025-05-07T20:33:05.6252227Z W0507 20:33:05.623000 89080 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240 2025-05-07T20:33:05.6253227Z W0507 20:33:05.623000 89080 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles". 2025-05-07T20:33:06.0808954Z W0507 20:33:05.623000 89080 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html. 2025-05-07T20:33:06.0809855Z 2025-05-07T20:33:06.0810007Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:06.0810835Z self=, 2025-05-07T20:33:06.0811291Z T=1, 2025-05-07T20:33:06.0811490Z D=5120, 2025-05-07T20:33:06.0811692Z scale_ub=1200.0, 2025-05-07T20:33:06.0811931Z contiguous=True, 2025-05-07T20:33:06.0812167Z compiled=True, 2025-05-07T20:33:06.0812371Z ) 2025-05-07T20:33:06.0812704Z self = 2025-05-07T20:33:06.0813200Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:06.0813456Z 2025-05-07T20:33:06.0813536Z @given( 2025-05-07T20:33:06.0813781Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:06.0814106Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:06.0814424Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:06.0814744Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:06.0815077Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:06.0815380Z ) 2025-05-07T20:33:06.0815728Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:06.0816169Z def test_silu_mul_quant( 2025-05-07T20:33:06.0816416Z self, 2025-05-07T20:33:06.0816607Z T: int, 2025-05-07T20:33:06.0816810Z D: int, 2025-05-07T20:33:06.0817040Z scale_ub: Optional[float], 2025-05-07T20:33:06.0817310Z contiguous: bool, 2025-05-07T20:33:06.0817558Z compiled: bool, 2025-05-07T20:33:06.0817793Z ) -> None: 2025-05-07T20:33:06.0818002Z torch.manual_seed(2025) 2025-05-07T20:33:06.0818252Z 2025-05-07T20:33:06.0818636Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:06.0818988Z 2025-05-07T20:33:06.0819182Z x_sign = torch.sign(x) 2025-05-07T20:33:06.0819482Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:06.0819797Z x = x_sign * x_clamp 2025-05-07T20:33:06.0820030Z x0 = x[:, :D] 2025-05-07T20:33:06.0820257Z x1 = x[:, D:] 2025-05-07T20:33:06.0820473Z 2025-05-07T20:33:06.0820655Z if contiguous: 2025-05-07T20:33:06.0820904Z x0 = x0.contiguous() 2025-05-07T20:33:06.0821213Z x1 = x1.contiguous() 2025-05-07T20:33:06.0821448Z 2025-05-07T20:33:06.0821646Z if scale_ub is not None: 2025-05-07T20:33:06.0821912Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:06.0822238Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:06.0822541Z ) 2025-05-07T20:33:06.0822732Z else: 2025-05-07T20:33:06.0822930Z scale_ub_tensor = None 2025-05-07T20:33:06.0823173Z 2025-05-07T20:33:06.0823403Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:06.0823802Z op = silu_mul_quant 2025-05-07T20:33:06.0824052Z if compiled: 2025-05-07T20:33:06.0824297Z op = torch.compile(op) 2025-05-07T20:33:06.0824590Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:06.0824956Z 2025-05-07T20:33:06.0825146Z > y_fp8, y_scale = fn() 2025-05-07T20:33:06.0825311Z 2025-05-07T20:33:06.0825414Z moe/activation_test.py:117: 2025-05-07T20:33:06.0825701Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:06.0826029Z moe/activation_test.py:115: in fn 2025-05-07T20:33:06.0826306Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:06.0826847Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:06.0827401Z return fn(*args, **kwargs) 2025-05-07T20:33:06.0828055Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:06.0828738Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:06.0829260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:06.0829981Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:06.0830636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:06.0831195Z kernel = self.compile( 2025-05-07T20:33:06.0831736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:06.0832386Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:06.0832784Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:06.0833009Z 2025-05-07T20:33:06.0833215Z self = 2025-05-07T20:33:06.0834292Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:06.0835675Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f15afdf9120>} 2025-05-07T20:33:06.0837002Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:06.0838013Z context = 2025-05-07T20:33:06.0838297Z 2025-05-07T20:33:06.0838457Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:06.0839071Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:06.0839534Z module_map=module_map) 2025-05-07T20:33:06.0839888Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:06.0840237Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:06.0840495Z E ^ 2025-05-07T20:33:06.0840976Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:06.0841451Z 2025-05-07T20:33:06.0841861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:06.0842372Z 2025-05-07T20:33:06.0842469Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:06.0842876Z self=, 2025-05-07T20:33:06.0843268Z T=1, 2025-05-07T20:33:06.0843437Z D=5120, 2025-05-07T20:33:06.0843633Z scale_ub=None, 2025-05-07T20:33:06.0843846Z contiguous=False, 2025-05-07T20:33:06.0844112Z compiled=True, 2025-05-07T20:33:06.0844402Z ) 2025-05-07T20:33:06.0844721Z self = 2025-05-07T20:33:06.0845197Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:06.0845544Z 2025-05-07T20:33:06.0845616Z @given( 2025-05-07T20:33:06.0845850Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:06.0846157Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:06.0846457Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:06.0846778Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:06.0847110Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:06.0847396Z ) 2025-05-07T20:33:06.0847737Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:06.0848190Z def test_silu_mul_quant( 2025-05-07T20:33:06.0848436Z self, 2025-05-07T20:33:06.0848638Z T: int, 2025-05-07T20:33:06.0848831Z D: int, 2025-05-07T20:33:06.0849057Z scale_ub: Optional[float], 2025-05-07T20:33:06.0849332Z contiguous: bool, 2025-05-07T20:33:06.0849624Z compiled: bool, 2025-05-07T20:33:06.0849865Z ) -> None: 2025-05-07T20:33:06.0850083Z torch.manual_seed(2025) 2025-05-07T20:33:06.0850319Z 2025-05-07T20:33:06.0850600Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:06.0850945Z 2025-05-07T20:33:06.0851134Z x_sign = torch.sign(x) 2025-05-07T20:33:06.0851428Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:06.0851740Z x = x_sign * x_clamp 2025-05-07T20:33:06.0851976Z x0 = x[:, :D] 2025-05-07T20:33:06.0852197Z x1 = x[:, D:] 2025-05-07T20:33:06.0852404Z 2025-05-07T20:33:06.0852579Z if contiguous: 2025-05-07T20:33:06.0852820Z x0 = x0.contiguous() 2025-05-07T20:33:06.0853079Z x1 = x1.contiguous() 2025-05-07T20:33:06.0853322Z 2025-05-07T20:33:06.0853503Z if scale_ub is not None: 2025-05-07T20:33:06.0853776Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:06.0854110Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:06.0854416Z ) 2025-05-07T20:33:06.0854610Z else: 2025-05-07T20:33:06.0854827Z scale_ub_tensor = None 2025-05-07T20:33:06.0855067Z 2025-05-07T20:33:06.0855301Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:06.0855613Z op = silu_mul_quant 2025-05-07T20:33:06.0855854Z if compiled: 2025-05-07T20:33:06.0856100Z op = torch.compile(op) 2025-05-07T20:33:06.0856389Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:06.0856646Z 2025-05-07T20:33:06.0856833Z y_fp8, y_scale = fn() 2025-05-07T20:33:06.0857165Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:06.0857458Z 2025-05-07T20:33:06.0857735Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:06.0858062Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:06.0858355Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:06.0858655Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:06.0859019Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:06.0859326Z 2025-05-07T20:33:06.0859513Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:06.0859712Z 2025-05-07T20:33:06.0859811Z moe/activation_test.py:126: 2025-05-07T20:33:06.0860110Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:06.0860433Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:06.0860759Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:06.0861539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:06.0862330Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:06.0862863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:06.0863549Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:06.0864285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:06.0865012Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:06.0865733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:06.0866380Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:06.0866985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:06.0867497Z fn() 2025-05-07T20:33:06.0868010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:06.0868594Z self.fn.run( 2025-05-07T20:33:06.0869068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:06.0869632Z kernel = self.compile( 2025-05-07T20:33:06.0870175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:06.0870831Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:06.0871216Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:06.0871452Z 2025-05-07T20:33:06.0871654Z self = 2025-05-07T20:33:06.0872730Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:06.0874087Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f15afd3ade0>} 2025-05-07T20:33:06.0875423Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:06.0876429Z context = 2025-05-07T20:33:06.0876719Z 2025-05-07T20:33:06.0876877Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:06.0877390Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:06.0877911Z module_map=module_map) 2025-05-07T20:33:06.0878368Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:06.0878820Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:06.0879078Z E ^ 2025-05-07T20:33:06.0879525Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:06.0879980Z 2025-05-07T20:33:06.0880386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:06.2299039Z 2025-05-07T20:33:06.2299438Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:06.2300064Z self=, 2025-05-07T20:33:06.2300588Z T=1, 2025-05-07T20:33:06.2300897Z D=5120, 2025-05-07T20:33:06.2301275Z scale_ub=None, 2025-05-07T20:33:06.2301679Z contiguous=True, 2025-05-07T20:33:06.2302094Z compiled=False, 2025-05-07T20:33:06.2302467Z ) 2025-05-07T20:33:06.2303090Z self = 2025-05-07T20:33:06.2304411Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:06.2304919Z 2025-05-07T20:33:06.2305058Z @given( 2025-05-07T20:33:06.2305499Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:06.2306218Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:06.2306796Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:06.2307416Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:06.2308040Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:06.2309206Z ) 2025-05-07T20:33:06.2309859Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:06.2310701Z def test_silu_mul_quant( 2025-05-07T20:33:06.2311040Z self, 2025-05-07T20:33:06.2311220Z T: int, 2025-05-07T20:33:06.2311409Z D: int, 2025-05-07T20:33:06.2311624Z scale_ub: Optional[float], 2025-05-07T20:33:06.2311877Z contiguous: bool, 2025-05-07T20:33:06.2312110Z compiled: bool, 2025-05-07T20:33:06.2312323Z ) -> None: 2025-05-07T20:33:06.2312521Z torch.manual_seed(2025) 2025-05-07T20:33:06.2312849Z 2025-05-07T20:33:06.2313110Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:06.2313433Z 2025-05-07T20:33:06.2313613Z x_sign = torch.sign(x) 2025-05-07T20:33:06.2313892Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:06.2314187Z x = x_sign * x_clamp 2025-05-07T20:33:06.2314408Z x0 = x[:, :D] 2025-05-07T20:33:06.2314611Z x1 = x[:, D:] 2025-05-07T20:33:06.2314805Z 2025-05-07T20:33:06.2314971Z if contiguous: 2025-05-07T20:33:06.2315190Z x0 = x0.contiguous() 2025-05-07T20:33:06.2315436Z x1 = x1.contiguous() 2025-05-07T20:33:06.2315655Z 2025-05-07T20:33:06.2315834Z if scale_ub is not None: 2025-05-07T20:33:06.2316096Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:06.2316416Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:06.2316710Z ) 2025-05-07T20:33:06.2316888Z else: 2025-05-07T20:33:06.2317083Z scale_ub_tensor = None 2025-05-07T20:33:06.2317323Z 2025-05-07T20:33:06.2317542Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:06.2317841Z op = silu_mul_quant 2025-05-07T20:33:06.2318082Z if compiled: 2025-05-07T20:33:06.2318317Z op = torch.compile(op) 2025-05-07T20:33:06.2318602Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:06.2318855Z 2025-05-07T20:33:06.2319034Z > y_fp8, y_scale = fn() 2025-05-07T20:33:06.2319193Z 2025-05-07T20:33:06.2319297Z moe/activation_test.py:117: 2025-05-07T20:33:06.2319578Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:06.2319999Z moe/activation_test.py:115: in fn 2025-05-07T20:33:06.2320273Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:06.2320946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:06.2321677Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:06.2322206Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:06.2322881Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:06.2323527Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:06.2324046Z kernel = self.compile( 2025-05-07T20:33:06.2324750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:06.2325396Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:06.2325854Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:06.2326084Z 2025-05-07T20:33:06.2326285Z self = 2025-05-07T20:33:06.2327354Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:06.2328784Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16d0501b20>} 2025-05-07T20:33:06.2330108Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:06.2331144Z context = 2025-05-07T20:33:06.2331467Z 2025-05-07T20:33:06.2331630Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:06.2332147Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:06.2332642Z module_map=module_map) 2025-05-07T20:33:06.2333003Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:06.2333349Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:06.2333590Z E ^ 2025-05-07T20:33:06.2334043Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:06.2334491Z 2025-05-07T20:33:06.2334899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:06.2335405Z 2025-05-07T20:33:06.2335510Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:06.2335908Z self=, 2025-05-07T20:33:06.2336300Z T=128, 2025-05-07T20:33:06.2336480Z D=5120, 2025-05-07T20:33:06.2336656Z scale_ub=None, 2025-05-07T20:33:06.2336861Z contiguous=False, 2025-05-07T20:33:06.2337081Z compiled=True, 2025-05-07T20:33:06.2337270Z ) 2025-05-07T20:33:06.2337575Z self = 2025-05-07T20:33:06.2338056Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:06.2338317Z 2025-05-07T20:33:06.2338391Z @given( 2025-05-07T20:33:06.2338604Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:06.2338907Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:06.2339206Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:06.2339518Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:06.2339837Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:06.2340165Z ) 2025-05-07T20:33:06.2340502Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:06.2340931Z def test_silu_mul_quant( 2025-05-07T20:33:06.2341160Z self, 2025-05-07T20:33:06.2341345Z T: int, 2025-05-07T20:33:06.2348245Z D: int, 2025-05-07T20:33:06.2348514Z scale_ub: Optional[float], 2025-05-07T20:33:06.2348800Z contiguous: bool, 2025-05-07T20:33:06.2349053Z compiled: bool, 2025-05-07T20:33:06.2349277Z ) -> None: 2025-05-07T20:33:06.2349501Z torch.manual_seed(2025) 2025-05-07T20:33:06.2349753Z 2025-05-07T20:33:06.2350027Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:06.2350381Z 2025-05-07T20:33:06.2350580Z x_sign = torch.sign(x) 2025-05-07T20:33:06.2350874Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:06.2351193Z x = x_sign * x_clamp 2025-05-07T20:33:06.2351489Z x0 = x[:, :D] 2025-05-07T20:33:06.2351713Z x1 = x[:, D:] 2025-05-07T20:33:06.2351920Z 2025-05-07T20:33:06.2352225Z if contiguous: 2025-05-07T20:33:06.2352457Z x0 = x0.contiguous() 2025-05-07T20:33:06.2352710Z x1 = x1.contiguous() 2025-05-07T20:33:06.2352957Z 2025-05-07T20:33:06.2353152Z if scale_ub is not None: 2025-05-07T20:33:06.2353464Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:06.2353800Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:06.2354111Z ) 2025-05-07T20:33:06.2354298Z else: 2025-05-07T20:33:06.2354512Z scale_ub_tensor = None 2025-05-07T20:33:06.2354764Z 2025-05-07T20:33:06.2354985Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:06.2355306Z op = silu_mul_quant 2025-05-07T20:33:06.2355557Z if compiled: 2025-05-07T20:33:06.2355800Z op = torch.compile(op) 2025-05-07T20:33:06.2356097Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:06.2356371Z 2025-05-07T20:33:06.2356569Z > y_fp8, y_scale = fn() 2025-05-07T20:33:06.2356731Z 2025-05-07T20:33:06.2356835Z moe/activation_test.py:117: 2025-05-07T20:33:06.2357130Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:06.2357515Z moe/activation_test.py:115: in fn 2025-05-07T20:33:06.2357789Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:06.2358348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:06.2358896Z return fn(*args, **kwargs) 2025-05-07T20:33:06.2359541Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:06.2360218Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:06.2360749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:06.2361481Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:06.2362131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:06.2362654Z kernel = self.compile( 2025-05-07T20:33:06.2363195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:06.2363842Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:06.2364233Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:06.2364576Z 2025-05-07T20:33:06.2364781Z self = 2025-05-07T20:33:06.2365903Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:06.2367271Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f15afd3ba60>} 2025-05-07T20:33:06.2368593Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:06.2369611Z context = 2025-05-07T20:33:06.2369909Z 2025-05-07T20:33:06.2370073Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:06.2370590Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:06.2371048Z module_map=module_map) 2025-05-07T20:33:06.2371416Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:06.2371769Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:06.2372019Z E ^ 2025-05-07T20:33:06.2372533Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:06.2372985Z 2025-05-07T20:33:06.2373400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:06.2373945Z 2025-05-07T20:33:06.2374056Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:06.2374458Z self=, 2025-05-07T20:33:06.2374858Z T=128, 2025-05-07T20:33:06.2375044Z D=7168, 2025-05-07T20:33:06.2375230Z scale_ub=1200.0, 2025-05-07T20:33:06.2375455Z contiguous=False, 2025-05-07T20:33:06.2375682Z compiled=False, 2025-05-07T20:33:06.3946888Z ) 2025-05-07T20:33:06.3947443Z self = 2025-05-07T20:33:06.3948240Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:06.3948641Z 2025-05-07T20:33:06.3948767Z @given( 2025-05-07T20:33:06.3948996Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:06.3949315Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:06.3949925Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:06.3950257Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:06.3950590Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:06.3950881Z ) 2025-05-07T20:33:06.3951221Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:06.3951664Z def test_silu_mul_quant( 2025-05-07T20:33:06.3951914Z self, 2025-05-07T20:33:06.3952113Z T: int, 2025-05-07T20:33:06.3952311Z D: int, 2025-05-07T20:33:06.3952533Z scale_ub: Optional[float], 2025-05-07T20:33:06.3952812Z contiguous: bool, 2025-05-07T20:33:06.3953049Z compiled: bool, 2025-05-07T20:33:06.3953279Z ) -> None: 2025-05-07T20:33:06.3953496Z torch.manual_seed(2025) 2025-05-07T20:33:06.3953733Z 2025-05-07T20:33:06.3954004Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:06.3954356Z 2025-05-07T20:33:06.3954544Z x_sign = torch.sign(x) 2025-05-07T20:33:06.3954839Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:06.3955146Z x = x_sign * x_clamp 2025-05-07T20:33:06.3955380Z x0 = x[:, :D] 2025-05-07T20:33:06.3955597Z x1 = x[:, D:] 2025-05-07T20:33:06.3955815Z 2025-05-07T20:33:06.3955997Z if contiguous: 2025-05-07T20:33:06.3956232Z x0 = x0.contiguous() 2025-05-07T20:33:06.3956493Z x1 = x1.contiguous() 2025-05-07T20:33:06.3956732Z 2025-05-07T20:33:06.3956922Z if scale_ub is not None: 2025-05-07T20:33:06.3957198Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:06.3957639Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:06.3957943Z ) 2025-05-07T20:33:06.3958142Z else: 2025-05-07T20:33:06.3958358Z scale_ub_tensor = None 2025-05-07T20:33:06.3958601Z 2025-05-07T20:33:06.3958839Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:06.3959163Z op = silu_mul_quant 2025-05-07T20:33:06.3959408Z if compiled: 2025-05-07T20:33:06.3959665Z op = torch.compile(op) 2025-05-07T20:33:06.3959966Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:06.3960236Z 2025-05-07T20:33:06.3960432Z > y_fp8, y_scale = fn() 2025-05-07T20:33:06.3960598Z 2025-05-07T20:33:06.3960708Z moe/activation_test.py:117: 2025-05-07T20:33:06.3961042Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:06.3961380Z moe/activation_test.py:115: in fn 2025-05-07T20:33:06.3961659Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:06.3962491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:06.3963184Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:06.3963723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:06.3964612Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:06.3965276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:06.3965809Z kernel = self.compile( 2025-05-07T20:33:06.3967627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:06.3968280Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:06.3968681Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:06.3968914Z 2025-05-07T20:33:06.3969129Z self = 2025-05-07T20:33:06.3970200Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:06.3971697Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16d0b52660>} 2025-05-07T20:33:06.3973037Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:06.3974072Z context = 2025-05-07T20:33:06.3974362Z 2025-05-07T20:33:06.3974542Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:06.3975066Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:06.3975543Z module_map=module_map) 2025-05-07T20:33:06.3975923Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:06.3976280Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:06.3976549Z E ^ 2025-05-07T20:33:06.3977029Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:06.3977477Z 2025-05-07T20:33:06.3977902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:06.3978410Z 2025-05-07T20:33:06.3978514Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:06.3978930Z self=, 2025-05-07T20:33:06.3979343Z T=128, 2025-05-07T20:33:06.3979582Z D=5120, 2025-05-07T20:33:06.3979775Z scale_ub=None, 2025-05-07T20:33:06.3980001Z contiguous=False, 2025-05-07T20:33:06.3980231Z compiled=False, 2025-05-07T20:33:06.3980430Z ) 2025-05-07T20:33:06.3980750Z self = 2025-05-07T20:33:06.3981244Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:06.3981512Z 2025-05-07T20:33:06.3981589Z @given( 2025-05-07T20:33:06.3981823Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:06.3982141Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:06.3982445Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:06.3982777Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:06.3983108Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:06.3983390Z ) 2025-05-07T20:33:06.3983732Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:06.3984174Z def test_silu_mul_quant( 2025-05-07T20:33:06.3984415Z self, 2025-05-07T20:33:06.3984658Z T: int, 2025-05-07T20:33:06.3984862Z D: int, 2025-05-07T20:33:06.3985080Z scale_ub: Optional[float], 2025-05-07T20:33:06.3985350Z contiguous: bool, 2025-05-07T20:33:06.3985626Z compiled: bool, 2025-05-07T20:33:06.3985849Z ) -> None: 2025-05-07T20:33:06.3986060Z torch.manual_seed(2025) 2025-05-07T20:33:06.3986302Z 2025-05-07T20:33:06.3986578Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:06.3986914Z 2025-05-07T20:33:06.3987110Z x_sign = torch.sign(x) 2025-05-07T20:33:06.3987406Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:06.3987705Z x = x_sign * x_clamp 2025-05-07T20:33:06.3987949Z x0 = x[:, :D] 2025-05-07T20:33:06.3988165Z x1 = x[:, D:] 2025-05-07T20:33:06.3988376Z 2025-05-07T20:33:06.3988557Z if contiguous: 2025-05-07T20:33:06.3988796Z x0 = x0.contiguous() 2025-05-07T20:33:06.3989058Z x1 = x1.contiguous() 2025-05-07T20:33:06.3989291Z 2025-05-07T20:33:06.3989484Z if scale_ub is not None: 2025-05-07T20:33:06.3989756Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:06.3990132Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:06.3990443Z ) 2025-05-07T20:33:06.3990637Z else: 2025-05-07T20:33:06.3990837Z scale_ub_tensor = None 2025-05-07T20:33:06.3991105Z 2025-05-07T20:33:06.3991327Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:06.3991635Z op = silu_mul_quant 2025-05-07T20:33:06.3991894Z if compiled: 2025-05-07T20:33:06.3992146Z op = torch.compile(op) 2025-05-07T20:33:06.3992434Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:06.3992710Z 2025-05-07T20:33:06.3992906Z > y_fp8, y_scale = fn() 2025-05-07T20:33:06.3993073Z 2025-05-07T20:33:06.3993172Z moe/activation_test.py:117: 2025-05-07T20:33:06.3993468Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:06.3993802Z moe/activation_test.py:115: in fn 2025-05-07T20:33:06.3994077Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:06.3994770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:06.3995451Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:06.3995985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:06.3996657Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:06.3997321Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:06.3997852Z kernel = self.compile( 2025-05-07T20:33:06.3998476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:06.3999129Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:06.3999524Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:06.3999757Z 2025-05-07T20:33:06.3999970Z self = 2025-05-07T20:33:06.4001040Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:06.4002400Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f15afdf8680>} 2025-05-07T20:33:06.4003780Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:06.4004885Z context = 2025-05-07T20:33:06.4005169Z 2025-05-07T20:33:06.4005343Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:06.4005893Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:06.4006361Z module_map=module_map) 2025-05-07T20:33:06.4006732Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:06.4007072Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:06.4007331Z E ^ 2025-05-07T20:33:06.4007795Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:06.4008519Z 2025-05-07T20:33:06.4009064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:06.4009658Z 2025-05-07T20:33:06.4009791Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:06.4010204Z self=, 2025-05-07T20:33:06.4010694Z T=128, 2025-05-07T20:33:06.4010870Z D=5120, 2025-05-07T20:33:06.4011055Z scale_ub=1200.0, 2025-05-07T20:33:06.4011270Z contiguous=True, 2025-05-07T20:33:06.4011476Z compiled=False, 2025-05-07T20:33:06.4011675Z ) 2025-05-07T20:33:06.4011984Z self = 2025-05-07T20:33:06.4012470Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:06.4012733Z 2025-05-07T20:33:06.4012804Z @given( 2025-05-07T20:33:06.4013026Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:06.4013332Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:06.4013625Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:06.4013949Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:06.4014270Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:06.4014539Z ) 2025-05-07T20:33:06.4014877Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:06.4015315Z def test_silu_mul_quant( 2025-05-07T20:33:06.4015547Z self, 2025-05-07T20:33:06.4015726Z T: int, 2025-05-07T20:33:06.4015917Z D: int, 2025-05-07T20:33:06.4016129Z scale_ub: Optional[float], 2025-05-07T20:33:06.4016383Z contiguous: bool, 2025-05-07T20:33:06.4016615Z compiled: bool, 2025-05-07T20:33:06.4016829Z ) -> None: 2025-05-07T20:33:06.4017029Z torch.manual_seed(2025) 2025-05-07T20:33:06.4017259Z 2025-05-07T20:33:06.4017522Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:06.4017844Z 2025-05-07T20:33:06.4018106Z x_sign = torch.sign(x) 2025-05-07T20:33:06.4018389Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:06.4018684Z x = x_sign * x_clamp 2025-05-07T20:33:06.4018915Z x0 = x[:, :D] 2025-05-07T20:33:06.4019120Z x1 = x[:, D:] 2025-05-07T20:33:06.4019313Z 2025-05-07T20:33:06.4019485Z if contiguous: 2025-05-07T20:33:06.4019708Z x0 = x0.contiguous() 2025-05-07T20:33:06.4019950Z x1 = x1.contiguous() 2025-05-07T20:33:06.4020181Z 2025-05-07T20:33:06.4020366Z if scale_ub is not None: 2025-05-07T20:33:06.4020631Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:06.4020952Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:06.4021247Z ) 2025-05-07T20:33:06.4021430Z else: 2025-05-07T20:33:06.4021628Z scale_ub_tensor = None 2025-05-07T20:33:06.4021874Z 2025-05-07T20:33:06.4022097Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:06.4022401Z op = silu_mul_quant 2025-05-07T20:33:06.4022644Z if compiled: 2025-05-07T20:33:06.4022956Z op = torch.compile(op) 2025-05-07T20:33:06.4023238Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:06.4023500Z 2025-05-07T20:33:06.4023685Z > y_fp8, y_scale = fn() 2025-05-07T20:33:06.4023904Z 2025-05-07T20:33:06.4023998Z moe/activation_test.py:117: 2025-05-07T20:33:06.4024287Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:06.4024607Z moe/activation_test.py:115: in fn 2025-05-07T20:33:06.4024878Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:06.4025549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:06.4026220Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:06.4026747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:06.4027414Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:06.4028066Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:06.4028633Z kernel = self.compile( 2025-05-07T20:33:06.4029165Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:06.4029802Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:06.4030188Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:06.4030411Z 2025-05-07T20:33:06.4030619Z self = 2025-05-07T20:33:06.4031689Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:06.4033061Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16d07f4c20>} 2025-05-07T20:33:06.4034411Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:06.4035431Z context = 2025-05-07T20:33:06.4035713Z 2025-05-07T20:33:06.4035879Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:06.4036381Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:06.4036846Z module_map=module_map) 2025-05-07T20:33:06.4037206Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:06.4037598Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:06.4037843Z E ^ 2025-05-07T20:33:06.4038298Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:06.4038739Z 2025-05-07T20:33:06.4039155Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:06.5603105Z 2025-05-07T20:33:06.5603433Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:06.5604194Z self=, 2025-05-07T20:33:06.5605026Z T=1, 2025-05-07T20:33:06.5605281Z D=7168, 2025-05-07T20:33:06.5605469Z scale_ub=1200.0, 2025-05-07T20:33:06.5605690Z contiguous=True, 2025-05-07T20:33:06.5605903Z compiled=True, 2025-05-07T20:33:06.5606106Z ) 2025-05-07T20:33:06.5606424Z self = 2025-05-07T20:33:06.5606921Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:06.5607189Z 2025-05-07T20:33:06.5607458Z @given( 2025-05-07T20:33:06.5607692Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:06.5607997Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:06.5608555Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:06.5608982Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:06.5609316Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:06.5609586Z ) 2025-05-07T20:33:06.5609932Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:06.5610371Z def test_silu_mul_quant( 2025-05-07T20:33:06.5610603Z self, 2025-05-07T20:33:06.5610792Z T: int, 2025-05-07T20:33:06.5610988Z D: int, 2025-05-07T20:33:06.5611194Z scale_ub: Optional[float], 2025-05-07T20:33:06.5611465Z contiguous: bool, 2025-05-07T20:33:06.5611707Z compiled: bool, 2025-05-07T20:33:06.5611924Z ) -> None: 2025-05-07T20:33:06.5612136Z torch.manual_seed(2025) 2025-05-07T20:33:06.5612369Z 2025-05-07T20:33:06.5612630Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:06.5613073Z 2025-05-07T20:33:06.5613260Z x_sign = torch.sign(x) 2025-05-07T20:33:06.5613540Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:06.5613841Z x = x_sign * x_clamp 2025-05-07T20:33:06.5614074Z x0 = x[:, :D] 2025-05-07T20:33:06.5614283Z x1 = x[:, D:] 2025-05-07T20:33:06.5614481Z 2025-05-07T20:33:06.5614658Z if contiguous: 2025-05-07T20:33:06.5614883Z x0 = x0.contiguous() 2025-05-07T20:33:06.5615134Z x1 = x1.contiguous() 2025-05-07T20:33:06.5615367Z 2025-05-07T20:33:06.5615552Z if scale_ub is not None: 2025-05-07T20:33:06.5615811Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:06.5616143Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:06.5616443Z ) 2025-05-07T20:33:06.5616624Z else: 2025-05-07T20:33:06.5616827Z scale_ub_tensor = None 2025-05-07T20:33:06.5617068Z 2025-05-07T20:33:06.5617286Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:06.5617601Z op = silu_mul_quant 2025-05-07T20:33:06.5617848Z if compiled: 2025-05-07T20:33:06.5618086Z op = torch.compile(op) 2025-05-07T20:33:06.5618380Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:06.5618649Z 2025-05-07T20:33:06.5618834Z > y_fp8, y_scale = fn() 2025-05-07T20:33:06.5618996Z 2025-05-07T20:33:06.5619092Z moe/activation_test.py:117: 2025-05-07T20:33:06.5619393Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:06.5619723Z moe/activation_test.py:115: in fn 2025-05-07T20:33:06.5619994Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:06.5620645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:06.5621204Z return fn(*args, **kwargs) 2025-05-07T20:33:06.5621917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:06.5622596Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:06.5629950Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:06.5630650Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:06.5631330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:06.5631874Z kernel = self.compile( 2025-05-07T20:33:06.5632422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:06.5633103Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:06.5633627Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:06.5633864Z 2025-05-07T20:33:06.5634084Z self = 2025-05-07T20:33:06.5635208Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:06.5636587Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16d07f5ee0>} 2025-05-07T20:33:06.5637928Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:06.5638957Z context = 2025-05-07T20:33:06.5639244Z 2025-05-07T20:33:06.5639421Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:06.5639941Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:06.5640506Z module_map=module_map) 2025-05-07T20:33:06.5640886Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:06.5641238Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:06.5641508Z E ^ 2025-05-07T20:33:06.5641982Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:06.5642430Z 2025-05-07T20:33:06.5642853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:06.5643367Z 2025-05-07T20:33:06.5643481Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:06.5643946Z self=, 2025-05-07T20:33:06.5644470Z T=1, 2025-05-07T20:33:06.5644656Z D=7168, 2025-05-07T20:33:06.5644858Z scale_ub=1200.0, 2025-05-07T20:33:06.5645092Z contiguous=False, 2025-05-07T20:33:06.5645324Z compiled=True, 2025-05-07T20:33:06.5645536Z ) 2025-05-07T20:33:06.5645865Z self = 2025-05-07T20:33:06.5646350Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:06.5646622Z 2025-05-07T20:33:06.5646702Z @given( 2025-05-07T20:33:06.5646943Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:06.5647267Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:06.5647574Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:06.5647914Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:06.5648304Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:06.5648588Z ) 2025-05-07T20:33:06.5648944Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:06.5649388Z def test_silu_mul_quant( 2025-05-07T20:33:06.5649634Z self, 2025-05-07T20:33:06.5649843Z T: int, 2025-05-07T20:33:06.5650048Z D: int, 2025-05-07T20:33:06.5650269Z scale_ub: Optional[float], 2025-05-07T20:33:06.5650546Z contiguous: bool, 2025-05-07T20:33:06.5650794Z compiled: bool, 2025-05-07T20:33:06.5651030Z ) -> None: 2025-05-07T20:33:06.5651248Z torch.manual_seed(2025) 2025-05-07T20:33:06.5651498Z 2025-05-07T20:33:06.5651779Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:06.5652122Z 2025-05-07T20:33:06.5652327Z x_sign = torch.sign(x) 2025-05-07T20:33:06.5652631Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:06.5652948Z x = x_sign * x_clamp 2025-05-07T20:33:06.5653199Z x0 = x[:, :D] 2025-05-07T20:33:06.5653474Z x1 = x[:, D:] 2025-05-07T20:33:06.5653684Z 2025-05-07T20:33:06.5653881Z if contiguous: 2025-05-07T20:33:06.5654123Z x0 = x0.contiguous() 2025-05-07T20:33:06.5654391Z x1 = x1.contiguous() 2025-05-07T20:33:06.5654686Z 2025-05-07T20:33:06.5654886Z if scale_ub is not None: 2025-05-07T20:33:06.5655162Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:06.5655510Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:06.5655826Z ) 2025-05-07T20:33:06.5656027Z else: 2025-05-07T20:33:06.5656238Z scale_ub_tensor = None 2025-05-07T20:33:06.5656497Z 2025-05-07T20:33:06.5656726Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:06.5657034Z op = silu_mul_quant 2025-05-07T20:33:06.5657288Z if compiled: 2025-05-07T20:33:06.5657534Z op = torch.compile(op) 2025-05-07T20:33:06.5657827Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:06.5658112Z 2025-05-07T20:33:06.5658296Z > y_fp8, y_scale = fn() 2025-05-07T20:33:06.5658465Z 2025-05-07T20:33:06.5658562Z moe/activation_test.py:117: 2025-05-07T20:33:06.5658906Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:06.5659247Z moe/activation_test.py:115: in fn 2025-05-07T20:33:06.5659525Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:06.5660082Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:06.5660637Z return fn(*args, **kwargs) 2025-05-07T20:33:06.5661284Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:06.5661966Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:06.5662500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:06.5663177Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:06.5663825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:06.5664361Z kernel = self.compile( 2025-05-07T20:33:06.5664901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:06.5665544Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:06.5665942Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:06.5666174Z 2025-05-07T20:33:06.5666379Z self = 2025-05-07T20:33:06.5667493Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:06.5668852Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16d07f6c00>} 2025-05-07T20:33:06.5670179Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:06.5671196Z context = 2025-05-07T20:33:06.5671483Z 2025-05-07T20:33:06.5671654Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:06.5672170Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:06.5672628Z module_map=module_map) 2025-05-07T20:33:06.5672994Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:06.5673386Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:06.5673641Z E ^ 2025-05-07T20:33:06.5674107Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:06.5674598Z 2025-05-07T20:33:06.5675019Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:06.7747839Z 2025-05-07T20:33:06.7748967Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:06.7750526Z self=, 2025-05-07T20:33:06.7751647Z T=1, 2025-05-07T20:33:06.7752003Z D=7168, 2025-05-07T20:33:06.7752373Z scale_ub=None, 2025-05-07T20:33:06.7752782Z contiguous=False, 2025-05-07T20:33:06.7753221Z compiled=True, 2025-05-07T20:33:06.7753615Z ) 2025-05-07T20:33:06.7754266Z self = 2025-05-07T20:33:06.7755248Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:06.7755780Z 2025-05-07T20:33:06.7755925Z @given( 2025-05-07T20:33:06.7756368Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:06.7757431Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:06.7757820Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:06.7758153Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:06.7758475Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:06.7758753Z ) 2025-05-07T20:33:06.7759093Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:06.7759520Z def test_silu_mul_quant( 2025-05-07T20:33:06.7759756Z self, 2025-05-07T20:33:06.7759946Z T: int, 2025-05-07T20:33:06.7760125Z D: int, 2025-05-07T20:33:06.7760337Z scale_ub: Optional[float], 2025-05-07T20:33:06.7760607Z contiguous: bool, 2025-05-07T20:33:06.7760842Z compiled: bool, 2025-05-07T20:33:06.7761060Z ) -> None: 2025-05-07T20:33:06.7761267Z torch.manual_seed(2025) 2025-05-07T20:33:06.7761503Z 2025-05-07T20:33:06.7761763Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:06.7762100Z 2025-05-07T20:33:06.7762283Z x_sign = torch.sign(x) 2025-05-07T20:33:06.7762559Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:06.7762868Z x = x_sign * x_clamp 2025-05-07T20:33:06.7763137Z x0 = x[:, :D] 2025-05-07T20:33:06.7763337Z x1 = x[:, D:] 2025-05-07T20:33:06.7763538Z 2025-05-07T20:33:06.7763709Z if contiguous: 2025-05-07T20:33:06.7763928Z x0 = x0.contiguous() 2025-05-07T20:33:06.7764180Z x1 = x1.contiguous() 2025-05-07T20:33:06.7764573Z 2025-05-07T20:33:06.7764758Z if scale_ub is not None: 2025-05-07T20:33:06.7765138Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:06.7765477Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:06.7765784Z ) 2025-05-07T20:33:06.7765969Z else: 2025-05-07T20:33:06.7766179Z scale_ub_tensor = None 2025-05-07T20:33:06.7766437Z 2025-05-07T20:33:06.7766661Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:06.7766973Z op = silu_mul_quant 2025-05-07T20:33:06.7767223Z if compiled: 2025-05-07T20:33:06.7767462Z op = torch.compile(op) 2025-05-07T20:33:06.7767759Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:06.7768034Z 2025-05-07T20:33:06.7768220Z y_fp8, y_scale = fn() 2025-05-07T20:33:06.7768504Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:06.7768798Z 2025-05-07T20:33:06.7769031Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:06.7769363Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:06.7769660Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:06.7770064Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:06.7770452Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:06.7770754Z 2025-05-07T20:33:06.7770954Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:06.7771233Z 2025-05-07T20:33:06.7771331Z moe/activation_test.py:126: 2025-05-07T20:33:06.7771624Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:06.7771953Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:06.7772277Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:06.7773065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:06.7773809Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:06.7774355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:06.7775041Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:06.7775737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:06.7776504Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:06.7777237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:06.7777878Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:06.7778478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:06.7778986Z fn() 2025-05-07T20:33:06.7779493Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:06.7780071Z self.fn.run( 2025-05-07T20:33:06.7780531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:06.7781057Z kernel = self.compile( 2025-05-07T20:33:06.7781594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:06.7782248Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:06.7782633Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:06.7782860Z 2025-05-07T20:33:06.7783084Z self = 2025-05-07T20:33:06.7784169Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:06.7785607Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f15afbd4180>} 2025-05-07T20:33:06.7786952Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:06.7787982Z context = 2025-05-07T20:33:06.7788268Z 2025-05-07T20:33:06.7788439Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:06.7788951Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:06.7789421Z module_map=module_map) 2025-05-07T20:33:06.7789788Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:06.7790138Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:06.7790401Z E ^ 2025-05-07T20:33:06.7790958Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:06.7791409Z 2025-05-07T20:33:06.7791831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:06.7792385Z 2025-05-07T20:33:06.7792494Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:06.7792902Z self=, 2025-05-07T20:33:06.7793302Z T=1, 2025-05-07T20:33:06.7793489Z D=5120, 2025-05-07T20:33:06.7793678Z scale_ub=1200.0, 2025-05-07T20:33:06.7793903Z contiguous=False, 2025-05-07T20:33:06.7794126Z compiled=True, 2025-05-07T20:33:06.7794321Z ) 2025-05-07T20:33:06.7794640Z self = 2025-05-07T20:33:06.7795126Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:06.7795387Z 2025-05-07T20:33:06.7795466Z @given( 2025-05-07T20:33:06.7795695Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:06.7796001Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:06.7796293Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:06.7796668Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:06.7796997Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:06.7797282Z ) 2025-05-07T20:33:06.7797625Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:06.7798070Z def test_silu_mul_quant( 2025-05-07T20:33:06.7798307Z self, 2025-05-07T20:33:06.7798495Z T: int, 2025-05-07T20:33:06.7798693Z D: int, 2025-05-07T20:33:06.7798910Z scale_ub: Optional[float], 2025-05-07T20:33:06.7799168Z contiguous: bool, 2025-05-07T20:33:06.7799405Z compiled: bool, 2025-05-07T20:33:06.7799630Z ) -> None: 2025-05-07T20:33:06.7799841Z torch.manual_seed(2025) 2025-05-07T20:33:06.7800083Z 2025-05-07T20:33:06.7800359Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:06.7800692Z 2025-05-07T20:33:06.7800889Z x_sign = torch.sign(x) 2025-05-07T20:33:06.7801182Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:06.7801492Z x = x_sign * x_clamp 2025-05-07T20:33:06.7801721Z x0 = x[:, :D] 2025-05-07T20:33:06.7801936Z x1 = x[:, D:] 2025-05-07T20:33:06.7802154Z 2025-05-07T20:33:06.7802330Z if contiguous: 2025-05-07T20:33:06.7802563Z x0 = x0.contiguous() 2025-05-07T20:33:06.7802828Z x1 = x1.contiguous() 2025-05-07T20:33:06.7803062Z 2025-05-07T20:33:06.7803254Z if scale_ub is not None: 2025-05-07T20:33:06.7803527Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:06.7803854Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:06.7804162Z ) 2025-05-07T20:33:06.7804514Z else: 2025-05-07T20:33:06.7804711Z scale_ub_tensor = None 2025-05-07T20:33:06.7804957Z 2025-05-07T20:33:06.7805177Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:06.7805475Z op = silu_mul_quant 2025-05-07T20:33:06.7805722Z if compiled: 2025-05-07T20:33:06.7805965Z op = torch.compile(op) 2025-05-07T20:33:06.7806249Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:06.7806504Z 2025-05-07T20:33:06.7806686Z > y_fp8, y_scale = fn() 2025-05-07T20:33:06.7806847Z 2025-05-07T20:33:06.7806944Z moe/activation_test.py:117: 2025-05-07T20:33:06.7807225Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:06.7807546Z moe/activation_test.py:115: in fn 2025-05-07T20:33:06.7807817Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:06.7808637Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:06.7809187Z return fn(*args, **kwargs) 2025-05-07T20:33:06.7809911Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:06.7810594Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:06.7811172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:06.7811848Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:06.7812505Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:06.7813020Z kernel = self.compile( 2025-05-07T20:33:06.7813553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:06.7814199Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:06.7814592Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:06.7814819Z 2025-05-07T20:33:06.7815020Z self = 2025-05-07T20:33:06.7816095Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:06.7817530Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f15afbd5300>} 2025-05-07T20:33:06.7818861Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:06.7819878Z context = 2025-05-07T20:33:06.7820164Z 2025-05-07T20:33:06.7820327Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:06.7820840Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:06.7821303Z module_map=module_map) 2025-05-07T20:33:06.7821656Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:06.7821998Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:06.7822244Z E ^ 2025-05-07T20:33:06.7822699Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:06.7823144Z 2025-05-07T20:33:06.7823559Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:06.9238220Z 2025-05-07T20:33:06.9238796Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:06.9239952Z self=, 2025-05-07T20:33:06.9240390Z T=1, 2025-05-07T20:33:06.9240585Z D=5120, 2025-05-07T20:33:06.9240775Z scale_ub=1200.0, 2025-05-07T20:33:06.9241000Z contiguous=False, 2025-05-07T20:33:06.9241229Z compiled=False, 2025-05-07T20:33:06.9241441Z ) 2025-05-07T20:33:06.9241764Z self = 2025-05-07T20:33:06.9242253Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:06.9242519Z 2025-05-07T20:33:06.9242608Z @given( 2025-05-07T20:33:06.9242840Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:06.9243180Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:06.9243509Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:06.9243856Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:06.9244215Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:06.9244659Z ) 2025-05-07T20:33:06.9245038Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:06.9245605Z def test_silu_mul_quant( 2025-05-07T20:33:06.9245864Z self, 2025-05-07T20:33:06.9246068Z T: int, 2025-05-07T20:33:06.9246271Z D: int, 2025-05-07T20:33:06.9246507Z scale_ub: Optional[float], 2025-05-07T20:33:06.9246870Z contiguous: bool, 2025-05-07T20:33:06.9247120Z compiled: bool, 2025-05-07T20:33:06.9247368Z ) -> None: 2025-05-07T20:33:06.9247593Z torch.manual_seed(2025) 2025-05-07T20:33:06.9247845Z 2025-05-07T20:33:06.9248137Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:06.9248506Z 2025-05-07T20:33:06.9248703Z x_sign = torch.sign(x) 2025-05-07T20:33:06.9249011Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:06.9249345Z x = x_sign * x_clamp 2025-05-07T20:33:06.9249599Z x0 = x[:, :D] 2025-05-07T20:33:06.9249840Z x1 = x[:, D:] 2025-05-07T20:33:06.9250062Z 2025-05-07T20:33:06.9250350Z if contiguous: 2025-05-07T20:33:06.9250859Z x0 = x0.contiguous() 2025-05-07T20:33:06.9251418Z x1 = x1.contiguous() 2025-05-07T20:33:06.9251679Z 2025-05-07T20:33:06.9252524Z if scale_ub is not None: 2025-05-07T20:33:06.9252797Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:06.9253128Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:06.9253428Z ) 2025-05-07T20:33:06.9253617Z else: 2025-05-07T20:33:06.9253828Z scale_ub_tensor = None 2025-05-07T20:33:06.9254072Z 2025-05-07T20:33:06.9254309Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:06.9254632Z op = silu_mul_quant 2025-05-07T20:33:06.9254877Z if compiled: 2025-05-07T20:33:06.9255138Z op = torch.compile(op) 2025-05-07T20:33:06.9255435Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:06.9255700Z 2025-05-07T20:33:06.9255898Z > y_fp8, y_scale = fn() 2025-05-07T20:33:06.9256066Z 2025-05-07T20:33:06.9256176Z moe/activation_test.py:117: 2025-05-07T20:33:06.9256472Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:06.9256811Z moe/activation_test.py:115: in fn 2025-05-07T20:33:06.9257097Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:06.9257780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:06.9258460Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:06.9258997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:06.9259675Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:06.9260378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:06.9260897Z kernel = self.compile( 2025-05-07T20:33:06.9261448Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:06.9262110Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:06.9262511Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:06.9262748Z 2025-05-07T20:33:06.9262953Z self = 2025-05-07T20:33:06.9264037Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:06.9265434Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f15afbd6020>} 2025-05-07T20:33:06.9266824Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:06.9267840Z context = 2025-05-07T20:33:06.9268174Z 2025-05-07T20:33:06.9268336Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:06.9268852Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:06.9269311Z module_map=module_map) 2025-05-07T20:33:06.9269659Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:06.9270007Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:06.9270255Z E ^ 2025-05-07T20:33:06.9270705Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:06.9271159Z 2025-05-07T20:33:06.9271574Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:06.9272084Z 2025-05-07T20:33:06.9272184Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:06.9272634Z self=, 2025-05-07T20:33:06.9273020Z T=16384, 2025-05-07T20:33:06.9273202Z D=5120, 2025-05-07T20:33:06.9273387Z scale_ub=1200.0, 2025-05-07T20:33:06.9273595Z contiguous=False, 2025-05-07T20:33:06.9273810Z compiled=True, 2025-05-07T20:33:06.9274008Z ) 2025-05-07T20:33:06.9274362Z self = 2025-05-07T20:33:06.9274856Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:06.9275129Z 2025-05-07T20:33:06.9275208Z @given( 2025-05-07T20:33:06.9275429Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:06.9282726Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:06.9283061Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:06.9283405Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:06.9283742Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:06.9284032Z ) 2025-05-07T20:33:06.9284511Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:06.9284960Z def test_silu_mul_quant( 2025-05-07T20:33:06.9285215Z self, 2025-05-07T20:33:06.9285411Z T: int, 2025-05-07T20:33:06.9285619Z D: int, 2025-05-07T20:33:06.9285842Z scale_ub: Optional[float], 2025-05-07T20:33:06.9286107Z contiguous: bool, 2025-05-07T20:33:06.9286354Z compiled: bool, 2025-05-07T20:33:06.9286580Z ) -> None: 2025-05-07T20:33:06.9286792Z torch.manual_seed(2025) 2025-05-07T20:33:06.9287038Z 2025-05-07T20:33:06.9287440Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:06.9287779Z 2025-05-07T20:33:06.9287971Z x_sign = torch.sign(x) 2025-05-07T20:33:06.9288270Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:06.9288571Z x = x_sign * x_clamp 2025-05-07T20:33:06.9288813Z x0 = x[:, :D] 2025-05-07T20:33:06.9289037Z x1 = x[:, D:] 2025-05-07T20:33:06.9289239Z 2025-05-07T20:33:06.9289425Z if contiguous: 2025-05-07T20:33:06.9289665Z x0 = x0.contiguous() 2025-05-07T20:33:06.9289927Z x1 = x1.contiguous() 2025-05-07T20:33:06.9290161Z 2025-05-07T20:33:06.9290357Z if scale_ub is not None: 2025-05-07T20:33:06.9290636Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:06.9290967Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:06.9291278Z ) 2025-05-07T20:33:06.9291477Z else: 2025-05-07T20:33:06.9291679Z scale_ub_tensor = None 2025-05-07T20:33:06.9291939Z 2025-05-07T20:33:06.9292172Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:06.9292527Z op = silu_mul_quant 2025-05-07T20:33:06.9292780Z if compiled: 2025-05-07T20:33:06.9293041Z op = torch.compile(op) 2025-05-07T20:33:06.9293333Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:06.9293657Z 2025-05-07T20:33:06.9293849Z > y_fp8, y_scale = fn() 2025-05-07T20:33:06.9294010Z 2025-05-07T20:33:06.9294115Z moe/activation_test.py:117: 2025-05-07T20:33:06.9294407Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:06.9294739Z moe/activation_test.py:115: in fn 2025-05-07T20:33:06.9295020Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:06.9295572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:06.9296132Z return fn(*args, **kwargs) 2025-05-07T20:33:06.9296798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:06.9297482Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:06.9298026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:06.9298759Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:06.9299424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:06.9299949Z kernel = self.compile( 2025-05-07T20:33:06.9300496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:06.9301154Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:06.9301584Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:06.9301837Z 2025-05-07T20:33:06.9302047Z self = 2025-05-07T20:33:06.9303130Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:06.9304506Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f15afbd7600>} 2025-05-07T20:33:06.9305848Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:06.9306866Z context = 2025-05-07T20:33:06.9307164Z 2025-05-07T20:33:06.9307376Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:06.9307902Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:06.9308681Z module_map=module_map) 2025-05-07T20:33:06.9309039Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:06.9309394Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:06.9309646Z E ^ 2025-05-07T20:33:06.9310112Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:06.9310564Z 2025-05-07T20:33:06.9310976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:06.9311483Z 2025-05-07T20:33:06.9311597Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:06.9312003Z self=, 2025-05-07T20:33:06.9312408Z T=2048, 2025-05-07T20:33:06.9312606Z D=7168, 2025-05-07T20:33:06.9312818Z scale_ub=1200.0, 2025-05-07T20:33:06.9313120Z contiguous=False, 2025-05-07T20:33:06.9313482Z compiled=True, 2025-05-07T20:33:07.1177431Z ) 2025-05-07T20:33:07.1178133Z self = 2025-05-07T20:33:07.1179038Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:07.1179690Z 2025-05-07T20:33:07.1179768Z @given( 2025-05-07T20:33:07.1180000Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.1180311Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.1180607Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.1180939Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.1181267Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.1181542Z ) 2025-05-07T20:33:07.1181902Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.1182375Z def test_silu_mul_quant( 2025-05-07T20:33:07.1182615Z self, 2025-05-07T20:33:07.1182829Z T: int, 2025-05-07T20:33:07.1183035Z D: int, 2025-05-07T20:33:07.1183253Z scale_ub: Optional[float], 2025-05-07T20:33:07.1183530Z contiguous: bool, 2025-05-07T20:33:07.1183897Z compiled: bool, 2025-05-07T20:33:07.1184140Z ) -> None: 2025-05-07T20:33:07.1184352Z torch.manual_seed(2025) 2025-05-07T20:33:07.1184602Z 2025-05-07T20:33:07.1184891Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.1185229Z 2025-05-07T20:33:07.1185431Z x_sign = torch.sign(x) 2025-05-07T20:33:07.1185734Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.1186036Z x = x_sign * x_clamp 2025-05-07T20:33:07.1186288Z x0 = x[:, :D] 2025-05-07T20:33:07.1186523Z x1 = x[:, D:] 2025-05-07T20:33:07.1186724Z 2025-05-07T20:33:07.1186918Z if contiguous: 2025-05-07T20:33:07.1187171Z x0 = x0.contiguous() 2025-05-07T20:33:07.1187428Z x1 = x1.contiguous() 2025-05-07T20:33:07.1187678Z 2025-05-07T20:33:07.1187887Z if scale_ub is not None: 2025-05-07T20:33:07.1188156Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.1188508Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.1188831Z ) 2025-05-07T20:33:07.1189034Z else: 2025-05-07T20:33:07.1189250Z scale_ub_tensor = None 2025-05-07T20:33:07.1189510Z 2025-05-07T20:33:07.1189745Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.1190053Z op = silu_mul_quant 2025-05-07T20:33:07.1190316Z if compiled: 2025-05-07T20:33:07.1190574Z op = torch.compile(op) 2025-05-07T20:33:07.1190864Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.1191143Z 2025-05-07T20:33:07.1191341Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.1191508Z 2025-05-07T20:33:07.1191701Z moe/activation_test.py:117: 2025-05-07T20:33:07.1192008Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.1192343Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.1192615Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.1193188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:07.1193756Z return fn(*args, **kwargs) 2025-05-07T20:33:07.1194423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.1195100Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.1195635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.1196307Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.1196972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.1197566Z kernel = self.compile( 2025-05-07T20:33:07.1198110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.1198766Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.1199189Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.1199418Z 2025-05-07T20:33:07.1199623Z self = 2025-05-07T20:33:07.1200700Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.1202095Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16d00d4720>} 2025-05-07T20:33:07.1203439Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.1204682Z context = 2025-05-07T20:33:07.1204985Z 2025-05-07T20:33:07.1205151Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.1205676Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.1206150Z module_map=module_map) 2025-05-07T20:33:07.1206511Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.1206873Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.1207135Z E ^ 2025-05-07T20:33:07.1207596Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.1208054Z 2025-05-07T20:33:07.1208746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.1209275Z 2025-05-07T20:33:07.1209379Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.1209786Z self=, 2025-05-07T20:33:07.1210174Z T=1, 2025-05-07T20:33:07.1210358Z D=5120, 2025-05-07T20:33:07.1210550Z scale_ub=None, 2025-05-07T20:33:07.1210756Z contiguous=False, 2025-05-07T20:33:07.1210982Z compiled=False, 2025-05-07T20:33:07.1211177Z ) 2025-05-07T20:33:07.1211486Z self = 2025-05-07T20:33:07.1211963Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:07.1212218Z 2025-05-07T20:33:07.1212297Z @given( 2025-05-07T20:33:07.1212602Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.1212906Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.1213201Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.1213522Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.1213839Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.1214117Z ) 2025-05-07T20:33:07.1214456Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.1214883Z def test_silu_mul_quant( 2025-05-07T20:33:07.1215116Z self, 2025-05-07T20:33:07.1215302Z T: int, 2025-05-07T20:33:07.1215484Z D: int, 2025-05-07T20:33:07.1215696Z scale_ub: Optional[float], 2025-05-07T20:33:07.1215973Z contiguous: bool, 2025-05-07T20:33:07.1216206Z compiled: bool, 2025-05-07T20:33:07.1216412Z ) -> None: 2025-05-07T20:33:07.1216626Z torch.manual_seed(2025) 2025-05-07T20:33:07.1216860Z 2025-05-07T20:33:07.1217136Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.1217552Z 2025-05-07T20:33:07.1217739Z x_sign = torch.sign(x) 2025-05-07T20:33:07.1218023Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.1218323Z x = x_sign * x_clamp 2025-05-07T20:33:07.1218615Z x0 = x[:, :D] 2025-05-07T20:33:07.1218829Z x1 = x[:, D:] 2025-05-07T20:33:07.1219022Z 2025-05-07T20:33:07.1219201Z if contiguous: 2025-05-07T20:33:07.1219427Z x0 = x0.contiguous() 2025-05-07T20:33:07.1219673Z x1 = x1.contiguous() 2025-05-07T20:33:07.1219905Z 2025-05-07T20:33:07.1220093Z if scale_ub is not None: 2025-05-07T20:33:07.1220351Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.1220686Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.1220987Z ) 2025-05-07T20:33:07.1221164Z else: 2025-05-07T20:33:07.1221369Z scale_ub_tensor = None 2025-05-07T20:33:07.1221649Z 2025-05-07T20:33:07.1221887Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.1222186Z op = silu_mul_quant 2025-05-07T20:33:07.1222433Z if compiled: 2025-05-07T20:33:07.1222777Z op = torch.compile(op) 2025-05-07T20:33:07.1223059Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.1223327Z 2025-05-07T20:33:07.1223512Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.1223672Z 2025-05-07T20:33:07.1223770Z moe/activation_test.py:117: 2025-05-07T20:33:07.1224061Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.1224385Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.1224679Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.1225354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.1226031Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.1226563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.1227239Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.1227891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.1228413Z kernel = self.compile( 2025-05-07T20:33:07.1228948Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.1229594Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.1229979Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.1230213Z 2025-05-07T20:33:07.1230415Z self = 2025-05-07T20:33:07.1231584Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.1232953Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16d00d5120>} 2025-05-07T20:33:07.1234284Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.1235304Z context = 2025-05-07T20:33:07.1235594Z 2025-05-07T20:33:07.1235755Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.1236277Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.1236736Z module_map=module_map) 2025-05-07T20:33:07.1237139Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.1237489Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.1237733Z E ^ 2025-05-07T20:33:07.1238189Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.1238675Z 2025-05-07T20:33:07.1239091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.1239594Z 2025-05-07T20:33:07.1239703Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.1240101Z self=, 2025-05-07T20:33:07.1240495Z T=4096, 2025-05-07T20:33:07.1240679Z D=7168, 2025-05-07T20:33:07.1240856Z scale_ub=1200.0, 2025-05-07T20:33:07.1241074Z contiguous=False, 2025-05-07T20:33:07.1241301Z compiled=False, 2025-05-07T20:33:07.1241502Z ) 2025-05-07T20:33:07.1241811Z self = 2025-05-07T20:33:07.1242301Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:07.1242618Z 2025-05-07T20:33:07.1242697Z @given( 2025-05-07T20:33:07.1242914Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.1243219Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.1243518Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.1243832Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.1244150Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.1244531Z ) 2025-05-07T20:33:07.1244868Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.1245293Z def test_silu_mul_quant( 2025-05-07T20:33:07.1245525Z self, 2025-05-07T20:33:07.1245712Z T: int, 2025-05-07T20:33:07.1245897Z D: int, 2025-05-07T20:33:07.1246109Z scale_ub: Optional[float], 2025-05-07T20:33:07.1246378Z contiguous: bool, 2025-05-07T20:33:07.1246611Z compiled: bool, 2025-05-07T20:33:07.1246833Z ) -> None: 2025-05-07T20:33:07.1247045Z torch.manual_seed(2025) 2025-05-07T20:33:07.1247275Z 2025-05-07T20:33:07.1247538Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.1247997Z 2025-05-07T20:33:07.1248251Z x_sign = torch.sign(x) 2025-05-07T20:33:07.1248541Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.1248845Z x = x_sign * x_clamp 2025-05-07T20:33:07.1249070Z x0 = x[:, :D] 2025-05-07T20:33:07.1249278Z x1 = x[:, D:] 2025-05-07T20:33:07.1249476Z 2025-05-07T20:33:07.1249646Z if contiguous: 2025-05-07T20:33:07.1249877Z x0 = x0.contiguous() 2025-05-07T20:33:07.1250128Z x1 = x1.contiguous() 2025-05-07T20:33:07.1250425Z 2025-05-07T20:33:07.1250610Z if scale_ub is not None: 2025-05-07T20:33:07.1250882Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.1251216Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.1251514Z ) 2025-05-07T20:33:07.1251707Z else: 2025-05-07T20:33:07.1251923Z scale_ub_tensor = None 2025-05-07T20:33:07.1252159Z 2025-05-07T20:33:07.1252386Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.1252695Z op = silu_mul_quant 2025-05-07T20:33:07.1252935Z if compiled: 2025-05-07T20:33:07.1253177Z op = torch.compile(op) 2025-05-07T20:33:07.1253474Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.1253745Z 2025-05-07T20:33:07.1253945Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.1254112Z 2025-05-07T20:33:07.1254226Z moe/activation_test.py:117: 2025-05-07T20:33:07.1254534Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.1254860Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.1255202Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.1255904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.1256629Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.1257166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.1257861Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.1258534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.1259061Z kernel = self.compile( 2025-05-07T20:33:07.1259609Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.1260280Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.1260676Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.1260918Z 2025-05-07T20:33:07.1261124Z self = 2025-05-07T20:33:07.1262309Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.1263686Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16d00d6480>} 2025-05-07T20:33:07.1265029Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.1266048Z context = 2025-05-07T20:33:07.1266356Z 2025-05-07T20:33:07.1266527Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.1267075Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.1267559Z module_map=module_map) 2025-05-07T20:33:07.1267926Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.1268297Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.1268568Z E ^ 2025-05-07T20:33:07.1269035Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.1269499Z 2025-05-07T20:33:07.1269918Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.2830359Z 2025-05-07T20:33:07.2831760Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.2832585Z self=, 2025-05-07T20:33:07.2833250Z T=16384, 2025-05-07T20:33:07.2833544Z D=7168, 2025-05-07T20:33:07.2833756Z scale_ub=None, 2025-05-07T20:33:07.2834002Z contiguous=True, 2025-05-07T20:33:07.2834260Z compiled=True, 2025-05-07T20:33:07.2834471Z ) 2025-05-07T20:33:07.2834818Z self = 2025-05-07T20:33:07.2835346Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:07.2835629Z 2025-05-07T20:33:07.2835711Z @given( 2025-05-07T20:33:07.2835952Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.2836288Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.2836598Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.2836945Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.2837301Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.2837593Z ) 2025-05-07T20:33:07.2838086Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.2838556Z def test_silu_mul_quant( 2025-05-07T20:33:07.2838816Z self, 2025-05-07T20:33:07.2839020Z T: int, 2025-05-07T20:33:07.2839321Z D: int, 2025-05-07T20:33:07.2839555Z scale_ub: Optional[float], 2025-05-07T20:33:07.2839830Z contiguous: bool, 2025-05-07T20:33:07.2840088Z compiled: bool, 2025-05-07T20:33:07.2840337Z ) -> None: 2025-05-07T20:33:07.2840554Z torch.manual_seed(2025) 2025-05-07T20:33:07.2840806Z 2025-05-07T20:33:07.2841098Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.2841449Z 2025-05-07T20:33:07.2841666Z x_sign = torch.sign(x) 2025-05-07T20:33:07.2841968Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.2842288Z x = x_sign * x_clamp 2025-05-07T20:33:07.2842544Z x0 = x[:, :D] 2025-05-07T20:33:07.2842782Z x1 = x[:, D:] 2025-05-07T20:33:07.2842996Z 2025-05-07T20:33:07.2843199Z if contiguous: 2025-05-07T20:33:07.2843445Z x0 = x0.contiguous() 2025-05-07T20:33:07.2843830Z x1 = x1.contiguous() 2025-05-07T20:33:07.2844072Z 2025-05-07T20:33:07.2844418Z if scale_ub is not None: 2025-05-07T20:33:07.2844699Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.2845032Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.2845350Z ) 2025-05-07T20:33:07.2845546Z else: 2025-05-07T20:33:07.2845753Z scale_ub_tensor = None 2025-05-07T20:33:07.2846013Z 2025-05-07T20:33:07.2846249Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.2846561Z op = silu_mul_quant 2025-05-07T20:33:07.2846811Z if compiled: 2025-05-07T20:33:07.2855527Z op = torch.compile(op) 2025-05-07T20:33:07.2855857Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.2856144Z 2025-05-07T20:33:07.2856343Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.2856519Z 2025-05-07T20:33:07.2856625Z moe/activation_test.py:117: 2025-05-07T20:33:07.2856939Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.2857282Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.2857579Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.2858150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:07.2858708Z return fn(*args, **kwargs) 2025-05-07T20:33:07.2859377Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.2860071Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.2860707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.2861418Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.2862120Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.2862661Z kernel = self.compile( 2025-05-07T20:33:07.2863218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.2863874Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.2864283Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.2864519Z 2025-05-07T20:33:07.2864737Z self = 2025-05-07T20:33:07.2865833Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.2867262Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16d00d7740>} 2025-05-07T20:33:07.2868650Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.2869679Z context = 2025-05-07T20:33:07.2869969Z 2025-05-07T20:33:07.2870145Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.2870665Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.2871148Z module_map=module_map) 2025-05-07T20:33:07.2871527Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.2871888Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.2872151Z E ^ 2025-05-07T20:33:07.2872624Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.2873163Z 2025-05-07T20:33:07.2873592Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.2874102Z 2025-05-07T20:33:07.2874210Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.2874631Z self=, 2025-05-07T20:33:07.2875044Z T=4096, 2025-05-07T20:33:07.2875251Z D=5120, 2025-05-07T20:33:07.2875446Z scale_ub=None, 2025-05-07T20:33:07.2875675Z contiguous=False, 2025-05-07T20:33:07.2875918Z compiled=True, 2025-05-07T20:33:07.2876128Z ) 2025-05-07T20:33:07.2876465Z self = 2025-05-07T20:33:07.2876976Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:07.2877248Z 2025-05-07T20:33:07.2877330Z @given( 2025-05-07T20:33:07.2877573Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.2877930Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.2878267Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.2878609Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.2878952Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.2879249Z ) 2025-05-07T20:33:07.2879602Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.2880052Z def test_silu_mul_quant( 2025-05-07T20:33:07.2880306Z self, 2025-05-07T20:33:07.2880508Z T: int, 2025-05-07T20:33:07.2880721Z D: int, 2025-05-07T20:33:07.2880954Z scale_ub: Optional[float], 2025-05-07T20:33:07.2881288Z contiguous: bool, 2025-05-07T20:33:07.2881545Z compiled: bool, 2025-05-07T20:33:07.2881784Z ) -> None: 2025-05-07T20:33:07.2882008Z torch.manual_seed(2025) 2025-05-07T20:33:07.2882266Z 2025-05-07T20:33:07.2882546Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.2882903Z 2025-05-07T20:33:07.2883098Z x_sign = torch.sign(x) 2025-05-07T20:33:07.2883394Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.2883713Z x = x_sign * x_clamp 2025-05-07T20:33:07.2883962Z x0 = x[:, :D] 2025-05-07T20:33:07.2884179Z x1 = x[:, D:] 2025-05-07T20:33:07.2884498Z 2025-05-07T20:33:07.2884692Z if contiguous: 2025-05-07T20:33:07.2884920Z x0 = x0.contiguous() 2025-05-07T20:33:07.2885179Z x1 = x1.contiguous() 2025-05-07T20:33:07.2885420Z 2025-05-07T20:33:07.2885599Z if scale_ub is not None: 2025-05-07T20:33:07.2885866Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.2886199Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.2888032Z ) 2025-05-07T20:33:07.2888224Z else: 2025-05-07T20:33:07.2888441Z scale_ub_tensor = None 2025-05-07T20:33:07.2888685Z 2025-05-07T20:33:07.2888927Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.2889287Z op = silu_mul_quant 2025-05-07T20:33:07.2889534Z if compiled: 2025-05-07T20:33:07.2889785Z op = torch.compile(op) 2025-05-07T20:33:07.2890090Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.2890356Z 2025-05-07T20:33:07.2890544Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.2890711Z 2025-05-07T20:33:07.2890806Z moe/activation_test.py:117: 2025-05-07T20:33:07.2891106Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.2891457Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.2891795Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.2892371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:07.2892921Z return fn(*args, **kwargs) 2025-05-07T20:33:07.2893589Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.2894350Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.2894902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.2895578Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.2896244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.2896781Z kernel = self.compile( 2025-05-07T20:33:07.2897483Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.2898265Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.2898663Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.2898888Z 2025-05-07T20:33:07.2899107Z self = 2025-05-07T20:33:07.2900178Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.2901570Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f15af538c20>} 2025-05-07T20:33:07.2902976Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.2903997Z context = 2025-05-07T20:33:07.2904283Z 2025-05-07T20:33:07.2904456Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.2904973Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.2905444Z module_map=module_map) 2025-05-07T20:33:07.2905806Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.2906150Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.2906428Z E ^ 2025-05-07T20:33:07.2906900Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.2907349Z 2025-05-07T20:33:07.2907776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.4289469Z 2025-05-07T20:33:07.4290155Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.4291132Z self=, 2025-05-07T20:33:07.4291693Z T=4096, 2025-05-07T20:33:07.4291877Z D=5120, 2025-05-07T20:33:07.4292074Z scale_ub=1200.0, 2025-05-07T20:33:07.4292393Z contiguous=False, 2025-05-07T20:33:07.4292613Z compiled=False, 2025-05-07T20:33:07.4292807Z ) 2025-05-07T20:33:07.4293121Z self = 2025-05-07T20:33:07.4293616Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:07.4293888Z 2025-05-07T20:33:07.4293963Z @given( 2025-05-07T20:33:07.4294193Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.4294497Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.4294788Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.4295109Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.4295433Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.4295698Z ) 2025-05-07T20:33:07.4296036Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.4296563Z def test_silu_mul_quant( 2025-05-07T20:33:07.4296796Z self, 2025-05-07T20:33:07.4296973Z T: int, 2025-05-07T20:33:07.4297158Z D: int, 2025-05-07T20:33:07.4297365Z scale_ub: Optional[float], 2025-05-07T20:33:07.4297617Z contiguous: bool, 2025-05-07T20:33:07.4297845Z compiled: bool, 2025-05-07T20:33:07.4298061Z ) -> None: 2025-05-07T20:33:07.4298260Z torch.manual_seed(2025) 2025-05-07T20:33:07.4298489Z 2025-05-07T20:33:07.4298751Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.4299077Z 2025-05-07T20:33:07.4299260Z x_sign = torch.sign(x) 2025-05-07T20:33:07.4299544Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.4299837Z x = x_sign * x_clamp 2025-05-07T20:33:07.4300072Z x0 = x[:, :D] 2025-05-07T20:33:07.4300278Z x1 = x[:, D:] 2025-05-07T20:33:07.4300470Z 2025-05-07T20:33:07.4300648Z if contiguous: 2025-05-07T20:33:07.4300872Z x0 = x0.contiguous() 2025-05-07T20:33:07.4301125Z x1 = x1.contiguous() 2025-05-07T20:33:07.4301351Z 2025-05-07T20:33:07.4301529Z if scale_ub is not None: 2025-05-07T20:33:07.4301796Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.4302117Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.4302416Z ) 2025-05-07T20:33:07.4302605Z else: 2025-05-07T20:33:07.4302802Z scale_ub_tensor = None 2025-05-07T20:33:07.4303045Z 2025-05-07T20:33:07.4303270Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.4303567Z op = silu_mul_quant 2025-05-07T20:33:07.4303902Z if compiled: 2025-05-07T20:33:07.4304145Z op = torch.compile(op) 2025-05-07T20:33:07.4304430Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.4304691Z 2025-05-07T20:33:07.4304874Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.4305032Z 2025-05-07T20:33:07.4305136Z moe/activation_test.py:117: 2025-05-07T20:33:07.4305425Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.4305751Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.4306027Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.4306700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.4307379Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.4307907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.4308880Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.4309599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.4310118Z kernel = self.compile( 2025-05-07T20:33:07.4310650Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.4311351Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.4311738Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.4311964Z 2025-05-07T20:33:07.4312179Z self = 2025-05-07T20:33:07.4313244Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.4314625Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f15af5396c0>} 2025-05-07T20:33:07.4315952Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.4317068Z context = 2025-05-07T20:33:07.4317352Z 2025-05-07T20:33:07.4317518Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.4318024Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.4318488Z module_map=module_map) 2025-05-07T20:33:07.4318844Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.4319182Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.4319433Z E ^ 2025-05-07T20:33:07.4319892Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.4320334Z 2025-05-07T20:33:07.4320751Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.4321267Z 2025-05-07T20:33:07.4321364Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.4321821Z self=, 2025-05-07T20:33:07.4322215Z T=4096, 2025-05-07T20:33:07.4322387Z D=5120, 2025-05-07T20:33:07.4322570Z scale_ub=1200.0, 2025-05-07T20:33:07.4322786Z contiguous=False, 2025-05-07T20:33:07.4322996Z compiled=True, 2025-05-07T20:33:07.4323190Z ) 2025-05-07T20:33:07.4323501Z self = 2025-05-07T20:33:07.4323987Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:07.4324443Z 2025-05-07T20:33:07.4324515Z @given( 2025-05-07T20:33:07.4324739Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.4325043Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.4325332Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.4325659Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.4325982Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.4326248Z ) 2025-05-07T20:33:07.4326587Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.4327020Z def test_silu_mul_quant( 2025-05-07T20:33:07.4327253Z self, 2025-05-07T20:33:07.4327431Z T: int, 2025-05-07T20:33:07.4327623Z D: int, 2025-05-07T20:33:07.4327839Z scale_ub: Optional[float], 2025-05-07T20:33:07.4328091Z contiguous: bool, 2025-05-07T20:33:07.4328320Z compiled: bool, 2025-05-07T20:33:07.4328535Z ) -> None: 2025-05-07T20:33:07.4328740Z torch.manual_seed(2025) 2025-05-07T20:33:07.4328969Z 2025-05-07T20:33:07.4329285Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.4329611Z 2025-05-07T20:33:07.4329796Z x_sign = torch.sign(x) 2025-05-07T20:33:07.4330082Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.4330413Z x = x_sign * x_clamp 2025-05-07T20:33:07.4330645Z x0 = x[:, :D] 2025-05-07T20:33:07.4330852Z x1 = x[:, D:] 2025-05-07T20:33:07.4331043Z 2025-05-07T20:33:07.4331223Z if contiguous: 2025-05-07T20:33:07.4331447Z x0 = x0.contiguous() 2025-05-07T20:33:07.4331689Z x1 = x1.contiguous() 2025-05-07T20:33:07.4331922Z 2025-05-07T20:33:07.4332106Z if scale_ub is not None: 2025-05-07T20:33:07.4332370Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.4332692Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.4332993Z ) 2025-05-07T20:33:07.4333173Z else: 2025-05-07T20:33:07.4333371Z scale_ub_tensor = None 2025-05-07T20:33:07.4333611Z 2025-05-07T20:33:07.4333833Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.4334133Z op = silu_mul_quant 2025-05-07T20:33:07.4334424Z if compiled: 2025-05-07T20:33:07.4334662Z op = torch.compile(op) 2025-05-07T20:33:07.4334942Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.4335205Z 2025-05-07T20:33:07.4335387Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.4335547Z 2025-05-07T20:33:07.4335639Z moe/activation_test.py:117: 2025-05-07T20:33:07.4335926Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.4336248Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.4336520Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.4337061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:07.4337606Z return fn(*args, **kwargs) 2025-05-07T20:33:07.4338255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.4338924Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.4339453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.4340125Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.4340777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.4341290Z kernel = self.compile( 2025-05-07T20:33:07.4341822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.4342468Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.4342903Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.4343133Z 2025-05-07T20:33:07.4343333Z self = 2025-05-07T20:33:07.4344402Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.4345764Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f15af53afc0>} 2025-05-07T20:33:07.4347095Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.4348107Z context = 2025-05-07T20:33:07.4348396Z 2025-05-07T20:33:07.4348599Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.4349121Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.4349587Z module_map=module_map) 2025-05-07T20:33:07.4349977Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.4350319Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.4350568Z E ^ 2025-05-07T20:33:07.4351014Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.4351461Z 2025-05-07T20:33:07.4351872Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.4352386Z 2025-05-07T20:33:07.4352484Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.4352892Z self=, 2025-05-07T20:33:07.4353283Z T=2048, 2025-05-07T20:33:07.4353462Z D=7168, 2025-05-07T20:33:07.4353648Z scale_ub=1200.0, 2025-05-07T20:33:07.4353860Z contiguous=False, 2025-05-07T20:33:07.4354122Z compiled=False, 2025-05-07T20:33:07.6325639Z ) 2025-05-07T20:33:07.6326186Z self = 2025-05-07T20:33:07.6326941Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:07.6327260Z 2025-05-07T20:33:07.6327334Z @given( 2025-05-07T20:33:07.6327562Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.6327863Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.6328169Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.6328494Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.6328814Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.6329104Z ) 2025-05-07T20:33:07.6329452Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.6329885Z def test_silu_mul_quant( 2025-05-07T20:33:07.6330108Z self, 2025-05-07T20:33:07.6330293Z T: int, 2025-05-07T20:33:07.6330485Z D: int, 2025-05-07T20:33:07.6330696Z scale_ub: Optional[float], 2025-05-07T20:33:07.6330961Z contiguous: bool, 2025-05-07T20:33:07.6331195Z compiled: bool, 2025-05-07T20:33:07.6331410Z ) -> None: 2025-05-07T20:33:07.6331619Z torch.manual_seed(2025) 2025-05-07T20:33:07.6331854Z 2025-05-07T20:33:07.6332115Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.6332449Z 2025-05-07T20:33:07.6332634Z x_sign = torch.sign(x) 2025-05-07T20:33:07.6332915Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.6333217Z x = x_sign * x_clamp 2025-05-07T20:33:07.6333766Z x0 = x[:, :D] 2025-05-07T20:33:07.6333980Z x1 = x[:, D:] 2025-05-07T20:33:07.6334171Z 2025-05-07T20:33:07.6334352Z if contiguous: 2025-05-07T20:33:07.6334578Z x0 = x0.contiguous() 2025-05-07T20:33:07.6334822Z x1 = x1.contiguous() 2025-05-07T20:33:07.6335056Z 2025-05-07T20:33:07.6335242Z if scale_ub is not None: 2025-05-07T20:33:07.6335502Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.6335835Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.6336133Z ) 2025-05-07T20:33:07.6336312Z else: 2025-05-07T20:33:07.6336515Z scale_ub_tensor = None 2025-05-07T20:33:07.6336758Z 2025-05-07T20:33:07.6336975Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.6337282Z op = silu_mul_quant 2025-05-07T20:33:07.6337526Z if compiled: 2025-05-07T20:33:07.6337764Z op = torch.compile(op) 2025-05-07T20:33:07.6338062Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.6338327Z 2025-05-07T20:33:07.6338613Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.6338775Z 2025-05-07T20:33:07.6338871Z moe/activation_test.py:117: 2025-05-07T20:33:07.6339165Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.6339580Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.6339853Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.6340536Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.6341224Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.6341805Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.6342475Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.6343134Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.6343665Z kernel = self.compile( 2025-05-07T20:33:07.6344195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.6344943Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.6345345Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.6345571Z 2025-05-07T20:33:07.6345784Z self = 2025-05-07T20:33:07.6346847Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.6348228Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f15af53bec0>} 2025-05-07T20:33:07.6349557Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.6350571Z context = 2025-05-07T20:33:07.6350856Z 2025-05-07T20:33:07.6351023Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.6351531Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.6351993Z module_map=module_map) 2025-05-07T20:33:07.6352353Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.6352690Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.6352943Z E ^ 2025-05-07T20:33:07.6353450Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.6353896Z 2025-05-07T20:33:07.6354314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.6354815Z 2025-05-07T20:33:07.6354913Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.6355340Z self=, 2025-05-07T20:33:07.6355818Z T=1, 2025-05-07T20:33:07.6356177Z D=7168, 2025-05-07T20:33:07.6356599Z scale_ub=None, 2025-05-07T20:33:07.6356877Z contiguous=True, 2025-05-07T20:33:07.6365353Z compiled=False, 2025-05-07T20:33:07.6365588Z ) 2025-05-07T20:33:07.6365904Z self = 2025-05-07T20:33:07.6366396Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:07.6366653Z 2025-05-07T20:33:07.6366738Z @given( 2025-05-07T20:33:07.6366997Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.6367320Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.6367729Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.6368060Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.6368395Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.6368734Z ) 2025-05-07T20:33:07.6369091Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.6369542Z def test_silu_mul_quant( 2025-05-07T20:33:07.6369793Z self, 2025-05-07T20:33:07.6369991Z T: int, 2025-05-07T20:33:07.6370199Z D: int, 2025-05-07T20:33:07.6370426Z scale_ub: Optional[float], 2025-05-07T20:33:07.6370702Z contiguous: bool, 2025-05-07T20:33:07.6370952Z compiled: bool, 2025-05-07T20:33:07.6371184Z ) -> None: 2025-05-07T20:33:07.6371398Z torch.manual_seed(2025) 2025-05-07T20:33:07.6371672Z 2025-05-07T20:33:07.6371978Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.6372330Z 2025-05-07T20:33:07.6372525Z x_sign = torch.sign(x) 2025-05-07T20:33:07.6372825Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.6373141Z x = x_sign * x_clamp 2025-05-07T20:33:07.6373427Z x0 = x[:, :D] 2025-05-07T20:33:07.6373655Z x1 = x[:, D:] 2025-05-07T20:33:07.6373870Z 2025-05-07T20:33:07.6374050Z if contiguous: 2025-05-07T20:33:07.6374287Z x0 = x0.contiguous() 2025-05-07T20:33:07.6374549Z x1 = x1.contiguous() 2025-05-07T20:33:07.6374782Z 2025-05-07T20:33:07.6374972Z if scale_ub is not None: 2025-05-07T20:33:07.6375245Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.6375572Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.6375878Z ) 2025-05-07T20:33:07.6376073Z else: 2025-05-07T20:33:07.6376276Z scale_ub_tensor = None 2025-05-07T20:33:07.6376529Z 2025-05-07T20:33:07.6376759Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.6377071Z op = silu_mul_quant 2025-05-07T20:33:07.6377315Z if compiled: 2025-05-07T20:33:07.6377562Z op = torch.compile(op) 2025-05-07T20:33:07.6377862Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.6378130Z 2025-05-07T20:33:07.6378327Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.6378491Z 2025-05-07T20:33:07.6378599Z moe/activation_test.py:117: 2025-05-07T20:33:07.6378888Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.6379224Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.6379508Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.6380192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.6380883Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.6381505Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.6382191Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.6382850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.6383384Z kernel = self.compile( 2025-05-07T20:33:07.6383927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.6384584Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.6384976Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.6385213Z 2025-05-07T20:33:07.6385423Z self = 2025-05-07T20:33:07.6386553Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.6387922Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f15affb0cc0>} 2025-05-07T20:33:07.6389308Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.6390330Z context = 2025-05-07T20:33:07.6390616Z 2025-05-07T20:33:07.6390789Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.6391304Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.6391811Z module_map=module_map) 2025-05-07T20:33:07.6392196Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.6392547Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.6392797Z E ^ 2025-05-07T20:33:07.6393265Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.6393758Z 2025-05-07T20:33:07.6394176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.6394682Z 2025-05-07T20:33:07.6394783Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.6395220Z self=, 2025-05-07T20:33:07.6395619Z T=16384, 2025-05-07T20:33:07.6395819Z D=7168, 2025-05-07T20:33:07.6396011Z scale_ub=1200.0, 2025-05-07T20:33:07.6396230Z contiguous=False, 2025-05-07T20:33:07.6396455Z compiled=True, 2025-05-07T20:33:07.6396658Z ) 2025-05-07T20:33:07.6396976Z self = 2025-05-07T20:33:07.6397479Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:07.6397755Z 2025-05-07T20:33:07.6397837Z @given( 2025-05-07T20:33:07.6398064Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.6398381Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.6398688Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.6399010Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.6399345Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.6399629Z ) 2025-05-07T20:33:07.6399976Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.6400403Z def test_silu_mul_quant( 2025-05-07T20:33:07.6400642Z self, 2025-05-07T20:33:07.6400843Z T: int, 2025-05-07T20:33:07.6401030Z D: int, 2025-05-07T20:33:07.6401297Z scale_ub: Optional[float], 2025-05-07T20:33:07.6401570Z contiguous: bool, 2025-05-07T20:33:07.6401802Z compiled: bool, 2025-05-07T20:33:07.6402026Z ) -> None: 2025-05-07T20:33:07.6402244Z torch.manual_seed(2025) 2025-05-07T20:33:07.6402479Z 2025-05-07T20:33:07.6402757Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.6403109Z 2025-05-07T20:33:07.6403293Z x_sign = torch.sign(x) 2025-05-07T20:33:07.6403589Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.6403907Z x = x_sign * x_clamp 2025-05-07T20:33:07.6404139Z x0 = x[:, :D] 2025-05-07T20:33:07.6404498Z x1 = x[:, D:] 2025-05-07T20:33:07.6404722Z 2025-05-07T20:33:07.6404926Z if contiguous: 2025-05-07T20:33:07.6405160Z x0 = x0.contiguous() 2025-05-07T20:33:07.6405433Z x1 = x1.contiguous() 2025-05-07T20:33:07.6405685Z 2025-05-07T20:33:07.6405880Z if scale_ub is not None: 2025-05-07T20:33:07.6406154Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.6406557Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.6406861Z ) 2025-05-07T20:33:07.6407061Z else: 2025-05-07T20:33:07.6407273Z scale_ub_tensor = None 2025-05-07T20:33:07.6407557Z 2025-05-07T20:33:07.6407775Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.6408088Z op = silu_mul_quant 2025-05-07T20:33:07.6408705Z if compiled: 2025-05-07T20:33:07.6408992Z op = torch.compile(op) 2025-05-07T20:33:07.6409325Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.6409629Z 2025-05-07T20:33:07.6409850Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.6410049Z 2025-05-07T20:33:07.6410155Z moe/activation_test.py:117: 2025-05-07T20:33:07.6410495Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.6410876Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.6411196Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.6411908Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:07.6412582Z return fn(*args, **kwargs) 2025-05-07T20:33:07.6413235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.6413915Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.6414448Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.6415115Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.6415770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.6416291Z kernel = self.compile( 2025-05-07T20:33:07.6416828Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.6417480Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.6417871Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.6418101Z 2025-05-07T20:33:07.6418313Z self = 2025-05-07T20:33:07.6419379Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.6420739Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f15affb20c0>} 2025-05-07T20:33:07.6422140Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.6423153Z context = 2025-05-07T20:33:07.6423443Z 2025-05-07T20:33:07.6423612Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.6424122Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.6424585Z module_map=module_map) 2025-05-07T20:33:07.6424949Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.6425289Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.6425542Z E ^ 2025-05-07T20:33:07.6426005Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.6426450Z 2025-05-07T20:33:07.6426871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.7779716Z 2025-05-07T20:33:07.7780142Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.7780743Z self=, 2025-05-07T20:33:07.7781236Z T=1, 2025-05-07T20:33:07.7781515Z D=7168, 2025-05-07T20:33:07.7781690Z scale_ub=None, 2025-05-07T20:33:07.7781897Z contiguous=False, 2025-05-07T20:33:07.7782113Z compiled=False, 2025-05-07T20:33:07.7782303Z ) 2025-05-07T20:33:07.7782611Z self = 2025-05-07T20:33:07.7783093Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:07.7783349Z 2025-05-07T20:33:07.7783418Z @given( 2025-05-07T20:33:07.7783640Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.7783948Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.7784267Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.7784640Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.7784999Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.7785296Z ) 2025-05-07T20:33:07.7785651Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.7786207Z def test_silu_mul_quant( 2025-05-07T20:33:07.7786454Z self, 2025-05-07T20:33:07.7786646Z T: int, 2025-05-07T20:33:07.7786849Z D: int, 2025-05-07T20:33:07.7787072Z scale_ub: Optional[float], 2025-05-07T20:33:07.7787345Z contiguous: bool, 2025-05-07T20:33:07.7787594Z compiled: bool, 2025-05-07T20:33:07.7787829Z ) -> None: 2025-05-07T20:33:07.7788041Z torch.manual_seed(2025) 2025-05-07T20:33:07.7788290Z 2025-05-07T20:33:07.7788569Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.7788922Z 2025-05-07T20:33:07.7789126Z x_sign = torch.sign(x) 2025-05-07T20:33:07.7789427Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.7789742Z x = x_sign * x_clamp 2025-05-07T20:33:07.7789994Z x0 = x[:, :D] 2025-05-07T20:33:07.7790220Z x1 = x[:, D:] 2025-05-07T20:33:07.7790434Z 2025-05-07T20:33:07.7790624Z if contiguous: 2025-05-07T20:33:07.7790866Z x0 = x0.contiguous() 2025-05-07T20:33:07.7791125Z x1 = x1.contiguous() 2025-05-07T20:33:07.7791376Z 2025-05-07T20:33:07.7791577Z if scale_ub is not None: 2025-05-07T20:33:07.7791850Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.7792181Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.7792481Z ) 2025-05-07T20:33:07.7792662Z else: 2025-05-07T20:33:07.7792860Z scale_ub_tensor = None 2025-05-07T20:33:07.7793100Z 2025-05-07T20:33:07.7793322Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.7793710Z op = silu_mul_quant 2025-05-07T20:33:07.7793951Z if compiled: 2025-05-07T20:33:07.7794211Z op = torch.compile(op) 2025-05-07T20:33:07.7794495Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.7794755Z 2025-05-07T20:33:07.7794930Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.7795101Z 2025-05-07T20:33:07.7795193Z moe/activation_test.py:117: 2025-05-07T20:33:07.7795481Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.7795794Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.7796064Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.7796741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.7797417Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.7797939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.7798658Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.7799313Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.7799825Z kernel = self.compile( 2025-05-07T20:33:07.7800392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.7801037Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.7801422Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.7801675Z 2025-05-07T20:33:07.7801898Z self = 2025-05-07T20:33:07.7802963Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.7804515Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f15affb2c00>} 2025-05-07T20:33:07.7805858Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.7806949Z context = 2025-05-07T20:33:07.7807232Z 2025-05-07T20:33:07.7807391Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.7807903Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.7808692Z module_map=module_map) 2025-05-07T20:33:07.7809052Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.7809408Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.7809669Z E ^ 2025-05-07T20:33:07.7810130Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.7810586Z 2025-05-07T20:33:07.7811008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.7811525Z 2025-05-07T20:33:07.7811632Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.7812046Z self=, 2025-05-07T20:33:07.7812439Z T=2048, 2025-05-07T20:33:07.7812641Z D=7168, 2025-05-07T20:33:07.7812835Z scale_ub=None, 2025-05-07T20:33:07.7813053Z contiguous=False, 2025-05-07T20:33:07.7813282Z compiled=True, 2025-05-07T20:33:07.7813490Z ) 2025-05-07T20:33:07.7813809Z self = 2025-05-07T20:33:07.7814398Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:07.7814674Z 2025-05-07T20:33:07.7814754Z @given( 2025-05-07T20:33:07.7814986Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.7815297Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.7815609Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.7815940Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.7816264Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.7816549Z ) 2025-05-07T20:33:07.7816898Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.7817339Z def test_silu_mul_quant( 2025-05-07T20:33:07.7817575Z self, 2025-05-07T20:33:07.7817771Z T: int, 2025-05-07T20:33:07.7817978Z D: int, 2025-05-07T20:33:07.7818228Z scale_ub: Optional[float], 2025-05-07T20:33:07.7818500Z contiguous: bool, 2025-05-07T20:33:07.7818743Z compiled: bool, 2025-05-07T20:33:07.7818961Z ) -> None: 2025-05-07T20:33:07.7819266Z torch.manual_seed(2025) 2025-05-07T20:33:07.7819506Z 2025-05-07T20:33:07.7819774Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.7820115Z 2025-05-07T20:33:07.7820361Z x_sign = torch.sign(x) 2025-05-07T20:33:07.7820649Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.7820954Z x = x_sign * x_clamp 2025-05-07T20:33:07.7821195Z x0 = x[:, :D] 2025-05-07T20:33:07.7821404Z x1 = x[:, D:] 2025-05-07T20:33:07.7821611Z 2025-05-07T20:33:07.7821794Z if contiguous: 2025-05-07T20:33:07.7822020Z x0 = x0.contiguous() 2025-05-07T20:33:07.7822275Z x1 = x1.contiguous() 2025-05-07T20:33:07.7822512Z 2025-05-07T20:33:07.7822702Z if scale_ub is not None: 2025-05-07T20:33:07.7822966Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.7823301Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.7823604Z ) 2025-05-07T20:33:07.7823794Z else: 2025-05-07T20:33:07.7824003Z scale_ub_tensor = None 2025-05-07T20:33:07.7824255Z 2025-05-07T20:33:07.7824478Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.7824860Z op = silu_mul_quant 2025-05-07T20:33:07.7825111Z if compiled: 2025-05-07T20:33:07.7825349Z op = torch.compile(op) 2025-05-07T20:33:07.7825641Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.7825910Z 2025-05-07T20:33:07.7826095Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.7826262Z 2025-05-07T20:33:07.7826359Z moe/activation_test.py:117: 2025-05-07T20:33:07.7826650Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.7826982Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.7827254Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.7827811Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:07.7828366Z return fn(*args, **kwargs) 2025-05-07T20:33:07.7829013Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.7829703Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.7830237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.7830911Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.7831559Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.7832083Z kernel = self.compile( 2025-05-07T20:33:07.7832619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.7833310Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.7833706Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.7833938Z 2025-05-07T20:33:07.7834141Z self = 2025-05-07T20:33:07.7835217Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.7836577Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f15afc782c0>} 2025-05-07T20:33:07.7837911Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.7838968Z context = 2025-05-07T20:33:07.7839264Z 2025-05-07T20:33:07.7839426Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.7839948Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.7840446Z module_map=module_map) 2025-05-07T20:33:07.7840809Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.7841159Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.7841415Z E ^ 2025-05-07T20:33:07.7841874Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.7842325Z 2025-05-07T20:33:07.7842736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.7843243Z 2025-05-07T20:33:07.7843355Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.7843760Z self=, 2025-05-07T20:33:07.7844156Z T=4096, 2025-05-07T20:33:07.7844436Z D=7168, 2025-05-07T20:33:07.7844621Z scale_ub=None, 2025-05-07T20:33:07.7844884Z contiguous=False, 2025-05-07T20:33:07.7845113Z compiled=True, 2025-05-07T20:33:08.2439242Z ) 2025-05-07T20:33:08.2439931Z self = 2025-05-07T20:33:08.2440946Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:08.2441341Z 2025-05-07T20:33:08.2441469Z @given( 2025-05-07T20:33:08.2441808Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.2442266Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.2442672Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.2443051Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.2443396Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.2443673Z ) 2025-05-07T20:33:08.2444011Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.2444538Z def test_silu_mul_quant( 2025-05-07T20:33:08.2444781Z self, 2025-05-07T20:33:08.2444971Z T: int, 2025-05-07T20:33:08.2445159Z D: int, 2025-05-07T20:33:08.2445367Z scale_ub: Optional[float], 2025-05-07T20:33:08.2445628Z contiguous: bool, 2025-05-07T20:33:08.2445849Z compiled: bool, 2025-05-07T20:33:08.2446069Z ) -> None: 2025-05-07T20:33:08.2446277Z torch.manual_seed(2025) 2025-05-07T20:33:08.2446503Z 2025-05-07T20:33:08.2446770Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.2447101Z 2025-05-07T20:33:08.2447277Z x_sign = torch.sign(x) 2025-05-07T20:33:08.2447566Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.2448191Z x = x_sign * x_clamp 2025-05-07T20:33:08.2448421Z x0 = x[:, :D] 2025-05-07T20:33:08.2448628Z x1 = x[:, D:] 2025-05-07T20:33:08.2448825Z 2025-05-07T20:33:08.2448991Z if contiguous: 2025-05-07T20:33:08.2449215Z x0 = x0.contiguous() 2025-05-07T20:33:08.2449465Z x1 = x1.contiguous() 2025-05-07T20:33:08.2449693Z 2025-05-07T20:33:08.2449874Z if scale_ub is not None: 2025-05-07T20:33:08.2450137Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.2450458Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.2458065Z ) 2025-05-07T20:33:08.2458298Z else: 2025-05-07T20:33:08.2458526Z scale_ub_tensor = None 2025-05-07T20:33:08.2458796Z 2025-05-07T20:33:08.2459034Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.2459368Z op = silu_mul_quant 2025-05-07T20:33:08.2459631Z if compiled: 2025-05-07T20:33:08.2459890Z op = torch.compile(op) 2025-05-07T20:33:08.2460200Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.2460613Z 2025-05-07T20:33:08.2460811Z > y_fp8, y_scale = fn() 2025-05-07T20:33:08.2460989Z 2025-05-07T20:33:08.2461095Z moe/activation_test.py:117: 2025-05-07T20:33:08.2461406Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.2461890Z moe/activation_test.py:115: in fn 2025-05-07T20:33:08.2462175Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.2462746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:08.2463315Z return fn(*args, **kwargs) 2025-05-07T20:33:08.2463973Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:08.2464667Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:08.2465213Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.2465900Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.2466560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.2467187Z kernel = self.compile( 2025-05-07T20:33:08.2467737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.2468389Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.2468798Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.2469036Z 2025-05-07T20:33:08.2469245Z self = 2025-05-07T20:33:08.2470334Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.2471736Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f15afc78d60>} 2025-05-07T20:33:08.2473076Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.2474109Z context = 2025-05-07T20:33:08.2474414Z 2025-05-07T20:33:08.2474583Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.2475113Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.2475584Z module_map=module_map) 2025-05-07T20:33:08.2476001Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.2476361Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:08.2476611Z E ^ 2025-05-07T20:33:08.2477080Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.2477541Z 2025-05-07T20:33:08.2477956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.2478463Z 2025-05-07T20:33:08.2478576Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.2478981Z self=, 2025-05-07T20:33:08.2479380Z T=16384, 2025-05-07T20:33:08.2479577Z D=5120, 2025-05-07T20:33:08.2479764Z scale_ub=1200.0, 2025-05-07T20:33:08.2479993Z contiguous=False, 2025-05-07T20:33:08.2480223Z compiled=False, 2025-05-07T20:33:08.2480422Z ) 2025-05-07T20:33:08.2480747Z self = 2025-05-07T20:33:08.2481300Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:08.2481580Z 2025-05-07T20:33:08.2481675Z @given( 2025-05-07T20:33:08.2481948Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.2482308Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.2482616Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.2482946Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.2483276Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.2483567Z ) 2025-05-07T20:33:08.2483911Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.2484503Z def test_silu_mul_quant( 2025-05-07T20:33:08.2484751Z self, 2025-05-07T20:33:08.2484954Z T: int, 2025-05-07T20:33:08.2485151Z D: int, 2025-05-07T20:33:08.2485376Z scale_ub: Optional[float], 2025-05-07T20:33:08.2485655Z contiguous: bool, 2025-05-07T20:33:08.2485885Z compiled: bool, 2025-05-07T20:33:08.2486113Z ) -> None: 2025-05-07T20:33:08.2486321Z torch.manual_seed(2025) 2025-05-07T20:33:08.2486570Z 2025-05-07T20:33:08.2486848Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.2487240Z 2025-05-07T20:33:08.2487435Z x_sign = torch.sign(x) 2025-05-07T20:33:08.2487727Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.2488028Z x = x_sign * x_clamp 2025-05-07T20:33:08.2488268Z x0 = x[:, :D] 2025-05-07T20:33:08.2488482Z x1 = x[:, D:] 2025-05-07T20:33:08.2488684Z 2025-05-07T20:33:08.2488875Z if contiguous: 2025-05-07T20:33:08.2489107Z x0 = x0.contiguous() 2025-05-07T20:33:08.2489369Z x1 = x1.contiguous() 2025-05-07T20:33:08.2489606Z 2025-05-07T20:33:08.2489798Z if scale_ub is not None: 2025-05-07T20:33:08.2490076Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.2490405Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.2490716Z ) 2025-05-07T20:33:08.2490911Z else: 2025-05-07T20:33:08.2491115Z scale_ub_tensor = None 2025-05-07T20:33:08.2491373Z 2025-05-07T20:33:08.2491603Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.2491903Z op = silu_mul_quant 2025-05-07T20:33:08.2492151Z if compiled: 2025-05-07T20:33:08.2492398Z op = torch.compile(op) 2025-05-07T20:33:08.2492686Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.2492959Z 2025-05-07T20:33:08.2493148Z > y_fp8, y_scale = fn() 2025-05-07T20:33:08.2493309Z 2025-05-07T20:33:08.2493414Z moe/activation_test.py:117: 2025-05-07T20:33:08.2493703Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.2494034Z moe/activation_test.py:115: in fn 2025-05-07T20:33:08.2494368Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.2495049Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:08.2495734Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:08.2496268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.2496949Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.2497601Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.2498134Z kernel = self.compile( 2025-05-07T20:33:08.2498720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.2499368Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.2499767Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.2499998Z 2025-05-07T20:33:08.2500253Z self = 2025-05-07T20:33:08.2501330Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.2502726Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f15afc79c60>} 2025-05-07T20:33:08.2504059Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.2505107Z context = 2025-05-07T20:33:08.2505399Z 2025-05-07T20:33:08.2505574Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.2506093Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.2506555Z module_map=module_map) 2025-05-07T20:33:08.2506965Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.2507312Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:08.2507563Z E ^ 2025-05-07T20:33:08.2508025Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.2508761Z 2025-05-07T20:33:08.2509187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.2509697Z 2025-05-07T20:33:08.2509806Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.2510361Z self=, 2025-05-07T20:33:08.2510768Z T=16384, 2025-05-07T20:33:08.2510965Z D=5120, 2025-05-07T20:33:08.2511152Z scale_ub=1200.0, 2025-05-07T20:33:08.2511378Z contiguous=True, 2025-05-07T20:33:08.2511601Z compiled=True, 2025-05-07T20:33:08.2511799Z ) 2025-05-07T20:33:08.2512147Z self = 2025-05-07T20:33:08.2512643Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:08.2512914Z 2025-05-07T20:33:08.2512997Z @given( 2025-05-07T20:33:08.2513219Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.2513530Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.2513839Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.2514157Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.2514484Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.2514763Z ) 2025-05-07T20:33:08.2515182Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.2515620Z def test_silu_mul_quant( 2025-05-07T20:33:08.2515859Z self, 2025-05-07T20:33:08.2516042Z T: int, 2025-05-07T20:33:08.2516238Z D: int, 2025-05-07T20:33:08.2516451Z scale_ub: Optional[float], 2025-05-07T20:33:08.2516709Z contiguous: bool, 2025-05-07T20:33:08.2516947Z compiled: bool, 2025-05-07T20:33:08.2517169Z ) -> None: 2025-05-07T20:33:08.2517373Z torch.manual_seed(2025) 2025-05-07T20:33:08.2517615Z 2025-05-07T20:33:08.2517879Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.2518214Z 2025-05-07T20:33:08.2518393Z x_sign = torch.sign(x) 2025-05-07T20:33:08.2518683Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.2518994Z x = x_sign * x_clamp 2025-05-07T20:33:08.2519224Z x0 = x[:, :D] 2025-05-07T20:33:08.2519440Z x1 = x[:, D:] 2025-05-07T20:33:08.2519643Z 2025-05-07T20:33:08.2519818Z if contiguous: 2025-05-07T20:33:08.2520112Z x0 = x0.contiguous() 2025-05-07T20:33:08.2520364Z x1 = x1.contiguous() 2025-05-07T20:33:08.2520588Z 2025-05-07T20:33:08.2520774Z if scale_ub is not None: 2025-05-07T20:33:08.2521104Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.2521430Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.2521738Z ) 2025-05-07T20:33:08.2521929Z else: 2025-05-07T20:33:08.2522131Z scale_ub_tensor = None 2025-05-07T20:33:08.2522382Z 2025-05-07T20:33:08.2522611Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.2522924Z op = silu_mul_quant 2025-05-07T20:33:08.2523165Z if compiled: 2025-05-07T20:33:08.2523408Z op = torch.compile(op) 2025-05-07T20:33:08.2523701Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.2523971Z 2025-05-07T20:33:08.2524160Z > y_fp8, y_scale = fn() 2025-05-07T20:33:08.2524417Z 2025-05-07T20:33:08.2524522Z moe/activation_test.py:117: 2025-05-07T20:33:08.2524811Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.2525208Z moe/activation_test.py:115: in fn 2025-05-07T20:33:08.2525495Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.2526043Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:08.2526595Z return fn(*args, **kwargs) 2025-05-07T20:33:08.2527255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:08.2527933Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:08.2528457Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.2529134Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.2529794Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.2530322Z kernel = self.compile( 2025-05-07T20:33:08.2530861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.2531513Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.2531908Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.2532134Z 2025-05-07T20:33:08.2532339Z self = 2025-05-07T20:33:08.2533420Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.2535410Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f15afc7b380>} 2025-05-07T20:33:08.2536744Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.2537759Z context = 2025-05-07T20:33:08.2538055Z 2025-05-07T20:33:08.2538218Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.2538738Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.2539203Z module_map=module_map) 2025-05-07T20:33:08.2539560Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.2539908Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:08.2540175Z E ^ 2025-05-07T20:33:08.2540701Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.2541153Z 2025-05-07T20:33:08.2541567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.4099199Z 2025-05-07T20:33:08.4099804Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.4100439Z self=, 2025-05-07T20:33:08.4100966Z T=16384, 2025-05-07T20:33:08.4101210Z D=5120, 2025-05-07T20:33:08.4101448Z scale_ub=None, 2025-05-07T20:33:08.4101687Z contiguous=False, 2025-05-07T20:33:08.4101901Z compiled=True, 2025-05-07T20:33:08.4102097Z ) 2025-05-07T20:33:08.4102401Z self = 2025-05-07T20:33:08.4102919Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:08.4103199Z 2025-05-07T20:33:08.4103271Z @given( 2025-05-07T20:33:08.4103499Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.4103805Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.4104123Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.4104789Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.4105123Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.4105419Z ) 2025-05-07T20:33:08.4105779Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.4106235Z def test_silu_mul_quant( 2025-05-07T20:33:08.4106487Z self, 2025-05-07T20:33:08.4106687Z T: int, 2025-05-07T20:33:08.4106885Z D: int, 2025-05-07T20:33:08.4107108Z scale_ub: Optional[float], 2025-05-07T20:33:08.4107388Z contiguous: bool, 2025-05-07T20:33:08.4107635Z compiled: bool, 2025-05-07T20:33:08.4107880Z ) -> None: 2025-05-07T20:33:08.4108100Z torch.manual_seed(2025) 2025-05-07T20:33:08.4108670Z 2025-05-07T20:33:08.4108946Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.4109301Z 2025-05-07T20:33:08.4109502Z x_sign = torch.sign(x) 2025-05-07T20:33:08.4109802Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.4110127Z x = x_sign * x_clamp 2025-05-07T20:33:08.4110377Z x0 = x[:, :D] 2025-05-07T20:33:08.4110592Z x1 = x[:, D:] 2025-05-07T20:33:08.4110808Z 2025-05-07T20:33:08.4111002Z if contiguous: 2025-05-07T20:33:08.4111235Z x0 = x0.contiguous() 2025-05-07T20:33:08.4111504Z x1 = x1.contiguous() 2025-05-07T20:33:08.4111769Z 2025-05-07T20:33:08.4111991Z if scale_ub is not None: 2025-05-07T20:33:08.4112283Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.4112635Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.4113051Z ) 2025-05-07T20:33:08.4113229Z else: 2025-05-07T20:33:08.4113435Z scale_ub_tensor = None 2025-05-07T20:33:08.4113671Z 2025-05-07T20:33:08.4113883Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.4114190Z op = silu_mul_quant 2025-05-07T20:33:08.4114459Z if compiled: 2025-05-07T20:33:08.4114698Z op = torch.compile(op) 2025-05-07T20:33:08.4114977Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.4115237Z 2025-05-07T20:33:08.4115419Z > y_fp8, y_scale = fn() 2025-05-07T20:33:08.4115579Z 2025-05-07T20:33:08.4115673Z moe/activation_test.py:117: 2025-05-07T20:33:08.4115960Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.4116279Z moe/activation_test.py:115: in fn 2025-05-07T20:33:08.4116544Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.4117101Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:08.4117744Z return fn(*args, **kwargs) 2025-05-07T20:33:08.4118393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:08.4119065Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:08.4119669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.4120341Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.4120998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.4121511Z kernel = self.compile( 2025-05-07T20:33:08.4122049Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.4122699Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.4123084Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.4123316Z 2025-05-07T20:33:08.4123516Z self = 2025-05-07T20:33:08.4124763Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.4126149Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f15af810180>} 2025-05-07T20:33:08.4127476Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.4128482Z context = 2025-05-07T20:33:08.4128773Z 2025-05-07T20:33:08.4128936Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.4129450Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.4129915Z module_map=module_map) 2025-05-07T20:33:08.4130264Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.4130607Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:08.4130853Z E ^ 2025-05-07T20:33:08.4131300Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.4131747Z 2025-05-07T20:33:08.4132154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.4132666Z 2025-05-07T20:33:08.4132765Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.4133402Z self=, 2025-05-07T20:33:08.4133790Z T=2048, 2025-05-07T20:33:08.4133965Z D=5120, 2025-05-07T20:33:08.4134150Z scale_ub=None, 2025-05-07T20:33:08.4134379Z contiguous=False, 2025-05-07T20:33:08.4134609Z compiled=True, 2025-05-07T20:33:08.4134801Z ) 2025-05-07T20:33:08.4135103Z self = 2025-05-07T20:33:08.4135587Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:08.4135849Z 2025-05-07T20:33:08.4135924Z @given( 2025-05-07T20:33:08.4136137Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.4136441Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.4136740Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.4137055Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.4137371Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.4137640Z ) 2025-05-07T20:33:08.4138024Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.4138444Z def test_silu_mul_quant( 2025-05-07T20:33:08.4138670Z self, 2025-05-07T20:33:08.4138853Z T: int, 2025-05-07T20:33:08.4139031Z D: int, 2025-05-07T20:33:08.4139282Z scale_ub: Optional[float], 2025-05-07T20:33:08.4139541Z contiguous: bool, 2025-05-07T20:33:08.4139761Z compiled: bool, 2025-05-07T20:33:08.4139975Z ) -> None: 2025-05-07T20:33:08.4140177Z torch.manual_seed(2025) 2025-05-07T20:33:08.4140399Z 2025-05-07T20:33:08.4140658Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.4140989Z 2025-05-07T20:33:08.4141171Z x_sign = torch.sign(x) 2025-05-07T20:33:08.4141442Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.4141751Z x = x_sign * x_clamp 2025-05-07T20:33:08.4142019Z x0 = x[:, :D] 2025-05-07T20:33:08.4142215Z x1 = x[:, D:] 2025-05-07T20:33:08.4142408Z 2025-05-07T20:33:08.4142584Z if contiguous: 2025-05-07T20:33:08.4142794Z x0 = x0.contiguous() 2025-05-07T20:33:08.4143041Z x1 = x1.contiguous() 2025-05-07T20:33:08.4143319Z 2025-05-07T20:33:08.4143495Z if scale_ub is not None: 2025-05-07T20:33:08.4143755Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.4144081Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.4144371Z ) 2025-05-07T20:33:08.4144557Z else: 2025-05-07T20:33:08.4144760Z scale_ub_tensor = None 2025-05-07T20:33:08.4144991Z 2025-05-07T20:33:08.4145213Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.4145520Z op = silu_mul_quant 2025-05-07T20:33:08.4145764Z if compiled: 2025-05-07T20:33:08.4145997Z op = torch.compile(op) 2025-05-07T20:33:08.4146290Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.4146552Z 2025-05-07T20:33:08.4146730Z > y_fp8, y_scale = fn() 2025-05-07T20:33:08.4146901Z 2025-05-07T20:33:08.4146994Z moe/activation_test.py:117: 2025-05-07T20:33:08.4147289Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.4147617Z moe/activation_test.py:115: in fn 2025-05-07T20:33:08.4147894Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.4148441Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:08.4148988Z return fn(*args, **kwargs) 2025-05-07T20:33:08.4149630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:08.4150306Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:08.4150916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.4151582Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.4152242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.4152768Z kernel = self.compile( 2025-05-07T20:33:08.4153305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.4153946Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.4154336Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.4154563Z 2025-05-07T20:33:08.4154773Z self = 2025-05-07T20:33:08.4155846Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.4157238Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f15af811440>} 2025-05-07T20:33:08.4158575Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.4159629Z context = 2025-05-07T20:33:08.4159911Z 2025-05-07T20:33:08.4160079Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.4160586Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.4161047Z module_map=module_map) 2025-05-07T20:33:08.4161408Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.4161752Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:08.4161993Z E ^ 2025-05-07T20:33:08.4162455Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.4162947Z 2025-05-07T20:33:08.4163365Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.5787286Z 2025-05-07T20:33:08.5794477Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.5795134Z self=, 2025-05-07T20:33:08.5795711Z T=2048, 2025-05-07T20:33:08.5795942Z D=5120, 2025-05-07T20:33:08.5796137Z scale_ub=1200.0, 2025-05-07T20:33:08.5796359Z contiguous=False, 2025-05-07T20:33:08.5796576Z compiled=True, 2025-05-07T20:33:08.5796782Z ) 2025-05-07T20:33:08.5797099Z self = 2025-05-07T20:33:08.5797601Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:08.5797888Z 2025-05-07T20:33:08.5797958Z @given( 2025-05-07T20:33:08.5798182Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.5798479Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.5798786Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.5799108Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.5799430Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.5799696Z ) 2025-05-07T20:33:08.5800034Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.5800463Z def test_silu_mul_quant( 2025-05-07T20:33:08.5800686Z self, 2025-05-07T20:33:08.5800869Z T: int, 2025-05-07T20:33:08.5801054Z D: int, 2025-05-07T20:33:08.5801255Z scale_ub: Optional[float], 2025-05-07T20:33:08.5801515Z contiguous: bool, 2025-05-07T20:33:08.5801970Z compiled: bool, 2025-05-07T20:33:08.5802212Z ) -> None: 2025-05-07T20:33:08.5802441Z torch.manual_seed(2025) 2025-05-07T20:33:08.5802673Z 2025-05-07T20:33:08.5802929Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.5803260Z 2025-05-07T20:33:08.5803447Z x_sign = torch.sign(x) 2025-05-07T20:33:08.5803722Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.5804021Z x = x_sign * x_clamp 2025-05-07T20:33:08.5804391Z x0 = x[:, :D] 2025-05-07T20:33:08.5804601Z x1 = x[:, D:] 2025-05-07T20:33:08.5804794Z 2025-05-07T20:33:08.5804964Z if contiguous: 2025-05-07T20:33:08.5805184Z x0 = x0.contiguous() 2025-05-07T20:33:08.5805423Z x1 = x1.contiguous() 2025-05-07T20:33:08.5805652Z 2025-05-07T20:33:08.5805833Z if scale_ub is not None: 2025-05-07T20:33:08.5806088Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.5806418Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.5806806Z ) 2025-05-07T20:33:08.5806988Z else: 2025-05-07T20:33:08.5807193Z scale_ub_tensor = None 2025-05-07T20:33:08.5807434Z 2025-05-07T20:33:08.5807649Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.5808049Z op = silu_mul_quant 2025-05-07T20:33:08.5808545Z if compiled: 2025-05-07T20:33:08.5808781Z op = torch.compile(op) 2025-05-07T20:33:08.5809067Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.5809330Z 2025-05-07T20:33:08.5809508Z > y_fp8, y_scale = fn() 2025-05-07T20:33:08.5809675Z 2025-05-07T20:33:08.5809770Z moe/activation_test.py:117: 2025-05-07T20:33:08.5810056Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.5810383Z moe/activation_test.py:115: in fn 2025-05-07T20:33:08.5810648Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.5811204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:08.5811756Z return fn(*args, **kwargs) 2025-05-07T20:33:08.5812405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:08.5813171Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:08.5813700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.5814365Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.5815013Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.5815529Z kernel = self.compile( 2025-05-07T20:33:08.5816059Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.5816695Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.5817083Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.5817306Z 2025-05-07T20:33:08.5817514Z self = 2025-05-07T20:33:08.5818597Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.5819958Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f15af812660>} 2025-05-07T20:33:08.5821352Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.5822422Z context = 2025-05-07T20:33:08.5822704Z 2025-05-07T20:33:08.5822870Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.5823375Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.5823834Z module_map=module_map) 2025-05-07T20:33:08.5824187Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.5824534Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:08.5824778Z E ^ 2025-05-07T20:33:08.5825230Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.5825672Z 2025-05-07T20:33:08.5826089Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.5826591Z 2025-05-07T20:33:08.5826691Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.5827159Z self=, 2025-05-07T20:33:08.5827553Z T=4096, 2025-05-07T20:33:08.5827730Z D=5120, 2025-05-07T20:33:08.5827902Z scale_ub=1200.0, 2025-05-07T20:33:08.5828112Z contiguous=True, 2025-05-07T20:33:08.5828408Z compiled=True, 2025-05-07T20:33:08.5828590Z ) 2025-05-07T20:33:08.5828900Z self = 2025-05-07T20:33:08.5829379Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:08.5829640Z 2025-05-07T20:33:08.5829734Z @given( 2025-05-07T20:33:08.5829946Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.5830243Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.5830541Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.5830850Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.5831171Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.5831445Z ) 2025-05-07T20:33:08.5831780Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.5832199Z def test_silu_mul_quant( 2025-05-07T20:33:08.5832479Z self, 2025-05-07T20:33:08.5832665Z T: int, 2025-05-07T20:33:08.5832842Z D: int, 2025-05-07T20:33:08.5833050Z scale_ub: Optional[float], 2025-05-07T20:33:08.5833311Z contiguous: bool, 2025-05-07T20:33:08.5833533Z compiled: bool, 2025-05-07T20:33:08.5833744Z ) -> None: 2025-05-07T20:33:08.5833945Z torch.manual_seed(2025) 2025-05-07T20:33:08.5834169Z 2025-05-07T20:33:08.5834431Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.5834764Z 2025-05-07T20:33:08.5834937Z x_sign = torch.sign(x) 2025-05-07T20:33:08.5835216Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.5835515Z x = x_sign * x_clamp 2025-05-07T20:33:08.5835735Z x0 = x[:, :D] 2025-05-07T20:33:08.5835946Z x1 = x[:, D:] 2025-05-07T20:33:08.5836143Z 2025-05-07T20:33:08.5836312Z if contiguous: 2025-05-07T20:33:08.5836535Z x0 = x0.contiguous() 2025-05-07T20:33:08.5836785Z x1 = x1.contiguous() 2025-05-07T20:33:08.5837014Z 2025-05-07T20:33:08.5837187Z if scale_ub is not None: 2025-05-07T20:33:08.5837453Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.5837777Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.5838066Z ) 2025-05-07T20:33:08.5838250Z else: 2025-05-07T20:33:08.5838454Z scale_ub_tensor = None 2025-05-07T20:33:08.5838687Z 2025-05-07T20:33:08.5838911Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.5839218Z op = silu_mul_quant 2025-05-07T20:33:08.5839453Z if compiled: 2025-05-07T20:33:08.5839743Z op = torch.compile(op) 2025-05-07T20:33:08.5840036Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.5840290Z 2025-05-07T20:33:08.5840468Z > y_fp8, y_scale = fn() 2025-05-07T20:33:08.5840626Z 2025-05-07T20:33:08.5840726Z moe/activation_test.py:117: 2025-05-07T20:33:08.5841014Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.5841331Z moe/activation_test.py:115: in fn 2025-05-07T20:33:08.5841600Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.5842144Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:08.5842679Z return fn(*args, **kwargs) 2025-05-07T20:33:08.5843324Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:08.5844019Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:08.5844649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.5845357Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.5846010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.5846570Z kernel = self.compile( 2025-05-07T20:33:08.5847096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.5847746Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.5848141Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.5848365Z 2025-05-07T20:33:08.5848573Z self = 2025-05-07T20:33:08.5849637Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.5850993Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f15af8139c0>} 2025-05-07T20:33:08.5852367Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.5853375Z context = 2025-05-07T20:33:08.5853657Z 2025-05-07T20:33:08.5853822Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.5854329Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.5854789Z module_map=module_map) 2025-05-07T20:33:08.5855154Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.5855493Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:08.5855739Z E ^ 2025-05-07T20:33:08.5856202Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.5856651Z 2025-05-07T20:33:08.5857071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.7578671Z 2025-05-07T20:33:08.7579432Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.7580216Z self=, 2025-05-07T20:33:08.7580963Z T=128, 2025-05-07T20:33:08.7581215Z D=5120, 2025-05-07T20:33:08.7581405Z scale_ub=1200.0, 2025-05-07T20:33:08.7581626Z contiguous=False, 2025-05-07T20:33:08.7581848Z compiled=True, 2025-05-07T20:33:08.7582075Z ) 2025-05-07T20:33:08.7582669Z self = 2025-05-07T20:33:08.7583175Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:08.7583446Z 2025-05-07T20:33:08.7583521Z @given( 2025-05-07T20:33:08.7583753Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.7584082Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.7584379Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.7584708Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.7585034Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.7585307Z ) 2025-05-07T20:33:08.7585654Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.7586089Z def test_silu_mul_quant( 2025-05-07T20:33:08.7586328Z self, 2025-05-07T20:33:08.7586517Z T: int, 2025-05-07T20:33:08.7586730Z D: int, 2025-05-07T20:33:08.7586949Z scale_ub: Optional[float], 2025-05-07T20:33:08.7587220Z contiguous: bool, 2025-05-07T20:33:08.7587549Z compiled: bool, 2025-05-07T20:33:08.7587774Z ) -> None: 2025-05-07T20:33:08.7587990Z torch.manual_seed(2025) 2025-05-07T20:33:08.7588234Z 2025-05-07T20:33:08.7588500Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.7588921Z 2025-05-07T20:33:08.7589113Z x_sign = torch.sign(x) 2025-05-07T20:33:08.7589405Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.7589705Z x = x_sign * x_clamp 2025-05-07T20:33:08.7589952Z x0 = x[:, :D] 2025-05-07T20:33:08.7590171Z x1 = x[:, D:] 2025-05-07T20:33:08.7590370Z 2025-05-07T20:33:08.7590559Z if contiguous: 2025-05-07T20:33:08.7590792Z x0 = x0.contiguous() 2025-05-07T20:33:08.7591045Z x1 = x1.contiguous() 2025-05-07T20:33:08.7591285Z 2025-05-07T20:33:08.7591471Z if scale_ub is not None: 2025-05-07T20:33:08.7591730Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.7592057Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.7592357Z ) 2025-05-07T20:33:08.7592531Z else: 2025-05-07T20:33:08.7592732Z scale_ub_tensor = None 2025-05-07T20:33:08.7593067Z 2025-05-07T20:33:08.7593284Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.7593587Z op = silu_mul_quant 2025-05-07T20:33:08.7593825Z if compiled: 2025-05-07T20:33:08.7594061Z op = torch.compile(op) 2025-05-07T20:33:08.7594344Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.7594605Z 2025-05-07T20:33:08.7594789Z > y_fp8, y_scale = fn() 2025-05-07T20:33:08.7594948Z 2025-05-07T20:33:08.7595046Z moe/activation_test.py:117: 2025-05-07T20:33:08.7595332Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.7595657Z moe/activation_test.py:115: in fn 2025-05-07T20:33:08.7595923Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.7596483Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:08.7597029Z return fn(*args, **kwargs) 2025-05-07T20:33:08.7597697Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:08.7598370Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:08.7598894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.7599561Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.7600204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.7600722Z kernel = self.compile( 2025-05-07T20:33:08.7601305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.7601955Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.7602341Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.7602574Z 2025-05-07T20:33:08.7602776Z self = 2025-05-07T20:33:08.7603841Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.7605396Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f15af43cfe0>} 2025-05-07T20:33:08.7606718Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.7607779Z context = 2025-05-07T20:33:08.7608075Z 2025-05-07T20:33:08.7608498Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.7609085Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.7609535Z module_map=module_map) 2025-05-07T20:33:08.7609892Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.7610239Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:08.7610486Z E ^ 2025-05-07T20:33:08.7610934Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.7611380Z 2025-05-07T20:33:08.7611796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.7612298Z 2025-05-07T20:33:08.7612405Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.7612808Z self=, 2025-05-07T20:33:08.7613198Z T=16384, 2025-05-07T20:33:08.7613452Z D=7168, 2025-05-07T20:33:08.7613637Z scale_ub=1200.0, 2025-05-07T20:33:08.7613843Z contiguous=True, 2025-05-07T20:33:08.7614055Z compiled=True, 2025-05-07T20:33:08.7614245Z ) 2025-05-07T20:33:08.7614549Z self = 2025-05-07T20:33:08.7615033Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:08.7615301Z 2025-05-07T20:33:08.7615382Z @given( 2025-05-07T20:33:08.7615597Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.7615900Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.7616195Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.7616513Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.7616837Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.7617111Z ) 2025-05-07T20:33:08.7617454Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.7617882Z def test_silu_mul_quant( 2025-05-07T20:33:08.7618120Z self, 2025-05-07T20:33:08.7618313Z T: int, 2025-05-07T20:33:08.7618494Z D: int, 2025-05-07T20:33:08.7618707Z scale_ub: Optional[float], 2025-05-07T20:33:08.7618974Z contiguous: bool, 2025-05-07T20:33:08.7619200Z compiled: bool, 2025-05-07T20:33:08.7619412Z ) -> None: 2025-05-07T20:33:08.7619620Z torch.manual_seed(2025) 2025-05-07T20:33:08.7619845Z 2025-05-07T20:33:08.7620110Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.7620438Z 2025-05-07T20:33:08.7620618Z x_sign = torch.sign(x) 2025-05-07T20:33:08.7620976Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.7621279Z x = x_sign * x_clamp 2025-05-07T20:33:08.7621503Z x0 = x[:, :D] 2025-05-07T20:33:08.7621714Z x1 = x[:, D:] 2025-05-07T20:33:08.7621914Z 2025-05-07T20:33:08.7622094Z if contiguous: 2025-05-07T20:33:08.7622317Z x0 = x0.contiguous() 2025-05-07T20:33:08.7622570Z x1 = x1.contiguous() 2025-05-07T20:33:08.7622801Z 2025-05-07T20:33:08.7622978Z if scale_ub is not None: 2025-05-07T20:33:08.7623245Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.7623572Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.7623862Z ) 2025-05-07T20:33:08.7624047Z else: 2025-05-07T20:33:08.7624250Z scale_ub_tensor = None 2025-05-07T20:33:08.7624484Z 2025-05-07T20:33:08.7624710Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.7625019Z op = silu_mul_quant 2025-05-07T20:33:08.7625261Z if compiled: 2025-05-07T20:33:08.7625573Z op = torch.compile(op) 2025-05-07T20:33:08.7625871Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.7626129Z 2025-05-07T20:33:08.7626314Z > y_fp8, y_scale = fn() 2025-05-07T20:33:08.7626485Z 2025-05-07T20:33:08.7626626Z moe/activation_test.py:117: 2025-05-07T20:33:08.7626919Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.7627235Z moe/activation_test.py:115: in fn 2025-05-07T20:33:08.7627513Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.7628059Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:08.7628597Z return fn(*args, **kwargs) 2025-05-07T20:33:08.7629245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:08.7629928Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:08.7630460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.7631126Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.7631829Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.7632349Z kernel = self.compile( 2025-05-07T20:33:08.7632875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.7633525Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.7633915Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.7634140Z 2025-05-07T20:33:08.7634349Z self = 2025-05-07T20:33:08.7635416Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.7636776Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f15af43de40>} 2025-05-07T20:33:08.7638111Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.7639125Z context = 2025-05-07T20:33:08.7639410Z 2025-05-07T20:33:08.7639579Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.7640089Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.7640597Z module_map=module_map) 2025-05-07T20:33:08.7640956Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.7641294Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:08.7641543Z E ^ 2025-05-07T20:33:08.7641999Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.7642449Z 2025-05-07T20:33:08.7642867Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.8815825Z 2025-05-07T20:33:08.8816573Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.8817350Z self=, 2025-05-07T20:33:08.8817936Z T=16384, 2025-05-07T20:33:08.8818134Z D=5120, 2025-05-07T20:33:08.8818319Z scale_ub=1200.0, 2025-05-07T20:33:08.8818528Z contiguous=True, 2025-05-07T20:33:08.8818746Z compiled=False, 2025-05-07T20:33:08.8818972Z ) 2025-05-07T20:33:08.8819569Z self = 2025-05-07T20:33:08.8820073Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:08.8820356Z 2025-05-07T20:33:08.8820438Z @given( 2025-05-07T20:33:08.8820741Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.8821038Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.8821341Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.8821662Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.8821980Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.8822261Z ) 2025-05-07T20:33:08.8822607Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.8823041Z def test_silu_mul_quant( 2025-05-07T20:33:08.8823279Z self, 2025-05-07T20:33:08.8823468Z T: int, 2025-05-07T20:33:08.8823656Z D: int, 2025-05-07T20:33:08.8823876Z scale_ub: Optional[float], 2025-05-07T20:33:08.8824148Z contiguous: bool, 2025-05-07T20:33:08.8824381Z compiled: bool, 2025-05-07T20:33:08.8824607Z ) -> None: 2025-05-07T20:33:08.8824819Z torch.manual_seed(2025) 2025-05-07T20:33:08.8825143Z 2025-05-07T20:33:08.8825406Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.8825744Z 2025-05-07T20:33:08.8825930Z x_sign = torch.sign(x) 2025-05-07T20:33:08.8826212Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.8826520Z x = x_sign * x_clamp 2025-05-07T20:33:08.8826754Z x0 = x[:, :D] 2025-05-07T20:33:08.8826961Z x1 = x[:, D:] 2025-05-07T20:33:08.8827161Z 2025-05-07T20:33:08.8827340Z if contiguous: 2025-05-07T20:33:08.8827560Z x0 = x0.contiguous() 2025-05-07T20:33:08.8827814Z x1 = x1.contiguous() 2025-05-07T20:33:08.8828049Z 2025-05-07T20:33:08.8828229Z if scale_ub is not None: 2025-05-07T20:33:08.8828502Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.8828837Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.8829142Z ) 2025-05-07T20:33:08.8836496Z else: 2025-05-07T20:33:08.8836741Z scale_ub_tensor = None 2025-05-07T20:33:08.8837008Z 2025-05-07T20:33:08.8837252Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.8837579Z op = silu_mul_quant 2025-05-07T20:33:08.8837846Z if compiled: 2025-05-07T20:33:08.8838106Z op = torch.compile(op) 2025-05-07T20:33:08.8838434Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.8838742Z 2025-05-07T20:33:08.8838948Z > y_fp8, y_scale = fn() 2025-05-07T20:33:08.8839119Z 2025-05-07T20:33:08.8839232Z moe/activation_test.py:117: 2025-05-07T20:33:08.8839653Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.8839997Z moe/activation_test.py:115: in fn 2025-05-07T20:33:08.8840290Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.8840979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:08.8841677Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:08.8842220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.8842906Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.8843567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.8844150Z kernel = self.compile( 2025-05-07T20:33:08.8844796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.8845446Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.8845891Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.8846125Z 2025-05-07T20:33:08.8846332Z self = 2025-05-07T20:33:08.8847411Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.8848837Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f15af43eca0>} 2025-05-07T20:33:08.8850157Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.8851175Z context = 2025-05-07T20:33:08.8851461Z 2025-05-07T20:33:08.8851634Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.8852159Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.8852663Z module_map=module_map) 2025-05-07T20:33:08.8853025Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.8853375Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:08.8853629Z E ^ 2025-05-07T20:33:08.8854095Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.8854541Z 2025-05-07T20:33:08.8854959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.8855465Z 2025-05-07T20:33:08.8855577Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.8855983Z self=, 2025-05-07T20:33:08.8856385Z T=1, 2025-05-07T20:33:08.8856571Z D=7168, 2025-05-07T20:33:08.8856757Z scale_ub=1200.0, 2025-05-07T20:33:08.8856985Z contiguous=False, 2025-05-07T20:33:08.8857215Z compiled=False, 2025-05-07T20:33:08.8857415Z ) 2025-05-07T20:33:08.8857730Z self = 2025-05-07T20:33:08.8858219Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:08.8858482Z 2025-05-07T20:33:08.8858564Z @given( 2025-05-07T20:33:08.8858789Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.8859101Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.8859409Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.8859737Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.8860116Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.8860406Z ) 2025-05-07T20:33:08.8860750Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.8861192Z def test_silu_mul_quant( 2025-05-07T20:33:08.8861435Z self, 2025-05-07T20:33:08.8861632Z T: int, 2025-05-07T20:33:08.8861833Z D: int, 2025-05-07T20:33:08.8862052Z scale_ub: Optional[float], 2025-05-07T20:33:08.8862316Z contiguous: bool, 2025-05-07T20:33:08.8862558Z compiled: bool, 2025-05-07T20:33:08.8862783Z ) -> None: 2025-05-07T20:33:08.8862997Z torch.manual_seed(2025) 2025-05-07T20:33:08.8863234Z 2025-05-07T20:33:08.8863512Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.8863840Z 2025-05-07T20:33:08.8864035Z x_sign = torch.sign(x) 2025-05-07T20:33:08.8864327Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.8864626Z x = x_sign * x_clamp 2025-05-07T20:33:08.8864867Z x0 = x[:, :D] 2025-05-07T20:33:08.8865082Z x1 = x[:, D:] 2025-05-07T20:33:08.8865329Z 2025-05-07T20:33:08.8865512Z if contiguous: 2025-05-07T20:33:08.8865742Z x0 = x0.contiguous() 2025-05-07T20:33:08.8865997Z x1 = x1.contiguous() 2025-05-07T20:33:08.8866241Z 2025-05-07T20:33:08.8866513Z if scale_ub is not None: 2025-05-07T20:33:08.8866785Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.8867110Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.8867419Z ) 2025-05-07T20:33:08.8867611Z else: 2025-05-07T20:33:08.8867823Z scale_ub_tensor = None 2025-05-07T20:33:08.8868073Z 2025-05-07T20:33:08.8868303Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.8868610Z op = silu_mul_quant 2025-05-07T20:33:08.8868862Z if compiled: 2025-05-07T20:33:08.8869113Z op = torch.compile(op) 2025-05-07T20:33:08.8869404Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.8869677Z 2025-05-07T20:33:08.8869870Z > y_fp8, y_scale = fn() 2025-05-07T20:33:08.8870031Z 2025-05-07T20:33:08.8870129Z moe/activation_test.py:117: 2025-05-07T20:33:08.8870422Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.8870858Z moe/activation_test.py:115: in fn 2025-05-07T20:33:08.8871140Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.8871816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:08.8872547Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:08.8873077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.8873745Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.8874410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.8874949Z kernel = self.compile( 2025-05-07T20:33:08.8875483Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.8876128Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.8876529Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.8876753Z 2025-05-07T20:33:08.8876967Z self = 2025-05-07T20:33:08.8878037Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.8879431Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f15af31c0e0>} 2025-05-07T20:33:08.8880763Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.8881781Z context = 2025-05-07T20:33:08.8882064Z 2025-05-07T20:33:08.8882235Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.8882747Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.8883215Z module_map=module_map) 2025-05-07T20:33:08.8883582Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.8883935Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:08.8884185Z E ^ 2025-05-07T20:33:08.8884775Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.8885218Z 2025-05-07T20:33:08.8885681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.8886184Z 2025-05-07T20:33:08.8886299Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.8886747Z self=, 2025-05-07T20:33:08.8887147Z T=4096, 2025-05-07T20:33:08.8887340Z D=7168, 2025-05-07T20:33:08.8887525Z scale_ub=1200.0, 2025-05-07T20:33:08.8887748Z contiguous=False, 2025-05-07T20:33:08.8887970Z compiled=True, 2025-05-07T20:33:09.0519625Z ) 2025-05-07T20:33:09.0520220Z self = 2025-05-07T20:33:09.0520932Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:09.0521216Z 2025-05-07T20:33:09.0521308Z @given( 2025-05-07T20:33:09.0521569Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:09.0521902Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:09.0522222Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:09.0522554Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:09.0523171Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:09.0523477Z ) 2025-05-07T20:33:09.0523827Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:09.0524397Z def test_silu_mul_quant( 2025-05-07T20:33:09.0524648Z self, 2025-05-07T20:33:09.0524849Z T: int, 2025-05-07T20:33:09.0525047Z D: int, 2025-05-07T20:33:09.0525281Z scale_ub: Optional[float], 2025-05-07T20:33:09.0525562Z contiguous: bool, 2025-05-07T20:33:09.0525796Z compiled: bool, 2025-05-07T20:33:09.0526039Z ) -> None: 2025-05-07T20:33:09.0526263Z torch.manual_seed(2025) 2025-05-07T20:33:09.0526498Z 2025-05-07T20:33:09.0526788Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:09.0527132Z 2025-05-07T20:33:09.0527314Z x_sign = torch.sign(x) 2025-05-07T20:33:09.0527610Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:09.0527925Z x = x_sign * x_clamp 2025-05-07T20:33:09.0528159Z x0 = x[:, :D] 2025-05-07T20:33:09.0528384Z x1 = x[:, D:] 2025-05-07T20:33:09.0528599Z 2025-05-07T20:33:09.0528787Z if contiguous: 2025-05-07T20:33:09.0529018Z x0 = x0.contiguous() 2025-05-07T20:33:09.0529283Z x1 = x1.contiguous() 2025-05-07T20:33:09.0529533Z 2025-05-07T20:33:09.0529719Z if scale_ub is not None: 2025-05-07T20:33:09.0529987Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:09.0530316Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:09.0530608Z ) 2025-05-07T20:33:09.0530794Z else: 2025-05-07T20:33:09.0531096Z scale_ub_tensor = None 2025-05-07T20:33:09.0531335Z 2025-05-07T20:33:09.0531570Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:09.0531882Z op = silu_mul_quant 2025-05-07T20:33:09.0532119Z if compiled: 2025-05-07T20:33:09.0532370Z op = torch.compile(op) 2025-05-07T20:33:09.0532659Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:09.0532920Z 2025-05-07T20:33:09.0533111Z > y_fp8, y_scale = fn() 2025-05-07T20:33:09.0533280Z 2025-05-07T20:33:09.0533375Z moe/activation_test.py:117: 2025-05-07T20:33:09.0533671Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:09.0534000Z moe/activation_test.py:115: in fn 2025-05-07T20:33:09.0534274Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:09.0534845Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:09.0535392Z return fn(*args, **kwargs) 2025-05-07T20:33:09.0536153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:09.0536844Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:09.0537379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:09.0538137Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:09.0538802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:09.0539331Z kernel = self.compile( 2025-05-07T20:33:09.0539876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:09.0540523Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:09.0540924Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:09.0541152Z 2025-05-07T20:33:09.0541375Z self = 2025-05-07T20:33:09.0542455Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:09.0543879Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f15af31d300>} 2025-05-07T20:33:09.0545223Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:09.0546249Z context = 2025-05-07T20:33:09.0546539Z 2025-05-07T20:33:09.0546716Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:09.0547235Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:09.0547710Z module_map=module_map) 2025-05-07T20:33:09.0548083Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:09.0548464Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:09.0548745Z E ^ 2025-05-07T20:33:09.0549201Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:09.0549645Z 2025-05-07T20:33:09.0550060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:09.0550562Z 2025-05-07T20:33:09.0550661Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:09.0551065Z self=, 2025-05-07T20:33:09.0551503Z T=128, 2025-05-07T20:33:09.0551690Z D=7168, 2025-05-07T20:33:09.0551876Z scale_ub=1200.0, 2025-05-07T20:33:09.0552095Z contiguous=False, 2025-05-07T20:33:09.0552312Z compiled=True, 2025-05-07T20:33:09.0552502Z ) 2025-05-07T20:33:09.0552817Z self = 2025-05-07T20:33:09.0553311Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:09.0553578Z 2025-05-07T20:33:09.0553649Z @given( 2025-05-07T20:33:09.0553877Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:09.0554184Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:09.0554473Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:09.0554798Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:09.0555118Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:09.0555398Z ) 2025-05-07T20:33:09.0555735Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:09.0556171Z def test_silu_mul_quant( 2025-05-07T20:33:09.0556456Z self, 2025-05-07T20:33:09.0556644Z T: int, 2025-05-07T20:33:09.0556838Z D: int, 2025-05-07T20:33:09.0557051Z scale_ub: Optional[float], 2025-05-07T20:33:09.0557315Z contiguous: bool, 2025-05-07T20:33:09.0557591Z compiled: bool, 2025-05-07T20:33:09.0557809Z ) -> None: 2025-05-07T20:33:09.0558012Z torch.manual_seed(2025) 2025-05-07T20:33:09.0558253Z 2025-05-07T20:33:09.0558521Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:09.0558847Z 2025-05-07T20:33:09.0559030Z x_sign = torch.sign(x) 2025-05-07T20:33:09.0559313Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:09.0559613Z x = x_sign * x_clamp 2025-05-07T20:33:09.0559842Z x0 = x[:, :D] 2025-05-07T20:33:09.0560055Z x1 = x[:, D:] 2025-05-07T20:33:09.0560256Z 2025-05-07T20:33:09.0560433Z if contiguous: 2025-05-07T20:33:09.0560660Z x0 = x0.contiguous() 2025-05-07T20:33:09.0560914Z x1 = x1.contiguous() 2025-05-07T20:33:09.0561140Z 2025-05-07T20:33:09.0561326Z if scale_ub is not None: 2025-05-07T20:33:09.0561646Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:09.0561972Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:09.0562271Z ) 2025-05-07T20:33:09.0562458Z else: 2025-05-07T20:33:09.0562655Z scale_ub_tensor = None 2025-05-07T20:33:09.0562909Z 2025-05-07T20:33:09.0563148Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:09.0563459Z op = silu_mul_quant 2025-05-07T20:33:09.0563717Z if compiled: 2025-05-07T20:33:09.0563973Z op = torch.compile(op) 2025-05-07T20:33:09.0564353Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:09.0564631Z 2025-05-07T20:33:09.0564830Z > y_fp8, y_scale = fn() 2025-05-07T20:33:09.0564996Z 2025-05-07T20:33:09.0565111Z moe/activation_test.py:117: 2025-05-07T20:33:09.0565398Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:09.0565735Z moe/activation_test.py:115: in fn 2025-05-07T20:33:09.0566039Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:09.0566590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:09.0567151Z return fn(*args, **kwargs) 2025-05-07T20:33:09.0567812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:09.0568498Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:09.0569029Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:09.0569764Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:09.0570436Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:09.0570953Z kernel = self.compile( 2025-05-07T20:33:09.0571498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:09.0572170Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:09.0572572Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:09.0572809Z 2025-05-07T20:33:09.0573012Z self = 2025-05-07T20:33:09.0574144Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:09.0575594Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f15af31e160>} 2025-05-07T20:33:09.0576934Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:09.0578001Z context = 2025-05-07T20:33:09.0578288Z 2025-05-07T20:33:09.0578450Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:09.0580437Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:09.0580901Z module_map=module_map) 2025-05-07T20:33:09.0581254Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:09.0581608Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:09.0581867Z E ^ 2025-05-07T20:33:09.0582334Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:09.0582831Z 2025-05-07T20:33:09.0583245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:09.0583812Z 2025-05-07T20:33:09.0583912Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:09.0584324Z self=, 2025-05-07T20:33:09.0584712Z T=2048, 2025-05-07T20:33:09.0584894Z D=7168, 2025-05-07T20:33:09.0585090Z scale_ub=None, 2025-05-07T20:33:09.0585307Z contiguous=True, 2025-05-07T20:33:09.0585525Z compiled=True, 2025-05-07T20:33:09.1866110Z ) 2025-05-07T20:33:09.1866786Z self = 2025-05-07T20:33:09.1867625Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:09.1868106Z 2025-05-07T20:33:09.1868214Z @given( 2025-05-07T20:33:09.1868560Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:09.1869158Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:09.1869743Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:09.1870366Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:09.1870989Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:09.1871521Z ) 2025-05-07T20:33:09.1872172Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:09.1873014Z def test_silu_mul_quant( 2025-05-07T20:33:09.1873461Z self, 2025-05-07T20:33:09.1873753Z T: int, 2025-05-07T20:33:09.1873961Z D: int, 2025-05-07T20:33:09.1874191Z scale_ub: Optional[float], 2025-05-07T20:33:09.1874443Z contiguous: bool, 2025-05-07T20:33:09.1874670Z compiled: bool, 2025-05-07T20:33:09.1874886Z ) -> None: 2025-05-07T20:33:09.1875331Z torch.manual_seed(2025) 2025-05-07T20:33:09.1875566Z 2025-05-07T20:33:09.1875833Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:09.1876156Z 2025-05-07T20:33:09.1876335Z x_sign = torch.sign(x) 2025-05-07T20:33:09.1876618Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:09.1876916Z x = x_sign * x_clamp 2025-05-07T20:33:09.1877137Z x0 = x[:, :D] 2025-05-07T20:33:09.1877341Z x1 = x[:, D:] 2025-05-07T20:33:09.1877540Z 2025-05-07T20:33:09.1877705Z if contiguous: 2025-05-07T20:33:09.1877929Z x0 = x0.contiguous() 2025-05-07T20:33:09.1878181Z x1 = x1.contiguous() 2025-05-07T20:33:09.1878400Z 2025-05-07T20:33:09.1878583Z if scale_ub is not None: 2025-05-07T20:33:09.1878847Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:09.1879165Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:09.1879463Z ) 2025-05-07T20:33:09.1879648Z else: 2025-05-07T20:33:09.1879841Z scale_ub_tensor = None 2025-05-07T20:33:09.1880157Z 2025-05-07T20:33:09.1880378Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:09.1880670Z op = silu_mul_quant 2025-05-07T20:33:09.1880913Z if compiled: 2025-05-07T20:33:09.1881217Z op = torch.compile(op) 2025-05-07T20:33:09.1881500Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:09.1881752Z 2025-05-07T20:33:09.1881938Z > y_fp8, y_scale = fn() 2025-05-07T20:33:09.1882094Z 2025-05-07T20:33:09.1882192Z moe/activation_test.py:117: 2025-05-07T20:33:09.1882469Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:09.1882787Z moe/activation_test.py:115: in fn 2025-05-07T20:33:09.1883057Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:09.1883597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:09.1884139Z return fn(*args, **kwargs) 2025-05-07T20:33:09.1884912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:09.1885582Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:09.1886183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:09.1886851Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:09.1887499Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:09.1888016Z kernel = self.compile( 2025-05-07T20:33:09.1888539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:09.1889181Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:09.1889572Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:09.1889796Z 2025-05-07T20:33:09.1889999Z self = 2025-05-07T20:33:09.1891059Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:09.1892431Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f15af31f420>} 2025-05-07T20:33:09.1893754Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:09.1894805Z context = 2025-05-07T20:33:09.1895087Z 2025-05-07T20:33:09.1895248Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:09.1895757Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:09.1896218Z module_map=module_map) 2025-05-07T20:33:09.1896567Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:09.1896908Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:09.1897154Z E ^ 2025-05-07T20:33:09.1897602Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:09.1898054Z 2025-05-07T20:33:09.1905638Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:09.1906167Z 2025-05-07T20:33:09.1906280Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:09.1906693Z self=, 2025-05-07T20:33:09.1907094Z T=16384, 2025-05-07T20:33:09.1907364Z D=5120, 2025-05-07T20:33:09.1907552Z scale_ub=None, 2025-05-07T20:33:09.1907765Z contiguous=False, 2025-05-07T20:33:09.1907989Z compiled=False, 2025-05-07T20:33:09.1908190Z ) 2025-05-07T20:33:09.1908864Z self = 2025-05-07T20:33:09.1909360Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:09.1909634Z 2025-05-07T20:33:09.1909718Z @given( 2025-05-07T20:33:09.1909938Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:09.1910248Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:09.1910553Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:09.1910902Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:09.1911224Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:09.1911507Z ) 2025-05-07T20:33:09.1911857Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:09.1912287Z def test_silu_mul_quant( 2025-05-07T20:33:09.1912523Z self, 2025-05-07T20:33:09.1912714Z T: int, 2025-05-07T20:33:09.1912989Z D: int, 2025-05-07T20:33:09.1913206Z scale_ub: Optional[float], 2025-05-07T20:33:09.1913472Z contiguous: bool, 2025-05-07T20:33:09.1913696Z compiled: bool, 2025-05-07T20:33:09.1913917Z ) -> None: 2025-05-07T20:33:09.1914131Z torch.manual_seed(2025) 2025-05-07T20:33:09.1914362Z 2025-05-07T20:33:09.1914629Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:09.1914967Z 2025-05-07T20:33:09.1915156Z x_sign = torch.sign(x) 2025-05-07T20:33:09.1915436Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:09.1917453Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:09.1919324Z 2025-05-07T20:33:09.1919440Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:09.1919649Z 2025-05-07T20:33:09.1919759Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:09.1920162Z self=, 2025-05-07T20:33:09.1920556Z T=4096, 2025-05-07T20:33:09.1920740Z D=7168, 2025-05-07T20:33:09.1920925Z scale_ub=1200.0, 2025-05-07T20:33:09.1921135Z contiguous=True, 2025-05-07T20:33:09.1921351Z compiled=True, 2025-05-07T20:33:09.1921627Z ) 2025-05-07T20:33:09.1921936Z self = 2025-05-07T20:33:09.1922422Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:09.1922687Z 2025-05-07T20:33:09.1922773Z @given( 2025-05-07T20:33:09.1922994Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:09.1923307Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:09.1923609Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:09.1923929Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:09.1924343Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:09.1924632Z ) 2025-05-07T20:33:09.1924974Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:09.1925399Z def test_silu_mul_quant( 2025-05-07T20:33:09.1925641Z self, 2025-05-07T20:33:09.1925837Z T: int, 2025-05-07T20:33:09.1926024Z D: int, 2025-05-07T20:33:09.1926236Z scale_ub: Optional[float], 2025-05-07T20:33:09.1926580Z contiguous: bool, 2025-05-07T20:33:09.1926809Z compiled: bool, 2025-05-07T20:33:09.1927030Z ) -> None: 2025-05-07T20:33:09.1927241Z torch.manual_seed(2025) 2025-05-07T20:33:09.1927477Z 2025-05-07T20:33:09.1927813Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:09.1928149Z 2025-05-07T20:33:09.1928330Z x_sign = torch.sign(x) 2025-05-07T20:33:09.1928615Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:09.1930611Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:09.1932507Z 2025-05-07T20:33:09.1932627Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:09.1932880Z 2025-05-07T20:33:09.1932983Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:09.1933392Z self=, 2025-05-07T20:33:09.1933784Z T=16384, 2025-05-07T20:33:09.1933966Z D=7168, 2025-05-07T20:33:09.1934154Z scale_ub=None, 2025-05-07T20:33:09.1934362Z contiguous=False, 2025-05-07T20:33:09.1934580Z compiled=False, 2025-05-07T20:33:09.1934770Z ) 2025-05-07T20:33:09.1935080Z self = 2025-05-07T20:33:09.1935570Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:09.1935842Z 2025-05-07T20:33:09.1935919Z @given( 2025-05-07T20:33:09.1936143Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:09.1936456Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:09.1936750Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:09.1937071Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:09.1937400Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:09.1937675Z ) 2025-05-07T20:33:09.1938005Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:09.1938435Z def test_silu_mul_quant( 2025-05-07T20:33:09.1938668Z self, 2025-05-07T20:33:09.1938848Z T: int, 2025-05-07T20:33:09.1939038Z D: int, 2025-05-07T20:33:09.1939247Z scale_ub: Optional[float], 2025-05-07T20:33:09.1939505Z contiguous: bool, 2025-05-07T20:33:09.1939733Z compiled: bool, 2025-05-07T20:33:09.1939948Z ) -> None: 2025-05-07T20:33:09.1940145Z torch.manual_seed(2025) 2025-05-07T20:33:09.1940448Z 2025-05-07T20:33:09.1940719Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:09.1942757Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:09.1944622Z 2025-05-07T20:33:09.1944740Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:09.3186629Z 2025-05-07T20:33:09.3187599Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:09.3188403Z self=, 2025-05-07T20:33:09.3189006Z T=2048, 2025-05-07T20:33:09.3189263Z D=7168, 2025-05-07T20:33:09.3189966Z scale_ub=1200.0, 2025-05-07T20:33:09.3190290Z contiguous=True, 2025-05-07T20:33:09.3190559Z compiled=True, 2025-05-07T20:33:09.3190767Z ) 2025-05-07T20:33:09.3191112Z self = 2025-05-07T20:33:09.3191717Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:09.3191990Z 2025-05-07T20:33:09.3192064Z @given( 2025-05-07T20:33:09.3192301Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:09.3192622Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:09.3192930Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:09.3193272Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:09.3193613Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:09.3193890Z ) 2025-05-07T20:33:09.3194249Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:09.3194707Z def test_silu_mul_quant( 2025-05-07T20:33:09.3194955Z self, 2025-05-07T20:33:09.3195143Z T: int, 2025-05-07T20:33:09.3195355Z D: int, 2025-05-07T20:33:09.3195685Z scale_ub: Optional[float], 2025-05-07T20:33:09.3195951Z contiguous: bool, 2025-05-07T20:33:09.3196196Z compiled: bool, 2025-05-07T20:33:09.3196441Z ) -> None: 2025-05-07T20:33:09.3196648Z torch.manual_seed(2025) 2025-05-07T20:33:09.3196893Z 2025-05-07T20:33:09.3197164Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:09.3197501Z 2025-05-07T20:33:09.3197695Z x_sign = torch.sign(x) 2025-05-07T20:33:09.3197989Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:09.3200004Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:09.3201879Z 2025-05-07T20:33:09.3201998Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:09.3202206Z 2025-05-07T20:33:09.3202318Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:09.3202717Z self=, 2025-05-07T20:33:09.3203105Z T=2048, 2025-05-07T20:33:09.3203282Z D=7168, 2025-05-07T20:33:09.3203455Z scale_ub=None, 2025-05-07T20:33:09.3203660Z contiguous=True, 2025-05-07T20:33:09.3203877Z compiled=False, 2025-05-07T20:33:09.3204066Z ) 2025-05-07T20:33:09.3204598Z self = 2025-05-07T20:33:09.3205150Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:09.3205418Z 2025-05-07T20:33:09.3205492Z @given( 2025-05-07T20:33:09.3205727Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:09.3206053Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:09.3206373Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:09.3206697Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:09.3207033Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:09.3207325Z ) 2025-05-07T20:33:09.3207666Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:09.3208116Z def test_silu_mul_quant( 2025-05-07T20:33:09.3208600Z self, 2025-05-07T20:33:09.3208789Z T: int, 2025-05-07T20:33:09.3208989Z D: int, 2025-05-07T20:33:09.3209217Z scale_ub: Optional[float], 2025-05-07T20:33:09.3209489Z contiguous: bool, 2025-05-07T20:33:09.3209818Z compiled: bool, 2025-05-07T20:33:09.3210036Z ) -> None: 2025-05-07T20:33:09.3210236Z torch.manual_seed(2025) 2025-05-07T20:33:09.3210471Z 2025-05-07T20:33:09.3210741Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:09.3211135Z 2025-05-07T20:33:09.3211315Z > x_sign = torch.sign(x) 2025-05-07T20:33:09.3213282Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:09.3215163Z 2025-05-07T20:33:09.3215282Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:33:09.3215489Z 2025-05-07T20:33:09.3215598Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:09.3216062Z self=, 2025-05-07T20:33:09.3216458Z T=1, 2025-05-07T20:33:09.3216635Z D=7168, 2025-05-07T20:33:09.3216813Z scale_ub=1200.0, 2025-05-07T20:33:09.3217030Z contiguous=True, 2025-05-07T20:33:09.3217248Z compiled=False, 2025-05-07T20:33:09.3217447Z ) 2025-05-07T20:33:09.3217754Z self = 2025-05-07T20:33:09.3218239Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:09.3218501Z 2025-05-07T20:33:09.3218584Z @given( 2025-05-07T20:33:09.3218799Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:09.3219107Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:09.3219419Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:09.3219735Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:09.3220056Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:09.3220340Z ) 2025-05-07T20:33:09.3220680Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:09.3221104Z def test_silu_mul_quant( 2025-05-07T20:33:09.3221366Z self, 2025-05-07T20:33:09.3221551Z T: int, 2025-05-07T20:33:09.3221749Z D: int, 2025-05-07T20:33:09.3221979Z scale_ub: Optional[float], 2025-05-07T20:33:09.3222242Z contiguous: bool, 2025-05-07T20:33:09.3222489Z compiled: bool, 2025-05-07T20:33:09.3222720Z ) -> None: 2025-05-07T20:33:09.3222929Z torch.manual_seed(2025) 2025-05-07T20:33:09.3223176Z 2025-05-07T20:33:09.3223452Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:09.3223866Z 2025-05-07T20:33:09.3224073Z x_sign = torch.sign(x) 2025-05-07T20:33:09.3224373Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:09.3224687Z x = x_sign * x_clamp 2025-05-07T20:33:09.3224916Z x0 = x[:, :D] 2025-05-07T20:33:09.3225149Z x1 = x[:, D:] 2025-05-07T20:33:09.3225366Z 2025-05-07T20:33:09.3225550Z if contiguous: 2025-05-07T20:33:09.3225791Z x0 = x0.contiguous() 2025-05-07T20:33:09.3226060Z x1 = x1.contiguous() 2025-05-07T20:33:09.3226295Z 2025-05-07T20:33:09.3226497Z if scale_ub is not None: 2025-05-07T20:33:09.3226779Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:09.3227114Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:09.3227431Z ) 2025-05-07T20:33:09.3227631Z else: 2025-05-07T20:33:09.3227835Z scale_ub_tensor = None 2025-05-07T20:33:09.3228086Z 2025-05-07T20:33:09.3228327Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:09.3228688Z op = silu_mul_quant 2025-05-07T20:33:09.3228942Z if compiled: 2025-05-07T20:33:09.3229194Z op = torch.compile(op) 2025-05-07T20:33:09.3229491Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:09.3229801Z 2025-05-07T20:33:09.3229995Z > y_fp8, y_scale = fn() 2025-05-07T20:33:09.3230158Z 2025-05-07T20:33:09.3230267Z moe/activation_test.py:117: 2025-05-07T20:33:09.3230553Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:09.3230882Z moe/activation_test.py:115: in fn 2025-05-07T20:33:09.3231169Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:09.3231852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:09.3232548Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:09.3233093Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:09.3233786Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:09.3234433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:09.3235007Z kernel = self.compile( 2025-05-07T20:33:09.3235540Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:09.3236179Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:09.3236577Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:09.3236807Z 2025-05-07T20:33:09.3237017Z self = 2025-05-07T20:33:09.3238094Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:09.3239461Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f15af0662a0>} 2025-05-07T20:33:09.3240791Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:09.3241813Z context = 2025-05-07T20:33:09.3242109Z 2025-05-07T20:33:09.3242271Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:09.3242784Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:09.3243235Z module_map=module_map) 2025-05-07T20:33:09.3243645Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:09.3244177Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:09.3244479Z E ^ 2025-05-07T20:33:09.3244938Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:09.3245396Z 2025-05-07T20:33:09.3245807Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:09.3246311Z 2025-05-07T20:33:09.3246415Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:09.3246812Z self=, 2025-05-07T20:33:09.3247209Z T=128, 2025-05-07T20:33:09.3247393Z D=5120, 2025-05-07T20:33:09.3247570Z scale_ub=None, 2025-05-07T20:33:09.3247780Z contiguous=True, 2025-05-07T20:33:09.3248000Z compiled=False, 2025-05-07T20:33:09.3248188Z ) 2025-05-07T20:33:09.3248499Z self = 2025-05-07T20:33:09.3249041Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:09.3249304Z 2025-05-07T20:33:09.3249381Z @given( 2025-05-07T20:33:09.3249603Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:09.3249954Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:09.3250251Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:09.3250566Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:09.3250892Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:09.3251169Z ) 2025-05-07T20:33:09.3251503Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:09.3251939Z def test_silu_mul_quant( 2025-05-07T20:33:09.3252201Z self, 2025-05-07T20:33:09.3252414Z T: int, 2025-05-07T20:33:09.3252602Z D: int, 2025-05-07T20:33:09.3252812Z scale_ub: Optional[float], 2025-05-07T20:33:09.3253077Z contiguous: bool, 2025-05-07T20:33:09.3253302Z compiled: bool, 2025-05-07T20:33:09.3253515Z ) -> None: 2025-05-07T20:33:09.3253726Z torch.manual_seed(2025) 2025-05-07T20:33:09.3253958Z 2025-05-07T20:33:09.3254229Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:09.3254616Z 2025-05-07T20:33:09.3254792Z x_sign = torch.sign(x) 2025-05-07T20:33:09.3255075Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:09.3255379Z x = x_sign * x_clamp 2025-05-07T20:33:09.3255606Z x0 = x[:, :D] 2025-05-07T20:33:09.3255819Z x1 = x[:, D:] 2025-05-07T20:33:09.3256020Z 2025-05-07T20:33:09.3256195Z if contiguous: 2025-05-07T20:33:09.3256429Z x0 = x0.contiguous() 2025-05-07T20:33:09.3256690Z x1 = x1.contiguous() 2025-05-07T20:33:09.3256918Z 2025-05-07T20:33:09.3257109Z if scale_ub is not None: 2025-05-07T20:33:09.3257385Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:09.3257727Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:09.3258030Z ) 2025-05-07T20:33:09.3258225Z else: 2025-05-07T20:33:09.3258436Z scale_ub_tensor = None 2025-05-07T20:33:09.3258678Z 2025-05-07T20:33:09.3258910Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:09.3259216Z op = silu_mul_quant 2025-05-07T20:33:09.3259454Z if compiled: 2025-05-07T20:33:09.3259695Z op = torch.compile(op) 2025-05-07T20:33:09.3259985Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:09.3260245Z 2025-05-07T20:33:09.3260441Z > y_fp8, y_scale = fn() 2025-05-07T20:33:09.3260606Z 2025-05-07T20:33:09.3260714Z moe/activation_test.py:117: 2025-05-07T20:33:09.3261006Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:09.3261344Z moe/activation_test.py:115: in fn 2025-05-07T20:33:09.3261704Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:09.3262401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:09.3263079Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:09.3263618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:09.3264303Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:09.3264956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:09.3265491Z kernel = self.compile( 2025-05-07T20:33:09.3266033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:09.3266692Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:09.3267089Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:09.3267325Z 2025-05-07T20:33:09.3267578Z self = 2025-05-07T20:33:09.3268659Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:09.3270068Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f15af0671a0>} 2025-05-07T20:33:09.3271412Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:09.3272428Z context = 2025-05-07T20:33:09.3272721Z 2025-05-07T20:33:09.3272890Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:09.3273411Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:09.3273877Z module_map=module_map) 2025-05-07T20:33:09.3274287Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:09.3274641Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:09.3274909Z E ^ 2025-05-07T20:33:09.3275408Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:09.3275864Z 2025-05-07T20:33:09.3276282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:09.4413440Z 2025-05-07T20:33:09.4414412Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:09.4415199Z self=, 2025-05-07T20:33:09.4415872Z T=128, 2025-05-07T20:33:09.4416163Z D=7168, 2025-05-07T20:33:09.4416445Z scale_ub=None, 2025-05-07T20:33:09.4416771Z contiguous=True, 2025-05-07T20:33:09.4417107Z compiled=False, 2025-05-07T20:33:09.4417415Z ) 2025-05-07T20:33:09.4417914Z self = 2025-05-07T20:33:09.4418661Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:09.4419122Z 2025-05-07T20:33:09.4419232Z @given( 2025-05-07T20:33:09.4419578Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:09.4420074Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:09.4420561Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:09.4421079Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:09.4421597Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:09.4422037Z ) 2025-05-07T20:33:09.4422971Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:09.4423709Z def test_silu_mul_quant( 2025-05-07T20:33:09.4424083Z self, 2025-05-07T20:33:09.4424366Z T: int, 2025-05-07T20:33:09.4424663Z D: int, 2025-05-07T20:33:09.4425006Z scale_ub: Optional[float], 2025-05-07T20:33:09.4425437Z contiguous: bool, 2025-05-07T20:33:09.4425825Z compiled: bool, 2025-05-07T20:33:09.4426173Z ) -> None: 2025-05-07T20:33:09.4426493Z torch.manual_seed(2025) 2025-05-07T20:33:09.4426876Z 2025-05-07T20:33:09.4427295Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:09.4427850Z 2025-05-07T20:33:09.4428135Z x_sign = torch.sign(x) 2025-05-07T20:33:09.4428588Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:09.4429076Z x = x_sign * x_clamp 2025-05-07T20:33:09.4429444Z x0 = x[:, :D] 2025-05-07T20:33:09.4429766Z x1 = x[:, D:] 2025-05-07T20:33:09.4430086Z 2025-05-07T20:33:09.4430355Z if contiguous: 2025-05-07T20:33:09.4430851Z x0 = x0.contiguous() 2025-05-07T20:33:09.4431264Z x1 = x1.contiguous() 2025-05-07T20:33:09.4431630Z 2025-05-07T20:33:09.4431926Z if scale_ub is not None: 2025-05-07T20:33:09.4432520Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:09.4433035Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:09.4433536Z ) 2025-05-07T20:33:09.4433820Z else: 2025-05-07T20:33:09.4443934Z scale_ub_tensor = None 2025-05-07T20:33:09.4444555Z 2025-05-07T20:33:09.4444933Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:09.4445450Z op = silu_mul_quant 2025-05-07T20:33:09.4445843Z if compiled: 2025-05-07T20:33:09.4446191Z op = torch.compile(op) 2025-05-07T20:33:09.4446628Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:09.4447072Z 2025-05-07T20:33:09.4447358Z > y_fp8, y_scale = fn() 2025-05-07T20:33:09.4447620Z 2025-05-07T20:33:09.4447762Z moe/activation_test.py:117: 2025-05-07T20:33:09.4448187Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:09.4448889Z moe/activation_test.py:115: in fn 2025-05-07T20:33:09.4449362Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:09.4450548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:09.4451783Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:09.4452762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:09.4453955Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:09.4455004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:09.4455841Z kernel = self.compile( 2025-05-07T20:33:09.4456754Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:09.4457786Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:09.4458426Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:09.4458799Z 2025-05-07T20:33:09.4459133Z self = 2025-05-07T20:33:09.4460935Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:09.4463246Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f15aeeb0040>} 2025-05-07T20:33:09.4465524Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:09.4467346Z context = 2025-05-07T20:33:09.4467874Z 2025-05-07T20:33:09.4468155Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:09.4469072Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:09.4469903Z module_map=module_map) 2025-05-07T20:33:09.4470521Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:09.4471111Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:09.4471545Z E ^ 2025-05-07T20:33:09.4472359Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:09.4473197Z 2025-05-07T20:33:09.4474085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:09.4475053Z 2025-05-07T20:33:09.4475223Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:09.4475947Z self=, 2025-05-07T20:33:09.4476691Z T=2048, 2025-05-07T20:33:09.4476972Z D=7168, 2025-05-07T20:33:09.4477267Z scale_ub=1200.0, 2025-05-07T20:33:09.4477610Z contiguous=True, 2025-05-07T20:33:09.4477944Z compiled=False, 2025-05-07T20:33:09.4479754Z ) 2025-05-07T20:33:09.4480267Z self = 2025-05-07T20:33:09.4481110Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:09.4481609Z 2025-05-07T20:33:09.4481729Z @given( 2025-05-07T20:33:09.4482108Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:09.4482681Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:09.4483217Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:09.4483786Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:09.4484477Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:09.4485042Z ) 2025-05-07T20:33:09.4485653Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:09.4486442Z def test_silu_mul_quant( 2025-05-07T20:33:09.4486834Z self, 2025-05-07T20:33:09.4487136Z T: int, 2025-05-07T20:33:09.4487455Z D: int, 2025-05-07T20:33:09.4487806Z scale_ub: Optional[float], 2025-05-07T20:33:09.4488250Z contiguous: bool, 2025-05-07T20:33:09.4488635Z compiled: bool, 2025-05-07T20:33:09.4488974Z ) -> None: 2025-05-07T20:33:09.4489315Z torch.manual_seed(2025) 2025-05-07T20:33:09.4489713Z 2025-05-07T20:33:09.4490166Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:09.4494012Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:09.4497536Z 2025-05-07T20:33:09.4497735Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:09.4498096Z 2025-05-07T20:33:09.4498270Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:09.4498974Z self=, 2025-05-07T20:33:09.4499685Z T=1, 2025-05-07T20:33:09.4499995Z D=5120, 2025-05-07T20:33:09.4500375Z scale_ub=1200.0, 2025-05-07T20:33:09.4500746Z contiguous=True, 2025-05-07T20:33:09.4501109Z compiled=False, 2025-05-07T20:33:09.4501436Z ) 2025-05-07T20:33:09.4501978Z self = 2025-05-07T20:33:09.4502827Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:09.4503296Z 2025-05-07T20:33:09.4503426Z @given( 2025-05-07T20:33:09.4503782Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:09.4504318Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:09.4504854Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:09.4505421Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:09.4505991Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:09.4506486Z ) 2025-05-07T20:33:09.4507081Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:09.4507883Z def test_silu_mul_quant( 2025-05-07T20:33:09.4508666Z self, 2025-05-07T20:33:09.4508993Z T: int, 2025-05-07T20:33:09.4509458Z D: int, 2025-05-07T20:33:09.4509832Z scale_ub: Optional[float], 2025-05-07T20:33:09.4510279Z contiguous: bool, 2025-05-07T20:33:09.4510690Z compiled: bool, 2025-05-07T20:33:09.4511179Z ) -> None: 2025-05-07T20:33:09.4511529Z torch.manual_seed(2025) 2025-05-07T20:33:09.4511870Z 2025-05-07T20:33:09.4512225Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:09.4512646Z 2025-05-07T20:33:09.4512884Z x_sign = torch.sign(x) 2025-05-07T20:33:09.4513250Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:09.4513635Z x = x_sign * x_clamp 2025-05-07T20:33:09.4513938Z x0 = x[:, :D] 2025-05-07T20:33:09.4514216Z x1 = x[:, D:] 2025-05-07T20:33:09.4514470Z 2025-05-07T20:33:09.4514687Z if contiguous: 2025-05-07T20:33:09.4515011Z x0 = x0.contiguous() 2025-05-07T20:33:09.4515387Z x1 = x1.contiguous() 2025-05-07T20:33:09.4515743Z 2025-05-07T20:33:09.4516024Z if scale_ub is not None: 2025-05-07T20:33:09.4516442Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:09.4516932Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:09.4517513Z ) 2025-05-07T20:33:09.4517799Z else: 2025-05-07T20:33:09.4518075Z scale_ub_tensor = None 2025-05-07T20:33:09.4518418Z 2025-05-07T20:33:09.4518749Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:09.4519197Z op = silu_mul_quant 2025-05-07T20:33:09.4519537Z if compiled: 2025-05-07T20:33:09.4519886Z op = torch.compile(op) 2025-05-07T20:33:09.4520302Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:09.4520685Z 2025-05-07T20:33:09.4520969Z > y_fp8, y_scale = fn() 2025-05-07T20:33:09.4521208Z 2025-05-07T20:33:09.4521372Z moe/activation_test.py:117: 2025-05-07T20:33:09.4521795Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:09.4522277Z moe/activation_test.py:115: in fn 2025-05-07T20:33:09.4522693Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:09.4523700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:09.4524814Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:09.4525586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:09.4526581Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:09.4527526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:09.4528305Z kernel = self.compile( 2025-05-07T20:33:09.4529207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:09.4530161Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:09.4530725Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:09.4531069Z 2025-05-07T20:33:09.4531371Z self = 2025-05-07T20:33:09.4532959Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:09.4534967Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f15aeeb1580>} 2025-05-07T20:33:09.4536983Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:09.4538466Z context = 2025-05-07T20:33:09.4538879Z 2025-05-07T20:33:09.4539118Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:09.4539890Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:09.4540555Z module_map=module_map) 2025-05-07T20:33:09.4541094Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:09.4541614Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:09.4541985Z E ^ 2025-05-07T20:33:09.4542706Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:09.4543507Z 2025-05-07T20:33:09.4544192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:09.5358523Z 2025-05-07T20:33:09.5359394Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:09.5360510Z self=, 2025-05-07T20:33:09.5361510Z T=2048, 2025-05-07T20:33:09.5362145Z D=5120, 2025-05-07T20:33:09.5362353Z scale_ub=None, 2025-05-07T20:33:09.5362556Z contiguous=True, 2025-05-07T20:33:09.5362775Z compiled=False, 2025-05-07T20:33:09.5362972Z ) 2025-05-07T20:33:09.5363282Z self = 2025-05-07T20:33:09.5363773Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:09.5364039Z 2025-05-07T20:33:09.5364115Z @given( 2025-05-07T20:33:09.5364445Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:09.5364746Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:09.5365045Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:09.5365373Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:09.5365694Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:09.5365972Z ) 2025-05-07T20:33:09.5366315Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:09.5366748Z def test_silu_mul_quant( 2025-05-07T20:33:09.5366987Z self, 2025-05-07T20:33:09.5367179Z T: int, 2025-05-07T20:33:09.5367364Z D: int, 2025-05-07T20:33:09.5367578Z scale_ub: Optional[float], 2025-05-07T20:33:09.5367845Z contiguous: bool, 2025-05-07T20:33:09.5368081Z compiled: bool, 2025-05-07T20:33:09.5368302Z ) -> None: 2025-05-07T20:33:09.5368519Z torch.manual_seed(2025) 2025-05-07T20:33:09.5368755Z 2025-05-07T20:33:09.5369021Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:09.5369352Z 2025-05-07T20:33:09.5369536Z > x_sign = torch.sign(x) 2025-05-07T20:33:09.5371562Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:09.5373431Z 2025-05-07T20:33:09.5373546Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:33:09.5373755Z 2025-05-07T20:33:09.5373863Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:09.5374262Z self=, 2025-05-07T20:33:09.5374662Z T=16384, 2025-05-07T20:33:09.5374851Z D=5120, 2025-05-07T20:33:09.5375031Z scale_ub=None, 2025-05-07T20:33:09.5375249Z contiguous=True, 2025-05-07T20:33:09.5375468Z compiled=False, 2025-05-07T20:33:09.5375751Z ) 2025-05-07T20:33:09.5376090Z self = 2025-05-07T20:33:09.5376571Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:09.5376917Z 2025-05-07T20:33:09.5376989Z @given( 2025-05-07T20:33:09.5377213Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:09.5377515Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:09.5377804Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:09.5378130Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:09.5378450Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:09.5378719Z ) 2025-05-07T20:33:09.5379058Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:09.5379488Z def test_silu_mul_quant( 2025-05-07T20:33:09.5379718Z self, 2025-05-07T20:33:09.5379906Z T: int, 2025-05-07T20:33:09.5380098Z D: int, 2025-05-07T20:33:09.5380304Z scale_ub: Optional[float], 2025-05-07T20:33:09.5380570Z contiguous: bool, 2025-05-07T20:33:09.5380805Z compiled: bool, 2025-05-07T20:33:09.5381072Z ) -> None: 2025-05-07T20:33:09.5381278Z torch.manual_seed(2025) 2025-05-07T20:33:09.5381512Z 2025-05-07T20:33:09.5381784Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:09.5383853Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:09.5385706Z 2025-05-07T20:33:09.5385819Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:09.5386033Z 2025-05-07T20:33:09.5386129Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:09.5386540Z self=, 2025-05-07T20:33:09.5386933Z T=4096, 2025-05-07T20:33:09.5387107Z D=5120, 2025-05-07T20:33:09.5387296Z scale_ub=None, 2025-05-07T20:33:09.5387505Z contiguous=True, 2025-05-07T20:33:09.5387720Z compiled=False, 2025-05-07T20:33:09.5387920Z ) 2025-05-07T20:33:09.5388232Z self = 2025-05-07T20:33:09.5388714Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:09.5388987Z 2025-05-07T20:33:09.5389063Z @given( 2025-05-07T20:33:09.5389346Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:09.5389647Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:09.5389952Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:09.5390281Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:09.5390616Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:09.5390904Z ) 2025-05-07T20:33:09.5391252Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:09.5391694Z def test_silu_mul_quant( 2025-05-07T20:33:09.5391938Z self, 2025-05-07T20:33:09.5392132Z T: int, 2025-05-07T20:33:09.5392331Z D: int, 2025-05-07T20:33:09.5392541Z scale_ub: Optional[float], 2025-05-07T20:33:09.5392812Z contiguous: bool, 2025-05-07T20:33:09.5393053Z compiled: bool, 2025-05-07T20:33:09.5393268Z ) -> None: 2025-05-07T20:33:09.5393490Z torch.manual_seed(2025) 2025-05-07T20:33:09.5393740Z 2025-05-07T20:33:09.5394008Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:09.5396077Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:09.5397956Z 2025-05-07T20:33:09.5398068Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:09.5398279Z 2025-05-07T20:33:09.5398375Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:09.5398780Z self=, 2025-05-07T20:33:09.5399163Z T=2048, 2025-05-07T20:33:09.5399343Z D=5120, 2025-05-07T20:33:09.5399524Z scale_ub=None, 2025-05-07T20:33:09.5399722Z contiguous=False, 2025-05-07T20:33:09.5399940Z compiled=False, 2025-05-07T20:33:09.5400135Z ) 2025-05-07T20:33:09.5400433Z self = 2025-05-07T20:33:09.5400968Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:09.5401239Z 2025-05-07T20:33:09.5401311Z @given( 2025-05-07T20:33:09.5401532Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:09.5401826Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:09.5402128Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:09.5402497Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:09.5402808Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:09.5403081Z ) 2025-05-07T20:33:09.5403420Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:09.5403845Z def test_silu_mul_quant( 2025-05-07T20:33:09.5404076Z self, 2025-05-07T20:33:09.5404326Z T: int, 2025-05-07T20:33:09.5404507Z D: int, 2025-05-07T20:33:09.5404721Z scale_ub: Optional[float], 2025-05-07T20:33:09.5404983Z contiguous: bool, 2025-05-07T20:33:09.5405220Z compiled: bool, 2025-05-07T20:33:09.5405428Z ) -> None: 2025-05-07T20:33:09.5405635Z torch.manual_seed(2025) 2025-05-07T20:33:09.5405869Z 2025-05-07T20:33:09.5406128Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:09.5408197Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:09.5410238Z 2025-05-07T20:33:09.5410352Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:09.5410561Z 2025-05-07T20:33:09.5410668Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:09.5411066Z self=, 2025-05-07T20:33:09.5411458Z T=4096, 2025-05-07T20:33:09.5411634Z D=7168, 2025-05-07T20:33:09.5411816Z scale_ub=None, 2025-05-07T20:33:09.5412014Z contiguous=True, 2025-05-07T20:33:09.5412226Z compiled=True, 2025-05-07T20:33:09.5412421Z ) 2025-05-07T20:33:09.5412726Z self = 2025-05-07T20:33:09.5413206Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:09.5413466Z 2025-05-07T20:33:09.5413558Z @given( 2025-05-07T20:33:09.5413792Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:09.5414184Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:09.5414490Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:09.5414813Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:09.5415239Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:09.5415524Z ) 2025-05-07T20:33:09.5415869Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:09.5416303Z def test_silu_mul_quant( 2025-05-07T20:33:09.5416550Z self, 2025-05-07T20:33:09.5416749Z T: int, 2025-05-07T20:33:09.5416949Z D: int, 2025-05-07T20:33:09.5417179Z scale_ub: Optional[float], 2025-05-07T20:33:09.5417465Z contiguous: bool, 2025-05-07T20:33:09.5417704Z compiled: bool, 2025-05-07T20:33:09.5417940Z ) -> None: 2025-05-07T20:33:09.5418166Z torch.manual_seed(2025) 2025-05-07T20:33:09.5418407Z 2025-05-07T20:33:09.5418686Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:09.5420725Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:09.5422671Z 2025-05-07T20:33:09.5422793Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:09.5423009Z 2025-05-07T20:33:09.5423133Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:09.5423546Z self=, 2025-05-07T20:33:09.5423965Z T=2048, 2025-05-07T20:33:09.5424158Z D=5120, 2025-05-07T20:33:09.5424350Z scale_ub=1200.0, 2025-05-07T20:33:09.5424580Z contiguous=False, 2025-05-07T20:33:09.5424808Z compiled=False, 2025-05-07T20:33:09.5973192Z ) 2025-05-07T20:33:09.5974277Z self = 2025-05-07T20:33:09.5975365Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:09.5975894Z 2025-05-07T20:33:09.5975974Z @given( 2025-05-07T20:33:09.5976206Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:09.5976523Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:09.5976828Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:09.5977164Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:09.5977503Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:09.5977788Z ) 2025-05-07T20:33:09.5978298Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:09.5978759Z def test_silu_mul_quant( 2025-05-07T20:33:09.5978995Z self, 2025-05-07T20:33:09.5979193Z T: int, 2025-05-07T20:33:09.5979401Z D: int, 2025-05-07T20:33:09.5979619Z scale_ub: Optional[float], 2025-05-07T20:33:09.5979895Z contiguous: bool, 2025-05-07T20:33:09.5980140Z compiled: bool, 2025-05-07T20:33:09.5980365Z ) -> None: 2025-05-07T20:33:09.5980590Z torch.manual_seed(2025) 2025-05-07T20:33:09.5980840Z 2025-05-07T20:33:09.5981114Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:09.5983327Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:09.5985283Z 2025-05-07T20:33:09.5985402Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:09.5985682Z 2025-05-07T20:33:09.5985784Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:09.5986205Z self=, 2025-05-07T20:33:09.5986623Z T=4096, 2025-05-07T20:33:09.5986805Z D=7168, 2025-05-07T20:33:09.5986997Z scale_ub=1200.0, 2025-05-07T20:33:09.5987235Z contiguous=True, 2025-05-07T20:33:09.5987456Z compiled=False, 2025-05-07T20:33:09.5987666Z ) 2025-05-07T20:33:09.5987993Z self = 2025-05-07T20:33:09.5988496Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:09.5988781Z 2025-05-07T20:33:09.5988857Z @given( 2025-05-07T20:33:09.5989094Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:09.5989403Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:09.5989704Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:09.5990112Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:09.5990429Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:09.5990713Z ) 2025-05-07T20:33:09.5991056Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:09.5991491Z def test_silu_mul_quant( 2025-05-07T20:33:09.5991727Z self, 2025-05-07T20:33:09.5991928Z T: int, 2025-05-07T20:33:09.5992123Z D: int, 2025-05-07T20:33:09.5992332Z scale_ub: Optional[float], 2025-05-07T20:33:09.5992603Z contiguous: bool, 2025-05-07T20:33:09.5992852Z compiled: bool, 2025-05-07T20:33:09.5993066Z ) -> None: 2025-05-07T20:33:09.5993284Z torch.manual_seed(2025) 2025-05-07T20:33:09.5993571Z 2025-05-07T20:33:09.6001297Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:09.6003394Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:09.6005364Z 2025-05-07T20:33:09.6005487Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:09.6005700Z 2025-05-07T20:33:09.6005812Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:09.6006305Z self=, 2025-05-07T20:33:09.6006709Z T=16384, 2025-05-07T20:33:09.6006912Z D=7168, 2025-05-07T20:33:09.6007111Z scale_ub=None, 2025-05-07T20:33:09.6007329Z contiguous=False, 2025-05-07T20:33:09.6007562Z compiled=True, 2025-05-07T20:33:09.6007776Z ) 2025-05-07T20:33:09.6008097Z self = 2025-05-07T20:33:09.6008863Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:09.6009143Z 2025-05-07T20:33:09.6009230Z @given( 2025-05-07T20:33:09.6009458Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:09.6009774Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:09.6010087Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:09.6010421Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:09.6010750Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:09.6011036Z ) 2025-05-07T20:33:09.6011473Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:09.6011914Z def test_silu_mul_quant( 2025-05-07T20:33:09.6012165Z self, 2025-05-07T20:33:09.6012373Z T: int, 2025-05-07T20:33:09.6012573Z D: int, 2025-05-07T20:33:09.6012858Z scale_ub: Optional[float], 2025-05-07T20:33:09.6013134Z contiguous: bool, 2025-05-07T20:33:09.6013369Z compiled: bool, 2025-05-07T20:33:09.6013610Z ) -> None: 2025-05-07T20:33:09.6013834Z torch.manual_seed(2025) 2025-05-07T20:33:09.6014074Z 2025-05-07T20:33:09.6014357Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:09.6016403Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:09.6018337Z 2025-05-07T20:33:09.6018458Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:09.6018668Z 2025-05-07T20:33:09.6018772Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:09.6019176Z self=, 2025-05-07T20:33:09.6019574Z T=4096, 2025-05-07T20:33:09.6019761Z D=7168, 2025-05-07T20:33:09.6019941Z scale_ub=None, 2025-05-07T20:33:09.6020151Z contiguous=True, 2025-05-07T20:33:09.6020373Z compiled=False, 2025-05-07T20:33:09.6020578Z ) 2025-05-07T20:33:09.6020892Z self = 2025-05-07T20:33:09.6021390Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:09.6021659Z 2025-05-07T20:33:09.6021742Z @given( 2025-05-07T20:33:09.6021964Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:09.6022277Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:09.6022593Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:09.6022912Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:09.6023243Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:09.6023526Z ) 2025-05-07T20:33:09.6023864Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:09.6024302Z def test_silu_mul_quant( 2025-05-07T20:33:09.6024551Z self, 2025-05-07T20:33:09.6024742Z T: int, 2025-05-07T20:33:09.6024943Z D: int, 2025-05-07T20:33:09.6025160Z scale_ub: Optional[float], 2025-05-07T20:33:09.6025425Z contiguous: bool, 2025-05-07T20:33:09.6025736Z compiled: bool, 2025-05-07T20:33:09.6025955Z ) -> None: 2025-05-07T20:33:09.6026164Z torch.manual_seed(2025) 2025-05-07T20:33:09.6026403Z 2025-05-07T20:33:09.6026670Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:09.6028719Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:09.6030597Z 2025-05-07T20:33:09.6030714Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:09.6030932Z 2025-05-07T20:33:09.6031034Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:09.6031487Z self=, 2025-05-07T20:33:09.6031890Z T=16384, 2025-05-07T20:33:09.6032075Z D=7168, 2025-05-07T20:33:09.6032265Z scale_ub=None, 2025-05-07T20:33:09.6032514Z contiguous=True, 2025-05-07T20:33:09.6032729Z compiled=False, 2025-05-07T20:33:09.6032933Z ) 2025-05-07T20:33:09.6033243Z self = 2025-05-07T20:33:09.6033728Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:09.6034008Z 2025-05-07T20:33:09.6034086Z @given( 2025-05-07T20:33:09.6034315Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:09.6034627Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:09.6034927Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:09.6035256Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:09.6035582Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:09.6035858Z ) 2025-05-07T20:33:09.6036196Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:09.6036628Z def test_silu_mul_quant( 2025-05-07T20:33:09.6036904Z self, 2025-05-07T20:33:09.6037094Z T: int, 2025-05-07T20:33:09.6037289Z D: int, 2025-05-07T20:33:09.6037496Z scale_ub: Optional[float], 2025-05-07T20:33:09.6037757Z contiguous: bool, 2025-05-07T20:33:09.6037993Z compiled: bool, 2025-05-07T20:33:09.6038205Z ) -> None: 2025-05-07T20:33:09.6038411Z torch.manual_seed(2025) 2025-05-07T20:33:09.6038651Z 2025-05-07T20:33:09.6038907Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:09.6040940Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:09.6042793Z 2025-05-07T20:33:09.6042910Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:09.6043127Z 2025-05-07T20:33:09.6043223Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:09.6043632Z self=, 2025-05-07T20:33:09.6044022Z T=16384, 2025-05-07T20:33:09.6044215Z D=7168, 2025-05-07T20:33:09.6044505Z scale_ub=1200.0, 2025-05-07T20:33:09.6044711Z contiguous=True, 2025-05-07T20:33:09.6044925Z compiled=False, 2025-05-07T20:33:09.6045129Z ) 2025-05-07T20:33:09.6045495Z self = 2025-05-07T20:33:09.6045990Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:09.6046269Z 2025-05-07T20:33:09.6046344Z @given( 2025-05-07T20:33:09.6046571Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:09.6046884Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:09.6047186Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:09.6047511Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:09.6047831Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:09.6048113Z ) 2025-05-07T20:33:09.6048457Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:09.6048890Z def test_silu_mul_quant( 2025-05-07T20:33:09.6049124Z self, 2025-05-07T20:33:09.6049320Z T: int, 2025-05-07T20:33:09.6049519Z D: int, 2025-05-07T20:33:09.6049734Z scale_ub: Optional[float], 2025-05-07T20:33:09.6050012Z contiguous: bool, 2025-05-07T20:33:09.6050331Z compiled: bool, 2025-05-07T20:33:09.6050548Z ) -> None: 2025-05-07T20:33:09.6050757Z torch.manual_seed(2025) 2025-05-07T20:33:09.6050994Z 2025-05-07T20:33:09.6051259Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:09.6053390Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:09.6055305Z 2025-05-07T20:33:09.6055424Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:09.7870671Z 2025-05-07T20:33:09.7871296Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:09.7872660Z self=, 2025-05-07T20:33:09.7873618Z T=128, 2025-05-07T20:33:09.7873796Z D=5120, 2025-05-07T20:33:09.7873987Z scale_ub=1200.0, 2025-05-07T20:33:09.7874196Z contiguous=False, 2025-05-07T20:33:09.7874413Z compiled=False, 2025-05-07T20:33:09.7874614Z ) 2025-05-07T20:33:09.7874916Z self = 2025-05-07T20:33:09.7875399Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:09.7875676Z 2025-05-07T20:33:09.7875748Z @given( 2025-05-07T20:33:09.7875967Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:09.7876264Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:09.7876571Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:09.7876898Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:09.7877214Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:09.7877492Z ) 2025-05-07T20:33:09.7877830Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:09.7878265Z def test_silu_mul_quant( 2025-05-07T20:33:09.7878494Z self, 2025-05-07T20:33:09.7878708Z T: int, 2025-05-07T20:33:09.7878898Z D: int, 2025-05-07T20:33:09.7879107Z scale_ub: Optional[float], 2025-05-07T20:33:09.7879363Z contiguous: bool, 2025-05-07T20:33:09.7879600Z compiled: bool, 2025-05-07T20:33:09.7879826Z ) -> None: 2025-05-07T20:33:09.7880027Z torch.manual_seed(2025) 2025-05-07T20:33:09.7880264Z 2025-05-07T20:33:09.7880532Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:09.7880867Z 2025-05-07T20:33:09.7881045Z x_sign = torch.sign(x) 2025-05-07T20:33:09.7881429Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:09.7881741Z x = x_sign * x_clamp 2025-05-07T20:33:09.7881966Z x0 = x[:, :D] 2025-05-07T20:33:09.7882174Z x1 = x[:, D:] 2025-05-07T20:33:09.7882374Z 2025-05-07T20:33:09.7882549Z if contiguous: 2025-05-07T20:33:09.7882776Z x0 = x0.contiguous() 2025-05-07T20:33:09.7883028Z x1 = x1.contiguous() 2025-05-07T20:33:09.7883259Z 2025-05-07T20:33:09.7883444Z if scale_ub is not None: 2025-05-07T20:33:09.7883707Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:09.7884035Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:09.7884491Z ) 2025-05-07T20:33:09.7884674Z else: 2025-05-07T20:33:09.7884867Z scale_ub_tensor = None 2025-05-07T20:33:09.7885105Z 2025-05-07T20:33:09.7885330Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:09.7885633Z op = silu_mul_quant 2025-05-07T20:33:09.7885873Z if compiled: 2025-05-07T20:33:09.7886199Z op = torch.compile(op) 2025-05-07T20:33:09.7886484Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:09.7886740Z 2025-05-07T20:33:09.7886936Z > y_fp8, y_scale = fn() 2025-05-07T20:33:09.7887174Z 2025-05-07T20:33:09.7887268Z moe/activation_test.py:117: 2025-05-07T20:33:09.7887556Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:09.7887879Z moe/activation_test.py:115: in fn 2025-05-07T20:33:09.7888144Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:09.7888826Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:09.7889501Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:09.7890031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:09.7890700Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:09.7891354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:09.7891871Z kernel = self.compile( 2025-05-07T20:33:09.7892456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:09.7893096Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:09.7893486Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:09.7893708Z 2025-05-07T20:33:09.7893918Z self = 2025-05-07T20:33:09.7894991Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:09.7896350Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f15aefe91c0>} 2025-05-07T20:33:09.7897676Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:09.7898693Z context = 2025-05-07T20:33:09.7898980Z 2025-05-07T20:33:09.7899152Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:09.7899662Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:09.7900122Z module_map=module_map) 2025-05-07T20:33:09.7900481Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:09.7900878Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:09.7901134Z E ^ 2025-05-07T20:33:09.7901598Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:09.7902043Z 2025-05-07T20:33:09.7902458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:09.7902971Z 2025-05-07T20:33:09.7903070Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:09.7903478Z self=, 2025-05-07T20:33:09.7903873Z T=2048, 2025-05-07T20:33:09.7904068Z D=7168, 2025-05-07T20:33:09.7904258Z scale_ub=None, 2025-05-07T20:33:09.7904480Z contiguous=False, 2025-05-07T20:33:09.7904706Z compiled=False, 2025-05-07T20:33:09.7904904Z ) 2025-05-07T20:33:09.7905218Z self = 2025-05-07T20:33:09.7905713Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:09.7905983Z 2025-05-07T20:33:09.7906102Z @given( 2025-05-07T20:33:09.7906332Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:09.7906644Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:09.7906942Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:09.7907310Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:09.7907635Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:09.7907919Z ) 2025-05-07T20:33:09.7908517Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:09.7908961Z def test_silu_mul_quant( 2025-05-07T20:33:09.7909205Z self, 2025-05-07T20:33:09.7909391Z T: int, 2025-05-07T20:33:09.7909591Z D: int, 2025-05-07T20:33:09.7909806Z scale_ub: Optional[float], 2025-05-07T20:33:09.7910069Z contiguous: bool, 2025-05-07T20:33:09.7910314Z compiled: bool, 2025-05-07T20:33:09.7910536Z ) -> None: 2025-05-07T20:33:09.7910747Z torch.manual_seed(2025) 2025-05-07T20:33:09.7910989Z 2025-05-07T20:33:09.7911264Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:09.7913304Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:09.7915234Z 2025-05-07T20:33:09.7915358Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:09.7915567Z 2025-05-07T20:33:09.7915670Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:09.7916078Z self=, 2025-05-07T20:33:09.7916477Z T=128, 2025-05-07T20:33:09.7916659Z D=7168, 2025-05-07T20:33:09.7916844Z scale_ub=1200.0, 2025-05-07T20:33:09.7917074Z contiguous=True, 2025-05-07T20:33:09.7917290Z compiled=True, 2025-05-07T20:33:09.7917492Z ) 2025-05-07T20:33:09.7917807Z self = 2025-05-07T20:33:09.7918282Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:09.7918552Z 2025-05-07T20:33:09.7918625Z @given( 2025-05-07T20:33:09.7918853Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:09.7919159Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:09.7919454Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:09.7919778Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:09.7920171Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:09.7920448Z ) 2025-05-07T20:33:09.7920794Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:09.7921229Z def test_silu_mul_quant( 2025-05-07T20:33:09.7921463Z self, 2025-05-07T20:33:09.7921662Z T: int, 2025-05-07T20:33:09.7921868Z D: int, 2025-05-07T20:33:09.7922087Z scale_ub: Optional[float], 2025-05-07T20:33:09.7922353Z contiguous: bool, 2025-05-07T20:33:09.7922593Z compiled: bool, 2025-05-07T20:33:09.7922813Z ) -> None: 2025-05-07T20:33:09.7923020Z torch.manual_seed(2025) 2025-05-07T20:33:09.7923258Z 2025-05-07T20:33:09.7923533Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:09.7923912Z 2025-05-07T20:33:09.7924106Z x_sign = torch.sign(x) 2025-05-07T20:33:09.7924472Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:09.7924782Z x = x_sign * x_clamp 2025-05-07T20:33:09.7925020Z x0 = x[:, :D] 2025-05-07T20:33:09.7925307Z x1 = x[:, D:] 2025-05-07T20:33:09.7925509Z 2025-05-07T20:33:09.7925697Z if contiguous: 2025-05-07T20:33:09.7925926Z x0 = x0.contiguous() 2025-05-07T20:33:09.7926182Z x1 = x1.contiguous() 2025-05-07T20:33:09.7926479Z 2025-05-07T20:33:09.7926671Z if scale_ub is not None: 2025-05-07T20:33:09.7926934Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:09.7927264Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:09.7927570Z ) 2025-05-07T20:33:09.7927763Z else: 2025-05-07T20:33:09.7927968Z scale_ub_tensor = None 2025-05-07T20:33:09.7928218Z 2025-05-07T20:33:09.7928447Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:09.7928752Z op = silu_mul_quant 2025-05-07T20:33:09.7929005Z if compiled: 2025-05-07T20:33:09.7929255Z op = torch.compile(op) 2025-05-07T20:33:09.7929543Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:09.7929822Z 2025-05-07T20:33:09.7930022Z > y_fp8, y_scale = fn() 2025-05-07T20:33:09.7930188Z 2025-05-07T20:33:09.7930285Z moe/activation_test.py:117: 2025-05-07T20:33:09.7930624Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:09.7930953Z moe/activation_test.py:115: in fn 2025-05-07T20:33:09.7931231Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:09.7931781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:09.7932335Z return fn(*args, **kwargs) 2025-05-07T20:33:09.7932987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:09.7933664Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:09.7934195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:09.7934872Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:09.7935532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:09.7936059Z kernel = self.compile( 2025-05-07T20:33:09.7936593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:09.7937250Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:09.7937641Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:09.7937871Z 2025-05-07T20:33:09.7938074Z self = 2025-05-07T20:33:09.7939231Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:09.7940601Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f15aec6bb00>} 2025-05-07T20:33:09.7941934Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:09.7942946Z context = 2025-05-07T20:33:09.7943239Z 2025-05-07T20:33:09.7943403Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:09.7943922Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:09.7944391Z module_map=module_map) 2025-05-07T20:33:09.7944750Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:09.7945100Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:09.7945399Z E ^ 2025-05-07T20:33:09.7945857Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:09.7946313Z 2025-05-07T20:33:09.7946767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:10.0875325Z 2025-05-07T20:33:10.0876124Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.0877970Z self=, 2025-05-07T20:33:10.0879389Z T=128, 2025-05-07T20:33:10.0879774Z D=7168, 2025-05-07T20:33:10.0880163Z scale_ub=1200.0, 2025-05-07T20:33:10.0880609Z contiguous=True, 2025-05-07T20:33:10.0881074Z compiled=False, 2025-05-07T20:33:10.0881504Z ) 2025-05-07T20:33:10.0882174Z self = 2025-05-07T20:33:10.0882842Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:10.0883132Z 2025-05-07T20:33:10.0883229Z @given( 2025-05-07T20:33:10.0883459Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.0884107Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.0884519Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.0884841Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.0885170Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.0885451Z ) 2025-05-07T20:33:10.0885801Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.0886243Z def test_silu_mul_quant( 2025-05-07T20:33:10.0886483Z self, 2025-05-07T20:33:10.0886676Z T: int, 2025-05-07T20:33:10.0886866Z D: int, 2025-05-07T20:33:10.0887083Z scale_ub: Optional[float], 2025-05-07T20:33:10.0887355Z contiguous: bool, 2025-05-07T20:33:10.0887590Z compiled: bool, 2025-05-07T20:33:10.0887834Z ) -> None: 2025-05-07T20:33:10.0888063Z torch.manual_seed(2025) 2025-05-07T20:33:10.0888301Z 2025-05-07T20:33:10.0888591Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.0888951Z 2025-05-07T20:33:10.0889141Z x_sign = torch.sign(x) 2025-05-07T20:33:10.0889445Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.0891614Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:10.0893546Z 2025-05-07T20:33:10.0893670Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:10.0893883Z 2025-05-07T20:33:10.0894001Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.0894422Z self=, 2025-05-07T20:33:10.0894852Z T=128, 2025-05-07T20:33:10.0895039Z D=5120, 2025-05-07T20:33:10.0895233Z scale_ub=1200.0, 2025-05-07T20:33:10.0895451Z contiguous=True, 2025-05-07T20:33:10.0895665Z compiled=True, 2025-05-07T20:33:10.0896004Z ) 2025-05-07T20:33:10.0896586Z self = 2025-05-07T20:33:10.0905005Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:10.0905402Z 2025-05-07T20:33:10.0905537Z @given( 2025-05-07T20:33:10.0905850Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.0906310Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.0906907Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.0907357Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.0907819Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.0908557Z ) 2025-05-07T20:33:10.0909177Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.0909800Z def test_silu_mul_quant( 2025-05-07T20:33:10.0910130Z self, 2025-05-07T20:33:10.0910384Z T: int, 2025-05-07T20:33:10.0910632Z D: int, 2025-05-07T20:33:10.0910932Z scale_ub: Optional[float], 2025-05-07T20:33:10.0911303Z contiguous: bool, 2025-05-07T20:33:10.0911618Z compiled: bool, 2025-05-07T20:33:10.0911921Z ) -> None: 2025-05-07T20:33:10.0912210Z torch.manual_seed(2025) 2025-05-07T20:33:10.0912526Z 2025-05-07T20:33:10.0912901Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.0913367Z 2025-05-07T20:33:10.0913618Z x_sign = torch.sign(x) 2025-05-07T20:33:10.0914013Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.0916826Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:10.0919549Z 2025-05-07T20:33:10.0919707Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:10.0919997Z 2025-05-07T20:33:10.0920147Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.0920704Z self=, 2025-05-07T20:33:10.0921263Z T=128, 2025-05-07T20:33:10.0921514Z D=7168, 2025-05-07T20:33:10.0921763Z scale_ub=None, 2025-05-07T20:33:10.0922053Z contiguous=True, 2025-05-07T20:33:10.0922365Z compiled=True, 2025-05-07T20:33:10.0922637Z ) 2025-05-07T20:33:10.0923079Z self = 2025-05-07T20:33:10.0923734Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:10.0924105Z 2025-05-07T20:33:10.0924212Z @given( 2025-05-07T20:33:10.0924627Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.0925046Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.0925459Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.0925892Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.0926326Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.0926818Z ) 2025-05-07T20:33:10.0927296Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.0927866Z def test_silu_mul_quant( 2025-05-07T20:33:10.0928200Z self, 2025-05-07T20:33:10.0928482Z T: int, 2025-05-07T20:33:10.0928748Z D: int, 2025-05-07T20:33:10.0929066Z scale_ub: Optional[float], 2025-05-07T20:33:10.0929448Z contiguous: bool, 2025-05-07T20:33:10.0929771Z compiled: bool, 2025-05-07T20:33:10.0930087Z ) -> None: 2025-05-07T20:33:10.0930396Z torch.manual_seed(2025) 2025-05-07T20:33:10.0930718Z 2025-05-07T20:33:10.0931098Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.0933898Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:10.0936595Z 2025-05-07T20:33:10.0936770Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:10.0937059Z 2025-05-07T20:33:10.0938080Z FAILED 2025-05-07T20:33:10.0938216Z 2025-05-07T20:33:10.0938392Z =================================== FAILURES =================================== 2025-05-07T20:33:10.0938963Z _____________________ ActivationTests.test_silu_mul_quant ______________________ 2025-05-07T20:33:10.0939577Z + Exception Group Traceback (most recent call last): 2025-05-07T20:33:10.0940424Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 58, in testPartExecutor 2025-05-07T20:33:10.0941155Z | yield 2025-05-07T20:33:10.0941765Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 651, in run 2025-05-07T20:33:10.0942481Z | self._callTestMethod(testMethod) 2025-05-07T20:33:10.0942893Z | ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:33:10.0943516Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 606, in _callTestMethod 2025-05-07T20:33:10.0944077Z | if method() is not None: 2025-05-07T20:33:10.0944327Z | ~~~~~~^^ 2025-05-07T20:33:10.0944940Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant 2025-05-07T20:33:10.0945650Z | T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.0945956Z | ^^^^^^^ 2025-05-07T20:33:10.0946527Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/hypothesis/core.py", line 1850, in wrapped_test 2025-05-07T20:33:10.0947146Z | raise the_error_hypothesis_found 2025-05-07T20:33:10.0947578Z | ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions) 2025-05-07T20:33:10.0948008Z +-+---------------- 1 ---------------- 2025-05-07T20:33:10.0948297Z | Traceback (most recent call last): 2025-05-07T20:33:10.0949014Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:33:10.0949788Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.0951865Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:10.0953821Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:33:10.0954254Z | self=, 2025-05-07T20:33:10.0954677Z | T=2048, 2025-05-07T20:33:10.0954930Z | D=5120, # or any other generated value 2025-05-07T20:33:10.0955267Z | scale_ub=None, # or any other generated value 2025-05-07T20:33:10.0955635Z | contiguous=True, # or any other generated value 2025-05-07T20:33:10.0956008Z | compiled=False, # or any other generated value 2025-05-07T20:33:10.0956325Z | ) 2025-05-07T20:33:10.0956511Z | 2025-05-07T20:33:10.0957044Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case 2025-05-07T20:33:10.0957665Z +---------------- 2 ---------------- 2025-05-07T20:33:10.0957961Z | Traceback (most recent call last): 2025-05-07T20:33:10.0958735Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:33:10.0959518Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.0961590Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:10.0963988Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:33:10.0964712Z | self=, 2025-05-07T20:33:10.0965279Z | T=128, 2025-05-07T20:33:10.0965568Z | D=7168, 2025-05-07T20:33:10.0965847Z | scale_ub=None, 2025-05-07T20:33:10.0966251Z | contiguous=True, 2025-05-07T20:33:10.0966585Z | compiled=True, 2025-05-07T20:33:10.0966887Z | ) 2025-05-07T20:33:10.0967152Z | 2025-05-07T20:33:10.0967896Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case 2025-05-07T20:33:10.0968730Z +---------------- 3 ---------------- 2025-05-07T20:33:10.0969115Z | Traceback (most recent call last): 2025-05-07T20:33:10.0970063Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:33:10.0971109Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.0973869Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:10.0976536Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:33:10.0977123Z | self=, 2025-05-07T20:33:10.0977650Z | T=128, 2025-05-07T20:33:10.0977896Z | D=5120, 2025-05-07T20:33:10.0978171Z | scale_ub=1200.0, 2025-05-07T20:33:10.0978553Z | contiguous=True, 2025-05-07T20:33:10.0978883Z | compiled=True, 2025-05-07T20:33:10.0979185Z | ) 2025-05-07T20:33:10.0979431Z | 2025-05-07T20:33:10.0980175Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case 2025-05-07T20:33:10.0981011Z +---------------- 4 ---------------- 2025-05-07T20:33:10.0981408Z | Traceback (most recent call last): 2025-05-07T20:33:10.0982393Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant 2025-05-07T20:33:10.0983355Z | y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:10.0983744Z | ~~~~~~^^ 2025-05-07T20:33:10.0984617Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn 2025-05-07T20:33:10.0985587Z | return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:10.0986835Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row 2025-05-07T20:33:10.0987951Z | _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:10.0988431Z | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^ 2025-05-07T20:33:10.0988796Z | a, 2025-05-07T20:33:10.0989094Z | ^^ 2025-05-07T20:33:10.0989401Z | ...<23 lines>... 2025-05-07T20:33:10.0989750Z | USE_INT64=use_int64, 2025-05-07T20:33:10.0990142Z | ^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:10.0990506Z | ) 2025-05-07T20:33:10.0990779Z | ^ 2025-05-07T20:33:10.0991530Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py", line 330, in 2025-05-07T20:33:10.0992559Z | return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:10.0993252Z | ~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:10.0994114Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 186, in run 2025-05-07T20:33:10.0995237Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:10.0995882Z | ~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:10.0996741Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 166, in _bench 2025-05-07T20:33:10.0997684Z | return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:10.0998195Z | ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:10.0998979Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py", line 117, in do_bench 2025-05-07T20:33:10.0999712Z | fn() 2025-05-07T20:33:10.0999965Z | ~~^^ 2025-05-07T20:33:10.1000718Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call 2025-05-07T20:33:10.1001538Z | self.fn.run( 2025-05-07T20:33:10.1001813Z | ~~~~~~~~~~~^ 2025-05-07T20:33:10.1002085Z | *args, 2025-05-07T20:33:10.1002358Z | ^^^^^^ 2025-05-07T20:33:10.1002652Z | **current, 2025-05-07T20:33:10.1002966Z | ^^^^^^^^^^ 2025-05-07T20:33:10.1003253Z | ) 2025-05-07T20:33:10.1003495Z | ^ 2025-05-07T20:33:10.1004143Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py", line 623, in run 2025-05-07T20:33:10.1005017Z | kernel = self.compile( 2025-05-07T20:33:10.1005369Z | src, 2025-05-07T20:33:10.1005640Z | target=target, 2025-05-07T20:33:10.1006044Z | options=options.__dict__, 2025-05-07T20:33:10.1006409Z | ) 2025-05-07T20:33:10.1007135Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 273, in compile 2025-05-07T20:33:10.1008114Z | module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:10.1009322Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:33:10.1010399Z | return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:10.1011024Z | module_map=module_map) 2025-05-07T20:33:10.1011513Z | triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:10.1011988Z | def _kernel_quantize_fp8_row( 2025-05-07T20:33:10.1012328Z | ^ 2025-05-07T20:33:10.1013006Z | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:10.1013927Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:33:10.1014464Z | # The test always failed when commented parts were varied together. 2025-05-07T20:33:10.1015175Z | self=, 2025-05-07T20:33:10.1015841Z | T=1, # or any other generated value 2025-05-07T20:33:10.1016245Z | D=5120, # or any other generated value 2025-05-07T20:33:10.1016662Z | scale_ub=None, # or any other generated value 2025-05-07T20:33:10.1017119Z | contiguous=True, # or any other generated value 2025-05-07T20:33:10.1017574Z | compiled=True, # or any other generated value 2025-05-07T20:33:10.1017941Z | ) 2025-05-07T20:33:10.1018171Z | 2025-05-07T20:33:10.1018855Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case 2025-05-07T20:33:10.1019634Z +------------------------------------ 2025-05-07T20:33:10.1020116Z ---------------------------------- Hypothesis ---------------------------------- 2025-05-07T20:33:10.1020616Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.1021145Z self=, 2025-05-07T20:33:10.1021765Z T=1, 2025-05-07T20:33:10.1022029Z D=5120, 2025-05-07T20:33:10.1022299Z scale_ub=None, 2025-05-07T20:33:10.1022586Z contiguous=True, 2025-05-07T20:33:10.1022903Z compiled=True, 2025-05-07T20:33:10.1023177Z ) 2025-05-07T20:33:10.1023569Z self = 2025-05-07T20:33:10.1024197Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:10.1024525Z 2025-05-07T20:33:10.1024649Z @given( 2025-05-07T20:33:10.1024947Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.1025390Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.1025835Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.1026272Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.1026702Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.1027096Z ) 2025-05-07T20:33:10.1027583Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.1028194Z def test_silu_mul_quant( 2025-05-07T20:33:10.1028519Z self, 2025-05-07T20:33:10.1028762Z T: int, 2025-05-07T20:33:10.1029001Z D: int, 2025-05-07T20:33:10.1029261Z scale_ub: Optional[float], 2025-05-07T20:33:10.1029595Z contiguous: bool, 2025-05-07T20:33:10.1029865Z compiled: bool, 2025-05-07T20:33:10.1030125Z ) -> None: 2025-05-07T20:33:10.1030380Z torch.manual_seed(2025) 2025-05-07T20:33:10.1030673Z 2025-05-07T20:33:10.1031026Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.1031562Z 2025-05-07T20:33:10.1031811Z x_sign = torch.sign(x) 2025-05-07T20:33:10.1032184Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.1032598Z x = x_sign * x_clamp 2025-05-07T20:33:10.1032938Z x0 = x[:, :D] 2025-05-07T20:33:10.1033222Z x1 = x[:, D:] 2025-05-07T20:33:10.1033519Z 2025-05-07T20:33:10.1033773Z if contiguous: 2025-05-07T20:33:10.1034064Z x0 = x0.contiguous() 2025-05-07T20:33:10.1034404Z x1 = x1.contiguous() 2025-05-07T20:33:10.1034743Z 2025-05-07T20:33:10.1035005Z if scale_ub is not None: 2025-05-07T20:33:10.1035380Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:10.1035826Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:10.1036221Z ) 2025-05-07T20:33:10.1036476Z else: 2025-05-07T20:33:10.1036751Z scale_ub_tensor = None 2025-05-07T20:33:10.1037068Z 2025-05-07T20:33:10.1037381Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.1037789Z op = silu_mul_quant 2025-05-07T20:33:10.1038181Z if compiled: 2025-05-07T20:33:10.1040035Z op = torch.compile(op) 2025-05-07T20:33:10.1040417Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1040790Z 2025-05-07T20:33:10.1041081Z y_fp8, y_scale = fn() 2025-05-07T20:33:10.1041456Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:10.1041839Z 2025-05-07T20:33:10.1042139Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.1042602Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:10.1043034Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:10.1043429Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:10.1043896Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:10.1044416Z 2025-05-07T20:33:10.1044666Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:10.1044935Z 2025-05-07T20:33:10.1045058Z moe/activation_test.py:126: 2025-05-07T20:33:10.1045453Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1045893Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:10.1046367Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:10.1047397Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:10.1048384Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:10.1049104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:10.1049970Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:10.1050862Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:10.1051789Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:10.1052713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:10.1053534Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:10.1054335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:10.1055048Z fn() 2025-05-07T20:33:10.1055751Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:10.1056575Z self.fn.run( 2025-05-07T20:33:10.1057228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:10.1057966Z kernel = self.compile( 2025-05-07T20:33:10.1058763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:10.1059664Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:10.1060168Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1060459Z 2025-05-07T20:33:10.1060710Z self = 2025-05-07T20:33:10.1062103Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:10.1063924Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16d32be700>} 2025-05-07T20:33:10.1065668Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:10.1067066Z context = 2025-05-07T20:33:10.1067442Z 2025-05-07T20:33:10.1067662Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:10.1068367Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:10.1069051Z module_map=module_map) 2025-05-07T20:33:10.1069537Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:10.1070040Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:10.1070387Z E ^ 2025-05-07T20:33:10.1070974Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:10.1071558Z 2025-05-07T20:33:10.1072095Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:10.1072826Z 2025-05-07T20:33:10.1072954Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.1073482Z self=, 2025-05-07T20:33:10.1073994Z T=2048, 2025-05-07T20:33:10.1074249Z D=5120, 2025-05-07T20:33:10.1074570Z scale_ub=1200.0, 2025-05-07T20:33:10.1074862Z contiguous=True, 2025-05-07T20:33:10.1075137Z compiled=False, 2025-05-07T20:33:10.1075419Z ) 2025-05-07T20:33:10.1075846Z self = 2025-05-07T20:33:10.1076499Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:10.1076879Z 2025-05-07T20:33:10.1076982Z @given( 2025-05-07T20:33:10.1077300Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.1077707Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.1078113Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.1078563Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.1078984Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.1079366Z ) 2025-05-07T20:33:10.1079833Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.1080450Z def test_silu_mul_quant( 2025-05-07T20:33:10.1080759Z self, 2025-05-07T20:33:10.1081017Z T: int, 2025-05-07T20:33:10.1081266Z D: int, 2025-05-07T20:33:10.1081552Z scale_ub: Optional[float], 2025-05-07T20:33:10.1081915Z contiguous: bool, 2025-05-07T20:33:10.1082232Z compiled: bool, 2025-05-07T20:33:10.1082515Z ) -> None: 2025-05-07T20:33:10.1082783Z torch.manual_seed(2025) 2025-05-07T20:33:10.1083111Z 2025-05-07T20:33:10.1083456Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.1083910Z 2025-05-07T20:33:10.1084149Z x_sign = torch.sign(x) 2025-05-07T20:33:10.1084611Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.1085088Z x = x_sign * x_clamp 2025-05-07T20:33:10.1085396Z x0 = x[:, :D] 2025-05-07T20:33:10.1085655Z x1 = x[:, D:] 2025-05-07T20:33:10.1109641Z 2025-05-07T20:33:10.1109914Z if contiguous: 2025-05-07T20:33:10.1110219Z x0 = x0.contiguous() 2025-05-07T20:33:10.1110559Z x1 = x1.contiguous() 2025-05-07T20:33:10.1110844Z 2025-05-07T20:33:10.1111088Z if scale_ub is not None: 2025-05-07T20:33:10.1111445Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:10.1111886Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:10.1112300Z ) 2025-05-07T20:33:10.1112571Z else: 2025-05-07T20:33:10.1112860Z scale_ub_tensor = None 2025-05-07T20:33:10.1113192Z 2025-05-07T20:33:10.1113518Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.1113950Z op = silu_mul_quant 2025-05-07T20:33:10.1114284Z if compiled: 2025-05-07T20:33:10.1114633Z op = torch.compile(op) 2025-05-07T20:33:10.1115248Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1115615Z 2025-05-07T20:33:10.1115891Z > y_fp8, y_scale = fn() 2025-05-07T20:33:10.1116111Z 2025-05-07T20:33:10.1116272Z moe/activation_test.py:117: 2025-05-07T20:33:10.1116729Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1117167Z moe/activation_test.py:115: in fn 2025-05-07T20:33:10.1117536Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1118473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:10.1119409Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:10.1120145Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:10.1121108Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:10.1122013Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:10.1122704Z kernel = self.compile( 2025-05-07T20:33:10.1123422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:10.1124626Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:10.1125210Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1125533Z 2025-05-07T20:33:10.1125824Z self = 2025-05-07T20:33:10.1127329Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:10.1129227Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16d316e020>} 2025-05-07T20:33:10.1130999Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:10.1132378Z context = 2025-05-07T20:33:10.1132791Z 2025-05-07T20:33:10.1133005Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:10.1133710Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:10.1134309Z module_map=module_map) 2025-05-07T20:33:10.1134771Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:10.1135270Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:10.1135605Z E ^ 2025-05-07T20:33:10.1136310Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:10.1136915Z 2025-05-07T20:33:10.1137464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:10.1138164Z 2025-05-07T20:33:10.1138301Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.1138818Z self=, 2025-05-07T20:33:10.1139331Z T=2048, 2025-05-07T20:33:10.1139566Z D=5120, 2025-05-07T20:33:10.1139814Z scale_ub=1200.0, 2025-05-07T20:33:10.1140104Z contiguous=True, 2025-05-07T20:33:10.1140406Z compiled=True, 2025-05-07T20:33:10.1140683Z ) 2025-05-07T20:33:10.1141103Z self = 2025-05-07T20:33:10.1141757Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:10.1142123Z 2025-05-07T20:33:10.1142234Z @given( 2025-05-07T20:33:10.1142602Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.1143028Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.1143441Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.1143800Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.1144184Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.1144481Z ) 2025-05-07T20:33:10.1144845Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.1145280Z def test_silu_mul_quant( 2025-05-07T20:33:10.1145513Z self, 2025-05-07T20:33:10.1145710Z T: int, 2025-05-07T20:33:10.1145896Z D: int, 2025-05-07T20:33:10.1146122Z scale_ub: Optional[float], 2025-05-07T20:33:10.1146384Z contiguous: bool, 2025-05-07T20:33:10.1146630Z compiled: bool, 2025-05-07T20:33:10.1146857Z ) -> None: 2025-05-07T20:33:10.1147072Z torch.manual_seed(2025) 2025-05-07T20:33:10.1147310Z 2025-05-07T20:33:10.1147592Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.1147925Z 2025-05-07T20:33:10.1148134Z x_sign = torch.sign(x) 2025-05-07T20:33:10.1148471Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.1150265Z x = x_sign * x_clamp 2025-05-07T20:33:10.1150496Z x0 = x[:, :D] 2025-05-07T20:33:10.1150711Z x1 = x[:, D:] 2025-05-07T20:33:10.1150923Z 2025-05-07T20:33:10.1151106Z if contiguous: 2025-05-07T20:33:10.1151341Z x0 = x0.contiguous() 2025-05-07T20:33:10.1151608Z x1 = x1.contiguous() 2025-05-07T20:33:10.1151840Z 2025-05-07T20:33:10.1152036Z if scale_ub is not None: 2025-05-07T20:33:10.1152314Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:10.1152640Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:10.1152949Z ) 2025-05-07T20:33:10.1153156Z else: 2025-05-07T20:33:10.1153359Z scale_ub_tensor = None 2025-05-07T20:33:10.1153614Z 2025-05-07T20:33:10.1153848Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.1154160Z op = silu_mul_quant 2025-05-07T20:33:10.1154421Z if compiled: 2025-05-07T20:33:10.1154671Z op = torch.compile(op) 2025-05-07T20:33:10.1154960Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1155236Z 2025-05-07T20:33:10.1155413Z y_fp8, y_scale = fn() 2025-05-07T20:33:10.1155691Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:10.1155967Z 2025-05-07T20:33:10.1156196Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.1156520Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:10.1156795Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:10.1157102Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:10.1157497Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:10.1157802Z 2025-05-07T20:33:10.1158005Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:10.1158194Z 2025-05-07T20:33:10.1158295Z moe/activation_test.py:126: 2025-05-07T20:33:10.1158590Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1158938Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:10.1159278Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:10.1160072Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:10.1160797Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:10.1161332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:10.1162009Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:10.1162751Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:10.1163462Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:10.1164190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:10.1165003Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:10.1165590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:10.1166095Z fn() 2025-05-07T20:33:10.1166601Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:10.1167175Z self.fn.run( 2025-05-07T20:33:10.1167629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:10.1168152Z kernel = self.compile( 2025-05-07T20:33:10.1168686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:10.1169329Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:10.1169784Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1170019Z 2025-05-07T20:33:10.1170217Z self = 2025-05-07T20:33:10.1171299Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:10.1172677Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16d215f100>} 2025-05-07T20:33:10.1174019Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:10.1175053Z context = 2025-05-07T20:33:10.1175364Z 2025-05-07T20:33:10.1175536Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:10.1176081Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:10.1176553Z module_map=module_map) 2025-05-07T20:33:10.1176933Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:10.1177318Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:10.1177589Z E ^ 2025-05-07T20:33:10.1178063Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:10.1178578Z 2025-05-07T20:33:10.1178998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:10.1179505Z 2025-05-07T20:33:10.1179619Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.1180031Z self=, 2025-05-07T20:33:10.1180446Z T=16384, 2025-05-07T20:33:10.1180647Z D=7168, 2025-05-07T20:33:10.1180837Z scale_ub=1200.0, 2025-05-07T20:33:10.1181067Z contiguous=False, 2025-05-07T20:33:10.1181303Z compiled=False, 2025-05-07T20:33:10.1181507Z ) 2025-05-07T20:33:10.1181831Z self = 2025-05-07T20:33:10.1182340Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:10.1182666Z 2025-05-07T20:33:10.1182759Z @given( 2025-05-07T20:33:10.1182993Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.1183320Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.1183725Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.1184055Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.1184397Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.1184696Z ) 2025-05-07T20:33:10.1185087Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.1185539Z def test_silu_mul_quant( 2025-05-07T20:33:10.1185796Z self, 2025-05-07T20:33:10.1186004Z T: int, 2025-05-07T20:33:10.1186202Z D: int, 2025-05-07T20:33:10.1186434Z scale_ub: Optional[float], 2025-05-07T20:33:10.1186717Z contiguous: bool, 2025-05-07T20:33:10.1186961Z compiled: bool, 2025-05-07T20:33:10.1187203Z ) -> None: 2025-05-07T20:33:10.1187431Z torch.manual_seed(2025) 2025-05-07T20:33:10.1187669Z 2025-05-07T20:33:10.1187961Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.1188330Z 2025-05-07T20:33:10.1188525Z x_sign = torch.sign(x) 2025-05-07T20:33:10.1188837Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.1189166Z x = x_sign * x_clamp 2025-05-07T20:33:10.1189451Z x0 = x[:, :D] 2025-05-07T20:33:10.1189681Z x1 = x[:, D:] 2025-05-07T20:33:10.1189899Z 2025-05-07T20:33:10.1190085Z if contiguous: 2025-05-07T20:33:10.1190337Z x0 = x0.contiguous() 2025-05-07T20:33:10.1190623Z x1 = x1.contiguous() 2025-05-07T20:33:10.1190863Z 2025-05-07T20:33:10.1191069Z if scale_ub is not None: 2025-05-07T20:33:10.1191356Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:10.1191710Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:10.1192016Z ) 2025-05-07T20:33:10.1192222Z else: 2025-05-07T20:33:10.1192453Z scale_ub_tensor = None 2025-05-07T20:33:10.1192702Z 2025-05-07T20:33:10.1192964Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.1193290Z op = silu_mul_quant 2025-05-07T20:33:10.1193558Z if compiled: 2025-05-07T20:33:10.1193827Z op = torch.compile(op) 2025-05-07T20:33:10.1194126Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1194418Z 2025-05-07T20:33:10.1194623Z > y_fp8, y_scale = fn() 2025-05-07T20:33:10.1194787Z 2025-05-07T20:33:10.1194897Z moe/activation_test.py:117: 2025-05-07T20:33:10.1195189Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1195531Z moe/activation_test.py:115: in fn 2025-05-07T20:33:10.1195833Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1196526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:10.1197217Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:10.1197813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:10.1198515Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:10.1199175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:10.1199721Z kernel = self.compile( 2025-05-07T20:33:10.1200275Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:10.1200935Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:10.1201341Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1201580Z 2025-05-07T20:33:10.1201783Z self = 2025-05-07T20:33:10.1202917Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:10.1204372Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16d1e14a40>} 2025-05-07T20:33:10.1205818Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:10.1206855Z context = 2025-05-07T20:33:10.1207142Z 2025-05-07T20:33:10.1207318Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:10.1207845Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:10.1208533Z module_map=module_map) 2025-05-07T20:33:10.1208940Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:10.1209309Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:10.1209563Z E ^ 2025-05-07T20:33:10.1210031Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:10.1210574Z 2025-05-07T20:33:10.1210998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:10.1211507Z 2025-05-07T20:33:10.1211618Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.1212026Z self=, 2025-05-07T20:33:10.1212434Z T=1, 2025-05-07T20:33:10.1212629Z D=7168, 2025-05-07T20:33:10.1212815Z scale_ub=None, 2025-05-07T20:33:10.1213038Z contiguous=True, 2025-05-07T20:33:10.1213268Z compiled=True, 2025-05-07T20:33:10.1213466Z ) 2025-05-07T20:33:10.1213792Z self = 2025-05-07T20:33:10.1214288Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:10.1214542Z 2025-05-07T20:33:10.1214631Z @given( 2025-05-07T20:33:10.1214862Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.1215179Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.1215483Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.1215808Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.1216137Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.1216421Z ) 2025-05-07T20:33:10.1216762Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.1217194Z def test_silu_mul_quant( 2025-05-07T20:33:10.1217430Z self, 2025-05-07T20:33:10.1217612Z T: int, 2025-05-07T20:33:10.1217804Z D: int, 2025-05-07T20:33:10.1218082Z scale_ub: Optional[float], 2025-05-07T20:33:10.1218337Z contiguous: bool, 2025-05-07T20:33:10.1218569Z compiled: bool, 2025-05-07T20:33:10.1218786Z ) -> None: 2025-05-07T20:33:10.1218993Z torch.manual_seed(2025) 2025-05-07T20:33:10.1219215Z 2025-05-07T20:33:10.1219492Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.1219830Z 2025-05-07T20:33:10.1220008Z x_sign = torch.sign(x) 2025-05-07T20:33:10.1220291Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.1220591Z x = x_sign * x_clamp 2025-05-07T20:33:10.1220813Z x0 = x[:, :D] 2025-05-07T20:33:10.1221018Z x1 = x[:, D:] 2025-05-07T20:33:10.1221215Z 2025-05-07T20:33:10.1221387Z if contiguous: 2025-05-07T20:33:10.1221618Z x0 = x0.contiguous() 2025-05-07T20:33:10.1221878Z x1 = x1.contiguous() 2025-05-07T20:33:10.1222104Z 2025-05-07T20:33:10.1222294Z if scale_ub is not None: 2025-05-07T20:33:10.1222571Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:10.1222969Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:10.1223277Z ) 2025-05-07T20:33:10.1223475Z else: 2025-05-07T20:33:10.1223695Z scale_ub_tensor = None 2025-05-07T20:33:10.1223944Z 2025-05-07T20:33:10.1224256Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.1224613Z op = silu_mul_quant 2025-05-07T20:33:10.1224855Z if compiled: 2025-05-07T20:33:10.1225104Z op = torch.compile(op) 2025-05-07T20:33:10.1225409Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1225683Z 2025-05-07T20:33:10.1225879Z y_fp8, y_scale = fn() 2025-05-07T20:33:10.1226167Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:10.1226451Z 2025-05-07T20:33:10.1226689Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.1227029Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:10.1227309Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:10.1227617Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:10.1227972Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:10.1228323Z 2025-05-07T20:33:10.1228514Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:10.1228713Z 2025-05-07T20:33:10.1228810Z moe/activation_test.py:126: 2025-05-07T20:33:10.1229102Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1229421Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:10.1229737Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:10.1230517Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:10.1231264Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:10.1231800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:10.1232479Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:10.1233155Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:10.1233870Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:10.1234595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:10.1235232Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:10.1235828Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:10.1236328Z fn() 2025-05-07T20:33:10.1236877Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:10.1237455Z self.fn.run( 2025-05-07T20:33:10.1237910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:10.1238433Z kernel = self.compile( 2025-05-07T20:33:10.1238965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:10.1239618Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:10.1240000Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1240232Z 2025-05-07T20:33:10.1240433Z self = 2025-05-07T20:33:10.1241510Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:10.1242976Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16d2019ee0>} 2025-05-07T20:33:10.1244392Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:10.1245461Z context = 2025-05-07T20:33:10.1245750Z 2025-05-07T20:33:10.1245910Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:10.1246429Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:10.1246880Z module_map=module_map) 2025-05-07T20:33:10.1247239Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:10.1247593Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:10.1247857Z E ^ 2025-05-07T20:33:10.1248306Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:10.1248759Z 2025-05-07T20:33:10.1249171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:10.1249725Z 2025-05-07T20:33:10.1249829Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.1250236Z self=, 2025-05-07T20:33:10.1250623Z T=4096, 2025-05-07T20:33:10.1250802Z D=5120, 2025-05-07T20:33:10.1250990Z scale_ub=None, 2025-05-07T20:33:10.1251192Z contiguous=False, 2025-05-07T20:33:10.1251409Z compiled=False, 2025-05-07T20:33:10.1251603Z ) 2025-05-07T20:33:10.1251905Z self = 2025-05-07T20:33:10.1252398Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:10.1252667Z 2025-05-07T20:33:10.1252745Z @given( 2025-05-07T20:33:10.1252961Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.1253274Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.1253572Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.1253900Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.1254217Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.1254496Z ) 2025-05-07T20:33:10.1254846Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.1255273Z def test_silu_mul_quant( 2025-05-07T20:33:10.1255506Z self, 2025-05-07T20:33:10.1255694Z T: int, 2025-05-07T20:33:10.1255877Z D: int, 2025-05-07T20:33:10.1256087Z scale_ub: Optional[float], 2025-05-07T20:33:10.1256350Z contiguous: bool, 2025-05-07T20:33:10.1256578Z compiled: bool, 2025-05-07T20:33:10.1256864Z ) -> None: 2025-05-07T20:33:10.1257070Z torch.manual_seed(2025) 2025-05-07T20:33:10.1257295Z 2025-05-07T20:33:10.1257562Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.1257897Z 2025-05-07T20:33:10.1258077Z x_sign = torch.sign(x) 2025-05-07T20:33:10.1258364Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.1258667Z x = x_sign * x_clamp 2025-05-07T20:33:10.1258895Z x0 = x[:, :D] 2025-05-07T20:33:10.1259094Z x1 = x[:, D:] 2025-05-07T20:33:10.1259289Z 2025-05-07T20:33:10.1259464Z if contiguous: 2025-05-07T20:33:10.1259677Z x0 = x0.contiguous() 2025-05-07T20:33:10.1259926Z x1 = x1.contiguous() 2025-05-07T20:33:10.1260157Z 2025-05-07T20:33:10.1260329Z if scale_ub is not None: 2025-05-07T20:33:10.1260594Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:10.1260926Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:10.1261219Z ) 2025-05-07T20:33:10.1261406Z else: 2025-05-07T20:33:10.1261656Z scale_ub_tensor = None 2025-05-07T20:33:10.1261891Z 2025-05-07T20:33:10.1262113Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.1262423Z op = silu_mul_quant 2025-05-07T20:33:10.1262712Z if compiled: 2025-05-07T20:33:10.1262949Z op = torch.compile(op) 2025-05-07T20:33:10.1263238Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1263504Z 2025-05-07T20:33:10.1263678Z > y_fp8, y_scale = fn() 2025-05-07T20:33:10.1263843Z 2025-05-07T20:33:10.1263934Z moe/activation_test.py:117: 2025-05-07T20:33:10.1264228Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1264324Z moe/activation_test.py:115: in fn 2025-05-07T20:33:10.1264418Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1264928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:10.1265020Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:10.1265374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:10.1265655Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:10.1273239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:10.1273341Z kernel = self.compile( 2025-05-07T20:33:10.1273736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:10.1273910Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:10.1274036Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1274041Z 2025-05-07T20:33:10.1274259Z self = 2025-05-07T20:33:10.1275034Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:10.1275549Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16d1e42700>} 2025-05-07T20:33:10.1276289Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:10.1276475Z context = 2025-05-07T20:33:10.1276480Z 2025-05-07T20:33:10.1276643Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:10.1276975Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:10.1277094Z module_map=module_map) 2025-05-07T20:33:10.1277252Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:10.1277347Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:10.1277426Z E ^ 2025-05-07T20:33:10.1277777Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:10.1277782Z 2025-05-07T20:33:10.1278192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:10.1278202Z 2025-05-07T20:33:10.1278299Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.1278515Z self=, 2025-05-07T20:33:10.1278590Z T=4096, 2025-05-07T20:33:10.1278659Z D=7168, 2025-05-07T20:33:10.1278738Z scale_ub=None, 2025-05-07T20:33:10.1278826Z contiguous=False, 2025-05-07T20:33:10.1278950Z compiled=False, 2025-05-07T20:33:10.1279018Z ) 2025-05-07T20:33:10.1279239Z self = 2025-05-07T20:33:10.1279409Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:10.1279454Z 2025-05-07T20:33:10.1279531Z @given( 2025-05-07T20:33:10.1279645Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.1279741Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.1279856Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.1279966Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.1280074Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.1280150Z ) 2025-05-07T20:33:10.1280392Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.1280478Z def test_silu_mul_quant( 2025-05-07T20:33:10.1280559Z self, 2025-05-07T20:33:10.1280628Z T: int, 2025-05-07T20:33:10.1280704Z D: int, 2025-05-07T20:33:10.1280805Z scale_ub: Optional[float], 2025-05-07T20:33:10.1280887Z contiguous: bool, 2025-05-07T20:33:10.1281018Z compiled: bool, 2025-05-07T20:33:10.1281096Z ) -> None: 2025-05-07T20:33:10.1281185Z torch.manual_seed(2025) 2025-05-07T20:33:10.1281259Z 2025-05-07T20:33:10.1281426Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.1281493Z 2025-05-07T20:33:10.1281584Z x_sign = torch.sign(x) 2025-05-07T20:33:10.1281702Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.1281782Z x = x_sign * x_clamp 2025-05-07T20:33:10.1281864Z x0 = x[:, :D] 2025-05-07T20:33:10.1281936Z x1 = x[:, D:] 2025-05-07T20:33:10.1282000Z 2025-05-07T20:33:10.1282087Z if contiguous: 2025-05-07T20:33:10.1282174Z x0 = x0.contiguous() 2025-05-07T20:33:10.1282266Z x1 = x1.contiguous() 2025-05-07T20:33:10.1282330Z 2025-05-07T20:33:10.1282416Z if scale_ub is not None: 2025-05-07T20:33:10.1282526Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:10.1282656Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:10.1282732Z ) 2025-05-07T20:33:10.1282811Z else: 2025-05-07T20:33:10.1282898Z scale_ub_tensor = None 2025-05-07T20:33:10.1282963Z 2025-05-07T20:33:10.1283096Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.1283181Z op = silu_mul_quant 2025-05-07T20:33:10.1283260Z if compiled: 2025-05-07T20:33:10.1283366Z op = torch.compile(op) 2025-05-07T20:33:10.1283466Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1283541Z 2025-05-07T20:33:10.1283627Z > y_fp8, y_scale = fn() 2025-05-07T20:33:10.1283631Z 2025-05-07T20:33:10.1283769Z moe/activation_test.py:117: 2025-05-07T20:33:10.1283904Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1284000Z moe/activation_test.py:115: in fn 2025-05-07T20:33:10.1284096Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1284709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:10.1284805Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:10.1285166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:10.1285390Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:10.1285727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:10.1285819Z kernel = self.compile( 2025-05-07T20:33:10.1286197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:10.1286408Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:10.1286542Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1286548Z 2025-05-07T20:33:10.1286746Z self = 2025-05-07T20:33:10.1287566Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:10.1288062Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16d1e41f80>} 2025-05-07T20:33:10.1288813Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:10.1289005Z context = 2025-05-07T20:33:10.1289010Z 2025-05-07T20:33:10.1289167Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:10.1289479Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:10.1289582Z module_map=module_map) 2025-05-07T20:33:10.1289741Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:10.1289844Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:10.1289915Z E ^ 2025-05-07T20:33:10.1290272Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:10.1290277Z 2025-05-07T20:33:10.1290687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:10.1290691Z 2025-05-07T20:33:10.1290793Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.1291022Z self=, 2025-05-07T20:33:10.1291091Z T=128, 2025-05-07T20:33:10.1291174Z D=7168, 2025-05-07T20:33:10.1291250Z scale_ub=None, 2025-05-07T20:33:10.1291328Z contiguous=False, 2025-05-07T20:33:10.1291410Z compiled=True, 2025-05-07T20:33:10.1291475Z ) 2025-05-07T20:33:10.1291689Z self = 2025-05-07T20:33:10.1291862Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:10.1291866Z 2025-05-07T20:33:10.1291937Z @given( 2025-05-07T20:33:10.1292051Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.1292152Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.1292259Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.1292420Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.1292528Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.1292595Z ) 2025-05-07T20:33:10.1292845Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.1292934Z def test_silu_mul_quant( 2025-05-07T20:33:10.1293006Z self, 2025-05-07T20:33:10.1293086Z T: int, 2025-05-07T20:33:10.1293155Z D: int, 2025-05-07T20:33:10.1293246Z scale_ub: Optional[float], 2025-05-07T20:33:10.1293337Z contiguous: bool, 2025-05-07T20:33:10.1293421Z compiled: bool, 2025-05-07T20:33:10.1293494Z ) -> None: 2025-05-07T20:33:10.1293590Z torch.manual_seed(2025) 2025-05-07T20:33:10.1293651Z 2025-05-07T20:33:10.1293818Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.1293881Z 2025-05-07T20:33:10.1293973Z x_sign = torch.sign(x) 2025-05-07T20:33:10.1294094Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.1294177Z x = x_sign * x_clamp 2025-05-07T20:33:10.1294301Z x0 = x[:, :D] 2025-05-07T20:33:10.1294375Z x1 = x[:, D:] 2025-05-07T20:33:10.1294440Z 2025-05-07T20:33:10.1294522Z if contiguous: 2025-05-07T20:33:10.1294614Z x0 = x0.contiguous() 2025-05-07T20:33:10.1294740Z x1 = x1.contiguous() 2025-05-07T20:33:10.1294818Z 2025-05-07T20:33:10.1294901Z if scale_ub is not None: 2025-05-07T20:33:10.1295000Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:10.1295137Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:10.1295207Z ) 2025-05-07T20:33:10.1295281Z else: 2025-05-07T20:33:10.1295367Z scale_ub_tensor = None 2025-05-07T20:33:10.1295435Z 2025-05-07T20:33:10.1295565Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.1295647Z op = silu_mul_quant 2025-05-07T20:33:10.1295729Z if compiled: 2025-05-07T20:33:10.1295828Z op = torch.compile(op) 2025-05-07T20:33:10.1295933Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1296000Z 2025-05-07T20:33:10.1296092Z y_fp8, y_scale = fn() 2025-05-07T20:33:10.1296277Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:10.1296342Z 2025-05-07T20:33:10.1296477Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.1296573Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:10.1296674Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:10.1296792Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:10.1296931Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:10.1297008Z 2025-05-07T20:33:10.1297105Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:10.1297109Z 2025-05-07T20:33:10.1297205Z moe/activation_test.py:126: 2025-05-07T20:33:10.1297346Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1297446Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:10.1297580Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:10.1298131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:10.1298232Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:10.1298596Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:10.1298819Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:10.1299180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:10.1299443Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:10.1299865Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:10.1300035Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:10.1300373Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:10.1300450Z fn() 2025-05-07T20:33:10.1300852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:10.1300929Z self.fn.run( 2025-05-07T20:33:10.1301263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:10.1301362Z kernel = self.compile( 2025-05-07T20:33:10.1301735Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:10.1301914Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:10.1302042Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1302088Z 2025-05-07T20:33:10.1302287Z self = 2025-05-07T20:33:10.1303072Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:10.1303605Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16d17f7c40>} 2025-05-07T20:33:10.1304346Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:10.1304533Z context = 2025-05-07T20:33:10.1304538Z 2025-05-07T20:33:10.1304704Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:10.1304964Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:10.1305105Z module_map=module_map) 2025-05-07T20:33:10.1305265Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:10.1305358Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:10.1305425Z E ^ 2025-05-07T20:33:10.1305779Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:10.1305784Z 2025-05-07T20:33:10.1306187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:10.1306192Z 2025-05-07T20:33:10.1306292Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.1306511Z self=, 2025-05-07T20:33:10.1306581Z T=128, 2025-05-07T20:33:10.1306654Z D=7168, 2025-05-07T20:33:10.1306725Z scale_ub=None, 2025-05-07T20:33:10.1306805Z contiguous=False, 2025-05-07T20:33:10.1306889Z compiled=False, 2025-05-07T20:33:10.1306955Z ) 2025-05-07T20:33:10.1307168Z self = 2025-05-07T20:33:10.1307336Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:10.1307340Z 2025-05-07T20:33:10.1307410Z @given( 2025-05-07T20:33:10.1307529Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.1307623Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.1307730Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.1307845Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.1307952Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.1308061Z ) 2025-05-07T20:33:10.1308544Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.1308672Z def test_silu_mul_quant( 2025-05-07T20:33:10.1308781Z self, 2025-05-07T20:33:10.1308855Z T: int, 2025-05-07T20:33:10.1308927Z D: int, 2025-05-07T20:33:10.1309032Z scale_ub: Optional[float], 2025-05-07T20:33:10.1309115Z contiguous: bool, 2025-05-07T20:33:10.1309194Z compiled: bool, 2025-05-07T20:33:10.1309270Z ) -> None: 2025-05-07T20:33:10.1309358Z torch.manual_seed(2025) 2025-05-07T20:33:10.1309421Z 2025-05-07T20:33:10.1309589Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.1309653Z 2025-05-07T20:33:10.1309738Z x_sign = torch.sign(x) 2025-05-07T20:33:10.1309861Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.1309942Z x = x_sign * x_clamp 2025-05-07T20:33:10.1310015Z x0 = x[:, :D] 2025-05-07T20:33:10.1310095Z x1 = x[:, D:] 2025-05-07T20:33:10.1310159Z 2025-05-07T20:33:10.1310345Z if contiguous: 2025-05-07T20:33:10.1310434Z x0 = x0.contiguous() 2025-05-07T20:33:10.1310515Z x1 = x1.contiguous() 2025-05-07T20:33:10.1310584Z 2025-05-07T20:33:10.1310666Z if scale_ub is not None: 2025-05-07T20:33:10.1310823Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:10.1310958Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:10.1311029Z ) 2025-05-07T20:33:10.1311095Z else: 2025-05-07T20:33:10.1311190Z scale_ub_tensor = None 2025-05-07T20:33:10.1311252Z 2025-05-07T20:33:10.1311373Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.1311461Z op = silu_mul_quant 2025-05-07T20:33:10.1311540Z if compiled: 2025-05-07T20:33:10.1311639Z op = torch.compile(op) 2025-05-07T20:33:10.1311743Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1311809Z 2025-05-07T20:33:10.1311902Z > y_fp8, y_scale = fn() 2025-05-07T20:33:10.1311907Z 2025-05-07T20:33:10.1311997Z moe/activation_test.py:117: 2025-05-07T20:33:10.1312123Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1312296Z moe/activation_test.py:115: in fn 2025-05-07T20:33:10.1312389Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1312879Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:10.1312977Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:10.1313326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:10.1313551Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:10.1313886Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:10.1313974Z kernel = self.compile( 2025-05-07T20:33:10.1314360Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:10.1314531Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:10.1314667Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1314672Z 2025-05-07T20:33:10.1314871Z self = 2025-05-07T20:33:10.1315642Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:10.1316209Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16d12b1d00>} 2025-05-07T20:33:10.1316949Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:10.1317143Z context = 2025-05-07T20:33:10.1317150Z 2025-05-07T20:33:10.1317309Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:10.1317564Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:10.1317670Z module_map=module_map) 2025-05-07T20:33:10.1317828Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:10.1317925Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:10.1317995Z E ^ 2025-05-07T20:33:10.1318344Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:10.1318348Z 2025-05-07T20:33:10.1318803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:10.1318808Z 2025-05-07T20:33:10.1318904Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.1319175Z self=, 2025-05-07T20:33:10.1319246Z T=4096, 2025-05-07T20:33:10.1319314Z D=5120, 2025-05-07T20:33:10.1319395Z scale_ub=1200.0, 2025-05-07T20:33:10.1319472Z contiguous=True, 2025-05-07T20:33:10.1319550Z compiled=False, 2025-05-07T20:33:10.1319622Z ) 2025-05-07T20:33:10.1319834Z self = 2025-05-07T20:33:10.1320003Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:10.1320007Z 2025-05-07T20:33:10.1320082Z @given( 2025-05-07T20:33:10.1320196Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.1320295Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.1320407Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.1320517Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.1320628Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.1320743Z ) 2025-05-07T20:33:10.1320983Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.1321074Z def test_silu_mul_quant( 2025-05-07T20:33:10.1321144Z self, 2025-05-07T20:33:10.1321215Z T: int, 2025-05-07T20:33:10.1321290Z D: int, 2025-05-07T20:33:10.1321383Z scale_ub: Optional[float], 2025-05-07T20:33:10.1321465Z contiguous: bool, 2025-05-07T20:33:10.1321548Z compiled: bool, 2025-05-07T20:33:10.1321617Z ) -> None: 2025-05-07T20:33:10.1321714Z torch.manual_seed(2025) 2025-05-07T20:33:10.1321782Z 2025-05-07T20:33:10.1321946Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.1322017Z 2025-05-07T20:33:10.1322107Z x_sign = torch.sign(x) 2025-05-07T20:33:10.1322224Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.1322308Z x = x_sign * x_clamp 2025-05-07T20:33:10.1322383Z x0 = x[:, :D] 2025-05-07T20:33:10.1322458Z x1 = x[:, D:] 2025-05-07T20:33:10.1322528Z 2025-05-07T20:33:10.1322607Z if contiguous: 2025-05-07T20:33:10.1322690Z x0 = x0.contiguous() 2025-05-07T20:33:10.1322780Z x1 = x1.contiguous() 2025-05-07T20:33:10.1322844Z 2025-05-07T20:33:10.1322927Z if scale_ub is not None: 2025-05-07T20:33:10.1323037Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:10.1323168Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:10.1323244Z ) 2025-05-07T20:33:10.1323314Z else: 2025-05-07T20:33:10.1323400Z scale_ub_tensor = None 2025-05-07T20:33:10.1323520Z 2025-05-07T20:33:10.1323651Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.1323737Z op = silu_mul_quant 2025-05-07T20:33:10.1323819Z if compiled: 2025-05-07T20:33:10.1323913Z op = torch.compile(op) 2025-05-07T20:33:10.1324016Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1324092Z 2025-05-07T20:33:10.1324175Z > y_fp8, y_scale = fn() 2025-05-07T20:33:10.1324179Z 2025-05-07T20:33:10.1324349Z moe/activation_test.py:117: 2025-05-07T20:33:10.1324474Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1324568Z moe/activation_test.py:115: in fn 2025-05-07T20:33:10.1324666Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1325157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:10.1325249Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:10.1325661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:10.1325880Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:10.1326223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:10.1326373Z kernel = self.compile( 2025-05-07T20:33:10.1326752Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:10.1326935Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:10.1327059Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1327064Z 2025-05-07T20:33:10.1327266Z self = 2025-05-07T20:33:10.1328061Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:10.1328556Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16d12b2160>} 2025-05-07T20:33:10.1329345Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:10.1329532Z context = 2025-05-07T20:33:10.1329536Z 2025-05-07T20:33:10.1329704Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:10.1329964Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:10.1330067Z module_map=module_map) 2025-05-07T20:33:10.1330239Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:10.1330332Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:10.1330401Z E ^ 2025-05-07T20:33:10.1330756Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:10.1330765Z 2025-05-07T20:33:10.1331172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:10.1331177Z 2025-05-07T20:33:10.1331281Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.1331497Z self=, 2025-05-07T20:33:10.1331566Z T=1, 2025-05-07T20:33:10.1331644Z D=5120, 2025-05-07T20:33:10.1331719Z scale_ub=None, 2025-05-07T20:33:10.1331796Z contiguous=True, 2025-05-07T20:33:10.1331877Z compiled=True, 2025-05-07T20:33:10.1331943Z ) 2025-05-07T20:33:10.1332213Z self = 2025-05-07T20:33:10.1332373Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:10.1332377Z 2025-05-07T20:33:10.1332446Z @given( 2025-05-07T20:33:10.1332571Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.1332666Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.1332773Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.1332896Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.1333002Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.1333068Z ) 2025-05-07T20:33:10.1333312Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.1333397Z def test_silu_mul_quant( 2025-05-07T20:33:10.1333465Z self, 2025-05-07T20:33:10.1333542Z T: int, 2025-05-07T20:33:10.1333612Z D: int, 2025-05-07T20:33:10.1333708Z scale_ub: Optional[float], 2025-05-07T20:33:10.1333799Z contiguous: bool, 2025-05-07T20:33:10.1333919Z compiled: bool, 2025-05-07T20:33:10.1333993Z ) -> None: 2025-05-07T20:33:10.1334089Z torch.manual_seed(2025) 2025-05-07T20:33:10.1334159Z 2025-05-07T20:33:10.1334330Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.1334437Z 2025-05-07T20:33:10.1334521Z x_sign = torch.sign(x) 2025-05-07T20:33:10.1334647Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.1334729Z x = x_sign * x_clamp 2025-05-07T20:33:10.1334801Z x0 = x[:, :D] 2025-05-07T20:33:10.1334881Z x1 = x[:, D:] 2025-05-07T20:33:10.1334951Z 2025-05-07T20:33:10.1335030Z if contiguous: 2025-05-07T20:33:10.1335126Z x0 = x0.contiguous() 2025-05-07T20:33:10.1335208Z x1 = x1.contiguous() 2025-05-07T20:33:10.1335273Z 2025-05-07T20:33:10.1335370Z if scale_ub is not None: 2025-05-07T20:33:10.1335468Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:10.1335609Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:10.1335682Z ) 2025-05-07T20:33:10.1335750Z else: 2025-05-07T20:33:10.1335889Z scale_ub_tensor = None 2025-05-07T20:33:10.1335957Z 2025-05-07T20:33:10.1336082Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.1336172Z op = silu_mul_quant 2025-05-07T20:33:10.1336252Z if compiled: 2025-05-07T20:33:10.1336346Z op = torch.compile(op) 2025-05-07T20:33:10.1336455Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1336519Z 2025-05-07T20:33:10.1336601Z y_fp8, y_scale = fn() 2025-05-07T20:33:10.1336723Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:10.1336787Z 2025-05-07T20:33:10.1336917Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.1337024Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:10.1337121Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:10.1337249Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:10.1337382Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:10.1337453Z 2025-05-07T20:33:10.1337556Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:10.1337561Z 2025-05-07T20:33:10.1337657Z moe/activation_test.py:126: 2025-05-07T20:33:10.1337783Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1337898Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:10.1338030Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:10.1338593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:10.1338694Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:10.1339102Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:10.1339338Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:10.1339699Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:10.1339967Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:10.1340344Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:10.1340510Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:10.1340865Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:10.1340943Z fn() 2025-05-07T20:33:10.1341347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:10.1341437Z self.fn.run( 2025-05-07T20:33:10.1341813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:10.1341912Z kernel = self.compile( 2025-05-07T20:33:10.1342296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:10.1342505Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:10.1342641Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1342646Z 2025-05-07T20:33:10.1342850Z self = 2025-05-07T20:33:10.1343634Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:10.1344129Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16d12b2de0>} 2025-05-07T20:33:10.1344863Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:10.1345098Z context = 2025-05-07T20:33:10.1345103Z 2025-05-07T20:33:10.1345263Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:10.1345529Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:10.1345632Z module_map=module_map) 2025-05-07T20:33:10.1345787Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:10.1345892Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:10.1345959Z E ^ 2025-05-07T20:33:10.1346308Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:10.1346319Z 2025-05-07T20:33:10.1346728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:10.1346738Z 2025-05-07T20:33:10.1346832Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.1347055Z self=, 2025-05-07T20:33:10.1347127Z T=2048, 2025-05-07T20:33:10.1347200Z D=5120, 2025-05-07T20:33:10.1347286Z scale_ub=None, 2025-05-07T20:33:10.1347363Z contiguous=True, 2025-05-07T20:33:10.1347440Z compiled=True, 2025-05-07T20:33:10.1347513Z ) 2025-05-07T20:33:10.1347727Z self = 2025-05-07T20:33:10.1347944Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:10.1347949Z 2025-05-07T20:33:10.1348026Z @given( 2025-05-07T20:33:10.1348142Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.1348250Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.1348366Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.1348480Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.1348594Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.1348662Z ) 2025-05-07T20:33:10.1348905Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.1349006Z def test_silu_mul_quant( 2025-05-07T20:33:10.1349075Z self, 2025-05-07T20:33:10.1349151Z T: int, 2025-05-07T20:33:10.1349222Z D: int, 2025-05-07T20:33:10.1349315Z scale_ub: Optional[float], 2025-05-07T20:33:10.1349409Z contiguous: bool, 2025-05-07T20:33:10.1349489Z compiled: bool, 2025-05-07T20:33:10.1349563Z ) -> None: 2025-05-07T20:33:10.1349655Z torch.manual_seed(2025) 2025-05-07T20:33:10.1349763Z 2025-05-07T20:33:10.1349928Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.1350006Z 2025-05-07T20:33:10.1350094Z x_sign = torch.sign(x) 2025-05-07T20:33:10.1350250Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.1350338Z x = x_sign * x_clamp 2025-05-07T20:33:10.1350410Z x0 = x[:, :D] 2025-05-07T20:33:10.1350489Z x1 = x[:, D:] 2025-05-07T20:33:10.1350555Z 2025-05-07T20:33:10.1350629Z if contiguous: 2025-05-07T20:33:10.1350722Z x0 = x0.contiguous() 2025-05-07T20:33:10.1350805Z x1 = x1.contiguous() 2025-05-07T20:33:10.1350870Z 2025-05-07T20:33:10.1350962Z if scale_ub is not None: 2025-05-07T20:33:10.1351063Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:10.1351200Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:10.1351276Z ) 2025-05-07T20:33:10.1351343Z else: 2025-05-07T20:33:10.1351431Z scale_ub_tensor = None 2025-05-07T20:33:10.1351501Z 2025-05-07T20:33:10.1351626Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.1351757Z op = silu_mul_quant 2025-05-07T20:33:10.1351850Z if compiled: 2025-05-07T20:33:10.1351943Z op = torch.compile(op) 2025-05-07T20:33:10.1352047Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1352110Z 2025-05-07T20:33:10.1352197Z y_fp8, y_scale = fn() 2025-05-07T20:33:10.1352317Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:10.1352380Z 2025-05-07T20:33:10.1352512Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.1352615Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:10.1352707Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:10.1352821Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:10.1352966Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:10.1353035Z 2025-05-07T20:33:10.1353133Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:10.1353138Z 2025-05-07T20:33:10.1353232Z moe/activation_test.py:126: 2025-05-07T20:33:10.1353357Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1353462Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:10.1353589Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:10.1354138Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:10.1354239Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:10.1354593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:10.1354867Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:10.1355232Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:10.1355480Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:10.1355862Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:10.1356024Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:10.1356369Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:10.1356440Z fn() 2025-05-07T20:33:10.1356831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:10.1356915Z self.fn.run( 2025-05-07T20:33:10.1357248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:10.1357404Z kernel = self.compile( 2025-05-07T20:33:10.1357786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:10.1357957Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:10.1358126Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1358131Z 2025-05-07T20:33:10.1358332Z self = 2025-05-07T20:33:10.1359105Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:10.1359609Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16d12f4cc0>} 2025-05-07T20:33:10.1360348Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:10.1360584Z context = 2025-05-07T20:33:10.1360588Z 2025-05-07T20:33:10.1360745Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:10.1361003Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:10.1361113Z module_map=module_map) 2025-05-07T20:33:10.1361270Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:10.1361373Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:10.1361441Z E ^ 2025-05-07T20:33:10.1361794Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:10.1361798Z 2025-05-07T20:33:10.1362215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:10.1362219Z 2025-05-07T20:33:10.1362320Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.1362545Z self=, 2025-05-07T20:33:10.1362615Z T=128, 2025-05-07T20:33:10.1362682Z D=5120, 2025-05-07T20:33:10.1362761Z scale_ub=None, 2025-05-07T20:33:10.1362837Z contiguous=True, 2025-05-07T20:33:10.1362912Z compiled=True, 2025-05-07T20:33:10.1362986Z ) 2025-05-07T20:33:10.1363200Z self = 2025-05-07T20:33:10.1363365Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:10.1363370Z 2025-05-07T20:33:10.1363446Z @given( 2025-05-07T20:33:10.1363601Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.1363708Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.1363815Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.1363928Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.1364042Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.1364114Z ) 2025-05-07T20:33:10.1364419Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.1364513Z def test_silu_mul_quant( 2025-05-07T20:33:10.1364589Z self, 2025-05-07T20:33:10.1364659Z T: int, 2025-05-07T20:33:10.1364737Z D: int, 2025-05-07T20:33:10.1364830Z scale_ub: Optional[float], 2025-05-07T20:33:10.1364915Z contiguous: bool, 2025-05-07T20:33:10.1364999Z compiled: bool, 2025-05-07T20:33:10.1365069Z ) -> None: 2025-05-07T20:33:10.1365164Z torch.manual_seed(2025) 2025-05-07T20:33:10.1365229Z 2025-05-07T20:33:10.1365396Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.1365512Z 2025-05-07T20:33:10.1365601Z x_sign = torch.sign(x) 2025-05-07T20:33:10.1365719Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.1365811Z x = x_sign * x_clamp 2025-05-07T20:33:10.1365926Z x0 = x[:, :D] 2025-05-07T20:33:10.1365999Z x1 = x[:, D:] 2025-05-07T20:33:10.1366072Z 2025-05-07T20:33:10.1366151Z if contiguous: 2025-05-07T20:33:10.1366238Z x0 = x0.contiguous() 2025-05-07T20:33:10.1366327Z x1 = x1.contiguous() 2025-05-07T20:33:10.1366391Z 2025-05-07T20:33:10.1366476Z if scale_ub is not None: 2025-05-07T20:33:10.1366582Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:10.1366714Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:10.1366786Z ) 2025-05-07T20:33:10.1366855Z else: 2025-05-07T20:33:10.1366946Z scale_ub_tensor = None 2025-05-07T20:33:10.1367018Z 2025-05-07T20:33:10.1367143Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.1367226Z op = silu_mul_quant 2025-05-07T20:33:10.1367311Z if compiled: 2025-05-07T20:33:10.1367404Z op = torch.compile(op) 2025-05-07T20:33:10.1367547Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1367621Z 2025-05-07T20:33:10.1367704Z y_fp8, y_scale = fn() 2025-05-07T20:33:10.1367817Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:10.1367891Z 2025-05-07T20:33:10.1368020Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.1368119Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:10.1368210Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:10.1368324Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:10.1368464Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:10.1368531Z 2025-05-07T20:33:10.1368625Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:10.1368634Z 2025-05-07T20:33:10.1368730Z moe/activation_test.py:126: 2025-05-07T20:33:10.1368852Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1368964Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:10.1369095Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:10.1369647Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:10.1369748Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:10.1370102Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:10.1370319Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:10.1370725Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:10.1370976Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:10.1371351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:10.1371515Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:10.1371849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:10.1371925Z fn() 2025-05-07T20:33:10.1372315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:10.1372390Z self.fn.run( 2025-05-07T20:33:10.1372728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:10.1372812Z kernel = self.compile( 2025-05-07T20:33:10.1373236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:10.1373407Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:10.1373529Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1373575Z 2025-05-07T20:33:10.1373780Z self = 2025-05-07T20:33:10.1374552Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:10.1375053Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16d0b52ca0>} 2025-05-07T20:33:10.1375795Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:10.1375985Z context = 2025-05-07T20:33:10.1375990Z 2025-05-07T20:33:10.1376187Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:10.1376444Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:10.1376551Z module_map=module_map) 2025-05-07T20:33:10.1376706Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:10.1376802Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:10.1376877Z E ^ 2025-05-07T20:33:10.1377225Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:10.1377229Z 2025-05-07T20:33:10.1377646Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:10.1377654Z 2025-05-07T20:33:10.1377749Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.1377967Z self=, 2025-05-07T20:33:10.1378046Z T=4096, 2025-05-07T20:33:10.1378116Z D=5120, 2025-05-07T20:33:10.1378191Z scale_ub=None, 2025-05-07T20:33:10.1378275Z contiguous=True, 2025-05-07T20:33:10.1378351Z compiled=True, 2025-05-07T20:33:10.1378416Z ) 2025-05-07T20:33:10.1378638Z self = 2025-05-07T20:33:10.1378805Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:10.1378809Z 2025-05-07T20:33:10.1378885Z @given( 2025-05-07T20:33:10.1378999Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.1379090Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.1379249Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.1379363Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.1379469Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.1379543Z ) 2025-05-07T20:33:10.1379782Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.1379879Z def test_silu_mul_quant( 2025-05-07T20:33:10.1379948Z self, 2025-05-07T20:33:10.1380016Z T: int, 2025-05-07T20:33:10.1380094Z D: int, 2025-05-07T20:33:10.1380184Z scale_ub: Optional[float], 2025-05-07T20:33:10.1380265Z contiguous: bool, 2025-05-07T20:33:10.1380350Z compiled: bool, 2025-05-07T20:33:10.1380422Z ) -> None: 2025-05-07T20:33:10.1380509Z torch.manual_seed(2025) 2025-05-07T20:33:10.1380581Z 2025-05-07T20:33:10.1380744Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.1380809Z 2025-05-07T20:33:10.1380902Z x_sign = torch.sign(x) 2025-05-07T20:33:10.1381020Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.1381153Z x = x_sign * x_clamp 2025-05-07T20:33:10.1381225Z x0 = x[:, :D] 2025-05-07T20:33:10.1381297Z x1 = x[:, D:] 2025-05-07T20:33:10.1381377Z 2025-05-07T20:33:10.1381453Z if contiguous: 2025-05-07T20:33:10.1381576Z x0 = x0.contiguous() 2025-05-07T20:33:10.1381664Z x1 = x1.contiguous() 2025-05-07T20:33:10.1381726Z 2025-05-07T20:33:10.1381810Z if scale_ub is not None: 2025-05-07T20:33:10.1381914Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:10.1382046Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:10.1382112Z ) 2025-05-07T20:33:10.1382182Z else: 2025-05-07T20:33:10.1382269Z scale_ub_tensor = None 2025-05-07T20:33:10.1382333Z 2025-05-07T20:33:10.1382461Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.1382546Z op = silu_mul_quant 2025-05-07T20:33:10.1382629Z if compiled: 2025-05-07T20:33:10.1382725Z op = torch.compile(op) 2025-05-07T20:33:10.1382824Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1382893Z 2025-05-07T20:33:10.1383019Z y_fp8, y_scale = fn() 2025-05-07T20:33:10.1383135Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:10.1383205Z 2025-05-07T20:33:10.1383335Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.1383429Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:10.1383529Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:10.1383644Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:10.1383786Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:10.1383854Z 2025-05-07T20:33:10.1383946Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:10.1383950Z 2025-05-07T20:33:10.1384049Z moe/activation_test.py:126: 2025-05-07T20:33:10.1384175Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1384272Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:10.1384405Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:10.1384959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:10.1385061Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:10.1385413Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:10.1385631Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:10.1385996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:10.1386307Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:10.1386679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:10.1386846Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:10.1387183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:10.1387264Z fn() 2025-05-07T20:33:10.1387659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:10.1387732Z self.fn.run( 2025-05-07T20:33:10.1388067Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:10.1388153Z kernel = self.compile( 2025-05-07T20:33:10.1388524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:10.1388700Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:10.1388862Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1388867Z 2025-05-07T20:33:10.1389071Z self = 2025-05-07T20:33:10.1389844Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:10.1390384Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16d08ba200>} 2025-05-07T20:33:10.1391120Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:10.1391309Z context = 2025-05-07T20:33:10.1391316Z 2025-05-07T20:33:10.1391478Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:10.1391737Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:10.1391889Z module_map=module_map) 2025-05-07T20:33:10.1392044Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:10.1392139Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:10.1392213Z E ^ 2025-05-07T20:33:10.1392584Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:10.1392589Z 2025-05-07T20:33:10.1393022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:10.1393036Z 2025-05-07T20:33:10.1393134Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.1404107Z self=, 2025-05-07T20:33:10.1404203Z T=16384, 2025-05-07T20:33:10.1404381Z D=5120, 2025-05-07T20:33:10.1404457Z scale_ub=None, 2025-05-07T20:33:10.1404546Z contiguous=True, 2025-05-07T20:33:10.1404626Z compiled=True, 2025-05-07T20:33:10.1404692Z ) 2025-05-07T20:33:10.1404920Z self = 2025-05-07T20:33:10.1405094Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:10.1405099Z 2025-05-07T20:33:10.1405168Z @given( 2025-05-07T20:33:10.1405291Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.1405384Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.1405494Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.1405613Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.1405803Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.1405880Z ) 2025-05-07T20:33:10.1406129Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.1406217Z def test_silu_mul_quant( 2025-05-07T20:33:10.1406295Z self, 2025-05-07T20:33:10.1406370Z T: int, 2025-05-07T20:33:10.1406441Z D: int, 2025-05-07T20:33:10.1406544Z scale_ub: Optional[float], 2025-05-07T20:33:10.1406628Z contiguous: bool, 2025-05-07T20:33:10.1406710Z compiled: bool, 2025-05-07T20:33:10.1406792Z ) -> None: 2025-05-07T20:33:10.1406883Z torch.manual_seed(2025) 2025-05-07T20:33:10.1406951Z 2025-05-07T20:33:10.1407121Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.1407190Z 2025-05-07T20:33:10.1407284Z x_sign = torch.sign(x) 2025-05-07T20:33:10.1407405Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.1407489Z x = x_sign * x_clamp 2025-05-07T20:33:10.1407574Z x0 = x[:, :D] 2025-05-07T20:33:10.1407647Z x1 = x[:, D:] 2025-05-07T20:33:10.1407759Z 2025-05-07T20:33:10.1407850Z if contiguous: 2025-05-07T20:33:10.1407935Z x0 = x0.contiguous() 2025-05-07T20:33:10.1408018Z x1 = x1.contiguous() 2025-05-07T20:33:10.1408137Z 2025-05-07T20:33:10.1408221Z if scale_ub is not None: 2025-05-07T20:33:10.1408622Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:10.1408815Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:10.1408911Z ) 2025-05-07T20:33:10.1408978Z else: 2025-05-07T20:33:10.1409074Z scale_ub_tensor = None 2025-05-07T20:33:10.1409140Z 2025-05-07T20:33:10.1409274Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.1409356Z op = silu_mul_quant 2025-05-07T20:33:10.1409438Z if compiled: 2025-05-07T20:33:10.1409539Z op = torch.compile(op) 2025-05-07T20:33:10.1409644Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1409712Z 2025-05-07T20:33:10.1409805Z y_fp8, y_scale = fn() 2025-05-07T20:33:10.1409922Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:10.1409986Z 2025-05-07T20:33:10.1410281Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.1410380Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:10.1410471Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:10.1410596Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:10.1410731Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:10.1410807Z 2025-05-07T20:33:10.1410899Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:10.1410903Z 2025-05-07T20:33:10.1410994Z moe/activation_test.py:126: 2025-05-07T20:33:10.1411127Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1411233Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:10.1411367Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:10.1411933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:10.1412035Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:10.1412396Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:10.1412615Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:10.1412974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:10.1413233Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:10.1413670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:10.1413843Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:10.1414178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:10.1414247Z fn() 2025-05-07T20:33:10.1414647Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:10.1414725Z self.fn.run( 2025-05-07T20:33:10.1415056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:10.1415148Z kernel = self.compile( 2025-05-07T20:33:10.1415523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:10.1415699Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:10.1415826Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1415830Z 2025-05-07T20:33:10.1416090Z self = 2025-05-07T20:33:10.1416872Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:10.1417444Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16d0ec5760>} 2025-05-07T20:33:10.1418195Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:10.1418381Z context = 2025-05-07T20:33:10.1418385Z 2025-05-07T20:33:10.1418547Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:10.1418816Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:10.1418918Z module_map=module_map) 2025-05-07T20:33:10.1419122Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:10.1419220Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:10.1419291Z E ^ 2025-05-07T20:33:10.1419651Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:10.1419656Z 2025-05-07T20:33:10.1420065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:10.1420069Z 2025-05-07T20:33:10.1420177Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.1420395Z self=, 2025-05-07T20:33:10.1420467Z T=1, 2025-05-07T20:33:10.1420545Z D=5120, 2025-05-07T20:33:10.1420621Z scale_ub=1200.0, 2025-05-07T20:33:10.1420702Z contiguous=True, 2025-05-07T20:33:10.1420788Z compiled=True, 2025-05-07T20:33:10.1420855Z ) 2025-05-07T20:33:10.1421075Z self = 2025-05-07T20:33:10.1421251Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:10.1421256Z 2025-05-07T20:33:10.1421326Z @given( 2025-05-07T20:33:10.1421451Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.1421544Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.1421654Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.1421778Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.1421890Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.1421961Z ) 2025-05-07T20:33:10.1422258Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.1422347Z def test_silu_mul_quant( 2025-05-07T20:33:10.1422420Z self, 2025-05-07T20:33:10.1422499Z T: int, 2025-05-07T20:33:10.1422571Z D: int, 2025-05-07T20:33:10.1422676Z scale_ub: Optional[float], 2025-05-07T20:33:10.1422767Z contiguous: bool, 2025-05-07T20:33:10.1422853Z compiled: bool, 2025-05-07T20:33:10.1422934Z ) -> None: 2025-05-07T20:33:10.1423022Z torch.manual_seed(2025) 2025-05-07T20:33:10.1423091Z 2025-05-07T20:33:10.1423270Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.1423339Z 2025-05-07T20:33:10.1423425Z x_sign = torch.sign(x) 2025-05-07T20:33:10.1423554Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.1423635Z x = x_sign * x_clamp 2025-05-07T20:33:10.1423714Z x0 = x[:, :D] 2025-05-07T20:33:10.1423795Z x1 = x[:, D:] 2025-05-07T20:33:10.1423858Z 2025-05-07T20:33:10.1423941Z if contiguous: 2025-05-07T20:33:10.1424037Z x0 = x0.contiguous() 2025-05-07T20:33:10.1424162Z x1 = x1.contiguous() 2025-05-07T20:33:10.1424227Z 2025-05-07T20:33:10.1424318Z if scale_ub is not None: 2025-05-07T20:33:10.1424416Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:10.1424588Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:10.1424665Z ) 2025-05-07T20:33:10.1424733Z else: 2025-05-07T20:33:10.1424819Z scale_ub_tensor = None 2025-05-07T20:33:10.1424896Z 2025-05-07T20:33:10.1425021Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.1425111Z op = silu_mul_quant 2025-05-07T20:33:10.1425188Z if compiled: 2025-05-07T20:33:10.1425281Z op = torch.compile(op) 2025-05-07T20:33:10.1425389Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1425454Z 2025-05-07T20:33:10.1425545Z > y_fp8, y_scale = fn() 2025-05-07T20:33:10.1425549Z 2025-05-07T20:33:10.1425647Z moe/activation_test.py:117: 2025-05-07T20:33:10.1425772Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1425870Z moe/activation_test.py:115: in fn 2025-05-07T20:33:10.1426037Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1426406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:10.1426498Z return fn(*args, **kwargs) 2025-05-07T20:33:10.1426984Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:10.1427072Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:10.1427431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:10.1427650Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:10.1427983Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:10.1428078Z kernel = self.compile( 2025-05-07T20:33:10.1428453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:10.1428633Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:10.1428754Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1428759Z 2025-05-07T20:33:10.1428958Z self = 2025-05-07T20:33:10.1429739Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:10.1430277Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f15afdf9120>} 2025-05-07T20:33:10.1431025Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:10.1431215Z context = 2025-05-07T20:33:10.1431219Z 2025-05-07T20:33:10.1431386Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:10.1431644Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:10.1431747Z module_map=module_map) 2025-05-07T20:33:10.1431910Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:10.1432001Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:10.1432071Z E ^ 2025-05-07T20:33:10.1432470Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:10.1432475Z 2025-05-07T20:33:10.1432883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:10.1432936Z 2025-05-07T20:33:10.1433036Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.1433254Z self=, 2025-05-07T20:33:10.1433322Z T=1, 2025-05-07T20:33:10.1433399Z D=5120, 2025-05-07T20:33:10.1433473Z scale_ub=None, 2025-05-07T20:33:10.1433552Z contiguous=False, 2025-05-07T20:33:10.1433631Z compiled=True, 2025-05-07T20:33:10.1433696Z ) 2025-05-07T20:33:10.1433908Z self = 2025-05-07T20:33:10.1434072Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:10.1434076Z 2025-05-07T20:33:10.1434146Z @given( 2025-05-07T20:33:10.1434265Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.1434362Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.1434471Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.1434632Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.1434741Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.1434806Z ) 2025-05-07T20:33:10.1435054Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.1435139Z def test_silu_mul_quant( 2025-05-07T20:33:10.1435208Z self, 2025-05-07T20:33:10.1435284Z T: int, 2025-05-07T20:33:10.1435352Z D: int, 2025-05-07T20:33:10.1435451Z scale_ub: Optional[float], 2025-05-07T20:33:10.1435532Z contiguous: bool, 2025-05-07T20:33:10.1435610Z compiled: bool, 2025-05-07T20:33:10.1435688Z ) -> None: 2025-05-07T20:33:10.1435777Z torch.manual_seed(2025) 2025-05-07T20:33:10.1435843Z 2025-05-07T20:33:10.1436018Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.1436083Z 2025-05-07T20:33:10.1436167Z x_sign = torch.sign(x) 2025-05-07T20:33:10.1436290Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.1436379Z x = x_sign * x_clamp 2025-05-07T20:33:10.1436450Z x0 = x[:, :D] 2025-05-07T20:33:10.1436526Z x1 = x[:, D:] 2025-05-07T20:33:10.1436589Z 2025-05-07T20:33:10.1436676Z if contiguous: 2025-05-07T20:33:10.1436761Z x0 = x0.contiguous() 2025-05-07T20:33:10.1436841Z x1 = x1.contiguous() 2025-05-07T20:33:10.1436912Z 2025-05-07T20:33:10.1436995Z if scale_ub is not None: 2025-05-07T20:33:10.1437094Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:10.1437231Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:10.1437299Z ) 2025-05-07T20:33:10.1437417Z else: 2025-05-07T20:33:10.1437514Z scale_ub_tensor = None 2025-05-07T20:33:10.1437580Z 2025-05-07T20:33:10.1437701Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.1437791Z op = silu_mul_quant 2025-05-07T20:33:10.1437871Z if compiled: 2025-05-07T20:33:10.1437967Z op = torch.compile(op) 2025-05-07T20:33:10.1438074Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1438140Z 2025-05-07T20:33:10.1438234Z y_fp8, y_scale = fn() 2025-05-07T20:33:10.1438353Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:10.1438418Z 2025-05-07T20:33:10.1438554Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.1438649Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:10.1438740Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:10.1438865Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:10.1439004Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:10.1439069Z 2025-05-07T20:33:10.1439218Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:10.1439223Z 2025-05-07T20:33:10.1439317Z moe/activation_test.py:126: 2025-05-07T20:33:10.1439449Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1439591Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:10.1439719Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:10.1440276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:10.1440372Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:10.1440723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:10.1440949Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:10.1441319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:10.1441581Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:10.1441993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:10.1442155Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:10.1442498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:10.1442564Z fn() 2025-05-07T20:33:10.1442967Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:10.1443041Z self.fn.run( 2025-05-07T20:33:10.1443369Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:10.1443462Z kernel = self.compile( 2025-05-07T20:33:10.1443836Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:10.1444006Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:10.1444140Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1444147Z 2025-05-07T20:33:10.1444451Z self = 2025-05-07T20:33:10.1445230Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:10.1445732Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f15afd3ade0>} 2025-05-07T20:33:10.1446528Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:10.1446721Z context = 2025-05-07T20:33:10.1446731Z 2025-05-07T20:33:10.1446895Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:10.1447161Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:10.1447269Z module_map=module_map) 2025-05-07T20:33:10.1447438Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:10.1447536Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:10.1447607Z E ^ 2025-05-07T20:33:10.1447965Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:10.1447970Z 2025-05-07T20:33:10.1448423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:10.1448428Z 2025-05-07T20:33:10.1448529Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.1448757Z self=, 2025-05-07T20:33:10.1448873Z T=1, 2025-05-07T20:33:10.1448956Z D=5120, 2025-05-07T20:33:10.1449035Z scale_ub=None, 2025-05-07T20:33:10.1449117Z contiguous=True, 2025-05-07T20:33:10.1449208Z compiled=False, 2025-05-07T20:33:10.1449281Z ) 2025-05-07T20:33:10.1449495Z self = 2025-05-07T20:33:10.1449663Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:10.1449667Z 2025-05-07T20:33:10.1449742Z @given( 2025-05-07T20:33:10.1449860Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.1449975Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.1450090Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.1450213Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.1450326Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.1450394Z ) 2025-05-07T20:33:10.1450690Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.1450783Z def test_silu_mul_quant( 2025-05-07T20:33:10.1450857Z self, 2025-05-07T20:33:10.1450940Z T: int, 2025-05-07T20:33:10.1451014Z D: int, 2025-05-07T20:33:10.1451109Z scale_ub: Optional[float], 2025-05-07T20:33:10.1451203Z contiguous: bool, 2025-05-07T20:33:10.1451286Z compiled: bool, 2025-05-07T20:33:10.1451363Z ) -> None: 2025-05-07T20:33:10.1451462Z torch.manual_seed(2025) 2025-05-07T20:33:10.1451529Z 2025-05-07T20:33:10.1451700Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.1451777Z 2025-05-07T20:33:10.1451867Z x_sign = torch.sign(x) 2025-05-07T20:33:10.1451995Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.1452082Z x = x_sign * x_clamp 2025-05-07T20:33:10.1452161Z x0 = x[:, :D] 2025-05-07T20:33:10.1452246Z x1 = x[:, D:] 2025-05-07T20:33:10.1452319Z 2025-05-07T20:33:10.1452403Z if contiguous: 2025-05-07T20:33:10.1452500Z x0 = x0.contiguous() 2025-05-07T20:33:10.1452607Z x1 = x1.contiguous() 2025-05-07T20:33:10.1452682Z 2025-05-07T20:33:10.1452796Z if scale_ub is not None: 2025-05-07T20:33:10.1452906Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:10.1453039Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:10.1453120Z ) 2025-05-07T20:33:10.1453195Z else: 2025-05-07T20:33:10.1453297Z scale_ub_tensor = None 2025-05-07T20:33:10.1453368Z 2025-05-07T20:33:10.1453541Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.1453635Z op = silu_mul_quant 2025-05-07T20:33:10.1453722Z if compiled: 2025-05-07T20:33:10.1453825Z op = torch.compile(op) 2025-05-07T20:33:10.1453935Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1454009Z 2025-05-07T20:33:10.1454101Z > y_fp8, y_scale = fn() 2025-05-07T20:33:10.1454105Z 2025-05-07T20:33:10.1454206Z moe/activation_test.py:117: 2025-05-07T20:33:10.1454331Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1454440Z moe/activation_test.py:115: in fn 2025-05-07T20:33:10.1454536Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1455027Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:10.1455129Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:10.1455484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:10.1455746Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:10.1456087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:10.1456240Z kernel = self.compile( 2025-05-07T20:33:10.1456623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:10.1456796Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:10.1456923Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1456928Z 2025-05-07T20:33:10.1457135Z self = 2025-05-07T20:33:10.1457910Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:10.1458415Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16d0501b20>} 2025-05-07T20:33:10.1459198Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:10.1459387Z context = 2025-05-07T20:33:10.1459398Z 2025-05-07T20:33:10.1459557Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:10.1459818Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:10.1459929Z module_map=module_map) 2025-05-07T20:33:10.1460089Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:10.1460188Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:10.1460273Z E ^ 2025-05-07T20:33:10.1460622Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:10.1460632Z 2025-05-07T20:33:10.1461049Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:10.1461054Z 2025-05-07T20:33:10.1461153Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.1461373Z self=, 2025-05-07T20:33:10.1461457Z T=128, 2025-05-07T20:33:10.1461533Z D=5120, 2025-05-07T20:33:10.1461615Z scale_ub=None, 2025-05-07T20:33:10.1461710Z contiguous=False, 2025-05-07T20:33:10.1461790Z compiled=True, 2025-05-07T20:33:10.1461863Z ) 2025-05-07T20:33:10.1462131Z self = 2025-05-07T20:33:10.1462302Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:10.1462307Z 2025-05-07T20:33:10.1462390Z @given( 2025-05-07T20:33:10.1462507Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.1462608Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.1462731Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.1462844Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.1462955Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.1463031Z ) 2025-05-07T20:33:10.1463285Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.1463378Z def test_silu_mul_quant( 2025-05-07T20:33:10.1463460Z self, 2025-05-07T20:33:10.1463536Z T: int, 2025-05-07T20:33:10.1463613Z D: int, 2025-05-07T20:33:10.1463717Z scale_ub: Optional[float], 2025-05-07T20:33:10.1463801Z contiguous: bool, 2025-05-07T20:33:10.1463881Z compiled: bool, 2025-05-07T20:33:10.1464001Z ) -> None: 2025-05-07T20:33:10.1464091Z torch.manual_seed(2025) 2025-05-07T20:33:10.1464156Z 2025-05-07T20:33:10.1464329Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.1464432Z 2025-05-07T20:33:10.1464525Z x_sign = torch.sign(x) 2025-05-07T20:33:10.1464645Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.1464727Z x = x_sign * x_clamp 2025-05-07T20:33:10.1464807Z x0 = x[:, :D] 2025-05-07T20:33:10.1464880Z x1 = x[:, D:] 2025-05-07T20:33:10.1464944Z 2025-05-07T20:33:10.1465029Z if contiguous: 2025-05-07T20:33:10.1465117Z x0 = x0.contiguous() 2025-05-07T20:33:10.1465199Z x1 = x1.contiguous() 2025-05-07T20:33:10.1465273Z 2025-05-07T20:33:10.1465357Z if scale_ub is not None: 2025-05-07T20:33:10.1465463Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:10.1465600Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:10.1465674Z ) 2025-05-07T20:33:10.1465759Z else: 2025-05-07T20:33:10.1465845Z scale_ub_tensor = None 2025-05-07T20:33:10.1465954Z 2025-05-07T20:33:10.1466086Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.1466171Z op = silu_mul_quant 2025-05-07T20:33:10.1466249Z if compiled: 2025-05-07T20:33:10.1466349Z op = torch.compile(op) 2025-05-07T20:33:10.1466449Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1466515Z 2025-05-07T20:33:10.1466605Z > y_fp8, y_scale = fn() 2025-05-07T20:33:10.1466609Z 2025-05-07T20:33:10.1466699Z moe/activation_test.py:117: 2025-05-07T20:33:10.1466825Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1466927Z moe/activation_test.py:115: in fn 2025-05-07T20:33:10.1467019Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1467391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:10.1467478Z return fn(*args, **kwargs) 2025-05-07T20:33:10.1467966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:10.1468073Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:10.1468424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:10.1468650Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:10.1468983Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:10.1469068Z kernel = self.compile( 2025-05-07T20:33:10.1469494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:10.1469667Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:10.1469787Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1469791Z 2025-05-07T20:33:10.1469998Z self = 2025-05-07T20:33:10.1470769Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:10.1471269Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f15afd3ba60>} 2025-05-07T20:33:10.1472008Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:10.1472242Z context = 2025-05-07T20:33:10.1472247Z 2025-05-07T20:33:10.1472405Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:10.1472664Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:10.1472810Z module_map=module_map) 2025-05-07T20:33:10.1472966Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:10.1473056Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:10.1473130Z E ^ 2025-05-07T20:33:10.1473479Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:10.1473484Z 2025-05-07T20:33:10.1473899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:10.1473906Z 2025-05-07T20:33:10.1474004Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.1474220Z self=, 2025-05-07T20:33:10.1474296Z T=128, 2025-05-07T20:33:10.1474364Z D=7168, 2025-05-07T20:33:10.1474480Z scale_ub=1200.0, 2025-05-07T20:33:10.1474565Z contiguous=False, 2025-05-07T20:33:10.1474640Z compiled=False, 2025-05-07T20:33:10.1474710Z ) 2025-05-07T20:33:10.1474923Z self = 2025-05-07T20:33:10.1475088Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:10.1475093Z 2025-05-07T20:33:10.1475166Z @given( 2025-05-07T20:33:10.1475277Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.1475367Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.1475487Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.1475599Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.1475706Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.1475778Z ) 2025-05-07T20:33:10.1476018Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.1476112Z def test_silu_mul_quant( 2025-05-07T20:33:10.1476185Z self, 2025-05-07T20:33:10.1476252Z T: int, 2025-05-07T20:33:10.1476326Z D: int, 2025-05-07T20:33:10.1476422Z scale_ub: Optional[float], 2025-05-07T20:33:10.1476503Z contiguous: bool, 2025-05-07T20:33:10.1476586Z compiled: bool, 2025-05-07T20:33:10.1476660Z ) -> None: 2025-05-07T20:33:10.1476748Z torch.manual_seed(2025) 2025-05-07T20:33:10.1476820Z 2025-05-07T20:33:10.1476985Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.1477049Z 2025-05-07T20:33:10.1477141Z x_sign = torch.sign(x) 2025-05-07T20:33:10.1477306Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.1477398Z x = x_sign * x_clamp 2025-05-07T20:33:10.1477471Z x0 = x[:, :D] 2025-05-07T20:33:10.1477545Z x1 = x[:, D:] 2025-05-07T20:33:10.1477620Z 2025-05-07T20:33:10.1477698Z if contiguous: 2025-05-07T20:33:10.1477787Z x0 = x0.contiguous() 2025-05-07T20:33:10.1477877Z x1 = x1.contiguous() 2025-05-07T20:33:10.1477941Z 2025-05-07T20:33:10.1478026Z if scale_ub is not None: 2025-05-07T20:33:10.1478130Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:10.1478257Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:10.1478324Z ) 2025-05-07T20:33:10.1478397Z else: 2025-05-07T20:33:10.1478481Z scale_ub_tensor = None 2025-05-07T20:33:10.1478552Z 2025-05-07T20:33:10.1478674Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.1478754Z op = silu_mul_quant 2025-05-07T20:33:10.1478840Z if compiled: 2025-05-07T20:33:10.1478932Z op = torch.compile(op) 2025-05-07T20:33:10.1479078Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1479149Z 2025-05-07T20:33:10.1479233Z > y_fp8, y_scale = fn() 2025-05-07T20:33:10.1479237Z 2025-05-07T20:33:10.1479329Z moe/activation_test.py:117: 2025-05-07T20:33:10.1479500Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1479593Z moe/activation_test.py:115: in fn 2025-05-07T20:33:10.1479685Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1480182Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:10.1480271Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:10.1480629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:10.1480848Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:10.1481182Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:10.1481275Z kernel = self.compile( 2025-05-07T20:33:10.1481649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:10.1481872Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:10.1481993Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1481997Z 2025-05-07T20:33:10.1482192Z self = 2025-05-07T20:33:10.1482969Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:10.1483471Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16d0b52660>} 2025-05-07T20:33:10.1484214Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:10.1484502Z context = 2025-05-07T20:33:10.1484507Z 2025-05-07T20:33:10.1484664Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:10.1484931Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:10.1485031Z module_map=module_map) 2025-05-07T20:33:10.1485198Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:10.1485288Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:10.1485400Z E ^ 2025-05-07T20:33:10.1485759Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:10.1485764Z 2025-05-07T20:33:10.1486172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:10.1486182Z 2025-05-07T20:33:10.1486288Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.1486503Z self=, 2025-05-07T20:33:10.1486571Z T=128, 2025-05-07T20:33:10.1486645Z D=5120, 2025-05-07T20:33:10.1486718Z scale_ub=None, 2025-05-07T20:33:10.1486796Z contiguous=False, 2025-05-07T20:33:10.1486879Z compiled=False, 2025-05-07T20:33:10.1486943Z ) 2025-05-07T20:33:10.1487154Z self = 2025-05-07T20:33:10.1487326Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:10.1487334Z 2025-05-07T20:33:10.1487401Z @given( 2025-05-07T20:33:10.1487588Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.1487683Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.1487790Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.1487950Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.1488057Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.1488123Z ) 2025-05-07T20:33:10.1488369Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.1488454Z def test_silu_mul_quant( 2025-05-07T20:33:10.1488520Z self, 2025-05-07T20:33:10.1488594Z T: int, 2025-05-07T20:33:10.1488663Z D: int, 2025-05-07T20:33:10.1488760Z scale_ub: Optional[float], 2025-05-07T20:33:10.1488840Z contiguous: bool, 2025-05-07T20:33:10.1488918Z compiled: bool, 2025-05-07T20:33:10.1488993Z ) -> None: 2025-05-07T20:33:10.1489083Z torch.manual_seed(2025) 2025-05-07T20:33:10.1489145Z 2025-05-07T20:33:10.1489320Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.1489386Z 2025-05-07T20:33:10.1489470Z x_sign = torch.sign(x) 2025-05-07T20:33:10.1489640Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.1489721Z x = x_sign * x_clamp 2025-05-07T20:33:10.1489793Z x0 = x[:, :D] 2025-05-07T20:33:10.1489871Z x1 = x[:, D:] 2025-05-07T20:33:10.1489934Z 2025-05-07T20:33:10.1490011Z if contiguous: 2025-05-07T20:33:10.1490102Z x0 = x0.contiguous() 2025-05-07T20:33:10.1490184Z x1 = x1.contiguous() 2025-05-07T20:33:10.1490252Z 2025-05-07T20:33:10.1490334Z if scale_ub is not None: 2025-05-07T20:33:10.1490432Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:10.1490570Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:10.1490640Z ) 2025-05-07T20:33:10.1490706Z else: 2025-05-07T20:33:10.1490800Z scale_ub_tensor = None 2025-05-07T20:33:10.1490864Z 2025-05-07T20:33:10.1490986Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.1491075Z op = silu_mul_quant 2025-05-07T20:33:10.1491157Z if compiled: 2025-05-07T20:33:10.1491248Z op = torch.compile(op) 2025-05-07T20:33:10.1491351Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1491414Z 2025-05-07T20:33:10.1491504Z > y_fp8, y_scale = fn() 2025-05-07T20:33:10.1491508Z 2025-05-07T20:33:10.1491599Z moe/activation_test.py:117: 2025-05-07T20:33:10.1491721Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1491822Z moe/activation_test.py:115: in fn 2025-05-07T20:33:10.1491913Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1492448Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:10.1492549Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:10.1492902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:10.1493129Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:10.1493465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:10.1493549Z kernel = self.compile( 2025-05-07T20:33:10.1493930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:10.1494099Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:10.1494222Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1494233Z 2025-05-07T20:33:10.1494438Z self = 2025-05-07T20:33:10.1495251Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:10.1495798Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f15afdf8680>} 2025-05-07T20:33:10.1496536Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:10.1496728Z context = 2025-05-07T20:33:10.1496733Z 2025-05-07T20:33:10.1496890Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:10.1497151Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:10.1497265Z module_map=module_map) 2025-05-07T20:33:10.1497420Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:10.1497561Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:10.1497633Z E ^ 2025-05-07T20:33:10.1497979Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:10.1497984Z 2025-05-07T20:33:10.1498397Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:10.1498401Z 2025-05-07T20:33:10.1498496Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.1498712Z self=, 2025-05-07T20:33:10.1498787Z T=128, 2025-05-07T20:33:10.1498855Z D=5120, 2025-05-07T20:33:10.1498940Z scale_ub=1200.0, 2025-05-07T20:33:10.1499018Z contiguous=True, 2025-05-07T20:33:10.1499093Z compiled=False, 2025-05-07T20:33:10.1499170Z ) 2025-05-07T20:33:10.1499383Z self = 2025-05-07T20:33:10.1499547Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:10.1499557Z 2025-05-07T20:33:10.1499631Z @given( 2025-05-07T20:33:10.1499745Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.1499843Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.1499959Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.1500070Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.1500183Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.1500249Z ) 2025-05-07T20:33:10.1500488Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.1500579Z def test_silu_mul_quant( 2025-05-07T20:33:10.1500691Z self, 2025-05-07T20:33:10.1500761Z T: int, 2025-05-07T20:33:10.1500836Z D: int, 2025-05-07T20:33:10.1500929Z scale_ub: Optional[float], 2025-05-07T20:33:10.1501012Z contiguous: bool, 2025-05-07T20:33:10.1501098Z compiled: bool, 2025-05-07T20:33:10.1501170Z ) -> None: 2025-05-07T20:33:10.1501260Z torch.manual_seed(2025) 2025-05-07T20:33:10.1501329Z 2025-05-07T20:33:10.1501492Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.1501565Z 2025-05-07T20:33:10.1501657Z x_sign = torch.sign(x) 2025-05-07T20:33:10.1501775Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.1501860Z x = x_sign * x_clamp 2025-05-07T20:33:10.1501931Z x0 = x[:, :D] 2025-05-07T20:33:10.1502002Z x1 = x[:, D:] 2025-05-07T20:33:10.1502072Z 2025-05-07T20:33:10.1502148Z if contiguous: 2025-05-07T20:33:10.1502232Z x0 = x0.contiguous() 2025-05-07T20:33:10.1502323Z x1 = x1.contiguous() 2025-05-07T20:33:10.1502387Z 2025-05-07T20:33:10.1502518Z if scale_ub is not None: 2025-05-07T20:33:10.1502625Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:10.1502754Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:10.1502871Z ) 2025-05-07T20:33:10.1502938Z else: 2025-05-07T20:33:10.1503025Z scale_ub_tensor = None 2025-05-07T20:33:10.1503093Z 2025-05-07T20:33:10.1503214Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.1503296Z op = silu_mul_quant 2025-05-07T20:33:10.1503384Z if compiled: 2025-05-07T20:33:10.1503476Z op = torch.compile(op) 2025-05-07T20:33:10.1503575Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1503645Z 2025-05-07T20:33:10.1503729Z > y_fp8, y_scale = fn() 2025-05-07T20:33:10.1503733Z 2025-05-07T20:33:10.1503825Z moe/activation_test.py:117: 2025-05-07T20:33:10.1503957Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1504052Z moe/activation_test.py:115: in fn 2025-05-07T20:33:10.1504150Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1504641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:10.1504777Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:10.1505141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:10.1505358Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:10.1505690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:10.1505786Z kernel = self.compile( 2025-05-07T20:33:10.1506163Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:10.1506341Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:10.1506463Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1506471Z 2025-05-07T20:33:10.1506668Z self = 2025-05-07T20:33:10.1507446Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:10.1507942Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16d07f4c20>} 2025-05-07T20:33:10.1509194Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:10.1509394Z context = 2025-05-07T20:33:10.1509399Z 2025-05-07T20:33:10.1509563Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:10.1509825Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:10.1509926Z module_map=module_map) 2025-05-07T20:33:10.1510090Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:10.1510179Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:10.1510248Z E ^ 2025-05-07T20:33:10.1510602Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:10.1510607Z 2025-05-07T20:33:10.1511015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:10.1511019Z 2025-05-07T20:33:10.1511190Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.1511407Z self=, 2025-05-07T20:33:10.1511475Z T=1, 2025-05-07T20:33:10.1511551Z D=7168, 2025-05-07T20:33:10.1511627Z scale_ub=1200.0, 2025-05-07T20:33:10.1511767Z contiguous=True, 2025-05-07T20:33:10.1511847Z compiled=True, 2025-05-07T20:33:10.1511912Z ) 2025-05-07T20:33:10.1512125Z self = 2025-05-07T20:33:10.1512295Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:10.1512299Z 2025-05-07T20:33:10.1512366Z @given( 2025-05-07T20:33:10.1512484Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.1512574Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.1512682Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.1512802Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.1512911Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.1512977Z ) 2025-05-07T20:33:10.1513223Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.1513388Z def test_silu_mul_quant( 2025-05-07T20:33:10.1513464Z self, 2025-05-07T20:33:10.1513532Z T: int, 2025-05-07T20:33:10.1513600Z D: int, 2025-05-07T20:33:10.1513702Z scale_ub: Optional[float], 2025-05-07T20:33:10.1513784Z contiguous: bool, 2025-05-07T20:33:10.1513860Z compiled: bool, 2025-05-07T20:33:10.1513936Z ) -> None: 2025-05-07T20:33:10.1514022Z torch.manual_seed(2025) 2025-05-07T20:33:10.1514086Z 2025-05-07T20:33:10.1514254Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.1514320Z 2025-05-07T20:33:10.1514404Z x_sign = torch.sign(x) 2025-05-07T20:33:10.1514529Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.1514610Z x = x_sign * x_clamp 2025-05-07T20:33:10.1514684Z x0 = x[:, :D] 2025-05-07T20:33:10.1514762Z x1 = x[:, D:] 2025-05-07T20:33:10.1514826Z 2025-05-07T20:33:10.1514910Z if contiguous: 2025-05-07T20:33:10.1514995Z x0 = x0.contiguous() 2025-05-07T20:33:10.1515082Z x1 = x1.contiguous() 2025-05-07T20:33:10.1515152Z 2025-05-07T20:33:10.1515233Z if scale_ub is not None: 2025-05-07T20:33:10.1515332Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:10.1515469Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:10.1515535Z ) 2025-05-07T20:33:10.1515603Z else: 2025-05-07T20:33:10.1515697Z scale_ub_tensor = None 2025-05-07T20:33:10.1515759Z 2025-05-07T20:33:10.1515881Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.1515968Z op = silu_mul_quant 2025-05-07T20:33:10.1516140Z if compiled: 2025-05-07T20:33:10.1516240Z op = torch.compile(op) 2025-05-07T20:33:10.1516340Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1516403Z 2025-05-07T20:33:10.1516491Z > y_fp8, y_scale = fn() 2025-05-07T20:33:10.1516499Z 2025-05-07T20:33:10.1516586Z moe/activation_test.py:117: 2025-05-07T20:33:10.1516714Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1516813Z moe/activation_test.py:115: in fn 2025-05-07T20:33:10.1516907Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1517272Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:10.1517362Z return fn(*args, **kwargs) 2025-05-07T20:33:10.1517847Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:10.1517943Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:10.1518338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:10.1518556Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:10.1518892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:10.1519017Z kernel = self.compile( 2025-05-07T20:33:10.1519400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:10.1519566Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:10.1524248Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1524350Z 2025-05-07T20:33:10.1524578Z self = 2025-05-07T20:33:10.1525370Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:10.1525870Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16d07f5ee0>} 2025-05-07T20:33:10.1526691Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:10.1526893Z context = 2025-05-07T20:33:10.1526898Z 2025-05-07T20:33:10.1527060Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:10.1527328Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:10.1527441Z module_map=module_map) 2025-05-07T20:33:10.1527605Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:10.1527710Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:10.1527788Z E ^ 2025-05-07T20:33:10.1528149Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:10.1528157Z 2025-05-07T20:33:10.1528571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:10.1528576Z 2025-05-07T20:33:10.1528680Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.1528909Z self=, 2025-05-07T20:33:10.1528986Z T=1, 2025-05-07T20:33:10.1529063Z D=7168, 2025-05-07T20:33:10.1529153Z scale_ub=1200.0, 2025-05-07T20:33:10.1529241Z contiguous=False, 2025-05-07T20:33:10.1529332Z compiled=True, 2025-05-07T20:33:10.1529405Z ) 2025-05-07T20:33:10.1529667Z self = 2025-05-07T20:33:10.1529844Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:10.1529848Z 2025-05-07T20:33:10.1529928Z @given( 2025-05-07T20:33:10.1530048Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.1530156Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.1530271Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.1530388Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.1530506Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.1530581Z ) 2025-05-07T20:33:10.1530833Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.1530925Z def test_silu_mul_quant( 2025-05-07T20:33:10.1531002Z self, 2025-05-07T20:33:10.1531091Z T: int, 2025-05-07T20:33:10.1531170Z D: int, 2025-05-07T20:33:10.1531269Z scale_ub: Optional[float], 2025-05-07T20:33:10.1531368Z contiguous: bool, 2025-05-07T20:33:10.1531497Z compiled: bool, 2025-05-07T20:33:10.1531578Z ) -> None: 2025-05-07T20:33:10.1531677Z torch.manual_seed(2025) 2025-05-07T20:33:10.1531754Z 2025-05-07T20:33:10.1531920Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.1532036Z 2025-05-07T20:33:10.1532126Z x_sign = torch.sign(x) 2025-05-07T20:33:10.1532256Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.1532344Z x = x_sign * x_clamp 2025-05-07T20:33:10.1532424Z x0 = x[:, :D] 2025-05-07T20:33:10.1532510Z x1 = x[:, D:] 2025-05-07T20:33:10.1532582Z 2025-05-07T20:33:10.1532664Z if contiguous: 2025-05-07T20:33:10.1532763Z x0 = x0.contiguous() 2025-05-07T20:33:10.1532849Z x1 = x1.contiguous() 2025-05-07T20:33:10.1532923Z 2025-05-07T20:33:10.1533022Z if scale_ub is not None: 2025-05-07T20:33:10.1533126Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:10.1533261Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:10.1533346Z ) 2025-05-07T20:33:10.1533422Z else: 2025-05-07T20:33:10.1533561Z scale_ub_tensor = None 2025-05-07T20:33:10.1533642Z 2025-05-07T20:33:10.1533770Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.1533866Z op = silu_mul_quant 2025-05-07T20:33:10.1533948Z if compiled: 2025-05-07T20:33:10.1534046Z op = torch.compile(op) 2025-05-07T20:33:10.1534158Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1534226Z 2025-05-07T20:33:10.1534313Z > y_fp8, y_scale = fn() 2025-05-07T20:33:10.1534317Z 2025-05-07T20:33:10.1534421Z moe/activation_test.py:117: 2025-05-07T20:33:10.1534548Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1534649Z moe/activation_test.py:115: in fn 2025-05-07T20:33:10.1534755Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1535121Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:10.1535218Z return fn(*args, **kwargs) 2025-05-07T20:33:10.1535710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:10.1535806Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:10.1536170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:10.1536390Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:10.1536732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:10.1536822Z kernel = self.compile( 2025-05-07T20:33:10.1537253Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:10.1537435Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:10.1537561Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1537571Z 2025-05-07T20:33:10.1537773Z self = 2025-05-07T20:33:10.1538552Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:10.1539050Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16d07f6c00>} 2025-05-07T20:33:10.1539836Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:10.1540029Z context = 2025-05-07T20:33:10.1540034Z 2025-05-07T20:33:10.1540206Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:10.1540510Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:10.1540614Z module_map=module_map) 2025-05-07T20:33:10.1540780Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:10.1540876Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:10.1540956Z E ^ 2025-05-07T20:33:10.1541316Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:10.1541320Z 2025-05-07T20:33:10.1541735Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:10.1541739Z 2025-05-07T20:33:10.1541852Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.1542071Z self=, 2025-05-07T20:33:10.1542191Z T=1, 2025-05-07T20:33:10.1542276Z D=7168, 2025-05-07T20:33:10.1542358Z scale_ub=None, 2025-05-07T20:33:10.1542445Z contiguous=False, 2025-05-07T20:33:10.1542535Z compiled=True, 2025-05-07T20:33:10.1542608Z ) 2025-05-07T20:33:10.1542837Z self = 2025-05-07T20:33:10.1543002Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:10.1543007Z 2025-05-07T20:33:10.1543081Z @given( 2025-05-07T20:33:10.1543209Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.1543305Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.1543421Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.1543549Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.1543663Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.1543735Z ) 2025-05-07T20:33:10.1543986Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.1544085Z def test_silu_mul_quant( 2025-05-07T20:33:10.1544169Z self, 2025-05-07T20:33:10.1544245Z T: int, 2025-05-07T20:33:10.1544321Z D: int, 2025-05-07T20:33:10.1544426Z scale_ub: Optional[float], 2025-05-07T20:33:10.1544514Z contiguous: bool, 2025-05-07T20:33:10.1544600Z compiled: bool, 2025-05-07T20:33:10.1544686Z ) -> None: 2025-05-07T20:33:10.1544786Z torch.manual_seed(2025) 2025-05-07T20:33:10.1544858Z 2025-05-07T20:33:10.1545032Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.1545103Z 2025-05-07T20:33:10.1545241Z x_sign = torch.sign(x) 2025-05-07T20:33:10.1545373Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.1545456Z x = x_sign * x_clamp 2025-05-07T20:33:10.1545535Z x0 = x[:, :D] 2025-05-07T20:33:10.1545618Z x1 = x[:, D:] 2025-05-07T20:33:10.1545687Z 2025-05-07T20:33:10.1545772Z if contiguous: 2025-05-07T20:33:10.1545875Z x0 = x0.contiguous() 2025-05-07T20:33:10.1545963Z x1 = x1.contiguous() 2025-05-07T20:33:10.1546030Z 2025-05-07T20:33:10.1546122Z if scale_ub is not None: 2025-05-07T20:33:10.1546224Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:10.1546356Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:10.1546438Z ) 2025-05-07T20:33:10.1546513Z else: 2025-05-07T20:33:10.1546613Z scale_ub_tensor = None 2025-05-07T20:33:10.1546687Z 2025-05-07T20:33:10.1546814Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.1546911Z op = silu_mul_quant 2025-05-07T20:33:10.1546994Z if compiled: 2025-05-07T20:33:10.1547137Z op = torch.compile(op) 2025-05-07T20:33:10.1547250Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1547321Z 2025-05-07T20:33:10.1547413Z y_fp8, y_scale = fn() 2025-05-07T20:33:10.1547573Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:10.1547645Z 2025-05-07T20:33:10.1547777Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.1547883Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:10.1547983Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:10.1548108Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:10.1548244Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:10.1548317Z 2025-05-07T20:33:10.1548420Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:10.1548424Z 2025-05-07T20:33:10.1548522Z moe/activation_test.py:126: 2025-05-07T20:33:10.1548650Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1548758Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:10.1548887Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:10.1549519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:10.1549618Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:10.1549973Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:10.1550198Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:10.1550566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:10.1550824Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:10.1551207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:10.1551370Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:10.1551716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:10.1551792Z fn() 2025-05-07T20:33:10.1552186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:10.1552270Z self.fn.run( 2025-05-07T20:33:10.1552601Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:10.1552689Z kernel = self.compile( 2025-05-07T20:33:10.1553069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:10.1553285Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:10.1553420Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1553424Z 2025-05-07T20:33:10.1553627Z self = 2025-05-07T20:33:10.1554404Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:10.1554909Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f15afbd4180>} 2025-05-07T20:33:10.1555649Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:10.1555843Z context = 2025-05-07T20:33:10.1555891Z 2025-05-07T20:33:10.1556051Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:10.1556317Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:10.1556463Z module_map=module_map) 2025-05-07T20:33:10.1556620Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:10.1556721Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:10.1556797Z E ^ 2025-05-07T20:33:10.1557144Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:10.1557149Z 2025-05-07T20:33:10.1557560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:10.1557564Z 2025-05-07T20:33:10.1557665Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.1557895Z self=, 2025-05-07T20:33:10.1557968Z T=1, 2025-05-07T20:33:10.1558041Z D=5120, 2025-05-07T20:33:10.1558127Z scale_ub=1200.0, 2025-05-07T20:33:10.1558211Z contiguous=False, 2025-05-07T20:33:10.1558334Z compiled=True, 2025-05-07T20:33:10.1558414Z ) 2025-05-07T20:33:10.1558629Z self = 2025-05-07T20:33:10.1558792Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:10.1558802Z 2025-05-07T20:33:10.1558875Z @given( 2025-05-07T20:33:10.1558991Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.1559091Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.1559204Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.1559316Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.1559433Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.1559504Z ) 2025-05-07T20:33:10.1559750Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.1559842Z def test_silu_mul_quant( 2025-05-07T20:33:10.1559916Z self, 2025-05-07T20:33:10.1559994Z T: int, 2025-05-07T20:33:10.1560078Z D: int, 2025-05-07T20:33:10.1560174Z scale_ub: Optional[float], 2025-05-07T20:33:10.1560265Z contiguous: bool, 2025-05-07T20:33:10.1560347Z compiled: bool, 2025-05-07T20:33:10.1560427Z ) -> None: 2025-05-07T20:33:10.1560525Z torch.manual_seed(2025) 2025-05-07T20:33:10.1560593Z 2025-05-07T20:33:10.1560757Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.1560839Z 2025-05-07T20:33:10.1560927Z x_sign = torch.sign(x) 2025-05-07T20:33:10.1561045Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.1561135Z x = x_sign * x_clamp 2025-05-07T20:33:10.1561258Z x0 = x[:, :D] 2025-05-07T20:33:10.1561338Z x1 = x[:, D:] 2025-05-07T20:33:10.1561415Z 2025-05-07T20:33:10.1561495Z if contiguous: 2025-05-07T20:33:10.1561590Z x0 = x0.contiguous() 2025-05-07T20:33:10.1561677Z x1 = x1.contiguous() 2025-05-07T20:33:10.1561750Z 2025-05-07T20:33:10.1561851Z if scale_ub is not None: 2025-05-07T20:33:10.1561953Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:10.1562088Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:10.1562172Z ) 2025-05-07T20:33:10.1562249Z else: 2025-05-07T20:33:10.1562339Z scale_ub_tensor = None 2025-05-07T20:33:10.1562419Z 2025-05-07T20:33:10.1562547Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.1562637Z op = silu_mul_quant 2025-05-07T20:33:10.1562728Z if compiled: 2025-05-07T20:33:10.1562827Z op = torch.compile(op) 2025-05-07T20:33:10.1562945Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1563015Z 2025-05-07T20:33:10.1563151Z > y_fp8, y_scale = fn() 2025-05-07T20:33:10.1563155Z 2025-05-07T20:33:10.1563258Z moe/activation_test.py:117: 2025-05-07T20:33:10.1563387Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1563535Z moe/activation_test.py:115: in fn 2025-05-07T20:33:10.1563641Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1564002Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:10.1564092Z return fn(*args, **kwargs) 2025-05-07T20:33:10.1564673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:10.1564769Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:10.1565134Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:10.1565355Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:10.1565687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:10.1565829Z kernel = self.compile( 2025-05-07T20:33:10.1566207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:10.1566388Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:10.1566513Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1566518Z 2025-05-07T20:33:10.1566720Z self = 2025-05-07T20:33:10.1567505Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:10.1568009Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f15afbd5300>} 2025-05-07T20:33:10.1568759Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:10.1568951Z context = 2025-05-07T20:33:10.1568956Z 2025-05-07T20:33:10.1569115Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:10.1569383Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:10.1569486Z module_map=module_map) 2025-05-07T20:33:10.1569697Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:10.1569793Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:10.1569874Z E ^ 2025-05-07T20:33:10.1570233Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:10.1570240Z 2025-05-07T20:33:10.1570651Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:10.1570656Z 2025-05-07T20:33:10.1570760Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.1570980Z self=, 2025-05-07T20:33:10.1571053Z T=1, 2025-05-07T20:33:10.1571133Z D=5120, 2025-05-07T20:33:10.1571214Z scale_ub=1200.0, 2025-05-07T20:33:10.1571296Z contiguous=False, 2025-05-07T20:33:10.1571385Z compiled=False, 2025-05-07T20:33:10.1571455Z ) 2025-05-07T20:33:10.1571669Z self = 2025-05-07T20:33:10.1571845Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:10.1571906Z 2025-05-07T20:33:10.1571981Z @given( 2025-05-07T20:33:10.1572103Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.1572203Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.1572356Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.1572477Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.1572588Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.1572660Z ) 2025-05-07T20:33:10.1572909Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.1573000Z def test_silu_mul_quant( 2025-05-07T20:33:10.1573076Z self, 2025-05-07T20:33:10.1573160Z T: int, 2025-05-07T20:33:10.1573236Z D: int, 2025-05-07T20:33:10.1573335Z scale_ub: Optional[float], 2025-05-07T20:33:10.1573434Z contiguous: bool, 2025-05-07T20:33:10.1573517Z compiled: bool, 2025-05-07T20:33:10.1573598Z ) -> None: 2025-05-07T20:33:10.1573691Z torch.manual_seed(2025) 2025-05-07T20:33:10.1573763Z 2025-05-07T20:33:10.1573933Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.1574047Z 2025-05-07T20:33:10.1574138Z x_sign = torch.sign(x) 2025-05-07T20:33:10.1574266Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.1574353Z x = x_sign * x_clamp 2025-05-07T20:33:10.1574432Z x0 = x[:, :D] 2025-05-07T20:33:10.1574514Z x1 = x[:, D:] 2025-05-07T20:33:10.1574583Z 2025-05-07T20:33:10.1574665Z if contiguous: 2025-05-07T20:33:10.1574763Z x0 = x0.contiguous() 2025-05-07T20:33:10.1574849Z x1 = x1.contiguous() 2025-05-07T20:33:10.1574925Z 2025-05-07T20:33:10.1575014Z if scale_ub is not None: 2025-05-07T20:33:10.1575116Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:10.1575260Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:10.1575337Z ) 2025-05-07T20:33:10.1575412Z else: 2025-05-07T20:33:10.1575510Z scale_ub_tensor = None 2025-05-07T20:33:10.1575581Z 2025-05-07T20:33:10.1575710Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.1575807Z op = silu_mul_quant 2025-05-07T20:33:10.1575888Z if compiled: 2025-05-07T20:33:10.1575988Z op = torch.compile(op) 2025-05-07T20:33:10.1576097Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1576167Z 2025-05-07T20:33:10.1576261Z > y_fp8, y_scale = fn() 2025-05-07T20:33:10.1576265Z 2025-05-07T20:33:10.1576362Z moe/activation_test.py:117: 2025-05-07T20:33:10.1576487Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1576596Z moe/activation_test.py:115: in fn 2025-05-07T20:33:10.1576736Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1577231Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:10.1577335Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:10.1577690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:10.1577922Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:10.1578259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:10.1578350Z kernel = self.compile( 2025-05-07T20:33:10.1578735Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:10.1578910Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:10.1579038Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1579050Z 2025-05-07T20:33:10.1579292Z self = 2025-05-07T20:33:10.1580071Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:10.1580643Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f15afbd6020>} 2025-05-07T20:33:10.1581384Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:10.1581577Z context = 2025-05-07T20:33:10.1581582Z 2025-05-07T20:33:10.1581744Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:10.1582008Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:10.1582118Z module_map=module_map) 2025-05-07T20:33:10.1582323Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:10.1582421Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:10.1582503Z E ^ 2025-05-07T20:33:10.1582850Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:10.1582855Z 2025-05-07T20:33:10.1583269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:10.1583274Z 2025-05-07T20:33:10.1583371Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.1583592Z self=, 2025-05-07T20:33:10.1583674Z T=16384, 2025-05-07T20:33:10.1583749Z D=5120, 2025-05-07T20:33:10.1583845Z scale_ub=1200.0, 2025-05-07T20:33:10.1583931Z contiguous=False, 2025-05-07T20:33:10.1584011Z compiled=True, 2025-05-07T20:33:10.1584087Z ) 2025-05-07T20:33:10.1584301Z self = 2025-05-07T20:33:10.1584481Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:10.1584485Z 2025-05-07T20:33:10.1584568Z @given( 2025-05-07T20:33:10.1584683Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.1584777Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.1584895Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.1585007Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.1585122Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.1585192Z ) 2025-05-07T20:33:10.1585483Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.1585572Z def test_silu_mul_quant( 2025-05-07T20:33:10.1585652Z self, 2025-05-07T20:33:10.1585728Z T: int, 2025-05-07T20:33:10.1585804Z D: int, 2025-05-07T20:33:10.1585905Z scale_ub: Optional[float], 2025-05-07T20:33:10.1585999Z contiguous: bool, 2025-05-07T20:33:10.1586083Z compiled: bool, 2025-05-07T20:33:10.1586168Z ) -> None: 2025-05-07T20:33:10.1586258Z torch.manual_seed(2025) 2025-05-07T20:33:10.1586336Z 2025-05-07T20:33:10.1586504Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.1586577Z 2025-05-07T20:33:10.1586672Z x_sign = torch.sign(x) 2025-05-07T20:33:10.1586792Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.1586878Z x = x_sign * x_clamp 2025-05-07T20:33:10.1586961Z x0 = x[:, :D] 2025-05-07T20:33:10.1587040Z x1 = x[:, D:] 2025-05-07T20:33:10.1587110Z 2025-05-07T20:33:10.1587200Z if contiguous: 2025-05-07T20:33:10.1587334Z x0 = x0.contiguous() 2025-05-07T20:33:10.1587423Z x1 = x1.contiguous() 2025-05-07T20:33:10.1587500Z 2025-05-07T20:33:10.1587588Z if scale_ub is not None: 2025-05-07T20:33:10.1587701Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:10.1587878Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:10.1587959Z ) 2025-05-07T20:33:10.1588043Z else: 2025-05-07T20:33:10.1588136Z scale_ub_tensor = None 2025-05-07T20:33:10.1588207Z 2025-05-07T20:33:10.1588339Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.1588426Z op = silu_mul_quant 2025-05-07T20:33:10.1588508Z if compiled: 2025-05-07T20:33:10.1588613Z op = torch.compile(op) 2025-05-07T20:33:10.1588716Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1588789Z 2025-05-07T20:33:10.1588887Z > y_fp8, y_scale = fn() 2025-05-07T20:33:10.1588891Z 2025-05-07T20:33:10.1588986Z moe/activation_test.py:117: 2025-05-07T20:33:10.1589120Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1589219Z moe/activation_test.py:115: in fn 2025-05-07T20:33:10.1589366Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1589741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:10.1589830Z return fn(*args, **kwargs) 2025-05-07T20:33:10.1590319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:10.1590418Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:10.1590770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:10.1590997Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:10.1591334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:10.1591426Z kernel = self.compile( 2025-05-07T20:33:10.1591808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:10.1591983Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:10.1592106Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1592116Z 2025-05-07T20:33:10.1592315Z self = 2025-05-07T20:33:10.1593084Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:10.1593635Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f15afbd7600>} 2025-05-07T20:33:10.1594378Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:10.1594576Z context = 2025-05-07T20:33:10.1594581Z 2025-05-07T20:33:10.1594741Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:10.1595001Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:10.1595119Z module_map=module_map) 2025-05-07T20:33:10.1595278Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:10.1595378Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:10.1595450Z E ^ 2025-05-07T20:33:10.1595845Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:10.1595850Z 2025-05-07T20:33:10.1596273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:10.1596322Z 2025-05-07T20:33:10.1596422Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.1596642Z self=, 2025-05-07T20:33:10.1596724Z T=2048, 2025-05-07T20:33:10.1596794Z D=7168, 2025-05-07T20:33:10.1596883Z scale_ub=1200.0, 2025-05-07T20:33:10.1596967Z contiguous=False, 2025-05-07T20:33:10.1597046Z compiled=True, 2025-05-07T20:33:10.1597122Z ) 2025-05-07T20:33:10.1597338Z self = 2025-05-07T20:33:10.1597508Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:10.1597515Z 2025-05-07T20:33:10.1597593Z @given( 2025-05-07T20:33:10.1597711Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.1597805Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.1597925Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.1598081Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.1598199Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.1598268Z ) 2025-05-07T20:33:10.1598509Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.1598602Z def test_silu_mul_quant( 2025-05-07T20:33:10.1598674Z self, 2025-05-07T20:33:10.1598749Z T: int, 2025-05-07T20:33:10.1598833Z D: int, 2025-05-07T20:33:10.1598932Z scale_ub: Optional[float], 2025-05-07T20:33:10.1599017Z contiguous: bool, 2025-05-07T20:33:10.1599108Z compiled: bool, 2025-05-07T20:33:10.1599185Z ) -> None: 2025-05-07T20:33:10.1599280Z torch.manual_seed(2025) 2025-05-07T20:33:10.1599357Z 2025-05-07T20:33:10.1599525Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.1599605Z 2025-05-07T20:33:10.1599693Z x_sign = torch.sign(x) 2025-05-07T20:33:10.1599819Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.1599919Z x = x_sign * x_clamp 2025-05-07T20:33:10.1599998Z x0 = x[:, :D] 2025-05-07T20:33:10.1600078Z x1 = x[:, D:] 2025-05-07T20:33:10.1600152Z 2025-05-07T20:33:10.1600231Z if contiguous: 2025-05-07T20:33:10.1600320Z x0 = x0.contiguous() 2025-05-07T20:33:10.1600414Z x1 = x1.contiguous() 2025-05-07T20:33:10.1600480Z 2025-05-07T20:33:10.1600569Z if scale_ub is not None: 2025-05-07T20:33:10.1600678Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:10.1600808Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:10.1600890Z ) 2025-05-07T20:33:10.1601013Z else: 2025-05-07T20:33:10.1601108Z scale_ub_tensor = None 2025-05-07T20:33:10.1601189Z 2025-05-07T20:33:10.1601314Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.1601398Z op = silu_mul_quant 2025-05-07T20:33:10.1601490Z if compiled: 2025-05-07T20:33:10.1601589Z op = torch.compile(op) 2025-05-07T20:33:10.1601694Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1601768Z 2025-05-07T20:33:10.1601854Z > y_fp8, y_scale = fn() 2025-05-07T20:33:10.1601858Z 2025-05-07T20:33:10.1601951Z moe/activation_test.py:117: 2025-05-07T20:33:10.1602084Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1602185Z moe/activation_test.py:115: in fn 2025-05-07T20:33:10.1602287Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1602652Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:10.1602743Z return fn(*args, **kwargs) 2025-05-07T20:33:10.1603281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:10.1603379Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:10.1603769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:10.1603994Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:10.1604407Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:10.1604506Z kernel = self.compile( 2025-05-07T20:33:10.1604882Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:10.1605054Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:10.1605184Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1605191Z 2025-05-07T20:33:10.1605393Z self = 2025-05-07T20:33:10.1606170Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:10.1606715Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16d00d4720>} 2025-05-07T20:33:10.1607453Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:10.1607648Z context = 2025-05-07T20:33:10.1607652Z 2025-05-07T20:33:10.1607814Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:10.1608081Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:10.1608187Z module_map=module_map) 2025-05-07T20:33:10.1608655Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:10.1608794Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:10.1608896Z E ^ 2025-05-07T20:33:10.1609257Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:10.1609262Z 2025-05-07T20:33:10.1609669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:10.1609674Z 2025-05-07T20:33:10.1609774Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.1610164Z self=, 2025-05-07T20:33:10.1610240Z T=1, 2025-05-07T20:33:10.1610317Z D=5120, 2025-05-07T20:33:10.1610404Z scale_ub=None, 2025-05-07T20:33:10.1610490Z contiguous=False, 2025-05-07T20:33:10.1610577Z compiled=False, 2025-05-07T20:33:10.1610652Z ) 2025-05-07T20:33:10.1610869Z self = 2025-05-07T20:33:10.1611038Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:10.1611043Z 2025-05-07T20:33:10.1611117Z @given( 2025-05-07T20:33:10.1611232Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.1611334Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.1611444Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.1611557Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.1611674Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.1611747Z ) 2025-05-07T20:33:10.1612091Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.1612182Z def test_silu_mul_quant( 2025-05-07T20:33:10.1612258Z self, 2025-05-07T20:33:10.1612339Z T: int, 2025-05-07T20:33:10.1612417Z D: int, 2025-05-07T20:33:10.1612512Z scale_ub: Optional[float], 2025-05-07T20:33:10.1612666Z contiguous: bool, 2025-05-07T20:33:10.1612749Z compiled: bool, 2025-05-07T20:33:10.1612823Z ) -> None: 2025-05-07T20:33:10.1612922Z torch.manual_seed(2025) 2025-05-07T20:33:10.1612994Z 2025-05-07T20:33:10.1613160Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.1613239Z 2025-05-07T20:33:10.1613327Z x_sign = torch.sign(x) 2025-05-07T20:33:10.1613455Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.1613541Z x = x_sign * x_clamp 2025-05-07T20:33:10.1613621Z x0 = x[:, :D] 2025-05-07T20:33:10.1613712Z x1 = x[:, D:] 2025-05-07T20:33:10.1613781Z 2025-05-07T20:33:10.1613867Z if contiguous: 2025-05-07T20:33:10.1613960Z x0 = x0.contiguous() 2025-05-07T20:33:10.1614047Z x1 = x1.contiguous() 2025-05-07T20:33:10.1614119Z 2025-05-07T20:33:10.1614288Z if scale_ub is not None: 2025-05-07T20:33:10.1614393Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:10.1614524Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:10.1614607Z ) 2025-05-07T20:33:10.1614679Z else: 2025-05-07T20:33:10.1614776Z scale_ub_tensor = None 2025-05-07T20:33:10.1614848Z 2025-05-07T20:33:10.1614972Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.1615067Z op = silu_mul_quant 2025-05-07T20:33:10.1615149Z if compiled: 2025-05-07T20:33:10.1615245Z op = torch.compile(op) 2025-05-07T20:33:10.1615351Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1615425Z 2025-05-07T20:33:10.1615512Z > y_fp8, y_scale = fn() 2025-05-07T20:33:10.1615518Z 2025-05-07T20:33:10.1615621Z moe/activation_test.py:117: 2025-05-07T20:33:10.1615746Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1615854Z moe/activation_test.py:115: in fn 2025-05-07T20:33:10.1615953Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1616443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:10.1616543Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:10.1616897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:10.1617117Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:10.1617503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:10.1617595Z kernel = self.compile( 2025-05-07T20:33:10.1617984Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:10.1618155Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:10.1618284Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1618288Z 2025-05-07T20:33:10.1618492Z self = 2025-05-07T20:33:10.1619263Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:10.1619769Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16d00d5120>} 2025-05-07T20:33:10.1620548Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:10.1620741Z context = 2025-05-07T20:33:10.1620785Z 2025-05-07T20:33:10.1620955Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:10.1621214Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:10.1621327Z module_map=module_map) 2025-05-07T20:33:10.1621487Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:10.1621581Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:10.1621659Z E ^ 2025-05-07T20:33:10.1622012Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:10.1622016Z 2025-05-07T20:33:10.1622438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:10.1622443Z 2025-05-07T20:33:10.1622544Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.1622805Z self=, 2025-05-07T20:33:10.1622887Z T=4096, 2025-05-07T20:33:10.1622959Z D=7168, 2025-05-07T20:33:10.1623039Z scale_ub=1200.0, 2025-05-07T20:33:10.1623130Z contiguous=False, 2025-05-07T20:33:10.1623210Z compiled=False, 2025-05-07T20:33:10.1623281Z ) 2025-05-07T20:33:10.1623506Z self = 2025-05-07T20:33:10.1623679Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:10.1623683Z 2025-05-07T20:33:10.1623762Z @given( 2025-05-07T20:33:10.1623878Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.1623976Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.1624098Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.1624212Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.1624322Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.1624405Z ) 2025-05-07T20:33:10.1624648Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.1624735Z def test_silu_mul_quant( 2025-05-07T20:33:10.1624815Z self, 2025-05-07T20:33:10.1624888Z T: int, 2025-05-07T20:33:10.1624967Z D: int, 2025-05-07T20:33:10.1625062Z scale_ub: Optional[float], 2025-05-07T20:33:10.1625148Z contiguous: bool, 2025-05-07T20:33:10.1625233Z compiled: bool, 2025-05-07T20:33:10.1625310Z ) -> None: 2025-05-07T20:33:10.1625402Z torch.manual_seed(2025) 2025-05-07T20:33:10.1625479Z 2025-05-07T20:33:10.1625690Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.1625760Z 2025-05-07T20:33:10.1625857Z x_sign = torch.sign(x) 2025-05-07T20:33:10.1625979Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.1626067Z x = x_sign * x_clamp 2025-05-07T20:33:10.1626156Z x0 = x[:, :D] 2025-05-07T20:33:10.1626234Z x1 = x[:, D:] 2025-05-07T20:33:10.1626306Z 2025-05-07T20:33:10.1626394Z if contiguous: 2025-05-07T20:33:10.1626484Z x0 = x0.contiguous() 2025-05-07T20:33:10.1626579Z x1 = x1.contiguous() 2025-05-07T20:33:10.1626651Z 2025-05-07T20:33:10.1626740Z if scale_ub is not None: 2025-05-07T20:33:10.1626851Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:10.1626985Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:10.1627059Z ) 2025-05-07T20:33:10.1627139Z else: 2025-05-07T20:33:10.1627228Z scale_ub_tensor = None 2025-05-07T20:33:10.1627306Z 2025-05-07T20:33:10.1627440Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.1627578Z op = silu_mul_quant 2025-05-07T20:33:10.1627665Z if compiled: 2025-05-07T20:33:10.1627770Z op = torch.compile(op) 2025-05-07T20:33:10.1627873Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1628711Z 2025-05-07T20:33:10.1628801Z > y_fp8, y_scale = fn() 2025-05-07T20:33:10.1628805Z 2025-05-07T20:33:10.1628900Z moe/activation_test.py:117: 2025-05-07T20:33:10.1629036Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1629140Z moe/activation_test.py:115: in fn 2025-05-07T20:33:10.1629235Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1629737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:10.1629833Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:10.1630204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:10.1630423Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:10.1630760Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:10.1630905Z kernel = self.compile( 2025-05-07T20:33:10.1631285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:10.1631456Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:10.1631588Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1631593Z 2025-05-07T20:33:10.1631792Z self = 2025-05-07T20:33:10.1632576Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:10.1633072Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16d00d6480>} 2025-05-07T20:33:10.1633826Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:10.1634014Z context = 2025-05-07T20:33:10.1634019Z 2025-05-07T20:33:10.1634180Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:10.1634448Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:10.1634594Z module_map=module_map) 2025-05-07T20:33:10.1634762Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:10.1634860Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:10.1634937Z E ^ 2025-05-07T20:33:10.1635293Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:10.1635303Z 2025-05-07T20:33:10.1635712Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:10.1635716Z 2025-05-07T20:33:10.1635817Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.1636043Z self=, 2025-05-07T20:33:10.1636116Z T=16384, 2025-05-07T20:33:10.1636196Z D=7168, 2025-05-07T20:33:10.1636275Z scale_ub=None, 2025-05-07T20:33:10.1636357Z contiguous=True, 2025-05-07T20:33:10.1636444Z compiled=True, 2025-05-07T20:33:10.1636514Z ) 2025-05-07T20:33:10.1636731Z self = 2025-05-07T20:33:10.1636955Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:10.1636960Z 2025-05-07T20:33:10.1637035Z @given( 2025-05-07T20:33:10.1637153Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.1637294Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.1637405Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.1637524Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.1637633Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.1637705Z ) 2025-05-07T20:33:10.1637952Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.1638043Z def test_silu_mul_quant( 2025-05-07T20:33:10.1638118Z self, 2025-05-07T20:33:10.1638198Z T: int, 2025-05-07T20:33:10.1638272Z D: int, 2025-05-07T20:33:10.1638369Z scale_ub: Optional[float], 2025-05-07T20:33:10.1638462Z contiguous: bool, 2025-05-07T20:33:10.1638545Z compiled: bool, 2025-05-07T20:33:10.1638620Z ) -> None: 2025-05-07T20:33:10.1638716Z torch.manual_seed(2025) 2025-05-07T20:33:10.1638788Z 2025-05-07T20:33:10.1639004Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.1639078Z 2025-05-07T20:33:10.1639168Z x_sign = torch.sign(x) 2025-05-07T20:33:10.1639298Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.1639384Z x = x_sign * x_clamp 2025-05-07T20:33:10.1639463Z x0 = x[:, :D] 2025-05-07T20:33:10.1639547Z x1 = x[:, D:] 2025-05-07T20:33:10.1639617Z 2025-05-07T20:33:10.1639697Z if contiguous: 2025-05-07T20:33:10.1639793Z x0 = x0.contiguous() 2025-05-07T20:33:10.1639879Z x1 = x1.contiguous() 2025-05-07T20:33:10.1639950Z 2025-05-07T20:33:10.1640046Z if scale_ub is not None: 2025-05-07T20:33:10.1640150Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:10.1640291Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:10.1640365Z ) 2025-05-07T20:33:10.1640440Z else: 2025-05-07T20:33:10.1640542Z scale_ub_tensor = None 2025-05-07T20:33:10.1640615Z 2025-05-07T20:33:10.1640740Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.1640832Z op = silu_mul_quant 2025-05-07T20:33:10.1640915Z if compiled: 2025-05-07T20:33:10.1641014Z op = torch.compile(op) 2025-05-07T20:33:10.1641122Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1641191Z 2025-05-07T20:33:10.1641278Z > y_fp8, y_scale = fn() 2025-05-07T20:33:10.1641282Z 2025-05-07T20:33:10.1641382Z moe/activation_test.py:117: 2025-05-07T20:33:10.1641509Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1641681Z moe/activation_test.py:115: in fn 2025-05-07T20:33:10.1641781Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1642149Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:10.1642244Z return fn(*args, **kwargs) 2025-05-07T20:33:10.1642734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:10.1642832Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:10.1643193Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:10.1643412Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:10.1643754Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:10.1643844Z kernel = self.compile( 2025-05-07T20:33:10.1648735Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:10.1649016Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:10.1649160Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1649210Z 2025-05-07T20:33:10.1649417Z self = 2025-05-07T20:33:10.1650193Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:10.1650700Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f16d00d7740>} 2025-05-07T20:33:10.1651448Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:10.1651644Z context = 2025-05-07T20:33:10.1651649Z 2025-05-07T20:33:10.1651855Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:10.1652128Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:10.1652236Z module_map=module_map) 2025-05-07T20:33:10.1652398Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:10.1652511Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:10.1652588Z E ^ 2025-05-07T20:33:10.1652943Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:10.1652948Z 2025-05-07T20:33:10.1653374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:10.1653379Z 2025-05-07T20:33:10.1653484Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.1653714Z self=, 2025-05-07T20:33:10.1653796Z T=4096, 2025-05-07T20:33:10.1653873Z D=5120, 2025-05-07T20:33:10.1653963Z scale_ub=None, 2025-05-07T20:33:10.1654052Z contiguous=False, 2025-05-07T20:33:10.1654138Z compiled=True, 2025-05-07T20:33:10.1654221Z ) 2025-05-07T20:33:10.1654438Z self = 2025-05-07T20:33:10.1654610Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:10.1654623Z 2025-05-07T20:33:10.1654704Z @given( 2025-05-07T20:33:10.1654824Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.1654933Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.1655097Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.1655215Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.1655341Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.1655417Z ) 2025-05-07T20:33:10.1655662Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.1655768Z def test_silu_mul_quant( 2025-05-07T20:33:10.1655846Z self, 2025-05-07T20:33:10.1655925Z T: int, 2025-05-07T20:33:10.1656009Z D: int, 2025-05-07T20:33:10.1656108Z scale_ub: Optional[float], 2025-05-07T20:33:10.1656208Z contiguous: bool, 2025-05-07T20:33:10.1656292Z compiled: bool, 2025-05-07T20:33:10.1656368Z ) -> None: 2025-05-07T20:33:10.1656473Z torch.manual_seed(2025) 2025-05-07T20:33:10.1656549Z 2025-05-07T20:33:10.1656716Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.1656797Z 2025-05-07T20:33:10.1656891Z x_sign = torch.sign(x) 2025-05-07T20:33:10.1657018Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.1657163Z x = x_sign * x_clamp 2025-05-07T20:33:10.1657242Z x0 = x[:, :D] 2025-05-07T20:33:10.1657322Z x1 = x[:, D:] 2025-05-07T20:33:10.1657401Z 2025-05-07T20:33:10.1657487Z if contiguous: 2025-05-07T20:33:10.1657617Z x0 = x0.contiguous() 2025-05-07T20:33:10.1657711Z x1 = x1.contiguous() 2025-05-07T20:33:10.1657783Z 2025-05-07T20:33:10.1657886Z if scale_ub is not None: 2025-05-07T20:33:10.1657990Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:10.1658125Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:10.1658211Z ) 2025-05-07T20:33:10.1658293Z else: 2025-05-07T20:33:10.1658388Z scale_ub_tensor = None 2025-05-07T20:33:10.1658476Z 2025-05-07T20:33:10.1658604Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.1658694Z op = silu_mul_quant 2025-05-07T20:33:10.1658791Z if compiled: 2025-05-07T20:33:10.1658895Z op = torch.compile(op) 2025-05-07T20:33:10.1659001Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1659085Z 2025-05-07T20:33:10.1659178Z > y_fp8, y_scale = fn() 2025-05-07T20:33:10.1659228Z 2025-05-07T20:33:10.1659334Z moe/activation_test.py:117: 2025-05-07T20:33:10.1659470Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1659570Z moe/activation_test.py:115: in fn 2025-05-07T20:33:10.1659675Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1660041Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:10.1660135Z return fn(*args, **kwargs) 2025-05-07T20:33:10.1660633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:10.1660730Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:10.1661095Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:10.1661315Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:10.1661655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:10.1661756Z kernel = self.compile( 2025-05-07T20:33:10.1662133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:10.1662314Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:10.1662440Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1662444Z 2025-05-07T20:33:10.1662649Z self = 2025-05-07T20:33:10.1663477Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:10.1663981Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f15af538c20>} 2025-05-07T20:33:10.1664736Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:10.1664926Z context = 2025-05-07T20:33:10.1664930Z 2025-05-07T20:33:10.1665093Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:10.1665364Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:10.1665473Z module_map=module_map) 2025-05-07T20:33:10.1665689Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:10.1665789Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:10.1665864Z E ^ 2025-05-07T20:33:10.1666238Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:10.1666280Z 2025-05-07T20:33:10.1666690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:10.1666695Z 2025-05-07T20:33:10.1666809Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.1667031Z self=, 2025-05-07T20:33:10.1667105Z T=4096, 2025-05-07T20:33:10.1667196Z D=5120, 2025-05-07T20:33:10.1667281Z scale_ub=1200.0, 2025-05-07T20:33:10.1667369Z contiguous=False, 2025-05-07T20:33:10.1667463Z compiled=False, 2025-05-07T20:33:10.1667541Z ) 2025-05-07T20:33:10.1667762Z self = 2025-05-07T20:33:10.1667949Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:10.1667996Z 2025-05-07T20:33:10.1668069Z @given( 2025-05-07T20:33:10.1668198Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.1668297Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.1668413Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.1668536Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.1668649Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.1668723Z ) 2025-05-07T20:33:10.1668975Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.1669066Z def test_silu_mul_quant( 2025-05-07T20:33:10.1669144Z self, 2025-05-07T20:33:10.1669229Z T: int, 2025-05-07T20:33:10.1669311Z D: int, 2025-05-07T20:33:10.1669407Z scale_ub: Optional[float], 2025-05-07T20:33:10.1669509Z contiguous: bool, 2025-05-07T20:33:10.1669593Z compiled: bool, 2025-05-07T20:33:10.1669676Z ) -> None: 2025-05-07T20:33:10.1669772Z torch.manual_seed(2025) 2025-05-07T20:33:10.1669847Z 2025-05-07T20:33:10.1670021Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.1670092Z 2025-05-07T20:33:10.1670181Z x_sign = torch.sign(x) 2025-05-07T20:33:10.1670307Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.1670401Z x = x_sign * x_clamp 2025-05-07T20:33:10.1670479Z x0 = x[:, :D] 2025-05-07T20:33:10.1670557Z x1 = x[:, D:] 2025-05-07T20:33:10.1670635Z 2025-05-07T20:33:10.1670716Z if contiguous: 2025-05-07T20:33:10.1670810Z x0 = x0.contiguous() 2025-05-07T20:33:10.1670899Z x1 = x1.contiguous() 2025-05-07T20:33:10.1671015Z 2025-05-07T20:33:10.1671118Z if scale_ub is not None: 2025-05-07T20:33:10.1671228Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:10.1671363Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:10.1671447Z ) 2025-05-07T20:33:10.1671522Z else: 2025-05-07T20:33:10.1671616Z scale_ub_tensor = None 2025-05-07T20:33:10.1671698Z 2025-05-07T20:33:10.1671827Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.1671912Z op = silu_mul_quant 2025-05-07T20:33:10.1672001Z if compiled: 2025-05-07T20:33:10.1672098Z op = torch.compile(op) 2025-05-07T20:33:10.1672207Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1672275Z 2025-05-07T20:33:10.1672364Z > y_fp8, y_scale = fn() 2025-05-07T20:33:10.1672368Z 2025-05-07T20:33:10.1672469Z moe/activation_test.py:117: 2025-05-07T20:33:10.1672599Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1672704Z moe/activation_test.py:115: in fn 2025-05-07T20:33:10.1672851Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1673347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:10.1673485Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:10.1673846Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:10.1674071Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:10.1674417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:10.1674508Z kernel = self.compile( 2025-05-07T20:33:10.1674885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:10.1675066Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:10.1675191Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1675195Z 2025-05-07T20:33:10.1675403Z self = 2025-05-07T20:33:10.1676250Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:10.1676749Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f15af5396c0>} 2025-05-07T20:33:10.1677498Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:10.1677686Z context = 2025-05-07T20:33:10.1677693Z 2025-05-07T20:33:10.1677862Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:10.1678121Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:10.1678230Z module_map=module_map) 2025-05-07T20:33:10.1678396Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:10.1678490Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:10.1678575Z E ^ 2025-05-07T20:33:10.1678924Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:10.1678929Z 2025-05-07T20:33:10.1679338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:10.1679342Z 2025-05-07T20:33:10.1679493Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.1679715Z self=, 2025-05-07T20:33:10.1679797Z T=4096, 2025-05-07T20:33:10.1679869Z D=5120, 2025-05-07T20:33:10.1679950Z scale_ub=1200.0, 2025-05-07T20:33:10.1680044Z contiguous=False, 2025-05-07T20:33:10.1680129Z compiled=True, 2025-05-07T20:33:10.1680202Z ) 2025-05-07T20:33:10.1680425Z self = 2025-05-07T20:33:10.1680595Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:10.1680599Z 2025-05-07T20:33:10.1680673Z @given( 2025-05-07T20:33:10.1680793Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.1680887Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.1681005Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.1681119Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.1681235Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.1681312Z ) 2025-05-07T20:33:10.1681602Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.1681694Z def test_silu_mul_quant( 2025-05-07T20:33:10.1681776Z self, 2025-05-07T20:33:10.1681850Z T: int, 2025-05-07T20:33:10.1681964Z D: int, 2025-05-07T20:33:10.1682069Z scale_ub: Optional[float], 2025-05-07T20:33:10.1682156Z contiguous: bool, 2025-05-07T20:33:10.1682241Z compiled: bool, 2025-05-07T20:33:10.1682322Z ) -> None: 2025-05-07T20:33:10.1682414Z torch.manual_seed(2025) 2025-05-07T20:33:10.1682486Z 2025-05-07T20:33:10.1682663Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.1682735Z 2025-05-07T20:33:10.1682830Z x_sign = torch.sign(x) 2025-05-07T20:33:10.1682951Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.1683040Z x = x_sign * x_clamp 2025-05-07T20:33:10.1683123Z x0 = x[:, :D] 2025-05-07T20:33:10.1683201Z x1 = x[:, D:] 2025-05-07T20:33:10.1683273Z 2025-05-07T20:33:10.1683361Z if contiguous: 2025-05-07T20:33:10.1683450Z x0 = x0.contiguous() 2025-05-07T20:33:10.1683581Z x1 = x1.contiguous() 2025-05-07T20:33:10.1683663Z 2025-05-07T20:33:10.1683750Z if scale_ub is not None: 2025-05-07T20:33:10.1683856Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:10.1683995Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:10.1684069Z ) 2025-05-07T20:33:10.1684151Z else: 2025-05-07T20:33:10.1684242Z scale_ub_tensor = None 2025-05-07T20:33:10.1684412Z 2025-05-07T20:33:10.1684545Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.1684630Z op = silu_mul_quant 2025-05-07T20:33:10.1684713Z if compiled: 2025-05-07T20:33:10.1684818Z op = torch.compile(op) 2025-05-07T20:33:10.1684919Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1684992Z 2025-05-07T20:33:10.1685089Z > y_fp8, y_scale = fn() 2025-05-07T20:33:10.1685093Z 2025-05-07T20:33:10.1685189Z moe/activation_test.py:117: 2025-05-07T20:33:10.1685325Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1685431Z moe/activation_test.py:115: in fn 2025-05-07T20:33:10.1685529Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1685898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:10.1685989Z return fn(*args, **kwargs) 2025-05-07T20:33:10.1686474Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:10.1686574Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:10.1686982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:10.1687215Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:10.1687551Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:10.1687647Z kernel = self.compile( 2025-05-07T20:33:10.1688036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:10.1688208Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:10.1688333Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1688345Z 2025-05-07T20:33:10.1688550Z self = 2025-05-07T20:33:10.1689328Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:10.1689965Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f15af53afc0>} 2025-05-07T20:33:10.1690744Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:10.1690937Z context = 2025-05-07T20:33:10.1690942Z 2025-05-07T20:33:10.1691102Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:10.1691362Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:10.1691475Z module_map=module_map) 2025-05-07T20:33:10.1691636Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:10.1691734Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:10.1691819Z E ^ 2025-05-07T20:33:10.1692168Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:10.1692216Z 2025-05-07T20:33:10.1692633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:10.1692638Z 2025-05-07T20:33:10.1692738Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.1692956Z self=, 2025-05-07T20:33:10.1693036Z T=2048, 2025-05-07T20:33:10.1693112Z D=7168, 2025-05-07T20:33:10.1693202Z scale_ub=1200.0, 2025-05-07T20:33:10.1693284Z contiguous=False, 2025-05-07T20:33:10.1693366Z compiled=False, 2025-05-07T20:33:10.1693440Z ) 2025-05-07T20:33:10.1693657Z self = 2025-05-07T20:33:10.1693832Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:10.1693836Z 2025-05-07T20:33:10.1693916Z @given( 2025-05-07T20:33:10.1694031Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.1694131Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.1694250Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.1694368Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.1694486Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.1694557Z ) 2025-05-07T20:33:10.1694799Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.1694893Z def test_silu_mul_quant( 2025-05-07T20:33:10.1694969Z self, 2025-05-07T20:33:10.1695047Z T: int, 2025-05-07T20:33:10.1695128Z D: int, 2025-05-07T20:33:10.1695225Z scale_ub: Optional[float], 2025-05-07T20:33:10.1695359Z contiguous: bool, 2025-05-07T20:33:10.1695451Z compiled: bool, 2025-05-07T20:33:10.1695530Z ) -> None: 2025-05-07T20:33:10.1695625Z torch.manual_seed(2025) 2025-05-07T20:33:10.1695702Z 2025-05-07T20:33:10.1695867Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.1695951Z 2025-05-07T20:33:10.1696041Z x_sign = torch.sign(x) 2025-05-07T20:33:10.1696162Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.1696256Z x = x_sign * x_clamp 2025-05-07T20:33:10.1696334Z x0 = x[:, :D] 2025-05-07T20:33:10.1696411Z x1 = x[:, D:] 2025-05-07T20:33:10.1696489Z 2025-05-07T20:33:10.1696570Z if contiguous: 2025-05-07T20:33:10.1696659Z x0 = x0.contiguous() 2025-05-07T20:33:10.1696752Z x1 = x1.contiguous() 2025-05-07T20:33:10.1696820Z 2025-05-07T20:33:10.1696907Z if scale_ub is not None: 2025-05-07T20:33:10.1697020Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:10.1697196Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:10.1697272Z ) 2025-05-07T20:33:10.1697355Z else: 2025-05-07T20:33:10.1697447Z scale_ub_tensor = None 2025-05-07T20:33:10.1697524Z 2025-05-07T20:33:10.1697648Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.1697793Z op = silu_mul_quant 2025-05-07T20:33:10.1697882Z if compiled: 2025-05-07T20:33:10.1697979Z op = torch.compile(op) 2025-05-07T20:33:10.1698083Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1698160Z 2025-05-07T20:33:10.1698248Z > y_fp8, y_scale = fn() 2025-05-07T20:33:10.1698253Z 2025-05-07T20:33:10.1698345Z moe/activation_test.py:117: 2025-05-07T20:33:10.1698478Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1698575Z moe/activation_test.py:115: in fn 2025-05-07T20:33:10.1698681Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1699174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:10.1699267Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:10.1699673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:10.1699893Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:10.1700225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:10.1700325Z kernel = self.compile( 2025-05-07T20:33:10.1700704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:10.1700887Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:10.1701012Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1701016Z 2025-05-07T20:33:10.1701218Z self = 2025-05-07T20:33:10.1701999Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:10.1702501Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f15af53bec0>} 2025-05-07T20:33:10.1703249Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:10.1703439Z context = 2025-05-07T20:33:10.1703486Z 2025-05-07T20:33:10.1703655Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:10.1703917Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:10.1704023Z module_map=module_map) 2025-05-07T20:33:10.1704196Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:10.1704291Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:10.1704367Z E ^ 2025-05-07T20:33:10.1704726Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:10.1704730Z 2025-05-07T20:33:10.1705138Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:10.1705142Z 2025-05-07T20:33:10.1705250Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.1705472Z self=, 2025-05-07T20:33:10.1705547Z T=1, 2025-05-07T20:33:10.1705630Z D=7168, 2025-05-07T20:33:10.1705755Z scale_ub=None, 2025-05-07T20:33:10.1705838Z contiguous=True, 2025-05-07T20:33:10.1705931Z compiled=False, 2025-05-07T20:33:10.1705999Z ) 2025-05-07T20:33:10.1706215Z self = 2025-05-07T20:33:10.1706447Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:10.1706451Z 2025-05-07T20:33:10.1706527Z @given( 2025-05-07T20:33:10.1706652Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.1706747Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.1706858Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.1706985Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.1707096Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.1707170Z ) 2025-05-07T20:33:10.1707424Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.1707517Z def test_silu_mul_quant( 2025-05-07T20:33:10.1707591Z self, 2025-05-07T20:33:10.1707675Z T: int, 2025-05-07T20:33:10.1707749Z D: int, 2025-05-07T20:33:10.1707896Z scale_ub: Optional[float], 2025-05-07T20:33:10.1707985Z contiguous: bool, 2025-05-07T20:33:10.1708068Z compiled: bool, 2025-05-07T20:33:10.1708150Z ) -> None: 2025-05-07T20:33:10.1708522Z torch.manual_seed(2025) 2025-05-07T20:33:10.1708627Z 2025-05-07T20:33:10.1708858Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.1708950Z 2025-05-07T20:33:10.1709067Z x_sign = torch.sign(x) 2025-05-07T20:33:10.1709233Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.1709363Z x = x_sign * x_clamp 2025-05-07T20:33:10.1709445Z x0 = x[:, :D] 2025-05-07T20:33:10.1709528Z x1 = x[:, D:] 2025-05-07T20:33:10.1709604Z 2025-05-07T20:33:10.1709686Z if contiguous: 2025-05-07T20:33:10.1709785Z x0 = x0.contiguous() 2025-05-07T20:33:10.1709870Z x1 = x1.contiguous() 2025-05-07T20:33:10.1709938Z 2025-05-07T20:33:10.1710032Z if scale_ub is not None: 2025-05-07T20:33:10.1710140Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:10.1710276Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:10.1710357Z ) 2025-05-07T20:33:10.1710433Z else: 2025-05-07T20:33:10.1710529Z scale_ub_tensor = None 2025-05-07T20:33:10.1710607Z 2025-05-07T20:33:10.1710735Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.1710828Z op = silu_mul_quant 2025-05-07T20:33:10.1710912Z if compiled: 2025-05-07T20:33:10.1711008Z op = torch.compile(op) 2025-05-07T20:33:10.1711117Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1711189Z 2025-05-07T20:33:10.1711467Z > y_fp8, y_scale = fn() 2025-05-07T20:33:10.1711472Z 2025-05-07T20:33:10.1711576Z moe/activation_test.py:117: 2025-05-07T20:33:10.1711703Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1711800Z moe/activation_test.py:115: in fn 2025-05-07T20:33:10.1711910Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1712402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:10.1712505Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:10.1712858Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:10.1713078Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:10.1713424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:10.1713519Z kernel = self.compile( 2025-05-07T20:33:10.1713970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:10.1714142Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:10.1714271Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1714340Z 2025-05-07T20:33:10.1714553Z self = 2025-05-07T20:33:10.1715331Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:10.1715839Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f15affb0cc0>} 2025-05-07T20:33:10.1716585Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:10.1716773Z context = 2025-05-07T20:33:10.1716843Z 2025-05-07T20:33:10.1717015Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:10.1717274Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:10.1717385Z module_map=module_map) 2025-05-07T20:33:10.1717544Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:10.1717639Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:10.1717719Z E ^ 2025-05-07T20:33:10.1718071Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:10.1718075Z 2025-05-07T20:33:10.1718487Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:10.1718498Z 2025-05-07T20:33:10.1718601Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.1718818Z self=, 2025-05-07T20:33:10.1718906Z T=16384, 2025-05-07T20:33:10.1718982Z D=7168, 2025-05-07T20:33:10.1719060Z scale_ub=1200.0, 2025-05-07T20:33:10.1719154Z contiguous=False, 2025-05-07T20:33:10.1719235Z compiled=True, 2025-05-07T20:33:10.1719307Z ) 2025-05-07T20:33:10.1719529Z self = 2025-05-07T20:33:10.1719705Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:10.1719709Z 2025-05-07T20:33:10.1719793Z @given( 2025-05-07T20:33:10.1719908Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.1720048Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.1720171Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.1720288Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.1720398Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.1720476Z ) 2025-05-07T20:33:10.1720717Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.1720808Z def test_silu_mul_quant( 2025-05-07T20:33:10.1720890Z self, 2025-05-07T20:33:10.1720967Z T: int, 2025-05-07T20:33:10.1721042Z D: int, 2025-05-07T20:33:10.1721148Z scale_ub: Optional[float], 2025-05-07T20:33:10.1721237Z contiguous: bool, 2025-05-07T20:33:10.1721324Z compiled: bool, 2025-05-07T20:33:10.1721396Z ) -> None: 2025-05-07T20:33:10.1721487Z torch.manual_seed(2025) 2025-05-07T20:33:10.1721566Z 2025-05-07T20:33:10.1721731Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.1721805Z 2025-05-07T20:33:10.1721902Z x_sign = torch.sign(x) 2025-05-07T20:33:10.1722067Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.1722153Z x = x_sign * x_clamp 2025-05-07T20:33:10.1722239Z x0 = x[:, :D] 2025-05-07T20:33:10.1722319Z x1 = x[:, D:] 2025-05-07T20:33:10.1722427Z 2025-05-07T20:33:10.1722515Z if contiguous: 2025-05-07T20:33:10.1722604Z x0 = x0.contiguous() 2025-05-07T20:33:10.1722699Z x1 = x1.contiguous() 2025-05-07T20:33:10.1722767Z 2025-05-07T20:33:10.1722853Z if scale_ub is not None: 2025-05-07T20:33:10.1722962Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:10.1723096Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:10.1723172Z ) 2025-05-07T20:33:10.1723253Z else: 2025-05-07T20:33:10.1723348Z scale_ub_tensor = None 2025-05-07T20:33:10.1723417Z 2025-05-07T20:33:10.1723555Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.1723647Z op = silu_mul_quant 2025-05-07T20:33:10.1723730Z if compiled: 2025-05-07T20:33:10.1723833Z op = torch.compile(op) 2025-05-07T20:33:10.1723936Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1724065Z 2025-05-07T20:33:10.1724156Z > y_fp8, y_scale = fn() 2025-05-07T20:33:10.1724160Z 2025-05-07T20:33:10.1724384Z moe/activation_test.py:117: 2025-05-07T20:33:10.1724514Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1724608Z moe/activation_test.py:115: in fn 2025-05-07T20:33:10.1724703Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1725070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:10.1725155Z return fn(*args, **kwargs) 2025-05-07T20:33:10.1725651Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:10.1725743Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:10.1726091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:10.1726319Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:10.1726652Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:10.1726736Z kernel = self.compile( 2025-05-07T20:33:10.1727118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:10.1727286Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:10.1727415Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1727419Z 2025-05-07T20:33:10.1727664Z self = 2025-05-07T20:33:10.1728439Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:10.1728949Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f15affb20c0>} 2025-05-07T20:33:10.1729684Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:10.1729875Z context = 2025-05-07T20:33:10.1729879Z 2025-05-07T20:33:10.1730038Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:10.1730296Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:10.1730446Z module_map=module_map) 2025-05-07T20:33:10.1730605Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:10.1730703Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:10.1730810Z E ^ 2025-05-07T20:33:10.1731160Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:10.1731165Z 2025-05-07T20:33:10.1731578Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:10.1731582Z 2025-05-07T20:33:10.1731678Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.1731901Z self=, 2025-05-07T20:33:10.1731968Z T=1, 2025-05-07T20:33:10.1732036Z D=7168, 2025-05-07T20:33:10.1732122Z scale_ub=None, 2025-05-07T20:33:10.1732203Z contiguous=False, 2025-05-07T20:33:10.1732277Z compiled=False, 2025-05-07T20:33:10.1732350Z ) 2025-05-07T20:33:10.1732561Z self = 2025-05-07T20:33:10.1732726Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:10.1732777Z 2025-05-07T20:33:10.1732851Z @given( 2025-05-07T20:33:10.1732963Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.1733066Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.1733172Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.1733280Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.1733394Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.1733458Z ) 2025-05-07T20:33:10.1733696Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.1733787Z def test_silu_mul_quant( 2025-05-07T20:33:10.1733855Z self, 2025-05-07T20:33:10.1733924Z T: int, 2025-05-07T20:33:10.1734002Z D: int, 2025-05-07T20:33:10.1734094Z scale_ub: Optional[float], 2025-05-07T20:33:10.1734176Z contiguous: bool, 2025-05-07T20:33:10.1734260Z compiled: bool, 2025-05-07T20:33:10.1734335Z ) -> None: 2025-05-07T20:33:10.1734434Z torch.manual_seed(2025) 2025-05-07T20:33:10.1734498Z 2025-05-07T20:33:10.1734660Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.1734735Z 2025-05-07T20:33:10.1734820Z x_sign = torch.sign(x) 2025-05-07T20:33:10.1734938Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.1735029Z x = x_sign * x_clamp 2025-05-07T20:33:10.1735100Z x0 = x[:, :D] 2025-05-07T20:33:10.1735172Z x1 = x[:, D:] 2025-05-07T20:33:10.1735243Z 2025-05-07T20:33:10.1735318Z if contiguous: 2025-05-07T20:33:10.1735401Z x0 = x0.contiguous() 2025-05-07T20:33:10.1735537Z x1 = x1.contiguous() 2025-05-07T20:33:10.1735603Z 2025-05-07T20:33:10.1735695Z if scale_ub is not None: 2025-05-07T20:33:10.1735793Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:10.1735920Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:10.1736000Z ) 2025-05-07T20:33:10.1736068Z else: 2025-05-07T20:33:10.1736155Z scale_ub_tensor = None 2025-05-07T20:33:10.1736224Z 2025-05-07T20:33:10.1736347Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.1736430Z op = silu_mul_quant 2025-05-07T20:33:10.1736516Z if compiled: 2025-05-07T20:33:10.1736607Z op = torch.compile(op) 2025-05-07T20:33:10.1736706Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1736777Z 2025-05-07T20:33:10.1736860Z > y_fp8, y_scale = fn() 2025-05-07T20:33:10.1736864Z 2025-05-07T20:33:10.1736964Z moe/activation_test.py:117: 2025-05-07T20:33:10.1737161Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1737255Z moe/activation_test.py:115: in fn 2025-05-07T20:33:10.1737358Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1737853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:10.1737988Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:10.1738344Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:10.1738559Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:10.1738901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:10.1738988Z kernel = self.compile( 2025-05-07T20:33:10.1739365Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:10.1739551Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:10.1739672Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1739716Z 2025-05-07T20:33:10.1739922Z self = 2025-05-07T20:33:10.1740691Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:10.1741184Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f15affb2c00>} 2025-05-07T20:33:10.1741930Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:10.1742119Z context = 2025-05-07T20:33:10.1742123Z 2025-05-07T20:33:10.1742288Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:10.1742549Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:10.1742648Z module_map=module_map) 2025-05-07T20:33:10.1742809Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:10.1742900Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:10.1742967Z E ^ 2025-05-07T20:33:10.1743318Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:10.1743322Z 2025-05-07T20:33:10.1743774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:10.1743778Z 2025-05-07T20:33:10.1743884Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.1744102Z self=, 2025-05-07T20:33:10.1744171Z T=2048, 2025-05-07T20:33:10.1744251Z D=7168, 2025-05-07T20:33:10.1744325Z scale_ub=None, 2025-05-07T20:33:10.1744416Z contiguous=False, 2025-05-07T20:33:10.1744490Z compiled=True, 2025-05-07T20:33:10.1744555Z ) 2025-05-07T20:33:10.1744777Z self = 2025-05-07T20:33:10.1744946Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:10.1744950Z 2025-05-07T20:33:10.1745019Z @given( 2025-05-07T20:33:10.1745141Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.1745233Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.1745346Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.1745465Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.1745620Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.1745699Z ) 2025-05-07T20:33:10.1745937Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.1746025Z def test_silu_mul_quant( 2025-05-07T20:33:10.1746141Z self, 2025-05-07T20:33:10.1746210Z T: int, 2025-05-07T20:33:10.1746278Z D: int, 2025-05-07T20:33:10.1746377Z scale_ub: Optional[float], 2025-05-07T20:33:10.1746459Z contiguous: bool, 2025-05-07T20:33:10.1746538Z compiled: bool, 2025-05-07T20:33:10.1746616Z ) -> None: 2025-05-07T20:33:10.1746705Z torch.manual_seed(2025) 2025-05-07T20:33:10.1746768Z 2025-05-07T20:33:10.1746942Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.1747007Z 2025-05-07T20:33:10.1747094Z x_sign = torch.sign(x) 2025-05-07T20:33:10.1747225Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.1747307Z x = x_sign * x_clamp 2025-05-07T20:33:10.1747388Z x0 = x[:, :D] 2025-05-07T20:33:10.1747462Z x1 = x[:, D:] 2025-05-07T20:33:10.1747525Z 2025-05-07T20:33:10.1747609Z if contiguous: 2025-05-07T20:33:10.1747774Z x0 = x0.contiguous() 2025-05-07T20:33:10.1747858Z x1 = x1.contiguous() 2025-05-07T20:33:10.1747929Z 2025-05-07T20:33:10.1748012Z if scale_ub is not None: 2025-05-07T20:33:10.1748110Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:10.1748245Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:10.1748313Z ) 2025-05-07T20:33:10.1748380Z else: 2025-05-07T20:33:10.1748474Z scale_ub_tensor = None 2025-05-07T20:33:10.1748539Z 2025-05-07T20:33:10.1748668Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.1748750Z op = silu_mul_quant 2025-05-07T20:33:10.1748834Z if compiled: 2025-05-07T20:33:10.1748936Z op = torch.compile(op) 2025-05-07T20:33:10.1749037Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1749103Z 2025-05-07T20:33:10.1749197Z > y_fp8, y_scale = fn() 2025-05-07T20:33:10.1749204Z 2025-05-07T20:33:10.1749295Z moe/activation_test.py:117: 2025-05-07T20:33:10.1749420Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1749520Z moe/activation_test.py:115: in fn 2025-05-07T20:33:10.1749613Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1749980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:10.1750065Z return fn(*args, **kwargs) 2025-05-07T20:33:10.1750549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:10.1750646Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:10.1751042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:10.1751260Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:10.1751600Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:10.1751689Z kernel = self.compile( 2025-05-07T20:33:10.1752069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:10.1752238Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:10.1752357Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1752362Z 2025-05-07T20:33:10.1752567Z self = 2025-05-07T20:33:10.1753379Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:10.1753882Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f15afc782c0>} 2025-05-07T20:33:10.1754664Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:10.1754855Z context = 2025-05-07T20:33:10.1754859Z 2025-05-07T20:33:10.1755017Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:10.1755275Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:10.1755386Z module_map=module_map) 2025-05-07T20:33:10.1755544Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:10.1755638Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:10.1755712Z E ^ 2025-05-07T20:33:10.1756057Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:10.1756105Z 2025-05-07T20:33:10.1756522Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:10.1756526Z 2025-05-07T20:33:10.1756621Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.1756835Z self=, 2025-05-07T20:33:10.1756911Z T=4096, 2025-05-07T20:33:10.1756978Z D=7168, 2025-05-07T20:33:10.1757050Z scale_ub=None, 2025-05-07T20:33:10.1757137Z contiguous=False, 2025-05-07T20:33:10.1757211Z compiled=True, 2025-05-07T20:33:10.1757274Z ) 2025-05-07T20:33:10.1757497Z self = 2025-05-07T20:33:10.1757667Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:10.1757671Z 2025-05-07T20:33:10.1757751Z @given( 2025-05-07T20:33:10.1757866Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.1757961Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.1758078Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.1758188Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.1758293Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.1758365Z ) 2025-05-07T20:33:10.1758604Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.1758694Z def test_silu_mul_quant( 2025-05-07T20:33:10.1758762Z self, 2025-05-07T20:33:10.1758831Z T: int, 2025-05-07T20:33:10.1758907Z D: int, 2025-05-07T20:33:10.1759044Z scale_ub: Optional[float], 2025-05-07T20:33:10.1759125Z contiguous: bool, 2025-05-07T20:33:10.1759213Z compiled: bool, 2025-05-07T20:33:10.1759284Z ) -> None: 2025-05-07T20:33:10.1759370Z torch.manual_seed(2025) 2025-05-07T20:33:10.1759444Z 2025-05-07T20:33:10.1759609Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.1759676Z 2025-05-07T20:33:10.1759768Z x_sign = torch.sign(x) 2025-05-07T20:33:10.1759887Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.1759976Z x = x_sign * x_clamp 2025-05-07T20:33:10.1760047Z x0 = x[:, :D] 2025-05-07T20:33:10.1760121Z x1 = x[:, D:] 2025-05-07T20:33:10.1760193Z 2025-05-07T20:33:10.1760268Z if contiguous: 2025-05-07T20:33:10.1760352Z x0 = x0.contiguous() 2025-05-07T20:33:10.1760441Z x1 = x1.contiguous() 2025-05-07T20:33:10.1760505Z 2025-05-07T20:33:10.1760590Z if scale_ub is not None: 2025-05-07T20:33:10.1760694Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:10.1760874Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:10.1760943Z ) 2025-05-07T20:33:10.1761015Z else: 2025-05-07T20:33:10.1761103Z scale_ub_tensor = None 2025-05-07T20:33:10.1761205Z 2025-05-07T20:33:10.1761334Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.1761415Z op = silu_mul_quant 2025-05-07T20:33:10.1761499Z if compiled: 2025-05-07T20:33:10.1761591Z op = torch.compile(op) 2025-05-07T20:33:10.1761688Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1761757Z 2025-05-07T20:33:10.1761839Z > y_fp8, y_scale = fn() 2025-05-07T20:33:10.1761843Z 2025-05-07T20:33:10.1761932Z moe/activation_test.py:117: 2025-05-07T20:33:10.1762063Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1762158Z moe/activation_test.py:115: in fn 2025-05-07T20:33:10.1762253Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1762619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:10.1762702Z return fn(*args, **kwargs) 2025-05-07T20:33:10.1763237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:10.1763327Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:10.1763679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:10.1763900Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:10.1764230Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:10.1764445Z kernel = self.compile( 2025-05-07T20:33:10.1764824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:10.1764995Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:10.1765123Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1765133Z 2025-05-07T20:33:10.1765330Z self = 2025-05-07T20:33:10.1766099Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:10.1766601Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f15afc78d60>} 2025-05-07T20:33:10.1767412Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:10.1767604Z context = 2025-05-07T20:33:10.1767609Z 2025-05-07T20:33:10.1767769Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:10.1768034Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:10.1768134Z module_map=module_map) 2025-05-07T20:33:10.1768289Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:10.1768385Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:10.1768453Z E ^ 2025-05-07T20:33:10.1768799Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:10.1768803Z 2025-05-07T20:33:10.1773614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:10.1773698Z 2025-05-07T20:33:10.1773829Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.1774058Z self=, 2025-05-07T20:33:10.1774150Z T=16384, 2025-05-07T20:33:10.1774268Z D=5120, 2025-05-07T20:33:10.1774353Z scale_ub=1200.0, 2025-05-07T20:33:10.1774449Z contiguous=False, 2025-05-07T20:33:10.1774534Z compiled=False, 2025-05-07T20:33:10.1774610Z ) 2025-05-07T20:33:10.1774837Z self = 2025-05-07T20:33:10.1775019Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:10.1775024Z 2025-05-07T20:33:10.1775101Z @given( 2025-05-07T20:33:10.1775231Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.1775330Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.1775448Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.1775574Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.1775687Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.1775775Z ) 2025-05-07T20:33:10.1776020Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.1776161Z def test_silu_mul_quant( 2025-05-07T20:33:10.1776248Z self, 2025-05-07T20:33:10.1776327Z T: int, 2025-05-07T20:33:10.1776405Z D: int, 2025-05-07T20:33:10.1776513Z scale_ub: Optional[float], 2025-05-07T20:33:10.1776603Z contiguous: bool, 2025-05-07T20:33:10.1776689Z compiled: bool, 2025-05-07T20:33:10.1776778Z ) -> None: 2025-05-07T20:33:10.1776873Z torch.manual_seed(2025) 2025-05-07T20:33:10.1776949Z 2025-05-07T20:33:10.1777122Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.1777197Z 2025-05-07T20:33:10.1777304Z x_sign = torch.sign(x) 2025-05-07T20:33:10.1777428Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.1777519Z x = x_sign * x_clamp 2025-05-07T20:33:10.1777612Z x0 = x[:, :D] 2025-05-07T20:33:10.1777696Z x1 = x[:, D:] 2025-05-07T20:33:10.1777770Z 2025-05-07T20:33:10.1777859Z if contiguous: 2025-05-07T20:33:10.1777955Z x0 = x0.contiguous() 2025-05-07T20:33:10.1778044Z x1 = x1.contiguous() 2025-05-07T20:33:10.1778125Z 2025-05-07T20:33:10.1778217Z if scale_ub is not None: 2025-05-07T20:33:10.1778325Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:10.1778471Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:10.1778549Z ) 2025-05-07T20:33:10.1778634Z else: 2025-05-07T20:33:10.1778727Z scale_ub_tensor = None 2025-05-07T20:33:10.1778799Z 2025-05-07T20:33:10.1778932Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.1779073Z op = silu_mul_quant 2025-05-07T20:33:10.1779161Z if compiled: 2025-05-07T20:33:10.1779271Z op = torch.compile(op) 2025-05-07T20:33:10.1779378Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1779453Z 2025-05-07T20:33:10.1779555Z > y_fp8, y_scale = fn() 2025-05-07T20:33:10.1779562Z 2025-05-07T20:33:10.1779657Z moe/activation_test.py:117: 2025-05-07T20:33:10.1779786Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1779894Z moe/activation_test.py:115: in fn 2025-05-07T20:33:10.1779994Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1780498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:10.1780593Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:10.1780951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:10.1781223Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:10.1781560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:10.1781666Z kernel = self.compile( 2025-05-07T20:33:10.1782086Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:10.1782260Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:10.1782393Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1782397Z 2025-05-07T20:33:10.1782601Z self = 2025-05-07T20:33:10.1783392Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:10.1783893Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f15afc79c60>} 2025-05-07T20:33:10.1784675Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:10.1784874Z context = 2025-05-07T20:33:10.1784878Z 2025-05-07T20:33:10.1785041Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:10.1785309Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:10.1785419Z module_map=module_map) 2025-05-07T20:33:10.1785579Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:10.1785685Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:10.1785762Z E ^ 2025-05-07T20:33:10.1786118Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:10.1786130Z 2025-05-07T20:33:10.1786542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:10.1786549Z 2025-05-07T20:33:10.1786651Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.1786877Z self=, 2025-05-07T20:33:10.1786953Z T=16384, 2025-05-07T20:33:10.1787029Z D=5120, 2025-05-07T20:33:10.1787121Z scale_ub=1200.0, 2025-05-07T20:33:10.1787206Z contiguous=True, 2025-05-07T20:33:10.1787287Z compiled=True, 2025-05-07T20:33:10.1787369Z ) 2025-05-07T20:33:10.1787587Z self = 2025-05-07T20:33:10.1787810Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:10.1787817Z 2025-05-07T20:33:10.1787893Z @given( 2025-05-07T20:33:10.1788012Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.1788121Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.1788241Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.1788361Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.1788480Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.1788556Z ) 2025-05-07T20:33:10.1788802Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.1788903Z def test_silu_mul_quant( 2025-05-07T20:33:10.1788980Z self, 2025-05-07T20:33:10.1789070Z T: int, 2025-05-07T20:33:10.1789151Z D: int, 2025-05-07T20:33:10.1789249Z scale_ub: Optional[float], 2025-05-07T20:33:10.1789349Z contiguous: bool, 2025-05-07T20:33:10.1789439Z compiled: bool, 2025-05-07T20:33:10.1789517Z ) -> None: 2025-05-07T20:33:10.1789663Z torch.manual_seed(2025) 2025-05-07T20:33:10.1789740Z 2025-05-07T20:33:10.1789906Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.1789992Z 2025-05-07T20:33:10.1790121Z x_sign = torch.sign(x) 2025-05-07T20:33:10.1790245Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.1790342Z x = x_sign * x_clamp 2025-05-07T20:33:10.1790424Z x0 = x[:, :D] 2025-05-07T20:33:10.1790514Z x1 = x[:, D:] 2025-05-07T20:33:10.1790585Z 2025-05-07T20:33:10.1790670Z if contiguous: 2025-05-07T20:33:10.1790768Z x0 = x0.contiguous() 2025-05-07T20:33:10.1790857Z x1 = x1.contiguous() 2025-05-07T20:33:10.1790931Z 2025-05-07T20:33:10.1791029Z if scale_ub is not None: 2025-05-07T20:33:10.1791134Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:10.1791272Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:10.1791357Z ) 2025-05-07T20:33:10.1791437Z else: 2025-05-07T20:33:10.1791530Z scale_ub_tensor = None 2025-05-07T20:33:10.1791609Z 2025-05-07T20:33:10.1791739Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.1791878Z op = silu_mul_quant 2025-05-07T20:33:10.1791970Z if compiled: 2025-05-07T20:33:10.1792067Z op = torch.compile(op) 2025-05-07T20:33:10.1792172Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1792252Z 2025-05-07T20:33:10.1792348Z > y_fp8, y_scale = fn() 2025-05-07T20:33:10.1792353Z 2025-05-07T20:33:10.1792465Z moe/activation_test.py:117: 2025-05-07T20:33:10.1792600Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1792706Z moe/activation_test.py:115: in fn 2025-05-07T20:33:10.1792829Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1793238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:10.1793332Z return fn(*args, **kwargs) 2025-05-07T20:33:10.1793832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:10.1793934Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:10.1794289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:10.1794515Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:10.1794852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:10.1794954Z kernel = self.compile( 2025-05-07T20:33:10.1795335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:10.1795555Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:10.1795689Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1795693Z 2025-05-07T20:33:10.1795895Z self = 2025-05-07T20:33:10.1796688Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:10.1797190Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f15afc7b380>} 2025-05-07T20:33:10.1797936Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:10.1798170Z context = 2025-05-07T20:33:10.1798175Z 2025-05-07T20:33:10.1798339Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:10.1798608Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:10.1798753Z module_map=module_map) 2025-05-07T20:33:10.1798912Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:10.1799014Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:10.1799089Z E ^ 2025-05-07T20:33:10.1799441Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:10.1799452Z 2025-05-07T20:33:10.1799865Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:10.1799870Z 2025-05-07T20:33:10.1799978Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.1800210Z self=, 2025-05-07T20:33:10.1800286Z T=16384, 2025-05-07T20:33:10.1800362Z D=5120, 2025-05-07T20:33:10.1800449Z scale_ub=None, 2025-05-07T20:33:10.1800614Z contiguous=False, 2025-05-07T20:33:10.1800699Z compiled=True, 2025-05-07T20:33:10.1800776Z ) 2025-05-07T20:33:10.1800992Z self = 2025-05-07T20:33:10.1801174Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:10.1801179Z 2025-05-07T20:33:10.1801251Z @given( 2025-05-07T20:33:10.1801367Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.1801468Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.1801580Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.1801693Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.1801811Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.1801883Z ) 2025-05-07T20:33:10.1802129Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.1802225Z def test_silu_mul_quant( 2025-05-07T20:33:10.1802306Z self, 2025-05-07T20:33:10.1802394Z T: int, 2025-05-07T20:33:10.1802471Z D: int, 2025-05-07T20:33:10.1802578Z scale_ub: Optional[float], 2025-05-07T20:33:10.1802685Z contiguous: bool, 2025-05-07T20:33:10.1802783Z compiled: bool, 2025-05-07T20:33:10.1802868Z ) -> None: 2025-05-07T20:33:10.1802965Z torch.manual_seed(2025) 2025-05-07T20:33:10.1803038Z 2025-05-07T20:33:10.1803204Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.1803284Z 2025-05-07T20:33:10.1803373Z x_sign = torch.sign(x) 2025-05-07T20:33:10.1803498Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.1803641Z x = x_sign * x_clamp 2025-05-07T20:33:10.1803721Z x0 = x[:, :D] 2025-05-07T20:33:10.1803808Z x1 = x[:, D:] 2025-05-07T20:33:10.1803880Z 2025-05-07T20:33:10.1803958Z if contiguous: 2025-05-07T20:33:10.1804053Z x0 = x0.contiguous() 2025-05-07T20:33:10.1804143Z x1 = x1.contiguous() 2025-05-07T20:33:10.1804219Z 2025-05-07T20:33:10.1804473Z if scale_ub is not None: 2025-05-07T20:33:10.1804577Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:10.1804709Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:10.1804788Z ) 2025-05-07T20:33:10.1804864Z else: 2025-05-07T20:33:10.1804954Z scale_ub_tensor = None 2025-05-07T20:33:10.1805028Z 2025-05-07T20:33:10.1805153Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.1805249Z op = silu_mul_quant 2025-05-07T20:33:10.1805333Z if compiled: 2025-05-07T20:33:10.1805435Z op = torch.compile(op) 2025-05-07T20:33:10.1805591Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1805663Z 2025-05-07T20:33:10.1805750Z > y_fp8, y_scale = fn() 2025-05-07T20:33:10.1805754Z 2025-05-07T20:33:10.1805853Z moe/activation_test.py:117: 2025-05-07T20:33:10.1805982Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1806179Z moe/activation_test.py:115: in fn 2025-05-07T20:33:10.1806280Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1806645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:10.1806741Z return fn(*args, **kwargs) 2025-05-07T20:33:10.1807230Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:10.1807325Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:10.1807692Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:10.1807915Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:10.1808512Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:10.1808844Z kernel = self.compile( 2025-05-07T20:33:10.1811346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:10.1811561Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:10.1811684Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1811690Z 2025-05-07T20:33:10.1811893Z self = 2025-05-07T20:33:10.1812710Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:10.1813218Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f15af810180>} 2025-05-07T20:33:10.1813968Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:10.1814159Z context = 2025-05-07T20:33:10.1814164Z 2025-05-07T20:33:10.1814330Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:10.1814587Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:10.1814688Z module_map=module_map) 2025-05-07T20:33:10.1815064Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:10.1815162Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:10.1815229Z E ^ 2025-05-07T20:33:10.1815586Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:10.1815597Z 2025-05-07T20:33:10.1816002Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:10.1816006Z 2025-05-07T20:33:10.1816107Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.1816322Z self=, 2025-05-07T20:33:10.1816390Z T=2048, 2025-05-07T20:33:10.1816463Z D=5120, 2025-05-07T20:33:10.1816536Z scale_ub=None, 2025-05-07T20:33:10.1816616Z contiguous=False, 2025-05-07T20:33:10.1816697Z compiled=True, 2025-05-07T20:33:10.1816764Z ) 2025-05-07T20:33:10.1816979Z self = 2025-05-07T20:33:10.1817236Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:10.1817241Z 2025-05-07T20:33:10.1817310Z @given( 2025-05-07T20:33:10.1817429Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.1817591Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.1817701Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.1817814Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.1817919Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.1817986Z ) 2025-05-07T20:33:10.1818231Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.1818316Z def test_silu_mul_quant( 2025-05-07T20:33:10.1818384Z self, 2025-05-07T20:33:10.1818459Z T: int, 2025-05-07T20:33:10.1818528Z D: int, 2025-05-07T20:33:10.1818623Z scale_ub: Optional[float], 2025-05-07T20:33:10.1818709Z contiguous: bool, 2025-05-07T20:33:10.1818784Z compiled: bool, 2025-05-07T20:33:10.1818865Z ) -> None: 2025-05-07T20:33:10.1818950Z torch.manual_seed(2025) 2025-05-07T20:33:10.1819013Z 2025-05-07T20:33:10.1819180Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.1819296Z 2025-05-07T20:33:10.1819379Z x_sign = torch.sign(x) 2025-05-07T20:33:10.1819504Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.1819583Z x = x_sign * x_clamp 2025-05-07T20:33:10.1819653Z x0 = x[:, :D] 2025-05-07T20:33:10.1819732Z x1 = x[:, D:] 2025-05-07T20:33:10.1819794Z 2025-05-07T20:33:10.1819875Z if contiguous: 2025-05-07T20:33:10.1819957Z x0 = x0.contiguous() 2025-05-07T20:33:10.1820037Z x1 = x1.contiguous() 2025-05-07T20:33:10.1820106Z 2025-05-07T20:33:10.1820191Z if scale_ub is not None: 2025-05-07T20:33:10.1820289Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:10.1820427Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:10.1820493Z ) 2025-05-07T20:33:10.1820560Z else: 2025-05-07T20:33:10.1820652Z scale_ub_tensor = None 2025-05-07T20:33:10.1820718Z 2025-05-07T20:33:10.1820842Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.1820930Z op = silu_mul_quant 2025-05-07T20:33:10.1821005Z if compiled: 2025-05-07T20:33:10.1821104Z op = torch.compile(op) 2025-05-07T20:33:10.1821202Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1821265Z 2025-05-07T20:33:10.1821354Z > y_fp8, y_scale = fn() 2025-05-07T20:33:10.1821359Z 2025-05-07T20:33:10.1821447Z moe/activation_test.py:117: 2025-05-07T20:33:10.1821569Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1821667Z moe/activation_test.py:115: in fn 2025-05-07T20:33:10.1821806Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1822170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:10.1822260Z return fn(*args, **kwargs) 2025-05-07T20:33:10.1822746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:10.1822842Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:10.1823189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:10.1823406Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:10.1823740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:10.1823825Z kernel = self.compile( 2025-05-07T20:33:10.1824200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:10.1824416Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:10.1824539Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1824547Z 2025-05-07T20:33:10.1824750Z self = 2025-05-07T20:33:10.1825554Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:10.1826061Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f15af811440>} 2025-05-07T20:33:10.1826800Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:10.1826986Z context = 2025-05-07T20:33:10.1826991Z 2025-05-07T20:33:10.1827154Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:10.1827449Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:10.1827555Z module_map=module_map) 2025-05-07T20:33:10.1827709Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:10.1827799Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:10.1827875Z E ^ 2025-05-07T20:33:10.1828221Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:10.1828226Z 2025-05-07T20:33:10.1828631Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:10.1828642Z 2025-05-07T20:33:10.1828739Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.1828955Z self=, 2025-05-07T20:33:10.1829034Z T=2048, 2025-05-07T20:33:10.1829104Z D=5120, 2025-05-07T20:33:10.1829181Z scale_ub=1200.0, 2025-05-07T20:33:10.1829268Z contiguous=False, 2025-05-07T20:33:10.1829341Z compiled=True, 2025-05-07T20:33:10.1829409Z ) 2025-05-07T20:33:10.1829626Z self = 2025-05-07T20:33:10.1829794Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:10.1829799Z 2025-05-07T20:33:10.1829873Z @given( 2025-05-07T20:33:10.1829983Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.1830074Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.1830188Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.1830342Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.1830451Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.1830524Z ) 2025-05-07T20:33:10.1830762Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.1830852Z def test_silu_mul_quant( 2025-05-07T20:33:10.1830928Z self, 2025-05-07T20:33:10.1830999Z T: int, 2025-05-07T20:33:10.1831067Z D: int, 2025-05-07T20:33:10.1831166Z scale_ub: Optional[float], 2025-05-07T20:33:10.1831246Z contiguous: bool, 2025-05-07T20:33:10.1831333Z compiled: bool, 2025-05-07T20:33:10.1831404Z ) -> None: 2025-05-07T20:33:10.1831490Z torch.manual_seed(2025) 2025-05-07T20:33:10.1831564Z 2025-05-07T20:33:10.1831726Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.1831790Z 2025-05-07T20:33:10.1831880Z x_sign = torch.sign(x) 2025-05-07T20:33:10.1831999Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.1832078Z x = x_sign * x_clamp 2025-05-07T20:33:10.1832203Z x0 = x[:, :D] 2025-05-07T20:33:10.1832274Z x1 = x[:, D:] 2025-05-07T20:33:10.1832336Z 2025-05-07T20:33:10.1832416Z if contiguous: 2025-05-07T20:33:10.1832501Z x0 = x0.contiguous() 2025-05-07T20:33:10.1832627Z x1 = x1.contiguous() 2025-05-07T20:33:10.1832690Z 2025-05-07T20:33:10.1832771Z if scale_ub is not None: 2025-05-07T20:33:10.1832874Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:10.1833002Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:10.1833069Z ) 2025-05-07T20:33:10.1833142Z else: 2025-05-07T20:33:10.1833241Z scale_ub_tensor = None 2025-05-07T20:33:10.1833304Z 2025-05-07T20:33:10.1833425Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.1833512Z op = silu_mul_quant 2025-05-07T20:33:10.1833590Z if compiled: 2025-05-07T20:33:10.1833689Z op = torch.compile(op) 2025-05-07T20:33:10.1833789Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1833853Z 2025-05-07T20:33:10.1833943Z > y_fp8, y_scale = fn() 2025-05-07T20:33:10.1833992Z 2025-05-07T20:33:10.1834082Z moe/activation_test.py:117: 2025-05-07T20:33:10.1834203Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1834305Z moe/activation_test.py:115: in fn 2025-05-07T20:33:10.1834397Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1834760Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:10.1834845Z return fn(*args, **kwargs) 2025-05-07T20:33:10.1835327Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:10.1835424Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:10.1835775Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:10.1835990Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:10.1836329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:10.1836420Z kernel = self.compile( 2025-05-07T20:33:10.1836799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:10.1836966Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:10.1837085Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1837090Z 2025-05-07T20:33:10.1837294Z self = 2025-05-07T20:33:10.1838108Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:10.1838611Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f15af812660>} 2025-05-07T20:33:10.1839351Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:10.1839536Z context = 2025-05-07T20:33:10.1839549Z 2025-05-07T20:33:10.1839707Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:10.1839964Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:10.1840074Z module_map=module_map) 2025-05-07T20:33:10.1840270Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:10.1840362Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:10.1840438Z E ^ 2025-05-07T20:33:10.1840783Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:10.1840861Z 2025-05-07T20:33:10.1841272Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:10.1841276Z 2025-05-07T20:33:10.1841370Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.1841584Z self=, 2025-05-07T20:33:10.1841656Z T=4096, 2025-05-07T20:33:10.1841722Z D=5120, 2025-05-07T20:33:10.1841795Z scale_ub=1200.0, 2025-05-07T20:33:10.1841877Z contiguous=True, 2025-05-07T20:33:10.1841951Z compiled=True, 2025-05-07T20:33:10.1842017Z ) 2025-05-07T20:33:10.1842236Z self = 2025-05-07T20:33:10.1842401Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:10.1842405Z 2025-05-07T20:33:10.1842529Z @given( 2025-05-07T20:33:10.1842639Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.1842732Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.1842847Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.1842957Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.1843064Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.1843135Z ) 2025-05-07T20:33:10.1843374Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.1843458Z def test_silu_mul_quant( 2025-05-07T20:33:10.1843532Z self, 2025-05-07T20:33:10.1843600Z T: int, 2025-05-07T20:33:10.1843676Z D: int, 2025-05-07T20:33:10.1843765Z scale_ub: Optional[float], 2025-05-07T20:33:10.1843848Z contiguous: bool, 2025-05-07T20:33:10.1843932Z compiled: bool, 2025-05-07T20:33:10.1844000Z ) -> None: 2025-05-07T20:33:10.1844084Z torch.manual_seed(2025) 2025-05-07T20:33:10.1844161Z 2025-05-07T20:33:10.1844510Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.1844573Z 2025-05-07T20:33:10.1844665Z x_sign = torch.sign(x) 2025-05-07T20:33:10.1844781Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.1844862Z x = x_sign * x_clamp 2025-05-07T20:33:10.1844939Z x0 = x[:, :D] 2025-05-07T20:33:10.1845009Z x1 = x[:, D:] 2025-05-07T20:33:10.1845078Z 2025-05-07T20:33:10.1845152Z if contiguous: 2025-05-07T20:33:10.1845236Z x0 = x0.contiguous() 2025-05-07T20:33:10.1845324Z x1 = x1.contiguous() 2025-05-07T20:33:10.1845386Z 2025-05-07T20:33:10.1845524Z if scale_ub is not None: 2025-05-07T20:33:10.1845632Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:10.1845759Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:10.1845825Z ) 2025-05-07T20:33:10.1845900Z else: 2025-05-07T20:33:10.1845988Z scale_ub_tensor = None 2025-05-07T20:33:10.1846055Z 2025-05-07T20:33:10.1846184Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.1846265Z op = silu_mul_quant 2025-05-07T20:33:10.1846342Z if compiled: 2025-05-07T20:33:10.1846442Z op = torch.compile(op) 2025-05-07T20:33:10.1846541Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1846614Z 2025-05-07T20:33:10.1846700Z > y_fp8, y_scale = fn() 2025-05-07T20:33:10.1846704Z 2025-05-07T20:33:10.1846792Z moe/activation_test.py:117: 2025-05-07T20:33:10.1846921Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1847018Z moe/activation_test.py:115: in fn 2025-05-07T20:33:10.1847159Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1847526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:10.1847615Z return fn(*args, **kwargs) 2025-05-07T20:33:10.1848145Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:10.1848235Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:10.1848584Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:10.1848810Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:10.1849138Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:10.1849230Z kernel = self.compile( 2025-05-07T20:33:10.1849615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:10.1849782Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:10.1849913Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1849961Z 2025-05-07T20:33:10.1850159Z self = 2025-05-07T20:33:10.1850925Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:10.1851425Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f15af8139c0>} 2025-05-07T20:33:10.1852163Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:10.1852354Z context = 2025-05-07T20:33:10.1852362Z 2025-05-07T20:33:10.1852518Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:10.1852780Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:10.1852881Z module_map=module_map) 2025-05-07T20:33:10.1853035Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:10.1853133Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:10.1853201Z E ^ 2025-05-07T20:33:10.1853548Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:10.1853553Z 2025-05-07T20:33:10.1854006Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:10.1854013Z 2025-05-07T20:33:10.1854108Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.1854331Z self=, 2025-05-07T20:33:10.1854403Z T=128, 2025-05-07T20:33:10.1854471Z D=5120, 2025-05-07T20:33:10.1854553Z scale_ub=1200.0, 2025-05-07T20:33:10.1854629Z contiguous=False, 2025-05-07T20:33:10.1854703Z compiled=True, 2025-05-07T20:33:10.1854775Z ) 2025-05-07T20:33:10.1854983Z self = 2025-05-07T20:33:10.1855148Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:10.1855161Z 2025-05-07T20:33:10.1855227Z @given( 2025-05-07T20:33:10.1855336Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.1855436Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.1855545Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.1855697Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.1855810Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.1855874Z ) 2025-05-07T20:33:10.1856112Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.1856247Z def test_silu_mul_quant( 2025-05-07T20:33:10.1856314Z self, 2025-05-07T20:33:10.1856381Z T: int, 2025-05-07T20:33:10.1856457Z D: int, 2025-05-07T20:33:10.1856548Z scale_ub: Optional[float], 2025-05-07T20:33:10.1856640Z contiguous: bool, 2025-05-07T20:33:10.1856716Z compiled: bool, 2025-05-07T20:33:10.1856785Z ) -> None: 2025-05-07T20:33:10.1856882Z torch.manual_seed(2025) 2025-05-07T20:33:10.1856945Z 2025-05-07T20:33:10.1857107Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.1857179Z 2025-05-07T20:33:10.1857265Z x_sign = torch.sign(x) 2025-05-07T20:33:10.1857384Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.1857471Z x = x_sign * x_clamp 2025-05-07T20:33:10.1857541Z x0 = x[:, :D] 2025-05-07T20:33:10.1857611Z x1 = x[:, D:] 2025-05-07T20:33:10.1857725Z 2025-05-07T20:33:10.1857801Z if contiguous: 2025-05-07T20:33:10.1857885Z x0 = x0.contiguous() 2025-05-07T20:33:10.1857979Z x1 = x1.contiguous() 2025-05-07T20:33:10.1858043Z 2025-05-07T20:33:10.1858132Z if scale_ub is not None: 2025-05-07T20:33:10.1858230Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:10.1858359Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:10.1858435Z ) 2025-05-07T20:33:10.1858505Z else: 2025-05-07T20:33:10.1858589Z scale_ub_tensor = None 2025-05-07T20:33:10.1858662Z 2025-05-07T20:33:10.1858783Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.1858865Z op = silu_mul_quant 2025-05-07T20:33:10.1858950Z if compiled: 2025-05-07T20:33:10.1859042Z op = torch.compile(op) 2025-05-07T20:33:10.1859138Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1859209Z 2025-05-07T20:33:10.1859294Z > y_fp8, y_scale = fn() 2025-05-07T20:33:10.1859301Z 2025-05-07T20:33:10.1859395Z moe/activation_test.py:117: 2025-05-07T20:33:10.1859519Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1859611Z moe/activation_test.py:115: in fn 2025-05-07T20:33:10.1859709Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1860067Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:10.1860150Z return fn(*args, **kwargs) 2025-05-07T20:33:10.1860684Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:10.1860774Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:10.1861132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:10.1861347Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:10.1861685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:10.1861775Z kernel = self.compile( 2025-05-07T20:33:10.1862147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:10.1862322Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:10.1862443Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1862447Z 2025-05-07T20:33:10.1862643Z self = 2025-05-07T20:33:10.1863462Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:10.1863959Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f15af43cfe0>} 2025-05-07T20:33:10.1864736Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:10.1864920Z context = 2025-05-07T20:33:10.1864925Z 2025-05-07T20:33:10.1865080Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:10.1865349Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:10.1865452Z module_map=module_map) 2025-05-07T20:33:10.1865612Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:10.1865703Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:10.1865811Z E ^ 2025-05-07T20:33:10.1866166Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:10.1866170Z 2025-05-07T20:33:10.1866575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:10.1866579Z 2025-05-07T20:33:10.1866679Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.1866893Z self=, 2025-05-07T20:33:10.1866961Z T=16384, 2025-05-07T20:33:10.1867033Z D=7168, 2025-05-07T20:33:10.1867108Z scale_ub=1200.0, 2025-05-07T20:33:10.1867186Z contiguous=True, 2025-05-07T20:33:10.1867266Z compiled=True, 2025-05-07T20:33:10.1867330Z ) 2025-05-07T20:33:10.1867544Z self = 2025-05-07T20:33:10.1867719Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:10.1867726Z 2025-05-07T20:33:10.1867795Z @given( 2025-05-07T20:33:10.1867914Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.1868006Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.1868114Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.1868232Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.1868337Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.1868403Z ) 2025-05-07T20:33:10.1868649Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.1868733Z def test_silu_mul_quant( 2025-05-07T20:33:10.1868799Z self, 2025-05-07T20:33:10.1868922Z T: int, 2025-05-07T20:33:10.1868994Z D: int, 2025-05-07T20:33:10.1869086Z scale_ub: Optional[float], 2025-05-07T20:33:10.1869174Z contiguous: bool, 2025-05-07T20:33:10.1869250Z compiled: bool, 2025-05-07T20:33:10.1869326Z ) -> None: 2025-05-07T20:33:10.1869414Z torch.manual_seed(2025) 2025-05-07T20:33:10.1869479Z 2025-05-07T20:33:10.1869647Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.1869715Z 2025-05-07T20:33:10.1869797Z x_sign = torch.sign(x) 2025-05-07T20:33:10.1869920Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.1869999Z x = x_sign * x_clamp 2025-05-07T20:33:10.1870070Z x0 = x[:, :D] 2025-05-07T20:33:10.1870149Z x1 = x[:, D:] 2025-05-07T20:33:10.1870213Z 2025-05-07T20:33:10.1870289Z if contiguous: 2025-05-07T20:33:10.1870381Z x0 = x0.contiguous() 2025-05-07T20:33:10.1870465Z x1 = x1.contiguous() 2025-05-07T20:33:10.1870528Z 2025-05-07T20:33:10.1870691Z if scale_ub is not None: 2025-05-07T20:33:10.1870789Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:10.1870924Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:10.1870995Z ) 2025-05-07T20:33:10.1871101Z else: 2025-05-07T20:33:10.1871196Z scale_ub_tensor = None 2025-05-07T20:33:10.1871259Z 2025-05-07T20:33:10.1871381Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.1871468Z op = silu_mul_quant 2025-05-07T20:33:10.1871544Z if compiled: 2025-05-07T20:33:10.1871636Z op = torch.compile(op) 2025-05-07T20:33:10.1871742Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1871805Z 2025-05-07T20:33:10.1871888Z > y_fp8, y_scale = fn() 2025-05-07T20:33:10.1871900Z 2025-05-07T20:33:10.1871987Z moe/activation_test.py:117: 2025-05-07T20:33:10.1872114Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1872216Z moe/activation_test.py:115: in fn 2025-05-07T20:33:10.1872306Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1872663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:10.1872796Z return fn(*args, **kwargs) 2025-05-07T20:33:10.1873277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:10.1873367Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:10.1873721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:10.1873936Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:10.1874274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:10.1874360Z kernel = self.compile( 2025-05-07T20:33:10.1874738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:10.1874914Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:10.1875035Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1875041Z 2025-05-07T20:33:10.1875246Z self = 2025-05-07T20:33:10.1876014Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:10.1876508Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f15af43de40>} 2025-05-07T20:33:10.1877756Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:10.1877951Z context = 2025-05-07T20:33:10.1877962Z 2025-05-07T20:33:10.1878130Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:10.1878385Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:10.1878485Z module_map=module_map) 2025-05-07T20:33:10.1878646Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:10.1878739Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:10.1878815Z E ^ 2025-05-07T20:33:10.1879162Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:10.1879170Z 2025-05-07T20:33:10.1879623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:10.1879628Z 2025-05-07T20:33:10.1879735Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.1879953Z self=, 2025-05-07T20:33:10.1880074Z T=16384, 2025-05-07T20:33:10.1880144Z D=5120, 2025-05-07T20:33:10.1880219Z scale_ub=1200.0, 2025-05-07T20:33:10.1880303Z contiguous=True, 2025-05-07T20:33:10.1880378Z compiled=False, 2025-05-07T20:33:10.1880443Z ) 2025-05-07T20:33:10.1880660Z self = 2025-05-07T20:33:10.1880832Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:10.1880837Z 2025-05-07T20:33:10.1880904Z @given( 2025-05-07T20:33:10.1881026Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.1881120Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.1881241Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.1881351Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.1881458Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.1881578Z ) 2025-05-07T20:33:10.1881820Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.1881905Z def test_silu_mul_quant( 2025-05-07T20:33:10.1881980Z self, 2025-05-07T20:33:10.1882049Z T: int, 2025-05-07T20:33:10.1882118Z D: int, 2025-05-07T20:33:10.1882217Z scale_ub: Optional[float], 2025-05-07T20:33:10.1882297Z contiguous: bool, 2025-05-07T20:33:10.1882374Z compiled: bool, 2025-05-07T20:33:10.1882450Z ) -> None: 2025-05-07T20:33:10.1882535Z torch.manual_seed(2025) 2025-05-07T20:33:10.1882608Z 2025-05-07T20:33:10.1882776Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.1882839Z 2025-05-07T20:33:10.1882936Z x_sign = torch.sign(x) 2025-05-07T20:33:10.1883056Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.1883137Z x = x_sign * x_clamp 2025-05-07T20:33:10.1883215Z x0 = x[:, :D] 2025-05-07T20:33:10.1883292Z x1 = x[:, D:] 2025-05-07T20:33:10.1883354Z 2025-05-07T20:33:10.1883438Z if contiguous: 2025-05-07T20:33:10.1883525Z x0 = x0.contiguous() 2025-05-07T20:33:10.1883606Z x1 = x1.contiguous() 2025-05-07T20:33:10.1883675Z 2025-05-07T20:33:10.1883756Z if scale_ub is not None: 2025-05-07T20:33:10.1883853Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:10.1883991Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:10.1884057Z ) 2025-05-07T20:33:10.1884133Z else: 2025-05-07T20:33:10.1884220Z scale_ub_tensor = None 2025-05-07T20:33:10.1884424Z 2025-05-07T20:33:10.1884605Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.1884693Z op = silu_mul_quant 2025-05-07T20:33:10.1884770Z if compiled: 2025-05-07T20:33:10.1884867Z op = torch.compile(op) 2025-05-07T20:33:10.1884964Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1885033Z 2025-05-07T20:33:10.1885122Z > y_fp8, y_scale = fn() 2025-05-07T20:33:10.1885126Z 2025-05-07T20:33:10.1885215Z moe/activation_test.py:117: 2025-05-07T20:33:10.1885344Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1885437Z moe/activation_test.py:115: in fn 2025-05-07T20:33:10.1885527Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1886022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:10.1886109Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:10.1886462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:10.1886726Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:10.1887059Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:10.1887190Z kernel = self.compile( 2025-05-07T20:33:10.1887566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:10.1887735Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:10.1887864Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1887869Z 2025-05-07T20:33:10.1888066Z self = 2025-05-07T20:33:10.1888849Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:10.1889343Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f15af43eca0>} 2025-05-07T20:33:10.1890123Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:10.1890314Z context = 2025-05-07T20:33:10.1890318Z 2025-05-07T20:33:10.1890472Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:10.1890736Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:10.1890837Z module_map=module_map) 2025-05-07T20:33:10.1890994Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:10.1891092Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:10.1891159Z E ^ 2025-05-07T20:33:10.1891505Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:10.1891520Z 2025-05-07T20:33:10.1891925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:10.1891930Z 2025-05-07T20:33:10.1892025Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.1892246Z self=, 2025-05-07T20:33:10.1892314Z T=1, 2025-05-07T20:33:10.1892383Z D=7168, 2025-05-07T20:33:10.1892464Z scale_ub=1200.0, 2025-05-07T20:33:10.1892543Z contiguous=False, 2025-05-07T20:33:10.1892619Z compiled=False, 2025-05-07T20:33:10.1892687Z ) 2025-05-07T20:33:10.1892940Z self = 2025-05-07T20:33:10.1893122Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:10.1893126Z 2025-05-07T20:33:10.1897658Z @given( 2025-05-07T20:33:10.1897807Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.1897917Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.1898041Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.1898157Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.1898269Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.1898356Z ) 2025-05-07T20:33:10.1898604Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.1898700Z def test_silu_mul_quant( 2025-05-07T20:33:10.1898785Z self, 2025-05-07T20:33:10.1898861Z T: int, 2025-05-07T20:33:10.1898939Z D: int, 2025-05-07T20:33:10.1899048Z scale_ub: Optional[float], 2025-05-07T20:33:10.1899136Z contiguous: bool, 2025-05-07T20:33:10.1899300Z compiled: bool, 2025-05-07T20:33:10.1899388Z ) -> None: 2025-05-07T20:33:10.1899486Z torch.manual_seed(2025) 2025-05-07T20:33:10.1899565Z 2025-05-07T20:33:10.1899735Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.1899850Z 2025-05-07T20:33:10.1899954Z x_sign = torch.sign(x) 2025-05-07T20:33:10.1900077Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.1900165Z x = x_sign * x_clamp 2025-05-07T20:33:10.1900252Z x0 = x[:, :D] 2025-05-07T20:33:10.1900329Z x1 = x[:, D:] 2025-05-07T20:33:10.1900402Z 2025-05-07T20:33:10.1900491Z if contiguous: 2025-05-07T20:33:10.1900579Z x0 = x0.contiguous() 2025-05-07T20:33:10.1900666Z x1 = x1.contiguous() 2025-05-07T20:33:10.1900744Z 2025-05-07T20:33:10.1900831Z if scale_ub is not None: 2025-05-07T20:33:10.1900941Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:10.1901086Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:10.1901160Z ) 2025-05-07T20:33:10.1901244Z else: 2025-05-07T20:33:10.1901339Z scale_ub_tensor = None 2025-05-07T20:33:10.1901458Z 2025-05-07T20:33:10.1901594Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.1901681Z op = silu_mul_quant 2025-05-07T20:33:10.1901762Z if compiled: 2025-05-07T20:33:10.1901868Z op = torch.compile(op) 2025-05-07T20:33:10.1901972Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1902044Z 2025-05-07T20:33:10.1902143Z > y_fp8, y_scale = fn() 2025-05-07T20:33:10.1902148Z 2025-05-07T20:33:10.1902244Z moe/activation_test.py:117: 2025-05-07T20:33:10.1902380Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1902478Z moe/activation_test.py:115: in fn 2025-05-07T20:33:10.1902577Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1903090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:10.1903183Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:10.1903544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:10.1903769Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:10.1904107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:10.1904204Z kernel = self.compile( 2025-05-07T20:33:10.1904585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:10.1904761Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:10.1904984Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1904991Z 2025-05-07T20:33:10.1905196Z self = 2025-05-07T20:33:10.1905977Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:10.1906479Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f15af31c0e0>} 2025-05-07T20:33:10.1907221Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:10.1907423Z context = 2025-05-07T20:33:10.1907428Z 2025-05-07T20:33:10.1907630Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:10.1907900Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:10.1908008Z module_map=module_map) 2025-05-07T20:33:10.1908210Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:10.1908692Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:10.1908801Z E ^ 2025-05-07T20:33:10.1909197Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:10.1909212Z 2025-05-07T20:33:10.1909622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:10.1909627Z 2025-05-07T20:33:10.1909726Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.1909958Z self=, 2025-05-07T20:33:10.1910029Z T=4096, 2025-05-07T20:33:10.1910107Z D=7168, 2025-05-07T20:33:10.1910196Z scale_ub=1200.0, 2025-05-07T20:33:10.1910279Z contiguous=False, 2025-05-07T20:33:10.1910358Z compiled=True, 2025-05-07T20:33:10.1910606Z ) 2025-05-07T20:33:10.1910823Z self = 2025-05-07T20:33:10.1911002Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:10.1911007Z 2025-05-07T20:33:10.1911079Z @given( 2025-05-07T20:33:10.1911195Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.1911297Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.1911413Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.1911526Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.1911646Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.1911719Z ) 2025-05-07T20:33:10.1911972Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.1912065Z def test_silu_mul_quant( 2025-05-07T20:33:10.1912139Z self, 2025-05-07T20:33:10.1912221Z T: int, 2025-05-07T20:33:10.1912298Z D: int, 2025-05-07T20:33:10.1912396Z scale_ub: Optional[float], 2025-05-07T20:33:10.1912493Z contiguous: bool, 2025-05-07T20:33:10.1912576Z compiled: bool, 2025-05-07T20:33:10.1912652Z ) -> None: 2025-05-07T20:33:10.1912757Z torch.manual_seed(2025) 2025-05-07T20:33:10.1912828Z 2025-05-07T20:33:10.1912994Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.1913072Z 2025-05-07T20:33:10.1913163Z x_sign = torch.sign(x) 2025-05-07T20:33:10.1913285Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.1913379Z x = x_sign * x_clamp 2025-05-07T20:33:10.1913461Z x0 = x[:, :D] 2025-05-07T20:33:10.1913624Z x1 = x[:, D:] 2025-05-07T20:33:10.1913701Z 2025-05-07T20:33:10.1913784Z if contiguous: 2025-05-07T20:33:10.1913886Z x0 = x0.contiguous() 2025-05-07T20:33:10.1913971Z x1 = x1.contiguous() 2025-05-07T20:33:10.1914042Z 2025-05-07T20:33:10.1914142Z if scale_ub is not None: 2025-05-07T20:33:10.1914247Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:10.1914380Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:10.1914465Z ) 2025-05-07T20:33:10.1914537Z else: 2025-05-07T20:33:10.1914628Z scale_ub_tensor = None 2025-05-07T20:33:10.1914710Z 2025-05-07T20:33:10.1914836Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.1914934Z op = silu_mul_quant 2025-05-07T20:33:10.1915018Z if compiled: 2025-05-07T20:33:10.1915118Z op = torch.compile(op) 2025-05-07T20:33:10.1915231Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1915307Z 2025-05-07T20:33:10.1915394Z > y_fp8, y_scale = fn() 2025-05-07T20:33:10.1915472Z 2025-05-07T20:33:10.1915576Z moe/activation_test.py:117: 2025-05-07T20:33:10.1915706Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1915808Z moe/activation_test.py:115: in fn 2025-05-07T20:33:10.1915980Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1916339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:10.1916436Z return fn(*args, **kwargs) 2025-05-07T20:33:10.1916930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:10.1917029Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:10.1917384Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:10.1917614Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:10.1917950Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:10.1918045Z kernel = self.compile( 2025-05-07T20:33:10.1918476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:10.1918649Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:10.1918779Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1918783Z 2025-05-07T20:33:10.1918984Z self = 2025-05-07T20:33:10.1919766Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:10.1920268Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f15af31d300>} 2025-05-07T20:33:10.1921012Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:10.1921212Z context = 2025-05-07T20:33:10.1921216Z 2025-05-07T20:33:10.1921376Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:10.1921645Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:10.1921749Z module_map=module_map) 2025-05-07T20:33:10.1921907Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:10.1922051Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:10.1922124Z E ^ 2025-05-07T20:33:10.1922486Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:10.1922491Z 2025-05-07T20:33:10.1922900Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:10.1922909Z 2025-05-07T20:33:10.1923009Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.1923234Z self=, 2025-05-07T20:33:10.1923308Z T=128, 2025-05-07T20:33:10.1923379Z D=7168, 2025-05-07T20:33:10.1923467Z scale_ub=1200.0, 2025-05-07T20:33:10.1923549Z contiguous=False, 2025-05-07T20:33:10.1923638Z compiled=True, 2025-05-07T20:33:10.1923709Z ) 2025-05-07T20:33:10.1923920Z self = 2025-05-07T20:33:10.1924096Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:10.1924100Z 2025-05-07T20:33:10.1924218Z @given( 2025-05-07T20:33:10.1924461Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.1924564Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.1924679Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.1924834Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.1924951Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.1925019Z ) 2025-05-07T20:33:10.1925267Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.1925359Z def test_silu_mul_quant( 2025-05-07T20:33:10.1925432Z self, 2025-05-07T20:33:10.1925512Z T: int, 2025-05-07T20:33:10.1925584Z D: int, 2025-05-07T20:33:10.1925678Z scale_ub: Optional[float], 2025-05-07T20:33:10.1925769Z contiguous: bool, 2025-05-07T20:33:10.1925852Z compiled: bool, 2025-05-07T20:33:10.1925930Z ) -> None: 2025-05-07T20:33:10.1926030Z torch.manual_seed(2025) 2025-05-07T20:33:10.1926101Z 2025-05-07T20:33:10.1926266Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.1926340Z 2025-05-07T20:33:10.1926474Z x_sign = torch.sign(x) 2025-05-07T20:33:10.1926608Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.1926693Z x = x_sign * x_clamp 2025-05-07T20:33:10.1926767Z x0 = x[:, :D] 2025-05-07T20:33:10.1926850Z x1 = x[:, D:] 2025-05-07T20:33:10.1926917Z 2025-05-07T20:33:10.1926997Z if contiguous: 2025-05-07T20:33:10.1927090Z x0 = x0.contiguous() 2025-05-07T20:33:10.1927175Z x1 = x1.contiguous() 2025-05-07T20:33:10.1927244Z 2025-05-07T20:33:10.1927337Z if scale_ub is not None: 2025-05-07T20:33:10.1927439Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:10.1927572Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:10.1927650Z ) 2025-05-07T20:33:10.1927722Z else: 2025-05-07T20:33:10.1927813Z scale_ub_tensor = None 2025-05-07T20:33:10.1927890Z 2025-05-07T20:33:10.1928013Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.1928107Z op = silu_mul_quant 2025-05-07T20:33:10.1928189Z if compiled: 2025-05-07T20:33:10.1928286Z op = torch.compile(op) 2025-05-07T20:33:10.1928396Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1928462Z 2025-05-07T20:33:10.1928546Z > y_fp8, y_scale = fn() 2025-05-07T20:33:10.1928551Z 2025-05-07T20:33:10.1928647Z moe/activation_test.py:117: 2025-05-07T20:33:10.1928773Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1928871Z moe/activation_test.py:115: in fn 2025-05-07T20:33:10.1928972Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1929381Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:10.1929479Z return fn(*args, **kwargs) 2025-05-07T20:33:10.1929966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:10.1930067Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:10.1930424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:10.1930646Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:10.1930984Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:10.1931072Z kernel = self.compile( 2025-05-07T20:33:10.1931446Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:10.1931625Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:10.1931795Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1931800Z 2025-05-07T20:33:10.1932001Z self = 2025-05-07T20:33:10.1932784Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:10.1933320Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f15af31e160>} 2025-05-07T20:33:10.1934064Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:10.1934253Z context = 2025-05-07T20:33:10.1934259Z 2025-05-07T20:33:10.1934425Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:10.1934688Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:10.1934862Z module_map=module_map) 2025-05-07T20:33:10.1935026Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:10.1935122Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:10.1935193Z E ^ 2025-05-07T20:33:10.1935549Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:10.1935553Z 2025-05-07T20:33:10.1935961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:10.1935966Z 2025-05-07T20:33:10.1936078Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.1936300Z self=, 2025-05-07T20:33:10.1936374Z T=2048, 2025-05-07T20:33:10.1936453Z D=7168, 2025-05-07T20:33:10.1936532Z scale_ub=None, 2025-05-07T20:33:10.1936613Z contiguous=True, 2025-05-07T20:33:10.1936705Z compiled=True, 2025-05-07T20:33:10.1936779Z ) 2025-05-07T20:33:10.1937003Z self = 2025-05-07T20:33:10.1937170Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:10.1937174Z 2025-05-07T20:33:10.1937243Z @given( 2025-05-07T20:33:10.1937365Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.1937460Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.1937572Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.1937693Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.1937846Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.1937916Z ) 2025-05-07T20:33:10.1938169Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.1938258Z def test_silu_mul_quant( 2025-05-07T20:33:10.1938339Z self, 2025-05-07T20:33:10.1938414Z T: int, 2025-05-07T20:33:10.1938488Z D: int, 2025-05-07T20:33:10.1938592Z scale_ub: Optional[float], 2025-05-07T20:33:10.1938680Z contiguous: bool, 2025-05-07T20:33:10.1938765Z compiled: bool, 2025-05-07T20:33:10.1938849Z ) -> None: 2025-05-07T20:33:10.1938941Z torch.manual_seed(2025) 2025-05-07T20:33:10.1939014Z 2025-05-07T20:33:10.1939183Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.1939250Z 2025-05-07T20:33:10.1939340Z x_sign = torch.sign(x) 2025-05-07T20:33:10.1939467Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.1939551Z x = x_sign * x_clamp 2025-05-07T20:33:10.1939637Z x0 = x[:, :D] 2025-05-07T20:33:10.1939713Z x1 = x[:, D:] 2025-05-07T20:33:10.1939827Z 2025-05-07T20:33:10.1939917Z if contiguous: 2025-05-07T20:33:10.1940005Z x0 = x0.contiguous() 2025-05-07T20:33:10.1940089Z x1 = x1.contiguous() 2025-05-07T20:33:10.1940169Z 2025-05-07T20:33:10.1940292Z if scale_ub is not None: 2025-05-07T20:33:10.1940392Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:10.1940530Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:10.1940600Z ) 2025-05-07T20:33:10.1940670Z else: 2025-05-07T20:33:10.1940766Z scale_ub_tensor = None 2025-05-07T20:33:10.1940835Z 2025-05-07T20:33:10.1940958Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.1941050Z op = silu_mul_quant 2025-05-07T20:33:10.1941132Z if compiled: 2025-05-07T20:33:10.1941235Z op = torch.compile(op) 2025-05-07T20:33:10.1941338Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1941408Z 2025-05-07T20:33:10.1941505Z > y_fp8, y_scale = fn() 2025-05-07T20:33:10.1941510Z 2025-05-07T20:33:10.1941602Z moe/activation_test.py:117: 2025-05-07T20:33:10.1941728Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1941977Z moe/activation_test.py:115: in fn 2025-05-07T20:33:10.1942071Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1942431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:10.1942526Z return fn(*args, **kwargs) 2025-05-07T20:33:10.1943012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:10.1943111Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:10.1943467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:10.1943687Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:10.1944030Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:10.1944124Z kernel = self.compile( 2025-05-07T20:33:10.1944507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:10.1944676Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:10.1944800Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1944804Z 2025-05-07T20:33:10.1945013Z self = 2025-05-07T20:33:10.1945830Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:10.1946338Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f15af31f420>} 2025-05-07T20:33:10.1947077Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:10.1947266Z context = 2025-05-07T20:33:10.1947270Z 2025-05-07T20:33:10.1947435Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:10.1947694Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:10.1947802Z module_map=module_map) 2025-05-07T20:33:10.1947961Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:10.1948055Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:10.1948177Z E ^ 2025-05-07T20:33:10.1948527Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:10.1948535Z 2025-05-07T20:33:10.1948988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:10.1948992Z 2025-05-07T20:33:10.1949093Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.1949310Z self=, 2025-05-07T20:33:10.1949387Z T=16384, 2025-05-07T20:33:10.1949461Z D=5120, 2025-05-07T20:33:10.1949537Z scale_ub=None, 2025-05-07T20:33:10.1949627Z contiguous=False, 2025-05-07T20:33:10.1949710Z compiled=False, 2025-05-07T20:33:10.1949781Z ) 2025-05-07T20:33:10.1950003Z self = 2025-05-07T20:33:10.1950180Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:10.1950187Z 2025-05-07T20:33:10.1950270Z @given( 2025-05-07T20:33:10.1950382Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.1950523Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.1950643Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.1950756Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.1950867Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.1950943Z ) 2025-05-07T20:33:10.1951186Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.1951273Z def test_silu_mul_quant( 2025-05-07T20:33:10.1951352Z self, 2025-05-07T20:33:10.1951425Z T: int, 2025-05-07T20:33:10.1951510Z D: int, 2025-05-07T20:33:10.1951607Z scale_ub: Optional[float], 2025-05-07T20:33:10.1951694Z contiguous: bool, 2025-05-07T20:33:10.1951784Z compiled: bool, 2025-05-07T20:33:10.1951859Z ) -> None: 2025-05-07T20:33:10.1951952Z torch.manual_seed(2025) 2025-05-07T20:33:10.1952029Z 2025-05-07T20:33:10.1952194Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.1952267Z 2025-05-07T20:33:10.1952364Z x_sign = torch.sign(x) 2025-05-07T20:33:10.1952485Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.1954337Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:10.1954344Z 2025-05-07T20:33:10.1954460Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:10.1954465Z 2025-05-07T20:33:10.1954567Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.1954787Z self=, 2025-05-07T20:33:10.1954862Z T=4096, 2025-05-07T20:33:10.1954940Z D=7168, 2025-05-07T20:33:10.1955021Z scale_ub=1200.0, 2025-05-07T20:33:10.1955103Z contiguous=True, 2025-05-07T20:33:10.1955189Z compiled=True, 2025-05-07T20:33:10.1955257Z ) 2025-05-07T20:33:10.1955468Z self = 2025-05-07T20:33:10.1955642Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:10.1955646Z 2025-05-07T20:33:10.1955717Z @given( 2025-05-07T20:33:10.1955828Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.1955934Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.1956086Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.1956206Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.1956314Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.1956389Z ) 2025-05-07T20:33:10.1956674Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.1956761Z def test_silu_mul_quant( 2025-05-07T20:33:10.1956833Z self, 2025-05-07T20:33:10.1956912Z T: int, 2025-05-07T20:33:10.1956984Z D: int, 2025-05-07T20:33:10.1957093Z scale_ub: Optional[float], 2025-05-07T20:33:10.1957183Z contiguous: bool, 2025-05-07T20:33:10.1957266Z compiled: bool, 2025-05-07T20:33:10.1957339Z ) -> None: 2025-05-07T20:33:10.1957436Z torch.manual_seed(2025) 2025-05-07T20:33:10.1957508Z 2025-05-07T20:33:10.1957679Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.1957757Z 2025-05-07T20:33:10.1957850Z x_sign = torch.sign(x) 2025-05-07T20:33:10.1957972Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.1959748Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:10.1959799Z 2025-05-07T20:33:10.1959916Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:10.1959927Z 2025-05-07T20:33:10.1960024Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.1960245Z self=, 2025-05-07T20:33:10.1960332Z T=16384, 2025-05-07T20:33:10.1960406Z D=7168, 2025-05-07T20:33:10.1960482Z scale_ub=None, 2025-05-07T20:33:10.1960570Z contiguous=False, 2025-05-07T20:33:10.1960651Z compiled=False, 2025-05-07T20:33:10.1960723Z ) 2025-05-07T20:33:10.1960942Z self = 2025-05-07T20:33:10.1961115Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:10.1961119Z 2025-05-07T20:33:10.1961193Z @given( 2025-05-07T20:33:10.1961315Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.1961411Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.1961530Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.1961643Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.1961751Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.1961874Z ) 2025-05-07T20:33:10.1962120Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.1962208Z def test_silu_mul_quant( 2025-05-07T20:33:10.1962290Z self, 2025-05-07T20:33:10.1962363Z T: int, 2025-05-07T20:33:10.1962441Z D: int, 2025-05-07T20:33:10.1962547Z scale_ub: Optional[float], 2025-05-07T20:33:10.1962632Z contiguous: bool, 2025-05-07T20:33:10.1962720Z compiled: bool, 2025-05-07T20:33:10.1962793Z ) -> None: 2025-05-07T20:33:10.1962882Z torch.manual_seed(2025) 2025-05-07T20:33:10.1962958Z 2025-05-07T20:33:10.1963121Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.1965057Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:10.1965137Z 2025-05-07T20:33:10.1965254Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:10.1965258Z 2025-05-07T20:33:10.1965354Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.1965575Z self=, 2025-05-07T20:33:10.1965647Z T=2048, 2025-05-07T20:33:10.1965720Z D=7168, 2025-05-07T20:33:10.1965802Z scale_ub=1200.0, 2025-05-07T20:33:10.1965881Z contiguous=True, 2025-05-07T20:33:10.1965958Z compiled=True, 2025-05-07T20:33:10.1966033Z ) 2025-05-07T20:33:10.1966244Z self = 2025-05-07T20:33:10.1966419Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:10.1966423Z 2025-05-07T20:33:10.1966495Z @given( 2025-05-07T20:33:10.1966607Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.1966705Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.1966862Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.1966974Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.1967088Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.1967157Z ) 2025-05-07T20:33:10.1967407Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.1967496Z def test_silu_mul_quant( 2025-05-07T20:33:10.1967570Z self, 2025-05-07T20:33:10.1967648Z T: int, 2025-05-07T20:33:10.1967722Z D: int, 2025-05-07T20:33:10.1967815Z scale_ub: Optional[float], 2025-05-07T20:33:10.1967904Z contiguous: bool, 2025-05-07T20:33:10.1967988Z compiled: bool, 2025-05-07T20:33:10.1968061Z ) -> None: 2025-05-07T20:33:10.1968159Z torch.manual_seed(2025) 2025-05-07T20:33:10.1968228Z 2025-05-07T20:33:10.1968391Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.1968468Z 2025-05-07T20:33:10.1968558Z x_sign = torch.sign(x) 2025-05-07T20:33:10.1968677Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.1970488Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:10.1970494Z 2025-05-07T20:33:10.1970620Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:10.1970624Z 2025-05-07T20:33:10.1970722Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.1970937Z self=, 2025-05-07T20:33:10.1971021Z T=2048, 2025-05-07T20:33:10.1971090Z D=7168, 2025-05-07T20:33:10.1971170Z scale_ub=None, 2025-05-07T20:33:10.1971255Z contiguous=True, 2025-05-07T20:33:10.1971338Z compiled=False, 2025-05-07T20:33:10.1971407Z ) 2025-05-07T20:33:10.1971626Z self = 2025-05-07T20:33:10.1971793Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:10.1971798Z 2025-05-07T20:33:10.1971878Z @given( 2025-05-07T20:33:10.1971991Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.1972088Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.1972205Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.1972362Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.1972471Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.1972552Z ) 2025-05-07T20:33:10.1972795Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.1972922Z def test_silu_mul_quant( 2025-05-07T20:33:10.1973006Z self, 2025-05-07T20:33:10.1973079Z T: int, 2025-05-07T20:33:10.1973160Z D: int, 2025-05-07T20:33:10.1973253Z scale_ub: Optional[float], 2025-05-07T20:33:10.1973336Z contiguous: bool, 2025-05-07T20:33:10.1973425Z compiled: bool, 2025-05-07T20:33:10.1973500Z ) -> None: 2025-05-07T20:33:10.1973591Z torch.manual_seed(2025) 2025-05-07T20:33:10.1973669Z 2025-05-07T20:33:10.1973832Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.1973903Z 2025-05-07T20:33:10.1973999Z > x_sign = torch.sign(x) 2025-05-07T20:33:10.1975759Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:10.1975813Z 2025-05-07T20:33:10.1975935Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:33:10.1975939Z 2025-05-07T20:33:10.1976038Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.1976254Z self=, 2025-05-07T20:33:10.1976335Z T=1, 2025-05-07T20:33:10.1976412Z D=7168, 2025-05-07T20:33:10.1976497Z scale_ub=1200.0, 2025-05-07T20:33:10.1976580Z contiguous=True, 2025-05-07T20:33:10.1976659Z compiled=False, 2025-05-07T20:33:10.1976735Z ) 2025-05-07T20:33:10.1976945Z self = 2025-05-07T20:33:10.1977114Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:10.1977119Z 2025-05-07T20:33:10.1977195Z @given( 2025-05-07T20:33:10.1977306Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.1977399Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.1977515Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.1977625Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.1977740Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.1977808Z ) 2025-05-07T20:33:10.1978088Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.1978185Z def test_silu_mul_quant( 2025-05-07T20:33:10.1978260Z self, 2025-05-07T20:33:10.1978332Z T: int, 2025-05-07T20:33:10.1978410Z D: int, 2025-05-07T20:33:10.1978505Z scale_ub: Optional[float], 2025-05-07T20:33:10.1978592Z contiguous: bool, 2025-05-07T20:33:10.1978681Z compiled: bool, 2025-05-07T20:33:10.1978756Z ) -> None: 2025-05-07T20:33:10.1978845Z torch.manual_seed(2025) 2025-05-07T20:33:10.1978919Z 2025-05-07T20:33:10.1979082Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.1979157Z 2025-05-07T20:33:10.1979243Z x_sign = torch.sign(x) 2025-05-07T20:33:10.1979362Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.1979453Z x = x_sign * x_clamp 2025-05-07T20:33:10.1979529Z x0 = x[:, :D] 2025-05-07T20:33:10.1979605Z x1 = x[:, D:] 2025-05-07T20:33:10.1979681Z 2025-05-07T20:33:10.1979764Z if contiguous: 2025-05-07T20:33:10.1979855Z x0 = x0.contiguous() 2025-05-07T20:33:10.1979992Z x1 = x1.contiguous() 2025-05-07T20:33:10.1980063Z 2025-05-07T20:33:10.1980149Z if scale_ub is not None: 2025-05-07T20:33:10.1980256Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:10.1980429Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:10.1980506Z ) 2025-05-07T20:33:10.1980577Z else: 2025-05-07T20:33:10.1980669Z scale_ub_tensor = None 2025-05-07T20:33:10.1980745Z 2025-05-07T20:33:10.1980871Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.1980955Z op = silu_mul_quant 2025-05-07T20:33:10.1981043Z if compiled: 2025-05-07T20:33:10.1981139Z op = torch.compile(op) 2025-05-07T20:33:10.1981239Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1981318Z 2025-05-07T20:33:10.1981404Z > y_fp8, y_scale = fn() 2025-05-07T20:33:10.1981412Z 2025-05-07T20:33:10.1981503Z moe/activation_test.py:117: 2025-05-07T20:33:10.1981637Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1981734Z moe/activation_test.py:115: in fn 2025-05-07T20:33:10.1981878Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1982378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:10.1982475Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:10.1982835Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:10.1983053Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:10.1983394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:10.1983483Z kernel = self.compile( 2025-05-07T20:33:10.1983869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:10.1984047Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:10.1984172Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1984182Z 2025-05-07T20:33:10.1984390Z self = 2025-05-07T20:33:10.1985173Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:10.1985673Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f15af0662a0>} 2025-05-07T20:33:10.1986465Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:10.1986656Z context = 2025-05-07T20:33:10.1986663Z 2025-05-07T20:33:10.1986832Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:10.1987096Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:10.1987199Z module_map=module_map) 2025-05-07T20:33:10.1987363Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:10.1987460Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:10.1987534Z E ^ 2025-05-07T20:33:10.1987890Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:10.1987894Z 2025-05-07T20:33:10.1988305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:10.1988350Z 2025-05-07T20:33:10.1988456Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.1988676Z self=, 2025-05-07T20:33:10.1988750Z T=128, 2025-05-07T20:33:10.1988873Z D=5120, 2025-05-07T20:33:10.1988950Z scale_ub=None, 2025-05-07T20:33:10.1989030Z contiguous=True, 2025-05-07T20:33:10.1989117Z compiled=False, 2025-05-07T20:33:10.1989185Z ) 2025-05-07T20:33:10.1989400Z self = 2025-05-07T20:33:10.1989575Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:10.1989579Z 2025-05-07T20:33:10.1989653Z @given( 2025-05-07T20:33:10.1989772Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.1989866Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.1989980Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.1990107Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.1990218Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.1990288Z ) 2025-05-07T20:33:10.1990535Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.1990668Z def test_silu_mul_quant( 2025-05-07T20:33:10.1990748Z self, 2025-05-07T20:33:10.1990821Z T: int, 2025-05-07T20:33:10.1990895Z D: int, 2025-05-07T20:33:10.1990995Z scale_ub: Optional[float], 2025-05-07T20:33:10.1991082Z contiguous: bool, 2025-05-07T20:33:10.1991163Z compiled: bool, 2025-05-07T20:33:10.1991244Z ) -> None: 2025-05-07T20:33:10.1991336Z torch.manual_seed(2025) 2025-05-07T20:33:10.1991405Z 2025-05-07T20:33:10.1991577Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.1991647Z 2025-05-07T20:33:10.1991737Z x_sign = torch.sign(x) 2025-05-07T20:33:10.1991869Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.1991953Z x = x_sign * x_clamp 2025-05-07T20:33:10.1992038Z x0 = x[:, :D] 2025-05-07T20:33:10.1992113Z x1 = x[:, D:] 2025-05-07T20:33:10.1992184Z 2025-05-07T20:33:10.1992271Z if contiguous: 2025-05-07T20:33:10.1992358Z x0 = x0.contiguous() 2025-05-07T20:33:10.1992444Z x1 = x1.contiguous() 2025-05-07T20:33:10.1992520Z 2025-05-07T20:33:10.1992606Z if scale_ub is not None: 2025-05-07T20:33:10.1992707Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:10.1992848Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:10.1992918Z ) 2025-05-07T20:33:10.1992991Z else: 2025-05-07T20:33:10.1993088Z scale_ub_tensor = None 2025-05-07T20:33:10.1993157Z 2025-05-07T20:33:10.1993282Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.1993421Z op = silu_mul_quant 2025-05-07T20:33:10.1993506Z if compiled: 2025-05-07T20:33:10.1993610Z op = torch.compile(op) 2025-05-07T20:33:10.1993712Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1993782Z 2025-05-07T20:33:10.1993876Z > y_fp8, y_scale = fn() 2025-05-07T20:33:10.1993883Z 2025-05-07T20:33:10.1993976Z moe/activation_test.py:117: 2025-05-07T20:33:10.1994102Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1994207Z moe/activation_test.py:115: in fn 2025-05-07T20:33:10.1994304Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1994797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:10.1994898Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:10.1995256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:10.1995553Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:10.1995890Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:10.1995983Z kernel = self.compile( 2025-05-07T20:33:10.1996408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:10.1996581Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:10.1996713Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1996717Z 2025-05-07T20:33:10.1996918Z self = 2025-05-07T20:33:10.1997694Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:10.1998200Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f15af0671a0>} 2025-05-07T20:33:10.1998979Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:10.1999176Z context = 2025-05-07T20:33:10.1999181Z 2025-05-07T20:33:10.1999341Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:10.1999599Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:10.1999711Z module_map=module_map) 2025-05-07T20:33:10.1999868Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:10.1999970Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:10.2000043Z E ^ 2025-05-07T20:33:10.2000396Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:10.2000402Z 2025-05-07T20:33:10.2000819Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:10.2000825Z 2025-05-07T20:33:10.2000925Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.2001152Z self=, 2025-05-07T20:33:10.2001225Z T=128, 2025-05-07T20:33:10.2001302Z D=7168, 2025-05-07T20:33:10.2001392Z scale_ub=None, 2025-05-07T20:33:10.2001473Z contiguous=True, 2025-05-07T20:33:10.2001553Z compiled=False, 2025-05-07T20:33:10.2001632Z ) 2025-05-07T20:33:10.2001845Z self = 2025-05-07T20:33:10.2002057Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:10.2002063Z 2025-05-07T20:33:10.2002147Z @given( 2025-05-07T20:33:10.2002261Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.2002366Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.2002479Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.2002594Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.2002713Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.2002783Z ) 2025-05-07T20:33:10.2003026Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.2003124Z def test_silu_mul_quant( 2025-05-07T20:33:10.2003198Z self, 2025-05-07T20:33:10.2003269Z T: int, 2025-05-07T20:33:10.2003354Z D: int, 2025-05-07T20:33:10.2003451Z scale_ub: Optional[float], 2025-05-07T20:33:10.2003536Z contiguous: bool, 2025-05-07T20:33:10.2003803Z compiled: bool, 2025-05-07T20:33:10.2003882Z ) -> None: 2025-05-07T20:33:10.2004025Z torch.manual_seed(2025) 2025-05-07T20:33:10.2004096Z 2025-05-07T20:33:10.2004413Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.2004490Z 2025-05-07T20:33:10.2004615Z x_sign = torch.sign(x) 2025-05-07T20:33:10.2004736Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.2004825Z x = x_sign * x_clamp 2025-05-07T20:33:10.2004895Z x0 = x[:, :D] 2025-05-07T20:33:10.2004967Z x1 = x[:, D:] 2025-05-07T20:33:10.2005038Z 2025-05-07T20:33:10.2005116Z if contiguous: 2025-05-07T20:33:10.2005204Z x0 = x0.contiguous() 2025-05-07T20:33:10.2005294Z x1 = x1.contiguous() 2025-05-07T20:33:10.2005359Z 2025-05-07T20:33:10.2005451Z if scale_ub is not None: 2025-05-07T20:33:10.2005550Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:10.2005683Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:10.2005761Z ) 2025-05-07T20:33:10.2005830Z else: 2025-05-07T20:33:10.2005918Z scale_ub_tensor = None 2025-05-07T20:33:10.2005993Z 2025-05-07T20:33:10.2006115Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.2006243Z op = silu_mul_quant 2025-05-07T20:33:10.2006328Z if compiled: 2025-05-07T20:33:10.2006419Z op = torch.compile(op) 2025-05-07T20:33:10.2006518Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.2006590Z 2025-05-07T20:33:10.2006674Z > y_fp8, y_scale = fn() 2025-05-07T20:33:10.2006678Z 2025-05-07T20:33:10.2006772Z moe/activation_test.py:117: 2025-05-07T20:33:10.2006893Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.2006986Z moe/activation_test.py:115: in fn 2025-05-07T20:33:10.2007088Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.2007584Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:10.2007673Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:10.2008031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:10.2008626Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:10.2009050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:10.2009138Z kernel = self.compile( 2025-05-07T20:33:10.2009515Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:10.2009689Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:10.2009810Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.2009987Z 2025-05-07T20:33:10.2010200Z self = 2025-05-07T20:33:10.2010970Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:10.2011471Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f15aeeb0040>} 2025-05-07T20:33:10.2012212Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:10.2012396Z context = 2025-05-07T20:33:10.2012401Z 2025-05-07T20:33:10.2012568Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:10.2012886Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:10.2012990Z module_map=module_map) 2025-05-07T20:33:10.2013157Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:10.2013307Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:10.2013384Z E ^ 2025-05-07T20:33:10.2013733Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:10.2013738Z 2025-05-07T20:33:10.2014142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:10.2014147Z 2025-05-07T20:33:10.2014247Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.2014463Z self=, 2025-05-07T20:33:10.2014529Z T=2048, 2025-05-07T20:33:10.2014602Z D=7168, 2025-05-07T20:33:10.2014678Z scale_ub=1200.0, 2025-05-07T20:33:10.2014762Z contiguous=True, 2025-05-07T20:33:10.2014837Z compiled=False, 2025-05-07T20:33:10.2014902Z ) 2025-05-07T20:33:10.2015121Z self = 2025-05-07T20:33:10.2015364Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:10.2015368Z 2025-05-07T20:33:10.2015435Z @given( 2025-05-07T20:33:10.2015553Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.2015645Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.2015751Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.2015867Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.2015973Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.2016049Z ) 2025-05-07T20:33:10.2016289Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.2016374Z def test_silu_mul_quant( 2025-05-07T20:33:10.2016450Z self, 2025-05-07T20:33:10.2016517Z T: int, 2025-05-07T20:33:10.2021055Z D: int, 2025-05-07T20:33:10.2021176Z scale_ub: Optional[float], 2025-05-07T20:33:10.2021285Z contiguous: bool, 2025-05-07T20:33:10.2021374Z compiled: bool, 2025-05-07T20:33:10.2021456Z ) -> None: 2025-05-07T20:33:10.2021559Z torch.manual_seed(2025) 2025-05-07T20:33:10.2021636Z 2025-05-07T20:33:10.2021810Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.2023679Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:10.2023686Z 2025-05-07T20:33:10.2023807Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:10.2023816Z 2025-05-07T20:33:10.2023926Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.2024150Z self=, 2025-05-07T20:33:10.2024227Z T=1, 2025-05-07T20:33:10.2024307Z D=5120, 2025-05-07T20:33:10.2024390Z scale_ub=1200.0, 2025-05-07T20:33:10.2024478Z contiguous=True, 2025-05-07T20:33:10.2024560Z compiled=False, 2025-05-07T20:33:10.2024628Z ) 2025-05-07T20:33:10.2024850Z self = 2025-05-07T20:33:10.2025014Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:10.2025019Z 2025-05-07T20:33:10.2025098Z @given( 2025-05-07T20:33:10.2025222Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.2025362Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.2025481Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.2025602Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.2025754Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.2025834Z ) 2025-05-07T20:33:10.2026075Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.2026165Z def test_silu_mul_quant( 2025-05-07T20:33:10.2026246Z self, 2025-05-07T20:33:10.2026318Z T: int, 2025-05-07T20:33:10.2026391Z D: int, 2025-05-07T20:33:10.2026497Z scale_ub: Optional[float], 2025-05-07T20:33:10.2026584Z contiguous: bool, 2025-05-07T20:33:10.2026666Z compiled: bool, 2025-05-07T20:33:10.2026752Z ) -> None: 2025-05-07T20:33:10.2026846Z torch.manual_seed(2025) 2025-05-07T20:33:10.2026915Z 2025-05-07T20:33:10.2027093Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.2027166Z 2025-05-07T20:33:10.2027262Z x_sign = torch.sign(x) 2025-05-07T20:33:10.2027383Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.2027513Z x = x_sign * x_clamp 2025-05-07T20:33:10.2027599Z x0 = x[:, :D] 2025-05-07T20:33:10.2027674Z x1 = x[:, D:] 2025-05-07T20:33:10.2027743Z 2025-05-07T20:33:10.2027831Z if contiguous: 2025-05-07T20:33:10.2027920Z x0 = x0.contiguous() 2025-05-07T20:33:10.2028005Z x1 = x1.contiguous() 2025-05-07T20:33:10.2028078Z 2025-05-07T20:33:10.2028164Z if scale_ub is not None: 2025-05-07T20:33:10.2028264Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:10.2028409Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:10.2028482Z ) 2025-05-07T20:33:10.2028561Z else: 2025-05-07T20:33:10.2028661Z scale_ub_tensor = None 2025-05-07T20:33:10.2028734Z 2025-05-07T20:33:10.2028872Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.2028958Z op = silu_mul_quant 2025-05-07T20:33:10.2029042Z if compiled: 2025-05-07T20:33:10.2029151Z op = torch.compile(op) 2025-05-07T20:33:10.2029255Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.2029322Z 2025-05-07T20:33:10.2029418Z > y_fp8, y_scale = fn() 2025-05-07T20:33:10.2029422Z 2025-05-07T20:33:10.2029516Z moe/activation_test.py:117: 2025-05-07T20:33:10.2029643Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.2029747Z moe/activation_test.py:115: in fn 2025-05-07T20:33:10.2029843Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.2030419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:10.2030513Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:10.2030871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:10.2031094Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:10.2031432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:10.2031518Z kernel = self.compile( 2025-05-07T20:33:10.2031898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:10.2032069Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:10.2032201Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.2032205Z 2025-05-07T20:33:10.2032402Z self = 2025-05-07T20:33:10.2033222Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:10.2033733Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f15aeeb1580>} 2025-05-07T20:33:10.2034512Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:10.2034706Z context = 2025-05-07T20:33:10.2034710Z 2025-05-07T20:33:10.2034870Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:10.2035135Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:10.2035240Z module_map=module_map) 2025-05-07T20:33:10.2035397Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:10.2035493Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:10.2035606Z E ^ 2025-05-07T20:33:10.2035958Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:10.2035963Z 2025-05-07T20:33:10.2036378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:10.2036382Z 2025-05-07T20:33:10.2036480Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.2036702Z self=, 2025-05-07T20:33:10.2036770Z T=2048, 2025-05-07T20:33:10.2036839Z D=5120, 2025-05-07T20:33:10.2036922Z scale_ub=None, 2025-05-07T20:33:10.2037002Z contiguous=True, 2025-05-07T20:33:10.2037081Z compiled=False, 2025-05-07T20:33:10.2037152Z ) 2025-05-07T20:33:10.2037369Z self = 2025-05-07T20:33:10.2037536Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:10.2037552Z 2025-05-07T20:33:10.2037622Z @given( 2025-05-07T20:33:10.2037734Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.2037836Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.2037945Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.2038055Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.2038170Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.2038236Z ) 2025-05-07T20:33:10.2038477Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.2038571Z def test_silu_mul_quant( 2025-05-07T20:33:10.2038642Z self, 2025-05-07T20:33:10.2038764Z T: int, 2025-05-07T20:33:10.2038841Z D: int, 2025-05-07T20:33:10.2038935Z scale_ub: Optional[float], 2025-05-07T20:33:10.2039025Z contiguous: bool, 2025-05-07T20:33:10.2039105Z compiled: bool, 2025-05-07T20:33:10.2039176Z ) -> None: 2025-05-07T20:33:10.2039273Z torch.manual_seed(2025) 2025-05-07T20:33:10.2039340Z 2025-05-07T20:33:10.2039502Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.2039576Z 2025-05-07T20:33:10.2039661Z > x_sign = torch.sign(x) 2025-05-07T20:33:10.2041490Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:10.2041496Z 2025-05-07T20:33:10.2041611Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:33:10.2041618Z 2025-05-07T20:33:10.2041714Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.2041976Z self=, 2025-05-07T20:33:10.2042045Z T=16384, 2025-05-07T20:33:10.2042120Z D=5120, 2025-05-07T20:33:10.2042195Z scale_ub=None, 2025-05-07T20:33:10.2042271Z contiguous=True, 2025-05-07T20:33:10.2042350Z compiled=False, 2025-05-07T20:33:10.2042422Z ) 2025-05-07T20:33:10.2042632Z self = 2025-05-07T20:33:10.2042802Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:10.2042806Z 2025-05-07T20:33:10.2042884Z @given( 2025-05-07T20:33:10.2042997Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.2043098Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.2043207Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.2043316Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.2043476Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.2043542Z ) 2025-05-07T20:33:10.2043785Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.2043876Z def test_silu_mul_quant( 2025-05-07T20:33:10.2043948Z self, 2025-05-07T20:33:10.2044016Z T: int, 2025-05-07T20:33:10.2044090Z D: int, 2025-05-07T20:33:10.2044181Z scale_ub: Optional[float], 2025-05-07T20:33:10.2044416Z contiguous: bool, 2025-05-07T20:33:10.2044506Z compiled: bool, 2025-05-07T20:33:10.2044574Z ) -> None: 2025-05-07T20:33:10.2044673Z torch.manual_seed(2025) 2025-05-07T20:33:10.2044740Z 2025-05-07T20:33:10.2044907Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.2046682Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:10.2046692Z 2025-05-07T20:33:10.2046803Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:10.2046808Z 2025-05-07T20:33:10.2046910Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.2047124Z self=, 2025-05-07T20:33:10.2047238Z T=4096, 2025-05-07T20:33:10.2047316Z D=5120, 2025-05-07T20:33:10.2047391Z scale_ub=None, 2025-05-07T20:33:10.2047468Z contiguous=True, 2025-05-07T20:33:10.2047551Z compiled=False, 2025-05-07T20:33:10.2047617Z ) 2025-05-07T20:33:10.2047836Z self = 2025-05-07T20:33:10.2048006Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:10.2048011Z 2025-05-07T20:33:10.2048078Z @given( 2025-05-07T20:33:10.2048207Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.2048298Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.2048405Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.2048521Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.2048631Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.2048702Z ) 2025-05-07T20:33:10.2048943Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.2049074Z def test_silu_mul_quant( 2025-05-07T20:33:10.2049154Z self, 2025-05-07T20:33:10.2049223Z T: int, 2025-05-07T20:33:10.2049293Z D: int, 2025-05-07T20:33:10.2049390Z scale_ub: Optional[float], 2025-05-07T20:33:10.2049519Z contiguous: bool, 2025-05-07T20:33:10.2049596Z compiled: bool, 2025-05-07T20:33:10.2049672Z ) -> None: 2025-05-07T20:33:10.2049759Z torch.manual_seed(2025) 2025-05-07T20:33:10.2049822Z 2025-05-07T20:33:10.2049988Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.2051749Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:10.2051754Z 2025-05-07T20:33:10.2051912Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:10.2051919Z 2025-05-07T20:33:10.2052013Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.2052238Z self=, 2025-05-07T20:33:10.2052305Z T=2048, 2025-05-07T20:33:10.2052372Z D=5120, 2025-05-07T20:33:10.2052451Z scale_ub=None, 2025-05-07T20:33:10.2052528Z contiguous=False, 2025-05-07T20:33:10.2052602Z compiled=False, 2025-05-07T20:33:10.2052674Z ) 2025-05-07T20:33:10.2052881Z self = 2025-05-07T20:33:10.2053048Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:10.2053054Z 2025-05-07T20:33:10.2053131Z @given( 2025-05-07T20:33:10.2053243Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.2053339Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.2053447Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.2053558Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.2053676Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.2053740Z ) 2025-05-07T20:33:10.2053979Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.2054069Z def test_silu_mul_quant( 2025-05-07T20:33:10.2054136Z self, 2025-05-07T20:33:10.2054205Z T: int, 2025-05-07T20:33:10.2054282Z D: int, 2025-05-07T20:33:10.2054371Z scale_ub: Optional[float], 2025-05-07T20:33:10.2054451Z contiguous: bool, 2025-05-07T20:33:10.2054533Z compiled: bool, 2025-05-07T20:33:10.2054602Z ) -> None: 2025-05-07T20:33:10.2054765Z torch.manual_seed(2025) 2025-05-07T20:33:10.2054830Z 2025-05-07T20:33:10.2054993Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.2056747Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:10.2056757Z 2025-05-07T20:33:10.2056865Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:10.2056870Z 2025-05-07T20:33:10.2056968Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.2057184Z self=, 2025-05-07T20:33:10.2057293Z T=4096, 2025-05-07T20:33:10.2057369Z D=7168, 2025-05-07T20:33:10.2057441Z scale_ub=None, 2025-05-07T20:33:10.2057517Z contiguous=True, 2025-05-07T20:33:10.2057601Z compiled=True, 2025-05-07T20:33:10.2057703Z ) 2025-05-07T20:33:10.2057920Z self = 2025-05-07T20:33:10.2058081Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:10.2058086Z 2025-05-07T20:33:10.2058152Z @given( 2025-05-07T20:33:10.2058268Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.2058359Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.2058469Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.2058583Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.2058694Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.2058769Z ) 2025-05-07T20:33:10.2059010Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.2059095Z def test_silu_mul_quant( 2025-05-07T20:33:10.2059172Z self, 2025-05-07T20:33:10.2059239Z T: int, 2025-05-07T20:33:10.2059380Z D: int, 2025-05-07T20:33:10.2059480Z scale_ub: Optional[float], 2025-05-07T20:33:10.2059561Z contiguous: bool, 2025-05-07T20:33:10.2059640Z compiled: bool, 2025-05-07T20:33:10.2059719Z ) -> None: 2025-05-07T20:33:10.2059804Z torch.manual_seed(2025) 2025-05-07T20:33:10.2059867Z 2025-05-07T20:33:10.2060036Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.2061798Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:10.2061808Z 2025-05-07T20:33:10.2061929Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:10.2061933Z 2025-05-07T20:33:10.2062027Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.2062251Z self=, 2025-05-07T20:33:10.2062319Z T=2048, 2025-05-07T20:33:10.2062386Z D=5120, 2025-05-07T20:33:10.2062467Z scale_ub=1200.0, 2025-05-07T20:33:10.2062545Z contiguous=False, 2025-05-07T20:33:10.2062621Z compiled=False, 2025-05-07T20:33:10.2062691Z ) 2025-05-07T20:33:10.2062900Z self = 2025-05-07T20:33:10.2063122Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:10.2063127Z 2025-05-07T20:33:10.2063204Z @given( 2025-05-07T20:33:10.2063314Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.2063412Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.2063522Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.2063633Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.2063746Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.2063811Z ) 2025-05-07T20:33:10.2064049Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.2064144Z def test_silu_mul_quant( 2025-05-07T20:33:10.2064210Z self, 2025-05-07T20:33:10.2064276Z T: int, 2025-05-07T20:33:10.2064348Z D: int, 2025-05-07T20:33:10.2064437Z scale_ub: Optional[float], 2025-05-07T20:33:10.2064518Z contiguous: bool, 2025-05-07T20:33:10.2064604Z compiled: bool, 2025-05-07T20:33:10.2064672Z ) -> None: 2025-05-07T20:33:10.2064808Z torch.manual_seed(2025) 2025-05-07T20:33:10.2064873Z 2025-05-07T20:33:10.2065033Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.2066797Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:10.2066842Z 2025-05-07T20:33:10.2066952Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:10.2066956Z 2025-05-07T20:33:10.2067059Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.2067276Z self=, 2025-05-07T20:33:10.2067344Z T=4096, 2025-05-07T20:33:10.2067420Z D=7168, 2025-05-07T20:33:10.2067493Z scale_ub=1200.0, 2025-05-07T20:33:10.2067612Z contiguous=True, 2025-05-07T20:33:10.2067695Z compiled=False, 2025-05-07T20:33:10.2067759Z ) 2025-05-07T20:33:10.2067974Z self = 2025-05-07T20:33:10.2068139Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:10.2068144Z 2025-05-07T20:33:10.2068213Z @given( 2025-05-07T20:33:10.2068329Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.2068420Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.2068531Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.2068647Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.2068753Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.2068823Z ) 2025-05-07T20:33:10.2069064Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.2069150Z def test_silu_mul_quant( 2025-05-07T20:33:10.2069228Z self, 2025-05-07T20:33:10.2069300Z T: int, 2025-05-07T20:33:10.2069368Z D: int, 2025-05-07T20:33:10.2069463Z scale_ub: Optional[float], 2025-05-07T20:33:10.2069542Z contiguous: bool, 2025-05-07T20:33:10.2069619Z compiled: bool, 2025-05-07T20:33:10.2069693Z ) -> None: 2025-05-07T20:33:10.2069777Z torch.manual_seed(2025) 2025-05-07T20:33:10.2069839Z 2025-05-07T20:33:10.2070007Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.2071809Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:10.2071820Z 2025-05-07T20:33:10.2071937Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:10.2071942Z 2025-05-07T20:33:10.2072036Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.2072258Z self=, 2025-05-07T20:33:10.2072326Z T=16384, 2025-05-07T20:33:10.2072393Z D=7168, 2025-05-07T20:33:10.2072472Z scale_ub=None, 2025-05-07T20:33:10.2072549Z contiguous=False, 2025-05-07T20:33:10.2072624Z compiled=True, 2025-05-07T20:33:10.2072695Z ) 2025-05-07T20:33:10.2072906Z self = 2025-05-07T20:33:10.2073117Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:10.2073122Z 2025-05-07T20:33:10.2073200Z @given( 2025-05-07T20:33:10.2073309Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.2073546Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.2073655Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.2073764Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.2073879Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.2073945Z ) 2025-05-07T20:33:10.2074184Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.2074275Z def test_silu_mul_quant( 2025-05-07T20:33:10.2074343Z self, 2025-05-07T20:33:10.2074411Z T: int, 2025-05-07T20:33:10.2074485Z D: int, 2025-05-07T20:33:10.2074576Z scale_ub: Optional[float], 2025-05-07T20:33:10.2074666Z contiguous: bool, 2025-05-07T20:33:10.2074744Z compiled: bool, 2025-05-07T20:33:10.2074817Z ) -> None: 2025-05-07T20:33:10.2074908Z torch.manual_seed(2025) 2025-05-07T20:33:10.2074972Z 2025-05-07T20:33:10.2075132Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.2076948Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:10.2076954Z 2025-05-07T20:33:10.2077068Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:10.2077072Z 2025-05-07T20:33:10.2077174Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.2077391Z self=, 2025-05-07T20:33:10.2077460Z T=4096, 2025-05-07T20:33:10.2077544Z D=7168, 2025-05-07T20:33:10.2077618Z scale_ub=None, 2025-05-07T20:33:10.2077694Z contiguous=True, 2025-05-07T20:33:10.2077777Z compiled=False, 2025-05-07T20:33:10.2077843Z ) 2025-05-07T20:33:10.2078060Z self = 2025-05-07T20:33:10.2078227Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:10.2078231Z 2025-05-07T20:33:10.2078299Z @given( 2025-05-07T20:33:10.2078422Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.2078511Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.2078618Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.2078786Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.2078894Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.2078967Z ) 2025-05-07T20:33:10.2079204Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.2079292Z def test_silu_mul_quant( 2025-05-07T20:33:10.2079368Z self, 2025-05-07T20:33:10.2079437Z T: int, 2025-05-07T20:33:10.2079504Z D: int, 2025-05-07T20:33:10.2079600Z scale_ub: Optional[float], 2025-05-07T20:33:10.2079681Z contiguous: bool, 2025-05-07T20:33:10.2079757Z compiled: bool, 2025-05-07T20:33:10.2079833Z ) -> None: 2025-05-07T20:33:10.2079920Z torch.manual_seed(2025) 2025-05-07T20:33:10.2079983Z 2025-05-07T20:33:10.2080150Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.2081953Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:10.2082000Z 2025-05-07T20:33:10.2082116Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:10.2082120Z 2025-05-07T20:33:10.2082213Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.2082451Z self=, 2025-05-07T20:33:10.2082526Z T=16384, 2025-05-07T20:33:10.2082595Z D=7168, 2025-05-07T20:33:10.2082668Z scale_ub=None, 2025-05-07T20:33:10.2082749Z contiguous=True, 2025-05-07T20:33:10.2082827Z compiled=False, 2025-05-07T20:33:10.2082895Z ) 2025-05-07T20:33:10.2083115Z self = 2025-05-07T20:33:10.2083285Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:10.2083289Z 2025-05-07T20:33:10.2083404Z @given( 2025-05-07T20:33:10.2083524Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.2083615Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.2083730Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.2083840Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.2083947Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.2084022Z ) 2025-05-07T20:33:10.2084368Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.2084464Z def test_silu_mul_quant( 2025-05-07T20:33:10.2084531Z self, 2025-05-07T20:33:10.2084599Z T: int, 2025-05-07T20:33:10.2084683Z D: int, 2025-05-07T20:33:10.2084773Z scale_ub: Optional[float], 2025-05-07T20:33:10.2084855Z contiguous: bool, 2025-05-07T20:33:10.2084941Z compiled: bool, 2025-05-07T20:33:10.2085013Z ) -> None: 2025-05-07T20:33:10.2085098Z torch.manual_seed(2025) 2025-05-07T20:33:10.2085176Z 2025-05-07T20:33:10.2085340Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.2087152Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:10.2087158Z 2025-05-07T20:33:10.2087275Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:10.2087279Z 2025-05-07T20:33:10.2087373Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.2087598Z self=, 2025-05-07T20:33:10.2087672Z T=16384, 2025-05-07T20:33:10.2087744Z D=7168, 2025-05-07T20:33:10.2087820Z scale_ub=1200.0, 2025-05-07T20:33:10.2087898Z contiguous=True, 2025-05-07T20:33:10.2087983Z compiled=False, 2025-05-07T20:33:10.2088046Z ) 2025-05-07T20:33:10.2088254Z self = 2025-05-07T20:33:10.2088431Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:10.2088435Z 2025-05-07T20:33:10.2088503Z @given( 2025-05-07T20:33:10.2088615Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.2088712Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.2088821Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.2088980Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.2089085Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.2089150Z ) 2025-05-07T20:33:10.2089398Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.2089546Z def test_silu_mul_quant( 2025-05-07T20:33:10.2089613Z self, 2025-05-07T20:33:10.2089687Z T: int, 2025-05-07T20:33:10.2089754Z D: int, 2025-05-07T20:33:10.2089844Z scale_ub: Optional[float], 2025-05-07T20:33:10.2089931Z contiguous: bool, 2025-05-07T20:33:10.2090008Z compiled: bool, 2025-05-07T20:33:10.2090075Z ) -> None: 2025-05-07T20:33:10.2090168Z torch.manual_seed(2025) 2025-05-07T20:33:10.2090231Z 2025-05-07T20:33:10.2090398Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.2092162Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:10.2092212Z 2025-05-07T20:33:10.2092331Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:10.2092336Z 2025-05-07T20:33:10.2092430Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.2092645Z self=, 2025-05-07T20:33:10.2092718Z T=128, 2025-05-07T20:33:10.2092787Z D=5120, 2025-05-07T20:33:10.2092861Z scale_ub=1200.0, 2025-05-07T20:33:10.2092946Z contiguous=False, 2025-05-07T20:33:10.2093020Z compiled=False, 2025-05-07T20:33:10.2093086Z ) 2025-05-07T20:33:10.2093303Z self = 2025-05-07T20:33:10.2093469Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:10.2093479Z 2025-05-07T20:33:10.2093551Z @given( 2025-05-07T20:33:10.2093660Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.2093750Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.2093864Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.2093974Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.2094077Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.2094148Z ) 2025-05-07T20:33:10.2094385Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.2094476Z def test_silu_mul_quant( 2025-05-07T20:33:10.2094586Z self, 2025-05-07T20:33:10.2094656Z T: int, 2025-05-07T20:33:10.2094733Z D: int, 2025-05-07T20:33:10.2094824Z scale_ub: Optional[float], 2025-05-07T20:33:10.2094905Z contiguous: bool, 2025-05-07T20:33:10.2094988Z compiled: bool, 2025-05-07T20:33:10.2095059Z ) -> None: 2025-05-07T20:33:10.2095145Z torch.manual_seed(2025) 2025-05-07T20:33:10.2095215Z 2025-05-07T20:33:10.2095375Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.2095442Z 2025-05-07T20:33:10.2095534Z x_sign = torch.sign(x) 2025-05-07T20:33:10.2095651Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.2095733Z x = x_sign * x_clamp 2025-05-07T20:33:10.2095812Z x0 = x[:, :D] 2025-05-07T20:33:10.2095885Z x1 = x[:, D:] 2025-05-07T20:33:10.2095957Z 2025-05-07T20:33:10.2096036Z if contiguous: 2025-05-07T20:33:10.2096121Z x0 = x0.contiguous() 2025-05-07T20:33:10.2096212Z x1 = x1.contiguous() 2025-05-07T20:33:10.2096279Z 2025-05-07T20:33:10.2096412Z if scale_ub is not None: 2025-05-07T20:33:10.2096524Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:10.2096656Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:10.2096726Z ) 2025-05-07T20:33:10.2096842Z else: 2025-05-07T20:33:10.2096929Z scale_ub_tensor = None 2025-05-07T20:33:10.2096992Z 2025-05-07T20:33:10.2097123Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.2097206Z op = silu_mul_quant 2025-05-07T20:33:10.2097289Z if compiled: 2025-05-07T20:33:10.2097381Z op = torch.compile(op) 2025-05-07T20:33:10.2097479Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.2097552Z 2025-05-07T20:33:10.2097633Z > y_fp8, y_scale = fn() 2025-05-07T20:33:10.2097637Z 2025-05-07T20:33:10.2097728Z moe/activation_test.py:117: 2025-05-07T20:33:10.2097860Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.2097955Z moe/activation_test.py:115: in fn 2025-05-07T20:33:10.2098047Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.2098548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:10.2098682Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:10.2099042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:10.2099259Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:10.2099593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:10.2099687Z kernel = self.compile( 2025-05-07T20:33:10.2100066Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:10.2100244Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:10.2100367Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.2100371Z 2025-05-07T20:33:10.2100573Z self = 2025-05-07T20:33:10.2101355Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:10.2101850Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f15aefe91c0>} 2025-05-07T20:33:10.2102642Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:10.2102833Z context = 2025-05-07T20:33:10.2102838Z 2025-05-07T20:33:10.2102994Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:10.2103260Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:10.2103366Z module_map=module_map) 2025-05-07T20:33:10.2103528Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:10.2103617Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:10.2103689Z E ^ 2025-05-07T20:33:10.2104054Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:10.2104059Z 2025-05-07T20:33:10.2104464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:10.2104471Z 2025-05-07T20:33:10.2104573Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.2104857Z self=, 2025-05-07T20:33:10.2104925Z T=2048, 2025-05-07T20:33:10.2104999Z D=7168, 2025-05-07T20:33:10.2105079Z scale_ub=None, 2025-05-07T20:33:10.2105200Z contiguous=False, 2025-05-07T20:33:10.2105282Z compiled=False, 2025-05-07T20:33:10.2105346Z ) 2025-05-07T20:33:10.2105560Z self = 2025-05-07T20:33:10.2105734Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:10.2105738Z 2025-05-07T20:33:10.2105808Z @given( 2025-05-07T20:33:10.2105918Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.2106014Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.2106123Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.2106240Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.2106347Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.2106415Z ) 2025-05-07T20:33:10.2106663Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.2106748Z def test_silu_mul_quant( 2025-05-07T20:33:10.2106863Z self, 2025-05-07T20:33:10.2106938Z T: int, 2025-05-07T20:33:10.2107007Z D: int, 2025-05-07T20:33:10.2107100Z scale_ub: Optional[float], 2025-05-07T20:33:10.2107185Z contiguous: bool, 2025-05-07T20:33:10.2107262Z compiled: bool, 2025-05-07T20:33:10.2107336Z ) -> None: 2025-05-07T20:33:10.2107421Z torch.manual_seed(2025) 2025-05-07T20:33:10.2107485Z 2025-05-07T20:33:10.2107657Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.2109833Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:10.2109849Z 2025-05-07T20:33:10.2109972Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:10.2109977Z 2025-05-07T20:33:10.2110070Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.2110285Z self=, 2025-05-07T20:33:10.2110359Z T=128, 2025-05-07T20:33:10.2110424Z D=7168, 2025-05-07T20:33:10.2110498Z scale_ub=1200.0, 2025-05-07T20:33:10.2110580Z contiguous=True, 2025-05-07T20:33:10.2110654Z compiled=True, 2025-05-07T20:33:10.2110718Z ) 2025-05-07T20:33:10.2111126Z self = 2025-05-07T20:33:10.2111290Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:10.2111294Z 2025-05-07T20:33:10.2111371Z @given( 2025-05-07T20:33:10.2111483Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.2111576Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.2111691Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.2111801Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.2111905Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.2111979Z ) 2025-05-07T20:33:10.2112217Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.2112309Z def test_silu_mul_quant( 2025-05-07T20:33:10.2112377Z self, 2025-05-07T20:33:10.2112444Z T: int, 2025-05-07T20:33:10.2112519Z D: int, 2025-05-07T20:33:10.2112611Z scale_ub: Optional[float], 2025-05-07T20:33:10.2112692Z contiguous: bool, 2025-05-07T20:33:10.2112848Z compiled: bool, 2025-05-07T20:33:10.2112920Z ) -> None: 2025-05-07T20:33:10.2113005Z torch.manual_seed(2025) 2025-05-07T20:33:10.2113075Z 2025-05-07T20:33:10.2113239Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.2113370Z 2025-05-07T20:33:10.2113459Z x_sign = torch.sign(x) 2025-05-07T20:33:10.2113576Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.2113663Z x = x_sign * x_clamp 2025-05-07T20:33:10.2113737Z x0 = x[:, :D] 2025-05-07T20:33:10.2113807Z x1 = x[:, D:] 2025-05-07T20:33:10.2113879Z 2025-05-07T20:33:10.2113953Z if contiguous: 2025-05-07T20:33:10.2114037Z x0 = x0.contiguous() 2025-05-07T20:33:10.2114123Z x1 = x1.contiguous() 2025-05-07T20:33:10.2114188Z 2025-05-07T20:33:10.2114271Z if scale_ub is not None: 2025-05-07T20:33:10.2114381Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:10.2114512Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:10.2114578Z ) 2025-05-07T20:33:10.2114653Z else: 2025-05-07T20:33:10.2114739Z scale_ub_tensor = None 2025-05-07T20:33:10.2114876Z 2025-05-07T20:33:10.2115006Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.2115086Z op = silu_mul_quant 2025-05-07T20:33:10.2115172Z if compiled: 2025-05-07T20:33:10.2115264Z op = torch.compile(op) 2025-05-07T20:33:10.2115362Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.2115432Z 2025-05-07T20:33:10.2115515Z > y_fp8, y_scale = fn() 2025-05-07T20:33:10.2115520Z 2025-05-07T20:33:10.2115607Z moe/activation_test.py:117: 2025-05-07T20:33:10.2115736Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.2115829Z moe/activation_test.py:115: in fn 2025-05-07T20:33:10.2115920Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.2116294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:10.2116377Z return fn(*args, **kwargs) 2025-05-07T20:33:10.2116876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:10.2116966Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:10.2117315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:10.2117538Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:10.2117870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:10.2117961Z kernel = self.compile( 2025-05-07T20:33:10.2118381Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:10.2118552Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:10.2118679Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.2118687Z 2025-05-07T20:33:10.2118887Z self = 2025-05-07T20:33:10.2119669Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:10.2120163Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f15aec6bb00>} 2025-05-07T20:33:10.2120975Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:10.2121169Z context = 2025-05-07T20:33:10.2121174Z 2025-05-07T20:33:10.2121333Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:10.2121636Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:10.2121739Z module_map=module_map) 2025-05-07T20:33:10.2121894Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:10.2121991Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:10.2122058Z E ^ 2025-05-07T20:33:10.2122404Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:10.2122416Z 2025-05-07T20:33:10.2122824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:10.2122828Z 2025-05-07T20:33:10.2122924Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.2123146Z self=, 2025-05-07T20:33:10.2123256Z T=128, 2025-05-07T20:33:10.2123324Z D=7168, 2025-05-07T20:33:10.2123403Z scale_ub=1200.0, 2025-05-07T20:33:10.2123479Z contiguous=True, 2025-05-07T20:33:10.2123553Z compiled=False, 2025-05-07T20:33:10.2123624Z ) 2025-05-07T20:33:10.2123834Z self = 2025-05-07T20:33:10.2124002Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:10.2124006Z 2025-05-07T20:33:10.2124072Z @given( 2025-05-07T20:33:10.2124182Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.2124389Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.2124501Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.2124610Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.2124723Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.2124790Z ) 2025-05-07T20:33:10.2125028Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.2125122Z def test_silu_mul_quant( 2025-05-07T20:33:10.2125188Z self, 2025-05-07T20:33:10.2125266Z T: int, 2025-05-07T20:33:10.2125332Z D: int, 2025-05-07T20:33:10.2125422Z scale_ub: Optional[float], 2025-05-07T20:33:10.2125507Z contiguous: bool, 2025-05-07T20:33:10.2125584Z compiled: bool, 2025-05-07T20:33:10.2125655Z ) -> None: 2025-05-07T20:33:10.2125748Z torch.manual_seed(2025) 2025-05-07T20:33:10.2125812Z 2025-05-07T20:33:10.2125974Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.2126045Z 2025-05-07T20:33:10.2126128Z x_sign = torch.sign(x) 2025-05-07T20:33:10.2126296Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.2128075Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:10.2128086Z 2025-05-07T20:33:10.2128197Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:10.2128210Z 2025-05-07T20:33:10.2128302Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.2128517Z self=, 2025-05-07T20:33:10.2128595Z T=128, 2025-05-07T20:33:10.2128662Z D=5120, 2025-05-07T20:33:10.2128736Z scale_ub=1200.0, 2025-05-07T20:33:10.2128863Z contiguous=True, 2025-05-07T20:33:10.2128938Z compiled=True, 2025-05-07T20:33:10.2129005Z ) 2025-05-07T20:33:10.2129222Z self = 2025-05-07T20:33:10.2129422Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:10.2129426Z 2025-05-07T20:33:10.2129492Z @given( 2025-05-07T20:33:10.2129608Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.2129699Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.2129812Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.2129921Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.2130027Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.2130099Z ) 2025-05-07T20:33:10.2130341Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.2130427Z def test_silu_mul_quant( 2025-05-07T20:33:10.2130507Z self, 2025-05-07T20:33:10.2130575Z T: int, 2025-05-07T20:33:10.2130643Z D: int, 2025-05-07T20:33:10.2130739Z scale_ub: Optional[float], 2025-05-07T20:33:10.2130868Z contiguous: bool, 2025-05-07T20:33:10.2130959Z compiled: bool, 2025-05-07T20:33:10.2131028Z ) -> None: 2025-05-07T20:33:10.2131114Z torch.manual_seed(2025) 2025-05-07T20:33:10.2131184Z 2025-05-07T20:33:10.2131343Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.2131407Z 2025-05-07T20:33:10.2131496Z x_sign = torch.sign(x) 2025-05-07T20:33:10.2131612Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.2133372Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:10.2133392Z 2025-05-07T20:33:10.2133503Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:10.2133507Z 2025-05-07T20:33:10.2133601Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.2133824Z self=, 2025-05-07T20:33:10.2133891Z T=128, 2025-05-07T20:33:10.2133969Z D=7168, 2025-05-07T20:33:10.2134044Z scale_ub=None, 2025-05-07T20:33:10.2134120Z contiguous=True, 2025-05-07T20:33:10.2134199Z compiled=True, 2025-05-07T20:33:10.2134263Z ) 2025-05-07T20:33:10.2134519Z self = 2025-05-07T20:33:10.2134689Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:10.2134694Z 2025-05-07T20:33:10.2134762Z @given( 2025-05-07T20:33:10.2134871Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.2134972Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.2135085Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.2135201Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.2135307Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.2135372Z ) 2025-05-07T20:33:10.2135620Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.2135706Z def test_silu_mul_quant( 2025-05-07T20:33:10.2135775Z self, 2025-05-07T20:33:10.2135850Z T: int, 2025-05-07T20:33:10.2135919Z D: int, 2025-05-07T20:33:10.2136008Z scale_ub: Optional[float], 2025-05-07T20:33:10.2136101Z contiguous: bool, 2025-05-07T20:33:10.2136180Z compiled: bool, 2025-05-07T20:33:10.2136296Z ) -> None: 2025-05-07T20:33:10.2136391Z torch.manual_seed(2025) 2025-05-07T20:33:10.2136457Z 2025-05-07T20:33:10.2136620Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.2138419Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:10.2138424Z 2025-05-07T20:33:10.2138545Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:10.2138676Z =============================== warnings summary =============================== 2025-05-07T20:33:10.2138976Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:33:10.2139320Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:33:10.2139616Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:33:10.2140488Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details. 2025-05-07T20:33:10.2140712Z warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See " 2025-05-07T20:33:10.2140716Z 2025-05-07T20:33:10.2140920Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html 2025-05-07T20:33:10.2141093Z ================= 1 failed, 1 deselected, 3 warnings in 12.50s ================= 2025-05-07T20:33:12.1103958Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error) 2025-05-07T20:33:12.1819368Z [EXEC] [ATTEMPT 1/2] Command attempt failed. 2025-05-07T20:33:12.1819599Z 2025-05-07T20:33:14.1837186Z [EXEC] [ATTEMPT 2/2] + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py 2025-05-07T20:33:16.4006199Z ============================= test session starts ============================== 2025-05-07T20:33:16.4007397Z platform linux -- Python 3.13.0, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:33:16.4009071Z cachedir: .pytest_cache 2025-05-07T20:33:16.4010200Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:33:16.4011428Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:33:16.4011860Z plugins: hypothesis-6.131.14 2025-05-07T20:33:17.9771751Z TMA benchmarks will be running with experimental grid constant TMA descriptor. 2025-05-07T20:33:18.0734939Z collecting ... collected 2 items / 1 deselected / 1 selected 2025-05-07T20:33:18.0735341Z run-last-failure: rerun previous 1 failure 2025-05-07T20:33:18.0735598Z 2025-05-07T20:33:20.2756135Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant( 2025-05-07T20:33:20.2757465Z self=, 2025-05-07T20:33:20.2758263Z T=1, 2025-05-07T20:33:20.2758608Z D=5120, 2025-05-07T20:33:20.2758976Z scale_ub=None, 2025-05-07T20:33:20.2759417Z contiguous=True, 2025-05-07T20:33:20.2759823Z compiled=True, 2025-05-07T20:33:20.2760853Z ) 2025-05-07T20:33:20.2761500Z self = 2025-05-07T20:33:20.2762457Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:20.2763139Z 2025-05-07T20:33:20.2763294Z @given( 2025-05-07T20:33:20.2763753Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:20.2764563Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:20.2765162Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:20.2765814Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:20.2766455Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:20.2767007Z ) 2025-05-07T20:33:20.2767697Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:20.2768574Z def test_silu_mul_quant( 2025-05-07T20:33:20.2769045Z self, 2025-05-07T20:33:20.2769435Z T: int, 2025-05-07T20:33:20.2769825Z D: int, 2025-05-07T20:33:20.2770260Z scale_ub: Optional[float], 2025-05-07T20:33:20.2770786Z contiguous: bool, 2025-05-07T20:33:20.2771413Z compiled: bool, 2025-05-07T20:33:20.2772411Z ) -> None: 2025-05-07T20:33:20.2772765Z torch.manual_seed(2025) 2025-05-07T20:33:20.2773057Z 2025-05-07T20:33:20.2773324Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:20.2773676Z 2025-05-07T20:33:20.2773874Z x_sign = torch.sign(x) 2025-05-07T20:33:20.2774185Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:20.2774504Z x = x_sign * x_clamp 2025-05-07T20:33:20.2774761Z x0 = x[:, :D] 2025-05-07T20:33:20.2774994Z x1 = x[:, D:] 2025-05-07T20:33:20.2775210Z 2025-05-07T20:33:20.2775416Z if contiguous: 2025-05-07T20:33:20.2775670Z x0 = x0.contiguous() 2025-05-07T20:33:20.2775939Z x1 = x1.contiguous() 2025-05-07T20:33:20.2776189Z 2025-05-07T20:33:20.2776395Z if scale_ub is not None: 2025-05-07T20:33:20.2776671Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:20.2777018Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:20.2777342Z ) 2025-05-07T20:33:20.2777548Z else: 2025-05-07T20:33:20.2777774Z scale_ub_tensor = None 2025-05-07T20:33:20.2778036Z 2025-05-07T20:33:20.2778270Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:20.2778601Z op = silu_mul_quant 2025-05-07T20:33:20.2778871Z if compiled: 2025-05-07T20:33:20.2779134Z op = torch.compile(op) 2025-05-07T20:33:20.2779430Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:20.2779713Z 2025-05-07T20:33:20.2779923Z y_fp8, y_scale = fn() 2025-05-07T20:33:20.2780209Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:20.2780636Z 2025-05-07T20:33:20.2780895Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:20.2781238Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:20.2781542Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:20.2781872Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:20.2782243Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:20.2782567Z 2025-05-07T20:33:20.2782783Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:20.2782981Z 2025-05-07T20:33:20.2783102Z moe/activation_test.py:126: 2025-05-07T20:33:20.2783399Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:20.2783755Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:20.2784092Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:20.2784889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:20.2785662Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:20.2786278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:20.2786990Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:20.2787728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:20.2788469Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:20.2789208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:20.2789857Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:20.2790456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:20.2790994Z fn() 2025-05-07T20:33:20.2791520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:20.2792102Z self.fn.run( 2025-05-07T20:33:20.2792599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:20.2793220Z kernel = self.compile( 2025-05-07T20:33:20.2793760Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:20.2794421Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:20.2794829Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:20.2795061Z 2025-05-07T20:33:20.2795285Z self = 2025-05-07T20:33:20.2796376Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:20.2797791Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8c4a502700>} 2025-05-07T20:33:20.2799198Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:20.2800236Z context = 2025-05-07T20:33:20.2800529Z 2025-05-07T20:33:20.2800709Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:20.2801230Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:20.2801698Z module_map=module_map) 2025-05-07T20:33:20.2802118Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:20.2802469Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:20.2802734Z E ^ 2025-05-07T20:33:20.2803196Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:20.2803647Z 2025-05-07T20:33:20.2804065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:20.2804712Z 2025-05-07T20:33:20.2804815Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:20.2805236Z self=, 2025-05-07T20:33:20.2805628Z T=2048, 2025-05-07T20:33:20.2805797Z D=5120, 2025-05-07T20:33:20.2805981Z scale_ub=1200.0, 2025-05-07T20:33:20.2806192Z contiguous=True, 2025-05-07T20:33:20.2806402Z compiled=False, 2025-05-07T20:33:20.2806590Z ) 2025-05-07T20:33:20.2806905Z self = 2025-05-07T20:33:20.2807444Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:20.2807716Z 2025-05-07T20:33:20.2807788Z @given( 2025-05-07T20:33:20.2808010Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:20.2808641Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:20.2808938Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:20.2809262Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:20.2809587Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:20.2809854Z ) 2025-05-07T20:33:20.2810190Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:20.2810621Z def test_silu_mul_quant( 2025-05-07T20:33:20.2810857Z self, 2025-05-07T20:33:20.2811035Z T: int, 2025-05-07T20:33:20.2811223Z D: int, 2025-05-07T20:33:20.2811437Z scale_ub: Optional[float], 2025-05-07T20:33:20.2811691Z contiguous: bool, 2025-05-07T20:33:20.2811920Z compiled: bool, 2025-05-07T20:33:20.2812133Z ) -> None: 2025-05-07T20:33:20.2812333Z torch.manual_seed(2025) 2025-05-07T20:33:20.2812565Z 2025-05-07T20:33:20.2812838Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:20.2813262Z 2025-05-07T20:33:20.2813449Z x_sign = torch.sign(x) 2025-05-07T20:33:20.2813735Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:20.2814027Z x = x_sign * x_clamp 2025-05-07T20:33:20.2814259Z x0 = x[:, :D] 2025-05-07T20:33:20.2814471Z x1 = x[:, D:] 2025-05-07T20:33:20.2814665Z 2025-05-07T20:33:20.2814844Z if contiguous: 2025-05-07T20:33:20.2815070Z x0 = x0.contiguous() 2025-05-07T20:33:20.2815319Z x1 = x1.contiguous() 2025-05-07T20:33:20.2815544Z 2025-05-07T20:33:20.2815732Z if scale_ub is not None: 2025-05-07T20:33:20.2816002Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:20.2816323Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:20.2816627Z ) 2025-05-07T20:33:20.2816813Z else: 2025-05-07T20:33:20.2817005Z scale_ub_tensor = None 2025-05-07T20:33:20.2817255Z 2025-05-07T20:33:20.2817478Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:20.2817780Z op = silu_mul_quant 2025-05-07T20:33:20.2818036Z if compiled: 2025-05-07T20:33:20.2818282Z op = torch.compile(op) 2025-05-07T20:33:20.2818569Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:20.2818861Z 2025-05-07T20:33:20.2819058Z > y_fp8, y_scale = fn() 2025-05-07T20:33:20.2819219Z 2025-05-07T20:33:20.2819321Z moe/activation_test.py:117: 2025-05-07T20:33:20.2819618Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:20.2819956Z moe/activation_test.py:115: in fn 2025-05-07T20:33:20.2820307Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:20.2820998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:20.2821694Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:20.2822238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:20.2822918Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:20.2823725Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:20.2824264Z kernel = self.compile( 2025-05-07T20:33:20.2824803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:20.2825468Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:20.2825885Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:20.2826117Z 2025-05-07T20:33:20.2826457Z self = 2025-05-07T20:33:20.2827532Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:20.2828987Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8c4a5b2020>} 2025-05-07T20:33:20.2830362Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:20.2831374Z context = 2025-05-07T20:33:20.2831666Z 2025-05-07T20:33:20.2831839Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:20.2832348Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:20.2832857Z module_map=module_map) 2025-05-07T20:33:20.2833223Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:20.2833569Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:20.2833821Z E ^ 2025-05-07T20:33:20.2834281Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:20.2834726Z 2025-05-07T20:33:20.2835145Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:20.9347763Z 2025-05-07T20:33:20.9348266Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:20.9348796Z self=, 2025-05-07T20:33:20.9349222Z T=2048, 2025-05-07T20:33:20.9349423Z D=5120, 2025-05-07T20:33:20.9349614Z scale_ub=1200.0, 2025-05-07T20:33:20.9349834Z contiguous=True, 2025-05-07T20:33:20.9350057Z compiled=True, 2025-05-07T20:33:20.9350276Z ) 2025-05-07T20:33:20.9350634Z self = 2025-05-07T20:33:20.9351156Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:20.9351427Z 2025-05-07T20:33:20.9351503Z @given( 2025-05-07T20:33:20.9351735Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:20.9352046Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:20.9352353Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:20.9352677Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:20.9353006Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:20.9353296Z ) 2025-05-07T20:33:20.9353927Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:20.9354381Z def test_silu_mul_quant( 2025-05-07T20:33:20.9354621Z self, 2025-05-07T20:33:20.9354805Z T: int, 2025-05-07T20:33:20.9355001Z D: int, 2025-05-07T20:33:20.9355220Z scale_ub: Optional[float], 2025-05-07T20:33:20.9355486Z contiguous: bool, 2025-05-07T20:33:20.9355723Z compiled: bool, 2025-05-07T20:33:20.9355952Z ) -> None: 2025-05-07T20:33:20.9356156Z torch.manual_seed(2025) 2025-05-07T20:33:20.9356397Z 2025-05-07T20:33:20.9356666Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:20.9357011Z 2025-05-07T20:33:20.9357191Z x_sign = torch.sign(x) 2025-05-07T20:33:20.9357480Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:20.9357786Z x = x_sign * x_clamp 2025-05-07T20:33:20.9358017Z x0 = x[:, :D] 2025-05-07T20:33:20.9358232Z x1 = x[:, D:] 2025-05-07T20:33:20.9358438Z 2025-05-07T20:33:20.9358609Z if contiguous: 2025-05-07T20:33:20.9358920Z x0 = x0.contiguous() 2025-05-07T20:33:20.9359175Z x1 = x1.contiguous() 2025-05-07T20:33:20.9359397Z 2025-05-07T20:33:20.9359581Z if scale_ub is not None: 2025-05-07T20:33:20.9359943Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:20.9360263Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:20.9360564Z ) 2025-05-07T20:33:20.9360748Z else: 2025-05-07T20:33:20.9360942Z scale_ub_tensor = None 2025-05-07T20:33:20.9361181Z 2025-05-07T20:33:20.9361403Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:20.9361699Z op = silu_mul_quant 2025-05-07T20:33:20.9361939Z if compiled: 2025-05-07T20:33:20.9362184Z op = torch.compile(op) 2025-05-07T20:33:20.9362471Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:20.9362730Z 2025-05-07T20:33:20.9362912Z y_fp8, y_scale = fn() 2025-05-07T20:33:20.9363196Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:20.9363467Z 2025-05-07T20:33:20.9363693Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:20.9364110Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:20.9364510Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:20.9364817Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:20.9365166Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:20.9365459Z 2025-05-07T20:33:20.9365649Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:20.9365842Z 2025-05-07T20:33:20.9365936Z moe/activation_test.py:126: 2025-05-07T20:33:20.9366228Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:20.9366551Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:20.9366870Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:20.9367655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:20.9368390Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:20.9368927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:20.9369602Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:20.9370285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:20.9370993Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:20.9371716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:20.9372404Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:20.9373000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:20.9373497Z fn() 2025-05-07T20:33:20.9373996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:20.9374571Z self.fn.run( 2025-05-07T20:33:20.9375021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:20.9375539Z kernel = self.compile( 2025-05-07T20:33:20.9376074Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:20.9376717Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:20.9377102Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:20.9377334Z 2025-05-07T20:33:20.9377541Z self = 2025-05-07T20:33:20.9378662Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:20.9380082Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8c495904a0>} 2025-05-07T20:33:20.9381408Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:20.9382425Z context = 2025-05-07T20:33:20.9382714Z 2025-05-07T20:33:20.9382873Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:20.9383395Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:20.9383853Z module_map=module_map) 2025-05-07T20:33:20.9384209Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:20.9384601Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:20.9384853Z E ^ 2025-05-07T20:33:20.9385309Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:20.9385763Z 2025-05-07T20:33:20.9386176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:20.9386729Z 2025-05-07T20:33:20.9386834Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:20.9387232Z self=, 2025-05-07T20:33:20.9387624Z T=16384, 2025-05-07T20:33:20.9387812Z D=7168, 2025-05-07T20:33:20.9387994Z scale_ub=1200.0, 2025-05-07T20:33:20.9388216Z contiguous=False, 2025-05-07T20:33:20.9388439Z compiled=False, 2025-05-07T20:33:20.9388637Z ) 2025-05-07T20:33:20.9388939Z self = 2025-05-07T20:33:20.9389433Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:20.9389714Z 2025-05-07T20:33:20.9389792Z @given( 2025-05-07T20:33:20.9390008Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:20.9390312Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:20.9390614Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:20.9390931Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:20.9391255Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:20.9391531Z ) 2025-05-07T20:33:20.9391870Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:20.9392344Z def test_silu_mul_quant( 2025-05-07T20:33:20.9392579Z self, 2025-05-07T20:33:20.9392807Z T: int, 2025-05-07T20:33:20.9393071Z D: int, 2025-05-07T20:33:20.9393302Z scale_ub: Optional[float], 2025-05-07T20:33:20.9393564Z contiguous: bool, 2025-05-07T20:33:20.9393791Z compiled: bool, 2025-05-07T20:33:20.9394008Z ) -> None: 2025-05-07T20:33:20.9394215Z torch.manual_seed(2025) 2025-05-07T20:33:20.9394440Z 2025-05-07T20:33:20.9394704Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:20.9395032Z 2025-05-07T20:33:20.9395209Z x_sign = torch.sign(x) 2025-05-07T20:33:20.9395493Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:20.9395792Z x = x_sign * x_clamp 2025-05-07T20:33:20.9396016Z x0 = x[:, :D] 2025-05-07T20:33:20.9396234Z x1 = x[:, D:] 2025-05-07T20:33:20.9396432Z 2025-05-07T20:33:20.9396600Z if contiguous: 2025-05-07T20:33:20.9396828Z x0 = x0.contiguous() 2025-05-07T20:33:20.9397077Z x1 = x1.contiguous() 2025-05-07T20:33:20.9397412Z 2025-05-07T20:33:20.9397590Z if scale_ub is not None: 2025-05-07T20:33:20.9397852Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:20.9398178Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:20.9398521Z ) 2025-05-07T20:33:20.9398704Z else: 2025-05-07T20:33:20.9398905Z scale_ub_tensor = None 2025-05-07T20:33:20.9399139Z 2025-05-07T20:33:20.9399360Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:20.9399667Z op = silu_mul_quant 2025-05-07T20:33:20.9399903Z if compiled: 2025-05-07T20:33:20.9400151Z op = torch.compile(op) 2025-05-07T20:33:20.9400440Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:20.9400699Z 2025-05-07T20:33:20.9400882Z > y_fp8, y_scale = fn() 2025-05-07T20:33:20.9401041Z 2025-05-07T20:33:20.9401143Z moe/activation_test.py:117: 2025-05-07T20:33:20.9401652Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:20.9401971Z moe/activation_test.py:115: in fn 2025-05-07T20:33:20.9409460Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:20.9410311Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:20.9411016Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:20.9411563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:20.9412247Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:20.9412922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:20.9413466Z kernel = self.compile( 2025-05-07T20:33:20.9414016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:20.9414683Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:20.9415094Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:20.9415335Z 2025-05-07T20:33:20.9415555Z self = 2025-05-07T20:33:20.9416639Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:20.9418023Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8c49293880>} 2025-05-07T20:33:20.9419445Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:20.9420483Z context = 2025-05-07T20:33:20.9420775Z 2025-05-07T20:33:20.9420952Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:20.9421474Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:20.9421954Z module_map=module_map) 2025-05-07T20:33:20.9422327Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:20.9422684Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:20.9422991Z E ^ 2025-05-07T20:33:20.9423466Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:20.9423918Z 2025-05-07T20:33:20.9424347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:21.6634157Z 2025-05-07T20:33:21.6635250Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:21.6635826Z self=, 2025-05-07T20:33:21.6636287Z T=1, 2025-05-07T20:33:21.6636523Z D=7168, 2025-05-07T20:33:21.6636852Z scale_ub=None, 2025-05-07T20:33:21.6637080Z contiguous=True, 2025-05-07T20:33:21.6637322Z compiled=True, 2025-05-07T20:33:21.6637552Z ) 2025-05-07T20:33:21.6637896Z self = 2025-05-07T20:33:21.6638414Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:21.6638678Z 2025-05-07T20:33:21.6638770Z @given( 2025-05-07T20:33:21.6639010Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:21.6639350Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:21.6639673Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:21.6640034Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:21.6640379Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:21.6640693Z ) 2025-05-07T20:33:21.6641079Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:21.6641679Z def test_silu_mul_quant( 2025-05-07T20:33:21.6641953Z self, 2025-05-07T20:33:21.6642182Z T: int, 2025-05-07T20:33:21.6642391Z D: int, 2025-05-07T20:33:21.6642645Z scale_ub: Optional[float], 2025-05-07T20:33:21.6642949Z contiguous: bool, 2025-05-07T20:33:21.6643215Z compiled: bool, 2025-05-07T20:33:21.6643497Z ) -> None: 2025-05-07T20:33:21.6643730Z torch.manual_seed(2025) 2025-05-07T20:33:21.6643982Z 2025-05-07T20:33:21.6644386Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:21.6644752Z 2025-05-07T20:33:21.6644947Z x_sign = torch.sign(x) 2025-05-07T20:33:21.6645262Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:21.6645584Z x = x_sign * x_clamp 2025-05-07T20:33:21.6645840Z x0 = x[:, :D] 2025-05-07T20:33:21.6646059Z x1 = x[:, D:] 2025-05-07T20:33:21.6646276Z 2025-05-07T20:33:21.6646465Z if contiguous: 2025-05-07T20:33:21.6646710Z x0 = x0.contiguous() 2025-05-07T20:33:21.6646971Z x1 = x1.contiguous() 2025-05-07T20:33:21.6647217Z 2025-05-07T20:33:21.6647426Z if scale_ub is not None: 2025-05-07T20:33:21.6647689Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:21.6648201Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:21.6648510Z ) 2025-05-07T20:33:21.6648691Z else: 2025-05-07T20:33:21.6648906Z scale_ub_tensor = None 2025-05-07T20:33:21.6649154Z 2025-05-07T20:33:21.6649379Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:21.6649695Z op = silu_mul_quant 2025-05-07T20:33:21.6650051Z if compiled: 2025-05-07T20:33:21.6650298Z op = torch.compile(op) 2025-05-07T20:33:21.6650603Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:21.6650876Z 2025-05-07T20:33:21.6651074Z y_fp8, y_scale = fn() 2025-05-07T20:33:21.6651357Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:21.6651659Z 2025-05-07T20:33:21.6651894Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:21.6653669Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:21.6653965Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:21.6654283Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:21.6654638Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:21.6654955Z 2025-05-07T20:33:21.6655159Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:21.6655351Z 2025-05-07T20:33:21.6655452Z moe/activation_test.py:126: 2025-05-07T20:33:21.6655761Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:21.6656151Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:21.6656511Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:21.6657313Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:21.6658099Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:21.6658653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:21.6659533Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:21.6660234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:21.6660950Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:21.6661688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:21.6662339Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:21.6662938Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:21.6663528Z fn() 2025-05-07T20:33:21.6664039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:21.6664621Z self.fn.run( 2025-05-07T20:33:21.6665085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:21.6665617Z kernel = self.compile( 2025-05-07T20:33:21.6666162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:21.6666808Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:21.6667263Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:21.6667517Z 2025-05-07T20:33:21.6667725Z self = 2025-05-07T20:33:21.6668817Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:21.6670228Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8c49450860>} 2025-05-07T20:33:21.6671564Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:21.6672655Z context = 2025-05-07T20:33:21.6672944Z 2025-05-07T20:33:21.6673130Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:21.6673659Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:21.6674131Z module_map=module_map) 2025-05-07T20:33:21.6674507Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:21.6674868Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:21.6675128Z E ^ 2025-05-07T20:33:21.6675597Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:21.6676056Z 2025-05-07T20:33:21.6676467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:21.6676973Z 2025-05-07T20:33:21.6677091Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:21.6677497Z self=, 2025-05-07T20:33:21.6677955Z T=4096, 2025-05-07T20:33:21.6678147Z D=5120, 2025-05-07T20:33:21.6678331Z scale_ub=None, 2025-05-07T20:33:21.6678557Z contiguous=False, 2025-05-07T20:33:21.6678785Z compiled=False, 2025-05-07T20:33:21.6678985Z ) 2025-05-07T20:33:21.6679346Z self = 2025-05-07T20:33:21.6679842Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:21.6680112Z 2025-05-07T20:33:21.6680199Z @given( 2025-05-07T20:33:21.6680417Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:21.6680733Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:21.6681036Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:21.6681361Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:21.6681688Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:21.6681975Z ) 2025-05-07T20:33:21.6682318Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:21.6682758Z def test_silu_mul_quant( 2025-05-07T20:33:21.6682991Z self, 2025-05-07T20:33:21.6683171Z T: int, 2025-05-07T20:33:21.6683413Z D: int, 2025-05-07T20:33:21.6683629Z scale_ub: Optional[float], 2025-05-07T20:33:21.6683899Z contiguous: bool, 2025-05-07T20:33:21.6684129Z compiled: bool, 2025-05-07T20:33:21.6684456Z ) -> None: 2025-05-07T20:33:21.6684677Z torch.manual_seed(2025) 2025-05-07T20:33:21.6684905Z 2025-05-07T20:33:21.6685170Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:21.6685505Z 2025-05-07T20:33:21.6685681Z x_sign = torch.sign(x) 2025-05-07T20:33:21.6685965Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:21.6686269Z x = x_sign * x_clamp 2025-05-07T20:33:21.6686492Z x0 = x[:, :D] 2025-05-07T20:33:21.6686701Z x1 = x[:, D:] 2025-05-07T20:33:21.6686897Z 2025-05-07T20:33:21.6687068Z if contiguous: 2025-05-07T20:33:21.6687291Z x0 = x0.contiguous() 2025-05-07T20:33:21.6687541Z x1 = x1.contiguous() 2025-05-07T20:33:21.6687766Z 2025-05-07T20:33:21.6687946Z if scale_ub is not None: 2025-05-07T20:33:21.6688212Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:21.6688535Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:21.6688831Z ) 2025-05-07T20:33:21.6689024Z else: 2025-05-07T20:33:21.6689225Z scale_ub_tensor = None 2025-05-07T20:33:21.6689463Z 2025-05-07T20:33:21.6689687Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:21.6689993Z op = silu_mul_quant 2025-05-07T20:33:21.6690230Z if compiled: 2025-05-07T20:33:21.6690471Z op = torch.compile(op) 2025-05-07T20:33:21.6690806Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:21.6691064Z 2025-05-07T20:33:21.6691253Z > y_fp8, y_scale = fn() 2025-05-07T20:33:21.6691413Z 2025-05-07T20:33:21.6691512Z moe/activation_test.py:117: 2025-05-07T20:33:21.6691792Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:21.6692126Z moe/activation_test.py:115: in fn 2025-05-07T20:33:21.6692396Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:21.6693077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:21.6693748Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:21.6694275Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:21.6694965Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:21.6695616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:21.6696142Z kernel = self.compile( 2025-05-07T20:33:21.6696726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:21.6697369Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:21.6697805Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:21.6698043Z 2025-05-07T20:33:21.6698246Z self = 2025-05-07T20:33:21.6699399Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:21.6700768Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8c48734ea0>} 2025-05-07T20:33:21.6702115Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:21.6703173Z context = 2025-05-07T20:33:21.6703461Z 2025-05-07T20:33:21.6703633Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:21.6704151Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:21.6704612Z module_map=module_map) 2025-05-07T20:33:21.6704975Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:21.6705328Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:21.6705571Z E ^ 2025-05-07T20:33:21.6706032Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:21.6706479Z 2025-05-07T20:33:21.6706901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:22.3655459Z 2025-05-07T20:33:22.3655798Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.3656239Z self=, 2025-05-07T20:33:22.3656685Z T=4096, 2025-05-07T20:33:22.3656872Z D=7168, 2025-05-07T20:33:22.3657056Z scale_ub=None, 2025-05-07T20:33:22.3657262Z contiguous=False, 2025-05-07T20:33:22.3657474Z compiled=False, 2025-05-07T20:33:22.3657674Z ) 2025-05-07T20:33:22.3657986Z self = 2025-05-07T20:33:22.3658472Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:22.3658745Z 2025-05-07T20:33:22.3658819Z @given( 2025-05-07T20:33:22.3659160Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.3659470Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.3659764Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.3660090Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.3660409Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.3660687Z ) 2025-05-07T20:33:22.3661029Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.3661471Z def test_silu_mul_quant( 2025-05-07T20:33:22.3661701Z self, 2025-05-07T20:33:22.3661888Z T: int, 2025-05-07T20:33:22.3662077Z D: int, 2025-05-07T20:33:22.3662282Z scale_ub: Optional[float], 2025-05-07T20:33:22.3662548Z contiguous: bool, 2025-05-07T20:33:22.3662783Z compiled: bool, 2025-05-07T20:33:22.3663002Z ) -> None: 2025-05-07T20:33:22.3663204Z torch.manual_seed(2025) 2025-05-07T20:33:22.3663440Z 2025-05-07T20:33:22.3663715Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.3664042Z 2025-05-07T20:33:22.3664292Z x_sign = torch.sign(x) 2025-05-07T20:33:22.3664575Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:22.3664867Z x = x_sign * x_clamp 2025-05-07T20:33:22.3665098Z x0 = x[:, :D] 2025-05-07T20:33:22.3665392Z x1 = x[:, D:] 2025-05-07T20:33:22.3665580Z 2025-05-07T20:33:22.3665754Z if contiguous: 2025-05-07T20:33:22.3665975Z x0 = x0.contiguous() 2025-05-07T20:33:22.3666215Z x1 = x1.contiguous() 2025-05-07T20:33:22.3666440Z 2025-05-07T20:33:22.3666617Z if scale_ub is not None: 2025-05-07T20:33:22.3666871Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:22.3667198Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:22.3667492Z ) 2025-05-07T20:33:22.3667674Z else: 2025-05-07T20:33:22.3667866Z scale_ub_tensor = None 2025-05-07T20:33:22.3668110Z 2025-05-07T20:33:22.3668368Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:22.3668678Z op = silu_mul_quant 2025-05-07T20:33:22.3668913Z if compiled: 2025-05-07T20:33:22.3669148Z op = torch.compile(op) 2025-05-07T20:33:22.3669500Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.3669763Z 2025-05-07T20:33:22.3669941Z > y_fp8, y_scale = fn() 2025-05-07T20:33:22.3670099Z 2025-05-07T20:33:22.3670191Z moe/activation_test.py:117: 2025-05-07T20:33:22.3670478Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.3670800Z moe/activation_test.py:115: in fn 2025-05-07T20:33:22.3671067Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.3671747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:22.3672423Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:22.3672953Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:22.3673621Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:22.3674276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:22.3674797Z kernel = self.compile( 2025-05-07T20:33:22.3675329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:22.3675969Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:22.3676358Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.3676579Z 2025-05-07T20:33:22.3676787Z self = 2025-05-07T20:33:22.3677960Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:22.3679322Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8c48735260>} 2025-05-07T20:33:22.3680661Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:22.3681674Z context = 2025-05-07T20:33:22.3681958Z 2025-05-07T20:33:22.3682130Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:22.3682639Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:22.3683102Z module_map=module_map) 2025-05-07T20:33:22.3683506Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:22.3683850Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:22.3684095Z E ^ 2025-05-07T20:33:22.3684651Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:22.3685141Z 2025-05-07T20:33:22.3685558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:22.3686059Z 2025-05-07T20:33:22.3686161Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.3686565Z self=, 2025-05-07T20:33:22.3686956Z T=128, 2025-05-07T20:33:22.3687138Z D=7168, 2025-05-07T20:33:22.3687316Z scale_ub=None, 2025-05-07T20:33:22.3687524Z contiguous=False, 2025-05-07T20:33:22.3687740Z compiled=True, 2025-05-07T20:33:22.3687925Z ) 2025-05-07T20:33:22.3688236Z self = 2025-05-07T20:33:22.3688715Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:22.3688976Z 2025-05-07T20:33:22.3689046Z @given( 2025-05-07T20:33:22.3689317Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.3689621Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.3689914Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.3690238Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.3690556Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.3690831Z ) 2025-05-07T20:33:22.3691162Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.3691589Z def test_silu_mul_quant( 2025-05-07T20:33:22.3691823Z self, 2025-05-07T20:33:22.3692000Z T: int, 2025-05-07T20:33:22.3692188Z D: int, 2025-05-07T20:33:22.3692401Z scale_ub: Optional[float], 2025-05-07T20:33:22.3692655Z contiguous: bool, 2025-05-07T20:33:22.3692889Z compiled: bool, 2025-05-07T20:33:22.3693125Z ) -> None: 2025-05-07T20:33:22.3693330Z torch.manual_seed(2025) 2025-05-07T20:33:22.3693556Z 2025-05-07T20:33:22.3693822Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.3694208Z 2025-05-07T20:33:22.3694461Z x_sign = torch.sign(x) 2025-05-07T20:33:22.3694800Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:22.3695098Z x = x_sign * x_clamp 2025-05-07T20:33:22.3695319Z x0 = x[:, :D] 2025-05-07T20:33:22.3695529Z x1 = x[:, D:] 2025-05-07T20:33:22.3695728Z 2025-05-07T20:33:22.3695896Z if contiguous: 2025-05-07T20:33:22.3696123Z x0 = x0.contiguous() 2025-05-07T20:33:22.3696376Z x1 = x1.contiguous() 2025-05-07T20:33:22.3696605Z 2025-05-07T20:33:22.3696841Z if scale_ub is not None: 2025-05-07T20:33:22.3697152Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:22.3697495Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:22.3697786Z ) 2025-05-07T20:33:22.3697973Z else: 2025-05-07T20:33:22.3698183Z scale_ub_tensor = None 2025-05-07T20:33:22.3698426Z 2025-05-07T20:33:22.3698649Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:22.3698953Z op = silu_mul_quant 2025-05-07T20:33:22.3699190Z if compiled: 2025-05-07T20:33:22.3699433Z op = torch.compile(op) 2025-05-07T20:33:22.3699723Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.3699984Z 2025-05-07T20:33:22.3700173Z y_fp8, y_scale = fn() 2025-05-07T20:33:22.3700458Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:22.3700743Z 2025-05-07T20:33:22.3700970Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:22.3701315Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:22.3701661Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:22.3701967Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:22.3702332Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:22.3702685Z 2025-05-07T20:33:22.3702873Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:22.3703075Z 2025-05-07T20:33:22.3703170Z moe/activation_test.py:126: 2025-05-07T20:33:22.3703465Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.3703794Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:22.3704118Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:22.3704907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:22.3705649Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:22.3706186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:22.3706863Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:22.3707600Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:22.3708686Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:22.3709408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:22.3710041Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:22.3710634Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:22.3711132Z fn() 2025-05-07T20:33:22.3711633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:22.3712203Z self.fn.run( 2025-05-07T20:33:22.3712666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:22.3713176Z kernel = self.compile( 2025-05-07T20:33:22.3713711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:22.3714363Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:22.3714745Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.3714979Z 2025-05-07T20:33:22.3721596Z self = 2025-05-07T20:33:22.3722824Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:22.3724204Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8c487377e0>} 2025-05-07T20:33:22.3725663Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:22.3726695Z context = 2025-05-07T20:33:22.3726984Z 2025-05-07T20:33:22.3727159Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:22.3727672Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:22.3728149Z module_map=module_map) 2025-05-07T20:33:22.3728553Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:22.3728927Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:22.3729200Z E ^ 2025-05-07T20:33:22.3729745Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:22.3730195Z 2025-05-07T20:33:22.3730620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:22.6133966Z 2025-05-07T20:33:22.6134268Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.6134709Z self=, 2025-05-07T20:33:22.6135182Z T=128, 2025-05-07T20:33:22.6135383Z D=7168, 2025-05-07T20:33:22.6135597Z scale_ub=None, 2025-05-07T20:33:22.6135813Z contiguous=False, 2025-05-07T20:33:22.6136050Z compiled=False, 2025-05-07T20:33:22.6136349Z ) 2025-05-07T20:33:22.6136897Z self = 2025-05-07T20:33:22.6137414Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:22.6137696Z 2025-05-07T20:33:22.6137775Z @given( 2025-05-07T20:33:22.6138008Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.6138319Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.6138751Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.6139086Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.6139415Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.6139703Z ) 2025-05-07T20:33:22.6140059Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.6140502Z def test_silu_mul_quant( 2025-05-07T20:33:22.6140742Z self, 2025-05-07T20:33:22.6140933Z T: int, 2025-05-07T20:33:22.6141125Z D: int, 2025-05-07T20:33:22.6141335Z scale_ub: Optional[float], 2025-05-07T20:33:22.6141607Z contiguous: bool, 2025-05-07T20:33:22.6141849Z compiled: bool, 2025-05-07T20:33:22.6142071Z ) -> None: 2025-05-07T20:33:22.6142287Z torch.manual_seed(2025) 2025-05-07T20:33:22.6142537Z 2025-05-07T20:33:22.6142806Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.6143148Z 2025-05-07T20:33:22.6143345Z x_sign = torch.sign(x) 2025-05-07T20:33:22.6143629Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:22.6143941Z x = x_sign * x_clamp 2025-05-07T20:33:22.6144180Z x0 = x[:, :D] 2025-05-07T20:33:22.6144389Z x1 = x[:, D:] 2025-05-07T20:33:22.6144595Z 2025-05-07T20:33:22.6144777Z if contiguous: 2025-05-07T20:33:22.6144998Z x0 = x0.contiguous() 2025-05-07T20:33:22.6145256Z x1 = x1.contiguous() 2025-05-07T20:33:22.6145497Z 2025-05-07T20:33:22.6145679Z if scale_ub is not None: 2025-05-07T20:33:22.6145951Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:22.6146359Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:22.6146671Z ) 2025-05-07T20:33:22.6146856Z else: 2025-05-07T20:33:22.6147067Z scale_ub_tensor = None 2025-05-07T20:33:22.6147305Z 2025-05-07T20:33:22.6147530Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:22.6147841Z op = silu_mul_quant 2025-05-07T20:33:22.6148083Z if compiled: 2025-05-07T20:33:22.6148328Z op = torch.compile(op) 2025-05-07T20:33:22.6148616Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.6148876Z 2025-05-07T20:33:22.6149055Z > y_fp8, y_scale = fn() 2025-05-07T20:33:22.6149220Z 2025-05-07T20:33:22.6149315Z moe/activation_test.py:117: 2025-05-07T20:33:22.6149604Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.6149925Z moe/activation_test.py:115: in fn 2025-05-07T20:33:22.6150198Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.6150976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:22.6151647Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:22.6152174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:22.6152903Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:22.6153562Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:22.6154075Z kernel = self.compile( 2025-05-07T20:33:22.6154606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:22.6155251Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:22.6155637Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.6155859Z 2025-05-07T20:33:22.6156065Z self = 2025-05-07T20:33:22.6157134Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:22.6158538Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8c48551440>} 2025-05-07T20:33:22.6159862Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:22.6160863Z context = 2025-05-07T20:33:22.6161148Z 2025-05-07T20:33:22.6161310Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:22.6161822Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:22.6162284Z module_map=module_map) 2025-05-07T20:33:22.6162630Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:22.6162976Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:22.6163222Z E ^ 2025-05-07T20:33:22.6163717Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:22.6164166Z 2025-05-07T20:33:22.6164693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:22.6165202Z 2025-05-07T20:33:22.6165298Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.6165699Z self=, 2025-05-07T20:33:22.6166082Z T=4096, 2025-05-07T20:33:22.6166312Z D=5120, 2025-05-07T20:33:22.6166494Z scale_ub=1200.0, 2025-05-07T20:33:22.6166698Z contiguous=True, 2025-05-07T20:33:22.6166914Z compiled=False, 2025-05-07T20:33:22.6167113Z ) 2025-05-07T20:33:22.6167417Z self = 2025-05-07T20:33:22.6167904Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:22.6168181Z 2025-05-07T20:33:22.6168253Z @given( 2025-05-07T20:33:22.6168475Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.6168770Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.6169069Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.6169389Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.6169700Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.6169973Z ) 2025-05-07T20:33:22.6170316Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.6170739Z def test_silu_mul_quant( 2025-05-07T20:33:22.6170972Z self, 2025-05-07T20:33:22.6171202Z T: int, 2025-05-07T20:33:22.6171384Z D: int, 2025-05-07T20:33:22.6171633Z scale_ub: Optional[float], 2025-05-07T20:33:22.6171888Z contiguous: bool, 2025-05-07T20:33:22.6172126Z compiled: bool, 2025-05-07T20:33:22.6172379Z ) -> None: 2025-05-07T20:33:22.6172582Z torch.manual_seed(2025) 2025-05-07T20:33:22.6172813Z 2025-05-07T20:33:22.6173081Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.6173406Z 2025-05-07T20:33:22.6173589Z x_sign = torch.sign(x) 2025-05-07T20:33:22.6173876Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:22.6174177Z x = x_sign * x_clamp 2025-05-07T20:33:22.6174398Z x0 = x[:, :D] 2025-05-07T20:33:22.6174605Z x1 = x[:, D:] 2025-05-07T20:33:22.6174800Z 2025-05-07T20:33:22.6174971Z if contiguous: 2025-05-07T20:33:22.6175197Z x0 = x0.contiguous() 2025-05-07T20:33:22.6175445Z x1 = x1.contiguous() 2025-05-07T20:33:22.6175666Z 2025-05-07T20:33:22.6175844Z if scale_ub is not None: 2025-05-07T20:33:22.6176103Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:22.6176487Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:22.6176783Z ) 2025-05-07T20:33:22.6176965Z else: 2025-05-07T20:33:22.6177156Z scale_ub_tensor = None 2025-05-07T20:33:22.6177396Z 2025-05-07T20:33:22.6177617Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:22.6177917Z op = silu_mul_quant 2025-05-07T20:33:22.6178159Z if compiled: 2025-05-07T20:33:22.6178399Z op = torch.compile(op) 2025-05-07T20:33:22.6178687Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.6178944Z 2025-05-07T20:33:22.6179125Z > y_fp8, y_scale = fn() 2025-05-07T20:33:22.6179284Z 2025-05-07T20:33:22.6179384Z moe/activation_test.py:117: 2025-05-07T20:33:22.6179670Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.6180042Z moe/activation_test.py:115: in fn 2025-05-07T20:33:22.6180316Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.6180990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:22.6181670Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:22.6182202Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:22.6182872Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:22.6183519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:22.6184042Z kernel = self.compile( 2025-05-07T20:33:22.6184625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:22.6185275Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:22.6185662Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.6185895Z 2025-05-07T20:33:22.6186097Z self = 2025-05-07T20:33:22.6187213Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:22.6188570Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8c485520c0>} 2025-05-07T20:33:22.6189935Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:22.6190947Z context = 2025-05-07T20:33:22.6191236Z 2025-05-07T20:33:22.6191400Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:22.6191952Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:22.6192405Z module_map=module_map) 2025-05-07T20:33:22.6192761Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:22.6193106Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:22.6193348Z E ^ 2025-05-07T20:33:22.6193804Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:22.6194256Z 2025-05-07T20:33:22.6194668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:22.6195174Z 2025-05-07T20:33:22.6195281Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.6195681Z self=, 2025-05-07T20:33:22.6196117Z T=1, 2025-05-07T20:33:22.6196295Z D=5120, 2025-05-07T20:33:22.6196475Z scale_ub=None, 2025-05-07T20:33:22.6196680Z contiguous=True, 2025-05-07T20:33:22.6196891Z compiled=True, 2025-05-07T20:33:22.6197074Z ) 2025-05-07T20:33:22.6197388Z self = 2025-05-07T20:33:22.6197860Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:22.6198110Z 2025-05-07T20:33:22.6198180Z @given( 2025-05-07T20:33:22.6198404Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.6198704Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.6198998Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.6199316Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.6199639Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.6199918Z ) 2025-05-07T20:33:22.6200249Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.6200687Z def test_silu_mul_quant( 2025-05-07T20:33:22.6200920Z self, 2025-05-07T20:33:22.6201098Z T: int, 2025-05-07T20:33:22.6201292Z D: int, 2025-05-07T20:33:22.6201505Z scale_ub: Optional[float], 2025-05-07T20:33:22.6201761Z contiguous: bool, 2025-05-07T20:33:22.6201994Z compiled: bool, 2025-05-07T20:33:22.6202202Z ) -> None: 2025-05-07T20:33:22.6202403Z torch.manual_seed(2025) 2025-05-07T20:33:22.6202631Z 2025-05-07T20:33:22.6202894Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.6203218Z 2025-05-07T20:33:22.6203400Z x_sign = torch.sign(x) 2025-05-07T20:33:22.6203731Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:22.6204025Z x = x_sign * x_clamp 2025-05-07T20:33:22.6204341Z x0 = x[:, :D] 2025-05-07T20:33:22.6204551Z x1 = x[:, D:] 2025-05-07T20:33:22.6204753Z 2025-05-07T20:33:22.6204925Z if contiguous: 2025-05-07T20:33:22.6205152Z x0 = x0.contiguous() 2025-05-07T20:33:22.6205397Z x1 = x1.contiguous() 2025-05-07T20:33:22.6205619Z 2025-05-07T20:33:22.6205798Z if scale_ub is not None: 2025-05-07T20:33:22.6206060Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:22.6206379Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:22.6206673Z ) 2025-05-07T20:33:22.6206851Z else: 2025-05-07T20:33:22.6207056Z scale_ub_tensor = None 2025-05-07T20:33:22.6207292Z 2025-05-07T20:33:22.6207511Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:22.6207810Z op = silu_mul_quant 2025-05-07T20:33:22.6208049Z if compiled: 2025-05-07T20:33:22.6208545Z op = torch.compile(op) 2025-05-07T20:33:22.6208829Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.6209096Z 2025-05-07T20:33:22.6209276Z y_fp8, y_scale = fn() 2025-05-07T20:33:22.6209615Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:22.6209886Z 2025-05-07T20:33:22.6210111Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:22.6210436Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:22.6210715Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:22.6211018Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:22.6211371Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:22.6211661Z 2025-05-07T20:33:22.6211851Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:22.6212040Z 2025-05-07T20:33:22.6212140Z moe/activation_test.py:126: 2025-05-07T20:33:22.6212423Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.6212749Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:22.6213065Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:22.6213903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:22.6214638Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:22.6215172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:22.6215843Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:22.6216518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:22.6217220Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:22.6217941Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:22.6218562Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:22.6219146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:22.6219649Z fn() 2025-05-07T20:33:22.6220142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:22.6220713Z self.fn.run( 2025-05-07T20:33:22.6221162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:22.6221677Z kernel = self.compile( 2025-05-07T20:33:22.6222209Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:22.6222932Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:22.6223351Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.6223603Z 2025-05-07T20:33:22.6223805Z self = 2025-05-07T20:33:22.6224883Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:22.6226240Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8c48552d40>} 2025-05-07T20:33:22.6227559Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:22.6228579Z context = 2025-05-07T20:33:22.6228870Z 2025-05-07T20:33:22.6229075Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:22.6229590Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:22.6230086Z module_map=module_map) 2025-05-07T20:33:22.6230446Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:22.6230795Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:22.6231042Z E ^ 2025-05-07T20:33:22.6231497Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:22.6231946Z 2025-05-07T20:33:22.6232354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:23.3121915Z 2025-05-07T20:33:23.3122267Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:23.3122721Z self=, 2025-05-07T20:33:23.3123214Z T=2048, 2025-05-07T20:33:23.3123437Z D=5120, 2025-05-07T20:33:23.3123658Z scale_ub=None, 2025-05-07T20:33:23.3123895Z contiguous=True, 2025-05-07T20:33:23.3124380Z compiled=True, 2025-05-07T20:33:23.3124573Z ) 2025-05-07T20:33:23.3124878Z self = 2025-05-07T20:33:23.3125360Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:23.3125621Z 2025-05-07T20:33:23.3125698Z @given( 2025-05-07T20:33:23.3125911Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:23.3126217Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:23.3126513Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:23.3126866Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:23.3127186Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:23.3127458Z ) 2025-05-07T20:33:23.3127843Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:23.3128275Z def test_silu_mul_quant( 2025-05-07T20:33:23.3128507Z self, 2025-05-07T20:33:23.3128686Z T: int, 2025-05-07T20:33:23.3128872Z D: int, 2025-05-07T20:33:23.3129081Z scale_ub: Optional[float], 2025-05-07T20:33:23.3129336Z contiguous: bool, 2025-05-07T20:33:23.3129561Z compiled: bool, 2025-05-07T20:33:23.3129774Z ) -> None: 2025-05-07T20:33:23.3129968Z torch.manual_seed(2025) 2025-05-07T20:33:23.3130201Z 2025-05-07T20:33:23.3130468Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:23.3130795Z 2025-05-07T20:33:23.3130976Z x_sign = torch.sign(x) 2025-05-07T20:33:23.3131254Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:23.3131542Z x = x_sign * x_clamp 2025-05-07T20:33:23.3131848Z x0 = x[:, :D] 2025-05-07T20:33:23.3132051Z x1 = x[:, D:] 2025-05-07T20:33:23.3132245Z 2025-05-07T20:33:23.3132411Z if contiguous: 2025-05-07T20:33:23.3132629Z x0 = x0.contiguous() 2025-05-07T20:33:23.3132874Z x1 = x1.contiguous() 2025-05-07T20:33:23.3133097Z 2025-05-07T20:33:23.3133278Z if scale_ub is not None: 2025-05-07T20:33:23.3133537Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:23.3133859Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:23.3134154Z ) 2025-05-07T20:33:23.3134332Z else: 2025-05-07T20:33:23.3134525Z scale_ub_tensor = None 2025-05-07T20:33:23.3134761Z 2025-05-07T20:33:23.3134979Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:23.3135277Z op = silu_mul_quant 2025-05-07T20:33:23.3135517Z if compiled: 2025-05-07T20:33:23.3135753Z op = torch.compile(op) 2025-05-07T20:33:23.3136034Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:23.3136292Z 2025-05-07T20:33:23.3136548Z y_fp8, y_scale = fn() 2025-05-07T20:33:23.3136825Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:23.3137099Z 2025-05-07T20:33:23.3137331Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:23.3137723Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:23.3137999Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:23.3138305Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:23.3138658Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:23.3138948Z 2025-05-07T20:33:23.3139142Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:23.3139334Z 2025-05-07T20:33:23.3139438Z moe/activation_test.py:126: 2025-05-07T20:33:23.3139727Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:23.3140059Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:23.3140384Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:23.3141170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:23.3141951Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:23.3142493Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:23.3143173Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:23.3143851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:23.3144557Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:23.3145274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:23.3145906Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:23.3146492Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:23.3146993Z fn() 2025-05-07T20:33:23.3147496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:23.3148066Z self.fn.run( 2025-05-07T20:33:23.3148515Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:23.3149046Z kernel = self.compile( 2025-05-07T20:33:23.3149619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:23.3150250Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:23.3150637Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:23.3150914Z 2025-05-07T20:33:23.3151119Z self = 2025-05-07T20:33:23.3152185Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:23.3153556Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8c48568c20>} 2025-05-07T20:33:23.3154874Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:23.3155886Z context = 2025-05-07T20:33:23.3156168Z 2025-05-07T20:33:23.3156339Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:23.3156896Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:23.3163338Z module_map=module_map) 2025-05-07T20:33:23.3163721Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:23.3164163Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:23.3164548Z E ^ 2025-05-07T20:33:23.3165019Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:23.3165472Z 2025-05-07T20:33:23.3165901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:23.3166413Z 2025-05-07T20:33:23.3166522Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:23.3166941Z self=, 2025-05-07T20:33:23.3167347Z T=128, 2025-05-07T20:33:23.3167547Z D=5120, 2025-05-07T20:33:23.3167740Z scale_ub=None, 2025-05-07T20:33:23.3167971Z contiguous=True, 2025-05-07T20:33:23.3168202Z compiled=True, 2025-05-07T20:33:23.3168397Z ) 2025-05-07T20:33:23.3168720Z self = 2025-05-07T20:33:23.3169254Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:23.3169511Z 2025-05-07T20:33:23.3169593Z @given( 2025-05-07T20:33:23.3169824Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:23.3170146Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:23.3170443Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:23.3170773Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:23.3171108Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:23.3171393Z ) 2025-05-07T20:33:23.3171735Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:23.3172177Z def test_silu_mul_quant( 2025-05-07T20:33:23.3172417Z self, 2025-05-07T20:33:23.3172605Z T: int, 2025-05-07T20:33:23.3172809Z D: int, 2025-05-07T20:33:23.3173021Z scale_ub: Optional[float], 2025-05-07T20:33:23.3173291Z contiguous: bool, 2025-05-07T20:33:23.3173529Z compiled: bool, 2025-05-07T20:33:23.3173748Z ) -> None: 2025-05-07T20:33:23.3173955Z torch.manual_seed(2025) 2025-05-07T20:33:23.3174196Z 2025-05-07T20:33:23.3174472Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:23.3174805Z 2025-05-07T20:33:23.3174996Z x_sign = torch.sign(x) 2025-05-07T20:33:23.3175286Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:23.3175585Z x = x_sign * x_clamp 2025-05-07T20:33:23.3175818Z x0 = x[:, :D] 2025-05-07T20:33:23.3176037Z x1 = x[:, D:] 2025-05-07T20:33:23.3176248Z 2025-05-07T20:33:23.3176480Z if contiguous: 2025-05-07T20:33:23.3176711Z x0 = x0.contiguous() 2025-05-07T20:33:23.3176970Z x1 = x1.contiguous() 2025-05-07T20:33:23.3177203Z 2025-05-07T20:33:23.3177397Z if scale_ub is not None: 2025-05-07T20:33:23.3177671Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:23.3178005Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:23.3178327Z ) 2025-05-07T20:33:23.3178528Z else: 2025-05-07T20:33:23.3178733Z scale_ub_tensor = None 2025-05-07T20:33:23.3178987Z 2025-05-07T20:33:23.3179212Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:23.3179516Z op = silu_mul_quant 2025-05-07T20:33:23.3179772Z if compiled: 2025-05-07T20:33:23.3180026Z op = torch.compile(op) 2025-05-07T20:33:23.3180322Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:23.3180593Z 2025-05-07T20:33:23.3180784Z y_fp8, y_scale = fn() 2025-05-07T20:33:23.3181076Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:23.3181359Z 2025-05-07T20:33:23.3181650Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:23.3181986Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:23.3182274Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:23.3182624Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:23.3182971Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:23.3183277Z 2025-05-07T20:33:23.3183477Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:23.3183668Z 2025-05-07T20:33:23.3183773Z moe/activation_test.py:126: 2025-05-07T20:33:23.3184065Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:23.3184398Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:23.3184722Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:23.3185501Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:23.3186247Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:23.3186786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:23.3187515Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:23.3188218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:23.3188960Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:23.3189688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:23.3190321Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:23.3190914Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:23.3191433Z fn() 2025-05-07T20:33:23.3191942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:23.3192516Z self.fn.run( 2025-05-07T20:33:23.3192982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:23.3193506Z kernel = self.compile( 2025-05-07T20:33:23.3194035Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:23.3194681Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:23.3195069Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:23.3195298Z 2025-05-07T20:33:23.3195507Z self = 2025-05-07T20:33:23.3196626Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:23.3198034Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b93d4eca0>} 2025-05-07T20:33:23.3199366Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:23.3200376Z context = 2025-05-07T20:33:23.3200660Z 2025-05-07T20:33:23.3200828Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:23.3201341Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:23.3201800Z module_map=module_map) 2025-05-07T20:33:23.3202230Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:23.3202576Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:23.3202835Z E ^ 2025-05-07T20:33:23.3203296Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:23.3203800Z 2025-05-07T20:33:23.3204298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:24.1010945Z 2025-05-07T20:33:24.1011940Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:24.1012761Z self=, 2025-05-07T20:33:24.1013285Z T=4096, 2025-05-07T20:33:24.1013472Z D=5120, 2025-05-07T20:33:24.1013665Z scale_ub=None, 2025-05-07T20:33:24.1013872Z contiguous=True, 2025-05-07T20:33:24.1014118Z compiled=True, 2025-05-07T20:33:24.1014331Z ) 2025-05-07T20:33:24.1014670Z self = 2025-05-07T20:33:24.1015158Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:24.1015813Z 2025-05-07T20:33:24.1015889Z @given( 2025-05-07T20:33:24.1016136Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:24.1016445Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:24.1016754Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:24.1017091Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:24.1017426Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:24.1017709Z ) 2025-05-07T20:33:24.1018069Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:24.1018526Z def test_silu_mul_quant( 2025-05-07T20:33:24.1018765Z self, 2025-05-07T20:33:24.1018971Z T: int, 2025-05-07T20:33:24.1019177Z D: int, 2025-05-07T20:33:24.1019394Z scale_ub: Optional[float], 2025-05-07T20:33:24.1019675Z contiguous: bool, 2025-05-07T20:33:24.1019921Z compiled: bool, 2025-05-07T20:33:24.1020156Z ) -> None: 2025-05-07T20:33:24.1020384Z torch.manual_seed(2025) 2025-05-07T20:33:24.1020646Z 2025-05-07T20:33:24.1020918Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:24.1021271Z 2025-05-07T20:33:24.1021471Z x_sign = torch.sign(x) 2025-05-07T20:33:24.1021762Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:24.1022093Z x = x_sign * x_clamp 2025-05-07T20:33:24.1022342Z x0 = x[:, :D] 2025-05-07T20:33:24.1022568Z x1 = x[:, D:] 2025-05-07T20:33:24.1022769Z 2025-05-07T20:33:24.1022964Z if contiguous: 2025-05-07T20:33:24.1023206Z x0 = x0.contiguous() 2025-05-07T20:33:24.1023463Z x1 = x1.contiguous() 2025-05-07T20:33:24.1023829Z 2025-05-07T20:33:24.1024029Z if scale_ub is not None: 2025-05-07T20:33:24.1024305Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:24.1024653Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:24.1024972Z ) 2025-05-07T20:33:24.1025161Z else: 2025-05-07T20:33:24.1025381Z scale_ub_tensor = None 2025-05-07T20:33:24.1025642Z 2025-05-07T20:33:24.1025869Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:24.1026191Z op = silu_mul_quant 2025-05-07T20:33:24.1026452Z if compiled: 2025-05-07T20:33:24.1026700Z op = torch.compile(op) 2025-05-07T20:33:24.1027006Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:24.1027289Z 2025-05-07T20:33:24.1027476Z y_fp8, y_scale = fn() 2025-05-07T20:33:24.1027776Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:24.1028076Z 2025-05-07T20:33:24.1028321Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:24.1028752Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:24.1029059Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:24.1029379Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:24.1029738Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:24.1030147Z 2025-05-07T20:33:24.1030357Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:24.1030550Z 2025-05-07T20:33:24.1030651Z moe/activation_test.py:126: 2025-05-07T20:33:24.1030959Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:24.1031310Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:24.1031652Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:24.1032445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:24.1033211Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:24.1033771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:24.1034504Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:24.1035252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:24.1035986Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:24.1036730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:24.1037365Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:24.1037978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:24.1038538Z fn() 2025-05-07T20:33:24.1039055Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:24.1039666Z self.fn.run( 2025-05-07T20:33:24.1040131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:24.1040665Z kernel = self.compile( 2025-05-07T20:33:24.1041220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:24.1041863Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:24.1042267Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:24.1042507Z 2025-05-07T20:33:24.1042713Z self = 2025-05-07T20:33:24.1043878Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:24.1045455Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b93d6e660>} 2025-05-07T20:33:24.1047005Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:24.1048095Z context = 2025-05-07T20:33:24.1048385Z 2025-05-07T20:33:24.1048561Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:24.1049091Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:24.1049553Z module_map=module_map) 2025-05-07T20:33:24.1049931Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:24.1050294Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:24.1050607Z E ^ 2025-05-07T20:33:24.1051075Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:24.1051525Z 2025-05-07T20:33:24.1051987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:24.1052667Z 2025-05-07T20:33:24.1052778Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:24.1053182Z self=, 2025-05-07T20:33:24.1053580Z T=16384, 2025-05-07T20:33:24.1053769Z D=5120, 2025-05-07T20:33:24.1053948Z scale_ub=None, 2025-05-07T20:33:24.1054159Z contiguous=True, 2025-05-07T20:33:24.1054372Z compiled=True, 2025-05-07T20:33:24.1054558Z ) 2025-05-07T20:33:24.1054871Z self = 2025-05-07T20:33:24.1055372Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:24.1055646Z 2025-05-07T20:33:24.1055725Z @given( 2025-05-07T20:33:24.1055941Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:24.1056311Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:24.1056619Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:24.1056934Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:24.1057259Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:24.1057540Z ) 2025-05-07T20:33:24.1057875Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:24.1058312Z def test_silu_mul_quant( 2025-05-07T20:33:24.1058545Z self, 2025-05-07T20:33:24.1058725Z T: int, 2025-05-07T20:33:24.1058918Z D: int, 2025-05-07T20:33:24.1059131Z scale_ub: Optional[float], 2025-05-07T20:33:24.1059389Z contiguous: bool, 2025-05-07T20:33:24.1059630Z compiled: bool, 2025-05-07T20:33:24.1059849Z ) -> None: 2025-05-07T20:33:24.1060068Z torch.manual_seed(2025) 2025-05-07T20:33:24.1060295Z 2025-05-07T20:33:24.1060567Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:24.1060909Z 2025-05-07T20:33:24.1061094Z x_sign = torch.sign(x) 2025-05-07T20:33:24.1061385Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:24.1061693Z x = x_sign * x_clamp 2025-05-07T20:33:24.1061922Z x0 = x[:, :D] 2025-05-07T20:33:24.1062133Z x1 = x[:, D:] 2025-05-07T20:33:24.1062335Z 2025-05-07T20:33:24.1062507Z if contiguous: 2025-05-07T20:33:24.1062739Z x0 = x0.contiguous() 2025-05-07T20:33:24.1062991Z x1 = x1.contiguous() 2025-05-07T20:33:24.1063216Z 2025-05-07T20:33:24.1063403Z if scale_ub is not None: 2025-05-07T20:33:24.1063668Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:24.1064068Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:24.1064402Z ) 2025-05-07T20:33:24.1064593Z else: 2025-05-07T20:33:24.1064789Z scale_ub_tensor = None 2025-05-07T20:33:24.1065035Z 2025-05-07T20:33:24.1065262Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:24.1065575Z op = silu_mul_quant 2025-05-07T20:33:24.1065812Z if compiled: 2025-05-07T20:33:24.1066052Z op = torch.compile(op) 2025-05-07T20:33:24.1066339Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:24.1066598Z 2025-05-07T20:33:24.1066781Z y_fp8, y_scale = fn() 2025-05-07T20:33:24.1067062Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:24.1067336Z 2025-05-07T20:33:24.1067570Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:24.1067903Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:24.1068187Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:24.1068547Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:24.1068906Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:24.1069211Z 2025-05-07T20:33:24.1069408Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:24.1069688Z 2025-05-07T20:33:24.1069783Z moe/activation_test.py:126: 2025-05-07T20:33:24.1070080Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:24.1070407Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:24.1070730Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:24.1071514Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:24.1072263Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:24.1072803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:24.1073488Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:24.1074190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:24.1074949Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:24.1075690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:24.1076326Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:24.1076930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:24.1077439Z fn() 2025-05-07T20:33:24.1077992Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:24.1078574Z self.fn.run( 2025-05-07T20:33:24.1079039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:24.1079568Z kernel = self.compile( 2025-05-07T20:33:24.1080110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:24.1080770Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:24.1081160Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:24.1081398Z 2025-05-07T20:33:24.1081602Z self = 2025-05-07T20:33:24.1082685Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:24.1084104Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b93521580>} 2025-05-07T20:33:24.1085518Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:24.1086544Z context = 2025-05-07T20:33:24.1086840Z 2025-05-07T20:33:24.1087002Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:24.1087521Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:24.1087978Z module_map=module_map) 2025-05-07T20:33:24.1088344Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:24.1088709Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:24.1088974Z E ^ 2025-05-07T20:33:24.1089475Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:24.1089933Z 2025-05-07T20:33:24.1090350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:24.1297986Z W0507 20:33:24.128000 89542 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8) 2025-05-07T20:33:24.1300135Z W0507 20:33:24.128000 89542 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55) 2025-05-07T20:33:24.1301528Z W0507 20:33:24.128000 89542 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240 2025-05-07T20:33:24.1302568Z W0507 20:33:24.128000 89542 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles". 2025-05-07T20:33:24.1303735Z W0507 20:33:24.128000 89542 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html. 2025-05-07T20:33:24.5865090Z 2025-05-07T20:33:24.5865526Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:24.5866003Z self=, 2025-05-07T20:33:24.5866411Z T=1, 2025-05-07T20:33:24.5866588Z D=5120, 2025-05-07T20:33:24.5866781Z scale_ub=1200.0, 2025-05-07T20:33:24.5867040Z contiguous=True, 2025-05-07T20:33:24.5867262Z compiled=True, 2025-05-07T20:33:24.5867468Z ) 2025-05-07T20:33:24.5867780Z self = 2025-05-07T20:33:24.5868274Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:24.5868533Z 2025-05-07T20:33:24.5868623Z @given( 2025-05-07T20:33:24.5868849Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:24.5869172Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:24.5869477Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:24.5869797Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:24.5870132Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:24.5870414Z ) 2025-05-07T20:33:24.5870763Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:24.5871194Z def test_silu_mul_quant( 2025-05-07T20:33:24.5871437Z self, 2025-05-07T20:33:24.5871631Z T: int, 2025-05-07T20:33:24.5871818Z D: int, 2025-05-07T20:33:24.5872039Z scale_ub: Optional[float], 2025-05-07T20:33:24.5872309Z contiguous: bool, 2025-05-07T20:33:24.5872541Z compiled: bool, 2025-05-07T20:33:24.5872774Z ) -> None: 2025-05-07T20:33:24.5872990Z torch.manual_seed(2025) 2025-05-07T20:33:24.5873502Z 2025-05-07T20:33:24.5873783Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:24.5874124Z 2025-05-07T20:33:24.5874305Z x_sign = torch.sign(x) 2025-05-07T20:33:24.5874595Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:24.5874905Z x = x_sign * x_clamp 2025-05-07T20:33:24.5875133Z x0 = x[:, :D] 2025-05-07T20:33:24.5875348Z x1 = x[:, D:] 2025-05-07T20:33:24.5875553Z 2025-05-07T20:33:24.5875735Z if contiguous: 2025-05-07T20:33:24.5875958Z x0 = x0.contiguous() 2025-05-07T20:33:24.5876217Z x1 = x1.contiguous() 2025-05-07T20:33:24.5876453Z 2025-05-07T20:33:24.5876633Z if scale_ub is not None: 2025-05-07T20:33:24.5876905Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:24.5877237Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:24.5877532Z ) 2025-05-07T20:33:24.5877727Z else: 2025-05-07T20:33:24.5877937Z scale_ub_tensor = None 2025-05-07T20:33:24.5878173Z 2025-05-07T20:33:24.5878500Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:24.5878813Z op = silu_mul_quant 2025-05-07T20:33:24.5879052Z if compiled: 2025-05-07T20:33:24.5879304Z op = torch.compile(op) 2025-05-07T20:33:24.5879681Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:24.5879939Z 2025-05-07T20:33:24.5880128Z > y_fp8, y_scale = fn() 2025-05-07T20:33:24.5880299Z 2025-05-07T20:33:24.5880397Z moe/activation_test.py:117: 2025-05-07T20:33:24.5880693Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:24.5881015Z moe/activation_test.py:115: in fn 2025-05-07T20:33:24.5881295Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:24.5881856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:24.5882410Z return fn(*args, **kwargs) 2025-05-07T20:33:24.5883069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:24.5890506Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:24.5891224Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:24.5891931Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:24.5892600Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:24.5893152Z kernel = self.compile( 2025-05-07T20:33:24.5893714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:24.5894386Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:24.5894794Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:24.5895040Z 2025-05-07T20:33:24.5895257Z self = 2025-05-07T20:33:24.5896350Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:24.5897775Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b9352f740>} 2025-05-07T20:33:24.5899138Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:24.5900173Z context = 2025-05-07T20:33:24.5900524Z 2025-05-07T20:33:24.5900700Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:24.5901239Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:24.5901715Z module_map=module_map) 2025-05-07T20:33:24.5902106Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:24.5902476Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:24.5902743Z E ^ 2025-05-07T20:33:24.5903224Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:24.5903685Z 2025-05-07T20:33:24.5904105Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:24.5904618Z 2025-05-07T20:33:24.5904737Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:24.5905157Z self=, 2025-05-07T20:33:24.5905571Z T=1, 2025-05-07T20:33:24.5905771Z D=5120, 2025-05-07T20:33:24.5906023Z scale_ub=None, 2025-05-07T20:33:24.5906261Z contiguous=False, 2025-05-07T20:33:24.5906504Z compiled=True, 2025-05-07T20:33:24.5906728Z ) 2025-05-07T20:33:24.5907060Z self = 2025-05-07T20:33:24.5907616Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:24.5907880Z 2025-05-07T20:33:24.5907978Z @given( 2025-05-07T20:33:24.5908507Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:24.5908848Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:24.5909173Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:24.5909512Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:24.5909861Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:24.5910163Z ) 2025-05-07T20:33:24.5910517Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:24.5910977Z def test_silu_mul_quant( 2025-05-07T20:33:24.5911244Z self, 2025-05-07T20:33:24.5911458Z T: int, 2025-05-07T20:33:24.5911664Z D: int, 2025-05-07T20:33:24.5912000Z scale_ub: Optional[float], 2025-05-07T20:33:24.5912292Z contiguous: bool, 2025-05-07T20:33:24.5912543Z compiled: bool, 2025-05-07T20:33:24.5912786Z ) -> None: 2025-05-07T20:33:24.5913025Z torch.manual_seed(2025) 2025-05-07T20:33:24.5913278Z 2025-05-07T20:33:24.5913576Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:24.5913939Z 2025-05-07T20:33:24.5914144Z x_sign = torch.sign(x) 2025-05-07T20:33:24.5914460Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:24.5914789Z x = x_sign * x_clamp 2025-05-07T20:33:24.5915038Z x0 = x[:, :D] 2025-05-07T20:33:24.5915276Z x1 = x[:, D:] 2025-05-07T20:33:24.5915508Z 2025-05-07T20:33:24.5915703Z if contiguous: 2025-05-07T20:33:24.5915958Z x0 = x0.contiguous() 2025-05-07T20:33:24.5916238Z x1 = x1.contiguous() 2025-05-07T20:33:24.5916499Z 2025-05-07T20:33:24.5916703Z if scale_ub is not None: 2025-05-07T20:33:24.5917006Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:24.5917362Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:24.5917677Z ) 2025-05-07T20:33:24.5917889Z else: 2025-05-07T20:33:24.5918107Z scale_ub_tensor = None 2025-05-07T20:33:24.5918377Z 2025-05-07T20:33:24.5918626Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:24.5918944Z op = silu_mul_quant 2025-05-07T20:33:24.5919213Z if compiled: 2025-05-07T20:33:24.5919478Z op = torch.compile(op) 2025-05-07T20:33:24.5919781Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:24.5920071Z 2025-05-07T20:33:24.5920358Z y_fp8, y_scale = fn() 2025-05-07T20:33:24.5920663Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:24.5920956Z 2025-05-07T20:33:24.5921207Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:24.5921558Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:24.5921859Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:24.5922184Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:24.5922552Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:24.5922866Z 2025-05-07T20:33:24.5923086Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:24.5923286Z 2025-05-07T20:33:24.5923401Z moe/activation_test.py:126: 2025-05-07T20:33:24.5923706Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:24.5924057Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:24.5924525Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:24.5925378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:24.5926128Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:24.5926682Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:24.5927428Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:24.5928168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:24.5928880Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:24.5929613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:24.5930252Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:24.5930853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:24.5931370Z fn() 2025-05-07T20:33:24.5931884Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:24.5932520Z self.fn.run( 2025-05-07T20:33:24.5932984Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:24.5933513Z kernel = self.compile( 2025-05-07T20:33:24.5934060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:24.5934715Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:24.5935126Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:24.5935366Z 2025-05-07T20:33:24.5935579Z self = 2025-05-07T20:33:24.5936670Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:24.5938100Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b93132de0>} 2025-05-07T20:33:24.5939431Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:24.5940453Z context = 2025-05-07T20:33:24.5940750Z 2025-05-07T20:33:24.5940920Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:24.5941530Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:24.5942000Z module_map=module_map) 2025-05-07T20:33:24.5942378Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:24.5942745Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:24.5943013Z E ^ 2025-05-07T20:33:24.5943489Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:24.5943946Z 2025-05-07T20:33:24.5944359Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:24.7343917Z 2025-05-07T20:33:24.7344292Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:24.7344744Z self=, 2025-05-07T20:33:24.7345150Z T=1, 2025-05-07T20:33:24.7345338Z D=5120, 2025-05-07T20:33:24.7345534Z scale_ub=None, 2025-05-07T20:33:24.7345753Z contiguous=True, 2025-05-07T20:33:24.7345981Z compiled=False, 2025-05-07T20:33:24.7346463Z ) 2025-05-07T20:33:24.7346777Z self = 2025-05-07T20:33:24.7347261Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:24.7347607Z 2025-05-07T20:33:24.7347692Z @given( 2025-05-07T20:33:24.7347915Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:24.7348231Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:24.7348538Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:24.7348869Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:24.7349185Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:24.7349465Z ) 2025-05-07T20:33:24.7349807Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:24.7350237Z def test_silu_mul_quant( 2025-05-07T20:33:24.7350483Z self, 2025-05-07T20:33:24.7350672Z T: int, 2025-05-07T20:33:24.7350862Z D: int, 2025-05-07T20:33:24.7351078Z scale_ub: Optional[float], 2025-05-07T20:33:24.7351341Z contiguous: bool, 2025-05-07T20:33:24.7351571Z compiled: bool, 2025-05-07T20:33:24.7351885Z ) -> None: 2025-05-07T20:33:24.7352099Z torch.manual_seed(2025) 2025-05-07T20:33:24.7352331Z 2025-05-07T20:33:24.7352599Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:24.7352937Z 2025-05-07T20:33:24.7353123Z x_sign = torch.sign(x) 2025-05-07T20:33:24.7353414Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:24.7353718Z x = x_sign * x_clamp 2025-05-07T20:33:24.7353956Z x0 = x[:, :D] 2025-05-07T20:33:24.7354186Z x1 = x[:, D:] 2025-05-07T20:33:24.7354412Z 2025-05-07T20:33:24.7354602Z if contiguous: 2025-05-07T20:33:24.7354822Z x0 = x0.contiguous() 2025-05-07T20:33:24.7355077Z x1 = x1.contiguous() 2025-05-07T20:33:24.7355314Z 2025-05-07T20:33:24.7355499Z if scale_ub is not None: 2025-05-07T20:33:24.7355769Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:24.7356101Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:24.7356402Z ) 2025-05-07T20:33:24.7356601Z else: 2025-05-07T20:33:24.7356813Z scale_ub_tensor = None 2025-05-07T20:33:24.7357055Z 2025-05-07T20:33:24.7357285Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:24.7357595Z op = silu_mul_quant 2025-05-07T20:33:24.7357840Z if compiled: 2025-05-07T20:33:24.7358138Z op = torch.compile(op) 2025-05-07T20:33:24.7358471Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:24.7358795Z 2025-05-07T20:33:24.7359011Z > y_fp8, y_scale = fn() 2025-05-07T20:33:24.7359200Z 2025-05-07T20:33:24.7359310Z moe/activation_test.py:117: 2025-05-07T20:33:24.7359735Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:24.7360132Z moe/activation_test.py:115: in fn 2025-05-07T20:33:24.7360449Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:24.7361269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:24.7362102Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:24.7362733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:24.7363542Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:24.7364484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:24.7365112Z kernel = self.compile( 2025-05-07T20:33:24.7365753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:24.7366576Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:24.7366971Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:24.7367197Z 2025-05-07T20:33:24.7367410Z self = 2025-05-07T20:33:24.7368617Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:24.7370008Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b93521940>} 2025-05-07T20:33:24.7371354Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:24.7372386Z context = 2025-05-07T20:33:24.7372673Z 2025-05-07T20:33:24.7372847Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:24.7373407Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:24.7373881Z module_map=module_map) 2025-05-07T20:33:24.7374248Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:24.7374592Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:24.7374852Z E ^ 2025-05-07T20:33:24.7375317Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:24.7375768Z 2025-05-07T20:33:24.7376190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:24.7376707Z 2025-05-07T20:33:24.7376808Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:24.7377227Z self=, 2025-05-07T20:33:24.7377635Z T=128, 2025-05-07T20:33:24.7377820Z D=5120, 2025-05-07T20:33:24.7378041Z scale_ub=None, 2025-05-07T20:33:24.7378261Z contiguous=False, 2025-05-07T20:33:24.7378478Z compiled=True, 2025-05-07T20:33:24.7378682Z ) 2025-05-07T20:33:24.7378996Z self = 2025-05-07T20:33:24.7379481Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:24.7379745Z 2025-05-07T20:33:24.7379819Z @given( 2025-05-07T20:33:24.7380046Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:24.7380355Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:24.7380651Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:24.7381027Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:24.7381354Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:24.7381630Z ) 2025-05-07T20:33:24.7381979Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:24.7382410Z def test_silu_mul_quant( 2025-05-07T20:33:24.7382647Z self, 2025-05-07T20:33:24.7382845Z T: int, 2025-05-07T20:33:24.7383041Z D: int, 2025-05-07T20:33:24.7383254Z scale_ub: Optional[float], 2025-05-07T20:33:24.7383514Z contiguous: bool, 2025-05-07T20:33:24.7383747Z compiled: bool, 2025-05-07T20:33:24.7383965Z ) -> None: 2025-05-07T20:33:24.7384167Z torch.manual_seed(2025) 2025-05-07T20:33:24.7384404Z 2025-05-07T20:33:24.7384676Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:24.7385005Z 2025-05-07T20:33:24.7385191Z x_sign = torch.sign(x) 2025-05-07T20:33:24.7385484Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:24.7385780Z x = x_sign * x_clamp 2025-05-07T20:33:24.7386062Z x0 = x[:, :D] 2025-05-07T20:33:24.7386273Z x1 = x[:, D:] 2025-05-07T20:33:24.7386470Z 2025-05-07T20:33:24.7386656Z if contiguous: 2025-05-07T20:33:24.7386889Z x0 = x0.contiguous() 2025-05-07T20:33:24.7387184Z x1 = x1.contiguous() 2025-05-07T20:33:24.7387422Z 2025-05-07T20:33:24.7387621Z if scale_ub is not None: 2025-05-07T20:33:24.7387885Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:24.7388215Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:24.7388529Z ) 2025-05-07T20:33:24.7388722Z else: 2025-05-07T20:33:24.7388924Z scale_ub_tensor = None 2025-05-07T20:33:24.7389173Z 2025-05-07T20:33:24.7389399Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:24.7389701Z op = silu_mul_quant 2025-05-07T20:33:24.7389953Z if compiled: 2025-05-07T20:33:24.7390202Z op = torch.compile(op) 2025-05-07T20:33:24.7390489Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:24.7390763Z 2025-05-07T20:33:24.7390954Z > y_fp8, y_scale = fn() 2025-05-07T20:33:24.7391121Z 2025-05-07T20:33:24.7391266Z moe/activation_test.py:117: 2025-05-07T20:33:24.7391565Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:24.7391897Z moe/activation_test.py:115: in fn 2025-05-07T20:33:24.7392176Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:24.7392726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:24.7393280Z return fn(*args, **kwargs) 2025-05-07T20:33:24.7393936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:24.7394610Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:24.7395146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:24.7395827Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:24.7396489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:24.7397017Z kernel = self.compile( 2025-05-07T20:33:24.7397562Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:24.7398219Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:24.7398610Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:24.7398843Z 2025-05-07T20:33:24.7399050Z self = 2025-05-07T20:33:24.7400176Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:24.7401543Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b93133880>} 2025-05-07T20:33:24.7402884Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:24.7403891Z context = 2025-05-07T20:33:24.7404185Z 2025-05-07T20:33:24.7404443Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:24.7404967Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:24.7405437Z module_map=module_map) 2025-05-07T20:33:24.7405794Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:24.7406191Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:24.7406452Z E ^ 2025-05-07T20:33:24.7406912Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:24.7407411Z 2025-05-07T20:33:24.7407823Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:24.7408619Z 2025-05-07T20:33:24.7408721Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:24.7409142Z self=, 2025-05-07T20:33:24.7409539Z T=128, 2025-05-07T20:33:24.7409734Z D=7168, 2025-05-07T20:33:24.7409935Z scale_ub=1200.0, 2025-05-07T20:33:24.7410156Z contiguous=False, 2025-05-07T20:33:24.7410387Z compiled=False, 2025-05-07T20:33:24.8978007Z ) 2025-05-07T20:33:24.8978449Z self = 2025-05-07T20:33:24.8978981Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:24.8979265Z 2025-05-07T20:33:24.8979379Z @given( 2025-05-07T20:33:24.8979978Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:24.8980308Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:24.8980617Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:24.8980942Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:24.8981275Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:24.8981564Z ) 2025-05-07T20:33:24.8981908Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:24.8982351Z def test_silu_mul_quant( 2025-05-07T20:33:24.8982594Z self, 2025-05-07T20:33:24.8982781Z T: int, 2025-05-07T20:33:24.8982977Z D: int, 2025-05-07T20:33:24.8983202Z scale_ub: Optional[float], 2025-05-07T20:33:24.8983469Z contiguous: bool, 2025-05-07T20:33:24.8983717Z compiled: bool, 2025-05-07T20:33:24.8983966Z ) -> None: 2025-05-07T20:33:24.8984179Z torch.manual_seed(2025) 2025-05-07T20:33:24.8984425Z 2025-05-07T20:33:24.8984700Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:24.8985050Z 2025-05-07T20:33:24.8985244Z x_sign = torch.sign(x) 2025-05-07T20:33:24.8985548Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:24.8985866Z x = x_sign * x_clamp 2025-05-07T20:33:24.8986108Z x0 = x[:, :D] 2025-05-07T20:33:24.8986325Z x1 = x[:, D:] 2025-05-07T20:33:24.8986533Z 2025-05-07T20:33:24.8986710Z if contiguous: 2025-05-07T20:33:24.8986942Z x0 = x0.contiguous() 2025-05-07T20:33:24.8987205Z x1 = x1.contiguous() 2025-05-07T20:33:24.8987440Z 2025-05-07T20:33:24.8987636Z if scale_ub is not None: 2025-05-07T20:33:24.8988000Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:24.8988340Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:24.8988650Z ) 2025-05-07T20:33:24.8988850Z else: 2025-05-07T20:33:24.8989053Z scale_ub_tensor = None 2025-05-07T20:33:24.8989324Z 2025-05-07T20:33:24.8989598Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:24.8989912Z op = silu_mul_quant 2025-05-07T20:33:24.8990171Z if compiled: 2025-05-07T20:33:24.8990425Z op = torch.compile(op) 2025-05-07T20:33:24.8990724Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:24.8990994Z 2025-05-07T20:33:24.8991195Z > y_fp8, y_scale = fn() 2025-05-07T20:33:24.8991362Z 2025-05-07T20:33:24.8991465Z moe/activation_test.py:117: 2025-05-07T20:33:24.8991750Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:24.8992078Z moe/activation_test.py:115: in fn 2025-05-07T20:33:24.8992354Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:24.8993111Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:24.8993796Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:24.8994400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:24.8995069Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:24.8995716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:24.8996237Z kernel = self.compile( 2025-05-07T20:33:24.8996767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:24.8997411Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:24.8997798Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:24.8998030Z 2025-05-07T20:33:24.8998233Z self = 2025-05-07T20:33:24.8999300Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:24.9000724Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b931f87c0>} 2025-05-07T20:33:24.9002047Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:24.9003054Z context = 2025-05-07T20:33:24.9003344Z 2025-05-07T20:33:24.9003506Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:24.9004014Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:24.9004613Z module_map=module_map) 2025-05-07T20:33:24.9004972Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:24.9005314Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:24.9005554Z E ^ 2025-05-07T20:33:24.9006008Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:24.9006454Z 2025-05-07T20:33:24.9006861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:24.9007364Z 2025-05-07T20:33:24.9007467Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:24.9007912Z self=, 2025-05-07T20:33:24.9008558Z T=128, 2025-05-07T20:33:24.9008743Z D=5120, 2025-05-07T20:33:24.9008919Z scale_ub=None, 2025-05-07T20:33:24.9009125Z contiguous=False, 2025-05-07T20:33:24.9009345Z compiled=False, 2025-05-07T20:33:24.9009537Z ) 2025-05-07T20:33:24.9009847Z self = 2025-05-07T20:33:24.9010326Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:24.9010589Z 2025-05-07T20:33:24.9010666Z @given( 2025-05-07T20:33:24.9010882Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:24.9011187Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:24.9018345Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:24.9018695Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:24.9019059Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:24.9019369Z ) 2025-05-07T20:33:24.9019843Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:24.9020291Z def test_silu_mul_quant( 2025-05-07T20:33:24.9020539Z self, 2025-05-07T20:33:24.9020746Z T: int, 2025-05-07T20:33:24.9020955Z D: int, 2025-05-07T20:33:24.9021250Z scale_ub: Optional[float], 2025-05-07T20:33:24.9021529Z contiguous: bool, 2025-05-07T20:33:24.9021780Z compiled: bool, 2025-05-07T20:33:24.9022006Z ) -> None: 2025-05-07T20:33:24.9022232Z torch.manual_seed(2025) 2025-05-07T20:33:24.9022480Z 2025-05-07T20:33:24.9022754Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:24.9023103Z 2025-05-07T20:33:24.9023307Z x_sign = torch.sign(x) 2025-05-07T20:33:24.9023604Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:24.9023910Z x = x_sign * x_clamp 2025-05-07T20:33:24.9024155Z x0 = x[:, :D] 2025-05-07T20:33:24.9024382Z x1 = x[:, D:] 2025-05-07T20:33:24.9024589Z 2025-05-07T20:33:24.9024787Z if contiguous: 2025-05-07T20:33:24.9025022Z x0 = x0.contiguous() 2025-05-07T20:33:24.9025281Z x1 = x1.contiguous() 2025-05-07T20:33:24.9025530Z 2025-05-07T20:33:24.9025804Z if scale_ub is not None: 2025-05-07T20:33:24.9026077Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:24.9026419Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:24.9026731Z ) 2025-05-07T20:33:24.9026921Z else: 2025-05-07T20:33:24.9027137Z scale_ub_tensor = None 2025-05-07T20:33:24.9027394Z 2025-05-07T20:33:24.9027623Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:24.9027946Z op = silu_mul_quant 2025-05-07T20:33:24.9028198Z if compiled: 2025-05-07T20:33:24.9028446Z op = torch.compile(op) 2025-05-07T20:33:24.9028740Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:24.9029017Z 2025-05-07T20:33:24.9029219Z > y_fp8, y_scale = fn() 2025-05-07T20:33:24.9029385Z 2025-05-07T20:33:24.9029488Z moe/activation_test.py:117: 2025-05-07T20:33:24.9029786Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:24.9030130Z moe/activation_test.py:115: in fn 2025-05-07T20:33:24.9030409Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:24.9031101Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:24.9031786Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:24.9032321Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:24.9032995Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:24.9033732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:24.9034273Z kernel = self.compile( 2025-05-07T20:33:24.9034815Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:24.9035477Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:24.9035883Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:24.9036114Z 2025-05-07T20:33:24.9036332Z self = 2025-05-07T20:33:24.9037405Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:24.9038780Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b9352e7a0>} 2025-05-07T20:33:24.9040200Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:24.9041241Z context = 2025-05-07T20:33:24.9041571Z 2025-05-07T20:33:24.9041754Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:24.9042279Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:24.9042757Z module_map=module_map) 2025-05-07T20:33:24.9043141Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:24.9043494Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:24.9043776Z E ^ 2025-05-07T20:33:24.9044338Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:24.9044786Z 2025-05-07T20:33:24.9045202Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:24.9045716Z 2025-05-07T20:33:24.9045819Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:24.9046291Z self=, 2025-05-07T20:33:24.9046699Z T=128, 2025-05-07T20:33:24.9046883Z D=5120, 2025-05-07T20:33:24.9047080Z scale_ub=1200.0, 2025-05-07T20:33:24.9047309Z contiguous=True, 2025-05-07T20:33:24.9047527Z compiled=False, 2025-05-07T20:33:24.9047738Z ) 2025-05-07T20:33:24.9048057Z self = 2025-05-07T20:33:24.9048541Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:24.9048820Z 2025-05-07T20:33:24.9048900Z @given( 2025-05-07T20:33:24.9049140Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:24.9049449Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:24.9049763Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:24.9050097Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:24.9050433Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:24.9050718Z ) 2025-05-07T20:33:24.9051081Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:24.9051526Z def test_silu_mul_quant( 2025-05-07T20:33:24.9051758Z self, 2025-05-07T20:33:24.9051961Z T: int, 2025-05-07T20:33:24.9052166Z D: int, 2025-05-07T20:33:24.9052382Z scale_ub: Optional[float], 2025-05-07T20:33:24.9052657Z contiguous: bool, 2025-05-07T20:33:24.9052913Z compiled: bool, 2025-05-07T20:33:24.9053129Z ) -> None: 2025-05-07T20:33:24.9053346Z torch.manual_seed(2025) 2025-05-07T20:33:24.9053595Z 2025-05-07T20:33:24.9053913Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:24.9054258Z 2025-05-07T20:33:24.9054457Z x_sign = torch.sign(x) 2025-05-07T20:33:24.9054738Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:24.9055055Z x = x_sign * x_clamp 2025-05-07T20:33:24.9055306Z x0 = x[:, :D] 2025-05-07T20:33:24.9055525Z x1 = x[:, D:] 2025-05-07T20:33:24.9055725Z 2025-05-07T20:33:24.9055915Z if contiguous: 2025-05-07T20:33:24.9056151Z x0 = x0.contiguous() 2025-05-07T20:33:24.9056404Z x1 = x1.contiguous() 2025-05-07T20:33:24.9056646Z 2025-05-07T20:33:24.9056844Z if scale_ub is not None: 2025-05-07T20:33:24.9057119Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:24.9057466Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:24.9057777Z ) 2025-05-07T20:33:24.9057966Z else: 2025-05-07T20:33:24.9058185Z scale_ub_tensor = None 2025-05-07T20:33:24.9058443Z 2025-05-07T20:33:24.9058667Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:24.9059060Z op = silu_mul_quant 2025-05-07T20:33:24.9059338Z if compiled: 2025-05-07T20:33:24.9059578Z op = torch.compile(op) 2025-05-07T20:33:24.9059884Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:24.9060218Z 2025-05-07T20:33:24.9060403Z > y_fp8, y_scale = fn() 2025-05-07T20:33:24.9060573Z 2025-05-07T20:33:24.9060668Z moe/activation_test.py:117: 2025-05-07T20:33:24.9060960Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:24.9061296Z moe/activation_test.py:115: in fn 2025-05-07T20:33:24.9061572Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:24.9062263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:24.9062950Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:24.9063476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:24.9064162Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:24.9064824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:24.9065413Z kernel = self.compile( 2025-05-07T20:33:24.9065946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:24.9066607Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:24.9067013Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:24.9067237Z 2025-05-07T20:33:24.9067455Z self = 2025-05-07T20:33:24.9068525Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:24.9069898Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b92f1cc20>} 2025-05-07T20:33:24.9071251Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:24.9072273Z context = 2025-05-07T20:33:24.9072560Z 2025-05-07T20:33:24.9072727Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:24.9073256Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:24.9073778Z module_map=module_map) 2025-05-07T20:33:24.9074153Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:24.9074502Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:24.9074761Z E ^ 2025-05-07T20:33:24.9075231Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:24.9075686Z 2025-05-07T20:33:24.9076111Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:25.0636036Z 2025-05-07T20:33:25.0636484Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:25.0636964Z self=, 2025-05-07T20:33:25.0637381Z T=1, 2025-05-07T20:33:25.0637561Z D=7168, 2025-05-07T20:33:25.0637751Z scale_ub=1200.0, 2025-05-07T20:33:25.0637967Z contiguous=True, 2025-05-07T20:33:25.0638181Z compiled=True, 2025-05-07T20:33:25.0638388Z ) 2025-05-07T20:33:25.0638716Z self = 2025-05-07T20:33:25.0639442Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:25.0639702Z 2025-05-07T20:33:25.0639784Z @given( 2025-05-07T20:33:25.0640005Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:25.0640424Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:25.0640738Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:25.0641067Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:25.0641401Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:25.0641695Z ) 2025-05-07T20:33:25.0642049Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:25.0642490Z def test_silu_mul_quant( 2025-05-07T20:33:25.0642740Z self, 2025-05-07T20:33:25.0642936Z T: int, 2025-05-07T20:33:25.0643121Z D: int, 2025-05-07T20:33:25.0643349Z scale_ub: Optional[float], 2025-05-07T20:33:25.0643629Z contiguous: bool, 2025-05-07T20:33:25.0643866Z compiled: bool, 2025-05-07T20:33:25.0644110Z ) -> None: 2025-05-07T20:33:25.0644461Z torch.manual_seed(2025) 2025-05-07T20:33:25.0644692Z 2025-05-07T20:33:25.0645076Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:25.0645414Z 2025-05-07T20:33:25.0645613Z x_sign = torch.sign(x) 2025-05-07T20:33:25.0645912Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:25.0646210Z x = x_sign * x_clamp 2025-05-07T20:33:25.0646449Z x0 = x[:, :D] 2025-05-07T20:33:25.0646670Z x1 = x[:, D:] 2025-05-07T20:33:25.0646873Z 2025-05-07T20:33:25.0647062Z if contiguous: 2025-05-07T20:33:25.0647299Z x0 = x0.contiguous() 2025-05-07T20:33:25.0647554Z x1 = x1.contiguous() 2025-05-07T20:33:25.0647803Z 2025-05-07T20:33:25.0647998Z if scale_ub is not None: 2025-05-07T20:33:25.0648268Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:25.0648615Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:25.0648925Z ) 2025-05-07T20:33:25.0649128Z else: 2025-05-07T20:33:25.0649337Z scale_ub_tensor = None 2025-05-07T20:33:25.0649627Z 2025-05-07T20:33:25.0649897Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:25.0650203Z op = silu_mul_quant 2025-05-07T20:33:25.0650467Z if compiled: 2025-05-07T20:33:25.0650745Z op = torch.compile(op) 2025-05-07T20:33:25.0651048Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:25.0651315Z 2025-05-07T20:33:25.0651518Z > y_fp8, y_scale = fn() 2025-05-07T20:33:25.0651689Z 2025-05-07T20:33:25.0651798Z moe/activation_test.py:117: 2025-05-07T20:33:25.0652088Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:25.0652530Z moe/activation_test.py:115: in fn 2025-05-07T20:33:25.0652818Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:25.0653388Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:25.0653937Z return fn(*args, **kwargs) 2025-05-07T20:33:25.0654603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:25.0655305Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:25.0655831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:25.0656508Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:25.0657178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:25.0657710Z kernel = self.compile( 2025-05-07T20:33:25.0658246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:25.0658951Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:25.0659360Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:25.0659839Z 2025-05-07T20:33:25.0660058Z self = 2025-05-07T20:33:25.0661132Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:25.0662524Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b92f1dee0>} 2025-05-07T20:33:25.0663873Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:25.0664900Z context = 2025-05-07T20:33:25.0665190Z 2025-05-07T20:33:25.0665419Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:25.0665945Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:25.0666409Z module_map=module_map) 2025-05-07T20:33:25.0666775Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:25.0667116Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:25.0667367Z E ^ 2025-05-07T20:33:25.0667825Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:25.0668270Z 2025-05-07T20:33:25.0668686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:25.0669204Z 2025-05-07T20:33:25.0669304Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:25.0669711Z self=, 2025-05-07T20:33:25.0670105Z T=1, 2025-05-07T20:33:25.0670276Z D=7168, 2025-05-07T20:33:25.0670460Z scale_ub=1200.0, 2025-05-07T20:33:25.0670674Z contiguous=False, 2025-05-07T20:33:25.0670883Z compiled=True, 2025-05-07T20:33:25.0671084Z ) 2025-05-07T20:33:25.0671395Z self = 2025-05-07T20:33:25.0671873Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:25.0672141Z 2025-05-07T20:33:25.0672212Z @given( 2025-05-07T20:33:25.0672432Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:25.0672737Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:25.0673080Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:25.0673405Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:25.0673724Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:25.0673992Z ) 2025-05-07T20:33:25.0674359Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:25.0674812Z def test_silu_mul_quant( 2025-05-07T20:33:25.0675033Z self, 2025-05-07T20:33:25.0675218Z T: int, 2025-05-07T20:33:25.0675403Z D: int, 2025-05-07T20:33:25.0675607Z scale_ub: Optional[float], 2025-05-07T20:33:25.0675871Z contiguous: bool, 2025-05-07T20:33:25.0676099Z compiled: bool, 2025-05-07T20:33:25.0676304Z ) -> None: 2025-05-07T20:33:25.0676514Z torch.manual_seed(2025) 2025-05-07T20:33:25.0676745Z 2025-05-07T20:33:25.0677002Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:25.0677336Z 2025-05-07T20:33:25.0677518Z x_sign = torch.sign(x) 2025-05-07T20:33:25.0677803Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:25.0678141Z x = x_sign * x_clamp 2025-05-07T20:33:25.0678375Z x0 = x[:, :D] 2025-05-07T20:33:25.0678583Z x1 = x[:, D:] 2025-05-07T20:33:25.0678775Z 2025-05-07T20:33:25.0678954Z if contiguous: 2025-05-07T20:33:25.0679219Z x0 = x0.contiguous() 2025-05-07T20:33:25.0679464Z x1 = x1.contiguous() 2025-05-07T20:33:25.0679691Z 2025-05-07T20:33:25.0679870Z if scale_ub is not None: 2025-05-07T20:33:25.0680127Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:25.0680454Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:25.0680751Z ) 2025-05-07T20:33:25.0680924Z else: 2025-05-07T20:33:25.0681129Z scale_ub_tensor = None 2025-05-07T20:33:25.0681366Z 2025-05-07T20:33:25.0681580Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:25.0681897Z op = silu_mul_quant 2025-05-07T20:33:25.0682145Z if compiled: 2025-05-07T20:33:25.0682389Z op = torch.compile(op) 2025-05-07T20:33:25.0682669Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:25.0682929Z 2025-05-07T20:33:25.0683111Z > y_fp8, y_scale = fn() 2025-05-07T20:33:25.0683322Z 2025-05-07T20:33:25.0683414Z moe/activation_test.py:117: 2025-05-07T20:33:25.0683703Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:25.0684026Z moe/activation_test.py:115: in fn 2025-05-07T20:33:25.0684425Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:25.0684975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:25.0685662Z return fn(*args, **kwargs) 2025-05-07T20:33:25.0686313Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:25.0686983Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:25.0687510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:25.0688229Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:25.0688885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:25.0689404Z kernel = self.compile( 2025-05-07T20:33:25.0689936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:25.0690580Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:25.0690963Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:25.0691193Z 2025-05-07T20:33:25.0691392Z self = 2025-05-07T20:33:25.0692527Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:25.0693889Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b92f1ec00>} 2025-05-07T20:33:25.0695215Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:25.0696228Z context = 2025-05-07T20:33:25.0696515Z 2025-05-07T20:33:25.0696675Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:25.0697194Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:25.0697655Z module_map=module_map) 2025-05-07T20:33:25.0698082Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:25.0698428Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:25.0698674Z E ^ 2025-05-07T20:33:25.0699124Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:25.0699613Z 2025-05-07T20:33:25.0700024Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:25.2775299Z 2025-05-07T20:33:25.2775662Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:25.2776123Z self=, 2025-05-07T20:33:25.2776526Z T=1, 2025-05-07T20:33:25.2776707Z D=7168, 2025-05-07T20:33:25.2776900Z scale_ub=None, 2025-05-07T20:33:25.2777106Z contiguous=False, 2025-05-07T20:33:25.2777359Z compiled=True, 2025-05-07T20:33:25.2777565Z ) 2025-05-07T20:33:25.2777880Z self = 2025-05-07T20:33:25.2778375Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:25.2778642Z 2025-05-07T20:33:25.2779019Z @given( 2025-05-07T20:33:25.2779255Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:25.2779558Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:25.2779854Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:25.2780176Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:25.2780488Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:25.2780763Z ) 2025-05-07T20:33:25.2781109Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:25.2781532Z def test_silu_mul_quant( 2025-05-07T20:33:25.2781766Z self, 2025-05-07T20:33:25.2781952Z T: int, 2025-05-07T20:33:25.2782137Z D: int, 2025-05-07T20:33:25.2782348Z scale_ub: Optional[float], 2025-05-07T20:33:25.2782609Z contiguous: bool, 2025-05-07T20:33:25.2782837Z compiled: bool, 2025-05-07T20:33:25.2783050Z ) -> None: 2025-05-07T20:33:25.2783256Z torch.manual_seed(2025) 2025-05-07T20:33:25.2783490Z 2025-05-07T20:33:25.2783746Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:25.2784073Z 2025-05-07T20:33:25.2784252Z x_sign = torch.sign(x) 2025-05-07T20:33:25.2784529Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:25.2784827Z x = x_sign * x_clamp 2025-05-07T20:33:25.2785055Z x0 = x[:, :D] 2025-05-07T20:33:25.2785255Z x1 = x[:, D:] 2025-05-07T20:33:25.2785453Z 2025-05-07T20:33:25.2785627Z if contiguous: 2025-05-07T20:33:25.2785840Z x0 = x0.contiguous() 2025-05-07T20:33:25.2786091Z x1 = x1.contiguous() 2025-05-07T20:33:25.2786317Z 2025-05-07T20:33:25.2786578Z if scale_ub is not None: 2025-05-07T20:33:25.2786843Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:25.2787168Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:25.2787463Z ) 2025-05-07T20:33:25.2787642Z else: 2025-05-07T20:33:25.2787845Z scale_ub_tensor = None 2025-05-07T20:33:25.2788086Z 2025-05-07T20:33:25.2788335Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:25.2788633Z op = silu_mul_quant 2025-05-07T20:33:25.2788875Z if compiled: 2025-05-07T20:33:25.2789112Z op = torch.compile(op) 2025-05-07T20:33:25.2789409Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:25.2801909Z 2025-05-07T20:33:25.2802202Z y_fp8, y_scale = fn() 2025-05-07T20:33:25.2802543Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:25.2802843Z 2025-05-07T20:33:25.2803095Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:25.2803437Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:25.2803890Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:25.2804218Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:25.2804667Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:25.2805078Z 2025-05-07T20:33:25.2805282Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:25.2805480Z 2025-05-07T20:33:25.2805589Z moe/activation_test.py:126: 2025-05-07T20:33:25.2805899Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:25.2806237Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:25.2806560Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:25.2807354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:25.2808159Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:25.2809110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:25.2809791Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:25.2810765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:25.2811647Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:25.2812541Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:25.2813308Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:25.2814036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:25.2814661Z fn() 2025-05-07T20:33:25.2815265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:25.2815976Z self.fn.run( 2025-05-07T20:33:25.2816539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:25.2817184Z kernel = self.compile( 2025-05-07T20:33:25.2817830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:25.2818673Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:25.2819073Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:25.2819301Z 2025-05-07T20:33:25.2819508Z self = 2025-05-07T20:33:25.2820865Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:25.2826126Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b93b20180>} 2025-05-07T20:33:25.2827450Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:25.2828523Z context = 2025-05-07T20:33:25.2828802Z 2025-05-07T20:33:25.2828963Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:25.2829461Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:25.2829908Z module_map=module_map) 2025-05-07T20:33:25.2830258Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:25.2830593Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:25.2830922Z E ^ 2025-05-07T20:33:25.2831365Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:25.2831806Z 2025-05-07T20:33:25.2832211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:25.2832775Z 2025-05-07T20:33:25.2832870Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:25.2833259Z self=, 2025-05-07T20:33:25.2833636Z T=1, 2025-05-07T20:33:25.2833798Z D=5120, 2025-05-07T20:33:25.2854709Z scale_ub=1200.0, 2025-05-07T20:33:25.2854920Z contiguous=False, 2025-05-07T20:33:25.2855129Z compiled=True, 2025-05-07T20:33:25.2855321Z ) 2025-05-07T20:33:25.2855627Z self = 2025-05-07T20:33:25.2856102Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:25.2856362Z 2025-05-07T20:33:25.2856435Z @given( 2025-05-07T20:33:25.2856653Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:25.2856952Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:25.2857314Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:25.2857626Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:25.2857940Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:25.2858211Z ) 2025-05-07T20:33:25.2858540Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:25.2858966Z def test_silu_mul_quant( 2025-05-07T20:33:25.2859189Z self, 2025-05-07T20:33:25.2859365Z T: int, 2025-05-07T20:33:25.2859546Z D: int, 2025-05-07T20:33:25.2859743Z scale_ub: Optional[float], 2025-05-07T20:33:25.2860006Z contiguous: bool, 2025-05-07T20:33:25.2860234Z compiled: bool, 2025-05-07T20:33:25.2860453Z ) -> None: 2025-05-07T20:33:25.2860669Z torch.manual_seed(2025) 2025-05-07T20:33:25.2860898Z 2025-05-07T20:33:25.2861167Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:25.2861501Z 2025-05-07T20:33:25.2861680Z x_sign = torch.sign(x) 2025-05-07T20:33:25.2861974Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:25.2862281Z x = x_sign * x_clamp 2025-05-07T20:33:25.2862507Z x0 = x[:, :D] 2025-05-07T20:33:25.2862716Z x1 = x[:, D:] 2025-05-07T20:33:25.2862919Z 2025-05-07T20:33:25.2863094Z if contiguous: 2025-05-07T20:33:25.2863323Z x0 = x0.contiguous() 2025-05-07T20:33:25.2863575Z x1 = x1.contiguous() 2025-05-07T20:33:25.2863811Z 2025-05-07T20:33:25.2863987Z if scale_ub is not None: 2025-05-07T20:33:25.2864255Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:25.2864637Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:25.2864934Z ) 2025-05-07T20:33:25.2865122Z else: 2025-05-07T20:33:25.2865327Z scale_ub_tensor = None 2025-05-07T20:33:25.2865563Z 2025-05-07T20:33:25.2865793Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:25.2866107Z op = silu_mul_quant 2025-05-07T20:33:25.2866346Z if compiled: 2025-05-07T20:33:25.2866590Z op = torch.compile(op) 2025-05-07T20:33:25.2866884Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:25.2867144Z 2025-05-07T20:33:25.2867330Z > y_fp8, y_scale = fn() 2025-05-07T20:33:25.2867489Z 2025-05-07T20:33:25.2867597Z moe/activation_test.py:117: 2025-05-07T20:33:25.2867878Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:25.2868199Z moe/activation_test.py:115: in fn 2025-05-07T20:33:25.2868473Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:25.2869065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:25.2869615Z return fn(*args, **kwargs) 2025-05-07T20:33:25.2870263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:25.2870988Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:25.2871510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:25.2872182Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:25.2872835Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:25.2873347Z kernel = self.compile( 2025-05-07T20:33:25.2873876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:25.2874571Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:25.2874970Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:25.2875194Z 2025-05-07T20:33:25.2875396Z self = 2025-05-07T20:33:25.2876509Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:25.2877881Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b93b21300>} 2025-05-07T20:33:25.2879208Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:25.2880223Z context = 2025-05-07T20:33:25.2880507Z 2025-05-07T20:33:25.2880667Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:25.2881181Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:25.2881642Z module_map=module_map) 2025-05-07T20:33:25.2881990Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:25.2882331Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:25.2882582Z E ^ 2025-05-07T20:33:25.2883026Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:25.2883473Z 2025-05-07T20:33:25.2883882Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:25.4294872Z 2025-05-07T20:33:25.4295537Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:25.4296004Z self=, 2025-05-07T20:33:25.4296402Z T=1, 2025-05-07T20:33:25.4296576Z D=5120, 2025-05-07T20:33:25.4296764Z scale_ub=1200.0, 2025-05-07T20:33:25.4296991Z contiguous=False, 2025-05-07T20:33:25.4297217Z compiled=False, 2025-05-07T20:33:25.4297421Z ) 2025-05-07T20:33:25.4297730Z self = 2025-05-07T20:33:25.4298200Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:25.4298467Z 2025-05-07T20:33:25.4298537Z @given( 2025-05-07T20:33:25.4298755Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:25.4299060Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:25.4299353Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:25.4299670Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:25.4299992Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:25.4300344Z ) 2025-05-07T20:33:25.4300684Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:25.4301119Z def test_silu_mul_quant( 2025-05-07T20:33:25.4301351Z self, 2025-05-07T20:33:25.4301620Z T: int, 2025-05-07T20:33:25.4301815Z D: int, 2025-05-07T20:33:25.4302019Z scale_ub: Optional[float], 2025-05-07T20:33:25.4302286Z contiguous: bool, 2025-05-07T20:33:25.4302519Z compiled: bool, 2025-05-07T20:33:25.4302737Z ) -> None: 2025-05-07T20:33:25.4302946Z torch.manual_seed(2025) 2025-05-07T20:33:25.4303183Z 2025-05-07T20:33:25.4303449Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:25.4303776Z 2025-05-07T20:33:25.4303967Z x_sign = torch.sign(x) 2025-05-07T20:33:25.4304249Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:25.4304549Z x = x_sign * x_clamp 2025-05-07T20:33:25.4304781Z x0 = x[:, :D] 2025-05-07T20:33:25.4304994Z x1 = x[:, D:] 2025-05-07T20:33:25.4305188Z 2025-05-07T20:33:25.4305367Z if contiguous: 2025-05-07T20:33:25.4305594Z x0 = x0.contiguous() 2025-05-07T20:33:25.4305954Z x1 = x1.contiguous() 2025-05-07T20:33:25.4306187Z 2025-05-07T20:33:25.4306370Z if scale_ub is not None: 2025-05-07T20:33:25.4306625Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:25.4306950Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:25.4307245Z ) 2025-05-07T20:33:25.4307455Z else: 2025-05-07T20:33:25.4307658Z scale_ub_tensor = None 2025-05-07T20:33:25.4307890Z 2025-05-07T20:33:25.4308117Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:25.4308682Z op = silu_mul_quant 2025-05-07T20:33:25.4308918Z if compiled: 2025-05-07T20:33:25.4309163Z op = torch.compile(op) 2025-05-07T20:33:25.4309458Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:25.4309713Z 2025-05-07T20:33:25.4309896Z > y_fp8, y_scale = fn() 2025-05-07T20:33:25.4310064Z 2025-05-07T20:33:25.4310160Z moe/activation_test.py:117: 2025-05-07T20:33:25.4310451Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:25.4310773Z moe/activation_test.py:115: in fn 2025-05-07T20:33:25.4311049Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:25.4311728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:25.4312399Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:25.4312925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:25.4313678Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:25.4314339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:25.4314860Z kernel = self.compile( 2025-05-07T20:33:25.4315403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:25.4316062Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:25.4316447Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:25.4316680Z 2025-05-07T20:33:25.4316882Z self = 2025-05-07T20:33:25.4317975Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:25.4319450Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b93b22020>} 2025-05-07T20:33:25.4320781Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:25.4321840Z context = 2025-05-07T20:33:25.4322132Z 2025-05-07T20:33:25.4322291Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:25.4322805Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:25.4323265Z module_map=module_map) 2025-05-07T20:33:25.4323615Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:25.4323960Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:25.4324212Z E ^ 2025-05-07T20:33:25.4324807Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:25.4325258Z 2025-05-07T20:33:25.4325669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:25.4326251Z 2025-05-07T20:33:25.4326350Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:25.4326755Z self=, 2025-05-07T20:33:25.4327141Z T=16384, 2025-05-07T20:33:25.4327331Z D=5120, 2025-05-07T20:33:25.4327516Z scale_ub=1200.0, 2025-05-07T20:33:25.4327724Z contiguous=False, 2025-05-07T20:33:25.4327941Z compiled=True, 2025-05-07T20:33:25.4328133Z ) 2025-05-07T20:33:25.4328435Z self = 2025-05-07T20:33:25.4328930Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:25.4329209Z 2025-05-07T20:33:25.4329281Z @given( 2025-05-07T20:33:25.4329506Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:25.4329818Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:25.4330121Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:25.4330438Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:25.4330763Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:25.4331038Z ) 2025-05-07T20:33:25.4331376Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:25.4331799Z def test_silu_mul_quant( 2025-05-07T20:33:25.4332035Z self, 2025-05-07T20:33:25.4332221Z T: int, 2025-05-07T20:33:25.4332404Z D: int, 2025-05-07T20:33:25.4332615Z scale_ub: Optional[float], 2025-05-07T20:33:25.4332876Z contiguous: bool, 2025-05-07T20:33:25.4333102Z compiled: bool, 2025-05-07T20:33:25.4333319Z ) -> None: 2025-05-07T20:33:25.4333580Z torch.manual_seed(2025) 2025-05-07T20:33:25.4333810Z 2025-05-07T20:33:25.4334076Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:25.4334426Z 2025-05-07T20:33:25.4334633Z x_sign = torch.sign(x) 2025-05-07T20:33:25.4334916Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:25.4335219Z x = x_sign * x_clamp 2025-05-07T20:33:25.4335443Z x0 = x[:, :D] 2025-05-07T20:33:25.4335650Z x1 = x[:, D:] 2025-05-07T20:33:25.4335847Z 2025-05-07T20:33:25.4336016Z if contiguous: 2025-05-07T20:33:25.4336238Z x0 = x0.contiguous() 2025-05-07T20:33:25.4336484Z x1 = x1.contiguous() 2025-05-07T20:33:25.4336711Z 2025-05-07T20:33:25.4336887Z if scale_ub is not None: 2025-05-07T20:33:25.4337150Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:25.4337474Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:25.4337768Z ) 2025-05-07T20:33:25.4337956Z else: 2025-05-07T20:33:25.4338158Z scale_ub_tensor = None 2025-05-07T20:33:25.4338442Z 2025-05-07T20:33:25.4338665Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:25.4338968Z op = silu_mul_quant 2025-05-07T20:33:25.4339208Z if compiled: 2025-05-07T20:33:25.4339490Z op = torch.compile(op) 2025-05-07T20:33:25.4339776Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:25.4340033Z 2025-05-07T20:33:25.4340215Z > y_fp8, y_scale = fn() 2025-05-07T20:33:25.4340377Z 2025-05-07T20:33:25.4340476Z moe/activation_test.py:117: 2025-05-07T20:33:25.4340762Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:25.4341107Z moe/activation_test.py:115: in fn 2025-05-07T20:33:25.4341404Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:25.4341951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:25.4342490Z return fn(*args, **kwargs) 2025-05-07T20:33:25.4343140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:25.4343816Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:25.4344391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:25.4345056Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:25.4345707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:25.4346229Z kernel = self.compile( 2025-05-07T20:33:25.4346753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:25.4347398Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:25.4347790Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:25.4348039Z 2025-05-07T20:33:25.4348274Z self = 2025-05-07T20:33:25.4349331Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:25.4350688Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b93b23600>} 2025-05-07T20:33:25.4352009Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:25.4353016Z context = 2025-05-07T20:33:25.4353342Z 2025-05-07T20:33:25.4353511Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:25.4354016Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:25.4354476Z module_map=module_map) 2025-05-07T20:33:25.4354837Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:25.4355176Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:25.4355423Z E ^ 2025-05-07T20:33:25.4355875Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:25.4356316Z 2025-05-07T20:33:25.4356730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:25.4357231Z 2025-05-07T20:33:25.4357328Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:25.4357734Z self=, 2025-05-07T20:33:25.4358150Z T=2048, 2025-05-07T20:33:25.4358385Z D=7168, 2025-05-07T20:33:25.4358569Z scale_ub=1200.0, 2025-05-07T20:33:25.4358783Z contiguous=False, 2025-05-07T20:33:25.4358993Z compiled=True, 2025-05-07T20:33:25.6257919Z ) 2025-05-07T20:33:25.6258579Z self = 2025-05-07T20:33:25.6259422Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:25.6259695Z 2025-05-07T20:33:25.6259774Z @given( 2025-05-07T20:33:25.6259993Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:25.6260300Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:25.6260592Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:25.6260912Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:25.6261223Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:25.6261492Z ) 2025-05-07T20:33:25.6261830Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:25.6262258Z def test_silu_mul_quant( 2025-05-07T20:33:25.6262484Z self, 2025-05-07T20:33:25.6262666Z T: int, 2025-05-07T20:33:25.6262845Z D: int, 2025-05-07T20:33:25.6263147Z scale_ub: Optional[float], 2025-05-07T20:33:25.6263408Z contiguous: bool, 2025-05-07T20:33:25.6263629Z compiled: bool, 2025-05-07T20:33:25.6263848Z ) -> None: 2025-05-07T20:33:25.6264051Z torch.manual_seed(2025) 2025-05-07T20:33:25.6264275Z 2025-05-07T20:33:25.6264538Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:25.6264867Z 2025-05-07T20:33:25.6265042Z x_sign = torch.sign(x) 2025-05-07T20:33:25.6265323Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:25.6265621Z x = x_sign * x_clamp 2025-05-07T20:33:25.6265848Z x0 = x[:, :D] 2025-05-07T20:33:25.6266048Z x1 = x[:, D:] 2025-05-07T20:33:25.6266239Z 2025-05-07T20:33:25.6266413Z if contiguous: 2025-05-07T20:33:25.6266628Z x0 = x0.contiguous() 2025-05-07T20:33:25.6266874Z x1 = x1.contiguous() 2025-05-07T20:33:25.6267100Z 2025-05-07T20:33:25.6267316Z if scale_ub is not None: 2025-05-07T20:33:25.6267580Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:25.6267901Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:25.6268194Z ) 2025-05-07T20:33:25.6268377Z else: 2025-05-07T20:33:25.6268574Z scale_ub_tensor = None 2025-05-07T20:33:25.6268813Z 2025-05-07T20:33:25.6269038Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:25.6269333Z op = silu_mul_quant 2025-05-07T20:33:25.6269576Z if compiled: 2025-05-07T20:33:25.6269817Z op = torch.compile(op) 2025-05-07T20:33:25.6270097Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:25.6270440Z 2025-05-07T20:33:25.6270630Z > y_fp8, y_scale = fn() 2025-05-07T20:33:25.6270792Z 2025-05-07T20:33:25.6270888Z moe/activation_test.py:117: 2025-05-07T20:33:25.6271182Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:25.6271511Z moe/activation_test.py:115: in fn 2025-05-07T20:33:25.6271788Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:25.6272330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:25.6272878Z return fn(*args, **kwargs) 2025-05-07T20:33:25.6273524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:25.6274187Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:25.6274709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:25.6275383Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:25.6276116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:25.6276628Z kernel = self.compile( 2025-05-07T20:33:25.6277161Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:25.6277876Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:25.6278263Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:25.6278487Z 2025-05-07T20:33:25.6278686Z self = 2025-05-07T20:33:25.6279756Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:25.6281130Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8c48d40720>} 2025-05-07T20:33:25.6290821Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:25.6292046Z context = 2025-05-07T20:33:25.6292347Z 2025-05-07T20:33:25.6292517Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:25.6293053Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:25.6293522Z module_map=module_map) 2025-05-07T20:33:25.6293898Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:25.6294266Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:25.6294537Z E ^ 2025-05-07T20:33:25.6295014Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:25.6295471Z 2025-05-07T20:33:25.6295890Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:25.6296407Z 2025-05-07T20:33:25.6296527Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:25.6296945Z self=, 2025-05-07T20:33:25.6297358Z T=1, 2025-05-07T20:33:25.6297558Z D=5120, 2025-05-07T20:33:25.6297767Z scale_ub=None, 2025-05-07T20:33:25.6297989Z contiguous=False, 2025-05-07T20:33:25.6298231Z compiled=False, 2025-05-07T20:33:25.6298446Z ) 2025-05-07T20:33:25.6298765Z self = 2025-05-07T20:33:25.6299316Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:25.6299579Z 2025-05-07T20:33:25.6299669Z @given( 2025-05-07T20:33:25.6299903Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:25.6300226Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:25.6300542Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:25.6300878Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:25.6301219Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:25.6301510Z ) 2025-05-07T20:33:25.6301870Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:25.6302316Z def test_silu_mul_quant( 2025-05-07T20:33:25.6302565Z self, 2025-05-07T20:33:25.6302769Z T: int, 2025-05-07T20:33:25.6302967Z D: int, 2025-05-07T20:33:25.6303190Z scale_ub: Optional[float], 2025-05-07T20:33:25.6303465Z contiguous: bool, 2025-05-07T20:33:25.6303698Z compiled: bool, 2025-05-07T20:33:25.6303932Z ) -> None: 2025-05-07T20:33:25.6304155Z torch.manual_seed(2025) 2025-05-07T20:33:25.6304439Z 2025-05-07T20:33:25.6304716Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:25.6305065Z 2025-05-07T20:33:25.6305254Z x_sign = torch.sign(x) 2025-05-07T20:33:25.6305597Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:25.6305907Z x = x_sign * x_clamp 2025-05-07T20:33:25.6306142Z x0 = x[:, :D] 2025-05-07T20:33:25.6306362Z x1 = x[:, D:] 2025-05-07T20:33:25.6306578Z 2025-05-07T20:33:25.6306773Z if contiguous: 2025-05-07T20:33:25.6307001Z x0 = x0.contiguous() 2025-05-07T20:33:25.6307266Z x1 = x1.contiguous() 2025-05-07T20:33:25.6307511Z 2025-05-07T20:33:25.6307696Z if scale_ub is not None: 2025-05-07T20:33:25.6307985Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:25.6308742Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:25.6309047Z ) 2025-05-07T20:33:25.6309250Z else: 2025-05-07T20:33:25.6309465Z scale_ub_tensor = None 2025-05-07T20:33:25.6309714Z 2025-05-07T20:33:25.6309951Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:25.6310355Z op = silu_mul_quant 2025-05-07T20:33:25.6310604Z if compiled: 2025-05-07T20:33:25.6310854Z op = torch.compile(op) 2025-05-07T20:33:25.6311145Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:25.6311425Z 2025-05-07T20:33:25.6311617Z > y_fp8, y_scale = fn() 2025-05-07T20:33:25.6311781Z 2025-05-07T20:33:25.6311878Z moe/activation_test.py:117: 2025-05-07T20:33:25.6312172Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:25.6312506Z moe/activation_test.py:115: in fn 2025-05-07T20:33:25.6312787Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:25.6313477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:25.6314166Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:25.6314711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:25.6315390Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:25.6316054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:25.6316587Z kernel = self.compile( 2025-05-07T20:33:25.6317124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:25.6317779Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:25.6318181Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:25.6318408Z 2025-05-07T20:33:25.6318693Z self = 2025-05-07T20:33:25.6319763Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:25.6321137Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8c48d41120>} 2025-05-07T20:33:25.6322472Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:25.6323488Z context = 2025-05-07T20:33:25.6323775Z 2025-05-07T20:33:25.6323948Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:25.6324602Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:25.6325075Z module_map=module_map) 2025-05-07T20:33:25.6325439Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:25.6325786Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:25.6326103Z E ^ 2025-05-07T20:33:25.6326562Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:25.6327009Z 2025-05-07T20:33:25.6327427Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:25.6327930Z 2025-05-07T20:33:25.6328034Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:25.6328443Z self=, 2025-05-07T20:33:25.6328842Z T=4096, 2025-05-07T20:33:25.6329024Z D=7168, 2025-05-07T20:33:25.6329219Z scale_ub=1200.0, 2025-05-07T20:33:25.6329448Z contiguous=False, 2025-05-07T20:33:25.6329669Z compiled=False, 2025-05-07T20:33:25.6329877Z ) 2025-05-07T20:33:25.6330196Z self = 2025-05-07T20:33:25.6330742Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:25.6331015Z 2025-05-07T20:33:25.6331091Z @given( 2025-05-07T20:33:25.6331318Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:25.6331630Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:25.6331926Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:25.6332256Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:25.6332584Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:25.6332856Z ) 2025-05-07T20:33:25.6333199Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:25.6333634Z def test_silu_mul_quant( 2025-05-07T20:33:25.6333877Z self, 2025-05-07T20:33:25.6334068Z T: int, 2025-05-07T20:33:25.6334264Z D: int, 2025-05-07T20:33:25.6334476Z scale_ub: Optional[float], 2025-05-07T20:33:25.6334734Z contiguous: bool, 2025-05-07T20:33:25.6334973Z compiled: bool, 2025-05-07T20:33:25.6335200Z ) -> None: 2025-05-07T20:33:25.6335410Z torch.manual_seed(2025) 2025-05-07T20:33:25.6335651Z 2025-05-07T20:33:25.6335926Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:25.6336263Z 2025-05-07T20:33:25.6336455Z x_sign = torch.sign(x) 2025-05-07T20:33:25.6336745Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:25.6337048Z x = x_sign * x_clamp 2025-05-07T20:33:25.6337284Z x0 = x[:, :D] 2025-05-07T20:33:25.6337503Z x1 = x[:, D:] 2025-05-07T20:33:25.6337707Z 2025-05-07T20:33:25.6337906Z if contiguous: 2025-05-07T20:33:25.6338235Z x0 = x0.contiguous() 2025-05-07T20:33:25.6338503Z x1 = x1.contiguous() 2025-05-07T20:33:25.6338744Z 2025-05-07T20:33:25.6338935Z if scale_ub is not None: 2025-05-07T20:33:25.6339206Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:25.6339533Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:25.6339843Z ) 2025-05-07T20:33:25.6340037Z else: 2025-05-07T20:33:25.6340240Z scale_ub_tensor = None 2025-05-07T20:33:25.6340491Z 2025-05-07T20:33:25.6340719Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:25.6341019Z op = silu_mul_quant 2025-05-07T20:33:25.6341268Z if compiled: 2025-05-07T20:33:25.6341514Z op = torch.compile(op) 2025-05-07T20:33:25.6341800Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:25.6342074Z 2025-05-07T20:33:25.6342267Z > y_fp8, y_scale = fn() 2025-05-07T20:33:25.6342428Z 2025-05-07T20:33:25.6342527Z moe/activation_test.py:117: 2025-05-07T20:33:25.6342866Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:25.6343202Z moe/activation_test.py:115: in fn 2025-05-07T20:33:25.6343484Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:25.6344170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:25.6344889Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:25.6345419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:25.6346086Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:25.6346750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:25.6347274Z kernel = self.compile( 2025-05-07T20:33:25.6347814Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:25.6348463Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:25.6348863Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:25.6349134Z 2025-05-07T20:33:25.6349343Z self = 2025-05-07T20:33:25.6350409Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:25.6351760Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8c48d42480>} 2025-05-07T20:33:25.6353093Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:25.6354107Z context = 2025-05-07T20:33:25.6354393Z 2025-05-07T20:33:25.6354563Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:25.6355078Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:25.6355545Z module_map=module_map) 2025-05-07T20:33:25.6355909Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:25.6356256Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:25.6356511Z E ^ 2025-05-07T20:33:25.6356977Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:25.6357420Z 2025-05-07T20:33:25.6357882Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:25.7927066Z 2025-05-07T20:33:25.7927512Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:25.7928235Z self=, 2025-05-07T20:33:25.7928654Z T=16384, 2025-05-07T20:33:25.7928866Z D=7168, 2025-05-07T20:33:25.7929070Z scale_ub=None, 2025-05-07T20:33:25.7929291Z contiguous=True, 2025-05-07T20:33:25.7929516Z compiled=True, 2025-05-07T20:33:25.7929715Z ) 2025-05-07T20:33:25.7930037Z self = 2025-05-07T20:33:25.7930535Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:25.7930807Z 2025-05-07T20:33:25.7930890Z @given( 2025-05-07T20:33:25.7931117Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:25.7931437Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:25.7931759Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:25.7932105Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:25.7932729Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:25.7933025Z ) 2025-05-07T20:33:25.7933372Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:25.7933909Z def test_silu_mul_quant( 2025-05-07T20:33:25.7934159Z self, 2025-05-07T20:33:25.7934348Z T: int, 2025-05-07T20:33:25.7934551Z D: int, 2025-05-07T20:33:25.7934772Z scale_ub: Optional[float], 2025-05-07T20:33:25.7935039Z contiguous: bool, 2025-05-07T20:33:25.7935281Z compiled: bool, 2025-05-07T20:33:25.7935515Z ) -> None: 2025-05-07T20:33:25.7935725Z torch.manual_seed(2025) 2025-05-07T20:33:25.7935972Z 2025-05-07T20:33:25.7936245Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:25.7936578Z 2025-05-07T20:33:25.7936765Z x_sign = torch.sign(x) 2025-05-07T20:33:25.7937061Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:25.7937368Z x = x_sign * x_clamp 2025-05-07T20:33:25.7937595Z x0 = x[:, :D] 2025-05-07T20:33:25.7937835Z x1 = x[:, D:] 2025-05-07T20:33:25.7938043Z 2025-05-07T20:33:25.7938332Z if contiguous: 2025-05-07T20:33:25.7938554Z x0 = x0.contiguous() 2025-05-07T20:33:25.7938814Z x1 = x1.contiguous() 2025-05-07T20:33:25.7939055Z 2025-05-07T20:33:25.7939236Z if scale_ub is not None: 2025-05-07T20:33:25.7939508Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:25.7939849Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:25.7940191Z ) 2025-05-07T20:33:25.7940395Z else: 2025-05-07T20:33:25.7940606Z scale_ub_tensor = None 2025-05-07T20:33:25.7940858Z 2025-05-07T20:33:25.7941086Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:25.7941398Z op = silu_mul_quant 2025-05-07T20:33:25.7941653Z if compiled: 2025-05-07T20:33:25.7941895Z op = torch.compile(op) 2025-05-07T20:33:25.7942194Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:25.7942469Z 2025-05-07T20:33:25.7942653Z > y_fp8, y_scale = fn() 2025-05-07T20:33:25.7942828Z 2025-05-07T20:33:25.7942931Z moe/activation_test.py:117: 2025-05-07T20:33:25.7943228Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:25.7943561Z moe/activation_test.py:115: in fn 2025-05-07T20:33:25.7943841Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:25.7944399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:25.7944960Z return fn(*args, **kwargs) 2025-05-07T20:33:25.7945610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:25.7946394Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:25.7946932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:25.7947611Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:25.7948273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:25.7948806Z kernel = self.compile( 2025-05-07T20:33:25.7949349Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:25.7949998Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:25.7950396Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:25.7950628Z 2025-05-07T20:33:25.7950834Z self = 2025-05-07T20:33:25.7951963Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:25.7953345Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8c48d43740>} 2025-05-07T20:33:25.7954720Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:25.7955737Z context = 2025-05-07T20:33:25.7956025Z 2025-05-07T20:33:25.7956200Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:25.7956718Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:25.7957183Z module_map=module_map) 2025-05-07T20:33:25.7957560Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:25.7957913Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:25.7958174Z E ^ 2025-05-07T20:33:25.7958688Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:25.7959135Z 2025-05-07T20:33:25.7959558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:25.7960063Z 2025-05-07T20:33:25.7960174Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:25.7960581Z self=, 2025-05-07T20:33:25.7960992Z T=4096, 2025-05-07T20:33:25.7961201Z D=5120, 2025-05-07T20:33:25.7961398Z scale_ub=None, 2025-05-07T20:33:25.7961631Z contiguous=False, 2025-05-07T20:33:25.7961874Z compiled=True, 2025-05-07T20:33:25.7962084Z ) 2025-05-07T20:33:25.7962419Z self = 2025-05-07T20:33:25.7962929Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:25.7963199Z 2025-05-07T20:33:25.7963293Z @given( 2025-05-07T20:33:25.7963527Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:25.7963847Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:25.7964168Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:25.7964660Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:25.7964990Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:25.7965275Z ) 2025-05-07T20:33:25.7965615Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:25.7966052Z def test_silu_mul_quant( 2025-05-07T20:33:25.7966299Z self, 2025-05-07T20:33:25.7966495Z T: int, 2025-05-07T20:33:25.7966753Z D: int, 2025-05-07T20:33:25.7966974Z scale_ub: Optional[float], 2025-05-07T20:33:25.7967237Z contiguous: bool, 2025-05-07T20:33:25.7967475Z compiled: bool, 2025-05-07T20:33:25.7967711Z ) -> None: 2025-05-07T20:33:25.7967925Z torch.manual_seed(2025) 2025-05-07T20:33:25.7968162Z 2025-05-07T20:33:25.7968442Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:25.7968787Z 2025-05-07T20:33:25.7968973Z x_sign = torch.sign(x) 2025-05-07T20:33:25.7969264Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:25.7969577Z x = x_sign * x_clamp 2025-05-07T20:33:25.7969810Z x0 = x[:, :D] 2025-05-07T20:33:25.7970024Z x1 = x[:, D:] 2025-05-07T20:33:25.7970240Z 2025-05-07T20:33:25.7970429Z if contiguous: 2025-05-07T20:33:25.7970665Z x0 = x0.contiguous() 2025-05-07T20:33:25.7970934Z x1 = x1.contiguous() 2025-05-07T20:33:25.7971173Z 2025-05-07T20:33:25.7971389Z if scale_ub is not None: 2025-05-07T20:33:25.7971723Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:25.7972059Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:25.7972377Z ) 2025-05-07T20:33:25.7972581Z else: 2025-05-07T20:33:25.7972842Z scale_ub_tensor = None 2025-05-07T20:33:25.7973092Z 2025-05-07T20:33:25.7973341Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:25.7973668Z op = silu_mul_quant 2025-05-07T20:33:25.7973928Z if compiled: 2025-05-07T20:33:25.7974194Z op = torch.compile(op) 2025-05-07T20:33:25.7974499Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:25.7974770Z 2025-05-07T20:33:25.7974977Z > y_fp8, y_scale = fn() 2025-05-07T20:33:25.7975147Z 2025-05-07T20:33:25.7975263Z moe/activation_test.py:117: 2025-05-07T20:33:25.7975564Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:25.7975909Z moe/activation_test.py:115: in fn 2025-05-07T20:33:25.7976201Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:25.7976752Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:25.7977347Z return fn(*args, **kwargs) 2025-05-07T20:33:25.7978004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:25.7978688Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:25.7979212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:25.7979894Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:25.7980558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:25.7981091Z kernel = self.compile( 2025-05-07T20:33:25.7981634Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:25.7982290Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:25.7982686Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:25.7982915Z 2025-05-07T20:33:25.7983132Z self = 2025-05-07T20:33:25.7984196Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:25.7985556Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b92998c20>} 2025-05-07T20:33:25.7986938Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:25.7987955Z context = 2025-05-07T20:33:25.7988286Z 2025-05-07T20:33:25.7988461Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:25.7988991Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:25.7989460Z module_map=module_map) 2025-05-07T20:33:25.7989843Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:25.7990192Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:25.7990465Z E ^ 2025-05-07T20:33:25.7990940Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:25.7991389Z 2025-05-07T20:33:25.7991855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:25.9401758Z 2025-05-07T20:33:25.9402091Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:25.9402729Z self=, 2025-05-07T20:33:25.9403559Z T=4096, 2025-05-07T20:33:25.9403830Z D=5120, 2025-05-07T20:33:25.9404063Z scale_ub=1200.0, 2025-05-07T20:33:25.9404408Z contiguous=False, 2025-05-07T20:33:25.9404631Z compiled=False, 2025-05-07T20:33:25.9404843Z ) 2025-05-07T20:33:25.9405168Z self = 2025-05-07T20:33:25.9405662Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:25.9405949Z 2025-05-07T20:33:25.9406023Z @given( 2025-05-07T20:33:25.9406258Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:25.9406560Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:25.9406874Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:25.9407213Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:25.9407545Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:25.9407820Z ) 2025-05-07T20:33:25.9408439Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:25.9408898Z def test_silu_mul_quant( 2025-05-07T20:33:25.9409138Z self, 2025-05-07T20:33:25.9409336Z T: int, 2025-05-07T20:33:25.9409531Z D: int, 2025-05-07T20:33:25.9409743Z scale_ub: Optional[float], 2025-05-07T20:33:25.9410023Z contiguous: bool, 2025-05-07T20:33:25.9410265Z compiled: bool, 2025-05-07T20:33:25.9410484Z ) -> None: 2025-05-07T20:33:25.9410702Z torch.manual_seed(2025) 2025-05-07T20:33:25.9410946Z 2025-05-07T20:33:25.9411215Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:25.9411556Z 2025-05-07T20:33:25.9411752Z x_sign = torch.sign(x) 2025-05-07T20:33:25.9412038Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:25.9412350Z x = x_sign * x_clamp 2025-05-07T20:33:25.9412589Z x0 = x[:, :D] 2025-05-07T20:33:25.9412794Z x1 = x[:, D:] 2025-05-07T20:33:25.9413000Z 2025-05-07T20:33:25.9413192Z if contiguous: 2025-05-07T20:33:25.9413419Z x0 = x0.contiguous() 2025-05-07T20:33:25.9413669Z x1 = x1.contiguous() 2025-05-07T20:33:25.9413906Z 2025-05-07T20:33:25.9420009Z if scale_ub is not None: 2025-05-07T20:33:25.9420313Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:25.9420664Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:25.9420974Z ) 2025-05-07T20:33:25.9421181Z else: 2025-05-07T20:33:25.9421400Z scale_ub_tensor = None 2025-05-07T20:33:25.9421658Z 2025-05-07T20:33:25.9422018Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:25.9422346Z op = silu_mul_quant 2025-05-07T20:33:25.9422612Z if compiled: 2025-05-07T20:33:25.9422896Z op = torch.compile(op) 2025-05-07T20:33:25.9423193Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:25.9423481Z 2025-05-07T20:33:25.9423684Z > y_fp8, y_scale = fn() 2025-05-07T20:33:25.9423851Z 2025-05-07T20:33:25.9423965Z moe/activation_test.py:117: 2025-05-07T20:33:25.9424263Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:25.9424608Z moe/activation_test.py:115: in fn 2025-05-07T20:33:25.9424900Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:25.9425595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:25.9426290Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:25.9426839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:25.9427618Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:25.9428284Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:25.9428851Z kernel = self.compile( 2025-05-07T20:33:25.9429480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:25.9430143Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:25.9430556Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:25.9430797Z 2025-05-07T20:33:25.9431008Z self = 2025-05-07T20:33:25.9432100Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:25.9433489Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b929996c0>} 2025-05-07T20:33:25.9434898Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:25.9435931Z context = 2025-05-07T20:33:25.9436222Z 2025-05-07T20:33:25.9436397Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:25.9436929Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:25.9437400Z module_map=module_map) 2025-05-07T20:33:25.9437778Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:25.9438142Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:25.9438406Z E ^ 2025-05-07T20:33:25.9438907Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:25.9439358Z 2025-05-07T20:33:25.9439787Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:25.9440299Z 2025-05-07T20:33:25.9440415Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:25.9440832Z self=, 2025-05-07T20:33:25.9441241Z T=4096, 2025-05-07T20:33:25.9441443Z D=5120, 2025-05-07T20:33:25.9441637Z scale_ub=1200.0, 2025-05-07T20:33:25.9441869Z contiguous=False, 2025-05-07T20:33:25.9442103Z compiled=True, 2025-05-07T20:33:25.9442310Z ) 2025-05-07T20:33:25.9442689Z self = 2025-05-07T20:33:25.9443196Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:25.9443469Z 2025-05-07T20:33:25.9443556Z @given( 2025-05-07T20:33:25.9443787Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:25.9444111Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:25.9444525Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:25.9444854Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:25.9445189Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:25.9445483Z ) 2025-05-07T20:33:25.9445827Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:25.9446267Z def test_silu_mul_quant( 2025-05-07T20:33:25.9446514Z self, 2025-05-07T20:33:25.9446709Z T: int, 2025-05-07T20:33:25.9446911Z D: int, 2025-05-07T20:33:25.9447137Z scale_ub: Optional[float], 2025-05-07T20:33:25.9447406Z contiguous: bool, 2025-05-07T20:33:25.9447655Z compiled: bool, 2025-05-07T20:33:25.9447944Z ) -> None: 2025-05-07T20:33:25.9448162Z torch.manual_seed(2025) 2025-05-07T20:33:25.9448412Z 2025-05-07T20:33:25.9448691Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:25.9449084Z 2025-05-07T20:33:25.9449276Z x_sign = torch.sign(x) 2025-05-07T20:33:25.9449578Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:25.9449890Z x = x_sign * x_clamp 2025-05-07T20:33:25.9450134Z x0 = x[:, :D] 2025-05-07T20:33:25.9450361Z x1 = x[:, D:] 2025-05-07T20:33:25.9450576Z 2025-05-07T20:33:25.9450761Z if contiguous: 2025-05-07T20:33:25.9451001Z x0 = x0.contiguous() 2025-05-07T20:33:25.9451268Z x1 = x1.contiguous() 2025-05-07T20:33:25.9451507Z 2025-05-07T20:33:25.9451705Z if scale_ub is not None: 2025-05-07T20:33:25.9451981Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:25.9452313Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:25.9452629Z ) 2025-05-07T20:33:25.9452825Z else: 2025-05-07T20:33:25.9453033Z scale_ub_tensor = None 2025-05-07T20:33:25.9453288Z 2025-05-07T20:33:25.9453576Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:25.9453887Z op = silu_mul_quant 2025-05-07T20:33:25.9454143Z if compiled: 2025-05-07T20:33:25.9454394Z op = torch.compile(op) 2025-05-07T20:33:25.9454690Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:25.9454959Z 2025-05-07T20:33:25.9455155Z > y_fp8, y_scale = fn() 2025-05-07T20:33:25.9455318Z 2025-05-07T20:33:25.9455425Z moe/activation_test.py:117: 2025-05-07T20:33:25.9455716Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:25.9456056Z moe/activation_test.py:115: in fn 2025-05-07T20:33:25.9456345Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:25.9456897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:25.9457456Z return fn(*args, **kwargs) 2025-05-07T20:33:25.9458112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:25.9458801Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:25.9459328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:25.9460010Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:25.9460673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:25.9461203Z kernel = self.compile( 2025-05-07T20:33:25.9461784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:25.9462445Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:25.9462849Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:25.9463078Z 2025-05-07T20:33:25.9463288Z self = 2025-05-07T20:33:25.9464371Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:25.9465745Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b9299afc0>} 2025-05-07T20:33:25.9467093Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:25.9468160Z context = 2025-05-07T20:33:25.9468452Z 2025-05-07T20:33:25.9468619Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:25.9469151Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:25.9469662Z module_map=module_map) 2025-05-07T20:33:25.9470029Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:25.9470384Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:25.9470652Z E ^ 2025-05-07T20:33:25.9471121Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:25.9471571Z 2025-05-07T20:33:25.9471987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:25.9472511Z 2025-05-07T20:33:25.9472617Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:25.9473037Z self=, 2025-05-07T20:33:25.9473440Z T=2048, 2025-05-07T20:33:25.9473629Z D=7168, 2025-05-07T20:33:25.9473930Z scale_ub=1200.0, 2025-05-07T20:33:25.9474161Z contiguous=False, 2025-05-07T20:33:25.9474385Z compiled=False, 2025-05-07T20:33:26.1436136Z ) 2025-05-07T20:33:26.1437149Z self = 2025-05-07T20:33:26.1438115Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:26.1438563Z 2025-05-07T20:33:26.1438682Z @given( 2025-05-07T20:33:26.1439024Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:26.1439455Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:26.1439847Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:26.1440277Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:26.1440588Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:26.1440866Z ) 2025-05-07T20:33:26.1441205Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:26.1441643Z def test_silu_mul_quant( 2025-05-07T20:33:26.1441871Z self, 2025-05-07T20:33:26.1442058Z T: int, 2025-05-07T20:33:26.1442248Z D: int, 2025-05-07T20:33:26.1442451Z scale_ub: Optional[float], 2025-05-07T20:33:26.1442711Z contiguous: bool, 2025-05-07T20:33:26.1442938Z compiled: bool, 2025-05-07T20:33:26.1443150Z ) -> None: 2025-05-07T20:33:26.1443354Z torch.manual_seed(2025) 2025-05-07T20:33:26.1443591Z 2025-05-07T20:33:26.1443860Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:26.1444194Z 2025-05-07T20:33:26.1444499Z x_sign = torch.sign(x) 2025-05-07T20:33:26.1444906Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:26.1445211Z x = x_sign * x_clamp 2025-05-07T20:33:26.1445442Z x0 = x[:, :D] 2025-05-07T20:33:26.1445640Z x1 = x[:, D:] 2025-05-07T20:33:26.1445839Z 2025-05-07T20:33:26.1446012Z if contiguous: 2025-05-07T20:33:26.1446229Z x0 = x0.contiguous() 2025-05-07T20:33:26.1446483Z x1 = x1.contiguous() 2025-05-07T20:33:26.1446714Z 2025-05-07T20:33:26.1446892Z if scale_ub is not None: 2025-05-07T20:33:26.1447149Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:26.1447473Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:26.1447769Z ) 2025-05-07T20:33:26.1447945Z else: 2025-05-07T20:33:26.1448144Z scale_ub_tensor = None 2025-05-07T20:33:26.1448386Z 2025-05-07T20:33:26.1448608Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:26.1448905Z op = silu_mul_quant 2025-05-07T20:33:26.1449152Z if compiled: 2025-05-07T20:33:26.1449390Z op = torch.compile(op) 2025-05-07T20:33:26.1449739Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:26.1450007Z 2025-05-07T20:33:26.1450196Z > y_fp8, y_scale = fn() 2025-05-07T20:33:26.1450355Z 2025-05-07T20:33:26.1450453Z moe/activation_test.py:117: 2025-05-07T20:33:26.1450830Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:26.1451153Z moe/activation_test.py:115: in fn 2025-05-07T20:33:26.1451433Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:26.1452108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:26.1452789Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:26.1453320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:26.1453993Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:26.1454648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:26.1455175Z kernel = self.compile( 2025-05-07T20:33:26.1455709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:26.1456417Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:26.1456806Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:26.1457028Z 2025-05-07T20:33:26.1457234Z self = 2025-05-07T20:33:26.1458302Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:26.1459733Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b9299bec0>} 2025-05-07T20:33:26.1461051Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:26.1462068Z context = 2025-05-07T20:33:26.1462358Z 2025-05-07T20:33:26.1462515Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:26.1463029Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:26.1463478Z module_map=module_map) 2025-05-07T20:33:26.1463831Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:26.1464175Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:26.1464470Z E ^ 2025-05-07T20:33:26.1464931Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:26.1465382Z 2025-05-07T20:33:26.1465795Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:26.1466307Z 2025-05-07T20:33:26.1466411Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:26.1466808Z self=, 2025-05-07T20:33:26.1467202Z T=1, 2025-05-07T20:33:26.1467374Z D=7168, 2025-05-07T20:33:26.1467556Z scale_ub=None, 2025-05-07T20:33:26.1467761Z contiguous=True, 2025-05-07T20:33:26.1467978Z compiled=False, 2025-05-07T20:33:26.1468167Z ) 2025-05-07T20:33:26.1468477Z self = 2025-05-07T20:33:26.1468969Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:26.1469231Z 2025-05-07T20:33:26.1469314Z @given( 2025-05-07T20:33:26.1469582Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:26.1469897Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:26.1470210Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:26.1470538Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:26.1470910Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:26.1471187Z ) 2025-05-07T20:33:26.1471521Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:26.1471948Z def test_silu_mul_quant( 2025-05-07T20:33:26.1472186Z self, 2025-05-07T20:33:26.1472377Z T: int, 2025-05-07T20:33:26.1472559Z D: int, 2025-05-07T20:33:26.1472780Z scale_ub: Optional[float], 2025-05-07T20:33:26.1473050Z contiguous: bool, 2025-05-07T20:33:26.1473278Z compiled: bool, 2025-05-07T20:33:26.1473493Z ) -> None: 2025-05-07T20:33:26.1473712Z torch.manual_seed(2025) 2025-05-07T20:33:26.1473945Z 2025-05-07T20:33:26.1474222Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:26.1474564Z 2025-05-07T20:33:26.1474746Z x_sign = torch.sign(x) 2025-05-07T20:33:26.1475089Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:26.1475394Z x = x_sign * x_clamp 2025-05-07T20:33:26.1475614Z x0 = x[:, :D] 2025-05-07T20:33:26.1475817Z x1 = x[:, D:] 2025-05-07T20:33:26.1476015Z 2025-05-07T20:33:26.1476181Z if contiguous: 2025-05-07T20:33:26.1476402Z x0 = x0.contiguous() 2025-05-07T20:33:26.1476651Z x1 = x1.contiguous() 2025-05-07T20:33:26.1476871Z 2025-05-07T20:33:26.1477050Z if scale_ub is not None: 2025-05-07T20:33:26.1477310Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:26.1477635Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:26.1477921Z ) 2025-05-07T20:33:26.1478105Z else: 2025-05-07T20:33:26.1478310Z scale_ub_tensor = None 2025-05-07T20:33:26.1478539Z 2025-05-07T20:33:26.1478758Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:26.1479061Z op = silu_mul_quant 2025-05-07T20:33:26.1479302Z if compiled: 2025-05-07T20:33:26.1479538Z op = torch.compile(op) 2025-05-07T20:33:26.1479822Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:26.1480073Z 2025-05-07T20:33:26.1480253Z > y_fp8, y_scale = fn() 2025-05-07T20:33:26.1480413Z 2025-05-07T20:33:26.1480513Z moe/activation_test.py:117: 2025-05-07T20:33:26.1480791Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:26.1481113Z moe/activation_test.py:115: in fn 2025-05-07T20:33:26.1481384Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:26.1482105Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:26.1482775Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:26.1483298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:26.1483969Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:26.1484729Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:26.1485243Z kernel = self.compile( 2025-05-07T20:33:26.1485823Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:26.1486467Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:26.1486851Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:26.1487081Z 2025-05-07T20:33:26.1487284Z self = 2025-05-07T20:33:26.1488393Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:26.1489749Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b9379ccc0>} 2025-05-07T20:33:26.1491166Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:26.1492166Z context = 2025-05-07T20:33:26.1492452Z 2025-05-07T20:33:26.1492612Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:26.1493123Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:26.1493583Z module_map=module_map) 2025-05-07T20:33:26.1493931Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:26.1494279Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:26.1494566Z E ^ 2025-05-07T20:33:26.1495011Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:26.1495458Z 2025-05-07T20:33:26.1495866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:26.1496388Z 2025-05-07T20:33:26.1496483Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:26.1496886Z self=, 2025-05-07T20:33:26.1497265Z T=16384, 2025-05-07T20:33:26.1497446Z D=7168, 2025-05-07T20:33:26.1497628Z scale_ub=1200.0, 2025-05-07T20:33:26.1497838Z contiguous=False, 2025-05-07T20:33:26.1498051Z compiled=True, 2025-05-07T20:33:26.1498244Z ) 2025-05-07T20:33:26.1498545Z self = 2025-05-07T20:33:26.1499033Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:26.1499309Z 2025-05-07T20:33:26.1499391Z @given( 2025-05-07T20:33:26.1499635Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:26.1499934Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:26.1500234Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:26.1500554Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:26.1500877Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:26.1501142Z ) 2025-05-07T20:33:26.1501483Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:26.1501912Z def test_silu_mul_quant( 2025-05-07T20:33:26.1502180Z self, 2025-05-07T20:33:26.1502365Z T: int, 2025-05-07T20:33:26.1502559Z D: int, 2025-05-07T20:33:26.1502761Z scale_ub: Optional[float], 2025-05-07T20:33:26.1503021Z contiguous: bool, 2025-05-07T20:33:26.1503249Z compiled: bool, 2025-05-07T20:33:26.1503456Z ) -> None: 2025-05-07T20:33:26.1503672Z torch.manual_seed(2025) 2025-05-07T20:33:26.1503900Z 2025-05-07T20:33:26.1504155Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:26.1504483Z 2025-05-07T20:33:26.1504667Z x_sign = torch.sign(x) 2025-05-07T20:33:26.1504953Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:26.1505245Z x = x_sign * x_clamp 2025-05-07T20:33:26.1505475Z x0 = x[:, :D] 2025-05-07T20:33:26.1505681Z x1 = x[:, D:] 2025-05-07T20:33:26.1505871Z 2025-05-07T20:33:26.1506045Z if contiguous: 2025-05-07T20:33:26.1506263Z x0 = x0.contiguous() 2025-05-07T20:33:26.1506502Z x1 = x1.contiguous() 2025-05-07T20:33:26.1506725Z 2025-05-07T20:33:26.1506954Z if scale_ub is not None: 2025-05-07T20:33:26.1507208Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:26.1507527Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:26.1507864Z ) 2025-05-07T20:33:26.1508035Z else: 2025-05-07T20:33:26.1508446Z scale_ub_tensor = None 2025-05-07T20:33:26.1508729Z 2025-05-07T20:33:26.1508937Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:26.1509239Z op = silu_mul_quant 2025-05-07T20:33:26.1509475Z if compiled: 2025-05-07T20:33:26.1509702Z op = torch.compile(op) 2025-05-07T20:33:26.1509983Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:26.1510238Z 2025-05-07T20:33:26.1510412Z > y_fp8, y_scale = fn() 2025-05-07T20:33:26.1510569Z 2025-05-07T20:33:26.1510659Z moe/activation_test.py:117: 2025-05-07T20:33:26.1510943Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:26.1511265Z moe/activation_test.py:115: in fn 2025-05-07T20:33:26.1511527Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:26.1512073Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:26.1513318Z return fn(*args, **kwargs) 2025-05-07T20:33:26.1513966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:26.1514639Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:26.1515164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:26.1515834Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:26.1516486Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:26.1517001Z kernel = self.compile( 2025-05-07T20:33:26.1517537Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:26.1518177Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:26.1518603Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:26.1518842Z 2025-05-07T20:33:26.1519041Z self = 2025-05-07T20:33:26.1520117Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:26.1521686Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b9379e0c0>} 2025-05-07T20:33:26.1523023Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:26.1524038Z context = 2025-05-07T20:33:26.1524469Z 2025-05-07T20:33:26.1524633Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:26.1525146Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:26.1525602Z module_map=module_map) 2025-05-07T20:33:26.1525968Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:26.1526320Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:26.1526579Z E ^ 2025-05-07T20:33:26.1527041Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:26.1533893Z 2025-05-07T20:33:26.1534467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:26.2867649Z 2025-05-07T20:33:26.2867829Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:26.2868434Z self=, 2025-05-07T20:33:26.2869369Z T=1, 2025-05-07T20:33:26.2869673Z D=7168, 2025-05-07T20:33:26.2869990Z scale_ub=None, 2025-05-07T20:33:26.2870298Z contiguous=False, 2025-05-07T20:33:26.2870522Z compiled=False, 2025-05-07T20:33:26.2870724Z ) 2025-05-07T20:33:26.2871032Z self = 2025-05-07T20:33:26.2871513Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:26.2871767Z 2025-05-07T20:33:26.2871845Z @given( 2025-05-07T20:33:26.2872064Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:26.2872369Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:26.2872669Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:26.2872992Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:26.2873303Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:26.2873665Z ) 2025-05-07T20:33:26.2874007Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:26.2874438Z def test_silu_mul_quant( 2025-05-07T20:33:26.2874672Z self, 2025-05-07T20:33:26.2874856Z T: int, 2025-05-07T20:33:26.2875037Z D: int, 2025-05-07T20:33:26.2875253Z scale_ub: Optional[float], 2025-05-07T20:33:26.2875516Z contiguous: bool, 2025-05-07T20:33:26.2875741Z compiled: bool, 2025-05-07T20:33:26.2875956Z ) -> None: 2025-05-07T20:33:26.2876166Z torch.manual_seed(2025) 2025-05-07T20:33:26.2876399Z 2025-05-07T20:33:26.2876663Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:26.2877001Z 2025-05-07T20:33:26.2877180Z x_sign = torch.sign(x) 2025-05-07T20:33:26.2877462Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:26.2877764Z x = x_sign * x_clamp 2025-05-07T20:33:26.2877994Z x0 = x[:, :D] 2025-05-07T20:33:26.2878202Z x1 = x[:, D:] 2025-05-07T20:33:26.2878406Z 2025-05-07T20:33:26.2878585Z if contiguous: 2025-05-07T20:33:26.2878812Z x0 = x0.contiguous() 2025-05-07T20:33:26.2879074Z x1 = x1.contiguous() 2025-05-07T20:33:26.2879312Z 2025-05-07T20:33:26.2879488Z if scale_ub is not None: 2025-05-07T20:33:26.2879759Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:26.2880081Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:26.2880371Z ) 2025-05-07T20:33:26.2880562Z else: 2025-05-07T20:33:26.2880761Z scale_ub_tensor = None 2025-05-07T20:33:26.2881069Z 2025-05-07T20:33:26.2881295Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:26.2881601Z op = silu_mul_quant 2025-05-07T20:33:26.2881838Z if compiled: 2025-05-07T20:33:26.2882083Z op = torch.compile(op) 2025-05-07T20:33:26.2882380Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:26.2882646Z 2025-05-07T20:33:26.2882824Z > y_fp8, y_scale = fn() 2025-05-07T20:33:26.2882987Z 2025-05-07T20:33:26.2883078Z moe/activation_test.py:117: 2025-05-07T20:33:26.2883362Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:26.2883683Z moe/activation_test.py:115: in fn 2025-05-07T20:33:26.2883953Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:26.2884770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:26.2885461Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:26.2886061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:26.2886734Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:26.2887379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:26.2887935Z kernel = self.compile( 2025-05-07T20:33:26.2888470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:26.2889105Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:26.2889483Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:26.2889712Z 2025-05-07T20:33:26.2889910Z self = 2025-05-07T20:33:26.2890982Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:26.2892340Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b9379ec00>} 2025-05-07T20:33:26.2893704Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:26.2894710Z context = 2025-05-07T20:33:26.2895001Z 2025-05-07T20:33:26.2895162Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:26.2895671Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:26.2896119Z module_map=module_map) 2025-05-07T20:33:26.2896480Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:26.2896823Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:26.2897060Z E ^ 2025-05-07T20:33:26.2897504Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:26.2897954Z 2025-05-07T20:33:26.2898360Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:26.2898859Z 2025-05-07T20:33:26.2898959Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:26.2899348Z self=, 2025-05-07T20:33:26.2899732Z T=2048, 2025-05-07T20:33:26.2899906Z D=7168, 2025-05-07T20:33:26.2900076Z scale_ub=None, 2025-05-07T20:33:26.2900278Z contiguous=False, 2025-05-07T20:33:26.2900489Z compiled=True, 2025-05-07T20:33:26.2900673Z ) 2025-05-07T20:33:26.2901017Z self = 2025-05-07T20:33:26.2901502Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:26.2901763Z 2025-05-07T20:33:26.2901835Z @given( 2025-05-07T20:33:26.2902047Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:26.2902348Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:26.2902647Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:26.2902957Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:26.2903272Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:26.2903542Z ) 2025-05-07T20:33:26.2903872Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:26.2904288Z def test_silu_mul_quant( 2025-05-07T20:33:26.2904511Z self, 2025-05-07T20:33:26.2904696Z T: int, 2025-05-07T20:33:26.2904872Z D: int, 2025-05-07T20:33:26.2905082Z scale_ub: Optional[float], 2025-05-07T20:33:26.2905341Z contiguous: bool, 2025-05-07T20:33:26.2905609Z compiled: bool, 2025-05-07T20:33:26.2905821Z ) -> None: 2025-05-07T20:33:26.2906025Z torch.manual_seed(2025) 2025-05-07T20:33:26.2906246Z 2025-05-07T20:33:26.2906502Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:26.2906868Z 2025-05-07T20:33:26.2907039Z x_sign = torch.sign(x) 2025-05-07T20:33:26.2907316Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:26.2907609Z x = x_sign * x_clamp 2025-05-07T20:33:26.2907827Z x0 = x[:, :D] 2025-05-07T20:33:26.2908029Z x1 = x[:, D:] 2025-05-07T20:33:26.2908468Z 2025-05-07T20:33:26.2908702Z if contiguous: 2025-05-07T20:33:26.2908922Z x0 = x0.contiguous() 2025-05-07T20:33:26.2909169Z x1 = x1.contiguous() 2025-05-07T20:33:26.2909390Z 2025-05-07T20:33:26.2909568Z if scale_ub is not None: 2025-05-07T20:33:26.2909833Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:26.2910159Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:26.2910448Z ) 2025-05-07T20:33:26.2910629Z else: 2025-05-07T20:33:26.2910922Z scale_ub_tensor = None 2025-05-07T20:33:26.2911151Z 2025-05-07T20:33:26.2911361Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:26.2911661Z op = silu_mul_quant 2025-05-07T20:33:26.2911890Z if compiled: 2025-05-07T20:33:26.2912124Z op = torch.compile(op) 2025-05-07T20:33:26.2912403Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:26.2912656Z 2025-05-07T20:33:26.2912831Z > y_fp8, y_scale = fn() 2025-05-07T20:33:26.2912988Z 2025-05-07T20:33:26.2913085Z moe/activation_test.py:117: 2025-05-07T20:33:26.2913365Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:26.2913679Z moe/activation_test.py:115: in fn 2025-05-07T20:33:26.2913945Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:26.2914489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:26.2915022Z return fn(*args, **kwargs) 2025-05-07T20:33:26.2915668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:26.2916337Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:26.2916857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:26.2917513Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:26.2918158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:26.2918669Z kernel = self.compile( 2025-05-07T20:33:26.2919258Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:26.2919901Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:26.2920285Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:26.2920510Z 2025-05-07T20:33:26.2920715Z self = 2025-05-07T20:33:26.2921766Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:26.2923117Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b92cbc2c0>} 2025-05-07T20:33:26.2924621Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:26.2925624Z context = 2025-05-07T20:33:26.2927388Z 2025-05-07T20:33:26.2927559Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:26.2928118Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:26.2928569Z module_map=module_map) 2025-05-07T20:33:26.2928926Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:26.2929258Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:26.2929499Z E ^ 2025-05-07T20:33:26.2929949Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:26.2930387Z 2025-05-07T20:33:26.2930803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:26.2931300Z 2025-05-07T20:33:26.2931397Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:26.2931796Z self=, 2025-05-07T20:33:26.2932254Z T=4096, 2025-05-07T20:33:26.2932424Z D=7168, 2025-05-07T20:33:26.2932600Z scale_ub=None, 2025-05-07T20:33:26.2932802Z contiguous=False, 2025-05-07T20:33:26.2933012Z compiled=True, 2025-05-07T20:33:26.7741900Z ) 2025-05-07T20:33:26.7742460Z self = 2025-05-07T20:33:26.7743197Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:26.7743593Z 2025-05-07T20:33:26.7743713Z @given( 2025-05-07T20:33:26.7744022Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:26.7744469Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:26.7744793Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:26.7745107Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:26.7745433Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:26.7745706Z ) 2025-05-07T20:33:26.7746042Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:26.7746475Z def test_silu_mul_quant( 2025-05-07T20:33:26.7746708Z self, 2025-05-07T20:33:26.7746892Z T: int, 2025-05-07T20:33:26.7747072Z D: int, 2025-05-07T20:33:26.7747279Z scale_ub: Optional[float], 2025-05-07T20:33:26.7747544Z contiguous: bool, 2025-05-07T20:33:26.7747765Z compiled: bool, 2025-05-07T20:33:26.7747979Z ) -> None: 2025-05-07T20:33:26.7748183Z torch.manual_seed(2025) 2025-05-07T20:33:26.7748407Z 2025-05-07T20:33:26.7748683Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:26.7749008Z 2025-05-07T20:33:26.7749185Z x_sign = torch.sign(x) 2025-05-07T20:33:26.7749594Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:26.7749906Z x = x_sign * x_clamp 2025-05-07T20:33:26.7750129Z x0 = x[:, :D] 2025-05-07T20:33:26.7750334Z x1 = x[:, D:] 2025-05-07T20:33:26.7750525Z 2025-05-07T20:33:26.7750687Z if contiguous: 2025-05-07T20:33:26.7750909Z x0 = x0.contiguous() 2025-05-07T20:33:26.7751157Z x1 = x1.contiguous() 2025-05-07T20:33:26.7751390Z 2025-05-07T20:33:26.7751566Z if scale_ub is not None: 2025-05-07T20:33:26.7751823Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:26.7752146Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:26.7752432Z ) 2025-05-07T20:33:26.7752615Z else: 2025-05-07T20:33:26.7752816Z scale_ub_tensor = None 2025-05-07T20:33:26.7753043Z 2025-05-07T20:33:26.7753260Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:26.7753630Z op = silu_mul_quant 2025-05-07T20:33:26.7753922Z if compiled: 2025-05-07T20:33:26.7754385Z op = torch.compile(op) 2025-05-07T20:33:26.7754673Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:26.7754935Z 2025-05-07T20:33:26.7755124Z > y_fp8, y_scale = fn() 2025-05-07T20:33:26.7755352Z 2025-05-07T20:33:26.7755444Z moe/activation_test.py:117: 2025-05-07T20:33:26.7755737Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:26.7756063Z moe/activation_test.py:115: in fn 2025-05-07T20:33:26.7756332Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:26.7756880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:26.7757422Z return fn(*args, **kwargs) 2025-05-07T20:33:26.7758071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:26.7758740Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:26.7759265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:26.7759933Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:26.7760651Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:26.7761207Z kernel = self.compile( 2025-05-07T20:33:26.7761731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:26.7762377Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:26.7762757Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:26.7762986Z 2025-05-07T20:33:26.7763183Z self = 2025-05-07T20:33:26.7764453Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:26.7765811Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b92cbcd60>} 2025-05-07T20:33:26.7767132Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:26.7768128Z context = 2025-05-07T20:33:26.7768411Z 2025-05-07T20:33:26.7768570Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:26.7769120Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:26.7769571Z module_map=module_map) 2025-05-07T20:33:26.7769916Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:26.7770252Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:26.7770494Z E ^ 2025-05-07T20:33:26.7770939Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:26.7771388Z 2025-05-07T20:33:26.7771797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:26.7772302Z 2025-05-07T20:33:26.7772398Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:26.7772795Z self=, 2025-05-07T20:33:26.7773170Z T=16384, 2025-05-07T20:33:26.7773351Z D=5120, 2025-05-07T20:33:26.7773527Z scale_ub=1200.0, 2025-05-07T20:33:26.7773734Z contiguous=False, 2025-05-07T20:33:26.7773952Z compiled=False, 2025-05-07T20:33:26.7774137Z ) 2025-05-07T20:33:26.7774481Z self = 2025-05-07T20:33:26.7774971Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:26.7775245Z 2025-05-07T20:33:26.7775321Z @given( 2025-05-07T20:33:26.7775569Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:26.7775869Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:26.7776155Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:26.7776470Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:26.7776776Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:26.7777041Z ) 2025-05-07T20:33:26.7777376Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:26.7777794Z def test_silu_mul_quant( 2025-05-07T20:33:26.7778017Z self, 2025-05-07T20:33:26.7778198Z T: int, 2025-05-07T20:33:26.7778378Z D: int, 2025-05-07T20:33:26.7778583Z scale_ub: Optional[float], 2025-05-07T20:33:26.7778836Z contiguous: bool, 2025-05-07T20:33:26.7779054Z compiled: bool, 2025-05-07T20:33:26.7779258Z ) -> None: 2025-05-07T20:33:26.7779504Z torch.manual_seed(2025) 2025-05-07T20:33:26.7779727Z 2025-05-07T20:33:26.7779983Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:26.7780309Z 2025-05-07T20:33:26.7780486Z x_sign = torch.sign(x) 2025-05-07T20:33:26.7780758Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:26.7781049Z x = x_sign * x_clamp 2025-05-07T20:33:26.7781271Z x0 = x[:, :D] 2025-05-07T20:33:26.7781466Z x1 = x[:, D:] 2025-05-07T20:33:26.7781657Z 2025-05-07T20:33:26.7781828Z if contiguous: 2025-05-07T20:33:26.7782039Z x0 = x0.contiguous() 2025-05-07T20:33:26.7782281Z x1 = x1.contiguous() 2025-05-07T20:33:26.7782508Z 2025-05-07T20:33:26.7782678Z if scale_ub is not None: 2025-05-07T20:33:26.7782935Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:26.7783253Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:26.7783539Z ) 2025-05-07T20:33:26.7783719Z else: 2025-05-07T20:33:26.7783917Z scale_ub_tensor = None 2025-05-07T20:33:26.7784145Z 2025-05-07T20:33:26.7784358Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:26.7784657Z op = silu_mul_quant 2025-05-07T20:33:26.7784889Z if compiled: 2025-05-07T20:33:26.7785116Z op = torch.compile(op) 2025-05-07T20:33:26.7785396Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:26.7785651Z 2025-05-07T20:33:26.7785820Z > y_fp8, y_scale = fn() 2025-05-07T20:33:26.7785981Z 2025-05-07T20:33:26.7786071Z moe/activation_test.py:117: 2025-05-07T20:33:26.7786396Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:26.7786708Z moe/activation_test.py:115: in fn 2025-05-07T20:33:26.7786973Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:26.7787641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:26.7788316Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:26.7788830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:26.7789492Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:26.7790136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:26.7790643Z kernel = self.compile( 2025-05-07T20:33:26.7791166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:26.7791810Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:26.7792245Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:26.7792466Z 2025-05-07T20:33:26.7792668Z self = 2025-05-07T20:33:26.7793770Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:26.7795115Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b92cbdc60>} 2025-05-07T20:33:26.7796433Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:26.7797433Z context = 2025-05-07T20:33:26.7797713Z 2025-05-07T20:33:26.7797871Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:26.7798377Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:26.7798926Z module_map=module_map) 2025-05-07T20:33:26.7799269Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:26.7799604Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:26.7799845Z E ^ 2025-05-07T20:33:26.7800293Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:26.7800732Z 2025-05-07T20:33:26.7801139Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:26.7801644Z 2025-05-07T20:33:26.7801742Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:26.7802138Z self=, 2025-05-07T20:33:26.7802515Z T=16384, 2025-05-07T20:33:26.7802695Z D=5120, 2025-05-07T20:33:26.7802873Z scale_ub=1200.0, 2025-05-07T20:33:26.7803079Z contiguous=True, 2025-05-07T20:33:26.7803286Z compiled=True, 2025-05-07T20:33:26.7803472Z ) 2025-05-07T20:33:26.7803776Z self = 2025-05-07T20:33:26.7804339Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:26.7804636Z 2025-05-07T20:33:26.7804704Z @given( 2025-05-07T20:33:26.7804925Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:26.7805222Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:26.7805510Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:26.7805826Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:26.7806186Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:26.7806455Z ) 2025-05-07T20:33:26.7806797Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:26.7807224Z def test_silu_mul_quant( 2025-05-07T20:33:26.7807454Z self, 2025-05-07T20:33:26.7807636Z T: int, 2025-05-07T20:33:26.7807825Z D: int, 2025-05-07T20:33:26.7808022Z scale_ub: Optional[float], 2025-05-07T20:33:26.7808574Z contiguous: bool, 2025-05-07T20:33:26.7808811Z compiled: bool, 2025-05-07T20:33:26.7809018Z ) -> None: 2025-05-07T20:33:26.7809218Z torch.manual_seed(2025) 2025-05-07T20:33:26.7809443Z 2025-05-07T20:33:26.7809699Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:26.7810027Z 2025-05-07T20:33:26.7810199Z x_sign = torch.sign(x) 2025-05-07T20:33:26.7810481Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:26.7810772Z x = x_sign * x_clamp 2025-05-07T20:33:26.7810991Z x0 = x[:, :D] 2025-05-07T20:33:26.7811275Z x1 = x[:, D:] 2025-05-07T20:33:26.7811467Z 2025-05-07T20:33:26.7817353Z if contiguous: 2025-05-07T20:33:26.7817628Z x0 = x0.contiguous() 2025-05-07T20:33:26.7817892Z x1 = x1.contiguous() 2025-05-07T20:33:26.7818233Z 2025-05-07T20:33:26.7818423Z if scale_ub is not None: 2025-05-07T20:33:26.7818694Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:26.7819038Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:26.7819354Z ) 2025-05-07T20:33:26.7819548Z else: 2025-05-07T20:33:26.7819754Z scale_ub_tensor = None 2025-05-07T20:33:26.7820003Z 2025-05-07T20:33:26.7820243Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:26.7820554Z op = silu_mul_quant 2025-05-07T20:33:26.7820809Z if compiled: 2025-05-07T20:33:26.7821060Z op = torch.compile(op) 2025-05-07T20:33:26.7821353Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:26.7821627Z 2025-05-07T20:33:26.7821821Z > y_fp8, y_scale = fn() 2025-05-07T20:33:26.7821987Z 2025-05-07T20:33:26.7822088Z moe/activation_test.py:117: 2025-05-07T20:33:26.7822452Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:26.7822783Z moe/activation_test.py:115: in fn 2025-05-07T20:33:26.7823066Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:26.7823614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:26.7824168Z return fn(*args, **kwargs) 2025-05-07T20:33:26.7824823Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:26.7825500Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:26.7826034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:26.7826712Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:26.7827375Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:26.7827904Z kernel = self.compile( 2025-05-07T20:33:26.7828445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:26.7829102Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:26.7829491Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:26.7829722Z 2025-05-07T20:33:26.7829930Z self = 2025-05-07T20:33:26.7831085Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:26.7832456Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b92cbf380>} 2025-05-07T20:33:26.7833801Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:26.7834811Z context = 2025-05-07T20:33:26.7835103Z 2025-05-07T20:33:26.7835270Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:26.7835792Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:26.7836258Z module_map=module_map) 2025-05-07T20:33:26.7836632Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:26.7837037Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:26.7837295Z E ^ 2025-05-07T20:33:26.7837747Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:26.7838244Z 2025-05-07T20:33:26.7838659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:26.9404357Z 2025-05-07T20:33:26.9404806Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:26.9405635Z self=, 2025-05-07T20:33:26.9406427Z T=16384, 2025-05-07T20:33:26.9406624Z D=5120, 2025-05-07T20:33:26.9406801Z scale_ub=None, 2025-05-07T20:33:26.9407012Z contiguous=False, 2025-05-07T20:33:26.9407227Z compiled=True, 2025-05-07T20:33:26.9407417Z ) 2025-05-07T20:33:26.9407739Z self = 2025-05-07T20:33:26.9408537Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:26.9408825Z 2025-05-07T20:33:26.9408900Z @given( 2025-05-07T20:33:26.9409120Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:26.9409545Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:26.9409842Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:26.9410171Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:26.9410493Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:26.9410772Z ) 2025-05-07T20:33:26.9411108Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:26.9411546Z def test_silu_mul_quant( 2025-05-07T20:33:26.9411780Z self, 2025-05-07T20:33:26.9411961Z T: int, 2025-05-07T20:33:26.9412147Z D: int, 2025-05-07T20:33:26.9412354Z scale_ub: Optional[float], 2025-05-07T20:33:26.9412611Z contiguous: bool, 2025-05-07T20:33:26.9412844Z compiled: bool, 2025-05-07T20:33:26.9413063Z ) -> None: 2025-05-07T20:33:26.9413264Z torch.manual_seed(2025) 2025-05-07T20:33:26.9413498Z 2025-05-07T20:33:26.9413764Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:26.9414102Z 2025-05-07T20:33:26.9414278Z x_sign = torch.sign(x) 2025-05-07T20:33:26.9414561Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:26.9414863Z x = x_sign * x_clamp 2025-05-07T20:33:26.9415090Z x0 = x[:, :D] 2025-05-07T20:33:26.9415296Z x1 = x[:, D:] 2025-05-07T20:33:26.9415491Z 2025-05-07T20:33:26.9415660Z if contiguous: 2025-05-07T20:33:26.9415881Z x0 = x0.contiguous() 2025-05-07T20:33:26.9416129Z x1 = x1.contiguous() 2025-05-07T20:33:26.9416357Z 2025-05-07T20:33:26.9416540Z if scale_ub is not None: 2025-05-07T20:33:26.9416872Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:26.9417202Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:26.9417498Z ) 2025-05-07T20:33:26.9417676Z else: 2025-05-07T20:33:26.9417870Z scale_ub_tensor = None 2025-05-07T20:33:26.9418115Z 2025-05-07T20:33:26.9418331Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:26.9418632Z op = silu_mul_quant 2025-05-07T20:33:26.9418873Z if compiled: 2025-05-07T20:33:26.9419110Z op = torch.compile(op) 2025-05-07T20:33:26.9419391Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:26.9419658Z 2025-05-07T20:33:26.9419835Z > y_fp8, y_scale = fn() 2025-05-07T20:33:26.9419995Z 2025-05-07T20:33:26.9420094Z moe/activation_test.py:117: 2025-05-07T20:33:26.9420373Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:26.9420702Z moe/activation_test.py:115: in fn 2025-05-07T20:33:26.9420977Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:26.9421597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:26.9422151Z return fn(*args, **kwargs) 2025-05-07T20:33:26.9422791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:26.9423518Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:26.9424034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:26.9424698Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:26.9425344Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:26.9425852Z kernel = self.compile( 2025-05-07T20:33:26.9426377Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:26.9427018Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:26.9427399Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:26.9427622Z 2025-05-07T20:33:26.9427865Z self = 2025-05-07T20:33:26.9428926Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:26.9430281Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b927245e0>} 2025-05-07T20:33:26.9431607Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:26.9432618Z context = 2025-05-07T20:33:26.9432901Z 2025-05-07T20:33:26.9433060Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:26.9433567Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:26.9434022Z module_map=module_map) 2025-05-07T20:33:26.9434367Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:26.9434707Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:26.9434946Z E ^ 2025-05-07T20:33:26.9435394Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:26.9435838Z 2025-05-07T20:33:26.9436290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:26.9436795Z 2025-05-07T20:33:26.9436891Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:26.9437286Z self=, 2025-05-07T20:33:26.9437664Z T=2048, 2025-05-07T20:33:26.9437837Z D=5120, 2025-05-07T20:33:26.9438012Z scale_ub=None, 2025-05-07T20:33:26.9438219Z contiguous=False, 2025-05-07T20:33:26.9438420Z compiled=True, 2025-05-07T20:33:26.9438601Z ) 2025-05-07T20:33:26.9438903Z self = 2025-05-07T20:33:26.9439375Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:26.9439640Z 2025-05-07T20:33:26.9439708Z @given( 2025-05-07T20:33:26.9439924Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:26.9440215Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:26.9440506Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:26.9440869Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:26.9441223Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:26.9441488Z ) 2025-05-07T20:33:26.9441815Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:26.9442242Z def test_silu_mul_quant( 2025-05-07T20:33:26.9442506Z self, 2025-05-07T20:33:26.9442680Z T: int, 2025-05-07T20:33:26.9442853Z D: int, 2025-05-07T20:33:26.9443048Z scale_ub: Optional[float], 2025-05-07T20:33:26.9443308Z contiguous: bool, 2025-05-07T20:33:26.9443529Z compiled: bool, 2025-05-07T20:33:26.9443727Z ) -> None: 2025-05-07T20:33:26.9443925Z torch.manual_seed(2025) 2025-05-07T20:33:26.9444144Z 2025-05-07T20:33:26.9444545Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:26.9444867Z 2025-05-07T20:33:26.9445039Z x_sign = torch.sign(x) 2025-05-07T20:33:26.9445316Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:26.9445605Z x = x_sign * x_clamp 2025-05-07T20:33:26.9445827Z x0 = x[:, :D] 2025-05-07T20:33:26.9446024Z x1 = x[:, D:] 2025-05-07T20:33:26.9446210Z 2025-05-07T20:33:26.9446373Z if contiguous: 2025-05-07T20:33:26.9446663Z x0 = x0.contiguous() 2025-05-07T20:33:26.9446906Z x1 = x1.contiguous() 2025-05-07T20:33:26.9447127Z 2025-05-07T20:33:26.9447303Z if scale_ub is not None: 2025-05-07T20:33:26.9447552Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:26.9447872Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:26.9448159Z ) 2025-05-07T20:33:26.9448326Z else: 2025-05-07T20:33:26.9448519Z scale_ub_tensor = None 2025-05-07T20:33:26.9448752Z 2025-05-07T20:33:26.9448959Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:26.9449254Z op = silu_mul_quant 2025-05-07T20:33:26.9449486Z if compiled: 2025-05-07T20:33:26.9449708Z op = torch.compile(op) 2025-05-07T20:33:26.9449986Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:26.9450246Z 2025-05-07T20:33:26.9450415Z > y_fp8, y_scale = fn() 2025-05-07T20:33:26.9450580Z 2025-05-07T20:33:26.9450668Z moe/activation_test.py:117: 2025-05-07T20:33:26.9450948Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:26.9451257Z moe/activation_test.py:115: in fn 2025-05-07T20:33:26.9451520Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:26.9452060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:26.9452590Z return fn(*args, **kwargs) 2025-05-07T20:33:26.9453239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:26.9453905Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:26.9454475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:26.9455132Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:26.9455774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:26.9456292Z kernel = self.compile( 2025-05-07T20:33:26.9456818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:26.9457452Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:26.9457833Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:26.9458052Z 2025-05-07T20:33:26.9458255Z self = 2025-05-07T20:33:26.9459363Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:26.9460711Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b92725440>} 2025-05-07T20:33:26.9462069Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:26.9463072Z context = 2025-05-07T20:33:26.9463352Z 2025-05-07T20:33:26.9463511Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:26.9464014Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:26.9464467Z module_map=module_map) 2025-05-07T20:33:26.9464816Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:26.9465153Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:26.9465386Z E ^ 2025-05-07T20:33:26.9465830Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:26.9466319Z 2025-05-07T20:33:26.9466734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:27.1079176Z 2025-05-07T20:33:27.1079692Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:27.1080462Z self=, 2025-05-07T20:33:27.1081121Z T=2048, 2025-05-07T20:33:27.1081405Z D=5120, 2025-05-07T20:33:27.1081663Z scale_ub=1200.0, 2025-05-07T20:33:27.1082071Z contiguous=False, 2025-05-07T20:33:27.1082437Z compiled=True, 2025-05-07T20:33:27.1082635Z ) 2025-05-07T20:33:27.1082944Z self = 2025-05-07T20:33:27.1083437Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:27.1083706Z 2025-05-07T20:33:27.1083786Z @given( 2025-05-07T20:33:27.1084009Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:27.1084420Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:27.1084715Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:27.1085035Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:27.1085376Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:27.1085649Z ) 2025-05-07T20:33:27.1085976Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:27.1086406Z def test_silu_mul_quant( 2025-05-07T20:33:27.1086640Z self, 2025-05-07T20:33:27.1086825Z T: int, 2025-05-07T20:33:27.1087014Z D: int, 2025-05-07T20:33:27.1087397Z scale_ub: Optional[float], 2025-05-07T20:33:27.1087659Z contiguous: bool, 2025-05-07T20:33:27.1087885Z compiled: bool, 2025-05-07T20:33:27.1088106Z ) -> None: 2025-05-07T20:33:27.1088304Z torch.manual_seed(2025) 2025-05-07T20:33:27.1088534Z 2025-05-07T20:33:27.1088794Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:27.1089129Z 2025-05-07T20:33:27.1089337Z x_sign = torch.sign(x) 2025-05-07T20:33:27.1089648Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:27.1089937Z x = x_sign * x_clamp 2025-05-07T20:33:27.1090164Z x0 = x[:, :D] 2025-05-07T20:33:27.1090364Z x1 = x[:, D:] 2025-05-07T20:33:27.1090567Z 2025-05-07T20:33:27.1090734Z if contiguous: 2025-05-07T20:33:27.1090951Z x0 = x0.contiguous() 2025-05-07T20:33:27.1091196Z x1 = x1.contiguous() 2025-05-07T20:33:27.1091422Z 2025-05-07T20:33:27.1091601Z if scale_ub is not None: 2025-05-07T20:33:27.1091868Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:27.1092267Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:27.1092564Z ) 2025-05-07T20:33:27.1092753Z else: 2025-05-07T20:33:27.1092955Z scale_ub_tensor = None 2025-05-07T20:33:27.1093251Z 2025-05-07T20:33:27.1093473Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:27.1093769Z op = silu_mul_quant 2025-05-07T20:33:27.1094013Z if compiled: 2025-05-07T20:33:27.1094251Z op = torch.compile(op) 2025-05-07T20:33:27.1094530Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:27.1094790Z 2025-05-07T20:33:27.1094975Z > y_fp8, y_scale = fn() 2025-05-07T20:33:27.1095132Z 2025-05-07T20:33:27.1095236Z moe/activation_test.py:117: 2025-05-07T20:33:27.1095520Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:27.1095849Z moe/activation_test.py:115: in fn 2025-05-07T20:33:27.1096132Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:27.1096674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:27.1097222Z return fn(*args, **kwargs) 2025-05-07T20:33:27.1097944Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:27.1098617Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:27.1099134Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:27.1099804Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:27.1100449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:27.1100962Z kernel = self.compile( 2025-05-07T20:33:27.1101499Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:27.1102143Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:27.1102528Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:27.1102756Z 2025-05-07T20:33:27.1102959Z self = 2025-05-07T20:33:27.1104029Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:27.1105390Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b92726660>} 2025-05-07T20:33:27.1106977Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:27.1107984Z context = 2025-05-07T20:33:27.1108393Z 2025-05-07T20:33:27.1108566Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:27.1109084Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:27.1109542Z module_map=module_map) 2025-05-07T20:33:27.1109893Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:27.1110236Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:27.1110489Z E ^ 2025-05-07T20:33:27.1110937Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:27.1111382Z 2025-05-07T20:33:27.1111797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:27.1112380Z 2025-05-07T20:33:27.1112483Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:27.1112888Z self=, 2025-05-07T20:33:27.1113279Z T=4096, 2025-05-07T20:33:27.1113512Z D=5120, 2025-05-07T20:33:27.1113695Z scale_ub=1200.0, 2025-05-07T20:33:27.1113908Z contiguous=True, 2025-05-07T20:33:27.1114123Z compiled=True, 2025-05-07T20:33:27.1114317Z ) 2025-05-07T20:33:27.1114622Z self = 2025-05-07T20:33:27.1115107Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:27.1115375Z 2025-05-07T20:33:27.1115450Z @given( 2025-05-07T20:33:27.1115677Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:27.1115977Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:27.1116281Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:27.1116603Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:27.1116920Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:27.1117209Z ) 2025-05-07T20:33:27.1117549Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:27.1118055Z def test_silu_mul_quant( 2025-05-07T20:33:27.1118286Z self, 2025-05-07T20:33:27.1118475Z T: int, 2025-05-07T20:33:27.1118672Z D: int, 2025-05-07T20:33:27.1118886Z scale_ub: Optional[float], 2025-05-07T20:33:27.1119153Z contiguous: bool, 2025-05-07T20:33:27.1119391Z compiled: bool, 2025-05-07T20:33:27.1119601Z ) -> None: 2025-05-07T20:33:27.1119811Z torch.manual_seed(2025) 2025-05-07T20:33:27.1120046Z 2025-05-07T20:33:27.1120310Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:27.1120641Z 2025-05-07T20:33:27.1120831Z x_sign = torch.sign(x) 2025-05-07T20:33:27.1121117Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:27.1121419Z x = x_sign * x_clamp 2025-05-07T20:33:27.1121661Z x0 = x[:, :D] 2025-05-07T20:33:27.1121866Z x1 = x[:, D:] 2025-05-07T20:33:27.1122076Z 2025-05-07T20:33:27.1122260Z if contiguous: 2025-05-07T20:33:27.1122486Z x0 = x0.contiguous() 2025-05-07T20:33:27.1122741Z x1 = x1.contiguous() 2025-05-07T20:33:27.1122982Z 2025-05-07T20:33:27.1123166Z if scale_ub is not None: 2025-05-07T20:33:27.1123435Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:27.1123765Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:27.1124067Z ) 2025-05-07T20:33:27.1124308Z else: 2025-05-07T20:33:27.1124512Z scale_ub_tensor = None 2025-05-07T20:33:27.1124762Z 2025-05-07T20:33:27.1124981Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:27.1125367Z op = silu_mul_quant 2025-05-07T20:33:27.1125616Z if compiled: 2025-05-07T20:33:27.1125852Z op = torch.compile(op) 2025-05-07T20:33:27.1126137Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:27.1126403Z 2025-05-07T20:33:27.1126583Z > y_fp8, y_scale = fn() 2025-05-07T20:33:27.1126755Z 2025-05-07T20:33:27.1126851Z moe/activation_test.py:117: 2025-05-07T20:33:27.1127139Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:27.1127467Z moe/activation_test.py:115: in fn 2025-05-07T20:33:27.1127732Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:27.1128285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:27.1128833Z return fn(*args, **kwargs) 2025-05-07T20:33:27.1129485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:27.1130166Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:27.1130743Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:27.1131409Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:27.1132096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:27.1132619Z kernel = self.compile( 2025-05-07T20:33:27.1133154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:27.1133797Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:27.1134186Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:27.1134407Z 2025-05-07T20:33:27.1134607Z self = 2025-05-07T20:33:27.1135681Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:27.1137030Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b927279c0>} 2025-05-07T20:33:27.1143558Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:27.1144571Z context = 2025-05-07T20:33:27.1144856Z 2025-05-07T20:33:27.1145024Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:27.1145531Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:27.1145988Z module_map=module_map) 2025-05-07T20:33:27.1146356Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:27.1146691Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:27.1146959Z E ^ 2025-05-07T20:33:27.1147424Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:27.1147866Z 2025-05-07T20:33:27.1148280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:27.2850409Z 2025-05-07T20:33:27.2850680Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:27.2851384Z self=, 2025-05-07T20:33:27.2852167Z T=128, 2025-05-07T20:33:27.2852443Z D=5120, 2025-05-07T20:33:27.2852709Z scale_ub=1200.0, 2025-05-07T20:33:27.2853003Z contiguous=False, 2025-05-07T20:33:27.2853374Z compiled=True, 2025-05-07T20:33:27.2853606Z ) 2025-05-07T20:33:27.2853906Z self = 2025-05-07T20:33:27.2854393Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:27.2854661Z 2025-05-07T20:33:27.2854733Z @given( 2025-05-07T20:33:27.2854959Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:27.2855258Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:27.2855559Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:27.2855885Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:27.2856197Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:27.2856473Z ) 2025-05-07T20:33:27.2856815Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:27.2857236Z def test_silu_mul_quant( 2025-05-07T20:33:27.2857467Z self, 2025-05-07T20:33:27.2857657Z T: int, 2025-05-07T20:33:27.2857853Z D: int, 2025-05-07T20:33:27.2858134Z scale_ub: Optional[float], 2025-05-07T20:33:27.2858404Z contiguous: bool, 2025-05-07T20:33:27.2858637Z compiled: bool, 2025-05-07T20:33:27.2858848Z ) -> None: 2025-05-07T20:33:27.2859054Z torch.manual_seed(2025) 2025-05-07T20:33:27.2859355Z 2025-05-07T20:33:27.2859617Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:27.2859950Z 2025-05-07T20:33:27.2860137Z x_sign = torch.sign(x) 2025-05-07T20:33:27.2860415Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:27.2860724Z x = x_sign * x_clamp 2025-05-07T20:33:27.2860962Z x0 = x[:, :D] 2025-05-07T20:33:27.2861166Z x1 = x[:, D:] 2025-05-07T20:33:27.2861368Z 2025-05-07T20:33:27.2861543Z if contiguous: 2025-05-07T20:33:27.2861762Z x0 = x0.contiguous() 2025-05-07T20:33:27.2862012Z x1 = x1.contiguous() 2025-05-07T20:33:27.2862254Z 2025-05-07T20:33:27.2862434Z if scale_ub is not None: 2025-05-07T20:33:27.2862697Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:27.2863030Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:27.2863441Z ) 2025-05-07T20:33:27.2863620Z else: 2025-05-07T20:33:27.2863828Z scale_ub_tensor = None 2025-05-07T20:33:27.2864072Z 2025-05-07T20:33:27.2864291Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:27.2864592Z op = silu_mul_quant 2025-05-07T20:33:27.2864826Z if compiled: 2025-05-07T20:33:27.2865064Z op = torch.compile(op) 2025-05-07T20:33:27.2865351Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:27.2865610Z 2025-05-07T20:33:27.2865794Z > y_fp8, y_scale = fn() 2025-05-07T20:33:27.2865954Z 2025-05-07T20:33:27.2866055Z moe/activation_test.py:117: 2025-05-07T20:33:27.2866335Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:27.2866660Z moe/activation_test.py:115: in fn 2025-05-07T20:33:27.2866931Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:27.2867479Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:27.2868028Z return fn(*args, **kwargs) 2025-05-07T20:33:27.2868674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:27.2869349Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:27.2869866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:27.2870534Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:27.2871236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:27.2871810Z kernel = self.compile( 2025-05-07T20:33:27.2872338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:27.2872979Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:27.2873374Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:27.2873601Z 2025-05-07T20:33:27.2873808Z self = 2025-05-07T20:33:27.2874870Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:27.2876224Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b92814fe0>} 2025-05-07T20:33:27.2877590Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:27.2878598Z context = 2025-05-07T20:33:27.2878950Z 2025-05-07T20:33:27.2879124Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:27.2879634Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:27.2880090Z module_map=module_map) 2025-05-07T20:33:27.2880445Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:27.2880789Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:27.2881035Z E ^ 2025-05-07T20:33:27.2881486Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:27.2881929Z 2025-05-07T20:33:27.2882341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:27.2882848Z 2025-05-07T20:33:27.2882944Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:27.2883343Z self=, 2025-05-07T20:33:27.2883783Z T=16384, 2025-05-07T20:33:27.2883968Z D=7168, 2025-05-07T20:33:27.2884154Z scale_ub=1200.0, 2025-05-07T20:33:27.2884499Z contiguous=True, 2025-05-07T20:33:27.2884706Z compiled=True, 2025-05-07T20:33:27.2884899Z ) 2025-05-07T20:33:27.2885204Z self = 2025-05-07T20:33:27.2885681Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:27.2885956Z 2025-05-07T20:33:27.2886028Z @given( 2025-05-07T20:33:27.2886251Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:27.2886561Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:27.2886851Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:27.2887164Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:27.2887477Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:27.2887745Z ) 2025-05-07T20:33:27.2888082Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:27.2888508Z def test_silu_mul_quant( 2025-05-07T20:33:27.2888737Z self, 2025-05-07T20:33:27.2888925Z T: int, 2025-05-07T20:33:27.2889107Z D: int, 2025-05-07T20:33:27.2889309Z scale_ub: Optional[float], 2025-05-07T20:33:27.2889568Z contiguous: bool, 2025-05-07T20:33:27.2889795Z compiled: bool, 2025-05-07T20:33:27.2890003Z ) -> None: 2025-05-07T20:33:27.2890208Z torch.manual_seed(2025) 2025-05-07T20:33:27.2890438Z 2025-05-07T20:33:27.2890693Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:27.2891076Z 2025-05-07T20:33:27.2891267Z x_sign = torch.sign(x) 2025-05-07T20:33:27.2891558Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:27.2891856Z x = x_sign * x_clamp 2025-05-07T20:33:27.2892091Z x0 = x[:, :D] 2025-05-07T20:33:27.2892312Z x1 = x[:, D:] 2025-05-07T20:33:27.2892510Z 2025-05-07T20:33:27.2892685Z if contiguous: 2025-05-07T20:33:27.2892918Z x0 = x0.contiguous() 2025-05-07T20:33:27.2893159Z x1 = x1.contiguous() 2025-05-07T20:33:27.2893396Z 2025-05-07T20:33:27.2893585Z if scale_ub is not None: 2025-05-07T20:33:27.2893842Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:27.2894171Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:27.2894467Z ) 2025-05-07T20:33:27.2894643Z else: 2025-05-07T20:33:27.2894851Z scale_ub_tensor = None 2025-05-07T20:33:27.2895095Z 2025-05-07T20:33:27.2895316Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:27.2895678Z op = silu_mul_quant 2025-05-07T20:33:27.2895975Z if compiled: 2025-05-07T20:33:27.2896219Z op = torch.compile(op) 2025-05-07T20:33:27.2896504Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:27.2896813Z 2025-05-07T20:33:27.2896999Z > y_fp8, y_scale = fn() 2025-05-07T20:33:27.2897160Z 2025-05-07T20:33:27.2897254Z moe/activation_test.py:117: 2025-05-07T20:33:27.2897540Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:27.2897869Z moe/activation_test.py:115: in fn 2025-05-07T20:33:27.2898134Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:27.2898672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:27.2899220Z return fn(*args, **kwargs) 2025-05-07T20:33:27.2899875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:27.2900551Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:27.2901078Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:27.2901797Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:27.2902446Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:27.2902971Z kernel = self.compile( 2025-05-07T20:33:27.2903498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:27.2904154Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:27.2904543Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:27.2904773Z 2025-05-07T20:33:27.2904977Z self = 2025-05-07T20:33:27.2906046Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:27.2907403Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b92815e40>} 2025-05-07T20:33:27.2909029Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:27.2910043Z context = 2025-05-07T20:33:27.2910339Z 2025-05-07T20:33:27.2910505Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:27.2911107Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:27.2911565Z module_map=module_map) 2025-05-07T20:33:27.2911928Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:27.2912282Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:27.2912542Z E ^ 2025-05-07T20:33:27.2912992Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:27.2913449Z 2025-05-07T20:33:27.2913858Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:27.4073964Z 2025-05-07T20:33:27.4074164Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:27.4074902Z self=, 2025-05-07T20:33:27.4075599Z T=16384, 2025-05-07T20:33:27.4075917Z D=5120, 2025-05-07T20:33:27.4076237Z scale_ub=1200.0, 2025-05-07T20:33:27.4076610Z contiguous=True, 2025-05-07T20:33:27.4076954Z compiled=False, 2025-05-07T20:33:27.4077146Z ) 2025-05-07T20:33:27.4077455Z self = 2025-05-07T20:33:27.4077974Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:27.4078391Z 2025-05-07T20:33:27.4078502Z @given( 2025-05-07T20:33:27.4078721Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:27.4079021Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:27.4079313Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:27.4079629Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:27.4079951Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:27.4080222Z ) 2025-05-07T20:33:27.4080559Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:27.4080990Z def test_silu_mul_quant( 2025-05-07T20:33:27.4081221Z self, 2025-05-07T20:33:27.4081403Z T: int, 2025-05-07T20:33:27.4081593Z D: int, 2025-05-07T20:33:27.4081797Z scale_ub: Optional[float], 2025-05-07T20:33:27.4082054Z contiguous: bool, 2025-05-07T20:33:27.4082283Z compiled: bool, 2025-05-07T20:33:27.4082558Z ) -> None: 2025-05-07T20:33:27.4082763Z torch.manual_seed(2025) 2025-05-07T20:33:27.4082991Z 2025-05-07T20:33:27.4083253Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:27.4083576Z 2025-05-07T20:33:27.4083756Z x_sign = torch.sign(x) 2025-05-07T20:33:27.4084033Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:27.4084444Z x = x_sign * x_clamp 2025-05-07T20:33:27.4084682Z x0 = x[:, :D] 2025-05-07T20:33:27.4084896Z x1 = x[:, D:] 2025-05-07T20:33:27.4085092Z 2025-05-07T20:33:27.4085270Z if contiguous: 2025-05-07T20:33:27.4085499Z x0 = x0.contiguous() 2025-05-07T20:33:27.4085746Z x1 = x1.contiguous() 2025-05-07T20:33:27.4085981Z 2025-05-07T20:33:27.4086166Z if scale_ub is not None: 2025-05-07T20:33:27.4086430Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:27.4086761Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:27.4087073Z ) 2025-05-07T20:33:27.4087255Z else: 2025-05-07T20:33:27.4087463Z scale_ub_tensor = None 2025-05-07T20:33:27.4087706Z 2025-05-07T20:33:27.4087924Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:27.4088240Z op = silu_mul_quant 2025-05-07T20:33:27.4088488Z if compiled: 2025-05-07T20:33:27.4088727Z op = torch.compile(op) 2025-05-07T20:33:27.4089014Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:27.4089287Z 2025-05-07T20:33:27.4089471Z > y_fp8, y_scale = fn() 2025-05-07T20:33:27.4089636Z 2025-05-07T20:33:27.4089804Z moe/activation_test.py:117: 2025-05-07T20:33:27.4090099Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:27.4090440Z moe/activation_test.py:115: in fn 2025-05-07T20:33:27.4090714Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:27.4091419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:27.4092115Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:27.4092640Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:27.4093299Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:27.4093950Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:27.4094470Z kernel = self.compile( 2025-05-07T20:33:27.4094997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:27.4095683Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:27.4096073Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:27.4096300Z 2025-05-07T20:33:27.4096506Z self = 2025-05-07T20:33:27.4097627Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:27.4098979Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b92816ca0>} 2025-05-07T20:33:27.4100310Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:27.4101313Z context = 2025-05-07T20:33:27.4101592Z 2025-05-07T20:33:27.4101753Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:27.4102296Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:27.4102745Z module_map=module_map) 2025-05-07T20:33:27.4103096Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:27.4103434Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:27.4103682Z E ^ 2025-05-07T20:33:27.4104134Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:27.4104576Z 2025-05-07T20:33:27.4104989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:27.4105500Z 2025-05-07T20:33:27.4105614Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:27.4106033Z self=, 2025-05-07T20:33:27.4106423Z T=1, 2025-05-07T20:33:27.4106592Z D=7168, 2025-05-07T20:33:27.4106777Z scale_ub=1200.0, 2025-05-07T20:33:27.4106985Z contiguous=False, 2025-05-07T20:33:27.4107208Z compiled=False, 2025-05-07T20:33:27.4107391Z ) 2025-05-07T20:33:27.4107697Z self = 2025-05-07T20:33:27.4108172Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:27.4108615Z 2025-05-07T20:33:27.4108686Z @given( 2025-05-07T20:33:27.4108912Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:27.4109212Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:27.4109501Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:27.4109892Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:27.4110209Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:27.4110471Z ) 2025-05-07T20:33:27.4110805Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:27.4111236Z def test_silu_mul_quant( 2025-05-07T20:33:27.4111471Z self, 2025-05-07T20:33:27.4111649Z T: int, 2025-05-07T20:33:27.4111842Z D: int, 2025-05-07T20:33:27.4112050Z scale_ub: Optional[float], 2025-05-07T20:33:27.4112303Z contiguous: bool, 2025-05-07T20:33:27.4112534Z compiled: bool, 2025-05-07T20:33:27.4112749Z ) -> None: 2025-05-07T20:33:27.4112949Z torch.manual_seed(2025) 2025-05-07T20:33:27.4113178Z 2025-05-07T20:33:27.4113440Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:27.4113764Z 2025-05-07T20:33:27.4113949Z x_sign = torch.sign(x) 2025-05-07T20:33:27.4114228Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:27.4114520Z x = x_sign * x_clamp 2025-05-07T20:33:27.4114814Z x0 = x[:, :D] 2025-05-07T20:33:27.4115022Z x1 = x[:, D:] 2025-05-07T20:33:27.4115217Z 2025-05-07T20:33:27.4115391Z if contiguous: 2025-05-07T20:33:27.4115616Z x0 = x0.contiguous() 2025-05-07T20:33:27.4115916Z x1 = x1.contiguous() 2025-05-07T20:33:27.4116134Z 2025-05-07T20:33:27.4116309Z if scale_ub is not None: 2025-05-07T20:33:27.4116567Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:27.4116885Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:27.4117177Z ) 2025-05-07T20:33:27.4117360Z else: 2025-05-07T20:33:27.4117552Z scale_ub_tensor = None 2025-05-07T20:33:27.4117788Z 2025-05-07T20:33:27.4118007Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:27.4118301Z op = silu_mul_quant 2025-05-07T20:33:27.4118542Z if compiled: 2025-05-07T20:33:27.4118782Z op = torch.compile(op) 2025-05-07T20:33:27.4119062Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:27.4119316Z 2025-05-07T20:33:27.4119494Z > y_fp8, y_scale = fn() 2025-05-07T20:33:27.4119718Z 2025-05-07T20:33:27.4119816Z moe/activation_test.py:117: 2025-05-07T20:33:27.4120097Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:27.4120413Z moe/activation_test.py:115: in fn 2025-05-07T20:33:27.4120680Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:27.4121342Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:27.4122008Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:27.4122528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:27.4123197Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:27.4123846Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:27.4124445Z kernel = self.compile( 2025-05-07T20:33:27.4124973Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:27.4125613Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:27.4125996Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:27.4126225Z 2025-05-07T20:33:27.4126426Z self = 2025-05-07T20:33:27.4127487Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:27.4128881Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b926900e0>} 2025-05-07T20:33:27.4130206Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:27.4131215Z context = 2025-05-07T20:33:27.4131497Z 2025-05-07T20:33:27.4131658Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:27.4132166Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:27.4132613Z module_map=module_map) 2025-05-07T20:33:27.4132977Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:27.4133325Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:27.4133574Z E ^ 2025-05-07T20:33:27.4134070Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:27.4134511Z 2025-05-07T20:33:27.4134924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:27.4135467Z 2025-05-07T20:33:27.4135569Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:27.4135960Z self=, 2025-05-07T20:33:27.4136349Z T=4096, 2025-05-07T20:33:27.4136525Z D=7168, 2025-05-07T20:33:27.4136699Z scale_ub=1200.0, 2025-05-07T20:33:27.4136913Z contiguous=False, 2025-05-07T20:33:27.4137125Z compiled=True, 2025-05-07T20:33:27.5752679Z ) 2025-05-07T20:33:27.5753336Z self = 2025-05-07T20:33:27.5754157Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:27.5754565Z 2025-05-07T20:33:27.5754730Z @given( 2025-05-07T20:33:27.5755046Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:27.5755509Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:27.5755958Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:27.5756562Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:27.5756982Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:27.5757339Z ) 2025-05-07T20:33:27.5757670Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:27.5758100Z def test_silu_mul_quant( 2025-05-07T20:33:27.5758328Z self, 2025-05-07T20:33:27.5758510Z T: int, 2025-05-07T20:33:27.5758699Z D: int, 2025-05-07T20:33:27.5758905Z scale_ub: Optional[float], 2025-05-07T20:33:27.5759156Z contiguous: bool, 2025-05-07T20:33:27.5759387Z compiled: bool, 2025-05-07T20:33:27.5759605Z ) -> None: 2025-05-07T20:33:27.5759806Z torch.manual_seed(2025) 2025-05-07T20:33:27.5765713Z 2025-05-07T20:33:27.5765996Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:27.5766347Z 2025-05-07T20:33:27.5766540Z x_sign = torch.sign(x) 2025-05-07T20:33:27.5766831Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:27.5767143Z x = x_sign * x_clamp 2025-05-07T20:33:27.5767383Z x0 = x[:, :D] 2025-05-07T20:33:27.5767606Z x1 = x[:, D:] 2025-05-07T20:33:27.5767806Z 2025-05-07T20:33:27.5767993Z if contiguous: 2025-05-07T20:33:27.5768226Z x0 = x0.contiguous() 2025-05-07T20:33:27.5768478Z x1 = x1.contiguous() 2025-05-07T20:33:27.5768728Z 2025-05-07T20:33:27.5768939Z if scale_ub is not None: 2025-05-07T20:33:27.5769215Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:27.5769545Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:27.5769955Z ) 2025-05-07T20:33:27.5770147Z else: 2025-05-07T20:33:27.5770364Z scale_ub_tensor = None 2025-05-07T20:33:27.5770615Z 2025-05-07T20:33:27.5770836Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:27.5771141Z op = silu_mul_quant 2025-05-07T20:33:27.5771389Z if compiled: 2025-05-07T20:33:27.5771625Z op = torch.compile(op) 2025-05-07T20:33:27.5771914Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:27.5772185Z 2025-05-07T20:33:27.5772367Z > y_fp8, y_scale = fn() 2025-05-07T20:33:27.5772536Z 2025-05-07T20:33:27.5772637Z moe/activation_test.py:117: 2025-05-07T20:33:27.5772932Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:27.5773267Z moe/activation_test.py:115: in fn 2025-05-07T20:33:27.5773540Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:27.5774094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:27.5774721Z return fn(*args, **kwargs) 2025-05-07T20:33:27.5775364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:27.5776041Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:27.5776633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:27.5777295Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:27.5777939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:27.5778450Z kernel = self.compile( 2025-05-07T20:33:27.5778979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:27.5779623Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:27.5780004Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:27.5780233Z 2025-05-07T20:33:27.5780434Z self = 2025-05-07T20:33:27.5781547Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:27.5782906Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b92691300>} 2025-05-07T20:33:27.5784217Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:27.5785227Z context = 2025-05-07T20:33:27.5785508Z 2025-05-07T20:33:27.5785669Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:27.5786167Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:27.5786623Z module_map=module_map) 2025-05-07T20:33:27.5786974Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:27.5787316Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:27.5787552Z E ^ 2025-05-07T20:33:27.5787999Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:27.5788446Z 2025-05-07T20:33:27.5788852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:27.5789399Z 2025-05-07T20:33:27.5789499Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:27.5789938Z self=, 2025-05-07T20:33:27.5790331Z T=128, 2025-05-07T20:33:27.5790527Z D=7168, 2025-05-07T20:33:27.5790705Z scale_ub=1200.0, 2025-05-07T20:33:27.5790910Z contiguous=False, 2025-05-07T20:33:27.5791127Z compiled=True, 2025-05-07T20:33:27.5791316Z ) 2025-05-07T20:33:27.5791623Z self = 2025-05-07T20:33:27.5792097Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:27.5792362Z 2025-05-07T20:33:27.5792432Z @given( 2025-05-07T20:33:27.5792652Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:27.5792950Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:27.5793242Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:27.5793554Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:27.5793866Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:27.5794137Z ) 2025-05-07T20:33:27.5794515Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:27.5794943Z def test_silu_mul_quant( 2025-05-07T20:33:27.5795165Z self, 2025-05-07T20:33:27.5795352Z T: int, 2025-05-07T20:33:27.5795576Z D: int, 2025-05-07T20:33:27.5795777Z scale_ub: Optional[float], 2025-05-07T20:33:27.5796032Z contiguous: bool, 2025-05-07T20:33:27.5796256Z compiled: bool, 2025-05-07T20:33:27.5796458Z ) -> None: 2025-05-07T20:33:27.5796660Z torch.manual_seed(2025) 2025-05-07T20:33:27.5796886Z 2025-05-07T20:33:27.5797138Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:27.5797462Z 2025-05-07T20:33:27.5797641Z x_sign = torch.sign(x) 2025-05-07T20:33:27.5797913Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:27.5798209Z x = x_sign * x_clamp 2025-05-07T20:33:27.5798438Z x0 = x[:, :D] 2025-05-07T20:33:27.5798637Z x1 = x[:, D:] 2025-05-07T20:33:27.5798837Z 2025-05-07T20:33:27.5799030Z if contiguous: 2025-05-07T20:33:27.5799256Z x0 = x0.contiguous() 2025-05-07T20:33:27.5799500Z x1 = x1.contiguous() 2025-05-07T20:33:27.5799776Z 2025-05-07T20:33:27.5799954Z if scale_ub is not None: 2025-05-07T20:33:27.5800204Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:27.5800519Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:27.5800809Z ) 2025-05-07T20:33:27.5800980Z else: 2025-05-07T20:33:27.5801173Z scale_ub_tensor = None 2025-05-07T20:33:27.5801406Z 2025-05-07T20:33:27.5801613Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:27.5801909Z op = silu_mul_quant 2025-05-07T20:33:27.5802139Z if compiled: 2025-05-07T20:33:27.5802367Z op = torch.compile(op) 2025-05-07T20:33:27.5802645Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:27.5802900Z 2025-05-07T20:33:27.5803076Z > y_fp8, y_scale = fn() 2025-05-07T20:33:27.5803237Z 2025-05-07T20:33:27.5803326Z moe/activation_test.py:117: 2025-05-07T20:33:27.5803603Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:27.5803925Z moe/activation_test.py:115: in fn 2025-05-07T20:33:27.5804185Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:27.5804810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:27.5805347Z return fn(*args, **kwargs) 2025-05-07T20:33:27.5805980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:27.5806646Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:27.5807211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:27.5807872Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:27.5808690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:27.5809205Z kernel = self.compile( 2025-05-07T20:33:27.5809733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:27.5810365Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:27.5810752Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:27.5810977Z 2025-05-07T20:33:27.5811173Z self = 2025-05-07T20:33:27.5812238Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:27.5813685Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b92692160>} 2025-05-07T20:33:27.5815014Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:27.5816132Z context = 2025-05-07T20:33:27.5816417Z 2025-05-07T20:33:27.5816578Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:27.5817082Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:27.5817531Z module_map=module_map) 2025-05-07T20:33:27.5817881Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:27.5818221Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:27.5818461Z E ^ 2025-05-07T20:33:27.5818910Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:27.5819430Z 2025-05-07T20:33:27.5819842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:27.5820347Z 2025-05-07T20:33:27.5820446Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:27.5820838Z self=, 2025-05-07T20:33:27.5821222Z T=2048, 2025-05-07T20:33:27.5821393Z D=7168, 2025-05-07T20:33:27.5821563Z scale_ub=None, 2025-05-07T20:33:27.5821765Z contiguous=True, 2025-05-07T20:33:27.5821975Z compiled=True, 2025-05-07T20:33:27.7066027Z ) 2025-05-07T20:33:27.7067272Z self = 2025-05-07T20:33:27.7067851Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:27.7068130Z 2025-05-07T20:33:27.7068209Z @given( 2025-05-07T20:33:27.7068440Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:27.7068759Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:27.7069078Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:27.7069418Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:27.7069751Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:27.7070040Z ) 2025-05-07T20:33:27.7070389Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:27.7070841Z def test_silu_mul_quant( 2025-05-07T20:33:27.7071086Z self, 2025-05-07T20:33:27.7071275Z T: int, 2025-05-07T20:33:27.7071477Z D: int, 2025-05-07T20:33:27.7071701Z scale_ub: Optional[float], 2025-05-07T20:33:27.7071969Z contiguous: bool, 2025-05-07T20:33:27.7072514Z compiled: bool, 2025-05-07T20:33:27.7072756Z ) -> None: 2025-05-07T20:33:27.7072969Z torch.manual_seed(2025) 2025-05-07T20:33:27.7073215Z 2025-05-07T20:33:27.7073496Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:27.7073837Z 2025-05-07T20:33:27.7074069Z x_sign = torch.sign(x) 2025-05-07T20:33:27.7074352Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:27.7074668Z x = x_sign * x_clamp 2025-05-07T20:33:27.7074913Z x0 = x[:, :D] 2025-05-07T20:33:27.7075123Z x1 = x[:, D:] 2025-05-07T20:33:27.7075331Z 2025-05-07T20:33:27.7075520Z if contiguous: 2025-05-07T20:33:27.7075749Z x0 = x0.contiguous() 2025-05-07T20:33:27.7076012Z x1 = x1.contiguous() 2025-05-07T20:33:27.7076261Z 2025-05-07T20:33:27.7076456Z if scale_ub is not None: 2025-05-07T20:33:27.7076740Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:27.7077081Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:27.7077474Z ) 2025-05-07T20:33:27.7077694Z else: 2025-05-07T20:33:27.7077909Z scale_ub_tensor = None 2025-05-07T20:33:27.7078159Z 2025-05-07T20:33:27.7078395Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:27.7078847Z op = silu_mul_quant 2025-05-07T20:33:27.7079118Z if compiled: 2025-05-07T20:33:27.7079365Z op = torch.compile(op) 2025-05-07T20:33:27.7079657Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:27.7079916Z 2025-05-07T20:33:27.7080103Z > y_fp8, y_scale = fn() 2025-05-07T20:33:27.7080272Z 2025-05-07T20:33:27.7080368Z moe/activation_test.py:117: 2025-05-07T20:33:27.7080665Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:27.7080987Z moe/activation_test.py:115: in fn 2025-05-07T20:33:27.7081270Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:27.7081834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:27.7082385Z return fn(*args, **kwargs) 2025-05-07T20:33:27.7083039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:27.7083808Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:27.7084488Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:27.7085154Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:27.7085812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:27.7086336Z kernel = self.compile( 2025-05-07T20:33:27.7086866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:27.7087520Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:27.7087917Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:27.7088142Z 2025-05-07T20:33:27.7088352Z self = 2025-05-07T20:33:27.7089423Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:27.7090806Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b92693420>} 2025-05-07T20:33:27.7092184Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:27.7093199Z context = 2025-05-07T20:33:27.7093482Z 2025-05-07T20:33:27.7093651Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:27.7094165Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:27.7094633Z module_map=module_map) 2025-05-07T20:33:27.7094995Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:27.7095335Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:27.7095587Z E ^ 2025-05-07T20:33:27.7096045Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:27.7096489Z 2025-05-07T20:33:27.7096908Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:27.7097413Z 2025-05-07T20:33:27.7097514Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:27.7097969Z self=, 2025-05-07T20:33:27.7098373Z T=16384, 2025-05-07T20:33:27.7098555Z D=5120, 2025-05-07T20:33:27.7098743Z scale_ub=None, 2025-05-07T20:33:27.7098961Z contiguous=False, 2025-05-07T20:33:27.7099217Z compiled=False, 2025-05-07T20:33:27.7099421Z ) 2025-05-07T20:33:27.7099735Z self = 2025-05-07T20:33:27.7100232Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:27.7100506Z 2025-05-07T20:33:27.7100578Z @given( 2025-05-07T20:33:27.7100808Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:27.7101117Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:27.7101411Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:27.7101741Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:27.7102069Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:27.7102345Z ) 2025-05-07T20:33:27.7102690Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:27.7103122Z def test_silu_mul_quant( 2025-05-07T20:33:27.7103408Z self, 2025-05-07T20:33:27.7103590Z T: int, 2025-05-07T20:33:27.7103781Z D: int, 2025-05-07T20:33:27.7103993Z scale_ub: Optional[float], 2025-05-07T20:33:27.7104253Z contiguous: bool, 2025-05-07T20:33:27.7104490Z compiled: bool, 2025-05-07T20:33:27.7104709Z ) -> None: 2025-05-07T20:33:27.7104908Z torch.manual_seed(2025) 2025-05-07T20:33:27.7105144Z 2025-05-07T20:33:27.7105417Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:27.7105742Z 2025-05-07T20:33:27.7105930Z x_sign = torch.sign(x) 2025-05-07T20:33:27.7106218Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:27.7108432Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:27.7110304Z 2025-05-07T20:33:27.7110430Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:27.7110638Z 2025-05-07T20:33:27.7110738Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:27.7111147Z self=, 2025-05-07T20:33:27.7111546Z T=4096, 2025-05-07T20:33:27.7111719Z D=7168, 2025-05-07T20:33:27.7111982Z scale_ub=1200.0, 2025-05-07T20:33:27.7112204Z contiguous=True, 2025-05-07T20:33:27.7112412Z compiled=True, 2025-05-07T20:33:27.7112614Z ) 2025-05-07T20:33:27.7112929Z self = 2025-05-07T20:33:27.7113410Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:27.7113693Z 2025-05-07T20:33:27.7113764Z @given( 2025-05-07T20:33:27.7113991Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:27.7114298Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:27.7114592Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:27.7114919Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:27.7115246Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:27.7115517Z ) 2025-05-07T20:33:27.7115864Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:27.7116302Z def test_silu_mul_quant( 2025-05-07T20:33:27.7116536Z self, 2025-05-07T20:33:27.7116726Z T: int, 2025-05-07T20:33:27.7116995Z D: int, 2025-05-07T20:33:27.7117204Z scale_ub: Optional[float], 2025-05-07T20:33:27.7117473Z contiguous: bool, 2025-05-07T20:33:27.7117714Z compiled: bool, 2025-05-07T20:33:27.7117938Z ) -> None: 2025-05-07T20:33:27.7118203Z torch.manual_seed(2025) 2025-05-07T20:33:27.7118445Z 2025-05-07T20:33:27.7118719Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:27.7119045Z 2025-05-07T20:33:27.7119234Z x_sign = torch.sign(x) 2025-05-07T20:33:27.7119521Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:27.7121517Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:27.7123442Z 2025-05-07T20:33:27.7123561Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:27.7123776Z 2025-05-07T20:33:27.7123877Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:27.7124378Z self=, 2025-05-07T20:33:27.7124781Z T=16384, 2025-05-07T20:33:27.7124963Z D=7168, 2025-05-07T20:33:27.7125150Z scale_ub=None, 2025-05-07T20:33:27.7125367Z contiguous=False, 2025-05-07T20:33:27.7125583Z compiled=False, 2025-05-07T20:33:27.7125780Z ) 2025-05-07T20:33:27.7126093Z self = 2025-05-07T20:33:27.7126580Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:27.7126861Z 2025-05-07T20:33:27.7126936Z @given( 2025-05-07T20:33:27.7127163Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:27.7127461Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:27.7127766Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:27.7128095Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:27.7128421Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:27.7128688Z ) 2025-05-07T20:33:27.7129033Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:27.7129468Z def test_silu_mul_quant( 2025-05-07T20:33:27.7129695Z self, 2025-05-07T20:33:27.7129884Z T: int, 2025-05-07T20:33:27.7130077Z D: int, 2025-05-07T20:33:27.7130281Z scale_ub: Optional[float], 2025-05-07T20:33:27.7130553Z contiguous: bool, 2025-05-07T20:33:27.7130841Z compiled: bool, 2025-05-07T20:33:27.7131052Z ) -> None: 2025-05-07T20:33:27.7131268Z torch.manual_seed(2025) 2025-05-07T20:33:27.7131511Z 2025-05-07T20:33:27.7131773Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:27.7133823Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:27.7138654Z 2025-05-07T20:33:27.7138778Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:27.8373348Z 2025-05-07T20:33:27.8373753Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:27.8374622Z self=, 2025-05-07T20:33:27.8375310Z T=2048, 2025-05-07T20:33:27.8375600Z D=7168, 2025-05-07T20:33:27.8375914Z scale_ub=1200.0, 2025-05-07T20:33:27.8376189Z contiguous=True, 2025-05-07T20:33:27.8376391Z compiled=True, 2025-05-07T20:33:27.8376579Z ) 2025-05-07T20:33:27.8376889Z self = 2025-05-07T20:33:27.8377365Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:27.8377632Z 2025-05-07T20:33:27.8377745Z @given( 2025-05-07T20:33:27.8377955Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:27.8378250Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:27.8378541Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:27.8378877Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:27.8379192Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:27.8379466Z ) 2025-05-07T20:33:27.8379795Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:27.8380224Z def test_silu_mul_quant( 2025-05-07T20:33:27.8380451Z self, 2025-05-07T20:33:27.8380714Z T: int, 2025-05-07T20:33:27.8380893Z D: int, 2025-05-07T20:33:27.8381091Z scale_ub: Optional[float], 2025-05-07T20:33:27.8381348Z contiguous: bool, 2025-05-07T20:33:27.8381570Z compiled: bool, 2025-05-07T20:33:27.8381776Z ) -> None: 2025-05-07T20:33:27.8381974Z torch.manual_seed(2025) 2025-05-07T20:33:27.8382194Z 2025-05-07T20:33:27.8382448Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:27.8382772Z 2025-05-07T20:33:27.8382945Z x_sign = torch.sign(x) 2025-05-07T20:33:27.8383223Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:27.8385242Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:27.8387113Z 2025-05-07T20:33:27.8387222Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:27.8387428Z 2025-05-07T20:33:27.8387533Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:27.8387928Z self=, 2025-05-07T20:33:27.8388326Z T=2048, 2025-05-07T20:33:27.8388500Z D=7168, 2025-05-07T20:33:27.8388670Z scale_ub=None, 2025-05-07T20:33:27.8388937Z contiguous=True, 2025-05-07T20:33:27.8389145Z compiled=False, 2025-05-07T20:33:27.8389330Z ) 2025-05-07T20:33:27.8389629Z self = 2025-05-07T20:33:27.8390159Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:27.8390423Z 2025-05-07T20:33:27.8390497Z @given( 2025-05-07T20:33:27.8390702Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:27.8390993Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:27.8391286Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:27.8391596Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:27.8391903Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:27.8392167Z ) 2025-05-07T20:33:27.8392493Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:27.8393047Z def test_silu_mul_quant( 2025-05-07T20:33:27.8398895Z self, 2025-05-07T20:33:27.8399108Z T: int, 2025-05-07T20:33:27.8399301Z D: int, 2025-05-07T20:33:27.8399585Z scale_ub: Optional[float], 2025-05-07T20:33:27.8399863Z contiguous: bool, 2025-05-07T20:33:27.8400101Z compiled: bool, 2025-05-07T20:33:27.8400317Z ) -> None: 2025-05-07T20:33:27.8400529Z torch.manual_seed(2025) 2025-05-07T20:33:27.8400764Z 2025-05-07T20:33:27.8401028Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:27.8401363Z 2025-05-07T20:33:27.8401553Z > x_sign = torch.sign(x) 2025-05-07T20:33:27.8403494Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:27.8405450Z 2025-05-07T20:33:27.8405568Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:33:27.8405830Z 2025-05-07T20:33:27.8405933Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:27.8406346Z self=, 2025-05-07T20:33:27.8406742Z T=1, 2025-05-07T20:33:27.8406917Z D=7168, 2025-05-07T20:33:27.8407103Z scale_ub=1200.0, 2025-05-07T20:33:27.8407319Z contiguous=True, 2025-05-07T20:33:27.8407533Z compiled=False, 2025-05-07T20:33:27.8407730Z ) 2025-05-07T20:33:27.8408038Z self = 2025-05-07T20:33:27.8408761Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:27.8409026Z 2025-05-07T20:33:27.8409102Z @given( 2025-05-07T20:33:27.8409329Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:27.8409635Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:27.8409935Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:27.8410256Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:27.8410581Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:27.8410854Z ) 2025-05-07T20:33:27.8411198Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:27.8411635Z def test_silu_mul_quant( 2025-05-07T20:33:27.8411871Z self, 2025-05-07T20:33:27.8412062Z T: int, 2025-05-07T20:33:27.8412255Z D: int, 2025-05-07T20:33:27.8412462Z scale_ub: Optional[float], 2025-05-07T20:33:27.8412727Z contiguous: bool, 2025-05-07T20:33:27.8412962Z compiled: bool, 2025-05-07T20:33:27.8413183Z ) -> None: 2025-05-07T20:33:27.8413402Z torch.manual_seed(2025) 2025-05-07T20:33:27.8413723Z 2025-05-07T20:33:27.8413991Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:27.8414330Z 2025-05-07T20:33:27.8414516Z x_sign = torch.sign(x) 2025-05-07T20:33:27.8414798Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:27.8415104Z x = x_sign * x_clamp 2025-05-07T20:33:27.8415333Z x0 = x[:, :D] 2025-05-07T20:33:27.8415543Z x1 = x[:, D:] 2025-05-07T20:33:27.8415739Z 2025-05-07T20:33:27.8415914Z if contiguous: 2025-05-07T20:33:27.8416138Z x0 = x0.contiguous() 2025-05-07T20:33:27.8416395Z x1 = x1.contiguous() 2025-05-07T20:33:27.8416629Z 2025-05-07T20:33:27.8416818Z if scale_ub is not None: 2025-05-07T20:33:27.8417082Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:27.8417409Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:27.8417796Z ) 2025-05-07T20:33:27.8417981Z else: 2025-05-07T20:33:27.8418191Z scale_ub_tensor = None 2025-05-07T20:33:27.8418433Z 2025-05-07T20:33:27.8418718Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:27.8419029Z op = silu_mul_quant 2025-05-07T20:33:27.8419273Z if compiled: 2025-05-07T20:33:27.8419544Z op = torch.compile(op) 2025-05-07T20:33:27.8419861Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:27.8420122Z 2025-05-07T20:33:27.8420307Z > y_fp8, y_scale = fn() 2025-05-07T20:33:27.8420467Z 2025-05-07T20:33:27.8420566Z moe/activation_test.py:117: 2025-05-07T20:33:27.8420850Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:27.8421176Z moe/activation_test.py:115: in fn 2025-05-07T20:33:27.8421450Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:27.8422132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:27.8422820Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:27.8423352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:27.8424027Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:27.8424748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:27.8425271Z kernel = self.compile( 2025-05-07T20:33:27.8425806Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:27.8426451Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:27.8426837Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:27.8427070Z 2025-05-07T20:33:27.8427276Z self = 2025-05-07T20:33:27.8428353Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:27.8429714Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b92bca2a0>} 2025-05-07T20:33:27.8431049Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:27.8432062Z context = 2025-05-07T20:33:27.8432353Z 2025-05-07T20:33:27.8432515Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:27.8433074Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:27.8433530Z module_map=module_map) 2025-05-07T20:33:27.8433890Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:27.8434235Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:27.8434486Z E ^ 2025-05-07T20:33:27.8434942Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:27.8435389Z 2025-05-07T20:33:27.8435802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:27.8436306Z 2025-05-07T20:33:27.8436409Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:27.8438223Z self=, 2025-05-07T20:33:27.8438611Z T=128, 2025-05-07T20:33:27.8438847Z D=5120, 2025-05-07T20:33:27.8439034Z scale_ub=None, 2025-05-07T20:33:27.8439240Z contiguous=True, 2025-05-07T20:33:27.8439459Z compiled=False, 2025-05-07T20:33:27.8439662Z ) 2025-05-07T20:33:27.8440038Z self = 2025-05-07T20:33:27.8440522Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:27.8440788Z 2025-05-07T20:33:27.8440865Z @given( 2025-05-07T20:33:27.8441085Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:27.8441391Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:27.8441688Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:27.8442009Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:27.8442325Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:27.8442599Z ) 2025-05-07T20:33:27.8442940Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:27.8443371Z def test_silu_mul_quant( 2025-05-07T20:33:27.8443604Z self, 2025-05-07T20:33:27.8443793Z T: int, 2025-05-07T20:33:27.8443980Z D: int, 2025-05-07T20:33:27.8444193Z scale_ub: Optional[float], 2025-05-07T20:33:27.8444528Z contiguous: bool, 2025-05-07T20:33:27.8444756Z compiled: bool, 2025-05-07T20:33:27.8444970Z ) -> None: 2025-05-07T20:33:27.8445228Z torch.manual_seed(2025) 2025-05-07T20:33:27.8445458Z 2025-05-07T20:33:27.8445725Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:27.8446056Z 2025-05-07T20:33:27.8446240Z x_sign = torch.sign(x) 2025-05-07T20:33:27.8446523Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:27.8446822Z x = x_sign * x_clamp 2025-05-07T20:33:27.8447055Z x0 = x[:, :D] 2025-05-07T20:33:27.8447262Z x1 = x[:, D:] 2025-05-07T20:33:27.8447460Z 2025-05-07T20:33:27.8447641Z if contiguous: 2025-05-07T20:33:27.8447867Z x0 = x0.contiguous() 2025-05-07T20:33:27.8448118Z x1 = x1.contiguous() 2025-05-07T20:33:27.8448354Z 2025-05-07T20:33:27.8448536Z if scale_ub is not None: 2025-05-07T20:33:27.8448803Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:27.8449130Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:27.8449425Z ) 2025-05-07T20:33:27.8449617Z else: 2025-05-07T20:33:27.8449823Z scale_ub_tensor = None 2025-05-07T20:33:27.8450065Z 2025-05-07T20:33:27.8450295Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:27.8450605Z op = silu_mul_quant 2025-05-07T20:33:27.8450858Z if compiled: 2025-05-07T20:33:27.8451113Z op = torch.compile(op) 2025-05-07T20:33:27.8451410Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:27.8451689Z 2025-05-07T20:33:27.8451878Z > y_fp8, y_scale = fn() 2025-05-07T20:33:27.8452049Z 2025-05-07T20:33:27.8452151Z moe/activation_test.py:117: 2025-05-07T20:33:27.8452494Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:27.8452827Z moe/activation_test.py:115: in fn 2025-05-07T20:33:27.8453114Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:27.8453805Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:27.8454492Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:27.8455032Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:27.8455719Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:27.8456385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:27.8456915Z kernel = self.compile( 2025-05-07T20:33:27.8457463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:27.8458179Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:27.8458621Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:27.8458851Z 2025-05-07T20:33:27.8459061Z self = 2025-05-07T20:33:27.8460144Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:27.8461514Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b92bcb1a0>} 2025-05-07T20:33:27.8462861Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:27.8463885Z context = 2025-05-07T20:33:27.8464182Z 2025-05-07T20:33:27.8464349Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:27.8464874Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:27.8465389Z module_map=module_map) 2025-05-07T20:33:27.8465754Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:27.8466111Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:27.8466376Z E ^ 2025-05-07T20:33:27.8466844Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:27.8467303Z 2025-05-07T20:33:27.8467723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:27.9593063Z 2025-05-07T20:33:27.9593428Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:27.9594517Z self=, 2025-05-07T20:33:27.9595398Z T=128, 2025-05-07T20:33:27.9595777Z D=7168, 2025-05-07T20:33:27.9596223Z scale_ub=None, 2025-05-07T20:33:27.9596646Z contiguous=True, 2025-05-07T20:33:27.9597075Z compiled=False, 2025-05-07T20:33:27.9597471Z ) 2025-05-07T20:33:27.9598085Z self = 2025-05-07T20:33:27.9599043Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:27.9599482Z 2025-05-07T20:33:27.9599559Z @given( 2025-05-07T20:33:27.9599787Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:27.9600096Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:27.9600394Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:27.9600724Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:27.9601164Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:27.9601441Z ) 2025-05-07T20:33:27.9601793Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:27.9602236Z def test_silu_mul_quant( 2025-05-07T20:33:27.9602474Z self, 2025-05-07T20:33:27.9602665Z T: int, 2025-05-07T20:33:27.9602863Z D: int, 2025-05-07T20:33:27.9603077Z scale_ub: Optional[float], 2025-05-07T20:33:27.9603337Z contiguous: bool, 2025-05-07T20:33:27.9603573Z compiled: bool, 2025-05-07T20:33:27.9603791Z ) -> None: 2025-05-07T20:33:27.9603997Z torch.manual_seed(2025) 2025-05-07T20:33:27.9604324Z 2025-05-07T20:33:27.9604595Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:27.9604930Z 2025-05-07T20:33:27.9605121Z x_sign = torch.sign(x) 2025-05-07T20:33:27.9605485Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:27.9605788Z x = x_sign * x_clamp 2025-05-07T20:33:27.9606026Z x0 = x[:, :D] 2025-05-07T20:33:27.9606299Z x1 = x[:, D:] 2025-05-07T20:33:27.9606501Z 2025-05-07T20:33:27.9606685Z if contiguous: 2025-05-07T20:33:27.9606917Z x0 = x0.contiguous() 2025-05-07T20:33:27.9607170Z x1 = x1.contiguous() 2025-05-07T20:33:27.9607411Z 2025-05-07T20:33:27.9607603Z if scale_ub is not None: 2025-05-07T20:33:27.9607871Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:27.9608197Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:27.9608749Z ) 2025-05-07T20:33:27.9608939Z else: 2025-05-07T20:33:27.9609141Z scale_ub_tensor = None 2025-05-07T20:33:27.9609388Z 2025-05-07T20:33:27.9609618Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:27.9609924Z op = silu_mul_quant 2025-05-07T20:33:27.9610175Z if compiled: 2025-05-07T20:33:27.9610422Z op = torch.compile(op) 2025-05-07T20:33:27.9610709Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:27.9610985Z 2025-05-07T20:33:27.9611175Z > y_fp8, y_scale = fn() 2025-05-07T20:33:27.9611335Z 2025-05-07T20:33:27.9611430Z moe/activation_test.py:117: 2025-05-07T20:33:27.9611800Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:27.9612135Z moe/activation_test.py:115: in fn 2025-05-07T20:33:27.9612415Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:27.9613092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:27.9613774Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:27.9614303Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:27.9614979Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:27.9615643Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:27.9616167Z kernel = self.compile( 2025-05-07T20:33:27.9616704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:27.9617355Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:27.9617753Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:27.9617978Z 2025-05-07T20:33:27.9618186Z self = 2025-05-07T20:33:27.9619311Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:27.9620741Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b923fc040>} 2025-05-07T20:33:27.9622074Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:27.9623085Z context = 2025-05-07T20:33:27.9623370Z 2025-05-07T20:33:27.9623538Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:27.9624047Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:27.9624516Z module_map=module_map) 2025-05-07T20:33:27.9624879Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:27.9625293Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:27.9625546Z E ^ 2025-05-07T20:33:27.9626016Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:27.9626518Z 2025-05-07T20:33:27.9626939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:27.9627448Z 2025-05-07T20:33:27.9627557Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:27.9627960Z self=, 2025-05-07T20:33:27.9628359Z T=2048, 2025-05-07T20:33:27.9628546Z D=7168, 2025-05-07T20:33:27.9628730Z scale_ub=1200.0, 2025-05-07T20:33:27.9628950Z contiguous=True, 2025-05-07T20:33:27.9629176Z compiled=False, 2025-05-07T20:33:27.9629373Z ) 2025-05-07T20:33:27.9629685Z self = 2025-05-07T20:33:27.9630176Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:27.9630446Z 2025-05-07T20:33:27.9630523Z @given( 2025-05-07T20:33:27.9630750Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:27.9631061Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:27.9631365Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:27.9631686Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:27.9632059Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:27.9632337Z ) 2025-05-07T20:33:27.9632673Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:27.9633106Z def test_silu_mul_quant( 2025-05-07T20:33:27.9633348Z self, 2025-05-07T20:33:27.9633535Z T: int, 2025-05-07T20:33:27.9633729Z D: int, 2025-05-07T20:33:27.9633944Z scale_ub: Optional[float], 2025-05-07T20:33:27.9634204Z contiguous: bool, 2025-05-07T20:33:27.9634446Z compiled: bool, 2025-05-07T20:33:27.9634671Z ) -> None: 2025-05-07T20:33:27.9634877Z torch.manual_seed(2025) 2025-05-07T20:33:27.9635118Z 2025-05-07T20:33:27.9635388Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:27.9637422Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:27.9639257Z 2025-05-07T20:33:27.9639381Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:27.9639590Z 2025-05-07T20:33:27.9639693Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:27.9640148Z self=, 2025-05-07T20:33:27.9640548Z T=1, 2025-05-07T20:33:27.9640726Z D=5120, 2025-05-07T20:33:27.9640924Z scale_ub=1200.0, 2025-05-07T20:33:27.9641142Z contiguous=True, 2025-05-07T20:33:27.9641355Z compiled=False, 2025-05-07T20:33:27.9641558Z ) 2025-05-07T20:33:27.9641871Z self = 2025-05-07T20:33:27.9642351Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:27.9642611Z 2025-05-07T20:33:27.9642688Z @given( 2025-05-07T20:33:27.9642914Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:27.9643219Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:27.9643515Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:27.9643841Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:27.9644338Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:27.9644617Z ) 2025-05-07T20:33:27.9644964Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:27.9645442Z def test_silu_mul_quant( 2025-05-07T20:33:27.9645683Z self, 2025-05-07T20:33:27.9645871Z T: int, 2025-05-07T20:33:27.9646065Z D: int, 2025-05-07T20:33:27.9646289Z scale_ub: Optional[float], 2025-05-07T20:33:27.9646550Z contiguous: bool, 2025-05-07T20:33:27.9646791Z compiled: bool, 2025-05-07T20:33:27.9647009Z ) -> None: 2025-05-07T20:33:27.9647212Z torch.manual_seed(2025) 2025-05-07T20:33:27.9647451Z 2025-05-07T20:33:27.9647723Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:27.9648053Z 2025-05-07T20:33:27.9648255Z x_sign = torch.sign(x) 2025-05-07T20:33:27.9648539Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:27.9648832Z x = x_sign * x_clamp 2025-05-07T20:33:27.9649075Z x0 = x[:, :D] 2025-05-07T20:33:27.9649294Z x1 = x[:, D:] 2025-05-07T20:33:27.9649516Z 2025-05-07T20:33:27.9649713Z if contiguous: 2025-05-07T20:33:27.9649939Z x0 = x0.contiguous() 2025-05-07T20:33:27.9650183Z x1 = x1.contiguous() 2025-05-07T20:33:27.9650414Z 2025-05-07T20:33:27.9650599Z if scale_ub is not None: 2025-05-07T20:33:27.9650909Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:27.9651228Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:27.9651528Z ) 2025-05-07T20:33:27.9651709Z else: 2025-05-07T20:33:27.9651905Z scale_ub_tensor = None 2025-05-07T20:33:27.9652144Z 2025-05-07T20:33:27.9652365Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:27.9652666Z op = silu_mul_quant 2025-05-07T20:33:27.9652905Z if compiled: 2025-05-07T20:33:27.9653140Z op = torch.compile(op) 2025-05-07T20:33:27.9653425Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:27.9653689Z 2025-05-07T20:33:27.9653877Z > y_fp8, y_scale = fn() 2025-05-07T20:33:27.9654037Z 2025-05-07T20:33:27.9654135Z moe/activation_test.py:117: 2025-05-07T20:33:27.9654419Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:27.9654741Z moe/activation_test.py:115: in fn 2025-05-07T20:33:27.9655016Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:27.9655688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:27.9656360Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:27.9656885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:27.9657551Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:27.9658202Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:27.9658818Z kernel = self.compile( 2025-05-07T20:33:27.9659354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:27.9659993Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:27.9660385Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:27.9660607Z 2025-05-07T20:33:27.9660812Z self = 2025-05-07T20:33:27.9661882Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:27.9663228Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b923fd580>} 2025-05-07T20:33:27.9664633Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:27.9666755Z context = 2025-05-07T20:33:27.9667038Z 2025-05-07T20:33:27.9667200Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:27.9667706Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:27.9668165Z module_map=module_map) 2025-05-07T20:33:27.9668517Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:27.9668852Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:27.9669095Z E ^ 2025-05-07T20:33:27.9669550Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:27.9669995Z 2025-05-07T20:33:27.9670408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:28.0488239Z 2025-05-07T20:33:28.0488375Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.0488808Z self=, 2025-05-07T20:33:28.0489500Z T=2048, 2025-05-07T20:33:28.0489790Z D=5120, 2025-05-07T20:33:28.0490038Z scale_ub=None, 2025-05-07T20:33:28.0490238Z contiguous=True, 2025-05-07T20:33:28.0490446Z compiled=False, 2025-05-07T20:33:28.0490644Z ) 2025-05-07T20:33:28.0490952Z self = 2025-05-07T20:33:28.0491430Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:28.0491697Z 2025-05-07T20:33:28.0491770Z @given( 2025-05-07T20:33:28.0491987Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.0492288Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.0492573Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.0492894Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.0493208Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.0493473Z ) 2025-05-07T20:33:28.0493806Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.0494228Z def test_silu_mul_quant( 2025-05-07T20:33:28.0494452Z self, 2025-05-07T20:33:28.0494638Z T: int, 2025-05-07T20:33:28.0494818Z D: int, 2025-05-07T20:33:28.0495019Z scale_ub: Optional[float], 2025-05-07T20:33:28.0495276Z contiguous: bool, 2025-05-07T20:33:28.0495508Z compiled: bool, 2025-05-07T20:33:28.0495716Z ) -> None: 2025-05-07T20:33:28.0495914Z torch.manual_seed(2025) 2025-05-07T20:33:28.0496144Z 2025-05-07T20:33:28.0496402Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.0496814Z 2025-05-07T20:33:28.0497000Z > x_sign = torch.sign(x) 2025-05-07T20:33:28.0498961Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:28.0500826Z 2025-05-07T20:33:28.0500946Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:33:28.0501149Z 2025-05-07T20:33:28.0501244Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.0501707Z self=, 2025-05-07T20:33:28.0502100Z T=16384, 2025-05-07T20:33:28.0502282Z D=5120, 2025-05-07T20:33:28.0502512Z scale_ub=None, 2025-05-07T20:33:28.0502720Z contiguous=True, 2025-05-07T20:33:28.0502933Z compiled=False, 2025-05-07T20:33:28.0503119Z ) 2025-05-07T20:33:28.0503419Z self = 2025-05-07T20:33:28.0503901Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:28.0504166Z 2025-05-07T20:33:28.0504235Z @given( 2025-05-07T20:33:28.0504448Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.0504747Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.0505030Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.0505343Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.0505653Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.0505922Z ) 2025-05-07T20:33:28.0506251Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.0506674Z def test_silu_mul_quant( 2025-05-07T20:33:28.0506902Z self, 2025-05-07T20:33:28.0507080Z T: int, 2025-05-07T20:33:28.0507257Z D: int, 2025-05-07T20:33:28.0507461Z scale_ub: Optional[float], 2025-05-07T20:33:28.0507759Z contiguous: bool, 2025-05-07T20:33:28.0507977Z compiled: bool, 2025-05-07T20:33:28.0508181Z ) -> None: 2025-05-07T20:33:28.0508635Z torch.manual_seed(2025) 2025-05-07T20:33:28.0508861Z 2025-05-07T20:33:28.0509113Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.0511134Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:28.0512978Z 2025-05-07T20:33:28.0513088Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:28.0513297Z 2025-05-07T20:33:28.0513400Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.0513791Z self=, 2025-05-07T20:33:28.0514172Z T=4096, 2025-05-07T20:33:28.0514344Z D=5120, 2025-05-07T20:33:28.0514516Z scale_ub=None, 2025-05-07T20:33:28.0514709Z contiguous=True, 2025-05-07T20:33:28.0514915Z compiled=False, 2025-05-07T20:33:28.0515104Z ) 2025-05-07T20:33:28.0515402Z self = 2025-05-07T20:33:28.0515884Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:28.0516142Z 2025-05-07T20:33:28.0516284Z @given( 2025-05-07T20:33:28.0516498Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.0516792Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.0517080Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.0517393Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.0517704Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.0517969Z ) 2025-05-07T20:33:28.0518305Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.0518723Z def test_silu_mul_quant( 2025-05-07T20:33:28.0518949Z self, 2025-05-07T20:33:28.0519125Z T: int, 2025-05-07T20:33:28.0519299Z D: int, 2025-05-07T20:33:28.0519500Z scale_ub: Optional[float], 2025-05-07T20:33:28.0519755Z contiguous: bool, 2025-05-07T20:33:28.0520044Z compiled: bool, 2025-05-07T20:33:28.0520249Z ) -> None: 2025-05-07T20:33:28.0520450Z torch.manual_seed(2025) 2025-05-07T20:33:28.0520669Z 2025-05-07T20:33:28.0520978Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.0522990Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:28.0524908Z 2025-05-07T20:33:28.0525017Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:28.0525224Z 2025-05-07T20:33:28.0525326Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.0525719Z self=, 2025-05-07T20:33:28.0526102Z T=2048, 2025-05-07T20:33:28.0526269Z D=5120, 2025-05-07T20:33:28.0526435Z scale_ub=None, 2025-05-07T20:33:28.0526633Z contiguous=False, 2025-05-07T20:33:28.0526840Z compiled=False, 2025-05-07T20:33:28.0527091Z ) 2025-05-07T20:33:28.0527392Z self = 2025-05-07T20:33:28.0527866Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:28.0528127Z 2025-05-07T20:33:28.0528199Z @given( 2025-05-07T20:33:28.0528408Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.0528702Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.0528988Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.0529293Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.0529605Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.0529869Z ) 2025-05-07T20:33:28.0530196Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.0530615Z def test_silu_mul_quant( 2025-05-07T20:33:28.0530837Z self, 2025-05-07T20:33:28.0531009Z T: int, 2025-05-07T20:33:28.0531191Z D: int, 2025-05-07T20:33:28.0531392Z scale_ub: Optional[float], 2025-05-07T20:33:28.0531643Z contiguous: bool, 2025-05-07T20:33:28.0531863Z compiled: bool, 2025-05-07T20:33:28.0532066Z ) -> None: 2025-05-07T20:33:28.0532265Z torch.manual_seed(2025) 2025-05-07T20:33:28.0532485Z 2025-05-07T20:33:28.0532741Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.0534807Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:28.0536646Z 2025-05-07T20:33:28.0536760Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:28.0536961Z 2025-05-07T20:33:28.0537055Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.0537455Z self=, 2025-05-07T20:33:28.0537837Z T=4096, 2025-05-07T20:33:28.0538007Z D=7168, 2025-05-07T20:33:28.0538174Z scale_ub=None, 2025-05-07T20:33:28.0538372Z contiguous=True, 2025-05-07T20:33:28.0538576Z compiled=True, 2025-05-07T20:33:28.0538755Z ) 2025-05-07T20:33:28.0539105Z self = 2025-05-07T20:33:28.0539577Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:28.0539832Z 2025-05-07T20:33:28.0539942Z @given( 2025-05-07T20:33:28.0540156Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.0540454Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.0540741Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.0541050Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.0541364Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.0541630Z ) 2025-05-07T20:33:28.0541954Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.0542377Z def test_silu_mul_quant( 2025-05-07T20:33:28.0542605Z self, 2025-05-07T20:33:28.0542777Z T: int, 2025-05-07T20:33:28.0542952Z D: int, 2025-05-07T20:33:28.0543152Z scale_ub: Optional[float], 2025-05-07T20:33:28.0543409Z contiguous: bool, 2025-05-07T20:33:28.0543649Z compiled: bool, 2025-05-07T20:33:28.0543857Z ) -> None: 2025-05-07T20:33:28.0544054Z torch.manual_seed(2025) 2025-05-07T20:33:28.0544281Z 2025-05-07T20:33:28.0544540Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.0546561Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:28.0548456Z 2025-05-07T20:33:28.0548573Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:28.0548778Z 2025-05-07T20:33:28.0548880Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.0549283Z self=, 2025-05-07T20:33:28.0549668Z T=2048, 2025-05-07T20:33:28.0549835Z D=5120, 2025-05-07T20:33:28.0550014Z scale_ub=1200.0, 2025-05-07T20:33:28.0550227Z contiguous=False, 2025-05-07T20:33:28.0550431Z compiled=False, 2025-05-07T20:33:28.1090876Z ) 2025-05-07T20:33:28.1091536Z self = 2025-05-07T20:33:28.1092506Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:28.1093037Z 2025-05-07T20:33:28.1093175Z @given( 2025-05-07T20:33:28.1093611Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.1094204Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.1094777Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.1095473Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.1096295Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.1096842Z ) 2025-05-07T20:33:28.1097509Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.1098344Z def test_silu_mul_quant( 2025-05-07T20:33:28.1098802Z self, 2025-05-07T20:33:28.1099185Z T: int, 2025-05-07T20:33:28.1099466Z D: int, 2025-05-07T20:33:28.1099710Z scale_ub: Optional[float], 2025-05-07T20:33:28.1099997Z contiguous: bool, 2025-05-07T20:33:28.1100223Z compiled: bool, 2025-05-07T20:33:28.1100439Z ) -> None: 2025-05-07T20:33:28.1100649Z torch.manual_seed(2025) 2025-05-07T20:33:28.1100874Z 2025-05-07T20:33:28.1101139Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.1103320Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:28.1105318Z 2025-05-07T20:33:28.1105431Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:28.1105639Z 2025-05-07T20:33:28.1105742Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.1106140Z self=, 2025-05-07T20:33:28.1106538Z T=4096, 2025-05-07T20:33:28.1106715Z D=7168, 2025-05-07T20:33:28.1106891Z scale_ub=1200.0, 2025-05-07T20:33:28.1107105Z contiguous=True, 2025-05-07T20:33:28.1107326Z compiled=False, 2025-05-07T20:33:28.1107515Z ) 2025-05-07T20:33:28.1107832Z self = 2025-05-07T20:33:28.1108540Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:28.1108812Z 2025-05-07T20:33:28.1108897Z @given( 2025-05-07T20:33:28.1109109Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.1109492Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.1109785Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.1110106Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.1110435Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.1110708Z ) 2025-05-07T20:33:28.1111049Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.1111485Z def test_silu_mul_quant( 2025-05-07T20:33:28.1111715Z self, 2025-05-07T20:33:28.1111900Z T: int, 2025-05-07T20:33:28.1112095Z D: int, 2025-05-07T20:33:28.1112301Z scale_ub: Optional[float], 2025-05-07T20:33:28.1112560Z contiguous: bool, 2025-05-07T20:33:28.1112789Z compiled: bool, 2025-05-07T20:33:28.1113008Z ) -> None: 2025-05-07T20:33:28.1113218Z torch.manual_seed(2025) 2025-05-07T20:33:28.1113447Z 2025-05-07T20:33:28.1113708Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.1115789Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:28.1117631Z 2025-05-07T20:33:28.1117816Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:28.1118024Z 2025-05-07T20:33:28.1118117Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.1118515Z self=, 2025-05-07T20:33:28.1118906Z T=16384, 2025-05-07T20:33:28.1119089Z D=7168, 2025-05-07T20:33:28.1119260Z scale_ub=None, 2025-05-07T20:33:28.1119460Z contiguous=False, 2025-05-07T20:33:28.1119670Z compiled=True, 2025-05-07T20:33:28.1119855Z ) 2025-05-07T20:33:28.1120155Z self = 2025-05-07T20:33:28.1120635Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:28.1120906Z 2025-05-07T20:33:28.1120973Z @given( 2025-05-07T20:33:28.1121190Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.1121491Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.1121853Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.1122168Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.1122537Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.1122803Z ) 2025-05-07T20:33:28.1123131Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.1123559Z def test_silu_mul_quant( 2025-05-07T20:33:28.1123791Z self, 2025-05-07T20:33:28.1123962Z T: int, 2025-05-07T20:33:28.1124145Z D: int, 2025-05-07T20:33:28.1124458Z scale_ub: Optional[float], 2025-05-07T20:33:28.1124711Z contiguous: bool, 2025-05-07T20:33:28.1124936Z compiled: bool, 2025-05-07T20:33:28.1125140Z ) -> None: 2025-05-07T20:33:28.1125334Z torch.manual_seed(2025) 2025-05-07T20:33:28.1125561Z 2025-05-07T20:33:28.1125816Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.1127850Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:28.1129795Z 2025-05-07T20:33:28.1129910Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:28.1130114Z 2025-05-07T20:33:28.1130209Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.1130608Z self=, 2025-05-07T20:33:28.1130993Z T=4096, 2025-05-07T20:33:28.1131157Z D=7168, 2025-05-07T20:33:28.1131340Z scale_ub=None, 2025-05-07T20:33:28.1131539Z contiguous=True, 2025-05-07T20:33:28.1131742Z compiled=False, 2025-05-07T20:33:28.1131932Z ) 2025-05-07T20:33:28.1132241Z self = 2025-05-07T20:33:28.1132714Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:28.1132980Z 2025-05-07T20:33:28.1133048Z @given( 2025-05-07T20:33:28.1133266Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.1133560Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.1133845Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.1134162Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.1134474Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.1134736Z ) 2025-05-07T20:33:28.1135069Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.1135489Z def test_silu_mul_quant( 2025-05-07T20:33:28.1135713Z self, 2025-05-07T20:33:28.1135887Z T: int, 2025-05-07T20:33:28.1136072Z D: int, 2025-05-07T20:33:28.1136315Z scale_ub: Optional[float], 2025-05-07T20:33:28.1136575Z contiguous: bool, 2025-05-07T20:33:28.1136799Z compiled: bool, 2025-05-07T20:33:28.1137003Z ) -> None: 2025-05-07T20:33:28.1137198Z torch.manual_seed(2025) 2025-05-07T20:33:28.1137429Z 2025-05-07T20:33:28.1137683Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.1139694Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:28.1141594Z 2025-05-07T20:33:28.1141703Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:28.1141952Z 2025-05-07T20:33:28.1142048Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.1142443Z self=, 2025-05-07T20:33:28.1142833Z T=16384, 2025-05-07T20:33:28.1143010Z D=7168, 2025-05-07T20:33:28.1143185Z scale_ub=None, 2025-05-07T20:33:28.1143384Z contiguous=True, 2025-05-07T20:33:28.1143587Z compiled=False, 2025-05-07T20:33:28.1143776Z ) 2025-05-07T20:33:28.1144080Z self = 2025-05-07T20:33:28.1144555Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:28.1144829Z 2025-05-07T20:33:28.1144897Z @given( 2025-05-07T20:33:28.1145110Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.1145405Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.1145698Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.1146013Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.1146364Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.1146678Z ) 2025-05-07T20:33:28.1147011Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.1147483Z def test_silu_mul_quant( 2025-05-07T20:33:28.1147706Z self, 2025-05-07T20:33:28.1147886Z T: int, 2025-05-07T20:33:28.1148069Z D: int, 2025-05-07T20:33:28.1148269Z scale_ub: Optional[float], 2025-05-07T20:33:28.1148521Z contiguous: bool, 2025-05-07T20:33:28.1148746Z compiled: bool, 2025-05-07T20:33:28.1148951Z ) -> None: 2025-05-07T20:33:28.1149150Z torch.manual_seed(2025) 2025-05-07T20:33:28.1149376Z 2025-05-07T20:33:28.1149627Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.1151693Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:28.1153580Z 2025-05-07T20:33:28.1153691Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:28.1153900Z 2025-05-07T20:33:28.1153994Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.1154395Z self=, 2025-05-07T20:33:28.1154780Z T=16384, 2025-05-07T20:33:28.1154961Z D=7168, 2025-05-07T20:33:28.1155137Z scale_ub=1200.0, 2025-05-07T20:33:28.1155391Z contiguous=True, 2025-05-07T20:33:28.1155602Z compiled=False, 2025-05-07T20:33:28.1155792Z ) 2025-05-07T20:33:28.1156094Z self = 2025-05-07T20:33:28.1156577Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:28.1156859Z 2025-05-07T20:33:28.1156928Z @given( 2025-05-07T20:33:28.1157143Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.1157434Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.1157726Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.1158038Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.1158347Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.1158619Z ) 2025-05-07T20:33:28.1158952Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.1159423Z def test_silu_mul_quant( 2025-05-07T20:33:28.1159651Z self, 2025-05-07T20:33:28.1159838Z T: int, 2025-05-07T20:33:28.1160016Z D: int, 2025-05-07T20:33:28.1160265Z scale_ub: Optional[float], 2025-05-07T20:33:28.1160530Z contiguous: bool, 2025-05-07T20:33:28.1160758Z compiled: bool, 2025-05-07T20:33:28.1160961Z ) -> None: 2025-05-07T20:33:28.1161168Z torch.manual_seed(2025) 2025-05-07T20:33:28.1161392Z 2025-05-07T20:33:28.1161647Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.1163682Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:28.1165661Z 2025-05-07T20:33:28.1165775Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:28.2958596Z 2025-05-07T20:33:28.2958760Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.2959300Z self=, 2025-05-07T20:33:28.2959746Z T=128, 2025-05-07T20:33:28.2959928Z D=5120, 2025-05-07T20:33:28.2960120Z scale_ub=1200.0, 2025-05-07T20:33:28.2960336Z contiguous=False, 2025-05-07T20:33:28.2960553Z compiled=False, 2025-05-07T20:33:28.2960770Z ) 2025-05-07T20:33:28.2961415Z self = 2025-05-07T20:33:28.2962399Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:28.2962930Z 2025-05-07T20:33:28.2963070Z @given( 2025-05-07T20:33:28.2963484Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.2964073Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.2964802Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.2965423Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.2966044Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.2966571Z ) 2025-05-07T20:33:28.2967227Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.2968069Z def test_silu_mul_quant( 2025-05-07T20:33:28.2968501Z self, 2025-05-07T20:33:28.2968860Z T: int, 2025-05-07T20:33:28.2969082Z D: int, 2025-05-07T20:33:28.2969277Z scale_ub: Optional[float], 2025-05-07T20:33:28.2969533Z contiguous: bool, 2025-05-07T20:33:28.2969755Z compiled: bool, 2025-05-07T20:33:28.2969956Z ) -> None: 2025-05-07T20:33:28.2970153Z torch.manual_seed(2025) 2025-05-07T20:33:28.2970382Z 2025-05-07T20:33:28.2977306Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.2977657Z 2025-05-07T20:33:28.2977854Z x_sign = torch.sign(x) 2025-05-07T20:33:28.2978151Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:28.2978462Z x = x_sign * x_clamp 2025-05-07T20:33:28.2978706Z x0 = x[:, :D] 2025-05-07T20:33:28.2978926Z x1 = x[:, D:] 2025-05-07T20:33:28.2979139Z 2025-05-07T20:33:28.2979322Z if contiguous: 2025-05-07T20:33:28.2979548Z x0 = x0.contiguous() 2025-05-07T20:33:28.2979809Z x1 = x1.contiguous() 2025-05-07T20:33:28.2980050Z 2025-05-07T20:33:28.2980236Z if scale_ub is not None: 2025-05-07T20:33:28.2980510Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:28.2980846Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:28.2981144Z ) 2025-05-07T20:33:28.2981414Z else: 2025-05-07T20:33:28.2981618Z scale_ub_tensor = None 2025-05-07T20:33:28.2981865Z 2025-05-07T20:33:28.2982104Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.2982479Z op = silu_mul_quant 2025-05-07T20:33:28.2982732Z if compiled: 2025-05-07T20:33:28.2982972Z op = torch.compile(op) 2025-05-07T20:33:28.2983269Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.2983537Z 2025-05-07T20:33:28.2983717Z > y_fp8, y_scale = fn() 2025-05-07T20:33:28.2983882Z 2025-05-07T20:33:28.2983981Z moe/activation_test.py:117: 2025-05-07T20:33:28.2984273Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.2984598Z moe/activation_test.py:115: in fn 2025-05-07T20:33:28.2984882Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.2985573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:28.2986262Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:28.2986796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:28.2987477Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:28.2988138Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:28.2988703Z kernel = self.compile( 2025-05-07T20:33:28.2989243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:28.2989900Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:28.2990302Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.2990568Z 2025-05-07T20:33:28.2990774Z self = 2025-05-07T20:33:28.2991855Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:28.2993221Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b922251c0>} 2025-05-07T20:33:28.2994550Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:28.2995565Z context = 2025-05-07T20:33:28.2995852Z 2025-05-07T20:33:28.2996015Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:28.2996530Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:28.2996999Z module_map=module_map) 2025-05-07T20:33:28.2997403Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:28.2997753Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:28.2998009Z E ^ 2025-05-07T20:33:28.2998474Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:28.2998923Z 2025-05-07T20:33:28.2999333Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:28.2999898Z 2025-05-07T20:33:28.3000000Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.3000405Z self=, 2025-05-07T20:33:28.3000797Z T=2048, 2025-05-07T20:33:28.3000975Z D=7168, 2025-05-07T20:33:28.3001157Z scale_ub=None, 2025-05-07T20:33:28.3001366Z contiguous=False, 2025-05-07T20:33:28.3001664Z compiled=False, 2025-05-07T20:33:28.3001868Z ) 2025-05-07T20:33:28.3002180Z self = 2025-05-07T20:33:28.3002708Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:28.3002974Z 2025-05-07T20:33:28.3003054Z @given( 2025-05-07T20:33:28.3003279Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.3003589Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.3003889Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.3004212Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.3004650Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.3004948Z ) 2025-05-07T20:33:28.3005291Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.3005723Z def test_silu_mul_quant( 2025-05-07T20:33:28.3005955Z self, 2025-05-07T20:33:28.3006151Z T: int, 2025-05-07T20:33:28.3006342Z D: int, 2025-05-07T20:33:28.3006552Z scale_ub: Optional[float], 2025-05-07T20:33:28.3006816Z contiguous: bool, 2025-05-07T20:33:28.3007055Z compiled: bool, 2025-05-07T20:33:28.3007272Z ) -> None: 2025-05-07T20:33:28.3007487Z torch.manual_seed(2025) 2025-05-07T20:33:28.3007723Z 2025-05-07T20:33:28.3008035Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.3010315Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:28.3012162Z 2025-05-07T20:33:28.3012274Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:28.3012484Z 2025-05-07T20:33:28.3012583Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.3012977Z self=, 2025-05-07T20:33:28.3013356Z T=128, 2025-05-07T20:33:28.3013528Z D=7168, 2025-05-07T20:33:28.3013702Z scale_ub=1200.0, 2025-05-07T20:33:28.3013904Z contiguous=True, 2025-05-07T20:33:28.3014112Z compiled=True, 2025-05-07T20:33:28.3014295Z ) 2025-05-07T20:33:28.3014591Z self = 2025-05-07T20:33:28.3015062Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:28.3015325Z 2025-05-07T20:33:28.3015413Z @given( 2025-05-07T20:33:28.3015658Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.3015958Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.3016246Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.3016635Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.3016945Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.3017214Z ) 2025-05-07T20:33:28.3017543Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.3017965Z def test_silu_mul_quant( 2025-05-07T20:33:28.3018189Z self, 2025-05-07T20:33:28.3018364Z T: int, 2025-05-07T20:33:28.3018538Z D: int, 2025-05-07T20:33:28.3018739Z scale_ub: Optional[float], 2025-05-07T20:33:28.3018991Z contiguous: bool, 2025-05-07T20:33:28.3019213Z compiled: bool, 2025-05-07T20:33:28.3019415Z ) -> None: 2025-05-07T20:33:28.3019618Z torch.manual_seed(2025) 2025-05-07T20:33:28.3019842Z 2025-05-07T20:33:28.3020095Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.3020487Z 2025-05-07T20:33:28.3020692Z x_sign = torch.sign(x) 2025-05-07T20:33:28.3020990Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:28.3021344Z x = x_sign * x_clamp 2025-05-07T20:33:28.3021568Z x0 = x[:, :D] 2025-05-07T20:33:28.3021764Z x1 = x[:, D:] 2025-05-07T20:33:28.3021951Z 2025-05-07T20:33:28.3022117Z if contiguous: 2025-05-07T20:33:28.3022328Z x0 = x0.contiguous() 2025-05-07T20:33:28.3022572Z x1 = x1.contiguous() 2025-05-07T20:33:28.3022794Z 2025-05-07T20:33:28.3022964Z if scale_ub is not None: 2025-05-07T20:33:28.3023219Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:28.3023536Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:28.3023830Z ) 2025-05-07T20:33:28.3024001Z else: 2025-05-07T20:33:28.3024200Z scale_ub_tensor = None 2025-05-07T20:33:28.3024433Z 2025-05-07T20:33:28.3024651Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.3024951Z op = silu_mul_quant 2025-05-07T20:33:28.3025186Z if compiled: 2025-05-07T20:33:28.3025417Z op = torch.compile(op) 2025-05-07T20:33:28.3025695Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.3025952Z 2025-05-07T20:33:28.3026123Z > y_fp8, y_scale = fn() 2025-05-07T20:33:28.3026358Z 2025-05-07T20:33:28.3026449Z moe/activation_test.py:117: 2025-05-07T20:33:28.3026728Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.3027039Z moe/activation_test.py:115: in fn 2025-05-07T20:33:28.3027305Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.3027847Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:28.3028389Z return fn(*args, **kwargs) 2025-05-07T20:33:28.3029027Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:28.3029700Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:28.3030219Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:28.3030877Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:28.3031526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:28.3032040Z kernel = self.compile( 2025-05-07T20:33:28.3032568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:28.3033202Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:28.3033592Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.3033812Z 2025-05-07T20:33:28.3034019Z self = 2025-05-07T20:33:28.3035129Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:28.3036477Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b920bfb00>} 2025-05-07T20:33:28.3037806Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:28.3038819Z context = 2025-05-07T20:33:28.3039105Z 2025-05-07T20:33:28.3039269Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:28.3039817Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:28.3040273Z module_map=module_map) 2025-05-07T20:33:28.3040659Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:28.3040995Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:28.3041236Z E ^ 2025-05-07T20:33:28.3041688Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:28.3042133Z 2025-05-07T20:33:28.3042552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:28.6013292Z 2025-05-07T20:33:28.6013532Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.6013948Z self=, 2025-05-07T20:33:28.6014382Z T=128, 2025-05-07T20:33:28.6014564Z D=7168, 2025-05-07T20:33:28.6014760Z scale_ub=1200.0, 2025-05-07T20:33:28.6014966Z contiguous=True, 2025-05-07T20:33:28.6015186Z compiled=False, 2025-05-07T20:33:28.6015384Z ) 2025-05-07T20:33:28.6015699Z self = 2025-05-07T20:33:28.6016189Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:28.6016457Z 2025-05-07T20:33:28.6016642Z @given( 2025-05-07T20:33:28.6016865Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.6017166Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.6017466Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.6017788Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.6018100Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.6018380Z ) 2025-05-07T20:33:28.6018728Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.6019160Z def test_silu_mul_quant( 2025-05-07T20:33:28.6019435Z self, 2025-05-07T20:33:28.6019621Z T: int, 2025-05-07T20:33:28.6019804Z D: int, 2025-05-07T20:33:28.6020015Z scale_ub: Optional[float], 2025-05-07T20:33:28.6020281Z contiguous: bool, 2025-05-07T20:33:28.6020515Z compiled: bool, 2025-05-07T20:33:28.6020726Z ) -> None: 2025-05-07T20:33:28.6020932Z torch.manual_seed(2025) 2025-05-07T20:33:28.6021166Z 2025-05-07T20:33:28.6021423Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.6021751Z 2025-05-07T20:33:28.6021932Z x_sign = torch.sign(x) 2025-05-07T20:33:28.6022208Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:28.6024268Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:28.6026110Z 2025-05-07T20:33:28.6026222Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:28.6026435Z 2025-05-07T20:33:28.6026530Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.6026927Z self=, 2025-05-07T20:33:28.6027308Z T=128, 2025-05-07T20:33:28.6027480Z D=5120, 2025-05-07T20:33:28.6027655Z scale_ub=1200.0, 2025-05-07T20:33:28.6027859Z contiguous=True, 2025-05-07T20:33:28.6028065Z compiled=True, 2025-05-07T20:33:28.6028254Z ) 2025-05-07T20:33:28.6028551Z self = 2025-05-07T20:33:28.6029095Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:28.6029353Z 2025-05-07T20:33:28.6029435Z @given( 2025-05-07T20:33:28.6029646Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.6030008Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.6030302Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.6030612Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.6030923Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.6031192Z ) 2025-05-07T20:33:28.6031525Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.6031946Z def test_silu_mul_quant( 2025-05-07T20:33:28.6032173Z self, 2025-05-07T20:33:28.6032350Z T: int, 2025-05-07T20:33:28.6032526Z D: int, 2025-05-07T20:33:28.6032728Z scale_ub: Optional[float], 2025-05-07T20:33:28.6032986Z contiguous: bool, 2025-05-07T20:33:28.6033214Z compiled: bool, 2025-05-07T20:33:28.6033422Z ) -> None: 2025-05-07T20:33:28.6033622Z torch.manual_seed(2025) 2025-05-07T20:33:28.6033842Z 2025-05-07T20:33:28.6034102Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.6034429Z 2025-05-07T20:33:28.6034609Z x_sign = torch.sign(x) 2025-05-07T20:33:28.6034876Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:28.6036891Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:28.6038724Z 2025-05-07T20:33:28.6038836Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:28.6039048Z 2025-05-07T20:33:28.6039159Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.6039567Z self=, 2025-05-07T20:33:28.6039981Z T=128, 2025-05-07T20:33:28.6040168Z D=7168, 2025-05-07T20:33:28.6040370Z scale_ub=None, 2025-05-07T20:33:28.6040576Z contiguous=True, 2025-05-07T20:33:28.6040804Z compiled=True, 2025-05-07T20:33:28.6041003Z ) 2025-05-07T20:33:28.6041317Z self = 2025-05-07T20:33:28.6041909Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:28.6042273Z 2025-05-07T20:33:28.6042376Z @given( 2025-05-07T20:33:28.6042668Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.6043084Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.6043501Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.6044015Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.6044613Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.6044990Z ) 2025-05-07T20:33:28.6045457Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.6046068Z def test_silu_mul_quant( 2025-05-07T20:33:28.6046374Z self, 2025-05-07T20:33:28.6046603Z T: int, 2025-05-07T20:33:28.6046783Z D: int, 2025-05-07T20:33:28.6047013Z scale_ub: Optional[float], 2025-05-07T20:33:28.6047278Z contiguous: bool, 2025-05-07T20:33:28.6047582Z compiled: bool, 2025-05-07T20:33:28.6047873Z ) -> None: 2025-05-07T20:33:28.6048151Z torch.manual_seed(2025) 2025-05-07T20:33:28.6048462Z 2025-05-07T20:33:28.6048827Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.6051799Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:28.6054335Z 2025-05-07T20:33:28.6054494Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:28.6054773Z 2025-05-07T20:33:28.6055227Z FAILED 2025-05-07T20:33:28.6055370Z 2025-05-07T20:33:28.6055535Z =================================== FAILURES =================================== 2025-05-07T20:33:28.6056093Z _____________________ ActivationTests.test_silu_mul_quant ______________________ 2025-05-07T20:33:28.6056685Z + Exception Group Traceback (most recent call last): 2025-05-07T20:33:28.6057511Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 58, in testPartExecutor 2025-05-07T20:33:28.6058231Z | yield 2025-05-07T20:33:28.6058808Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 651, in run 2025-05-07T20:33:28.6059564Z | self._callTestMethod(testMethod) 2025-05-07T20:33:28.6060006Z | ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:33:28.6060713Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 606, in _callTestMethod 2025-05-07T20:33:28.6061487Z | if method() is not None: 2025-05-07T20:33:28.6061819Z | ~~~~~~^^ 2025-05-07T20:33:28.6062688Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant 2025-05-07T20:33:28.6063669Z | T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.6064060Z | ^^^^^^^ 2025-05-07T20:33:28.6064817Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/hypothesis/core.py", line 1850, in wrapped_test 2025-05-07T20:33:28.6065646Z | raise the_error_hypothesis_found 2025-05-07T20:33:28.6066217Z | ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions) 2025-05-07T20:33:28.6066778Z +-+---------------- 1 ---------------- 2025-05-07T20:33:28.6067161Z | Traceback (most recent call last): 2025-05-07T20:33:28.6068105Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:33:28.6069150Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.6072086Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:28.6074784Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:33:28.6075355Z | self=, 2025-05-07T20:33:28.6075920Z | T=2048, 2025-05-07T20:33:28.6076221Z | D=5120, # or any other generated value 2025-05-07T20:33:28.6076686Z | scale_ub=None, # or any other generated value 2025-05-07T20:33:28.6077221Z | contiguous=True, # or any other generated value 2025-05-07T20:33:28.6097355Z | compiled=False, # or any other generated value 2025-05-07T20:33:28.6098099Z | ) 2025-05-07T20:33:28.6098391Z | 2025-05-07T20:33:28.6099331Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case 2025-05-07T20:33:28.6100449Z +---------------- 2 ---------------- 2025-05-07T20:33:28.6100969Z | Traceback (most recent call last): 2025-05-07T20:33:28.6102492Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:33:28.6103907Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.6107335Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:28.6111701Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:33:28.6112524Z | self=, 2025-05-07T20:33:28.6113302Z | T=128, 2025-05-07T20:33:28.6113553Z | D=7168, 2025-05-07T20:33:28.6113841Z | scale_ub=None, 2025-05-07T20:33:28.6114168Z | contiguous=True, 2025-05-07T20:33:28.6114483Z | compiled=True, 2025-05-07T20:33:28.6114778Z | ) 2025-05-07T20:33:28.6115011Z | 2025-05-07T20:33:28.6115709Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case 2025-05-07T20:33:28.6116550Z +---------------- 3 ---------------- 2025-05-07T20:33:28.6116947Z | Traceback (most recent call last): 2025-05-07T20:33:28.6117912Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:33:28.6118963Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.6121779Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:28.6124721Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:33:28.6125346Z | self=, 2025-05-07T20:33:28.6125905Z | T=128, 2025-05-07T20:33:28.6126309Z | D=5120, 2025-05-07T20:33:28.6126633Z | scale_ub=1200.0, 2025-05-07T20:33:28.6126991Z | contiguous=True, 2025-05-07T20:33:28.6127313Z | compiled=True, 2025-05-07T20:33:28.6127622Z | ) 2025-05-07T20:33:28.6127869Z | 2025-05-07T20:33:28.6128589Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case 2025-05-07T20:33:28.6129435Z +---------------- 4 ---------------- 2025-05-07T20:33:28.6129826Z | Traceback (most recent call last): 2025-05-07T20:33:28.6130793Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant 2025-05-07T20:33:28.6131772Z | y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:28.6132282Z | ~~~~~~^^ 2025-05-07T20:33:28.6133158Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn 2025-05-07T20:33:28.6134170Z | return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:28.6135308Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row 2025-05-07T20:33:28.6136404Z | _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:28.6136802Z | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^ 2025-05-07T20:33:28.6137150Z | a, 2025-05-07T20:33:28.6137424Z | ^^ 2025-05-07T20:33:28.6137710Z | ...<23 lines>... 2025-05-07T20:33:28.6138029Z | USE_INT64=use_int64, 2025-05-07T20:33:28.6138386Z | ^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:28.6138715Z | ) 2025-05-07T20:33:28.6138950Z | ^ 2025-05-07T20:33:28.6139661Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py", line 330, in 2025-05-07T20:33:28.6140656Z | return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:28.6141259Z | ~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:28.6142220Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 186, in run 2025-05-07T20:33:28.6143262Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:28.6143901Z | ~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:28.6146176Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 166, in _bench 2025-05-07T20:33:28.6147119Z | return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:28.6147622Z | ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:28.6148431Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py", line 117, in do_bench 2025-05-07T20:33:28.6149167Z | fn() 2025-05-07T20:33:28.6149430Z | ~~^^ 2025-05-07T20:33:28.6150196Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call 2025-05-07T20:33:28.6151064Z | self.fn.run( 2025-05-07T20:33:28.6151354Z | ~~~~~~~~~~~^ 2025-05-07T20:33:28.6151637Z | *args, 2025-05-07T20:33:28.6151910Z | ^^^^^^ 2025-05-07T20:33:28.6152196Z | **current, 2025-05-07T20:33:28.6152494Z | ^^^^^^^^^^ 2025-05-07T20:33:28.6152775Z | ) 2025-05-07T20:33:28.6153008Z | ^ 2025-05-07T20:33:28.6153682Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py", line 623, in run 2025-05-07T20:33:28.6154476Z | kernel = self.compile( 2025-05-07T20:33:28.6154892Z | src, 2025-05-07T20:33:28.6155181Z | target=target, 2025-05-07T20:33:28.6155512Z | options=options.__dict__, 2025-05-07T20:33:28.6155855Z | ) 2025-05-07T20:33:28.6156593Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 273, in compile 2025-05-07T20:33:28.6157560Z | module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:28.6158509Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:33:28.6159586Z | return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:28.6160229Z | module_map=module_map) 2025-05-07T20:33:28.6160770Z | triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:28.6161213Z | def _kernel_quantize_fp8_row( 2025-05-07T20:33:28.6161548Z | ^ 2025-05-07T20:33:28.6162186Z | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:28.6162902Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:33:28.6163398Z | # The test always failed when commented parts were varied together. 2025-05-07T20:33:28.6164057Z | self=, 2025-05-07T20:33:28.6164760Z | T=1, # or any other generated value 2025-05-07T20:33:28.6165146Z | D=5120, # or any other generated value 2025-05-07T20:33:28.6165566Z | scale_ub=None, # or any other generated value 2025-05-07T20:33:28.6166026Z | contiguous=True, # or any other generated value 2025-05-07T20:33:28.6166475Z | compiled=True, # or any other generated value 2025-05-07T20:33:28.6166861Z | ) 2025-05-07T20:33:28.6167084Z | 2025-05-07T20:33:28.6167792Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case 2025-05-07T20:33:28.6168622Z +------------------------------------ 2025-05-07T20:33:28.6169107Z ---------------------------------- Hypothesis ---------------------------------- 2025-05-07T20:33:28.6169673Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.6170223Z self=, 2025-05-07T20:33:28.6170745Z T=1, 2025-05-07T20:33:28.6170984Z D=5120, 2025-05-07T20:33:28.6171222Z scale_ub=None, 2025-05-07T20:33:28.6171497Z contiguous=True, 2025-05-07T20:33:28.6171780Z compiled=True, 2025-05-07T20:33:28.6172036Z ) 2025-05-07T20:33:28.6172461Z self = 2025-05-07T20:33:28.6173096Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:28.6173439Z 2025-05-07T20:33:28.6173543Z @given( 2025-05-07T20:33:28.6173854Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.6174275Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.6174659Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.6175084Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.6175510Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.6175869Z ) 2025-05-07T20:33:28.6176324Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.6176919Z def test_silu_mul_quant( 2025-05-07T20:33:28.6177239Z self, 2025-05-07T20:33:28.6177489Z T: int, 2025-05-07T20:33:28.6177752Z D: int, 2025-05-07T20:33:28.6178042Z scale_ub: Optional[float], 2025-05-07T20:33:28.6178388Z contiguous: bool, 2025-05-07T20:33:28.6178703Z compiled: bool, 2025-05-07T20:33:28.6179000Z ) -> None: 2025-05-07T20:33:28.6179290Z torch.manual_seed(2025) 2025-05-07T20:33:28.6179623Z 2025-05-07T20:33:28.6180036Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.6180474Z 2025-05-07T20:33:28.6180718Z x_sign = torch.sign(x) 2025-05-07T20:33:28.6181089Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:28.6181487Z x = x_sign * x_clamp 2025-05-07T20:33:28.6181781Z x0 = x[:, :D] 2025-05-07T20:33:28.6182054Z x1 = x[:, D:] 2025-05-07T20:33:28.6182315Z 2025-05-07T20:33:28.6182536Z if contiguous: 2025-05-07T20:33:28.6182821Z x0 = x0.contiguous() 2025-05-07T20:33:28.6183155Z x1 = x1.contiguous() 2025-05-07T20:33:28.6183450Z 2025-05-07T20:33:28.6183689Z if scale_ub is not None: 2025-05-07T20:33:28.6184033Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:28.6184444Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:28.6184899Z ) 2025-05-07T20:33:28.6185136Z else: 2025-05-07T20:33:28.6185392Z scale_ub_tensor = None 2025-05-07T20:33:28.6185714Z 2025-05-07T20:33:28.6186042Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.6186441Z op = silu_mul_quant 2025-05-07T20:33:28.6186781Z if compiled: 2025-05-07T20:33:28.6187123Z op = torch.compile(op) 2025-05-07T20:33:28.6187517Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.6187894Z 2025-05-07T20:33:28.6188145Z y_fp8, y_scale = fn() 2025-05-07T20:33:28.6188528Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:28.6188919Z 2025-05-07T20:33:28.6189224Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.6189640Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:28.6189998Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:28.6190398Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:28.6190852Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:28.6191245Z 2025-05-07T20:33:28.6191498Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:28.6191745Z 2025-05-07T20:33:28.6191881Z moe/activation_test.py:126: 2025-05-07T20:33:28.6192255Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6192732Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:28.6193152Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:28.6194163Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:28.6195111Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:28.6195799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:28.6196674Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:28.6197561Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:28.6198484Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:28.6199477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:28.6200357Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:28.6201188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:28.6201897Z fn() 2025-05-07T20:33:28.6202588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:28.6203377Z self.fn.run( 2025-05-07T20:33:28.6203976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:28.6204838Z kernel = self.compile( 2025-05-07T20:33:28.6205613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:28.6206479Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:28.6207004Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6207332Z 2025-05-07T20:33:28.6207593Z self = 2025-05-07T20:33:28.6209294Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:28.6211200Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8c4a502700>} 2025-05-07T20:33:28.6213168Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:28.6214523Z context = 2025-05-07T20:33:28.6214915Z 2025-05-07T20:33:28.6215146Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:28.6215845Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:28.6216478Z module_map=module_map) 2025-05-07T20:33:28.6216966Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:28.6217447Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:28.6217785Z E ^ 2025-05-07T20:33:28.6218372Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:28.6218958Z 2025-05-07T20:33:28.6219541Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:28.6220202Z 2025-05-07T20:33:28.6220328Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.6220852Z self=, 2025-05-07T20:33:28.6221447Z T=2048, 2025-05-07T20:33:28.6221689Z D=5120, 2025-05-07T20:33:28.6221923Z scale_ub=1200.0, 2025-05-07T20:33:28.6222208Z contiguous=True, 2025-05-07T20:33:28.6222489Z compiled=False, 2025-05-07T20:33:28.6222739Z ) 2025-05-07T20:33:28.6223140Z self = 2025-05-07T20:33:28.6223780Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:28.6224131Z 2025-05-07T20:33:28.6224225Z @given( 2025-05-07T20:33:28.6224511Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.6224918Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.6225296Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.6225716Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.6226139Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.6226507Z ) 2025-05-07T20:33:28.6226956Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.6227552Z def test_silu_mul_quant( 2025-05-07T20:33:28.6227863Z self, 2025-05-07T20:33:28.6228111Z T: int, 2025-05-07T20:33:28.6228373Z D: int, 2025-05-07T20:33:28.6228655Z scale_ub: Optional[float], 2025-05-07T20:33:28.6228993Z contiguous: bool, 2025-05-07T20:33:28.6229331Z compiled: bool, 2025-05-07T20:33:28.6229648Z ) -> None: 2025-05-07T20:33:28.6229922Z torch.manual_seed(2025) 2025-05-07T20:33:28.6230242Z 2025-05-07T20:33:28.6230590Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.6231026Z 2025-05-07T20:33:28.6231362Z x_sign = torch.sign(x) 2025-05-07T20:33:28.6231741Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:28.6232144Z x = x_sign * x_clamp 2025-05-07T20:33:28.6232465Z x0 = x[:, :D] 2025-05-07T20:33:28.6232751Z x1 = x[:, D:] 2025-05-07T20:33:28.6233035Z 2025-05-07T20:33:28.6233278Z if contiguous: 2025-05-07T20:33:28.6233592Z x0 = x0.contiguous() 2025-05-07T20:33:28.6233945Z x1 = x1.contiguous() 2025-05-07T20:33:28.6234263Z 2025-05-07T20:33:28.6234522Z if scale_ub is not None: 2025-05-07T20:33:28.6234892Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:28.6235324Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:28.6235713Z ) 2025-05-07T20:33:28.6235970Z else: 2025-05-07T20:33:28.6236224Z scale_ub_tensor = None 2025-05-07T20:33:28.6236617Z 2025-05-07T20:33:28.6236921Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.6237337Z op = silu_mul_quant 2025-05-07T20:33:28.6237655Z if compiled: 2025-05-07T20:33:28.6238047Z op = torch.compile(op) 2025-05-07T20:33:28.6238431Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.6238776Z 2025-05-07T20:33:28.6239030Z > y_fp8, y_scale = fn() 2025-05-07T20:33:28.6239248Z 2025-05-07T20:33:28.6239390Z moe/activation_test.py:117: 2025-05-07T20:33:28.6239777Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6240228Z moe/activation_test.py:115: in fn 2025-05-07T20:33:28.6240602Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.6241525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:28.6242431Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:28.6243153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:28.6244081Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:28.6245077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:28.6245861Z kernel = self.compile( 2025-05-07T20:33:28.6246593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:28.6247472Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:28.6247998Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6248312Z 2025-05-07T20:33:28.6248586Z self = 2025-05-07T20:33:28.6250081Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:28.6251966Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8c4a5b2020>} 2025-05-07T20:33:28.6253754Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:28.6255115Z context = 2025-05-07T20:33:28.6255500Z 2025-05-07T20:33:28.6255729Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:28.6256469Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:28.6257139Z module_map=module_map) 2025-05-07T20:33:28.6257659Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:28.6258104Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:28.6258449Z E ^ 2025-05-07T20:33:28.6259087Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:28.6259697Z 2025-05-07T20:33:28.6260266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:28.6260951Z 2025-05-07T20:33:28.6261092Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.6261626Z self=, 2025-05-07T20:33:28.6262159Z T=2048, 2025-05-07T20:33:28.6262403Z D=5120, 2025-05-07T20:33:28.6262645Z scale_ub=1200.0, 2025-05-07T20:33:28.6262940Z contiguous=True, 2025-05-07T20:33:28.6263223Z compiled=True, 2025-05-07T20:33:28.6263546Z ) 2025-05-07T20:33:28.6263956Z self = 2025-05-07T20:33:28.6264634Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:28.6265054Z 2025-05-07T20:33:28.6265167Z @given( 2025-05-07T20:33:28.6265464Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.6265887Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.6266301Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.6266733Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.6267125Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.6267399Z ) 2025-05-07T20:33:28.6267733Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.6268166Z def test_silu_mul_quant( 2025-05-07T20:33:28.6268400Z self, 2025-05-07T20:33:28.6268575Z T: int, 2025-05-07T20:33:28.6268771Z D: int, 2025-05-07T20:33:28.6268993Z scale_ub: Optional[float], 2025-05-07T20:33:28.6269256Z contiguous: bool, 2025-05-07T20:33:28.6269484Z compiled: bool, 2025-05-07T20:33:28.6269707Z ) -> None: 2025-05-07T20:33:28.6269935Z torch.manual_seed(2025) 2025-05-07T20:33:28.6270162Z 2025-05-07T20:33:28.6270426Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.6270818Z 2025-05-07T20:33:28.6270993Z x_sign = torch.sign(x) 2025-05-07T20:33:28.6271276Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:28.6271577Z x = x_sign * x_clamp 2025-05-07T20:33:28.6271798Z x0 = x[:, :D] 2025-05-07T20:33:28.6272009Z x1 = x[:, D:] 2025-05-07T20:33:28.6272207Z 2025-05-07T20:33:28.6272377Z if contiguous: 2025-05-07T20:33:28.6272599Z x0 = x0.contiguous() 2025-05-07T20:33:28.6272850Z x1 = x1.contiguous() 2025-05-07T20:33:28.6273071Z 2025-05-07T20:33:28.6273264Z if scale_ub is not None: 2025-05-07T20:33:28.6273531Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:28.6273851Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:28.6274152Z ) 2025-05-07T20:33:28.6274333Z else: 2025-05-07T20:33:28.6274534Z scale_ub_tensor = None 2025-05-07T20:33:28.6274764Z 2025-05-07T20:33:28.6274992Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.6275298Z op = silu_mul_quant 2025-05-07T20:33:28.6275534Z if compiled: 2025-05-07T20:33:28.6275772Z op = torch.compile(op) 2025-05-07T20:33:28.6276056Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.6276311Z 2025-05-07T20:33:28.6276495Z y_fp8, y_scale = fn() 2025-05-07T20:33:28.6276772Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:28.6277046Z 2025-05-07T20:33:28.6277277Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.6277610Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:28.6277945Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:28.6278252Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:28.6278605Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:28.6278912Z 2025-05-07T20:33:28.6279096Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:28.6279296Z 2025-05-07T20:33:28.6279386Z moe/activation_test.py:126: 2025-05-07T20:33:28.6279679Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6280000Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:28.6280321Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:28.6281100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:28.6281845Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:28.6282431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:28.6283150Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:28.6283835Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:28.6284693Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:28.6285420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:28.6286050Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:28.6286643Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:28.6287142Z fn() 2025-05-07T20:33:28.6287640Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:28.6288216Z self.fn.run( 2025-05-07T20:33:28.6288672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:28.6289192Z kernel = self.compile( 2025-05-07T20:33:28.6289726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:28.6290425Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:28.6290814Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6291045Z 2025-05-07T20:33:28.6291247Z self = 2025-05-07T20:33:28.6292327Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:28.6293706Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8c495904a0>} 2025-05-07T20:33:28.6295045Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:28.6296059Z context = 2025-05-07T20:33:28.6296352Z 2025-05-07T20:33:28.6296512Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:28.6297023Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:28.6297475Z module_map=module_map) 2025-05-07T20:33:28.6297881Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:28.6298317Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:28.6298582Z E ^ 2025-05-07T20:33:28.6299100Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:28.6299563Z 2025-05-07T20:33:28.6299976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:28.6300487Z 2025-05-07T20:33:28.6300598Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.6301006Z self=, 2025-05-07T20:33:28.6310414Z T=16384, 2025-05-07T20:33:28.6310659Z D=7168, 2025-05-07T20:33:28.6310858Z scale_ub=1200.0, 2025-05-07T20:33:28.6311088Z contiguous=False, 2025-05-07T20:33:28.6311310Z compiled=False, 2025-05-07T20:33:28.6311517Z ) 2025-05-07T20:33:28.6311843Z self = 2025-05-07T20:33:28.6312480Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:28.6312769Z 2025-05-07T20:33:28.6312850Z @given( 2025-05-07T20:33:28.6313078Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.6313501Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.6313800Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.6314126Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.6314453Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.6314730Z ) 2025-05-07T20:33:28.6315071Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.6315510Z def test_silu_mul_quant( 2025-05-07T20:33:28.6315743Z self, 2025-05-07T20:33:28.6315939Z T: int, 2025-05-07T20:33:28.6316137Z D: int, 2025-05-07T20:33:28.6316356Z scale_ub: Optional[float], 2025-05-07T20:33:28.6316622Z contiguous: bool, 2025-05-07T20:33:28.6316866Z compiled: bool, 2025-05-07T20:33:28.6317091Z ) -> None: 2025-05-07T20:33:28.6317295Z torch.manual_seed(2025) 2025-05-07T20:33:28.6317537Z 2025-05-07T20:33:28.6317811Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.6318145Z 2025-05-07T20:33:28.6318342Z x_sign = torch.sign(x) 2025-05-07T20:33:28.6318633Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:28.6319015Z x = x_sign * x_clamp 2025-05-07T20:33:28.6319253Z x0 = x[:, :D] 2025-05-07T20:33:28.6319468Z x1 = x[:, D:] 2025-05-07T20:33:28.6319668Z 2025-05-07T20:33:28.6319856Z if contiguous: 2025-05-07T20:33:28.6320088Z x0 = x0.contiguous() 2025-05-07T20:33:28.6320341Z x1 = x1.contiguous() 2025-05-07T20:33:28.6320581Z 2025-05-07T20:33:28.6320773Z if scale_ub is not None: 2025-05-07T20:33:28.6321052Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:28.6321415Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:28.6321719Z ) 2025-05-07T20:33:28.6321907Z else: 2025-05-07T20:33:28.6322104Z scale_ub_tensor = None 2025-05-07T20:33:28.6322348Z 2025-05-07T20:33:28.6322574Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.6322872Z op = silu_mul_quant 2025-05-07T20:33:28.6323117Z if compiled: 2025-05-07T20:33:28.6323367Z op = torch.compile(op) 2025-05-07T20:33:28.6323648Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.6323920Z 2025-05-07T20:33:28.6324105Z > y_fp8, y_scale = fn() 2025-05-07T20:33:28.6324403Z 2025-05-07T20:33:28.6324507Z moe/activation_test.py:117: 2025-05-07T20:33:28.6324798Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6325125Z moe/activation_test.py:115: in fn 2025-05-07T20:33:28.6325400Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.6326188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:28.6326871Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:28.6327409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:28.6328073Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:28.6328727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:28.6329246Z kernel = self.compile( 2025-05-07T20:33:28.6329777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:28.6330417Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:28.6330800Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6331080Z 2025-05-07T20:33:28.6331280Z self = 2025-05-07T20:33:28.6332392Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:28.6333753Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8c49293880>} 2025-05-07T20:33:28.6335078Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:28.6336079Z context = 2025-05-07T20:33:28.6336367Z 2025-05-07T20:33:28.6336527Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:28.6337046Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:28.6337504Z module_map=module_map) 2025-05-07T20:33:28.6337855Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:28.6338196Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:28.6338491Z E ^ 2025-05-07T20:33:28.6338941Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:28.6339423Z 2025-05-07T20:33:28.6339856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:28.6340366Z 2025-05-07T20:33:28.6340463Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.6340866Z self=, 2025-05-07T20:33:28.6341247Z T=1, 2025-05-07T20:33:28.6341423Z D=7168, 2025-05-07T20:33:28.6341609Z scale_ub=None, 2025-05-07T20:33:28.6341808Z contiguous=True, 2025-05-07T20:33:28.6342025Z compiled=True, 2025-05-07T20:33:28.6342220Z ) 2025-05-07T20:33:28.6342531Z self = 2025-05-07T20:33:28.6343001Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:28.6343252Z 2025-05-07T20:33:28.6343326Z @given( 2025-05-07T20:33:28.6343540Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.6343842Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.6344136Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.6344453Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.6344763Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.6345035Z ) 2025-05-07T20:33:28.6345370Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.6345792Z def test_silu_mul_quant( 2025-05-07T20:33:28.6346020Z self, 2025-05-07T20:33:28.6346253Z T: int, 2025-05-07T20:33:28.6346433Z D: int, 2025-05-07T20:33:28.6346644Z scale_ub: Optional[float], 2025-05-07T20:33:28.6346905Z contiguous: bool, 2025-05-07T20:33:28.6347128Z compiled: bool, 2025-05-07T20:33:28.6347338Z ) -> None: 2025-05-07T20:33:28.6347544Z torch.manual_seed(2025) 2025-05-07T20:33:28.6347768Z 2025-05-07T20:33:28.6348033Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.6348363Z 2025-05-07T20:33:28.6348541Z x_sign = torch.sign(x) 2025-05-07T20:33:28.6348819Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:28.6349115Z x = x_sign * x_clamp 2025-05-07T20:33:28.6349342Z x0 = x[:, :D] 2025-05-07T20:33:28.6349538Z x1 = x[:, D:] 2025-05-07T20:33:28.6349734Z 2025-05-07T20:33:28.6349910Z if contiguous: 2025-05-07T20:33:28.6350171Z x0 = x0.contiguous() 2025-05-07T20:33:28.6350420Z x1 = x1.contiguous() 2025-05-07T20:33:28.6350652Z 2025-05-07T20:33:28.6350825Z if scale_ub is not None: 2025-05-07T20:33:28.6351132Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:28.6351462Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:28.6351750Z ) 2025-05-07T20:33:28.6351935Z else: 2025-05-07T20:33:28.6352138Z scale_ub_tensor = None 2025-05-07T20:33:28.6352370Z 2025-05-07T20:33:28.6352593Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.6352895Z op = silu_mul_quant 2025-05-07T20:33:28.6353129Z if compiled: 2025-05-07T20:33:28.6353363Z op = torch.compile(op) 2025-05-07T20:33:28.6353649Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.6353908Z 2025-05-07T20:33:28.6354081Z y_fp8, y_scale = fn() 2025-05-07T20:33:28.6354364Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:28.6354641Z 2025-05-07T20:33:28.6354862Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.6355191Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:28.6355469Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:28.6355762Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:28.6356161Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:28.6356457Z 2025-05-07T20:33:28.6356639Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:28.6356833Z 2025-05-07T20:33:28.6356923Z moe/activation_test.py:126: 2025-05-07T20:33:28.6357209Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6357537Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:28.6357845Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:28.6358614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:28.6359355Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:28.6359899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:28.6360600Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:28.6361280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:28.6361989Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:28.6362697Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:28.6363321Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:28.6363910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:28.6364514Z fn() 2025-05-07T20:33:28.6365055Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:28.6365628Z self.fn.run( 2025-05-07T20:33:28.6366085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:28.6366598Z kernel = self.compile( 2025-05-07T20:33:28.6367126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:28.6367765Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:28.6368153Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6368375Z 2025-05-07T20:33:28.6368575Z self = 2025-05-07T20:33:28.6369646Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:28.6371143Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8c49450860>} 2025-05-07T20:33:28.6372474Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:28.6373486Z context = 2025-05-07T20:33:28.6373768Z 2025-05-07T20:33:28.6373929Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:28.6374443Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:28.6374900Z module_map=module_map) 2025-05-07T20:33:28.6375250Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:28.6375593Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:28.6375847Z E ^ 2025-05-07T20:33:28.6376291Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:28.6376786Z 2025-05-07T20:33:28.6377195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:28.6377705Z 2025-05-07T20:33:28.6377802Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.6378201Z self=, 2025-05-07T20:33:28.6378583Z T=4096, 2025-05-07T20:33:28.6378757Z D=5120, 2025-05-07T20:33:28.6378938Z scale_ub=None, 2025-05-07T20:33:28.6379144Z contiguous=False, 2025-05-07T20:33:28.6379400Z compiled=False, 2025-05-07T20:33:28.6379590Z ) 2025-05-07T20:33:28.6379898Z self = 2025-05-07T20:33:28.6380380Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:28.6380650Z 2025-05-07T20:33:28.6380718Z @given( 2025-05-07T20:33:28.6380937Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.6381234Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.6381535Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.6381852Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.6382169Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.6382437Z ) 2025-05-07T20:33:28.6382773Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.6383195Z def test_silu_mul_quant( 2025-05-07T20:33:28.6383427Z self, 2025-05-07T20:33:28.6383609Z T: int, 2025-05-07T20:33:28.6383789Z D: int, 2025-05-07T20:33:28.6383995Z scale_ub: Optional[float], 2025-05-07T20:33:28.6384309Z contiguous: bool, 2025-05-07T20:33:28.6384532Z compiled: bool, 2025-05-07T20:33:28.6384746Z ) -> None: 2025-05-07T20:33:28.6384949Z torch.manual_seed(2025) 2025-05-07T20:33:28.6385174Z 2025-05-07T20:33:28.6385435Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.6385769Z 2025-05-07T20:33:28.6385943Z x_sign = torch.sign(x) 2025-05-07T20:33:28.6386222Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:28.6386525Z x = x_sign * x_clamp 2025-05-07T20:33:28.6386755Z x0 = x[:, :D] 2025-05-07T20:33:28.6386962Z x1 = x[:, D:] 2025-05-07T20:33:28.6387164Z 2025-05-07T20:33:28.6387345Z if contiguous: 2025-05-07T20:33:28.6387564Z x0 = x0.contiguous() 2025-05-07T20:33:28.6387814Z x1 = x1.contiguous() 2025-05-07T20:33:28.6388098Z 2025-05-07T20:33:28.6388277Z if scale_ub is not None: 2025-05-07T20:33:28.6388549Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:28.6388877Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:28.6389217Z ) 2025-05-07T20:33:28.6389400Z else: 2025-05-07T20:33:28.6389598Z scale_ub_tensor = None 2025-05-07T20:33:28.6389856Z 2025-05-07T20:33:28.6390097Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.6390397Z op = silu_mul_quant 2025-05-07T20:33:28.6390633Z if compiled: 2025-05-07T20:33:28.6390865Z op = torch.compile(op) 2025-05-07T20:33:28.6391147Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.6391407Z 2025-05-07T20:33:28.6391578Z > y_fp8, y_scale = fn() 2025-05-07T20:33:28.6391741Z 2025-05-07T20:33:28.6391832Z moe/activation_test.py:117: 2025-05-07T20:33:28.6392114Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6392211Z moe/activation_test.py:115: in fn 2025-05-07T20:33:28.6392307Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.6392808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:28.6392896Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:28.6393321Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:28.6393545Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:28.6393878Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:28.6393972Z kernel = self.compile( 2025-05-07T20:33:28.6394347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:28.6394519Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:28.6394647Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6394652Z 2025-05-07T20:33:28.6394852Z self = 2025-05-07T20:33:28.6395630Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:28.6396129Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8c48734ea0>} 2025-05-07T20:33:28.6396870Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:28.6397057Z context = 2025-05-07T20:33:28.6397062Z 2025-05-07T20:33:28.6397263Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:28.6397531Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:28.6397633Z module_map=module_map) 2025-05-07T20:33:28.6397789Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:28.6397885Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:28.6397956Z E ^ 2025-05-07T20:33:28.6398313Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:28.6398318Z 2025-05-07T20:33:28.6398726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:28.6398730Z 2025-05-07T20:33:28.6398828Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.6399098Z self=, 2025-05-07T20:33:28.6399165Z T=4096, 2025-05-07T20:33:28.6399233Z D=7168, 2025-05-07T20:33:28.6399352Z scale_ub=None, 2025-05-07T20:33:28.6399429Z contiguous=False, 2025-05-07T20:33:28.6399508Z compiled=False, 2025-05-07T20:33:28.6399574Z ) 2025-05-07T20:33:28.6399788Z self = 2025-05-07T20:33:28.6399961Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:28.6399965Z 2025-05-07T20:33:28.6400033Z @given( 2025-05-07T20:33:28.6400144Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.6400240Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.6400347Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.6400455Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.6400570Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.6400634Z ) 2025-05-07T20:33:28.6400881Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.6400969Z def test_silu_mul_quant( 2025-05-07T20:33:28.6401036Z self, 2025-05-07T20:33:28.6401110Z T: int, 2025-05-07T20:33:28.6401177Z D: int, 2025-05-07T20:33:28.6401268Z scale_ub: Optional[float], 2025-05-07T20:33:28.6401401Z contiguous: bool, 2025-05-07T20:33:28.6401480Z compiled: bool, 2025-05-07T20:33:28.6401550Z ) -> None: 2025-05-07T20:33:28.6401642Z torch.manual_seed(2025) 2025-05-07T20:33:28.6401705Z 2025-05-07T20:33:28.6401867Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.6401935Z 2025-05-07T20:33:28.6402019Z x_sign = torch.sign(x) 2025-05-07T20:33:28.6402144Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:28.6402223Z x = x_sign * x_clamp 2025-05-07T20:33:28.6402297Z x0 = x[:, :D] 2025-05-07T20:33:28.6402372Z x1 = x[:, D:] 2025-05-07T20:33:28.6402437Z 2025-05-07T20:33:28.6402513Z if contiguous: 2025-05-07T20:33:28.6402604Z x0 = x0.contiguous() 2025-05-07T20:33:28.6402683Z x1 = x1.contiguous() 2025-05-07T20:33:28.6402744Z 2025-05-07T20:33:28.6402828Z if scale_ub is not None: 2025-05-07T20:33:28.6402930Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:28.6403059Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:28.6403130Z ) 2025-05-07T20:33:28.6403196Z else: 2025-05-07T20:33:28.6403288Z scale_ub_tensor = None 2025-05-07T20:33:28.6403352Z 2025-05-07T20:33:28.6403473Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.6403561Z op = silu_mul_quant 2025-05-07T20:33:28.6403638Z if compiled: 2025-05-07T20:33:28.6403727Z op = torch.compile(op) 2025-05-07T20:33:28.6403832Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.6403894Z 2025-05-07T20:33:28.6404027Z > y_fp8, y_scale = fn() 2025-05-07T20:33:28.6404032Z 2025-05-07T20:33:28.6404131Z moe/activation_test.py:117: 2025-05-07T20:33:28.6404350Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6404470Z moe/activation_test.py:115: in fn 2025-05-07T20:33:28.6404566Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.6405061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:28.6405153Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:28.6405507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:28.6405726Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:28.6406136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:28.6406231Z kernel = self.compile( 2025-05-07T20:33:28.6406678Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:28.6406848Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:28.6406973Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6406977Z 2025-05-07T20:33:28.6407180Z self = 2025-05-07T20:33:28.6407960Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:28.6408701Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8c48735260>} 2025-05-07T20:33:28.6409452Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:28.6409637Z context = 2025-05-07T20:33:28.6409738Z 2025-05-07T20:33:28.6409896Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:28.6410152Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:28.6410259Z module_map=module_map) 2025-05-07T20:33:28.6410417Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:28.6410508Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:28.6410584Z E ^ 2025-05-07T20:33:28.6410930Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:28.6410939Z 2025-05-07T20:33:28.6411395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:28.6411402Z 2025-05-07T20:33:28.6411545Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.6411797Z self=, 2025-05-07T20:33:28.6411914Z T=128, 2025-05-07T20:33:28.6412005Z D=7168, 2025-05-07T20:33:28.6412079Z scale_ub=None, 2025-05-07T20:33:28.6412159Z contiguous=False, 2025-05-07T20:33:28.6412234Z compiled=True, 2025-05-07T20:33:28.6412299Z ) 2025-05-07T20:33:28.6412517Z self = 2025-05-07T20:33:28.6412679Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:28.6412684Z 2025-05-07T20:33:28.6412758Z @given( 2025-05-07T20:33:28.6412875Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.6413138Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.6413252Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.6413364Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.6413469Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.6413542Z ) 2025-05-07T20:33:28.6413780Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.6413866Z def test_silu_mul_quant( 2025-05-07T20:33:28.6413937Z self, 2025-05-07T20:33:28.6414002Z T: int, 2025-05-07T20:33:28.6414075Z D: int, 2025-05-07T20:33:28.6414170Z scale_ub: Optional[float], 2025-05-07T20:33:28.6414251Z contiguous: bool, 2025-05-07T20:33:28.6414337Z compiled: bool, 2025-05-07T20:33:28.6414407Z ) -> None: 2025-05-07T20:33:28.6414493Z torch.manual_seed(2025) 2025-05-07T20:33:28.6414630Z 2025-05-07T20:33:28.6414794Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.6414863Z 2025-05-07T20:33:28.6414954Z x_sign = torch.sign(x) 2025-05-07T20:33:28.6415133Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:28.6415215Z x = x_sign * x_clamp 2025-05-07T20:33:28.6415294Z x0 = x[:, :D] 2025-05-07T20:33:28.6415367Z x1 = x[:, D:] 2025-05-07T20:33:28.6415439Z 2025-05-07T20:33:28.6415514Z if contiguous: 2025-05-07T20:33:28.6415597Z x0 = x0.contiguous() 2025-05-07T20:33:28.6415683Z x1 = x1.contiguous() 2025-05-07T20:33:28.6415747Z 2025-05-07T20:33:28.6415827Z if scale_ub is not None: 2025-05-07T20:33:28.6415930Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:28.6416059Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:28.6416127Z ) 2025-05-07T20:33:28.6416197Z else: 2025-05-07T20:33:28.6416286Z scale_ub_tensor = None 2025-05-07T20:33:28.6416348Z 2025-05-07T20:33:28.6416479Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.6416559Z op = silu_mul_quant 2025-05-07T20:33:28.6416640Z if compiled: 2025-05-07T20:33:28.6416738Z op = torch.compile(op) 2025-05-07T20:33:28.6416837Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.6416954Z 2025-05-07T20:33:28.6417036Z y_fp8, y_scale = fn() 2025-05-07T20:33:28.6417149Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:28.6417216Z 2025-05-07T20:33:28.6417343Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.6417438Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:28.6417538Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:28.6417653Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:28.6417786Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:28.6417857Z 2025-05-07T20:33:28.6417947Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:28.6417954Z 2025-05-07T20:33:28.6418048Z moe/activation_test.py:126: 2025-05-07T20:33:28.6418171Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6418268Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:28.6418405Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:28.6418956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:28.6419051Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:28.6419406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:28.6419621Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:28.6419985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:28.6420279Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:28.6420652Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:28.6420816Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:28.6421151Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:28.6421225Z fn() 2025-05-07T20:33:28.6421618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:28.6421693Z self.fn.run( 2025-05-07T20:33:28.6422027Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:28.6422112Z kernel = self.compile( 2025-05-07T20:33:28.6422531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:28.6422769Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:28.6422890Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6422895Z 2025-05-07T20:33:28.6423099Z self = 2025-05-07T20:33:28.6423871Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:28.6424370Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8c487377e0>} 2025-05-07T20:33:28.6425112Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:28.6425302Z context = 2025-05-07T20:33:28.6425307Z 2025-05-07T20:33:28.6425472Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:28.6425768Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:28.6425868Z module_map=module_map) 2025-05-07T20:33:28.6426027Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:28.6426120Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:28.6426191Z E ^ 2025-05-07T20:33:28.6426536Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:28.6426540Z 2025-05-07T20:33:28.6426957Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:28.6426964Z 2025-05-07T20:33:28.6427065Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.6427282Z self=, 2025-05-07T20:33:28.6427356Z T=128, 2025-05-07T20:33:28.6427424Z D=7168, 2025-05-07T20:33:28.6427500Z scale_ub=None, 2025-05-07T20:33:28.6427584Z contiguous=False, 2025-05-07T20:33:28.6427659Z compiled=False, 2025-05-07T20:33:28.6427722Z ) 2025-05-07T20:33:28.6427940Z self = 2025-05-07T20:33:28.6428104Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:28.6428109Z 2025-05-07T20:33:28.6428175Z @given( 2025-05-07T20:33:28.6428294Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.6428385Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.6428503Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.6428655Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.6428762Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.6428835Z ) 2025-05-07T20:33:28.6429073Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.6429159Z def test_silu_mul_quant( 2025-05-07T20:33:28.6429234Z self, 2025-05-07T20:33:28.6429303Z T: int, 2025-05-07T20:33:28.6429371Z D: int, 2025-05-07T20:33:28.6429464Z scale_ub: Optional[float], 2025-05-07T20:33:28.6429545Z contiguous: bool, 2025-05-07T20:33:28.6429621Z compiled: bool, 2025-05-07T20:33:28.6429695Z ) -> None: 2025-05-07T20:33:28.6429779Z torch.manual_seed(2025) 2025-05-07T20:33:28.6429849Z 2025-05-07T20:33:28.6430011Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.6430075Z 2025-05-07T20:33:28.6430206Z x_sign = torch.sign(x) 2025-05-07T20:33:28.6430327Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:28.6430409Z x = x_sign * x_clamp 2025-05-07T20:33:28.6430523Z x0 = x[:, :D] 2025-05-07T20:33:28.6430596Z x1 = x[:, D:] 2025-05-07T20:33:28.6430657Z 2025-05-07T20:33:28.6430739Z if contiguous: 2025-05-07T20:33:28.6430825Z x0 = x0.contiguous() 2025-05-07T20:33:28.6430907Z x1 = x1.contiguous() 2025-05-07T20:33:28.6430976Z 2025-05-07T20:33:28.6431058Z if scale_ub is not None: 2025-05-07T20:33:28.6431159Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:28.6431293Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:28.6431360Z ) 2025-05-07T20:33:28.6431433Z else: 2025-05-07T20:33:28.6431517Z scale_ub_tensor = None 2025-05-07T20:33:28.6431580Z 2025-05-07T20:33:28.6431706Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.6431791Z op = silu_mul_quant 2025-05-07T20:33:28.6431866Z if compiled: 2025-05-07T20:33:28.6431966Z op = torch.compile(op) 2025-05-07T20:33:28.6432065Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.6432128Z 2025-05-07T20:33:28.6432219Z > y_fp8, y_scale = fn() 2025-05-07T20:33:28.6432223Z 2025-05-07T20:33:28.6432357Z moe/activation_test.py:117: 2025-05-07T20:33:28.6432486Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6432580Z moe/activation_test.py:115: in fn 2025-05-07T20:33:28.6432671Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.6433168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:28.6433254Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:28.6433604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:28.6433833Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:28.6434167Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:28.6434256Z kernel = self.compile( 2025-05-07T20:33:28.6434628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:28.6434800Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:28.6434927Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6434931Z 2025-05-07T20:33:28.6435128Z self = 2025-05-07T20:33:28.6435902Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:28.6436442Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8c48551440>} 2025-05-07T20:33:28.6437183Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:28.6437374Z context = 2025-05-07T20:33:28.6437379Z 2025-05-07T20:33:28.6437534Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:28.6437797Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:28.6437898Z module_map=module_map) 2025-05-07T20:33:28.6438053Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:28.6438193Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:28.6438260Z E ^ 2025-05-07T20:33:28.6438647Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:28.6438659Z 2025-05-07T20:33:28.6439066Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:28.6439073Z 2025-05-07T20:33:28.6439167Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.6439390Z self=, 2025-05-07T20:33:28.6439459Z T=4096, 2025-05-07T20:33:28.6439527Z D=5120, 2025-05-07T20:33:28.6439607Z scale_ub=1200.0, 2025-05-07T20:33:28.6439681Z contiguous=True, 2025-05-07T20:33:28.6439756Z compiled=False, 2025-05-07T20:33:28.6439832Z ) 2025-05-07T20:33:28.6440048Z self = 2025-05-07T20:33:28.6440223Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:28.6440231Z 2025-05-07T20:33:28.6450542Z @given( 2025-05-07T20:33:28.6450678Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.6450784Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.6450895Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.6451092Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.6451212Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.6451286Z ) 2025-05-07T20:33:28.6451531Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.6451637Z def test_silu_mul_quant( 2025-05-07T20:33:28.6451714Z self, 2025-05-07T20:33:28.6451792Z T: int, 2025-05-07T20:33:28.6451871Z D: int, 2025-05-07T20:33:28.6451966Z scale_ub: Optional[float], 2025-05-07T20:33:28.6452052Z contiguous: bool, 2025-05-07T20:33:28.6452151Z compiled: bool, 2025-05-07T20:33:28.6452233Z ) -> None: 2025-05-07T20:33:28.6452335Z torch.manual_seed(2025) 2025-05-07T20:33:28.6452403Z 2025-05-07T20:33:28.6452577Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.6452656Z 2025-05-07T20:33:28.6452747Z x_sign = torch.sign(x) 2025-05-07T20:33:28.6452878Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:28.6452976Z x = x_sign * x_clamp 2025-05-07T20:33:28.6453053Z x0 = x[:, :D] 2025-05-07T20:33:28.6453129Z x1 = x[:, D:] 2025-05-07T20:33:28.6453208Z 2025-05-07T20:33:28.6453291Z if contiguous: 2025-05-07T20:33:28.6453378Z x0 = x0.contiguous() 2025-05-07T20:33:28.6453474Z x1 = x1.contiguous() 2025-05-07T20:33:28.6453542Z 2025-05-07T20:33:28.6453633Z if scale_ub is not None: 2025-05-07T20:33:28.6453737Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:28.6453874Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:28.6454003Z ) 2025-05-07T20:33:28.6454081Z else: 2025-05-07T20:33:28.6454179Z scale_ub_tensor = None 2025-05-07T20:33:28.6454254Z 2025-05-07T20:33:28.6454382Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.6454470Z op = silu_mul_quant 2025-05-07T20:33:28.6454563Z if compiled: 2025-05-07T20:33:28.6454657Z op = torch.compile(op) 2025-05-07T20:33:28.6454758Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.6454836Z 2025-05-07T20:33:28.6454919Z > y_fp8, y_scale = fn() 2025-05-07T20:33:28.6454924Z 2025-05-07T20:33:28.6455026Z moe/activation_test.py:117: 2025-05-07T20:33:28.6455156Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6455255Z moe/activation_test.py:115: in fn 2025-05-07T20:33:28.6455353Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.6455906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:28.6456041Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:28.6456404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:28.6456626Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:28.6456965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:28.6457054Z kernel = self.compile( 2025-05-07T20:33:28.6457431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:28.6457605Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:28.6457727Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6457734Z 2025-05-07T20:33:28.6457945Z self = 2025-05-07T20:33:28.6458724Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:28.6459267Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8c485520c0>} 2025-05-07T20:33:28.6460009Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:28.6460195Z context = 2025-05-07T20:33:28.6460200Z 2025-05-07T20:33:28.6460365Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:28.6460625Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:28.6460737Z module_map=module_map) 2025-05-07T20:33:28.6460924Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:28.6461042Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:28.6461124Z E ^ 2025-05-07T20:33:28.6461483Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:28.6461488Z 2025-05-07T20:33:28.6461897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:28.6461902Z 2025-05-07T20:33:28.6462000Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.6462215Z self=, 2025-05-07T20:33:28.6462289Z T=1, 2025-05-07T20:33:28.6462371Z D=5120, 2025-05-07T20:33:28.6462452Z scale_ub=None, 2025-05-07T20:33:28.6462574Z contiguous=True, 2025-05-07T20:33:28.6462658Z compiled=True, 2025-05-07T20:33:28.6462731Z ) 2025-05-07T20:33:28.6462952Z self = 2025-05-07T20:33:28.6463108Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:28.6463115Z 2025-05-07T20:33:28.6463189Z @given( 2025-05-07T20:33:28.6463309Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.6463405Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.6463515Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.6463635Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.6463742Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.6463814Z ) 2025-05-07T20:33:28.6464055Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.6464217Z def test_silu_mul_quant( 2025-05-07T20:33:28.6464299Z self, 2025-05-07T20:33:28.6464373Z T: int, 2025-05-07T20:33:28.6464444Z D: int, 2025-05-07T20:33:28.6464583Z scale_ub: Optional[float], 2025-05-07T20:33:28.6464674Z contiguous: bool, 2025-05-07T20:33:28.6464759Z compiled: bool, 2025-05-07T20:33:28.6464848Z ) -> None: 2025-05-07T20:33:28.6464941Z torch.manual_seed(2025) 2025-05-07T20:33:28.6465012Z 2025-05-07T20:33:28.6465188Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.6465257Z 2025-05-07T20:33:28.6465347Z x_sign = torch.sign(x) 2025-05-07T20:33:28.6465475Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:28.6465562Z x = x_sign * x_clamp 2025-05-07T20:33:28.6465643Z x0 = x[:, :D] 2025-05-07T20:33:28.6465720Z x1 = x[:, D:] 2025-05-07T20:33:28.6465787Z 2025-05-07T20:33:28.6465876Z if contiguous: 2025-05-07T20:33:28.6465966Z x0 = x0.contiguous() 2025-05-07T20:33:28.6466055Z x1 = x1.contiguous() 2025-05-07T20:33:28.6466128Z 2025-05-07T20:33:28.6466216Z if scale_ub is not None: 2025-05-07T20:33:28.6466319Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:28.6466462Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:28.6466582Z ) 2025-05-07T20:33:28.6466653Z else: 2025-05-07T20:33:28.6466752Z scale_ub_tensor = None 2025-05-07T20:33:28.6466820Z 2025-05-07T20:33:28.6466952Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.6467038Z op = silu_mul_quant 2025-05-07T20:33:28.6467118Z if compiled: 2025-05-07T20:33:28.6467221Z op = torch.compile(op) 2025-05-07T20:33:28.6467324Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.6467393Z 2025-05-07T20:33:28.6467491Z y_fp8, y_scale = fn() 2025-05-07T20:33:28.6467610Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:28.6467681Z 2025-05-07T20:33:28.6467817Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.6467916Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:28.6468008Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:28.6468132Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:28.6468270Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:28.6468341Z 2025-05-07T20:33:28.6468446Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:28.6468451Z 2025-05-07T20:33:28.6468544Z moe/activation_test.py:126: 2025-05-07T20:33:28.6468668Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6468773Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:28.6468904Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:28.6469565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:28.6469663Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:28.6470018Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:28.6470245Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:28.6470611Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:28.6470875Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:28.6471244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:28.6471406Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:28.6471783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:28.6471864Z fn() 2025-05-07T20:33:28.6472300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:28.6472383Z self.fn.run( 2025-05-07T20:33:28.6472714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:28.6472810Z kernel = self.compile( 2025-05-07T20:33:28.6473183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:28.6473359Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:28.6473484Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6473489Z 2025-05-07T20:33:28.6473688Z self = 2025-05-07T20:33:28.6474477Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:28.6474976Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8c48552d40>} 2025-05-07T20:33:28.6475757Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:28.6475951Z context = 2025-05-07T20:33:28.6475955Z 2025-05-07T20:33:28.6476118Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:28.6476382Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:28.6476495Z module_map=module_map) 2025-05-07T20:33:28.6476657Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:28.6476764Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:28.6476839Z E ^ 2025-05-07T20:33:28.6477194Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:28.6477207Z 2025-05-07T20:33:28.6477613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:28.6477618Z 2025-05-07T20:33:28.6477714Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.6477937Z self=, 2025-05-07T20:33:28.6478011Z T=2048, 2025-05-07T20:33:28.6478083Z D=5120, 2025-05-07T20:33:28.6478166Z scale_ub=None, 2025-05-07T20:33:28.6478249Z contiguous=True, 2025-05-07T20:33:28.6478331Z compiled=True, 2025-05-07T20:33:28.6478404Z ) 2025-05-07T20:33:28.6478663Z self = 2025-05-07T20:33:28.6478840Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:28.6478845Z 2025-05-07T20:33:28.6478916Z @given( 2025-05-07T20:33:28.6479028Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.6479134Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.6479246Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.6479355Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.6479468Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.6479539Z ) 2025-05-07T20:33:28.6479781Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.6479873Z def test_silu_mul_quant( 2025-05-07T20:33:28.6479945Z self, 2025-05-07T20:33:28.6480068Z T: int, 2025-05-07T20:33:28.6480140Z D: int, 2025-05-07T20:33:28.6480237Z scale_ub: Optional[float], 2025-05-07T20:33:28.6480324Z contiguous: bool, 2025-05-07T20:33:28.6480444Z compiled: bool, 2025-05-07T20:33:28.6480517Z ) -> None: 2025-05-07T20:33:28.6480612Z torch.manual_seed(2025) 2025-05-07T20:33:28.6480680Z 2025-05-07T20:33:28.6480844Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.6480921Z 2025-05-07T20:33:28.6481012Z x_sign = torch.sign(x) 2025-05-07T20:33:28.6481131Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:28.6481218Z x = x_sign * x_clamp 2025-05-07T20:33:28.6481293Z x0 = x[:, :D] 2025-05-07T20:33:28.6481371Z x1 = x[:, D:] 2025-05-07T20:33:28.6481438Z 2025-05-07T20:33:28.6481515Z if contiguous: 2025-05-07T20:33:28.6481607Z x0 = x0.contiguous() 2025-05-07T20:33:28.6481691Z x1 = x1.contiguous() 2025-05-07T20:33:28.6481759Z 2025-05-07T20:33:28.6481853Z if scale_ub is not None: 2025-05-07T20:33:28.6481960Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:28.6482090Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:28.6482170Z ) 2025-05-07T20:33:28.6482242Z else: 2025-05-07T20:33:28.6482331Z scale_ub_tensor = None 2025-05-07T20:33:28.6482448Z 2025-05-07T20:33:28.6482572Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.6482658Z op = silu_mul_quant 2025-05-07T20:33:28.6482739Z if compiled: 2025-05-07T20:33:28.6482835Z op = torch.compile(op) 2025-05-07T20:33:28.6482947Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.6483016Z 2025-05-07T20:33:28.6483102Z y_fp8, y_scale = fn() 2025-05-07T20:33:28.6483220Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:28.6483291Z 2025-05-07T20:33:28.6483425Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.6483529Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:28.6483620Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:28.6483738Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:28.6483876Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:28.6483949Z 2025-05-07T20:33:28.6484048Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:28.6484052Z 2025-05-07T20:33:28.6484144Z moe/activation_test.py:126: 2025-05-07T20:33:28.6484403Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6484510Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:28.6484639Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:28.6485186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:28.6485287Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:28.6485686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:28.6485908Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:28.6486271Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:28.6486524Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:28.6486897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:28.6487058Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:28.6487398Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:28.6487471Z fn() 2025-05-07T20:33:28.6487909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:28.6487992Z self.fn.run( 2025-05-07T20:33:28.6488365Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:28.6488456Z kernel = self.compile( 2025-05-07T20:33:28.6488834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:28.6489006Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:28.6489136Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6489140Z 2025-05-07T20:33:28.6489340Z self = 2025-05-07T20:33:28.6490117Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:28.6490625Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8c48568c20>} 2025-05-07T20:33:28.6491361Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:28.6491593Z context = 2025-05-07T20:33:28.6491597Z 2025-05-07T20:33:28.6491755Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:28.6492012Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:28.6492117Z module_map=module_map) 2025-05-07T20:33:28.6492280Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:28.6492383Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:28.6492458Z E ^ 2025-05-07T20:33:28.6492811Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:28.6492816Z 2025-05-07T20:33:28.6493229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:28.6493236Z 2025-05-07T20:33:28.6493334Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.6493556Z self=, 2025-05-07T20:33:28.6493629Z T=128, 2025-05-07T20:33:28.6493700Z D=5120, 2025-05-07T20:33:28.6493783Z scale_ub=None, 2025-05-07T20:33:28.6493862Z contiguous=True, 2025-05-07T20:33:28.6493943Z compiled=True, 2025-05-07T20:33:28.6494015Z ) 2025-05-07T20:33:28.6494225Z self = 2025-05-07T20:33:28.6494389Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:28.6494462Z 2025-05-07T20:33:28.6494542Z @given( 2025-05-07T20:33:28.6494659Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.6494754Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.6494872Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.6494987Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.6495103Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.6495170Z ) 2025-05-07T20:33:28.6495411Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.6495504Z def test_silu_mul_quant( 2025-05-07T20:33:28.6495576Z self, 2025-05-07T20:33:28.6495648Z T: int, 2025-05-07T20:33:28.6495727Z D: int, 2025-05-07T20:33:28.6495820Z scale_ub: Optional[float], 2025-05-07T20:33:28.6495949Z contiguous: bool, 2025-05-07T20:33:28.6496033Z compiled: bool, 2025-05-07T20:33:28.6496106Z ) -> None: 2025-05-07T20:33:28.6496209Z torch.manual_seed(2025) 2025-05-07T20:33:28.6496278Z 2025-05-07T20:33:28.6496482Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.6496555Z 2025-05-07T20:33:28.6496641Z x_sign = torch.sign(x) 2025-05-07T20:33:28.6496763Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:28.6496853Z x = x_sign * x_clamp 2025-05-07T20:33:28.6496931Z x0 = x[:, :D] 2025-05-07T20:33:28.6497007Z x1 = x[:, D:] 2025-05-07T20:33:28.6497082Z 2025-05-07T20:33:28.6497160Z if contiguous: 2025-05-07T20:33:28.6497247Z x0 = x0.contiguous() 2025-05-07T20:33:28.6497334Z x1 = x1.contiguous() 2025-05-07T20:33:28.6497402Z 2025-05-07T20:33:28.6497488Z if scale_ub is not None: 2025-05-07T20:33:28.6497592Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:28.6497724Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:28.6497805Z ) 2025-05-07T20:33:28.6497875Z else: 2025-05-07T20:33:28.6497966Z scale_ub_tensor = None 2025-05-07T20:33:28.6498037Z 2025-05-07T20:33:28.6498162Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.6498249Z op = silu_mul_quant 2025-05-07T20:33:28.6498382Z if compiled: 2025-05-07T20:33:28.6498476Z op = torch.compile(op) 2025-05-07T20:33:28.6498577Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.6498654Z 2025-05-07T20:33:28.6498740Z y_fp8, y_scale = fn() 2025-05-07T20:33:28.6498858Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:28.6498933Z 2025-05-07T20:33:28.6499064Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.6499163Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:28.6499260Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:28.6499377Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:28.6499519Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:28.6499586Z 2025-05-07T20:33:28.6499679Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:28.6499683Z 2025-05-07T20:33:28.6499778Z moe/activation_test.py:126: 2025-05-07T20:33:28.6499903Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6500008Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:28.6500137Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:28.6500683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:28.6500783Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:28.6501136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:28.6501401Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:28.6501767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:28.6502015Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:28.6502395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:28.6502614Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:28.6503025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:28.6503104Z fn() 2025-05-07T20:33:28.6503500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:28.6503637Z self.fn.run( 2025-05-07T20:33:28.6503982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:28.6504070Z kernel = self.compile( 2025-05-07T20:33:28.6504511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:28.6504683Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:28.6504815Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6504820Z 2025-05-07T20:33:28.6505029Z self = 2025-05-07T20:33:28.6505858Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:28.6506366Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b93d4eca0>} 2025-05-07T20:33:28.6507118Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:28.6507347Z context = 2025-05-07T20:33:28.6507358Z 2025-05-07T20:33:28.6507518Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:28.6507780Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:28.6507884Z module_map=module_map) 2025-05-07T20:33:28.6508042Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:28.6508144Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:28.6508218Z E ^ 2025-05-07T20:33:28.6508794Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:28.6508799Z 2025-05-07T20:33:28.6509225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:28.6509230Z 2025-05-07T20:33:28.6509327Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.6509551Z self=, 2025-05-07T20:33:28.6509632Z T=4096, 2025-05-07T20:33:28.6509704Z D=5120, 2025-05-07T20:33:28.6509780Z scale_ub=None, 2025-05-07T20:33:28.6509867Z contiguous=True, 2025-05-07T20:33:28.6509945Z compiled=True, 2025-05-07T20:33:28.6510010Z ) 2025-05-07T20:33:28.6510233Z self = 2025-05-07T20:33:28.6510399Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:28.6510408Z 2025-05-07T20:33:28.6510479Z @given( 2025-05-07T20:33:28.6510602Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.6510797Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.6510914Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.6511033Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.6511143Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.6511219Z ) 2025-05-07T20:33:28.6511472Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.6511574Z def test_silu_mul_quant( 2025-05-07T20:33:28.6511657Z self, 2025-05-07T20:33:28.6511742Z T: int, 2025-05-07T20:33:28.6511811Z D: int, 2025-05-07T20:33:28.6511908Z scale_ub: Optional[float], 2025-05-07T20:33:28.6511993Z contiguous: bool, 2025-05-07T20:33:28.6512072Z compiled: bool, 2025-05-07T20:33:28.6512151Z ) -> None: 2025-05-07T20:33:28.6512302Z torch.manual_seed(2025) 2025-05-07T20:33:28.6512369Z 2025-05-07T20:33:28.6512541Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.6512610Z 2025-05-07T20:33:28.6512758Z x_sign = torch.sign(x) 2025-05-07T20:33:28.6512883Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:28.6512965Z x = x_sign * x_clamp 2025-05-07T20:33:28.6513049Z x0 = x[:, :D] 2025-05-07T20:33:28.6513125Z x1 = x[:, D:] 2025-05-07T20:33:28.6513191Z 2025-05-07T20:33:28.6513292Z if contiguous: 2025-05-07T20:33:28.6513414Z x0 = x0.contiguous() 2025-05-07T20:33:28.6513533Z x1 = x1.contiguous() 2025-05-07T20:33:28.6513630Z 2025-05-07T20:33:28.6513748Z if scale_ub is not None: 2025-05-07T20:33:28.6513885Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:28.6514073Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:28.6514175Z ) 2025-05-07T20:33:28.6514288Z else: 2025-05-07T20:33:28.6514389Z scale_ub_tensor = None 2025-05-07T20:33:28.6514463Z 2025-05-07T20:33:28.6514598Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.6514688Z op = silu_mul_quant 2025-05-07T20:33:28.6514773Z if compiled: 2025-05-07T20:33:28.6514873Z op = torch.compile(op) 2025-05-07T20:33:28.6515082Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.6515154Z 2025-05-07T20:33:28.6515250Z y_fp8, y_scale = fn() 2025-05-07T20:33:28.6515369Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:28.6515439Z 2025-05-07T20:33:28.6515580Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.6515678Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:28.6515779Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:28.6515898Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:28.6516036Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:28.6516112Z 2025-05-07T20:33:28.6516211Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:28.6516216Z 2025-05-07T20:33:28.6516315Z moe/activation_test.py:126: 2025-05-07T20:33:28.6516444Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6516548Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:28.6516683Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:28.6517243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:28.6517340Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:28.6517704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:28.6517924Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:28.6518339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:28.6518607Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:28.6518980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:28.6519156Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:28.6519495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:28.6519569Z fn() 2025-05-07T20:33:28.6519972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:28.6520052Z self.fn.run( 2025-05-07T20:33:28.6520384Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:28.6520534Z kernel = self.compile( 2025-05-07T20:33:28.6520915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:28.6521131Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:28.6521260Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6521269Z 2025-05-07T20:33:28.6521476Z self = 2025-05-07T20:33:28.6522259Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:28.6522765Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b93d6e660>} 2025-05-07T20:33:28.6523520Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:28.6523708Z context = 2025-05-07T20:33:28.6523713Z 2025-05-07T20:33:28.6523879Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:28.6524209Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:28.6524432Z module_map=module_map) 2025-05-07T20:33:28.6524594Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:28.6524692Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:28.6524769Z E ^ 2025-05-07T20:33:28.6525127Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:28.6525134Z 2025-05-07T20:33:28.6525548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:28.6525552Z 2025-05-07T20:33:28.6525658Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.6525876Z self=, 2025-05-07T20:33:28.6525950Z T=16384, 2025-05-07T20:33:28.6526038Z D=5120, 2025-05-07T20:33:28.6526117Z scale_ub=None, 2025-05-07T20:33:28.6526203Z contiguous=True, 2025-05-07T20:33:28.6526293Z compiled=True, 2025-05-07T20:33:28.6526365Z ) 2025-05-07T20:33:28.6526579Z self = 2025-05-07T20:33:28.6526758Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:28.6526763Z 2025-05-07T20:33:28.6526839Z @given( 2025-05-07T20:33:28.6526965Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.6527067Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.6527180Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.6527350Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.6527465Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.6527540Z ) 2025-05-07T20:33:28.6527791Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.6527891Z def test_silu_mul_quant( 2025-05-07T20:33:28.6527971Z self, 2025-05-07T20:33:28.6528048Z T: int, 2025-05-07T20:33:28.6528129Z D: int, 2025-05-07T20:33:28.6528235Z scale_ub: Optional[float], 2025-05-07T20:33:28.6528325Z contiguous: bool, 2025-05-07T20:33:28.6528413Z compiled: bool, 2025-05-07T20:33:28.6528495Z ) -> None: 2025-05-07T20:33:28.6528585Z torch.manual_seed(2025) 2025-05-07T20:33:28.6528655Z 2025-05-07T20:33:28.6528827Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.6528941Z 2025-05-07T20:33:28.6529031Z x_sign = torch.sign(x) 2025-05-07T20:33:28.6529163Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:28.6529290Z x = x_sign * x_clamp 2025-05-07T20:33:28.6529371Z x0 = x[:, :D] 2025-05-07T20:33:28.6529456Z x1 = x[:, D:] 2025-05-07T20:33:28.6529531Z 2025-05-07T20:33:28.6529617Z if contiguous: 2025-05-07T20:33:28.6529710Z x0 = x0.contiguous() 2025-05-07T20:33:28.6529798Z x1 = x1.contiguous() 2025-05-07T20:33:28.6529872Z 2025-05-07T20:33:28.6529958Z if scale_ub is not None: 2025-05-07T20:33:28.6530061Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:28.6530199Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:28.6530273Z ) 2025-05-07T20:33:28.6530347Z else: 2025-05-07T20:33:28.6530450Z scale_ub_tensor = None 2025-05-07T20:33:28.6530523Z 2025-05-07T20:33:28.6530656Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.6530747Z op = silu_mul_quant 2025-05-07T20:33:28.6530833Z if compiled: 2025-05-07T20:33:28.6530934Z op = torch.compile(op) 2025-05-07T20:33:28.6531039Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.6531114Z 2025-05-07T20:33:28.6531211Z y_fp8, y_scale = fn() 2025-05-07T20:33:28.6531378Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:28.6531447Z 2025-05-07T20:33:28.6531588Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.6531686Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:28.6531784Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:28.6531906Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:28.6532044Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:28.6532114Z 2025-05-07T20:33:28.6532214Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:28.6532223Z 2025-05-07T20:33:28.6532317Z moe/activation_test.py:126: 2025-05-07T20:33:28.6532448Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6532553Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:28.6532685Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:28.6533246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:28.6533347Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:28.6533703Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:28.6533924Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:28.6534289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:28.6534548Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:28.6534968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:28.6535136Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:28.6535477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:28.6535556Z fn() 2025-05-07T20:33:28.6535954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:28.6536036Z self.fn.run( 2025-05-07T20:33:28.6536370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:28.6536468Z kernel = self.compile( 2025-05-07T20:33:28.6536843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:28.6537060Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:28.6537233Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6537238Z 2025-05-07T20:33:28.6537442Z self = 2025-05-07T20:33:28.6538220Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:28.6538725Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b93521580>} 2025-05-07T20:33:28.6539474Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:28.6539670Z context = 2025-05-07T20:33:28.6539674Z 2025-05-07T20:33:28.6539838Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:28.6540101Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:28.6540258Z module_map=module_map) 2025-05-07T20:33:28.6540420Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:28.6540519Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:28.6540594Z E ^ 2025-05-07T20:33:28.6540949Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:28.6540953Z 2025-05-07T20:33:28.6541360Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:28.6541367Z 2025-05-07T20:33:28.6541468Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.6541692Z self=, 2025-05-07T20:33:28.6541773Z T=1, 2025-05-07T20:33:28.6541856Z D=5120, 2025-05-07T20:33:28.6541938Z scale_ub=1200.0, 2025-05-07T20:33:28.6542017Z contiguous=True, 2025-05-07T20:33:28.6542105Z compiled=True, 2025-05-07T20:33:28.6542176Z ) 2025-05-07T20:33:28.6542391Z self = 2025-05-07T20:33:28.6542563Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:28.6542568Z 2025-05-07T20:33:28.6542640Z @given( 2025-05-07T20:33:28.6542755Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.6542857Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.6542970Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.6543092Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.6543208Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.6543330Z ) 2025-05-07T20:33:28.6543582Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.6543672Z def test_silu_mul_quant( 2025-05-07T20:33:28.6543745Z self, 2025-05-07T20:33:28.6543826Z T: int, 2025-05-07T20:33:28.6543902Z D: int, 2025-05-07T20:33:28.6543997Z scale_ub: Optional[float], 2025-05-07T20:33:28.6544094Z contiguous: bool, 2025-05-07T20:33:28.6544177Z compiled: bool, 2025-05-07T20:33:28.6544253Z ) -> None: 2025-05-07T20:33:28.6544347Z torch.manual_seed(2025) 2025-05-07T20:33:28.6544420Z 2025-05-07T20:33:28.6544594Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.6544668Z 2025-05-07T20:33:28.6544757Z x_sign = torch.sign(x) 2025-05-07T20:33:28.6544881Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:28.6545013Z x = x_sign * x_clamp 2025-05-07T20:33:28.6545096Z x0 = x[:, :D] 2025-05-07T20:33:28.6545179Z x1 = x[:, D:] 2025-05-07T20:33:28.6545254Z 2025-05-07T20:33:28.6546074Z if contiguous: 2025-05-07T20:33:28.6546174Z x0 = x0.contiguous() 2025-05-07T20:33:28.6546262Z x1 = x1.contiguous() 2025-05-07T20:33:28.6546338Z 2025-05-07T20:33:28.6546430Z if scale_ub is not None: 2025-05-07T20:33:28.6546536Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:28.6546678Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:28.6546751Z ) 2025-05-07T20:33:28.6546824Z else: 2025-05-07T20:33:28.6546921Z scale_ub_tensor = None 2025-05-07T20:33:28.6546992Z 2025-05-07T20:33:28.6547118Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.6547216Z op = silu_mul_quant 2025-05-07T20:33:28.6547301Z if compiled: 2025-05-07T20:33:28.6547397Z op = torch.compile(op) 2025-05-07T20:33:28.6547509Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.6547580Z 2025-05-07T20:33:28.6547673Z > y_fp8, y_scale = fn() 2025-05-07T20:33:28.6547677Z 2025-05-07T20:33:28.6547781Z moe/activation_test.py:117: 2025-05-07T20:33:28.6547908Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6548146Z moe/activation_test.py:115: in fn 2025-05-07T20:33:28.6548244Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.6548615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:28.6548708Z return fn(*args, **kwargs) 2025-05-07T20:33:28.6549200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:28.6549296Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:28.6549662Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:28.6549888Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:28.6550232Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:28.6550329Z kernel = self.compile( 2025-05-07T20:33:28.6550708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:28.6550886Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:28.6551011Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6551016Z 2025-05-07T20:33:28.6551220Z self = 2025-05-07T20:33:28.6552044Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:28.6552551Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b9352f740>} 2025-05-07T20:33:28.6553297Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:28.6553490Z context = 2025-05-07T20:33:28.6553494Z 2025-05-07T20:33:28.6553659Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:28.6553918Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:28.6554023Z module_map=module_map) 2025-05-07T20:33:28.6554226Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:28.6554325Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:28.6554406Z E ^ 2025-05-07T20:33:28.6554824Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:28.6554829Z 2025-05-07T20:33:28.6555243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:28.6555248Z 2025-05-07T20:33:28.6555355Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.6555575Z self=, 2025-05-07T20:33:28.6555648Z T=1, 2025-05-07T20:33:28.6555726Z D=5120, 2025-05-07T20:33:28.6555808Z scale_ub=None, 2025-05-07T20:33:28.6555897Z contiguous=False, 2025-05-07T20:33:28.6555977Z compiled=True, 2025-05-07T20:33:28.6556046Z ) 2025-05-07T20:33:28.6556268Z self = 2025-05-07T20:33:28.6556436Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:28.6556440Z 2025-05-07T20:33:28.6556515Z @given( 2025-05-07T20:33:28.6556637Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.6556735Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.6556895Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.6557020Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.6557133Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.6557208Z ) 2025-05-07T20:33:28.6557453Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.6557543Z def test_silu_mul_quant( 2025-05-07T20:33:28.6557620Z self, 2025-05-07T20:33:28.6557695Z T: int, 2025-05-07T20:33:28.6557772Z D: int, 2025-05-07T20:33:28.6557875Z scale_ub: Optional[float], 2025-05-07T20:33:28.6557965Z contiguous: bool, 2025-05-07T20:33:28.6558055Z compiled: bool, 2025-05-07T20:33:28.6558139Z ) -> None: 2025-05-07T20:33:28.6558235Z torch.manual_seed(2025) 2025-05-07T20:33:28.6558307Z 2025-05-07T20:33:28.6558480Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.6558553Z 2025-05-07T20:33:28.6558647Z x_sign = torch.sign(x) 2025-05-07T20:33:28.6558770Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:28.6558856Z x = x_sign * x_clamp 2025-05-07T20:33:28.6558939Z x0 = x[:, :D] 2025-05-07T20:33:28.6559017Z x1 = x[:, D:] 2025-05-07T20:33:28.6559089Z 2025-05-07T20:33:28.6559175Z if contiguous: 2025-05-07T20:33:28.6559285Z x0 = x0.contiguous() 2025-05-07T20:33:28.6559378Z x1 = x1.contiguous() 2025-05-07T20:33:28.6559473Z 2025-05-07T20:33:28.6559565Z if scale_ub is not None: 2025-05-07T20:33:28.6559675Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:28.6559857Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:28.6559931Z ) 2025-05-07T20:33:28.6560009Z else: 2025-05-07T20:33:28.6560107Z scale_ub_tensor = None 2025-05-07T20:33:28.6560180Z 2025-05-07T20:33:28.6560317Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.6560410Z op = silu_mul_quant 2025-05-07T20:33:28.6560494Z if compiled: 2025-05-07T20:33:28.6560596Z op = torch.compile(op) 2025-05-07T20:33:28.6560704Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.6560773Z 2025-05-07T20:33:28.6560867Z y_fp8, y_scale = fn() 2025-05-07T20:33:28.6560984Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:28.6561055Z 2025-05-07T20:33:28.6561190Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.6561334Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:28.6561433Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:28.6561557Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:28.6561732Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:28.6561811Z 2025-05-07T20:33:28.6561909Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:28.6561915Z 2025-05-07T20:33:28.6562011Z moe/activation_test.py:126: 2025-05-07T20:33:28.6562144Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6562250Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:28.6562380Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:28.6562940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:28.6563040Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:28.6563401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:28.6563623Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:28.6563991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:28.6564323Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:28.6564738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:28.6564910Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:28.6565250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:28.6565327Z fn() 2025-05-07T20:33:28.6565727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:28.6565811Z self.fn.run( 2025-05-07T20:33:28.6566153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:28.6566252Z kernel = self.compile( 2025-05-07T20:33:28.6566631Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:28.6566814Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:28.6566940Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6566944Z 2025-05-07T20:33:28.6567150Z self = 2025-05-07T20:33:28.6567932Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:28.6568479Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b93132de0>} 2025-05-07T20:33:28.6569228Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:28.6569420Z context = 2025-05-07T20:33:28.6569424Z 2025-05-07T20:33:28.6569586Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:28.6569845Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:28.6569953Z module_map=module_map) 2025-05-07T20:33:28.6570118Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:28.6570218Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:28.6570337Z E ^ 2025-05-07T20:33:28.6570699Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:28.6570743Z 2025-05-07T20:33:28.6571154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:28.6571161Z 2025-05-07T20:33:28.6571265Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.6571484Z self=, 2025-05-07T20:33:28.6571560Z T=1, 2025-05-07T20:33:28.6571638Z D=5120, 2025-05-07T20:33:28.6571718Z scale_ub=None, 2025-05-07T20:33:28.6571800Z contiguous=True, 2025-05-07T20:33:28.6571887Z compiled=False, 2025-05-07T20:33:28.6571956Z ) 2025-05-07T20:33:28.6572171Z self = 2025-05-07T20:33:28.6572338Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:28.6572345Z 2025-05-07T20:33:28.6575671Z @given( 2025-05-07T20:33:28.6575815Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.6575915Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.6576031Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.6576145Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.6576332Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.6576407Z ) 2025-05-07T20:33:28.6576653Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.6576746Z def test_silu_mul_quant( 2025-05-07T20:33:28.6576823Z self, 2025-05-07T20:33:28.6576900Z T: int, 2025-05-07T20:33:28.6576979Z D: int, 2025-05-07T20:33:28.6577073Z scale_ub: Optional[float], 2025-05-07T20:33:28.6577161Z contiguous: bool, 2025-05-07T20:33:28.6577247Z compiled: bool, 2025-05-07T20:33:28.6577329Z ) -> None: 2025-05-07T20:33:28.6577428Z torch.manual_seed(2025) 2025-05-07T20:33:28.6577505Z 2025-05-07T20:33:28.6577672Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.6577743Z 2025-05-07T20:33:28.6577835Z x_sign = torch.sign(x) 2025-05-07T20:33:28.6577957Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:28.6578057Z x = x_sign * x_clamp 2025-05-07T20:33:28.6578137Z x0 = x[:, :D] 2025-05-07T20:33:28.6578213Z x1 = x[:, D:] 2025-05-07T20:33:28.6578287Z 2025-05-07T20:33:28.6578370Z if contiguous: 2025-05-07T20:33:28.6578460Z x0 = x0.contiguous() 2025-05-07T20:33:28.6578550Z x1 = x1.contiguous() 2025-05-07T20:33:28.6578621Z 2025-05-07T20:33:28.6578708Z if scale_ub is not None: 2025-05-07T20:33:28.6578817Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:28.6578949Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:28.6579021Z ) 2025-05-07T20:33:28.6579102Z else: 2025-05-07T20:33:28.6579242Z scale_ub_tensor = None 2025-05-07T20:33:28.6579322Z 2025-05-07T20:33:28.6579453Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.6579540Z op = silu_mul_quant 2025-05-07T20:33:28.6579625Z if compiled: 2025-05-07T20:33:28.6579727Z op = torch.compile(op) 2025-05-07T20:33:28.6579830Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.6579906Z 2025-05-07T20:33:28.6579996Z > y_fp8, y_scale = fn() 2025-05-07T20:33:28.6580001Z 2025-05-07T20:33:28.6580096Z moe/activation_test.py:117: 2025-05-07T20:33:28.6580236Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6580336Z moe/activation_test.py:115: in fn 2025-05-07T20:33:28.6580435Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.6580937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:28.6581085Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:28.6581488Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:28.6581708Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:28.6582045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:28.6582137Z kernel = self.compile( 2025-05-07T20:33:28.6582517Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:28.6582698Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:28.6582822Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6582826Z 2025-05-07T20:33:28.6583029Z self = 2025-05-07T20:33:28.6583825Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:28.6584328Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b93521940>} 2025-05-07T20:33:28.6585120Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:28.6585309Z context = 2025-05-07T20:33:28.6585314Z 2025-05-07T20:33:28.6585476Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:28.6585739Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:28.6585847Z module_map=module_map) 2025-05-07T20:33:28.6586016Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:28.6586109Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:28.6586186Z E ^ 2025-05-07T20:33:28.6586550Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:28.6586555Z 2025-05-07T20:33:28.6586965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:28.6586970Z 2025-05-07T20:33:28.6587074Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.6587293Z self=, 2025-05-07T20:33:28.6587367Z T=128, 2025-05-07T20:33:28.6587443Z D=5120, 2025-05-07T20:33:28.6587524Z scale_ub=None, 2025-05-07T20:33:28.6587615Z contiguous=False, 2025-05-07T20:33:28.6587741Z compiled=True, 2025-05-07T20:33:28.6587813Z ) 2025-05-07T20:33:28.6588032Z self = 2025-05-07T20:33:28.6588210Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:28.6588214Z 2025-05-07T20:33:28.6588293Z @given( 2025-05-07T20:33:28.6588415Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.6588510Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.6588625Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.6588739Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.6588851Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.6588918Z ) 2025-05-07T20:33:28.6589159Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.6589312Z def test_silu_mul_quant( 2025-05-07T20:33:28.6589385Z self, 2025-05-07T20:33:28.6589460Z T: int, 2025-05-07T20:33:28.6589535Z D: int, 2025-05-07T20:33:28.6589632Z scale_ub: Optional[float], 2025-05-07T20:33:28.6589764Z contiguous: bool, 2025-05-07T20:33:28.6589866Z compiled: bool, 2025-05-07T20:33:28.6589958Z ) -> None: 2025-05-07T20:33:28.6590066Z torch.manual_seed(2025) 2025-05-07T20:33:28.6590140Z 2025-05-07T20:33:28.6590314Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.6590383Z 2025-05-07T20:33:28.6590468Z x_sign = torch.sign(x) 2025-05-07T20:33:28.6590597Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:28.6590682Z x = x_sign * x_clamp 2025-05-07T20:33:28.6590760Z x0 = x[:, :D] 2025-05-07T20:33:28.6590835Z x1 = x[:, D:] 2025-05-07T20:33:28.6590900Z 2025-05-07T20:33:28.6590980Z if contiguous: 2025-05-07T20:33:28.6591066Z x0 = x0.contiguous() 2025-05-07T20:33:28.6591149Z x1 = x1.contiguous() 2025-05-07T20:33:28.6591217Z 2025-05-07T20:33:28.6591304Z if scale_ub is not None: 2025-05-07T20:33:28.6591405Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:28.6591542Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:28.6591615Z ) 2025-05-07T20:33:28.6591730Z else: 2025-05-07T20:33:28.6591823Z scale_ub_tensor = None 2025-05-07T20:33:28.6591888Z 2025-05-07T20:33:28.6592008Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.6592101Z op = silu_mul_quant 2025-05-07T20:33:28.6592180Z if compiled: 2025-05-07T20:33:28.6592274Z op = torch.compile(op) 2025-05-07T20:33:28.6592384Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.6592454Z 2025-05-07T20:33:28.6592547Z > y_fp8, y_scale = fn() 2025-05-07T20:33:28.6592551Z 2025-05-07T20:33:28.6592645Z moe/activation_test.py:117: 2025-05-07T20:33:28.6592777Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6592877Z moe/activation_test.py:115: in fn 2025-05-07T20:33:28.6592975Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.6593335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:28.6593431Z return fn(*args, **kwargs) 2025-05-07T20:33:28.6593916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:28.6594009Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:28.6594362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:28.6594580Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:28.6594912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:28.6595051Z kernel = self.compile( 2025-05-07T20:33:28.6595430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:28.6595605Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:28.6595727Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6595733Z 2025-05-07T20:33:28.6595935Z self = 2025-05-07T20:33:28.6596719Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:28.6597220Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b93133880>} 2025-05-07T20:33:28.6598044Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:28.6598237Z context = 2025-05-07T20:33:28.6598245Z 2025-05-07T20:33:28.6598409Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:28.6598665Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:28.6598768Z module_map=module_map) 2025-05-07T20:33:28.6598934Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:28.6599027Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:28.6599097Z E ^ 2025-05-07T20:33:28.6599451Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:28.6599458Z 2025-05-07T20:33:28.6599894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:28.6599899Z 2025-05-07T20:33:28.6600016Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.6600236Z self=, 2025-05-07T20:33:28.6600347Z T=128, 2025-05-07T20:33:28.6600424Z D=7168, 2025-05-07T20:33:28.6600501Z scale_ub=1200.0, 2025-05-07T20:33:28.6600582Z contiguous=False, 2025-05-07T20:33:28.6600667Z compiled=False, 2025-05-07T20:33:28.6600737Z ) 2025-05-07T20:33:28.6600949Z self = 2025-05-07T20:33:28.6601119Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:28.6601124Z 2025-05-07T20:33:28.6601198Z @given( 2025-05-07T20:33:28.6601325Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.6601422Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.6601535Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.6601652Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.6601759Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.6601824Z ) 2025-05-07T20:33:28.6602073Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.6602159Z def test_silu_mul_quant( 2025-05-07T20:33:28.6602232Z self, 2025-05-07T20:33:28.6602304Z T: int, 2025-05-07T20:33:28.6602375Z D: int, 2025-05-07T20:33:28.6602474Z scale_ub: Optional[float], 2025-05-07T20:33:28.6602559Z contiguous: bool, 2025-05-07T20:33:28.6602640Z compiled: bool, 2025-05-07T20:33:28.6602716Z ) -> None: 2025-05-07T20:33:28.6602802Z torch.manual_seed(2025) 2025-05-07T20:33:28.6602868Z 2025-05-07T20:33:28.6603036Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.6603104Z 2025-05-07T20:33:28.6603233Z x_sign = torch.sign(x) 2025-05-07T20:33:28.6603359Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:28.6603443Z x = x_sign * x_clamp 2025-05-07T20:33:28.6603519Z x0 = x[:, :D] 2025-05-07T20:33:28.6603594Z x1 = x[:, D:] 2025-05-07T20:33:28.6603662Z 2025-05-07T20:33:28.6603742Z if contiguous: 2025-05-07T20:33:28.6603826Z x0 = x0.contiguous() 2025-05-07T20:33:28.6603908Z x1 = x1.contiguous() 2025-05-07T20:33:28.6603978Z 2025-05-07T20:33:28.6604063Z if scale_ub is not None: 2025-05-07T20:33:28.6604163Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:28.6604390Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:28.6604460Z ) 2025-05-07T20:33:28.6604529Z else: 2025-05-07T20:33:28.6604668Z scale_ub_tensor = None 2025-05-07T20:33:28.6604735Z 2025-05-07T20:33:28.6604860Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.6604947Z op = silu_mul_quant 2025-05-07T20:33:28.6605067Z if compiled: 2025-05-07T20:33:28.6605165Z op = torch.compile(op) 2025-05-07T20:33:28.6605264Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.6605332Z 2025-05-07T20:33:28.6605417Z > y_fp8, y_scale = fn() 2025-05-07T20:33:28.6605421Z 2025-05-07T20:33:28.6605513Z moe/activation_test.py:117: 2025-05-07T20:33:28.6605636Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6605741Z moe/activation_test.py:115: in fn 2025-05-07T20:33:28.6605835Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.6606323Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:28.6606429Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:28.6606781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:28.6607001Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:28.6607335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:28.6607467Z kernel = self.compile( 2025-05-07T20:33:28.6607843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:28.6608011Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:28.6608139Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6608144Z 2025-05-07T20:33:28.6608605Z self = 2025-05-07T20:33:28.6609430Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:28.6609936Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b931f87c0>} 2025-05-07T20:33:28.6610677Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:28.6610871Z context = 2025-05-07T20:33:28.6610876Z 2025-05-07T20:33:28.6611034Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:28.6611289Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:28.6611401Z module_map=module_map) 2025-05-07T20:33:28.6611647Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:28.6611748Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:28.6611822Z E ^ 2025-05-07T20:33:28.6612173Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:28.6612180Z 2025-05-07T20:33:28.6612591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:28.6612596Z 2025-05-07T20:33:28.6612692Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.6612913Z self=, 2025-05-07T20:33:28.6612984Z T=128, 2025-05-07T20:33:28.6613058Z D=5120, 2025-05-07T20:33:28.6613132Z scale_ub=None, 2025-05-07T20:33:28.6613209Z contiguous=False, 2025-05-07T20:33:28.6613287Z compiled=False, 2025-05-07T20:33:28.6613418Z ) 2025-05-07T20:33:28.6613632Z self = 2025-05-07T20:33:28.6613864Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:28.6613869Z 2025-05-07T20:33:28.6613945Z @given( 2025-05-07T20:33:28.6614057Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.6614155Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.6614263Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.6614375Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.6614485Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.6614554Z ) 2025-05-07T20:33:28.6614792Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.6614887Z def test_silu_mul_quant( 2025-05-07T20:33:28.6614959Z self, 2025-05-07T20:33:28.6615032Z T: int, 2025-05-07T20:33:28.6615110Z D: int, 2025-05-07T20:33:28.6615202Z scale_ub: Optional[float], 2025-05-07T20:33:28.6615288Z contiguous: bool, 2025-05-07T20:33:28.6615369Z compiled: bool, 2025-05-07T20:33:28.6615442Z ) -> None: 2025-05-07T20:33:28.6615534Z torch.manual_seed(2025) 2025-05-07T20:33:28.6615602Z 2025-05-07T20:33:28.6615770Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.6615961Z 2025-05-07T20:33:28.6616083Z x_sign = torch.sign(x) 2025-05-07T20:33:28.6616227Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:28.6616349Z x = x_sign * x_clamp 2025-05-07T20:33:28.6616455Z x0 = x[:, :D] 2025-05-07T20:33:28.6616534Z x1 = x[:, D:] 2025-05-07T20:33:28.6616606Z 2025-05-07T20:33:28.6616683Z if contiguous: 2025-05-07T20:33:28.6616768Z x0 = x0.contiguous() 2025-05-07T20:33:28.6616856Z x1 = x1.contiguous() 2025-05-07T20:33:28.6616931Z 2025-05-07T20:33:28.6617015Z if scale_ub is not None: 2025-05-07T20:33:28.6617117Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:28.6617248Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:28.6617324Z ) 2025-05-07T20:33:28.6617398Z else: 2025-05-07T20:33:28.6617485Z scale_ub_tensor = None 2025-05-07T20:33:28.6617556Z 2025-05-07T20:33:28.6617684Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.6617770Z op = silu_mul_quant 2025-05-07T20:33:28.6617855Z if compiled: 2025-05-07T20:33:28.6617948Z op = torch.compile(op) 2025-05-07T20:33:28.6618047Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.6618117Z 2025-05-07T20:33:28.6618201Z > y_fp8, y_scale = fn() 2025-05-07T20:33:28.6618205Z 2025-05-07T20:33:28.6618305Z moe/activation_test.py:117: 2025-05-07T20:33:28.6618427Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6618522Z moe/activation_test.py:115: in fn 2025-05-07T20:33:28.6618703Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.6619200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:28.6619316Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:28.6619698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:28.6619922Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:28.6620256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:28.6620344Z kernel = self.compile( 2025-05-07T20:33:28.6620718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:28.6620891Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:28.6621058Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6621063Z 2025-05-07T20:33:28.6621303Z self = 2025-05-07T20:33:28.6622079Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:28.6622577Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b9352e7a0>} 2025-05-07T20:33:28.6623319Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:28.6623506Z context = 2025-05-07T20:33:28.6623511Z 2025-05-07T20:33:28.6623677Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:28.6623938Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:28.6624041Z module_map=module_map) 2025-05-07T20:33:28.6624242Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:28.6624333Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:28.6624402Z E ^ 2025-05-07T20:33:28.6624752Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:28.6624756Z 2025-05-07T20:33:28.6625163Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:28.6625167Z 2025-05-07T20:33:28.6625268Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.6625486Z self=, 2025-05-07T20:33:28.6625560Z T=128, 2025-05-07T20:33:28.6625635Z D=5120, 2025-05-07T20:33:28.6625716Z scale_ub=1200.0, 2025-05-07T20:33:28.6625796Z contiguous=True, 2025-05-07T20:33:28.6625879Z compiled=False, 2025-05-07T20:33:28.6625945Z ) 2025-05-07T20:33:28.6626168Z self = 2025-05-07T20:33:28.6626335Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:28.6626340Z 2025-05-07T20:33:28.6626409Z @given( 2025-05-07T20:33:28.6626524Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.6626617Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.6626728Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.6626841Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.6626951Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.6627019Z ) 2025-05-07T20:33:28.6627306Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.6627396Z def test_silu_mul_quant( 2025-05-07T20:33:28.6627476Z self, 2025-05-07T20:33:28.6627550Z T: int, 2025-05-07T20:33:28.6627618Z D: int, 2025-05-07T20:33:28.6627716Z scale_ub: Optional[float], 2025-05-07T20:33:28.6627802Z contiguous: bool, 2025-05-07T20:33:28.6627881Z compiled: bool, 2025-05-07T20:33:28.6627955Z ) -> None: 2025-05-07T20:33:28.6628042Z torch.manual_seed(2025) 2025-05-07T20:33:28.6628109Z 2025-05-07T20:33:28.6628275Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.6628342Z 2025-05-07T20:33:28.6628426Z x_sign = torch.sign(x) 2025-05-07T20:33:28.6628553Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:28.6628634Z x = x_sign * x_clamp 2025-05-07T20:33:28.6628757Z x0 = x[:, :D] 2025-05-07T20:33:28.6628831Z x1 = x[:, D:] 2025-05-07T20:33:28.6628898Z 2025-05-07T20:33:28.6628983Z if contiguous: 2025-05-07T20:33:28.6629109Z x0 = x0.contiguous() 2025-05-07T20:33:28.6629194Z x1 = x1.contiguous() 2025-05-07T20:33:28.6629263Z 2025-05-07T20:33:28.6629349Z if scale_ub is not None: 2025-05-07T20:33:28.6629455Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:28.6629591Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:28.6629664Z ) 2025-05-07T20:33:28.6629735Z else: 2025-05-07T20:33:28.6629831Z scale_ub_tensor = None 2025-05-07T20:33:28.6629902Z 2025-05-07T20:33:28.6630027Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.6630110Z op = silu_mul_quant 2025-05-07T20:33:28.6630191Z if compiled: 2025-05-07T20:33:28.6630288Z op = torch.compile(op) 2025-05-07T20:33:28.6630393Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.6630460Z 2025-05-07T20:33:28.6630557Z > y_fp8, y_scale = fn() 2025-05-07T20:33:28.6630561Z 2025-05-07T20:33:28.6630653Z moe/activation_test.py:117: 2025-05-07T20:33:28.6630777Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6630875Z moe/activation_test.py:115: in fn 2025-05-07T20:33:28.6631013Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.6631513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:28.6631603Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:28.6631957Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:28.6632179Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:28.6632513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:28.6632608Z kernel = self.compile( 2025-05-07T20:33:28.6632989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:28.6633161Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:28.6633284Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6633288Z 2025-05-07T20:33:28.6633490Z self = 2025-05-07T20:33:28.6634266Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:28.6634769Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b92f1cc20>} 2025-05-07T20:33:28.6635560Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:28.6635745Z context = 2025-05-07T20:33:28.6635756Z 2025-05-07T20:33:28.6635913Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:28.6636172Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:28.6636274Z module_map=module_map) 2025-05-07T20:33:28.6636430Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:28.6636521Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:28.6636596Z E ^ 2025-05-07T20:33:28.6636943Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:28.6636988Z 2025-05-07T20:33:28.6637444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:28.6637449Z 2025-05-07T20:33:28.6637549Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.6637772Z self=, 2025-05-07T20:33:28.6637851Z T=1, 2025-05-07T20:33:28.6637923Z D=7168, 2025-05-07T20:33:28.6637999Z scale_ub=1200.0, 2025-05-07T20:33:28.6638083Z contiguous=True, 2025-05-07T20:33:28.6638162Z compiled=True, 2025-05-07T20:33:28.6638230Z ) 2025-05-07T20:33:28.6638449Z self = 2025-05-07T20:33:28.6638610Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:28.6638615Z 2025-05-07T20:33:28.6638694Z @given( 2025-05-07T20:33:28.6638810Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.6638902Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.6639018Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.6639133Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.6639240Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.6639315Z ) 2025-05-07T20:33:28.6639600Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.6639691Z def test_silu_mul_quant( 2025-05-07T20:33:28.6639760Z self, 2025-05-07T20:33:28.6639829Z T: int, 2025-05-07T20:33:28.6639903Z D: int, 2025-05-07T20:33:28.6639995Z scale_ub: Optional[float], 2025-05-07T20:33:28.6640080Z contiguous: bool, 2025-05-07T20:33:28.6640163Z compiled: bool, 2025-05-07T20:33:28.6640235Z ) -> None: 2025-05-07T20:33:28.6640323Z torch.manual_seed(2025) 2025-05-07T20:33:28.6640399Z 2025-05-07T20:33:28.6640563Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.6640630Z 2025-05-07T20:33:28.6640720Z x_sign = torch.sign(x) 2025-05-07T20:33:28.6640843Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:28.6640930Z x = x_sign * x_clamp 2025-05-07T20:33:28.6641010Z x0 = x[:, :D] 2025-05-07T20:33:28.6641089Z x1 = x[:, D:] 2025-05-07T20:33:28.6641156Z 2025-05-07T20:33:28.6641235Z if contiguous: 2025-05-07T20:33:28.6641321Z x0 = x0.contiguous() 2025-05-07T20:33:28.6641408Z x1 = x1.contiguous() 2025-05-07T20:33:28.6641481Z 2025-05-07T20:33:28.6641564Z if scale_ub is not None: 2025-05-07T20:33:28.6641670Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:28.6641798Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:28.6641870Z ) 2025-05-07T20:33:28.6641947Z else: 2025-05-07T20:33:28.6642039Z scale_ub_tensor = None 2025-05-07T20:33:28.6642103Z 2025-05-07T20:33:28.6642278Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.6642357Z op = silu_mul_quant 2025-05-07T20:33:28.6642443Z if compiled: 2025-05-07T20:33:28.6642536Z op = torch.compile(op) 2025-05-07T20:33:28.6642634Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.6642705Z 2025-05-07T20:33:28.6642789Z > y_fp8, y_scale = fn() 2025-05-07T20:33:28.6642794Z 2025-05-07T20:33:28.6642887Z moe/activation_test.py:117: 2025-05-07T20:33:28.6643013Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6643107Z moe/activation_test.py:115: in fn 2025-05-07T20:33:28.6643199Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.6643563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:28.6643647Z return fn(*args, **kwargs) 2025-05-07T20:33:28.6644181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:28.6644420Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:28.6644774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:28.6644996Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:28.6645328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:28.6645416Z kernel = self.compile( 2025-05-07T20:33:28.6645792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:28.6646377Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:28.6646502Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6646509Z 2025-05-07T20:33:28.6646709Z self = 2025-05-07T20:33:28.6647489Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:28.6648060Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b92f1dee0>} 2025-05-07T20:33:28.6648806Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:28.6648997Z context = 2025-05-07T20:33:28.6649002Z 2025-05-07T20:33:28.6649160Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:28.6649422Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:28.6649529Z module_map=module_map) 2025-05-07T20:33:28.6649686Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:28.6649779Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:28.6649862Z E ^ 2025-05-07T20:33:28.6650209Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:28.6650218Z 2025-05-07T20:33:28.6650627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:28.6650631Z 2025-05-07T20:33:28.6650728Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.6650947Z self=, 2025-05-07T20:33:28.6651025Z T=1, 2025-05-07T20:33:28.6651095Z D=7168, 2025-05-07T20:33:28.6651174Z scale_ub=1200.0, 2025-05-07T20:33:28.6651296Z contiguous=False, 2025-05-07T20:33:28.6651374Z compiled=True, 2025-05-07T20:33:28.6651447Z ) 2025-05-07T20:33:28.6651661Z self = 2025-05-07T20:33:28.6651824Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:28.6651832Z 2025-05-07T20:33:28.6651901Z @given( 2025-05-07T20:33:28.6652011Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.6652107Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.6652215Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.6652326Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.6652435Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.6652501Z ) 2025-05-07T20:33:28.6652739Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.6652913Z def test_silu_mul_quant( 2025-05-07T20:33:28.6652985Z self, 2025-05-07T20:33:28.6653056Z T: int, 2025-05-07T20:33:28.6653126Z D: int, 2025-05-07T20:33:28.6653261Z scale_ub: Optional[float], 2025-05-07T20:33:28.6653347Z contiguous: bool, 2025-05-07T20:33:28.6653429Z compiled: bool, 2025-05-07T20:33:28.6653505Z ) -> None: 2025-05-07T20:33:28.6653594Z torch.manual_seed(2025) 2025-05-07T20:33:28.6653660Z 2025-05-07T20:33:28.6653821Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.6653892Z 2025-05-07T20:33:28.6653977Z x_sign = torch.sign(x) 2025-05-07T20:33:28.6654096Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:28.6654179Z x = x_sign * x_clamp 2025-05-07T20:33:28.6654252Z x0 = x[:, :D] 2025-05-07T20:33:28.6654327Z x1 = x[:, D:] 2025-05-07T20:33:28.6654395Z 2025-05-07T20:33:28.6654475Z if contiguous: 2025-05-07T20:33:28.6654564Z x0 = x0.contiguous() 2025-05-07T20:33:28.6654648Z x1 = x1.contiguous() 2025-05-07T20:33:28.6654713Z 2025-05-07T20:33:28.6654802Z if scale_ub is not None: 2025-05-07T20:33:28.6654900Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:28.6655027Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:28.6655143Z ) 2025-05-07T20:33:28.6655212Z else: 2025-05-07T20:33:28.6655299Z scale_ub_tensor = None 2025-05-07T20:33:28.6655369Z 2025-05-07T20:33:28.6655492Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.6655579Z op = silu_mul_quant 2025-05-07T20:33:28.6655658Z if compiled: 2025-05-07T20:33:28.6655754Z op = torch.compile(op) 2025-05-07T20:33:28.6655860Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.6655926Z 2025-05-07T20:33:28.6656011Z > y_fp8, y_scale = fn() 2025-05-07T20:33:28.6656015Z 2025-05-07T20:33:28.6656110Z moe/activation_test.py:117: 2025-05-07T20:33:28.6656234Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6656331Z moe/activation_test.py:115: in fn 2025-05-07T20:33:28.6656429Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.6656791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:28.6656885Z return fn(*args, **kwargs) 2025-05-07T20:33:28.6657371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:28.6657461Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:28.6657812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:28.6658029Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:28.6658409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:28.6658503Z kernel = self.compile( 2025-05-07T20:33:28.6658878Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:28.6659050Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:28.6659174Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6659178Z 2025-05-07T20:33:28.6659373Z self = 2025-05-07T20:33:28.6660148Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:28.6660648Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b92f1ec00>} 2025-05-07T20:33:28.6661468Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:28.6661657Z context = 2025-05-07T20:33:28.6661661Z 2025-05-07T20:33:28.6661821Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:28.6662077Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:28.6662180Z module_map=module_map) 2025-05-07T20:33:28.6662337Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:28.6662428Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:28.6662500Z E ^ 2025-05-07T20:33:28.6662861Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:28.6662865Z 2025-05-07T20:33:28.6663274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:28.6663278Z 2025-05-07T20:33:28.6663375Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.6663632Z self=, 2025-05-07T20:33:28.6663700Z T=1, 2025-05-07T20:33:28.6663775Z D=7168, 2025-05-07T20:33:28.6663850Z scale_ub=None, 2025-05-07T20:33:28.6663929Z contiguous=False, 2025-05-07T20:33:28.6664011Z compiled=True, 2025-05-07T20:33:28.6664077Z ) 2025-05-07T20:33:28.6664289Z self = 2025-05-07T20:33:28.6664451Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:28.6664459Z 2025-05-07T20:33:28.6664527Z @given( 2025-05-07T20:33:28.6664643Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.6664736Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.6664847Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.6664961Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.6665067Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.6665136Z ) 2025-05-07T20:33:28.6665379Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.6665465Z def test_silu_mul_quant( 2025-05-07T20:33:28.6665534Z self, 2025-05-07T20:33:28.6665607Z T: int, 2025-05-07T20:33:28.6665676Z D: int, 2025-05-07T20:33:28.6665775Z scale_ub: Optional[float], 2025-05-07T20:33:28.6665858Z contiguous: bool, 2025-05-07T20:33:28.6665937Z compiled: bool, 2025-05-07T20:33:28.6666011Z ) -> None: 2025-05-07T20:33:28.6666100Z torch.manual_seed(2025) 2025-05-07T20:33:28.6666172Z 2025-05-07T20:33:28.6666383Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.6666456Z 2025-05-07T20:33:28.6666544Z x_sign = torch.sign(x) 2025-05-07T20:33:28.6666665Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:28.6666747Z x = x_sign * x_clamp 2025-05-07T20:33:28.6666827Z x0 = x[:, :D] 2025-05-07T20:33:28.6666904Z x1 = x[:, D:] 2025-05-07T20:33:28.6666971Z 2025-05-07T20:33:28.6667053Z if contiguous: 2025-05-07T20:33:28.6667137Z x0 = x0.contiguous() 2025-05-07T20:33:28.6667219Z x1 = x1.contiguous() 2025-05-07T20:33:28.6667286Z 2025-05-07T20:33:28.6667369Z if scale_ub is not None: 2025-05-07T20:33:28.6667467Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:28.6667603Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:28.6667671Z ) 2025-05-07T20:33:28.6667782Z else: 2025-05-07T20:33:28.6667875Z scale_ub_tensor = None 2025-05-07T20:33:28.6667942Z 2025-05-07T20:33:28.6668068Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.6668197Z op = silu_mul_quant 2025-05-07T20:33:28.6668276Z if compiled: 2025-05-07T20:33:28.6668371Z op = torch.compile(op) 2025-05-07T20:33:28.6668475Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.6668541Z 2025-05-07T20:33:28.6668628Z y_fp8, y_scale = fn() 2025-05-07T20:33:28.6668744Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:28.6668810Z 2025-05-07T20:33:28.6668947Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.6669043Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:28.6669137Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:28.6669258Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:28.6669395Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:28.6669463Z 2025-05-07T20:33:28.6669564Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:28.6669568Z 2025-05-07T20:33:28.6669661Z moe/activation_test.py:126: 2025-05-07T20:33:28.6669784Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6669882Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:28.6670051Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:28.6670604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:28.6670698Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:28.6671069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:28.6671313Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:28.6671675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:28.6671928Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:28.6672296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:28.6672459Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:28.6672792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:28.6672862Z fn() 2025-05-07T20:33:28.6673259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:28.6673333Z self.fn.run( 2025-05-07T20:33:28.6673661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:28.6673749Z kernel = self.compile( 2025-05-07T20:33:28.6674161Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:28.6674334Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:28.6674456Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6674463Z 2025-05-07T20:33:28.6674659Z self = 2025-05-07T20:33:28.6675434Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:28.6675929Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b93b20180>} 2025-05-07T20:33:28.6676712Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:28.6677025Z context = 2025-05-07T20:33:28.6677030Z 2025-05-07T20:33:28.6677188Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:28.6677452Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:28.6677553Z module_map=module_map) 2025-05-07T20:33:28.6677709Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:28.6677808Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:28.6677879Z E ^ 2025-05-07T20:33:28.6678229Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:28.6678236Z 2025-05-07T20:33:28.6678644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:28.6678649Z 2025-05-07T20:33:28.6678749Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.6678965Z self=, 2025-05-07T20:33:28.6679035Z T=1, 2025-05-07T20:33:28.6679152Z D=5120, 2025-05-07T20:33:28.6679229Z scale_ub=1200.0, 2025-05-07T20:33:28.6679309Z contiguous=False, 2025-05-07T20:33:28.6679389Z compiled=True, 2025-05-07T20:33:28.6679456Z ) 2025-05-07T20:33:28.6679666Z self = 2025-05-07T20:33:28.6679829Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:28.6679833Z 2025-05-07T20:33:28.6679902Z @given( 2025-05-07T20:33:28.6680039Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.6680145Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.6680270Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.6680387Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.6680497Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.6680565Z ) 2025-05-07T20:33:28.6680806Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.6680898Z def test_silu_mul_quant( 2025-05-07T20:33:28.6680967Z self, 2025-05-07T20:33:28.6681042Z T: int, 2025-05-07T20:33:28.6681113Z D: int, 2025-05-07T20:33:28.6681204Z scale_ub: Optional[float], 2025-05-07T20:33:28.6681294Z contiguous: bool, 2025-05-07T20:33:28.6681372Z compiled: bool, 2025-05-07T20:33:28.6681444Z ) -> None: 2025-05-07T20:33:28.6681534Z torch.manual_seed(2025) 2025-05-07T20:33:28.6681604Z 2025-05-07T20:33:28.6681769Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.6681840Z 2025-05-07T20:33:28.6681924Z x_sign = torch.sign(x) 2025-05-07T20:33:28.6682094Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:28.6682179Z x = x_sign * x_clamp 2025-05-07T20:33:28.6682254Z x0 = x[:, :D] 2025-05-07T20:33:28.6682331Z x1 = x[:, D:] 2025-05-07T20:33:28.6682395Z 2025-05-07T20:33:28.6682473Z if contiguous: 2025-05-07T20:33:28.6682565Z x0 = x0.contiguous() 2025-05-07T20:33:28.6682649Z x1 = x1.contiguous() 2025-05-07T20:33:28.6682718Z 2025-05-07T20:33:28.6682806Z if scale_ub is not None: 2025-05-07T20:33:28.6682905Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:28.6683032Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:28.6683105Z ) 2025-05-07T20:33:28.6683175Z else: 2025-05-07T20:33:28.6683269Z scale_ub_tensor = None 2025-05-07T20:33:28.6683335Z 2025-05-07T20:33:28.6683503Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.6683588Z op = silu_mul_quant 2025-05-07T20:33:28.6683670Z if compiled: 2025-05-07T20:33:28.6683767Z op = torch.compile(op) 2025-05-07T20:33:28.6683910Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.6683975Z 2025-05-07T20:33:28.6684059Z > y_fp8, y_scale = fn() 2025-05-07T20:33:28.6684066Z 2025-05-07T20:33:28.6684162Z moe/activation_test.py:117: 2025-05-07T20:33:28.6684400Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6684497Z moe/activation_test.py:115: in fn 2025-05-07T20:33:28.6684590Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.6684949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:28.6685039Z return fn(*args, **kwargs) 2025-05-07T20:33:28.6685524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:28.6685623Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:28.6685976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:28.6686192Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:28.6686579Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:28.6686667Z kernel = self.compile( 2025-05-07T20:33:28.6687039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:28.6687212Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:28.6687332Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6687337Z 2025-05-07T20:33:28.6687543Z self = 2025-05-07T20:33:28.6688323Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:28.6688819Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b93b21300>} 2025-05-07T20:33:28.6689562Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:28.6689750Z context = 2025-05-07T20:33:28.6689754Z 2025-05-07T20:33:28.6689913Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:28.6690171Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:28.6690315Z module_map=module_map) 2025-05-07T20:33:28.6690475Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:28.6690568Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:28.6690640Z E ^ 2025-05-07T20:33:28.6690996Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:28.6691005Z 2025-05-07T20:33:28.6691411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:28.6691416Z 2025-05-07T20:33:28.6691515Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.6691731Z self=, 2025-05-07T20:33:28.6691799Z T=1, 2025-05-07T20:33:28.6691873Z D=5120, 2025-05-07T20:33:28.6695604Z scale_ub=1200.0, 2025-05-07T20:33:28.6695704Z contiguous=False, 2025-05-07T20:33:28.6695789Z compiled=False, 2025-05-07T20:33:28.6695866Z ) 2025-05-07T20:33:28.6696157Z self = 2025-05-07T20:33:28.6696324Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:28.6696329Z 2025-05-07T20:33:28.6696410Z @given( 2025-05-07T20:33:28.6696531Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.6696628Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.6696739Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.6696854Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.6696967Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.6697038Z ) 2025-05-07T20:33:28.6697283Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.6697379Z def test_silu_mul_quant( 2025-05-07T20:33:28.6697456Z self, 2025-05-07T20:33:28.6697535Z T: int, 2025-05-07T20:33:28.6697616Z D: int, 2025-05-07T20:33:28.6697712Z scale_ub: Optional[float], 2025-05-07T20:33:28.6697803Z contiguous: bool, 2025-05-07T20:33:28.6697885Z compiled: bool, 2025-05-07T20:33:28.6697962Z ) -> None: 2025-05-07T20:33:28.6698062Z torch.manual_seed(2025) 2025-05-07T20:33:28.6698183Z 2025-05-07T20:33:28.6698346Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.6698421Z 2025-05-07T20:33:28.6698511Z x_sign = torch.sign(x) 2025-05-07T20:33:28.6698634Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:28.6698723Z x = x_sign * x_clamp 2025-05-07T20:33:28.6698803Z x0 = x[:, :D] 2025-05-07T20:33:28.6698879Z x1 = x[:, D:] 2025-05-07T20:33:28.6698950Z 2025-05-07T20:33:28.6699030Z if contiguous: 2025-05-07T20:33:28.6699120Z x0 = x0.contiguous() 2025-05-07T20:33:28.6699216Z x1 = x1.contiguous() 2025-05-07T20:33:28.6699290Z 2025-05-07T20:33:28.6699386Z if scale_ub is not None: 2025-05-07T20:33:28.6699491Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:28.6699625Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:28.6699703Z ) 2025-05-07T20:33:28.6699779Z else: 2025-05-07T20:33:28.6699873Z scale_ub_tensor = None 2025-05-07T20:33:28.6699950Z 2025-05-07T20:33:28.6700077Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.6700164Z op = silu_mul_quant 2025-05-07T20:33:28.6700260Z if compiled: 2025-05-07T20:33:28.6700358Z op = torch.compile(op) 2025-05-07T20:33:28.6700459Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.6700532Z 2025-05-07T20:33:28.6700623Z > y_fp8, y_scale = fn() 2025-05-07T20:33:28.6700628Z 2025-05-07T20:33:28.6700726Z moe/activation_test.py:117: 2025-05-07T20:33:28.6700857Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6701003Z moe/activation_test.py:115: in fn 2025-05-07T20:33:28.6701106Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.6701604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:28.6701706Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:28.6702061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:28.6702279Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:28.6702617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:28.6702707Z kernel = self.compile( 2025-05-07T20:33:28.6703083Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:28.6703308Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:28.6703471Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6703476Z 2025-05-07T20:33:28.6703678Z self = 2025-05-07T20:33:28.6704461Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:28.6704962Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b93b22020>} 2025-05-07T20:33:28.6705704Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:28.6705903Z context = 2025-05-07T20:33:28.6705907Z 2025-05-07T20:33:28.6706075Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:28.6706334Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:28.6706485Z module_map=module_map) 2025-05-07T20:33:28.6706645Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:28.6706735Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:28.6706804Z E ^ 2025-05-07T20:33:28.6707282Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:28.6707291Z 2025-05-07T20:33:28.6707719Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:28.6707728Z 2025-05-07T20:33:28.6707826Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.6708046Z self=, 2025-05-07T20:33:28.6708120Z T=16384, 2025-05-07T20:33:28.6708201Z D=5120, 2025-05-07T20:33:28.6708496Z scale_ub=1200.0, 2025-05-07T20:33:28.6708589Z contiguous=False, 2025-05-07T20:33:28.6708669Z compiled=True, 2025-05-07T20:33:28.6708738Z ) 2025-05-07T20:33:28.6708954Z self = 2025-05-07T20:33:28.6709130Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:28.6709135Z 2025-05-07T20:33:28.6709204Z @given( 2025-05-07T20:33:28.6709319Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.6709410Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.6709522Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.6709639Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.6709854Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.6709934Z ) 2025-05-07T20:33:28.6710178Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.6710267Z def test_silu_mul_quant( 2025-05-07T20:33:28.6710344Z self, 2025-05-07T20:33:28.6710429Z T: int, 2025-05-07T20:33:28.6710505Z D: int, 2025-05-07T20:33:28.6710606Z scale_ub: Optional[float], 2025-05-07T20:33:28.6710693Z contiguous: bool, 2025-05-07T20:33:28.6710774Z compiled: bool, 2025-05-07T20:33:28.6710852Z ) -> None: 2025-05-07T20:33:28.6710944Z torch.manual_seed(2025) 2025-05-07T20:33:28.6711011Z 2025-05-07T20:33:28.6711175Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.6711244Z 2025-05-07T20:33:28.6711333Z x_sign = torch.sign(x) 2025-05-07T20:33:28.6711548Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:28.6711631Z x = x_sign * x_clamp 2025-05-07T20:33:28.6711714Z x0 = x[:, :D] 2025-05-07T20:33:28.6711791Z x1 = x[:, D:] 2025-05-07T20:33:28.6711914Z 2025-05-07T20:33:28.6712000Z if contiguous: 2025-05-07T20:33:28.6712089Z x0 = x0.contiguous() 2025-05-07T20:33:28.6712177Z x1 = x1.contiguous() 2025-05-07T20:33:28.6712255Z 2025-05-07T20:33:28.6712338Z if scale_ub is not None: 2025-05-07T20:33:28.6712437Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:28.6712573Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:28.6712645Z ) 2025-05-07T20:33:28.6712715Z else: 2025-05-07T20:33:28.6712806Z scale_ub_tensor = None 2025-05-07T20:33:28.6712883Z 2025-05-07T20:33:28.6713014Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.6713101Z op = silu_mul_quant 2025-05-07T20:33:28.6713186Z if compiled: 2025-05-07T20:33:28.6713284Z op = torch.compile(op) 2025-05-07T20:33:28.6713386Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.6713451Z 2025-05-07T20:33:28.6713538Z > y_fp8, y_scale = fn() 2025-05-07T20:33:28.6713543Z 2025-05-07T20:33:28.6713638Z moe/activation_test.py:117: 2025-05-07T20:33:28.6713761Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6713928Z moe/activation_test.py:115: in fn 2025-05-07T20:33:28.6714028Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.6714389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:28.6714482Z return fn(*args, **kwargs) 2025-05-07T20:33:28.6714969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:28.6715064Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:28.6715424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:28.6715648Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:28.6716020Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:28.6716123Z kernel = self.compile( 2025-05-07T20:33:28.6716503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:28.6716677Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:28.6716800Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6716804Z 2025-05-07T20:33:28.6717006Z self = 2025-05-07T20:33:28.6717863Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:28.6718585Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b93b23600>} 2025-05-07T20:33:28.6719378Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:28.6719564Z context = 2025-05-07T20:33:28.6719569Z 2025-05-07T20:33:28.6719731Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:28.6719984Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:28.6720151Z module_map=module_map) 2025-05-07T20:33:28.6720313Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:28.6720403Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:28.6720521Z E ^ 2025-05-07T20:33:28.6720870Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:28.6720878Z 2025-05-07T20:33:28.6721285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:28.6721290Z 2025-05-07T20:33:28.6721390Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.6721607Z self=, 2025-05-07T20:33:28.6721680Z T=2048, 2025-05-07T20:33:28.6721750Z D=7168, 2025-05-07T20:33:28.6721825Z scale_ub=1200.0, 2025-05-07T20:33:28.6721914Z contiguous=False, 2025-05-07T20:33:28.6722013Z compiled=True, 2025-05-07T20:33:28.6722091Z ) 2025-05-07T20:33:28.6722306Z self = 2025-05-07T20:33:28.6722475Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:28.6722482Z 2025-05-07T20:33:28.6722550Z @given( 2025-05-07T20:33:28.6722667Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.6722762Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.6722920Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.6723035Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.6723142Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.6723217Z ) 2025-05-07T20:33:28.6723456Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.6723542Z def test_silu_mul_quant( 2025-05-07T20:33:28.6723624Z self, 2025-05-07T20:33:28.6723693Z T: int, 2025-05-07T20:33:28.6723765Z D: int, 2025-05-07T20:33:28.6723862Z scale_ub: Optional[float], 2025-05-07T20:33:28.6723949Z contiguous: bool, 2025-05-07T20:33:28.6724030Z compiled: bool, 2025-05-07T20:33:28.6724106Z ) -> None: 2025-05-07T20:33:28.6724201Z torch.manual_seed(2025) 2025-05-07T20:33:28.6724360Z 2025-05-07T20:33:28.6724528Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.6724599Z 2025-05-07T20:33:28.6724691Z x_sign = torch.sign(x) 2025-05-07T20:33:28.6724809Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:28.6724890Z x = x_sign * x_clamp 2025-05-07T20:33:28.6724969Z x0 = x[:, :D] 2025-05-07T20:33:28.6725044Z x1 = x[:, D:] 2025-05-07T20:33:28.6725110Z 2025-05-07T20:33:28.6725193Z if contiguous: 2025-05-07T20:33:28.6725279Z x0 = x0.contiguous() 2025-05-07T20:33:28.6725362Z x1 = x1.contiguous() 2025-05-07T20:33:28.6725433Z 2025-05-07T20:33:28.6725521Z if scale_ub is not None: 2025-05-07T20:33:28.6725621Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:28.6725800Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:28.6725873Z ) 2025-05-07T20:33:28.6725944Z else: 2025-05-07T20:33:28.6726031Z scale_ub_tensor = None 2025-05-07T20:33:28.6726097Z 2025-05-07T20:33:28.6726227Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.6726313Z op = silu_mul_quant 2025-05-07T20:33:28.6726393Z if compiled: 2025-05-07T20:33:28.6726492Z op = torch.compile(op) 2025-05-07T20:33:28.6726592Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.6726657Z 2025-05-07T20:33:28.6726747Z > y_fp8, y_scale = fn() 2025-05-07T20:33:28.6726751Z 2025-05-07T20:33:28.6726841Z moe/activation_test.py:117: 2025-05-07T20:33:28.6726963Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6727107Z moe/activation_test.py:115: in fn 2025-05-07T20:33:28.6727200Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.6727605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:28.6727692Z return fn(*args, **kwargs) 2025-05-07T20:33:28.6728180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:28.6728278Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:28.6728630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:28.6728849Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:28.6729179Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:28.6729270Z kernel = self.compile( 2025-05-07T20:33:28.6729657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:28.6729855Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:28.6729978Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6729982Z 2025-05-07T20:33:28.6730181Z self = 2025-05-07T20:33:28.6730994Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:28.6731492Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8c48d40720>} 2025-05-07T20:33:28.6732224Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:28.6732415Z context = 2025-05-07T20:33:28.6732419Z 2025-05-07T20:33:28.6732576Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:28.6732834Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:28.6732937Z module_map=module_map) 2025-05-07T20:33:28.6733091Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:28.6733183Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:28.6733259Z E ^ 2025-05-07T20:33:28.6733603Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:28.6733608Z 2025-05-07T20:33:28.6734014Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:28.6734021Z 2025-05-07T20:33:28.6734160Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.6734380Z self=, 2025-05-07T20:33:28.6734455Z T=1, 2025-05-07T20:33:28.6734524Z D=5120, 2025-05-07T20:33:28.6734598Z scale_ub=None, 2025-05-07T20:33:28.6734682Z contiguous=False, 2025-05-07T20:33:28.6734762Z compiled=False, 2025-05-07T20:33:28.6734829Z ) 2025-05-07T20:33:28.6735042Z self = 2025-05-07T20:33:28.6735204Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:28.6735208Z 2025-05-07T20:33:28.6735281Z @given( 2025-05-07T20:33:28.6735394Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.6735486Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.6735643Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.6735752Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.6735861Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.6735970Z ) 2025-05-07T20:33:28.6736207Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.6736296Z def test_silu_mul_quant( 2025-05-07T20:33:28.6736370Z self, 2025-05-07T20:33:28.6736444Z T: int, 2025-05-07T20:33:28.6736519Z D: int, 2025-05-07T20:33:28.6736612Z scale_ub: Optional[float], 2025-05-07T20:33:28.6736697Z contiguous: bool, 2025-05-07T20:33:28.6736782Z compiled: bool, 2025-05-07T20:33:28.6736855Z ) -> None: 2025-05-07T20:33:28.6736942Z torch.manual_seed(2025) 2025-05-07T20:33:28.6737013Z 2025-05-07T20:33:28.6737174Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.6737240Z 2025-05-07T20:33:28.6737331Z x_sign = torch.sign(x) 2025-05-07T20:33:28.6737452Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:28.6737539Z x = x_sign * x_clamp 2025-05-07T20:33:28.6737615Z x0 = x[:, :D] 2025-05-07T20:33:28.6737688Z x1 = x[:, D:] 2025-05-07T20:33:28.6737758Z 2025-05-07T20:33:28.6737835Z if contiguous: 2025-05-07T20:33:28.6737919Z x0 = x0.contiguous() 2025-05-07T20:33:28.6738053Z x1 = x1.contiguous() 2025-05-07T20:33:28.6738119Z 2025-05-07T20:33:28.6738206Z if scale_ub is not None: 2025-05-07T20:33:28.6738314Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:28.6738441Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:28.6738510Z ) 2025-05-07T20:33:28.6738584Z else: 2025-05-07T20:33:28.6738671Z scale_ub_tensor = None 2025-05-07T20:33:28.6738741Z 2025-05-07T20:33:28.6738863Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.6738947Z op = silu_mul_quant 2025-05-07T20:33:28.6739029Z if compiled: 2025-05-07T20:33:28.6739124Z op = torch.compile(op) 2025-05-07T20:33:28.6739229Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.6739299Z 2025-05-07T20:33:28.6739388Z > y_fp8, y_scale = fn() 2025-05-07T20:33:28.6739392Z 2025-05-07T20:33:28.6739482Z moe/activation_test.py:117: 2025-05-07T20:33:28.6739612Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6739705Z moe/activation_test.py:115: in fn 2025-05-07T20:33:28.6739799Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.6740284Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:28.6740375Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:28.6740726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:28.6741017Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:28.6741352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:28.6741441Z kernel = self.compile( 2025-05-07T20:33:28.6741814Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:28.6741988Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:28.6742108Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6742112Z 2025-05-07T20:33:28.6742309Z self = 2025-05-07T20:33:28.6743080Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:28.6743656Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8c48d41120>} 2025-05-07T20:33:28.6744392Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:28.6744579Z context = 2025-05-07T20:33:28.6744584Z 2025-05-07T20:33:28.6744742Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:28.6744999Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:28.6745101Z module_map=module_map) 2025-05-07T20:33:28.6745260Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:28.6745359Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:28.6745433Z E ^ 2025-05-07T20:33:28.6745788Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:28.6745793Z 2025-05-07T20:33:28.6746196Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:28.6746244Z 2025-05-07T20:33:28.6746347Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.6746563Z self=, 2025-05-07T20:33:28.6746632Z T=4096, 2025-05-07T20:33:28.6746707Z D=7168, 2025-05-07T20:33:28.6746786Z scale_ub=1200.0, 2025-05-07T20:33:28.6746867Z contiguous=False, 2025-05-07T20:33:28.6746946Z compiled=False, 2025-05-07T20:33:28.6747012Z ) 2025-05-07T20:33:28.6747220Z self = 2025-05-07T20:33:28.6747399Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:28.6747403Z 2025-05-07T20:33:28.6747478Z @given( 2025-05-07T20:33:28.6747604Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.6747699Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.6747807Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.6747926Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.6748033Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.6748102Z ) 2025-05-07T20:33:28.6748349Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.6748435Z def test_silu_mul_quant( 2025-05-07T20:33:28.6748502Z self, 2025-05-07T20:33:28.6748577Z T: int, 2025-05-07T20:33:28.6748645Z D: int, 2025-05-07T20:33:28.6748739Z scale_ub: Optional[float], 2025-05-07T20:33:28.6748826Z contiguous: bool, 2025-05-07T20:33:28.6748905Z compiled: bool, 2025-05-07T20:33:28.6748977Z ) -> None: 2025-05-07T20:33:28.6749112Z torch.manual_seed(2025) 2025-05-07T20:33:28.6749190Z 2025-05-07T20:33:28.6749384Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.6749471Z 2025-05-07T20:33:28.6749558Z x_sign = torch.sign(x) 2025-05-07T20:33:28.6749680Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:28.6749766Z x = x_sign * x_clamp 2025-05-07T20:33:28.6749838Z x0 = x[:, :D] 2025-05-07T20:33:28.6749915Z x1 = x[:, D:] 2025-05-07T20:33:28.6749984Z 2025-05-07T20:33:28.6750061Z if contiguous: 2025-05-07T20:33:28.6750154Z x0 = x0.contiguous() 2025-05-07T20:33:28.6750235Z x1 = x1.contiguous() 2025-05-07T20:33:28.6750305Z 2025-05-07T20:33:28.6750392Z if scale_ub is not None: 2025-05-07T20:33:28.6750491Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:28.6750664Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:28.6750732Z ) 2025-05-07T20:33:28.6750807Z else: 2025-05-07T20:33:28.6750902Z scale_ub_tensor = None 2025-05-07T20:33:28.6751007Z 2025-05-07T20:33:28.6751133Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.6751220Z op = silu_mul_quant 2025-05-07T20:33:28.6751303Z if compiled: 2025-05-07T20:33:28.6751394Z op = torch.compile(op) 2025-05-07T20:33:28.6751502Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.6751569Z 2025-05-07T20:33:28.6751659Z > y_fp8, y_scale = fn() 2025-05-07T20:33:28.6751663Z 2025-05-07T20:33:28.6751755Z moe/activation_test.py:117: 2025-05-07T20:33:28.6751877Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6751977Z moe/activation_test.py:115: in fn 2025-05-07T20:33:28.6752075Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.6752566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:28.6752661Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:28.6753013Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:28.6753227Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:28.6753608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:28.6753694Z kernel = self.compile( 2025-05-07T20:33:28.6754069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:28.6754243Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:28.6754362Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6754370Z 2025-05-07T20:33:28.6754574Z self = 2025-05-07T20:33:28.6755340Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:28.6755846Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8c48d42480>} 2025-05-07T20:33:28.6756578Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:28.6756760Z context = 2025-05-07T20:33:28.6756765Z 2025-05-07T20:33:28.6756929Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:28.6757226Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:28.6757334Z module_map=module_map) 2025-05-07T20:33:28.6757492Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:28.6757584Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:28.6757661Z E ^ 2025-05-07T20:33:28.6758004Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:28.6758009Z 2025-05-07T20:33:28.6758416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:28.6758425Z 2025-05-07T20:33:28.6758520Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.6758736Z self=, 2025-05-07T20:33:28.6758857Z T=16384, 2025-05-07T20:33:28.6758928Z D=7168, 2025-05-07T20:33:28.6759005Z scale_ub=None, 2025-05-07T20:33:28.6759095Z contiguous=True, 2025-05-07T20:33:28.6759171Z compiled=True, 2025-05-07T20:33:28.6759239Z ) 2025-05-07T20:33:28.6759544Z self = 2025-05-07T20:33:28.6759713Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:28.6759722Z 2025-05-07T20:33:28.6759795Z @given( 2025-05-07T20:33:28.6759905Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.6759997Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.6760110Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.6760225Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.6760334Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.6760403Z ) 2025-05-07T20:33:28.6760640Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.6760731Z def test_silu_mul_quant( 2025-05-07T20:33:28.6760807Z self, 2025-05-07T20:33:28.6760886Z T: int, 2025-05-07T20:33:28.6760955Z D: int, 2025-05-07T20:33:28.6761054Z scale_ub: Optional[float], 2025-05-07T20:33:28.6761137Z contiguous: bool, 2025-05-07T20:33:28.6761225Z compiled: bool, 2025-05-07T20:33:28.6761342Z ) -> None: 2025-05-07T20:33:28.6761428Z torch.manual_seed(2025) 2025-05-07T20:33:28.6761495Z 2025-05-07T20:33:28.6761660Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.6761730Z 2025-05-07T20:33:28.6761820Z x_sign = torch.sign(x) 2025-05-07T20:33:28.6761942Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:28.6762022Z x = x_sign * x_clamp 2025-05-07T20:33:28.6762101Z x0 = x[:, :D] 2025-05-07T20:33:28.6762172Z x1 = x[:, D:] 2025-05-07T20:33:28.6762239Z 2025-05-07T20:33:28.6762323Z if contiguous: 2025-05-07T20:33:28.6762408Z x0 = x0.contiguous() 2025-05-07T20:33:28.6762491Z x1 = x1.contiguous() 2025-05-07T20:33:28.6762563Z 2025-05-07T20:33:28.6762650Z if scale_ub is not None: 2025-05-07T20:33:28.6762754Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:28.6762881Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:28.6762951Z ) 2025-05-07T20:33:28.6763021Z else: 2025-05-07T20:33:28.6763108Z scale_ub_tensor = None 2025-05-07T20:33:28.6763180Z 2025-05-07T20:33:28.6763308Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.6763391Z op = silu_mul_quant 2025-05-07T20:33:28.6763468Z if compiled: 2025-05-07T20:33:28.6763563Z op = torch.compile(op) 2025-05-07T20:33:28.6763661Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.6763728Z 2025-05-07T20:33:28.6763813Z > y_fp8, y_scale = fn() 2025-05-07T20:33:28.6763819Z 2025-05-07T20:33:28.6763908Z moe/activation_test.py:117: 2025-05-07T20:33:28.6764077Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6764174Z moe/activation_test.py:115: in fn 2025-05-07T20:33:28.6764348Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.6764712Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:28.6764800Z return fn(*args, **kwargs) 2025-05-07T20:33:28.6765282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:28.6765374Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:28.6765720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:28.6765937Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:28.6766312Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:28.6766401Z kernel = self.compile( 2025-05-07T20:33:28.6766815Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:28.6766984Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:28.6767115Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6767119Z 2025-05-07T20:33:28.6767320Z self = 2025-05-07T20:33:28.6768087Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:28.6768592Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8c48d43740>} 2025-05-07T20:33:28.6769329Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:28.6769515Z context = 2025-05-07T20:33:28.6769585Z 2025-05-07T20:33:28.6769743Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:28.6770002Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:28.6770109Z module_map=module_map) 2025-05-07T20:33:28.6770267Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:28.6770361Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:28.6770434Z E ^ 2025-05-07T20:33:28.6770784Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:28.6770791Z 2025-05-07T20:33:28.6771203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:28.6771207Z 2025-05-07T20:33:28.6771305Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.6771527Z self=, 2025-05-07T20:33:28.6771598Z T=4096, 2025-05-07T20:33:28.6771667Z D=5120, 2025-05-07T20:33:28.6771742Z scale_ub=None, 2025-05-07T20:33:28.6771827Z contiguous=False, 2025-05-07T20:33:28.6771905Z compiled=True, 2025-05-07T20:33:28.6771971Z ) 2025-05-07T20:33:28.6772183Z self = 2025-05-07T20:33:28.6772349Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:28.6772353Z 2025-05-07T20:33:28.6772433Z @given( 2025-05-07T20:33:28.6772544Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.6772686Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.6772799Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.6772910Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.6773018Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.6773086Z ) 2025-05-07T20:33:28.6773323Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.6773410Z def test_silu_mul_quant( 2025-05-07T20:33:28.6773480Z self, 2025-05-07T20:33:28.6773549Z T: int, 2025-05-07T20:33:28.6773624Z D: int, 2025-05-07T20:33:28.6773717Z scale_ub: Optional[float], 2025-05-07T20:33:28.6773799Z contiguous: bool, 2025-05-07T20:33:28.6773884Z compiled: bool, 2025-05-07T20:33:28.6773957Z ) -> None: 2025-05-07T20:33:28.6774051Z torch.manual_seed(2025) 2025-05-07T20:33:28.6774162Z 2025-05-07T20:33:28.6774327Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.6774399Z 2025-05-07T20:33:28.6774523Z x_sign = torch.sign(x) 2025-05-07T20:33:28.6774641Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:28.6774727Z x = x_sign * x_clamp 2025-05-07T20:33:28.6774806Z x0 = x[:, :D] 2025-05-07T20:33:28.6774878Z x1 = x[:, D:] 2025-05-07T20:33:28.6774949Z 2025-05-07T20:33:28.6775025Z if contiguous: 2025-05-07T20:33:28.6775112Z x0 = x0.contiguous() 2025-05-07T20:33:28.6775195Z x1 = x1.contiguous() 2025-05-07T20:33:28.6775261Z 2025-05-07T20:33:28.6775344Z if scale_ub is not None: 2025-05-07T20:33:28.6775444Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:28.6775571Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:28.6775642Z ) 2025-05-07T20:33:28.6775715Z else: 2025-05-07T20:33:28.6775803Z scale_ub_tensor = None 2025-05-07T20:33:28.6775874Z 2025-05-07T20:33:28.6775998Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.6776085Z op = silu_mul_quant 2025-05-07T20:33:28.6776170Z if compiled: 2025-05-07T20:33:28.6776266Z op = torch.compile(op) 2025-05-07T20:33:28.6776411Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.6776483Z 2025-05-07T20:33:28.6776566Z > y_fp8, y_scale = fn() 2025-05-07T20:33:28.6776571Z 2025-05-07T20:33:28.6776662Z moe/activation_test.py:117: 2025-05-07T20:33:28.6776784Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6776879Z moe/activation_test.py:115: in fn 2025-05-07T20:33:28.6776977Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.6777339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:28.6777429Z return fn(*args, **kwargs) 2025-05-07T20:33:28.6777917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:28.6778014Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:28.6778374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:28.6778594Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:28.6778929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:28.6779020Z kernel = self.compile( 2025-05-07T20:33:28.6779420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:28.6779612Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:28.6779738Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6779743Z 2025-05-07T20:33:28.6779985Z self = 2025-05-07T20:33:28.6780760Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:28.6781258Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b92998c20>} 2025-05-07T20:33:28.6781993Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:28.6782176Z context = 2025-05-07T20:33:28.6782229Z 2025-05-07T20:33:28.6782391Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:28.6782693Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:28.6782794Z module_map=module_map) 2025-05-07T20:33:28.6782956Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:28.6783057Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:28.6783127Z E ^ 2025-05-07T20:33:28.6783475Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:28.6783479Z 2025-05-07T20:33:28.6783883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:28.6783887Z 2025-05-07T20:33:28.6783984Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.6784206Z self=, 2025-05-07T20:33:28.6784278Z T=4096, 2025-05-07T20:33:28.6784354Z D=5120, 2025-05-07T20:33:28.6784443Z scale_ub=1200.0, 2025-05-07T20:33:28.6784524Z contiguous=False, 2025-05-07T20:33:28.6784606Z compiled=False, 2025-05-07T20:33:28.6784680Z ) 2025-05-07T20:33:28.6784891Z self = 2025-05-07T20:33:28.6785117Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:28.6785121Z 2025-05-07T20:33:28.6785191Z @given( 2025-05-07T20:33:28.6785301Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.6785402Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.6785509Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.6785623Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.6785733Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.6785803Z ) 2025-05-07T20:33:28.6786048Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.6786136Z def test_silu_mul_quant( 2025-05-07T20:33:28.6786205Z self, 2025-05-07T20:33:28.6786289Z T: int, 2025-05-07T20:33:28.6786359Z D: int, 2025-05-07T20:33:28.6786449Z scale_ub: Optional[float], 2025-05-07T20:33:28.6786536Z contiguous: bool, 2025-05-07T20:33:28.6786624Z compiled: bool, 2025-05-07T20:33:28.6786699Z ) -> None: 2025-05-07T20:33:28.6786792Z torch.manual_seed(2025) 2025-05-07T20:33:28.6786861Z 2025-05-07T20:33:28.6787027Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.6787095Z 2025-05-07T20:33:28.6787181Z x_sign = torch.sign(x) 2025-05-07T20:33:28.6787303Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:28.6787386Z x = x_sign * x_clamp 2025-05-07T20:33:28.6787461Z x0 = x[:, :D] 2025-05-07T20:33:28.6787544Z x1 = x[:, D:] 2025-05-07T20:33:28.6787610Z 2025-05-07T20:33:28.6787686Z if contiguous: 2025-05-07T20:33:28.6787828Z x0 = x0.contiguous() 2025-05-07T20:33:28.6787912Z x1 = x1.contiguous() 2025-05-07T20:33:28.6787978Z 2025-05-07T20:33:28.6788073Z if scale_ub is not None: 2025-05-07T20:33:28.6788173Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:28.6788305Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:28.6788384Z ) 2025-05-07T20:33:28.6788454Z else: 2025-05-07T20:33:28.6788546Z scale_ub_tensor = None 2025-05-07T20:33:28.6788617Z 2025-05-07T20:33:28.6788738Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.6788827Z op = silu_mul_quant 2025-05-07T20:33:28.6788905Z if compiled: 2025-05-07T20:33:28.6788997Z op = torch.compile(op) 2025-05-07T20:33:28.6789099Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.6789211Z 2025-05-07T20:33:28.6789293Z > y_fp8, y_scale = fn() 2025-05-07T20:33:28.6789298Z 2025-05-07T20:33:28.6789400Z moe/activation_test.py:117: 2025-05-07T20:33:28.6789559Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6789662Z moe/activation_test.py:115: in fn 2025-05-07T20:33:28.6789757Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.6790297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:28.6790389Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:28.6790737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:28.6790952Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:28.6791290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:28.6791378Z kernel = self.compile( 2025-05-07T20:33:28.6791756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:28.6791928Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:28.6792051Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6792101Z 2025-05-07T20:33:28.6792301Z self = 2025-05-07T20:33:28.6793069Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:28.6793566Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b929996c0>} 2025-05-07T20:33:28.6794305Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:28.6794492Z context = 2025-05-07T20:33:28.6794499Z 2025-05-07T20:33:28.6794655Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:28.6794912Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:28.6795017Z module_map=module_map) 2025-05-07T20:33:28.6795172Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:28.6795262Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:28.6795338Z E ^ 2025-05-07T20:33:28.6795683Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:28.6795689Z 2025-05-07T20:33:28.6796136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:28.6796141Z 2025-05-07T20:33:28.6796239Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.6796454Z self=, 2025-05-07T20:33:28.6796530Z T=4096, 2025-05-07T20:33:28.6796601Z D=5120, 2025-05-07T20:33:28.6796676Z scale_ub=1200.0, 2025-05-07T20:33:28.6796764Z contiguous=False, 2025-05-07T20:33:28.6796843Z compiled=True, 2025-05-07T20:33:28.6796907Z ) 2025-05-07T20:33:28.6797130Z self = 2025-05-07T20:33:28.6797304Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:28.6797308Z 2025-05-07T20:33:28.6797378Z @given( 2025-05-07T20:33:28.6797492Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.6797630Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.6797743Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.6797856Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.6798027Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.6798104Z ) 2025-05-07T20:33:28.6798341Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.6798437Z def test_silu_mul_quant( 2025-05-07T20:33:28.6798510Z self, 2025-05-07T20:33:28.6798579Z T: int, 2025-05-07T20:33:28.6798656Z D: int, 2025-05-07T20:33:28.6798746Z scale_ub: Optional[float], 2025-05-07T20:33:28.6798826Z contiguous: bool, 2025-05-07T20:33:28.6798909Z compiled: bool, 2025-05-07T20:33:28.6798979Z ) -> None: 2025-05-07T20:33:28.6799070Z torch.manual_seed(2025) 2025-05-07T20:33:28.6799137Z 2025-05-07T20:33:28.6799301Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.6799370Z 2025-05-07T20:33:28.6799459Z x_sign = torch.sign(x) 2025-05-07T20:33:28.6799578Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:28.6799663Z x = x_sign * x_clamp 2025-05-07T20:33:28.6799743Z x0 = x[:, :D] 2025-05-07T20:33:28.6799819Z x1 = x[:, D:] 2025-05-07T20:33:28.6799895Z 2025-05-07T20:33:28.6800017Z if contiguous: 2025-05-07T20:33:28.6800103Z x0 = x0.contiguous() 2025-05-07T20:33:28.6800192Z x1 = x1.contiguous() 2025-05-07T20:33:28.6800258Z 2025-05-07T20:33:28.6800341Z if scale_ub is not None: 2025-05-07T20:33:28.6800449Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:28.6800579Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:28.6800649Z ) 2025-05-07T20:33:28.6800724Z else: 2025-05-07T20:33:28.6800816Z scale_ub_tensor = None 2025-05-07T20:33:28.6800883Z 2025-05-07T20:33:28.6801012Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.6801097Z op = silu_mul_quant 2025-05-07T20:33:28.6801180Z if compiled: 2025-05-07T20:33:28.6801274Z op = torch.compile(op) 2025-05-07T20:33:28.6801375Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.6801442Z 2025-05-07T20:33:28.6801526Z > y_fp8, y_scale = fn() 2025-05-07T20:33:28.6801533Z 2025-05-07T20:33:28.6801624Z moe/activation_test.py:117: 2025-05-07T20:33:28.6801752Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6801845Z moe/activation_test.py:115: in fn 2025-05-07T20:33:28.6801937Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.6802304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:28.6802390Z return fn(*args, **kwargs) 2025-05-07T20:33:28.6802880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:28.6803018Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:28.6803370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:28.6803589Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:28.6803921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:28.6804011Z kernel = self.compile( 2025-05-07T20:33:28.6804499Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:28.6804668Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:28.6804794Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6804798Z 2025-05-07T20:33:28.6805111Z self = 2025-05-07T20:33:28.6805926Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:28.6806423Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b9299afc0>} 2025-05-07T20:33:28.6807162Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:28.6807353Z context = 2025-05-07T20:33:28.6807357Z 2025-05-07T20:33:28.6807514Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:28.6807779Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:28.6807883Z module_map=module_map) 2025-05-07T20:33:28.6808039Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:28.6808131Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:28.6808206Z E ^ 2025-05-07T20:33:28.6808873Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:28.6808880Z 2025-05-07T20:33:28.6809295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:28.6809299Z 2025-05-07T20:33:28.6809396Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.6809618Z self=, 2025-05-07T20:33:28.6809694Z T=2048, 2025-05-07T20:33:28.6809766Z D=7168, 2025-05-07T20:33:28.6809850Z scale_ub=1200.0, 2025-05-07T20:33:28.6809932Z contiguous=False, 2025-05-07T20:33:28.6810013Z compiled=False, 2025-05-07T20:33:28.6810083Z ) 2025-05-07T20:33:28.6810297Z self = 2025-05-07T20:33:28.6810465Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:28.6810476Z 2025-05-07T20:33:28.6810546Z @given( 2025-05-07T20:33:28.6810657Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.6810757Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.6810865Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.6810975Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.6811084Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.6811153Z ) 2025-05-07T20:33:28.6811390Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.6811480Z def test_silu_mul_quant( 2025-05-07T20:33:28.6814500Z self, 2025-05-07T20:33:28.6814589Z T: int, 2025-05-07T20:33:28.6814775Z D: int, 2025-05-07T20:33:28.6814879Z scale_ub: Optional[float], 2025-05-07T20:33:28.6814972Z contiguous: bool, 2025-05-07T20:33:28.6815056Z compiled: bool, 2025-05-07T20:33:28.6815139Z ) -> None: 2025-05-07T20:33:28.6815235Z torch.manual_seed(2025) 2025-05-07T20:33:28.6815314Z 2025-05-07T20:33:28.6815504Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.6815578Z 2025-05-07T20:33:28.6815679Z x_sign = torch.sign(x) 2025-05-07T20:33:28.6815824Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:28.6815926Z x = x_sign * x_clamp 2025-05-07T20:33:28.6816027Z x0 = x[:, :D] 2025-05-07T20:33:28.6816114Z x1 = x[:, D:] 2025-05-07T20:33:28.6816191Z 2025-05-07T20:33:28.6816286Z if contiguous: 2025-05-07T20:33:28.6816439Z x0 = x0.contiguous() 2025-05-07T20:33:28.6816526Z x1 = x1.contiguous() 2025-05-07T20:33:28.6816603Z 2025-05-07T20:33:28.6816692Z if scale_ub is not None: 2025-05-07T20:33:28.6816856Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:28.6816999Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:28.6817073Z ) 2025-05-07T20:33:28.6817153Z else: 2025-05-07T20:33:28.6817255Z scale_ub_tensor = None 2025-05-07T20:33:28.6817326Z 2025-05-07T20:33:28.6817458Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.6817547Z op = silu_mul_quant 2025-05-07T20:33:28.6817628Z if compiled: 2025-05-07T20:33:28.6817726Z op = torch.compile(op) 2025-05-07T20:33:28.6817827Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.6817897Z 2025-05-07T20:33:28.6817994Z > y_fp8, y_scale = fn() 2025-05-07T20:33:28.6817999Z 2025-05-07T20:33:28.6818099Z moe/activation_test.py:117: 2025-05-07T20:33:28.6818230Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6818333Z moe/activation_test.py:115: in fn 2025-05-07T20:33:28.6818432Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.6818935Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:28.6819100Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:28.6819454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:28.6819680Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:28.6820019Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:28.6820111Z kernel = self.compile( 2025-05-07T20:33:28.6820662Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:28.6820869Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:28.6820997Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6821002Z 2025-05-07T20:33:28.6821204Z self = 2025-05-07T20:33:28.6821988Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:28.6822545Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b9299bec0>} 2025-05-07T20:33:28.6823285Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:28.6823538Z context = 2025-05-07T20:33:28.6823545Z 2025-05-07T20:33:28.6823706Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:28.6823971Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:28.6824084Z module_map=module_map) 2025-05-07T20:33:28.6824242Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:28.6824342Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:28.6824417Z E ^ 2025-05-07T20:33:28.6824771Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:28.6824776Z 2025-05-07T20:33:28.6825189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:28.6825235Z 2025-05-07T20:33:28.6825337Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.6825596Z self=, 2025-05-07T20:33:28.6825670Z T=1, 2025-05-07T20:33:28.6825747Z D=7168, 2025-05-07T20:33:28.6825834Z scale_ub=None, 2025-05-07T20:33:28.6825919Z contiguous=True, 2025-05-07T20:33:28.6826005Z compiled=False, 2025-05-07T20:33:28.6826081Z ) 2025-05-07T20:33:28.6826296Z self = 2025-05-07T20:33:28.6826459Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:28.6826466Z 2025-05-07T20:33:28.6826538Z @given( 2025-05-07T20:33:28.6826652Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.6826750Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.6826862Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.6826981Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.6827097Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.6827168Z ) 2025-05-07T20:33:28.6827414Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.6827508Z def test_silu_mul_quant( 2025-05-07T20:33:28.6827582Z self, 2025-05-07T20:33:28.6827701Z T: int, 2025-05-07T20:33:28.6827779Z D: int, 2025-05-07T20:33:28.6827880Z scale_ub: Optional[float], 2025-05-07T20:33:28.6827970Z contiguous: bool, 2025-05-07T20:33:28.6828054Z compiled: bool, 2025-05-07T20:33:28.6828129Z ) -> None: 2025-05-07T20:33:28.6828222Z torch.manual_seed(2025) 2025-05-07T20:33:28.6828291Z 2025-05-07T20:33:28.6828462Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.6828536Z 2025-05-07T20:33:28.6828623Z x_sign = torch.sign(x) 2025-05-07T20:33:28.6828747Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:28.6828835Z x = x_sign * x_clamp 2025-05-07T20:33:28.6828916Z x0 = x[:, :D] 2025-05-07T20:33:28.6828994Z x1 = x[:, D:] 2025-05-07T20:33:28.6829073Z 2025-05-07T20:33:28.6829165Z if contiguous: 2025-05-07T20:33:28.6829263Z x0 = x0.contiguous() 2025-05-07T20:33:28.6829352Z x1 = x1.contiguous() 2025-05-07T20:33:28.6829427Z 2025-05-07T20:33:28.6829517Z if scale_ub is not None: 2025-05-07T20:33:28.6829621Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:28.6829753Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:28.6829828Z ) 2025-05-07T20:33:28.6829902Z else: 2025-05-07T20:33:28.6829990Z scale_ub_tensor = None 2025-05-07T20:33:28.6830063Z 2025-05-07T20:33:28.6830188Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.6830274Z op = silu_mul_quant 2025-05-07T20:33:28.6830363Z if compiled: 2025-05-07T20:33:28.6830458Z op = torch.compile(op) 2025-05-07T20:33:28.6830610Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.6830682Z 2025-05-07T20:33:28.6830774Z > y_fp8, y_scale = fn() 2025-05-07T20:33:28.6830779Z 2025-05-07T20:33:28.6830875Z moe/activation_test.py:117: 2025-05-07T20:33:28.6831003Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6831100Z moe/activation_test.py:115: in fn 2025-05-07T20:33:28.6831200Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.6831692Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:28.6831784Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:28.6832140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:28.6832429Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:28.6832774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:28.6832905Z kernel = self.compile( 2025-05-07T20:33:28.6833285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:28.6833464Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:28.6833587Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6833591Z 2025-05-07T20:33:28.6833793Z self = 2025-05-07T20:33:28.6834571Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:28.6835075Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b9379ccc0>} 2025-05-07T20:33:28.6835819Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:28.6836051Z context = 2025-05-07T20:33:28.6836056Z 2025-05-07T20:33:28.6836217Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:28.6836480Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:28.6836585Z module_map=module_map) 2025-05-07T20:33:28.6836748Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:28.6836842Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:28.6836919Z E ^ 2025-05-07T20:33:28.6837273Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:28.6837281Z 2025-05-07T20:33:28.6837691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:28.6837698Z 2025-05-07T20:33:28.6837802Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.6838020Z self=, 2025-05-07T20:33:28.6838095Z T=16384, 2025-05-07T20:33:28.6838174Z D=7168, 2025-05-07T20:33:28.6838254Z scale_ub=1200.0, 2025-05-07T20:33:28.6838337Z contiguous=False, 2025-05-07T20:33:28.6838426Z compiled=True, 2025-05-07T20:33:28.6838497Z ) 2025-05-07T20:33:28.6838711Z self = 2025-05-07T20:33:28.6838889Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:28.6838896Z 2025-05-07T20:33:28.6838969Z @given( 2025-05-07T20:33:28.6839140Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.6839239Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.6839351Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.6839468Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.6839579Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.6839652Z ) 2025-05-07T20:33:28.6839899Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.6839990Z def test_silu_mul_quant( 2025-05-07T20:33:28.6840063Z self, 2025-05-07T20:33:28.6840139Z T: int, 2025-05-07T20:33:28.6840214Z D: int, 2025-05-07T20:33:28.6840310Z scale_ub: Optional[float], 2025-05-07T20:33:28.6840404Z contiguous: bool, 2025-05-07T20:33:28.6840509Z compiled: bool, 2025-05-07T20:33:28.6840645Z ) -> None: 2025-05-07T20:33:28.6840748Z torch.manual_seed(2025) 2025-05-07T20:33:28.6840820Z 2025-05-07T20:33:28.6841025Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.6841096Z 2025-05-07T20:33:28.6841187Z x_sign = torch.sign(x) 2025-05-07T20:33:28.6841312Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:28.6841400Z x = x_sign * x_clamp 2025-05-07T20:33:28.6841478Z x0 = x[:, :D] 2025-05-07T20:33:28.6841557Z x1 = x[:, D:] 2025-05-07T20:33:28.6841625Z 2025-05-07T20:33:28.6841705Z if contiguous: 2025-05-07T20:33:28.6841797Z x0 = x0.contiguous() 2025-05-07T20:33:28.6841882Z x1 = x1.contiguous() 2025-05-07T20:33:28.6841954Z 2025-05-07T20:33:28.6842040Z if scale_ub is not None: 2025-05-07T20:33:28.6842141Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:28.6842276Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:28.6842355Z ) 2025-05-07T20:33:28.6842429Z else: 2025-05-07T20:33:28.6842524Z scale_ub_tensor = None 2025-05-07T20:33:28.6842593Z 2025-05-07T20:33:28.6842722Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.6842809Z op = silu_mul_quant 2025-05-07T20:33:28.6842890Z if compiled: 2025-05-07T20:33:28.6843031Z op = torch.compile(op) 2025-05-07T20:33:28.6843140Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.6843209Z 2025-05-07T20:33:28.6843295Z > y_fp8, y_scale = fn() 2025-05-07T20:33:28.6843303Z 2025-05-07T20:33:28.6843397Z moe/activation_test.py:117: 2025-05-07T20:33:28.6843521Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6843624Z moe/activation_test.py:115: in fn 2025-05-07T20:33:28.6843722Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.6844084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:28.6844182Z return fn(*args, **kwargs) 2025-05-07T20:33:28.6844811Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:28.6844906Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:28.6845261Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:28.6845479Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:28.6845816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:28.6845910Z kernel = self.compile( 2025-05-07T20:33:28.6846288Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:28.6846461Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:28.6846634Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6846639Z 2025-05-07T20:33:28.6846845Z self = 2025-05-07T20:33:28.6847623Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:28.6848122Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b9379e0c0>} 2025-05-07T20:33:28.6848863Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:28.6849093Z context = 2025-05-07T20:33:28.6849098Z 2025-05-07T20:33:28.6849266Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:28.6849560Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:28.6849669Z module_map=module_map) 2025-05-07T20:33:28.6849846Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:28.6849944Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:28.6850024Z E ^ 2025-05-07T20:33:28.6850375Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:28.6850380Z 2025-05-07T20:33:28.6850788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:28.6850792Z 2025-05-07T20:33:28.6850897Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.6851120Z self=, 2025-05-07T20:33:28.6851198Z T=1, 2025-05-07T20:33:28.6851270Z D=7168, 2025-05-07T20:33:28.6851351Z scale_ub=None, 2025-05-07T20:33:28.6851441Z contiguous=False, 2025-05-07T20:33:28.6851523Z compiled=False, 2025-05-07T20:33:28.6851592Z ) 2025-05-07T20:33:28.6851853Z self = 2025-05-07T20:33:28.6852021Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:28.6852026Z 2025-05-07T20:33:28.6852098Z @given( 2025-05-07T20:33:28.6852215Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.6852310Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.6852426Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.6852538Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.6852648Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.6852722Z ) 2025-05-07T20:33:28.6852965Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.6853054Z def test_silu_mul_quant( 2025-05-07T20:33:28.6853134Z self, 2025-05-07T20:33:28.6853208Z T: int, 2025-05-07T20:33:28.6853281Z D: int, 2025-05-07T20:33:28.6853378Z scale_ub: Optional[float], 2025-05-07T20:33:28.6853473Z contiguous: bool, 2025-05-07T20:33:28.6853554Z compiled: bool, 2025-05-07T20:33:28.6853631Z ) -> None: 2025-05-07T20:33:28.6853721Z torch.manual_seed(2025) 2025-05-07T20:33:28.6853795Z 2025-05-07T20:33:28.6853963Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.6854034Z 2025-05-07T20:33:28.6854126Z x_sign = torch.sign(x) 2025-05-07T20:33:28.6854245Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:28.6854332Z x = x_sign * x_clamp 2025-05-07T20:33:28.6854413Z x0 = x[:, :D] 2025-05-07T20:33:28.6854489Z x1 = x[:, D:] 2025-05-07T20:33:28.6854558Z 2025-05-07T20:33:28.6854688Z if contiguous: 2025-05-07T20:33:28.6854783Z x0 = x0.contiguous() 2025-05-07T20:33:28.6854870Z x1 = x1.contiguous() 2025-05-07T20:33:28.6854942Z 2025-05-07T20:33:28.6855030Z if scale_ub is not None: 2025-05-07T20:33:28.6855136Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:28.6855272Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:28.6855345Z ) 2025-05-07T20:33:28.6855424Z else: 2025-05-07T20:33:28.6855515Z scale_ub_tensor = None 2025-05-07T20:33:28.6855584Z 2025-05-07T20:33:28.6855712Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.6855801Z op = silu_mul_quant 2025-05-07T20:33:28.6855882Z if compiled: 2025-05-07T20:33:28.6855981Z op = torch.compile(op) 2025-05-07T20:33:28.6856127Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.6856198Z 2025-05-07T20:33:28.6856290Z > y_fp8, y_scale = fn() 2025-05-07T20:33:28.6856294Z 2025-05-07T20:33:28.6856425Z moe/activation_test.py:117: 2025-05-07T20:33:28.6856556Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6856653Z moe/activation_test.py:115: in fn 2025-05-07T20:33:28.6856755Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.6857250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:28.6857343Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:28.6857694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:28.6857916Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:28.6858249Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:28.6858347Z kernel = self.compile( 2025-05-07T20:33:28.6858725Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:28.6858896Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:28.6859065Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6859069Z 2025-05-07T20:33:28.6859270Z self = 2025-05-07T20:33:28.6860050Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:28.6860551Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b9379ec00>} 2025-05-07T20:33:28.6861323Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:28.6861538Z context = 2025-05-07T20:33:28.6861546Z 2025-05-07T20:33:28.6861707Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:28.6861968Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:28.6862072Z module_map=module_map) 2025-05-07T20:33:28.6862229Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:28.6862328Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:28.6862401Z E ^ 2025-05-07T20:33:28.6862749Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:28.6862758Z 2025-05-07T20:33:28.6863236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:28.6863241Z 2025-05-07T20:33:28.6863341Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.6863564Z self=, 2025-05-07T20:33:28.6863638Z T=2048, 2025-05-07T20:33:28.6863713Z D=7168, 2025-05-07T20:33:28.6863795Z scale_ub=None, 2025-05-07T20:33:28.6863877Z contiguous=False, 2025-05-07T20:33:28.6863957Z compiled=True, 2025-05-07T20:33:28.6864028Z ) 2025-05-07T20:33:28.6864243Z self = 2025-05-07T20:33:28.6864416Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:28.6864420Z 2025-05-07T20:33:28.6864492Z @given( 2025-05-07T20:33:28.6864649Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.6864752Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.6864869Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.6865025Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.6865141Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.6865211Z ) 2025-05-07T20:33:28.6865457Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.6865549Z def test_silu_mul_quant( 2025-05-07T20:33:28.6865621Z self, 2025-05-07T20:33:28.6865696Z T: int, 2025-05-07T20:33:28.6865769Z D: int, 2025-05-07T20:33:28.6865865Z scale_ub: Optional[float], 2025-05-07T20:33:28.6865952Z contiguous: bool, 2025-05-07T20:33:28.6866034Z compiled: bool, 2025-05-07T20:33:28.6866108Z ) -> None: 2025-05-07T20:33:28.6866205Z torch.manual_seed(2025) 2025-05-07T20:33:28.6866279Z 2025-05-07T20:33:28.6866447Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.6866533Z 2025-05-07T20:33:28.6866621Z x_sign = torch.sign(x) 2025-05-07T20:33:28.6866746Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:28.6866834Z x = x_sign * x_clamp 2025-05-07T20:33:28.6866910Z x0 = x[:, :D] 2025-05-07T20:33:28.6867037Z x1 = x[:, D:] 2025-05-07T20:33:28.6867105Z 2025-05-07T20:33:28.6867186Z if contiguous: 2025-05-07T20:33:28.6867277Z x0 = x0.contiguous() 2025-05-07T20:33:28.6867361Z x1 = x1.contiguous() 2025-05-07T20:33:28.6867430Z 2025-05-07T20:33:28.6867523Z if scale_ub is not None: 2025-05-07T20:33:28.6867624Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:28.6867754Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:28.6867832Z ) 2025-05-07T20:33:28.6867905Z else: 2025-05-07T20:33:28.6867999Z scale_ub_tensor = None 2025-05-07T20:33:28.6868075Z 2025-05-07T20:33:28.6868207Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.6868296Z op = silu_mul_quant 2025-05-07T20:33:28.6868380Z if compiled: 2025-05-07T20:33:28.6868475Z op = torch.compile(op) 2025-05-07T20:33:28.6868581Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.6868659Z 2025-05-07T20:33:28.6868745Z > y_fp8, y_scale = fn() 2025-05-07T20:33:28.6868750Z 2025-05-07T20:33:28.6868850Z moe/activation_test.py:117: 2025-05-07T20:33:28.6868975Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6869073Z moe/activation_test.py:115: in fn 2025-05-07T20:33:28.6869178Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.6869592Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:28.6869685Z return fn(*args, **kwargs) 2025-05-07T20:33:28.6870216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:28.6870313Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:28.6870669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:28.6870890Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:28.6871224Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:28.6871318Z kernel = self.compile( 2025-05-07T20:33:28.6871695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:28.6871873Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:28.6871996Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6872044Z 2025-05-07T20:33:28.6872250Z self = 2025-05-07T20:33:28.6873070Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:28.6873575Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b92cbc2c0>} 2025-05-07T20:33:28.6874315Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:28.6874506Z context = 2025-05-07T20:33:28.6874510Z 2025-05-07T20:33:28.6874674Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:28.6874938Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:28.6875044Z module_map=module_map) 2025-05-07T20:33:28.6875202Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:28.6875299Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:28.6875415Z E ^ 2025-05-07T20:33:28.6875767Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:28.6875771Z 2025-05-07T20:33:28.6876182Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:28.6876186Z 2025-05-07T20:33:28.6876286Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.6876507Z self=, 2025-05-07T20:33:28.6876582Z T=4096, 2025-05-07T20:33:28.6876654Z D=7168, 2025-05-07T20:33:28.6876736Z scale_ub=None, 2025-05-07T20:33:28.6876822Z contiguous=False, 2025-05-07T20:33:28.6876904Z compiled=True, 2025-05-07T20:33:28.6876978Z ) 2025-05-07T20:33:28.6877197Z self = 2025-05-07T20:33:28.6877367Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:28.6877375Z 2025-05-07T20:33:28.6877449Z @given( 2025-05-07T20:33:28.6877564Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.6877662Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.6877774Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.6877888Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.6878003Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.6878075Z ) 2025-05-07T20:33:28.6878322Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.6878411Z def test_silu_mul_quant( 2025-05-07T20:33:28.6878529Z self, 2025-05-07T20:33:28.6878608Z T: int, 2025-05-07T20:33:28.6878680Z D: int, 2025-05-07T20:33:28.6878780Z scale_ub: Optional[float], 2025-05-07T20:33:28.6878869Z contiguous: bool, 2025-05-07T20:33:28.6878952Z compiled: bool, 2025-05-07T20:33:28.6879028Z ) -> None: 2025-05-07T20:33:28.6879121Z torch.manual_seed(2025) 2025-05-07T20:33:28.6879207Z 2025-05-07T20:33:28.6879399Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.6879478Z 2025-05-07T20:33:28.6879566Z x_sign = torch.sign(x) 2025-05-07T20:33:28.6879690Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:28.6879775Z x = x_sign * x_clamp 2025-05-07T20:33:28.6879853Z x0 = x[:, :D] 2025-05-07T20:33:28.6879932Z x1 = x[:, D:] 2025-05-07T20:33:28.6880044Z 2025-05-07T20:33:28.6880124Z if contiguous: 2025-05-07T20:33:28.6880214Z x0 = x0.contiguous() 2025-05-07T20:33:28.6880304Z x1 = x1.contiguous() 2025-05-07T20:33:28.6880371Z 2025-05-07T20:33:28.6880502Z if scale_ub is not None: 2025-05-07T20:33:28.6880604Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:28.6880735Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:28.6880815Z ) 2025-05-07T20:33:28.6880891Z else: 2025-05-07T20:33:28.6880984Z scale_ub_tensor = None 2025-05-07T20:33:28.6881054Z 2025-05-07T20:33:28.6881178Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.6881268Z op = silu_mul_quant 2025-05-07T20:33:28.6881349Z if compiled: 2025-05-07T20:33:28.6881446Z op = torch.compile(op) 2025-05-07T20:33:28.6881551Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.6881621Z 2025-05-07T20:33:28.6881710Z > y_fp8, y_scale = fn() 2025-05-07T20:33:28.6881714Z 2025-05-07T20:33:28.6881810Z moe/activation_test.py:117: 2025-05-07T20:33:28.6881937Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6882036Z moe/activation_test.py:115: in fn 2025-05-07T20:33:28.6882136Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.6882497Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:28.6882631Z return fn(*args, **kwargs) 2025-05-07T20:33:28.6883118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:28.6883216Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:28.6883570Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:28.6883788Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:28.6884129Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:28.6884219Z kernel = self.compile( 2025-05-07T20:33:28.6884719Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:28.6884892Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:28.6885017Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6885021Z 2025-05-07T20:33:28.6885221Z self = 2025-05-07T20:33:28.6885995Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:28.6886544Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b92cbcd60>} 2025-05-07T20:33:28.6887292Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:28.6887483Z context = 2025-05-07T20:33:28.6887487Z 2025-05-07T20:33:28.6887648Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:28.6887906Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:28.6888009Z module_map=module_map) 2025-05-07T20:33:28.6888169Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:28.6888264Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:28.6888338Z E ^ 2025-05-07T20:33:28.6888734Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:28.6888739Z 2025-05-07T20:33:28.6889197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:28.6889202Z 2025-05-07T20:33:28.6889304Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.6889523Z self=, 2025-05-07T20:33:28.6889598Z T=16384, 2025-05-07T20:33:28.6889678Z D=5120, 2025-05-07T20:33:28.6889760Z scale_ub=1200.0, 2025-05-07T20:33:28.6889843Z contiguous=False, 2025-05-07T20:33:28.6889926Z compiled=False, 2025-05-07T20:33:28.6890004Z ) 2025-05-07T20:33:28.6890221Z self = 2025-05-07T20:33:28.6890397Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:28.6890404Z 2025-05-07T20:33:28.6890481Z @given( 2025-05-07T20:33:28.6890602Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.6890700Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.6890814Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.6890931Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.6891040Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.6891177Z ) 2025-05-07T20:33:28.6891421Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.6891509Z def test_silu_mul_quant( 2025-05-07T20:33:28.6891584Z self, 2025-05-07T20:33:28.6891658Z T: int, 2025-05-07T20:33:28.6891731Z D: int, 2025-05-07T20:33:28.6891828Z scale_ub: Optional[float], 2025-05-07T20:33:28.6891913Z contiguous: bool, 2025-05-07T20:33:28.6891995Z compiled: bool, 2025-05-07T20:33:28.6892076Z ) -> None: 2025-05-07T20:33:28.6892169Z torch.manual_seed(2025) 2025-05-07T20:33:28.6892238Z 2025-05-07T20:33:28.6892408Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.6892481Z 2025-05-07T20:33:28.6892569Z x_sign = torch.sign(x) 2025-05-07T20:33:28.6892691Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:28.6892776Z x = x_sign * x_clamp 2025-05-07T20:33:28.6892858Z x0 = x[:, :D] 2025-05-07T20:33:28.6892934Z x1 = x[:, D:] 2025-05-07T20:33:28.6893003Z 2025-05-07T20:33:28.6893086Z if contiguous: 2025-05-07T20:33:28.6893177Z x0 = x0.contiguous() 2025-05-07T20:33:28.6893261Z x1 = x1.contiguous() 2025-05-07T20:33:28.6893333Z 2025-05-07T20:33:28.6893419Z if scale_ub is not None: 2025-05-07T20:33:28.6893521Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:28.6893656Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:28.6893730Z ) 2025-05-07T20:33:28.6893804Z else: 2025-05-07T20:33:28.6893897Z scale_ub_tensor = None 2025-05-07T20:33:28.6894015Z 2025-05-07T20:33:28.6894145Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.6894234Z op = silu_mul_quant 2025-05-07T20:33:28.6894316Z if compiled: 2025-05-07T20:33:28.6894415Z op = torch.compile(op) 2025-05-07T20:33:28.6894521Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.6894594Z 2025-05-07T20:33:28.6894683Z > y_fp8, y_scale = fn() 2025-05-07T20:33:28.6894687Z 2025-05-07T20:33:28.6894781Z moe/activation_test.py:117: 2025-05-07T20:33:28.6894905Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6895005Z moe/activation_test.py:115: in fn 2025-05-07T20:33:28.6895100Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.6895595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:28.6895732Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:28.6896127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:28.6896349Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:28.6896682Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:28.6896776Z kernel = self.compile( 2025-05-07T20:33:28.6897158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:28.6897331Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:28.6897457Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6897461Z 2025-05-07T20:33:28.6897661Z self = 2025-05-07T20:33:28.6898443Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:28.6898946Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b92cbdc60>} 2025-05-07T20:33:28.6899779Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:28.6899970Z context = 2025-05-07T20:33:28.6899974Z 2025-05-07T20:33:28.6900137Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:28.6900400Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:28.6900510Z module_map=module_map) 2025-05-07T20:33:28.6900676Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:28.6900778Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:28.6900854Z E ^ 2025-05-07T20:33:28.6901205Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:28.6901213Z 2025-05-07T20:33:28.6901624Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:28.6901628Z 2025-05-07T20:33:28.6901728Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.6901951Z self=, 2025-05-07T20:33:28.6902024Z T=16384, 2025-05-07T20:33:28.6902097Z D=5120, 2025-05-07T20:33:28.6902179Z scale_ub=1200.0, 2025-05-07T20:33:28.6902263Z contiguous=True, 2025-05-07T20:33:28.6902343Z compiled=True, 2025-05-07T20:33:28.6902416Z ) 2025-05-07T20:33:28.6902673Z self = 2025-05-07T20:33:28.6902854Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:28.6902861Z 2025-05-07T20:33:28.6902937Z @given( 2025-05-07T20:33:28.6903057Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.6903163Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.6903274Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.6903386Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.6903497Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.6903567Z ) 2025-05-07T20:33:28.6903807Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.6903899Z def test_silu_mul_quant( 2025-05-07T20:33:28.6904015Z self, 2025-05-07T20:33:28.6904089Z T: int, 2025-05-07T20:33:28.6904169Z D: int, 2025-05-07T20:33:28.6904266Z scale_ub: Optional[float], 2025-05-07T20:33:28.6904354Z contiguous: bool, 2025-05-07T20:33:28.6904479Z compiled: bool, 2025-05-07T20:33:28.6904554Z ) -> None: 2025-05-07T20:33:28.6904649Z torch.manual_seed(2025) 2025-05-07T20:33:28.6904720Z 2025-05-07T20:33:28.6904883Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.6904958Z 2025-05-07T20:33:28.6905046Z x_sign = torch.sign(x) 2025-05-07T20:33:28.6905166Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:28.6905256Z x = x_sign * x_clamp 2025-05-07T20:33:28.6905332Z x0 = x[:, :D] 2025-05-07T20:33:28.6905409Z x1 = x[:, D:] 2025-05-07T20:33:28.6905480Z 2025-05-07T20:33:28.6905559Z if contiguous: 2025-05-07T20:33:28.6905651Z x0 = x0.contiguous() 2025-05-07T20:33:28.6905742Z x1 = x1.contiguous() 2025-05-07T20:33:28.6905810Z 2025-05-07T20:33:28.6905902Z if scale_ub is not None: 2025-05-07T20:33:28.6906005Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:28.6906144Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:28.6906221Z ) 2025-05-07T20:33:28.6906296Z else: 2025-05-07T20:33:28.6906431Z scale_ub_tensor = None 2025-05-07T20:33:28.6906503Z 2025-05-07T20:33:28.6906628Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.6906715Z op = silu_mul_quant 2025-05-07T20:33:28.6906803Z if compiled: 2025-05-07T20:33:28.6906901Z op = torch.compile(op) 2025-05-07T20:33:28.6907003Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.6907076Z 2025-05-07T20:33:28.6907163Z > y_fp8, y_scale = fn() 2025-05-07T20:33:28.6907167Z 2025-05-07T20:33:28.6907264Z moe/activation_test.py:117: 2025-05-07T20:33:28.6907392Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6907491Z moe/activation_test.py:115: in fn 2025-05-07T20:33:28.6907592Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.6907954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:28.6908043Z return fn(*args, **kwargs) 2025-05-07T20:33:28.6908765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:28.6908863Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:28.6909224Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:28.6909447Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:28.6909783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:28.6909879Z kernel = self.compile( 2025-05-07T20:33:28.6910346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:28.6910520Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:28.6910648Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6910655Z 2025-05-07T20:33:28.6910856Z self = 2025-05-07T20:33:28.6911748Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:28.6912257Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b92cbf380>} 2025-05-07T20:33:28.6913159Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:28.6913349Z context = 2025-05-07T20:33:28.6913354Z 2025-05-07T20:33:28.6913519Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:28.6913784Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:28.6913889Z module_map=module_map) 2025-05-07T20:33:28.6914048Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:28.6914143Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:28.6914218Z E ^ 2025-05-07T20:33:28.6914573Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:28.6914580Z 2025-05-07T20:33:28.6914992Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:28.6914996Z 2025-05-07T20:33:28.6915101Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.6915320Z self=, 2025-05-07T20:33:28.6915457Z T=16384, 2025-05-07T20:33:28.6915532Z D=5120, 2025-05-07T20:33:28.6915610Z scale_ub=None, 2025-05-07T20:33:28.6915693Z contiguous=False, 2025-05-07T20:33:28.6915777Z compiled=True, 2025-05-07T20:33:28.6915846Z ) 2025-05-07T20:33:28.6916061Z self = 2025-05-07T20:33:28.6916254Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:28.6916260Z 2025-05-07T20:33:28.6916341Z @given( 2025-05-07T20:33:28.6916476Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.6916582Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.6916694Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.6916814Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.6916923Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.6916994Z ) 2025-05-07T20:33:28.6917238Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.6917330Z def test_silu_mul_quant( 2025-05-07T20:33:28.6917402Z self, 2025-05-07T20:33:28.6917481Z T: int, 2025-05-07T20:33:28.6917555Z D: int, 2025-05-07T20:33:28.6917650Z scale_ub: Optional[float], 2025-05-07T20:33:28.6917743Z contiguous: bool, 2025-05-07T20:33:28.6917825Z compiled: bool, 2025-05-07T20:33:28.6917903Z ) -> None: 2025-05-07T20:33:28.6917995Z torch.manual_seed(2025) 2025-05-07T20:33:28.6918063Z 2025-05-07T20:33:28.6918230Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.6918303Z 2025-05-07T20:33:28.6918438Z x_sign = torch.sign(x) 2025-05-07T20:33:28.6918564Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:28.6918656Z x = x_sign * x_clamp 2025-05-07T20:33:28.6918734Z x0 = x[:, :D] 2025-05-07T20:33:28.6918812Z x1 = x[:, D:] 2025-05-07T20:33:28.6918884Z 2025-05-07T20:33:28.6918967Z if contiguous: 2025-05-07T20:33:28.6919061Z x0 = x0.contiguous() 2025-05-07T20:33:28.6919145Z x1 = x1.contiguous() 2025-05-07T20:33:28.6919213Z 2025-05-07T20:33:28.6919302Z if scale_ub is not None: 2025-05-07T20:33:28.6919406Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:28.6919540Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:28.6919613Z ) 2025-05-07T20:33:28.6919687Z else: 2025-05-07T20:33:28.6919781Z scale_ub_tensor = None 2025-05-07T20:33:28.6919891Z 2025-05-07T20:33:28.6920019Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.6920113Z op = silu_mul_quant 2025-05-07T20:33:28.6920195Z if compiled: 2025-05-07T20:33:28.6920358Z op = torch.compile(op) 2025-05-07T20:33:28.6920465Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.6920534Z 2025-05-07T20:33:28.6920626Z > y_fp8, y_scale = fn() 2025-05-07T20:33:28.6920634Z 2025-05-07T20:33:28.6920728Z moe/activation_test.py:117: 2025-05-07T20:33:28.6920852Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6920954Z moe/activation_test.py:115: in fn 2025-05-07T20:33:28.6921052Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.6921417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:28.6921511Z return fn(*args, **kwargs) 2025-05-07T20:33:28.6922043Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:28.6922181Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:28.6922701Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:28.6923017Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:28.6923427Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:28.6923518Z kernel = self.compile( 2025-05-07T20:33:28.6923900Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:28.6924075Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:28.6924198Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6924205Z 2025-05-07T20:33:28.6924502Z self = 2025-05-07T20:33:28.6925286Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:28.6925784Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b927245e0>} 2025-05-07T20:33:28.6926531Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:28.6926719Z context = 2025-05-07T20:33:28.6926723Z 2025-05-07T20:33:28.6926888Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:28.6927199Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:28.6927304Z module_map=module_map) 2025-05-07T20:33:28.6927467Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:28.6927561Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:28.6927643Z E ^ 2025-05-07T20:33:28.6927997Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:28.6928001Z 2025-05-07T20:33:28.6928410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:28.6928414Z 2025-05-07T20:33:28.6928517Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.6928735Z self=, 2025-05-07T20:33:28.6928814Z T=2048, 2025-05-07T20:33:28.6928931Z D=5120, 2025-05-07T20:33:28.6929010Z scale_ub=None, 2025-05-07T20:33:28.6929097Z contiguous=False, 2025-05-07T20:33:28.6929179Z compiled=True, 2025-05-07T20:33:28.6929250Z ) 2025-05-07T20:33:28.6929507Z self = 2025-05-07T20:33:28.6929679Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:28.6929686Z 2025-05-07T20:33:28.6929759Z @given( 2025-05-07T20:33:28.6929876Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.6929972Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.6930084Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.6930200Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.6930309Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.6930382Z ) 2025-05-07T20:33:28.6930623Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.6930715Z def test_silu_mul_quant( 2025-05-07T20:33:28.6930790Z self, 2025-05-07T20:33:28.6930868Z T: int, 2025-05-07T20:33:28.6930942Z D: int, 2025-05-07T20:33:28.6931041Z scale_ub: Optional[float], 2025-05-07T20:33:28.6931126Z contiguous: bool, 2025-05-07T20:33:28.6931208Z compiled: bool, 2025-05-07T20:33:28.6931285Z ) -> None: 2025-05-07T20:33:28.6931419Z torch.manual_seed(2025) 2025-05-07T20:33:28.6931488Z 2025-05-07T20:33:28.6931657Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.6931727Z 2025-05-07T20:33:28.6931818Z x_sign = torch.sign(x) 2025-05-07T20:33:28.6931937Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:28.6932022Z x = x_sign * x_clamp 2025-05-07T20:33:28.6932101Z x0 = x[:, :D] 2025-05-07T20:33:28.6932182Z x1 = x[:, D:] 2025-05-07T20:33:28.6935063Z 2025-05-07T20:33:28.6935152Z if contiguous: 2025-05-07T20:33:28.6935249Z x0 = x0.contiguous() 2025-05-07T20:33:28.6935334Z x1 = x1.contiguous() 2025-05-07T20:33:28.6935403Z 2025-05-07T20:33:28.6935496Z if scale_ub is not None: 2025-05-07T20:33:28.6935601Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:28.6935736Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:28.6935807Z ) 2025-05-07T20:33:28.6935888Z else: 2025-05-07T20:33:28.6935981Z scale_ub_tensor = None 2025-05-07T20:33:28.6936054Z 2025-05-07T20:33:28.6936181Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.6936266Z op = silu_mul_quant 2025-05-07T20:33:28.6936349Z if compiled: 2025-05-07T20:33:28.6936447Z op = torch.compile(op) 2025-05-07T20:33:28.6936557Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.6936625Z 2025-05-07T20:33:28.6936710Z > y_fp8, y_scale = fn() 2025-05-07T20:33:28.6936716Z 2025-05-07T20:33:28.6936811Z moe/activation_test.py:117: 2025-05-07T20:33:28.6937078Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6937180Z moe/activation_test.py:115: in fn 2025-05-07T20:33:28.6937287Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.6937653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:28.6937742Z return fn(*args, **kwargs) 2025-05-07T20:33:28.6938232Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:28.6938324Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:28.6938677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:28.6938896Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:28.6939302Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:28.6939413Z kernel = self.compile( 2025-05-07T20:33:28.6939827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:28.6939999Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:28.6940127Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6940131Z 2025-05-07T20:33:28.6940334Z self = 2025-05-07T20:33:28.6941115Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:28.6941614Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b92725440>} 2025-05-07T20:33:28.6942359Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:28.6942545Z context = 2025-05-07T20:33:28.6942590Z 2025-05-07T20:33:28.6942749Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:28.6943007Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:28.6943112Z module_map=module_map) 2025-05-07T20:33:28.6943275Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:28.6943366Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:28.6943439Z E ^ 2025-05-07T20:33:28.6943789Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:28.6943795Z 2025-05-07T20:33:28.6944205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:28.6944209Z 2025-05-07T20:33:28.6944312Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.6944528Z self=, 2025-05-07T20:33:28.6944603Z T=2048, 2025-05-07T20:33:28.6944675Z D=5120, 2025-05-07T20:33:28.6944753Z scale_ub=1200.0, 2025-05-07T20:33:28.6944834Z contiguous=False, 2025-05-07T20:33:28.6944916Z compiled=True, 2025-05-07T20:33:28.6944984Z ) 2025-05-07T20:33:28.6945195Z self = 2025-05-07T20:33:28.6945367Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:28.6945371Z 2025-05-07T20:33:28.6945444Z @given( 2025-05-07T20:33:28.6945564Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.6945701Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.6945812Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.6945928Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.6946037Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.6946108Z ) 2025-05-07T20:33:28.6946357Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.6946444Z def test_silu_mul_quant( 2025-05-07T20:33:28.6946517Z self, 2025-05-07T20:33:28.6946594Z T: int, 2025-05-07T20:33:28.6946667Z D: int, 2025-05-07T20:33:28.6946760Z scale_ub: Optional[float], 2025-05-07T20:33:28.6946846Z contiguous: bool, 2025-05-07T20:33:28.6946926Z compiled: bool, 2025-05-07T20:33:28.6947001Z ) -> None: 2025-05-07T20:33:28.6947090Z torch.manual_seed(2025) 2025-05-07T20:33:28.6947204Z 2025-05-07T20:33:28.6947369Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.6947443Z 2025-05-07T20:33:28.6947530Z x_sign = torch.sign(x) 2025-05-07T20:33:28.6947695Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:28.6947781Z x = x_sign * x_clamp 2025-05-07T20:33:28.6947855Z x0 = x[:, :D] 2025-05-07T20:33:28.6947935Z x1 = x[:, D:] 2025-05-07T20:33:28.6948004Z 2025-05-07T20:33:28.6948081Z if contiguous: 2025-05-07T20:33:28.6948171Z x0 = x0.contiguous() 2025-05-07T20:33:28.6948254Z x1 = x1.contiguous() 2025-05-07T20:33:28.6948323Z 2025-05-07T20:33:28.6948408Z if scale_ub is not None: 2025-05-07T20:33:28.6948511Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:28.6948643Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:28.6948714Z ) 2025-05-07T20:33:28.6948787Z else: 2025-05-07T20:33:28.6948882Z scale_ub_tensor = None 2025-05-07T20:33:28.6948950Z 2025-05-07T20:33:28.6949080Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.6949169Z op = silu_mul_quant 2025-05-07T20:33:28.6949257Z if compiled: 2025-05-07T20:33:28.6949354Z op = torch.compile(op) 2025-05-07T20:33:28.6949458Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.6949573Z 2025-05-07T20:33:28.6949661Z > y_fp8, y_scale = fn() 2025-05-07T20:33:28.6949665Z 2025-05-07T20:33:28.6949756Z moe/activation_test.py:117: 2025-05-07T20:33:28.6949880Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6949979Z moe/activation_test.py:115: in fn 2025-05-07T20:33:28.6950073Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.6950430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:28.6950522Z return fn(*args, **kwargs) 2025-05-07T20:33:28.6951009Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:28.6951110Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:28.6951458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:28.6951679Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:28.6952012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:28.6952103Z kernel = self.compile( 2025-05-07T20:33:28.6952481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:28.6952653Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:28.6952774Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6952780Z 2025-05-07T20:33:28.6953025Z self = 2025-05-07T20:33:28.6953801Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:28.6954298Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b92726660>} 2025-05-07T20:33:28.6955037Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:28.6955222Z context = 2025-05-07T20:33:28.6955290Z 2025-05-07T20:33:28.6955462Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:28.6955722Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:28.6955869Z module_map=module_map) 2025-05-07T20:33:28.6956028Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:28.6956123Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:28.6956196Z E ^ 2025-05-07T20:33:28.6956543Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:28.6956547Z 2025-05-07T20:33:28.6956955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:28.6956960Z 2025-05-07T20:33:28.6957058Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.6957274Z self=, 2025-05-07T20:33:28.6957346Z T=4096, 2025-05-07T20:33:28.6957422Z D=5120, 2025-05-07T20:33:28.6957503Z scale_ub=1200.0, 2025-05-07T20:33:28.6957585Z contiguous=True, 2025-05-07T20:33:28.6957667Z compiled=True, 2025-05-07T20:33:28.6957739Z ) 2025-05-07T20:33:28.6957952Z self = 2025-05-07T20:33:28.6958123Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:28.6958170Z 2025-05-07T20:33:28.6958242Z @given( 2025-05-07T20:33:28.6958357Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.6958449Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.6958559Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.6958676Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.6958784Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.6958854Z ) 2025-05-07T20:33:28.6959102Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.6959192Z def test_silu_mul_quant( 2025-05-07T20:33:28.6959264Z self, 2025-05-07T20:33:28.6959340Z T: int, 2025-05-07T20:33:28.6959414Z D: int, 2025-05-07T20:33:28.6959510Z scale_ub: Optional[float], 2025-05-07T20:33:28.6959594Z contiguous: bool, 2025-05-07T20:33:28.6959673Z compiled: bool, 2025-05-07T20:33:28.6959752Z ) -> None: 2025-05-07T20:33:28.6959841Z torch.manual_seed(2025) 2025-05-07T20:33:28.6959909Z 2025-05-07T20:33:28.6960073Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.6960142Z 2025-05-07T20:33:28.6960228Z x_sign = torch.sign(x) 2025-05-07T20:33:28.6960349Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:28.6960433Z x = x_sign * x_clamp 2025-05-07T20:33:28.6960508Z x0 = x[:, :D] 2025-05-07T20:33:28.6960588Z x1 = x[:, D:] 2025-05-07T20:33:28.6960662Z 2025-05-07T20:33:28.6960742Z if contiguous: 2025-05-07T20:33:28.6960830Z x0 = x0.contiguous() 2025-05-07T20:33:28.6960960Z x1 = x1.contiguous() 2025-05-07T20:33:28.6961030Z 2025-05-07T20:33:28.6961117Z if scale_ub is not None: 2025-05-07T20:33:28.6961220Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:28.6961353Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:28.6961428Z ) 2025-05-07T20:33:28.6961499Z else: 2025-05-07T20:33:28.6961591Z scale_ub_tensor = None 2025-05-07T20:33:28.6961659Z 2025-05-07T20:33:28.6961780Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.6961868Z op = silu_mul_quant 2025-05-07T20:33:28.6961948Z if compiled: 2025-05-07T20:33:28.6962042Z op = torch.compile(op) 2025-05-07T20:33:28.6962147Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.6962213Z 2025-05-07T20:33:28.6962345Z > y_fp8, y_scale = fn() 2025-05-07T20:33:28.6962349Z 2025-05-07T20:33:28.6962444Z moe/activation_test.py:117: 2025-05-07T20:33:28.6962605Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6962702Z moe/activation_test.py:115: in fn 2025-05-07T20:33:28.6962797Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.6963156Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:28.6963250Z return fn(*args, **kwargs) 2025-05-07T20:33:28.6963734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:28.6963830Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:28.6964180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:28.6964488Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:28.6964829Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:28.6964918Z kernel = self.compile( 2025-05-07T20:33:28.6965296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:28.6965468Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:28.6965636Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6965640Z 2025-05-07T20:33:28.6965841Z self = 2025-05-07T20:33:28.6966611Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:28.6967115Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b927279c0>} 2025-05-07T20:33:28.6967853Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:28.6968040Z context = 2025-05-07T20:33:28.6968044Z 2025-05-07T20:33:28.6968205Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:28.6968464Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:28.6968569Z module_map=module_map) 2025-05-07T20:33:28.6968725Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:28.6968817Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:28.6968894Z E ^ 2025-05-07T20:33:28.6969319Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:28.6969326Z 2025-05-07T20:33:28.6969750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:28.6969757Z 2025-05-07T20:33:28.6969854Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.6970071Z self=, 2025-05-07T20:33:28.6970144Z T=128, 2025-05-07T20:33:28.6970215Z D=5120, 2025-05-07T20:33:28.6970292Z scale_ub=1200.0, 2025-05-07T20:33:28.6970375Z contiguous=False, 2025-05-07T20:33:28.6970457Z compiled=True, 2025-05-07T20:33:28.6970524Z ) 2025-05-07T20:33:28.6970741Z self = 2025-05-07T20:33:28.6970905Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:28.6970951Z 2025-05-07T20:33:28.6971021Z @given( 2025-05-07T20:33:28.6971143Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.6971238Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.6971390Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.6971503Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.6971612Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.6971689Z ) 2025-05-07T20:33:28.6971929Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.6972016Z def test_silu_mul_quant( 2025-05-07T20:33:28.6972092Z self, 2025-05-07T20:33:28.6972164Z T: int, 2025-05-07T20:33:28.6972234Z D: int, 2025-05-07T20:33:28.6972330Z scale_ub: Optional[float], 2025-05-07T20:33:28.6972413Z contiguous: bool, 2025-05-07T20:33:28.6972496Z compiled: bool, 2025-05-07T20:33:28.6972568Z ) -> None: 2025-05-07T20:33:28.6972659Z torch.manual_seed(2025) 2025-05-07T20:33:28.6972731Z 2025-05-07T20:33:28.6972897Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.6972964Z 2025-05-07T20:33:28.6973056Z x_sign = torch.sign(x) 2025-05-07T20:33:28.6973173Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:28.6973254Z x = x_sign * x_clamp 2025-05-07T20:33:28.6973376Z x0 = x[:, :D] 2025-05-07T20:33:28.6973452Z x1 = x[:, D:] 2025-05-07T20:33:28.6973518Z 2025-05-07T20:33:28.6973597Z if contiguous: 2025-05-07T20:33:28.6973683Z x0 = x0.contiguous() 2025-05-07T20:33:28.6973767Z x1 = x1.contiguous() 2025-05-07T20:33:28.6973842Z 2025-05-07T20:33:28.6973927Z if scale_ub is not None: 2025-05-07T20:33:28.6974033Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:28.6974164Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:28.6974236Z ) 2025-05-07T20:33:28.6974307Z else: 2025-05-07T20:33:28.6974397Z scale_ub_tensor = None 2025-05-07T20:33:28.6974466Z 2025-05-07T20:33:28.6974596Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.6974682Z op = silu_mul_quant 2025-05-07T20:33:28.6974765Z if compiled: 2025-05-07T20:33:28.6974862Z op = torch.compile(op) 2025-05-07T20:33:28.6974968Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.6975033Z 2025-05-07T20:33:28.6975121Z > y_fp8, y_scale = fn() 2025-05-07T20:33:28.6975126Z 2025-05-07T20:33:28.6975218Z moe/activation_test.py:117: 2025-05-07T20:33:28.6975344Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6975439Z moe/activation_test.py:115: in fn 2025-05-07T20:33:28.6975534Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.6975898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:28.6975990Z return fn(*args, **kwargs) 2025-05-07T20:33:28.6976523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:28.6976618Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:28.6976967Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:28.6977198Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:28.6977528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:28.6977616Z kernel = self.compile( 2025-05-07T20:33:28.6977995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:28.6978164Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:28.6978333Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6978340Z 2025-05-07T20:33:28.6978539Z self = 2025-05-07T20:33:28.6979348Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:28.6979850Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b92814fe0>} 2025-05-07T20:33:28.6980584Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:28.6980773Z context = 2025-05-07T20:33:28.6980780Z 2025-05-07T20:33:28.6980939Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:28.6981198Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:28.6981301Z module_map=module_map) 2025-05-07T20:33:28.6981457Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:28.6981597Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:28.6981668Z E ^ 2025-05-07T20:33:28.6982016Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:28.6982020Z 2025-05-07T20:33:28.6982430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:28.6982435Z 2025-05-07T20:33:28.6982531Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.6982753Z self=, 2025-05-07T20:33:28.6982826Z T=16384, 2025-05-07T20:33:28.6982897Z D=7168, 2025-05-07T20:33:28.6982977Z scale_ub=1200.0, 2025-05-07T20:33:28.6983057Z contiguous=True, 2025-05-07T20:33:28.6983134Z compiled=True, 2025-05-07T20:33:28.6983202Z ) 2025-05-07T20:33:28.6983414Z self = 2025-05-07T20:33:28.6983586Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:28.6983591Z 2025-05-07T20:33:28.6983666Z @given( 2025-05-07T20:33:28.6983779Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.6983872Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.6983986Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.6984098Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.6984207Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.6984278Z ) 2025-05-07T20:33:28.6984517Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.6984673Z def test_silu_mul_quant( 2025-05-07T20:33:28.6984745Z self, 2025-05-07T20:33:28.6984821Z T: int, 2025-05-07T20:33:28.6984896Z D: int, 2025-05-07T20:33:28.6984989Z scale_ub: Optional[float], 2025-05-07T20:33:28.6985074Z contiguous: bool, 2025-05-07T20:33:28.6985159Z compiled: bool, 2025-05-07T20:33:28.6985230Z ) -> None: 2025-05-07T20:33:28.6985319Z torch.manual_seed(2025) 2025-05-07T20:33:28.6985393Z 2025-05-07T20:33:28.6985553Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.6985623Z 2025-05-07T20:33:28.6985731Z x_sign = torch.sign(x) 2025-05-07T20:33:28.6985862Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:28.6985959Z x = x_sign * x_clamp 2025-05-07T20:33:28.6986034Z x0 = x[:, :D] 2025-05-07T20:33:28.6986153Z x1 = x[:, D:] 2025-05-07T20:33:28.6986224Z 2025-05-07T20:33:28.6986306Z if contiguous: 2025-05-07T20:33:28.6986394Z x0 = x0.contiguous() 2025-05-07T20:33:28.6986522Z x1 = x1.contiguous() 2025-05-07T20:33:28.6986590Z 2025-05-07T20:33:28.6986677Z if scale_ub is not None: 2025-05-07T20:33:28.6986781Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:28.6986914Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:28.6986987Z ) 2025-05-07T20:33:28.6987059Z else: 2025-05-07T20:33:28.6987147Z scale_ub_tensor = None 2025-05-07T20:33:28.6987217Z 2025-05-07T20:33:28.6987339Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.6987422Z op = silu_mul_quant 2025-05-07T20:33:28.6987508Z if compiled: 2025-05-07T20:33:28.6987600Z op = torch.compile(op) 2025-05-07T20:33:28.6987699Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.6987776Z 2025-05-07T20:33:28.6987862Z > y_fp8, y_scale = fn() 2025-05-07T20:33:28.6987869Z 2025-05-07T20:33:28.6987960Z moe/activation_test.py:117: 2025-05-07T20:33:28.6988087Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6988183Z moe/activation_test.py:115: in fn 2025-05-07T20:33:28.6988280Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.6988683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:28.6988769Z return fn(*args, **kwargs) 2025-05-07T20:33:28.6989260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:28.6989353Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:28.6989703Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:28.6989928Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:28.6990263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:28.6990352Z kernel = self.compile( 2025-05-07T20:33:28.6990728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:28.6990899Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:28.6991023Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.6991028Z 2025-05-07T20:33:28.6991226Z self = 2025-05-07T20:33:28.6992003Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:28.6992545Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b92815e40>} 2025-05-07T20:33:28.6993317Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:28.6993507Z context = 2025-05-07T20:33:28.6993511Z 2025-05-07T20:33:28.6993669Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:28.6993929Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:28.6994034Z module_map=module_map) 2025-05-07T20:33:28.6994190Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:28.6994336Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:28.6994409Z E ^ 2025-05-07T20:33:28.6994761Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:28.6994803Z 2025-05-07T20:33:28.6995221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:28.6995228Z 2025-05-07T20:33:28.6995329Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.6995550Z self=, 2025-05-07T20:33:28.6995623Z T=16384, 2025-05-07T20:33:28.6995697Z D=5120, 2025-05-07T20:33:28.6995777Z scale_ub=1200.0, 2025-05-07T20:33:28.6995856Z contiguous=True, 2025-05-07T20:33:28.6995936Z compiled=False, 2025-05-07T20:33:28.6996009Z ) 2025-05-07T20:33:28.6996220Z self = 2025-05-07T20:33:28.6996395Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:28.6996399Z 2025-05-07T20:33:28.6996476Z @given( 2025-05-07T20:33:28.6996588Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.6996688Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.6996797Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.6996908Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.6997060Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.6997131Z ) 2025-05-07T20:33:28.6997375Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.6997466Z def test_silu_mul_quant( 2025-05-07T20:33:28.6997541Z self, 2025-05-07T20:33:28.6997614Z T: int, 2025-05-07T20:33:28.6997688Z D: int, 2025-05-07T20:33:28.6997782Z scale_ub: Optional[float], 2025-05-07T20:33:28.6997867Z contiguous: bool, 2025-05-07T20:33:28.6997955Z compiled: bool, 2025-05-07T20:33:28.6998029Z ) -> None: 2025-05-07T20:33:28.6998124Z torch.manual_seed(2025) 2025-05-07T20:33:28.6998197Z 2025-05-07T20:33:28.6998365Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.6998435Z 2025-05-07T20:33:28.6998517Z x_sign = torch.sign(x) 2025-05-07T20:33:28.6998640Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:28.6998731Z x = x_sign * x_clamp 2025-05-07T20:33:28.6998808Z x0 = x[:, :D] 2025-05-07T20:33:28.6998882Z x1 = x[:, D:] 2025-05-07T20:33:28.6998953Z 2025-05-07T20:33:28.6999034Z if contiguous: 2025-05-07T20:33:28.6999122Z x0 = x0.contiguous() 2025-05-07T20:33:28.6999214Z x1 = x1.contiguous() 2025-05-07T20:33:28.6999301Z 2025-05-07T20:33:28.6999397Z if scale_ub is not None: 2025-05-07T20:33:28.6999518Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:28.6999649Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:28.6999725Z ) 2025-05-07T20:33:28.6999842Z else: 2025-05-07T20:33:28.6999933Z scale_ub_tensor = None 2025-05-07T20:33:28.7000006Z 2025-05-07T20:33:28.7000134Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.7000220Z op = silu_mul_quant 2025-05-07T20:33:28.7000307Z if compiled: 2025-05-07T20:33:28.7000405Z op = torch.compile(op) 2025-05-07T20:33:28.7000505Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.7000579Z 2025-05-07T20:33:28.7000665Z > y_fp8, y_scale = fn() 2025-05-07T20:33:28.7000669Z 2025-05-07T20:33:28.7000767Z moe/activation_test.py:117: 2025-05-07T20:33:28.7000896Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.7000992Z moe/activation_test.py:115: in fn 2025-05-07T20:33:28.7001091Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.7001638Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:28.7001732Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:28.7002126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:28.7002345Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:28.7002684Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:28.7002775Z kernel = self.compile( 2025-05-07T20:33:28.7003152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:28.7003325Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:28.7003448Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.7003454Z 2025-05-07T20:33:28.7003655Z self = 2025-05-07T20:33:28.7004526Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:28.7005067Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b92816ca0>} 2025-05-07T20:33:28.7005810Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:28.7005996Z context = 2025-05-07T20:33:28.7006000Z 2025-05-07T20:33:28.7006162Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:28.7006426Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:28.7006531Z module_map=module_map) 2025-05-07T20:33:28.7006691Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:28.7006783Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:28.7006857Z E ^ 2025-05-07T20:33:28.7007208Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:28.7007213Z 2025-05-07T20:33:28.7007621Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:28.7007625Z 2025-05-07T20:33:28.7007726Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.7007942Z self=, 2025-05-07T20:33:28.7008015Z T=1, 2025-05-07T20:33:28.7008092Z D=7168, 2025-05-07T20:33:28.7008170Z scale_ub=1200.0, 2025-05-07T20:33:28.7008469Z contiguous=False, 2025-05-07T20:33:28.7008695Z compiled=False, 2025-05-07T20:33:28.7008793Z ) 2025-05-07T20:33:28.7009087Z self = 2025-05-07T20:33:28.7009303Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:28.7009312Z 2025-05-07T20:33:28.7009385Z @given( 2025-05-07T20:33:28.7009502Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.7009598Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.7009709Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.7009825Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.7009937Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.7010008Z ) 2025-05-07T20:33:28.7010250Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.7010413Z def test_silu_mul_quant( 2025-05-07T20:33:28.7010488Z self, 2025-05-07T20:33:28.7010565Z T: int, 2025-05-07T20:33:28.7010641Z D: int, 2025-05-07T20:33:28.7010793Z scale_ub: Optional[float], 2025-05-07T20:33:28.7010881Z contiguous: bool, 2025-05-07T20:33:28.7010961Z compiled: bool, 2025-05-07T20:33:28.7011039Z ) -> None: 2025-05-07T20:33:28.7011131Z torch.manual_seed(2025) 2025-05-07T20:33:28.7011199Z 2025-05-07T20:33:28.7011365Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.7011434Z 2025-05-07T20:33:28.7011520Z x_sign = torch.sign(x) 2025-05-07T20:33:28.7011645Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:28.7011729Z x = x_sign * x_clamp 2025-05-07T20:33:28.7011809Z x0 = x[:, :D] 2025-05-07T20:33:28.7011882Z x1 = x[:, D:] 2025-05-07T20:33:28.7011949Z 2025-05-07T20:33:28.7012033Z if contiguous: 2025-05-07T20:33:28.7012124Z x0 = x0.contiguous() 2025-05-07T20:33:28.7012208Z x1 = x1.contiguous() 2025-05-07T20:33:28.7012282Z 2025-05-07T20:33:28.7012368Z if scale_ub is not None: 2025-05-07T20:33:28.7012472Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:28.7012602Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:28.7012773Z ) 2025-05-07T20:33:28.7012849Z else: 2025-05-07T20:33:28.7012943Z scale_ub_tensor = None 2025-05-07T20:33:28.7013015Z 2025-05-07T20:33:28.7013143Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.7013231Z op = silu_mul_quant 2025-05-07T20:33:28.7013314Z if compiled: 2025-05-07T20:33:28.7013415Z op = torch.compile(op) 2025-05-07T20:33:28.7013518Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.7013591Z 2025-05-07T20:33:28.7013681Z > y_fp8, y_scale = fn() 2025-05-07T20:33:28.7013688Z 2025-05-07T20:33:28.7013783Z moe/activation_test.py:117: 2025-05-07T20:33:28.7013912Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.7014017Z moe/activation_test.py:115: in fn 2025-05-07T20:33:28.7014115Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.7014616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:28.7014717Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:28.7015072Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:28.7015297Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:28.7015634Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:28.7015728Z kernel = self.compile( 2025-05-07T20:33:28.7016114Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:28.7016333Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:28.7016463Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.7016467Z 2025-05-07T20:33:28.7016668Z self = 2025-05-07T20:33:28.7017449Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:28.7017955Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b926900e0>} 2025-05-07T20:33:28.7018700Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:28.7018975Z context = 2025-05-07T20:33:28.7018980Z 2025-05-07T20:33:28.7019141Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:28.7019406Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:28.7019517Z module_map=module_map) 2025-05-07T20:33:28.7019676Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:28.7019779Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:28.7019858Z E ^ 2025-05-07T20:33:28.7020209Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:28.7020214Z 2025-05-07T20:33:28.7020628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:28.7020635Z 2025-05-07T20:33:28.7020740Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.7020969Z self=, 2025-05-07T20:33:28.7021045Z T=4096, 2025-05-07T20:33:28.7021120Z D=7168, 2025-05-07T20:33:28.7021205Z scale_ub=1200.0, 2025-05-07T20:33:28.7021332Z contiguous=False, 2025-05-07T20:33:28.7021415Z compiled=True, 2025-05-07T20:33:28.7021493Z ) 2025-05-07T20:33:28.7021709Z self = 2025-05-07T20:33:28.7021881Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:28.7021890Z 2025-05-07T20:33:28.7021964Z @given( 2025-05-07T20:33:28.7022081Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.7022182Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.7022299Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.7022413Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.7022529Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.7022602Z ) 2025-05-07T20:33:28.7022849Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.7022944Z def test_silu_mul_quant( 2025-05-07T20:33:28.7023022Z self, 2025-05-07T20:33:28.7023099Z T: int, 2025-05-07T20:33:28.7023176Z D: int, 2025-05-07T20:33:28.7023273Z scale_ub: Optional[float], 2025-05-07T20:33:28.7023366Z contiguous: bool, 2025-05-07T20:33:28.7023451Z compiled: bool, 2025-05-07T20:33:28.7023528Z ) -> None: 2025-05-07T20:33:28.7023620Z torch.manual_seed(2025) 2025-05-07T20:33:28.7023689Z 2025-05-07T20:33:28.7023853Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.7023931Z 2025-05-07T20:33:28.7024020Z x_sign = torch.sign(x) 2025-05-07T20:33:28.7024141Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:28.7024304Z x = x_sign * x_clamp 2025-05-07T20:33:28.7024413Z x0 = x[:, :D] 2025-05-07T20:33:28.7024520Z x1 = x[:, D:] 2025-05-07T20:33:28.7024614Z 2025-05-07T20:33:28.7024724Z if contiguous: 2025-05-07T20:33:28.7024847Z x0 = x0.contiguous() 2025-05-07T20:33:28.7024934Z x1 = x1.contiguous() 2025-05-07T20:33:28.7024999Z 2025-05-07T20:33:28.7025089Z if scale_ub is not None: 2025-05-07T20:33:28.7025189Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:28.7025318Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:28.7025395Z ) 2025-05-07T20:33:28.7025465Z else: 2025-05-07T20:33:28.7025555Z scale_ub_tensor = None 2025-05-07T20:33:28.7025625Z 2025-05-07T20:33:28.7025748Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.7025889Z op = silu_mul_quant 2025-05-07T20:33:28.7025974Z if compiled: 2025-05-07T20:33:28.7026071Z op = torch.compile(op) 2025-05-07T20:33:28.7026213Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.7026280Z 2025-05-07T20:33:28.7026366Z > y_fp8, y_scale = fn() 2025-05-07T20:33:28.7026371Z 2025-05-07T20:33:28.7026466Z moe/activation_test.py:117: 2025-05-07T20:33:28.7026592Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.7026688Z moe/activation_test.py:115: in fn 2025-05-07T20:33:28.7026787Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.7027152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:28.7027240Z return fn(*args, **kwargs) 2025-05-07T20:33:28.7027732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:28.7027829Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:28.7028187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:28.7028407Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:28.7028739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:28.7028873Z kernel = self.compile( 2025-05-07T20:33:28.7029254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:28.7029428Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:28.7029555Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.7029559Z 2025-05-07T20:33:28.7029781Z self = 2025-05-07T20:33:28.7030560Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:28.7031054Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b92691300>} 2025-05-07T20:33:28.7031796Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:28.7031982Z context = 2025-05-07T20:33:28.7031986Z 2025-05-07T20:33:28.7032146Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:28.7032407Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:28.7032510Z module_map=module_map) 2025-05-07T20:33:28.7032708Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:28.7032802Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:28.7032875Z E ^ 2025-05-07T20:33:28.7033224Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:28.7033231Z 2025-05-07T20:33:28.7033636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:28.7033640Z 2025-05-07T20:33:28.7033740Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.7033955Z self=, 2025-05-07T20:33:28.7034026Z T=128, 2025-05-07T20:33:28.7034099Z D=7168, 2025-05-07T20:33:28.7034171Z scale_ub=1200.0, 2025-05-07T20:33:28.7034250Z contiguous=False, 2025-05-07T20:33:28.7034370Z compiled=True, 2025-05-07T20:33:28.7034437Z ) 2025-05-07T20:33:28.7034652Z self = 2025-05-07T20:33:28.7034858Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:28.7034863Z 2025-05-07T20:33:28.7034932Z @given( 2025-05-07T20:33:28.7035047Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.7035140Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.7035249Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.7035360Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.7035466Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.7035532Z ) 2025-05-07T20:33:28.7035774Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.7035860Z def test_silu_mul_quant( 2025-05-07T20:33:28.7035929Z self, 2025-05-07T20:33:28.7036005Z T: int, 2025-05-07T20:33:28.7036074Z D: int, 2025-05-07T20:33:28.7036165Z scale_ub: Optional[float], 2025-05-07T20:33:28.7036253Z contiguous: bool, 2025-05-07T20:33:28.7036331Z compiled: bool, 2025-05-07T20:33:28.7036408Z ) -> None: 2025-05-07T20:33:28.7036496Z torch.manual_seed(2025) 2025-05-07T20:33:28.7036563Z 2025-05-07T20:33:28.7036727Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.7036840Z 2025-05-07T20:33:28.7036923Z x_sign = torch.sign(x) 2025-05-07T20:33:28.7037044Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:28.7037126Z x = x_sign * x_clamp 2025-05-07T20:33:28.7037197Z x0 = x[:, :D] 2025-05-07T20:33:28.7037276Z x1 = x[:, D:] 2025-05-07T20:33:28.7037344Z 2025-05-07T20:33:28.7037420Z if contiguous: 2025-05-07T20:33:28.7037512Z x0 = x0.contiguous() 2025-05-07T20:33:28.7037593Z x1 = x1.contiguous() 2025-05-07T20:33:28.7037664Z 2025-05-07T20:33:28.7037751Z if scale_ub is not None: 2025-05-07T20:33:28.7037850Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:28.7037984Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:28.7038055Z ) 2025-05-07T20:33:28.7038127Z else: 2025-05-07T20:33:28.7038220Z scale_ub_tensor = None 2025-05-07T20:33:28.7038290Z 2025-05-07T20:33:28.7038414Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.7038500Z op = silu_mul_quant 2025-05-07T20:33:28.7038577Z if compiled: 2025-05-07T20:33:28.7038672Z op = torch.compile(op) 2025-05-07T20:33:28.7038773Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.7038841Z 2025-05-07T20:33:28.7038925Z > y_fp8, y_scale = fn() 2025-05-07T20:33:28.7038932Z 2025-05-07T20:33:28.7039026Z moe/activation_test.py:117: 2025-05-07T20:33:28.7039149Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.7039249Z moe/activation_test.py:115: in fn 2025-05-07T20:33:28.7039388Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.7039755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:28.7039845Z return fn(*args, **kwargs) 2025-05-07T20:33:28.7040336Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:28.7040433Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:28.7040781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:28.7040997Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:28.7041334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:28.7041471Z kernel = self.compile( 2025-05-07T20:33:28.7041847Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:28.7042083Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:28.7042205Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.7042212Z 2025-05-07T20:33:28.7042411Z self = 2025-05-07T20:33:28.7043184Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:28.7043681Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b92692160>} 2025-05-07T20:33:28.7044524Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:28.7044712Z context = 2025-05-07T20:33:28.7044717Z 2025-05-07T20:33:28.7044876Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:28.7045177Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:28.7045279Z module_map=module_map) 2025-05-07T20:33:28.7045440Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:28.7045530Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:28.7045605Z E ^ 2025-05-07T20:33:28.7045951Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:28.7045958Z 2025-05-07T20:33:28.7046368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:28.7046372Z 2025-05-07T20:33:28.7046475Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.7046693Z self=, 2025-05-07T20:33:28.7046766Z T=2048, 2025-05-07T20:33:28.7046839Z D=7168, 2025-05-07T20:33:28.7046913Z scale_ub=None, 2025-05-07T20:33:28.7046993Z contiguous=True, 2025-05-07T20:33:28.7047068Z compiled=True, 2025-05-07T20:33:28.7047134Z ) 2025-05-07T20:33:28.7047352Z self = 2025-05-07T20:33:28.7047518Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:28.7047522Z 2025-05-07T20:33:28.7047591Z @given( 2025-05-07T20:33:28.7047706Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.7047798Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.7047911Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.7048066Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.7048176Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.7048249Z ) 2025-05-07T20:33:28.7048487Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.7048576Z def test_silu_mul_quant( 2025-05-07T20:33:28.7048650Z self, 2025-05-07T20:33:28.7048720Z T: int, 2025-05-07T20:33:28.7048789Z D: int, 2025-05-07T20:33:28.7048882Z scale_ub: Optional[float], 2025-05-07T20:33:28.7048966Z contiguous: bool, 2025-05-07T20:33:28.7049047Z compiled: bool, 2025-05-07T20:33:28.7049122Z ) -> None: 2025-05-07T20:33:28.7049210Z torch.manual_seed(2025) 2025-05-07T20:33:28.7049280Z 2025-05-07T20:33:28.7049441Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.7049552Z 2025-05-07T20:33:28.7049639Z x_sign = torch.sign(x) 2025-05-07T20:33:28.7049759Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:28.7049841Z x = x_sign * x_clamp 2025-05-07T20:33:28.7049954Z x0 = x[:, :D] 2025-05-07T20:33:28.7050028Z x1 = x[:, D:] 2025-05-07T20:33:28.7050093Z 2025-05-07T20:33:28.7050174Z if contiguous: 2025-05-07T20:33:28.7050261Z x0 = x0.contiguous() 2025-05-07T20:33:28.7050342Z x1 = x1.contiguous() 2025-05-07T20:33:28.7050412Z 2025-05-07T20:33:28.7050494Z if scale_ub is not None: 2025-05-07T20:33:28.7050593Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:28.7050725Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:28.7050794Z ) 2025-05-07T20:33:28.7050866Z else: 2025-05-07T20:33:28.7050957Z scale_ub_tensor = None 2025-05-07T20:33:28.7053845Z 2025-05-07T20:33:28.7053991Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.7054072Z op = silu_mul_quant 2025-05-07T20:33:28.7054153Z if compiled: 2025-05-07T20:33:28.7054244Z op = torch.compile(op) 2025-05-07T20:33:28.7054350Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.7054411Z 2025-05-07T20:33:28.7054490Z > y_fp8, y_scale = fn() 2025-05-07T20:33:28.7054561Z 2025-05-07T20:33:28.7054651Z moe/activation_test.py:117: 2025-05-07T20:33:28.7054773Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.7054866Z moe/activation_test.py:115: in fn 2025-05-07T20:33:28.7054956Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.7055320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:28.7055405Z return fn(*args, **kwargs) 2025-05-07T20:33:28.7055890Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:28.7055983Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:28.7056336Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:28.7056548Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:28.7056880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:28.7056965Z kernel = self.compile( 2025-05-07T20:33:28.7057338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:28.7057511Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:28.7057633Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.7057637Z 2025-05-07T20:33:28.7057832Z self = 2025-05-07T20:33:28.7058654Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:28.7059151Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b92693420>} 2025-05-07T20:33:28.7059891Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:28.7060074Z context = 2025-05-07T20:33:28.7060078Z 2025-05-07T20:33:28.7060236Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:28.7060531Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:28.7060632Z module_map=module_map) 2025-05-07T20:33:28.7060828Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:28.7060917Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:28.7060981Z E ^ 2025-05-07T20:33:28.7061332Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:28.7061341Z 2025-05-07T20:33:28.7061748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:28.7061752Z 2025-05-07T20:33:28.7061854Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.7062105Z self=, 2025-05-07T20:33:28.7062177Z T=16384, 2025-05-07T20:33:28.7062245Z D=5120, 2025-05-07T20:33:28.7062316Z scale_ub=None, 2025-05-07T20:33:28.7062392Z contiguous=False, 2025-05-07T20:33:28.7062470Z compiled=False, 2025-05-07T20:33:28.7062534Z ) 2025-05-07T20:33:28.7062747Z self = 2025-05-07T20:33:28.7062921Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:28.7062925Z 2025-05-07T20:33:28.7062990Z @given( 2025-05-07T20:33:28.7063245Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.7063334Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.7063442Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.7063551Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.7063654Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.7063715Z ) 2025-05-07T20:33:28.7063958Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.7064039Z def test_silu_mul_quant( 2025-05-07T20:33:28.7064105Z self, 2025-05-07T20:33:28.7064173Z T: int, 2025-05-07T20:33:28.7064242Z D: int, 2025-05-07T20:33:28.7064333Z scale_ub: Optional[float], 2025-05-07T20:33:28.7064415Z contiguous: bool, 2025-05-07T20:33:28.7064490Z compiled: bool, 2025-05-07T20:33:28.7064560Z ) -> None: 2025-05-07T20:33:28.7064642Z torch.manual_seed(2025) 2025-05-07T20:33:28.7064710Z 2025-05-07T20:33:28.7064874Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.7064935Z 2025-05-07T20:33:28.7065020Z x_sign = torch.sign(x) 2025-05-07T20:33:28.7065139Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:28.7066983Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:28.7066991Z 2025-05-07T20:33:28.7067104Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:28.7067111Z 2025-05-07T20:33:28.7067204Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.7067418Z self=, 2025-05-07T20:33:28.7067481Z T=4096, 2025-05-07T20:33:28.7067546Z D=7168, 2025-05-07T20:33:28.7067622Z scale_ub=1200.0, 2025-05-07T20:33:28.7067694Z contiguous=True, 2025-05-07T20:33:28.7067765Z compiled=True, 2025-05-07T20:33:28.7067829Z ) 2025-05-07T20:33:28.7068038Z self = 2025-05-07T20:33:28.7068200Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:28.7068247Z 2025-05-07T20:33:28.7068313Z @given( 2025-05-07T20:33:28.7068422Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.7068551Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.7068656Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.7068765Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.7068873Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.7068936Z ) 2025-05-07T20:33:28.7069170Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.7069257Z def test_silu_mul_quant( 2025-05-07T20:33:28.7069322Z self, 2025-05-07T20:33:28.7069388Z T: int, 2025-05-07T20:33:28.7069456Z D: int, 2025-05-07T20:33:28.7069542Z scale_ub: Optional[float], 2025-05-07T20:33:28.7069622Z contiguous: bool, 2025-05-07T20:33:28.7069703Z compiled: bool, 2025-05-07T20:33:28.7069777Z ) -> None: 2025-05-07T20:33:28.7069868Z torch.manual_seed(2025) 2025-05-07T20:33:28.7069937Z 2025-05-07T20:33:28.7070100Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.7070165Z 2025-05-07T20:33:28.7070247Z x_sign = torch.sign(x) 2025-05-07T20:33:28.7070363Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:28.7072175Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:28.7072183Z 2025-05-07T20:33:28.7072291Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:28.7072297Z 2025-05-07T20:33:28.7072390Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.7072604Z self=, 2025-05-07T20:33:28.7072669Z T=16384, 2025-05-07T20:33:28.7072735Z D=7168, 2025-05-07T20:33:28.7072807Z scale_ub=None, 2025-05-07T20:33:28.7072886Z contiguous=False, 2025-05-07T20:33:28.7072958Z compiled=False, 2025-05-07T20:33:28.7073021Z ) 2025-05-07T20:33:28.7073239Z self = 2025-05-07T20:33:28.7073407Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:28.7073411Z 2025-05-07T20:33:28.7073475Z @given( 2025-05-07T20:33:28.7073585Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.7073673Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.7073780Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.7073932Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.7074038Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.7074111Z ) 2025-05-07T20:33:28.7074351Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.7074433Z def test_silu_mul_quant( 2025-05-07T20:33:28.7074506Z self, 2025-05-07T20:33:28.7074572Z T: int, 2025-05-07T20:33:28.7074637Z D: int, 2025-05-07T20:33:28.7074726Z scale_ub: Optional[float], 2025-05-07T20:33:28.7074803Z contiguous: bool, 2025-05-07T20:33:28.7074880Z compiled: bool, 2025-05-07T20:33:28.7074946Z ) -> None: 2025-05-07T20:33:28.7075030Z torch.manual_seed(2025) 2025-05-07T20:33:28.7075091Z 2025-05-07T20:33:28.7075252Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.7077128Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:28.7077136Z 2025-05-07T20:33:28.7077248Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:28.7077252Z 2025-05-07T20:33:28.7077346Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.7077558Z self=, 2025-05-07T20:33:28.7077629Z T=2048, 2025-05-07T20:33:28.7077693Z D=7168, 2025-05-07T20:33:28.7077764Z scale_ub=1200.0, 2025-05-07T20:33:28.7077843Z contiguous=True, 2025-05-07T20:33:28.7077915Z compiled=True, 2025-05-07T20:33:28.7077977Z ) 2025-05-07T20:33:28.7078192Z self = 2025-05-07T20:33:28.7078354Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:28.7078359Z 2025-05-07T20:33:28.7078429Z @given( 2025-05-07T20:33:28.7078583Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.7078671Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.7078778Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.7078884Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.7078986Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.7079052Z ) 2025-05-07T20:33:28.7079286Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.7079371Z def test_silu_mul_quant( 2025-05-07T20:33:28.7079440Z self, 2025-05-07T20:33:28.7079505Z T: int, 2025-05-07T20:33:28.7079574Z D: int, 2025-05-07T20:33:28.7079663Z scale_ub: Optional[float], 2025-05-07T20:33:28.7079742Z contiguous: bool, 2025-05-07T20:33:28.7079822Z compiled: bool, 2025-05-07T20:33:28.7079887Z ) -> None: 2025-05-07T20:33:28.7079973Z torch.manual_seed(2025) 2025-05-07T20:33:28.7080040Z 2025-05-07T20:33:28.7080201Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.7080265Z 2025-05-07T20:33:28.7080348Z x_sign = torch.sign(x) 2025-05-07T20:33:28.7080462Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:28.7082264Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:28.7082273Z 2025-05-07T20:33:28.7082381Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:28.7082385Z 2025-05-07T20:33:28.7082486Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.7082698Z self=, 2025-05-07T20:33:28.7082764Z T=2048, 2025-05-07T20:33:28.7082831Z D=7168, 2025-05-07T20:33:28.7082900Z scale_ub=None, 2025-05-07T20:33:28.7082973Z contiguous=True, 2025-05-07T20:33:28.7083047Z compiled=False, 2025-05-07T20:33:28.7083110Z ) 2025-05-07T20:33:28.7083314Z self = 2025-05-07T20:33:28.7083476Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:28.7083522Z 2025-05-07T20:33:28.7083586Z @given( 2025-05-07T20:33:28.7083701Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.7083824Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.7083931Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.7084041Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.7084146Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.7084208Z ) 2025-05-07T20:33:28.7084511Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.7084593Z def test_silu_mul_quant( 2025-05-07T20:33:28.7084656Z self, 2025-05-07T20:33:28.7084724Z T: int, 2025-05-07T20:33:28.7084789Z D: int, 2025-05-07T20:33:28.7084876Z scale_ub: Optional[float], 2025-05-07T20:33:28.7084956Z contiguous: bool, 2025-05-07T20:33:28.7085031Z compiled: bool, 2025-05-07T20:33:28.7085102Z ) -> None: 2025-05-07T20:33:28.7085185Z torch.manual_seed(2025) 2025-05-07T20:33:28.7085245Z 2025-05-07T20:33:28.7085410Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.7085475Z 2025-05-07T20:33:28.7085555Z > x_sign = torch.sign(x) 2025-05-07T20:33:28.7087356Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:28.7087409Z 2025-05-07T20:33:28.7087517Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:33:28.7087524Z 2025-05-07T20:33:28.7087617Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.7087830Z self=, 2025-05-07T20:33:28.7087898Z T=1, 2025-05-07T20:33:28.7087964Z D=7168, 2025-05-07T20:33:28.7088035Z scale_ub=1200.0, 2025-05-07T20:33:28.7088114Z contiguous=True, 2025-05-07T20:33:28.7088185Z compiled=False, 2025-05-07T20:33:28.7088249Z ) 2025-05-07T20:33:28.7088457Z self = 2025-05-07T20:33:28.7088612Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:28.7088617Z 2025-05-07T20:33:28.7088681Z @given( 2025-05-07T20:33:28.7088790Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.7088877Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.7088980Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.7089088Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.7089195Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.7089308Z ) 2025-05-07T20:33:28.7089547Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.7089628Z def test_silu_mul_quant( 2025-05-07T20:33:28.7089696Z self, 2025-05-07T20:33:28.7089761Z T: int, 2025-05-07T20:33:28.7089831Z D: int, 2025-05-07T20:33:28.7089921Z scale_ub: Optional[float], 2025-05-07T20:33:28.7089999Z contiguous: bool, 2025-05-07T20:33:28.7090074Z compiled: bool, 2025-05-07T20:33:28.7090146Z ) -> None: 2025-05-07T20:33:28.7090229Z torch.manual_seed(2025) 2025-05-07T20:33:28.7090289Z 2025-05-07T20:33:28.7090449Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.7090511Z 2025-05-07T20:33:28.7090595Z x_sign = torch.sign(x) 2025-05-07T20:33:28.7090709Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:28.7090833Z x = x_sign * x_clamp 2025-05-07T20:33:28.7090905Z x0 = x[:, :D] 2025-05-07T20:33:28.7090977Z x1 = x[:, D:] 2025-05-07T20:33:28.7091038Z 2025-05-07T20:33:28.7091153Z if contiguous: 2025-05-07T20:33:28.7091235Z x0 = x0.contiguous() 2025-05-07T20:33:28.7091314Z x1 = x1.contiguous() 2025-05-07T20:33:28.7091381Z 2025-05-07T20:33:28.7091463Z if scale_ub is not None: 2025-05-07T20:33:28.7091558Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:28.7091687Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:28.7091751Z ) 2025-05-07T20:33:28.7091815Z else: 2025-05-07T20:33:28.7091900Z scale_ub_tensor = None 2025-05-07T20:33:28.7091964Z 2025-05-07T20:33:28.7092107Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.7092194Z op = silu_mul_quant 2025-05-07T20:33:28.7092284Z if compiled: 2025-05-07T20:33:28.7092391Z op = torch.compile(op) 2025-05-07T20:33:28.7092489Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.7092550Z 2025-05-07T20:33:28.7092634Z > y_fp8, y_scale = fn() 2025-05-07T20:33:28.7092638Z 2025-05-07T20:33:28.7092724Z moe/activation_test.py:117: 2025-05-07T20:33:28.7092845Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.7092983Z moe/activation_test.py:115: in fn 2025-05-07T20:33:28.7093073Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.7093568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:28.7093655Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:28.7094006Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:28.7094223Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:28.7094561Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:28.7094646Z kernel = self.compile( 2025-05-07T20:33:28.7095028Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:28.7095194Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:28.7095317Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.7095321Z 2025-05-07T20:33:28.7095516Z self = 2025-05-07T20:33:28.7096288Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:28.7096831Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b92bca2a0>} 2025-05-07T20:33:28.7097570Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:28.7097758Z context = 2025-05-07T20:33:28.7097763Z 2025-05-07T20:33:28.7097918Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:28.7098175Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:28.7098273Z module_map=module_map) 2025-05-07T20:33:28.7098425Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:28.7098518Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:28.7098629Z E ^ 2025-05-07T20:33:28.7098976Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:28.7098981Z 2025-05-07T20:33:28.7099474Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:28.7099479Z 2025-05-07T20:33:28.7099572Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.7099789Z self=, 2025-05-07T20:33:28.7099855Z T=128, 2025-05-07T20:33:28.7099919Z D=5120, 2025-05-07T20:33:28.7099991Z scale_ub=None, 2025-05-07T20:33:28.7100065Z contiguous=True, 2025-05-07T20:33:28.7100136Z compiled=False, 2025-05-07T20:33:28.7100200Z ) 2025-05-07T20:33:28.7100408Z self = 2025-05-07T20:33:28.7100568Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:28.7100580Z 2025-05-07T20:33:28.7100643Z @given( 2025-05-07T20:33:28.7100755Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.7100847Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.7100953Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.7101059Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.7101165Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.7101270Z ) 2025-05-07T20:33:28.7101505Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.7101588Z def test_silu_mul_quant( 2025-05-07T20:33:28.7101653Z self, 2025-05-07T20:33:28.7101718Z T: int, 2025-05-07T20:33:28.7101787Z D: int, 2025-05-07T20:33:28.7101874Z scale_ub: Optional[float], 2025-05-07T20:33:28.7101955Z contiguous: bool, 2025-05-07T20:33:28.7102031Z compiled: bool, 2025-05-07T20:33:28.7102096Z ) -> None: 2025-05-07T20:33:28.7102183Z torch.manual_seed(2025) 2025-05-07T20:33:28.7102245Z 2025-05-07T20:33:28.7102408Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.7102473Z 2025-05-07T20:33:28.7102557Z x_sign = torch.sign(x) 2025-05-07T20:33:28.7102672Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:28.7102753Z x = x_sign * x_clamp 2025-05-07T20:33:28.7102827Z x0 = x[:, :D] 2025-05-07T20:33:28.7102895Z x1 = x[:, D:] 2025-05-07T20:33:28.7102959Z 2025-05-07T20:33:28.7103031Z if contiguous: 2025-05-07T20:33:28.7103114Z x0 = x0.contiguous() 2025-05-07T20:33:28.7103193Z x1 = x1.contiguous() 2025-05-07T20:33:28.7103253Z 2025-05-07T20:33:28.7103335Z if scale_ub is not None: 2025-05-07T20:33:28.7103430Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:28.7103556Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:28.7103624Z ) 2025-05-07T20:33:28.7103687Z else: 2025-05-07T20:33:28.7103771Z scale_ub_tensor = None 2025-05-07T20:33:28.7103905Z 2025-05-07T20:33:28.7104026Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.7104111Z op = silu_mul_quant 2025-05-07T20:33:28.7104189Z if compiled: 2025-05-07T20:33:28.7104277Z op = torch.compile(op) 2025-05-07T20:33:28.7104379Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.7104440Z 2025-05-07T20:33:28.7104519Z > y_fp8, y_scale = fn() 2025-05-07T20:33:28.7104524Z 2025-05-07T20:33:28.7104614Z moe/activation_test.py:117: 2025-05-07T20:33:28.7104734Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.7104823Z moe/activation_test.py:115: in fn 2025-05-07T20:33:28.7104917Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.7105404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:28.7105532Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:28.7105926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:28.7106139Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:28.7106469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:28.7106557Z kernel = self.compile( 2025-05-07T20:33:28.7106929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:28.7107098Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:28.7107218Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.7107222Z 2025-05-07T20:33:28.7107419Z self = 2025-05-07T20:33:28.7108194Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:28.7109073Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b92bcb1a0>} 2025-05-07T20:33:28.7109924Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:28.7110107Z context = 2025-05-07T20:33:28.7110112Z 2025-05-07T20:33:28.7110272Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:28.7110527Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:28.7110633Z module_map=module_map) 2025-05-07T20:33:28.7110800Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:28.7110887Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:28.7110959Z E ^ 2025-05-07T20:33:28.7111306Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:28.7111312Z 2025-05-07T20:33:28.7111718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:28.7111723Z 2025-05-07T20:33:28.7111819Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.7112034Z self=, 2025-05-07T20:33:28.7112101Z T=128, 2025-05-07T20:33:28.7112169Z D=7168, 2025-05-07T20:33:28.7112239Z scale_ub=None, 2025-05-07T20:33:28.7112318Z contiguous=True, 2025-05-07T20:33:28.7112391Z compiled=False, 2025-05-07T20:33:28.7112455Z ) 2025-05-07T20:33:28.7112734Z self = 2025-05-07T20:33:28.7112905Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:28.7112910Z 2025-05-07T20:33:28.7112973Z @given( 2025-05-07T20:33:28.7113088Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.7113177Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.7113292Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.7113403Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.7113507Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.7113568Z ) 2025-05-07T20:33:28.7113807Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.7113889Z def test_silu_mul_quant( 2025-05-07T20:33:28.7114015Z self, 2025-05-07T20:33:28.7114083Z T: int, 2025-05-07T20:33:28.7114149Z D: int, 2025-05-07T20:33:28.7114239Z scale_ub: Optional[float], 2025-05-07T20:33:28.7114319Z contiguous: bool, 2025-05-07T20:33:28.7114449Z compiled: bool, 2025-05-07T20:33:28.7114518Z ) -> None: 2025-05-07T20:33:28.7114602Z torch.manual_seed(2025) 2025-05-07T20:33:28.7114669Z 2025-05-07T20:33:28.7114831Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.7114893Z 2025-05-07T20:33:28.7114976Z x_sign = torch.sign(x) 2025-05-07T20:33:28.7115093Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:28.7115177Z x = x_sign * x_clamp 2025-05-07T20:33:28.7115246Z x0 = x[:, :D] 2025-05-07T20:33:28.7115347Z x1 = x[:, D:] 2025-05-07T20:33:28.7115438Z 2025-05-07T20:33:28.7115548Z if contiguous: 2025-05-07T20:33:28.7115664Z x0 = x0.contiguous() 2025-05-07T20:33:28.7115748Z x1 = x1.contiguous() 2025-05-07T20:33:28.7115813Z 2025-05-07T20:33:28.7115897Z if scale_ub is not None: 2025-05-07T20:33:28.7115992Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:28.7116125Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:28.7116189Z ) 2025-05-07T20:33:28.7116255Z else: 2025-05-07T20:33:28.7116395Z scale_ub_tensor = None 2025-05-07T20:33:28.7116458Z 2025-05-07T20:33:28.7116578Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.7116661Z op = silu_mul_quant 2025-05-07T20:33:28.7116734Z if compiled: 2025-05-07T20:33:28.7116824Z op = torch.compile(op) 2025-05-07T20:33:28.7116922Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.7116984Z 2025-05-07T20:33:28.7117067Z > y_fp8, y_scale = fn() 2025-05-07T20:33:28.7117072Z 2025-05-07T20:33:28.7117157Z moe/activation_test.py:117: 2025-05-07T20:33:28.7117283Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.7117379Z moe/activation_test.py:115: in fn 2025-05-07T20:33:28.7117472Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.7117965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:28.7118056Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:28.7118408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:28.7118628Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:28.7118963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:28.7119045Z kernel = self.compile( 2025-05-07T20:33:28.7119422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:28.7119592Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:28.7119765Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.7119774Z 2025-05-07T20:33:28.7119970Z self = 2025-05-07T20:33:28.7120743Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:28.7121248Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b923fc040>} 2025-05-07T20:33:28.7121988Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:28.7122219Z context = 2025-05-07T20:33:28.7122224Z 2025-05-07T20:33:28.7122418Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:28.7122677Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:28.7122782Z module_map=module_map) 2025-05-07T20:33:28.7122934Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:28.7123022Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:28.7123089Z E ^ 2025-05-07T20:33:28.7123434Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:28.7123438Z 2025-05-07T20:33:28.7123846Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:28.7123853Z 2025-05-07T20:33:28.7123945Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.7124160Z self=, 2025-05-07T20:33:28.7124332Z T=2048, 2025-05-07T20:33:28.7124406Z D=7168, 2025-05-07T20:33:28.7124482Z scale_ub=1200.0, 2025-05-07T20:33:28.7124556Z contiguous=True, 2025-05-07T20:33:28.7124677Z compiled=False, 2025-05-07T20:33:28.7124740Z ) 2025-05-07T20:33:28.7124950Z self = 2025-05-07T20:33:28.7125114Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:28.7125118Z 2025-05-07T20:33:28.7125184Z @given( 2025-05-07T20:33:28.7125290Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.7125377Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.7125484Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.7125591Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.7125701Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.7125765Z ) 2025-05-07T20:33:28.7126005Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.7126120Z def test_silu_mul_quant( 2025-05-07T20:33:28.7126216Z self, 2025-05-07T20:33:28.7126314Z T: int, 2025-05-07T20:33:28.7126423Z D: int, 2025-05-07T20:33:28.7126552Z scale_ub: Optional[float], 2025-05-07T20:33:28.7126669Z contiguous: bool, 2025-05-07T20:33:28.7126779Z compiled: bool, 2025-05-07T20:33:28.7126881Z ) -> None: 2025-05-07T20:33:28.7126995Z torch.manual_seed(2025) 2025-05-07T20:33:28.7127061Z 2025-05-07T20:33:28.7127226Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.7129074Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:28.7129086Z 2025-05-07T20:33:28.7129195Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:28.7129199Z 2025-05-07T20:33:28.7129295Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.7129509Z self=, 2025-05-07T20:33:28.7129573Z T=1, 2025-05-07T20:33:28.7129641Z D=5120, 2025-05-07T20:33:28.7129711Z scale_ub=1200.0, 2025-05-07T20:33:28.7129785Z contiguous=True, 2025-05-07T20:33:28.7129860Z compiled=False, 2025-05-07T20:33:28.7129967Z ) 2025-05-07T20:33:28.7130175Z self = 2025-05-07T20:33:28.7130338Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:28.7130342Z 2025-05-07T20:33:28.7130450Z @given( 2025-05-07T20:33:28.7130559Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.7130646Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.7130754Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.7130862Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.7130963Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.7131024Z ) 2025-05-07T20:33:28.7131269Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.7131351Z def test_silu_mul_quant( 2025-05-07T20:33:28.7131416Z self, 2025-05-07T20:33:28.7131483Z T: int, 2025-05-07T20:33:28.7131548Z D: int, 2025-05-07T20:33:28.7131640Z scale_ub: Optional[float], 2025-05-07T20:33:28.7131718Z contiguous: bool, 2025-05-07T20:33:28.7131795Z compiled: bool, 2025-05-07T20:33:28.7131863Z ) -> None: 2025-05-07T20:33:28.7131949Z torch.manual_seed(2025) 2025-05-07T20:33:28.7132009Z 2025-05-07T20:33:28.7132169Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.7132302Z 2025-05-07T20:33:28.7132383Z x_sign = torch.sign(x) 2025-05-07T20:33:28.7132507Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:28.7132588Z x = x_sign * x_clamp 2025-05-07T20:33:28.7132663Z x0 = x[:, :D] 2025-05-07T20:33:28.7132739Z x1 = x[:, D:] 2025-05-07T20:33:28.7132806Z 2025-05-07T20:33:28.7132882Z if contiguous: 2025-05-07T20:33:28.7132970Z x0 = x0.contiguous() 2025-05-07T20:33:28.7133052Z x1 = x1.contiguous() 2025-05-07T20:33:28.7133123Z 2025-05-07T20:33:28.7133209Z if scale_ub is not None: 2025-05-07T20:33:28.7133312Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:28.7133448Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:28.7133518Z ) 2025-05-07T20:33:28.7133593Z else: 2025-05-07T20:33:28.7133685Z scale_ub_tensor = None 2025-05-07T20:33:28.7133750Z 2025-05-07T20:33:28.7133871Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.7133959Z op = silu_mul_quant 2025-05-07T20:33:28.7134037Z if compiled: 2025-05-07T20:33:28.7134130Z op = torch.compile(op) 2025-05-07T20:33:28.7134231Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.7134297Z 2025-05-07T20:33:28.7134387Z > y_fp8, y_scale = fn() 2025-05-07T20:33:28.7134391Z 2025-05-07T20:33:28.7134481Z moe/activation_test.py:117: 2025-05-07T20:33:28.7134603Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.7134705Z moe/activation_test.py:115: in fn 2025-05-07T20:33:28.7134800Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.7135338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:28.7135439Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:28.7135841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:28.7136064Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:28.7136397Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:28.7136484Z kernel = self.compile( 2025-05-07T20:33:28.7136863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:28.7137031Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:28.7137195Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.7137206Z 2025-05-07T20:33:28.7137442Z self = 2025-05-07T20:33:28.7138215Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:28.7138717Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b923fd580>} 2025-05-07T20:33:28.7139453Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:28.7139643Z context = 2025-05-07T20:33:28.7139650Z 2025-05-07T20:33:28.7139810Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:28.7140071Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:28.7140176Z module_map=module_map) 2025-05-07T20:33:28.7140330Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:28.7140463Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:28.7140532Z E ^ 2025-05-07T20:33:28.7140878Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:28.7140883Z 2025-05-07T20:33:28.7141290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:28.7141295Z 2025-05-07T20:33:28.7141390Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.7141607Z self=, 2025-05-07T20:33:28.7141680Z T=2048, 2025-05-07T20:33:28.7141752Z D=5120, 2025-05-07T20:33:28.7141833Z scale_ub=None, 2025-05-07T20:33:28.7141917Z contiguous=True, 2025-05-07T20:33:28.7141995Z compiled=False, 2025-05-07T20:33:28.7142066Z ) 2025-05-07T20:33:28.7142277Z self = 2025-05-07T20:33:28.7142445Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:28.7142450Z 2025-05-07T20:33:28.7142522Z @given( 2025-05-07T20:33:28.7142632Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.7142723Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.7142833Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.7142945Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.7143055Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.7143127Z ) 2025-05-07T20:33:28.7143368Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.7143502Z def test_silu_mul_quant( 2025-05-07T20:33:28.7143573Z self, 2025-05-07T20:33:28.7143644Z T: int, 2025-05-07T20:33:28.7143716Z D: int, 2025-05-07T20:33:28.7143810Z scale_ub: Optional[float], 2025-05-07T20:33:28.7143897Z contiguous: bool, 2025-05-07T20:33:28.7143978Z compiled: bool, 2025-05-07T20:33:28.7144051Z ) -> None: 2025-05-07T20:33:28.7144139Z torch.manual_seed(2025) 2025-05-07T20:33:28.7144210Z 2025-05-07T20:33:28.7144371Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.7144443Z 2025-05-07T20:33:28.7144533Z > x_sign = torch.sign(x) 2025-05-07T20:33:28.7146641Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:28.7146687Z 2025-05-07T20:33:28.7146802Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:33:28.7146806Z 2025-05-07T20:33:28.7146901Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.7147123Z self=, 2025-05-07T20:33:28.7147198Z T=16384, 2025-05-07T20:33:28.7147271Z D=5120, 2025-05-07T20:33:28.7147352Z scale_ub=None, 2025-05-07T20:33:28.7147432Z contiguous=True, 2025-05-07T20:33:28.7147511Z compiled=False, 2025-05-07T20:33:28.7147582Z ) 2025-05-07T20:33:28.7147792Z self = 2025-05-07T20:33:28.7147976Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:28.7147980Z 2025-05-07T20:33:28.7148049Z @given( 2025-05-07T20:33:28.7148165Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.7148259Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.7148410Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.7148521Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.7148631Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.7148701Z ) 2025-05-07T20:33:28.7148942Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.7149031Z def test_silu_mul_quant( 2025-05-07T20:33:28.7149103Z self, 2025-05-07T20:33:28.7149178Z T: int, 2025-05-07T20:33:28.7149248Z D: int, 2025-05-07T20:33:28.7149340Z scale_ub: Optional[float], 2025-05-07T20:33:28.7149431Z contiguous: bool, 2025-05-07T20:33:28.7149513Z compiled: bool, 2025-05-07T20:33:28.7149587Z ) -> None: 2025-05-07T20:33:28.7149677Z torch.manual_seed(2025) 2025-05-07T20:33:28.7149759Z 2025-05-07T20:33:28.7149944Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.7151741Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:28.7151749Z 2025-05-07T20:33:28.7151863Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:28.7151867Z 2025-05-07T20:33:28.7152013Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.7152232Z self=, 2025-05-07T20:33:28.7152309Z T=4096, 2025-05-07T20:33:28.7152382Z D=5120, 2025-05-07T20:33:28.7152457Z scale_ub=None, 2025-05-07T20:33:28.7152537Z contiguous=True, 2025-05-07T20:33:28.7152620Z compiled=False, 2025-05-07T20:33:28.7152684Z ) 2025-05-07T20:33:28.7152896Z self = 2025-05-07T20:33:28.7153061Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:28.7153065Z 2025-05-07T20:33:28.7153137Z @given( 2025-05-07T20:33:28.7153251Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.7153343Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.7153453Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.7153604Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.7153713Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.7153786Z ) 2025-05-07T20:33:28.7154061Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.7154148Z def test_silu_mul_quant( 2025-05-07T20:33:28.7154220Z self, 2025-05-07T20:33:28.7154293Z T: int, 2025-05-07T20:33:28.7154363Z D: int, 2025-05-07T20:33:28.7154457Z scale_ub: Optional[float], 2025-05-07T20:33:28.7154538Z contiguous: bool, 2025-05-07T20:33:28.7154616Z compiled: bool, 2025-05-07T20:33:28.7154692Z ) -> None: 2025-05-07T20:33:28.7154778Z torch.manual_seed(2025) 2025-05-07T20:33:28.7154848Z 2025-05-07T20:33:28.7155008Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.7156780Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:28.7156836Z 2025-05-07T20:33:28.7156947Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:28.7156951Z 2025-05-07T20:33:28.7157047Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.7157264Z self=, 2025-05-07T20:33:28.7157339Z T=2048, 2025-05-07T20:33:28.7157410Z D=5120, 2025-05-07T20:33:28.7157488Z scale_ub=None, 2025-05-07T20:33:28.7157569Z contiguous=False, 2025-05-07T20:33:28.7157648Z compiled=False, 2025-05-07T20:33:28.7157720Z ) 2025-05-07T20:33:28.7157934Z self = 2025-05-07T20:33:28.7158109Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:28.7158114Z 2025-05-07T20:33:28.7158183Z @given( 2025-05-07T20:33:28.7158293Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.7158393Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.7158501Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.7158609Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.7158719Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.7158788Z ) 2025-05-07T20:33:28.7159026Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.7159116Z def test_silu_mul_quant( 2025-05-07T20:33:28.7159185Z self, 2025-05-07T20:33:28.7159260Z T: int, 2025-05-07T20:33:28.7159331Z D: int, 2025-05-07T20:33:28.7159422Z scale_ub: Optional[float], 2025-05-07T20:33:28.7159556Z contiguous: bool, 2025-05-07T20:33:28.7159636Z compiled: bool, 2025-05-07T20:33:28.7159713Z ) -> None: 2025-05-07T20:33:28.7159815Z torch.manual_seed(2025) 2025-05-07T20:33:28.7159887Z 2025-05-07T20:33:28.7160069Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.7162343Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:28.7162388Z 2025-05-07T20:33:28.7162510Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:28.7162517Z 2025-05-07T20:33:28.7162623Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.7162940Z self=, 2025-05-07T20:33:28.7163020Z T=4096, 2025-05-07T20:33:28.7163091Z D=7168, 2025-05-07T20:33:28.7163170Z scale_ub=None, 2025-05-07T20:33:28.7163258Z contiguous=True, 2025-05-07T20:33:28.7163337Z compiled=True, 2025-05-07T20:33:28.7163404Z ) 2025-05-07T20:33:28.7163619Z self = 2025-05-07T20:33:28.7163780Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:28.7163785Z 2025-05-07T20:33:28.7163853Z @given( 2025-05-07T20:33:28.7163965Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.7164056Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.7164168Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.7164386Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.7164489Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.7164558Z ) 2025-05-07T20:33:28.7164796Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.7164880Z def test_silu_mul_quant( 2025-05-07T20:33:28.7164997Z self, 2025-05-07T20:33:28.7165062Z T: int, 2025-05-07T20:33:28.7165128Z D: int, 2025-05-07T20:33:28.7165224Z scale_ub: Optional[float], 2025-05-07T20:33:28.7165301Z contiguous: bool, 2025-05-07T20:33:28.7165376Z compiled: bool, 2025-05-07T20:33:28.7165444Z ) -> None: 2025-05-07T20:33:28.7165527Z torch.manual_seed(2025) 2025-05-07T20:33:28.7165590Z 2025-05-07T20:33:28.7165748Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.7167525Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:28.7167540Z 2025-05-07T20:33:28.7167646Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:28.7167651Z 2025-05-07T20:33:28.7167745Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.7167961Z self=, 2025-05-07T20:33:28.7168027Z T=2048, 2025-05-07T20:33:28.7168091Z D=5120, 2025-05-07T20:33:28.7168164Z scale_ub=1200.0, 2025-05-07T20:33:28.7168243Z contiguous=False, 2025-05-07T20:33:28.7168315Z compiled=False, 2025-05-07T20:33:28.7168379Z ) 2025-05-07T20:33:28.7168633Z self = 2025-05-07T20:33:28.7168803Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:28.7168807Z 2025-05-07T20:33:28.7168870Z @given( 2025-05-07T20:33:28.7168980Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.7169072Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.7169175Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.7169281Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.7169386Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.7169448Z ) 2025-05-07T20:33:28.7169683Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.7169769Z def test_silu_mul_quant( 2025-05-07T20:33:28.7169898Z self, 2025-05-07T20:33:28.7169969Z T: int, 2025-05-07T20:33:28.7170034Z D: int, 2025-05-07T20:33:28.7170124Z scale_ub: Optional[float], 2025-05-07T20:33:28.7170204Z contiguous: bool, 2025-05-07T20:33:28.7170317Z compiled: bool, 2025-05-07T20:33:28.7170383Z ) -> None: 2025-05-07T20:33:28.7170471Z torch.manual_seed(2025) 2025-05-07T20:33:28.7170532Z 2025-05-07T20:33:28.7170693Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.7172450Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:28.7172457Z 2025-05-07T20:33:28.7172566Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:28.7172570Z 2025-05-07T20:33:28.7172664Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.7172875Z self=, 2025-05-07T20:33:28.7172985Z T=4096, 2025-05-07T20:33:28.7175857Z D=7168, 2025-05-07T20:33:28.7175956Z scale_ub=1200.0, 2025-05-07T20:33:28.7176036Z contiguous=True, 2025-05-07T20:33:28.7176116Z compiled=False, 2025-05-07T20:33:28.7176189Z ) 2025-05-07T20:33:28.7176405Z self = 2025-05-07T20:33:28.7176576Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:28.7176581Z 2025-05-07T20:33:28.7176651Z @given( 2025-05-07T20:33:28.7176765Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.7176871Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.7176984Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.7177096Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.7177215Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.7177287Z ) 2025-05-07T20:33:28.7177526Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.7177630Z def test_silu_mul_quant( 2025-05-07T20:33:28.7177707Z self, 2025-05-07T20:33:28.7177784Z T: int, 2025-05-07T20:33:28.7177857Z D: int, 2025-05-07T20:33:28.7177950Z scale_ub: Optional[float], 2025-05-07T20:33:28.7178038Z contiguous: bool, 2025-05-07T20:33:28.7178120Z compiled: bool, 2025-05-07T20:33:28.7178198Z ) -> None: 2025-05-07T20:33:28.7178293Z torch.manual_seed(2025) 2025-05-07T20:33:28.7178362Z 2025-05-07T20:33:28.7178526Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.7180828Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:28.7180837Z 2025-05-07T20:33:28.7180953Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:28.7180958Z 2025-05-07T20:33:28.7181058Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.7181271Z self=, 2025-05-07T20:33:28.7181348Z T=16384, 2025-05-07T20:33:28.7181465Z D=7168, 2025-05-07T20:33:28.7181544Z scale_ub=None, 2025-05-07T20:33:28.7181633Z contiguous=False, 2025-05-07T20:33:28.7181721Z compiled=True, 2025-05-07T20:33:28.7181793Z ) 2025-05-07T20:33:28.7182043Z self = 2025-05-07T20:33:28.7182216Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:28.7182222Z 2025-05-07T20:33:28.7182292Z @given( 2025-05-07T20:33:28.7182414Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.7182508Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.7182623Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.7182740Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.7182847Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.7182923Z ) 2025-05-07T20:33:28.7183161Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.7183251Z def test_silu_mul_quant( 2025-05-07T20:33:28.7183324Z self, 2025-05-07T20:33:28.7183406Z T: int, 2025-05-07T20:33:28.7183479Z D: int, 2025-05-07T20:33:28.7183578Z scale_ub: Optional[float], 2025-05-07T20:33:28.7183661Z contiguous: bool, 2025-05-07T20:33:28.7183739Z compiled: bool, 2025-05-07T20:33:28.7183812Z ) -> None: 2025-05-07T20:33:28.7183946Z torch.manual_seed(2025) 2025-05-07T20:33:28.7184018Z 2025-05-07T20:33:28.7184178Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.7185945Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:28.7185956Z 2025-05-07T20:33:28.7186071Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:28.7186075Z 2025-05-07T20:33:28.7186172Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.7186389Z self=, 2025-05-07T20:33:28.7186465Z T=4096, 2025-05-07T20:33:28.7186534Z D=7168, 2025-05-07T20:33:28.7186614Z scale_ub=None, 2025-05-07T20:33:28.7186695Z contiguous=True, 2025-05-07T20:33:28.7186776Z compiled=False, 2025-05-07T20:33:28.7186854Z ) 2025-05-07T20:33:28.7187065Z self = 2025-05-07T20:33:28.7187232Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:28.7187236Z 2025-05-07T20:33:28.7187306Z @given( 2025-05-07T20:33:28.7187419Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.7187555Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.7187665Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.7187776Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.7187887Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.7187979Z ) 2025-05-07T20:33:28.7188298Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.7188390Z def test_silu_mul_quant( 2025-05-07T20:33:28.7188461Z self, 2025-05-07T20:33:28.7188537Z T: int, 2025-05-07T20:33:28.7188606Z D: int, 2025-05-07T20:33:28.7188699Z scale_ub: Optional[float], 2025-05-07T20:33:28.7188786Z contiguous: bool, 2025-05-07T20:33:28.7188864Z compiled: bool, 2025-05-07T20:33:28.7188935Z ) -> None: 2025-05-07T20:33:28.7189026Z torch.manual_seed(2025) 2025-05-07T20:33:28.7189148Z 2025-05-07T20:33:28.7189309Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.7191192Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:28.7191203Z 2025-05-07T20:33:28.7191330Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:28.7191335Z 2025-05-07T20:33:28.7191455Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.7191670Z self=, 2025-05-07T20:33:28.7191745Z T=16384, 2025-05-07T20:33:28.7191815Z D=7168, 2025-05-07T20:33:28.7191895Z scale_ub=None, 2025-05-07T20:33:28.7191982Z contiguous=True, 2025-05-07T20:33:28.7192066Z compiled=False, 2025-05-07T20:33:28.7192133Z ) 2025-05-07T20:33:28.7192348Z self = 2025-05-07T20:33:28.7192515Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:28.7192563Z 2025-05-07T20:33:28.7192636Z @given( 2025-05-07T20:33:28.7192751Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.7192845Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.7192957Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.7193066Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.7193170Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.7193243Z ) 2025-05-07T20:33:28.7193482Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.7193573Z def test_silu_mul_quant( 2025-05-07T20:33:28.7193650Z self, 2025-05-07T20:33:28.7193720Z T: int, 2025-05-07T20:33:28.7193795Z D: int, 2025-05-07T20:33:28.7193894Z scale_ub: Optional[float], 2025-05-07T20:33:28.7193978Z contiguous: bool, 2025-05-07T20:33:28.7194056Z compiled: bool, 2025-05-07T20:33:28.7194132Z ) -> None: 2025-05-07T20:33:28.7194218Z torch.manual_seed(2025) 2025-05-07T20:33:28.7194291Z 2025-05-07T20:33:28.7194452Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.7196260Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:28.7196273Z 2025-05-07T20:33:28.7196384Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:28.7196389Z 2025-05-07T20:33:28.7196489Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.7196710Z self=, 2025-05-07T20:33:28.7196780Z T=16384, 2025-05-07T20:33:28.7196856Z D=7168, 2025-05-07T20:33:28.7196937Z scale_ub=1200.0, 2025-05-07T20:33:28.7197015Z contiguous=True, 2025-05-07T20:33:28.7197093Z compiled=False, 2025-05-07T20:33:28.7197167Z ) 2025-05-07T20:33:28.7197376Z self = 2025-05-07T20:33:28.7197544Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:28.7197615Z 2025-05-07T20:33:28.7197689Z @given( 2025-05-07T20:33:28.7197802Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.7197893Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.7198043Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.7198156Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.7198269Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.7198342Z ) 2025-05-07T20:33:28.7198584Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.7198673Z def test_silu_mul_quant( 2025-05-07T20:33:28.7198745Z self, 2025-05-07T20:33:28.7198817Z T: int, 2025-05-07T20:33:28.7198891Z D: int, 2025-05-07T20:33:28.7198984Z scale_ub: Optional[float], 2025-05-07T20:33:28.7199068Z contiguous: bool, 2025-05-07T20:33:28.7199152Z compiled: bool, 2025-05-07T20:33:28.7199232Z ) -> None: 2025-05-07T20:33:28.7199324Z torch.manual_seed(2025) 2025-05-07T20:33:28.7199396Z 2025-05-07T20:33:28.7199559Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.7201322Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:28.7201370Z 2025-05-07T20:33:28.7201482Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:28.7201486Z 2025-05-07T20:33:28.7201584Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.7201800Z self=, 2025-05-07T20:33:28.7201875Z T=128, 2025-05-07T20:33:28.7201950Z D=5120, 2025-05-07T20:33:28.7202027Z scale_ub=1200.0, 2025-05-07T20:33:28.7202110Z contiguous=False, 2025-05-07T20:33:28.7202192Z compiled=False, 2025-05-07T20:33:28.7202260Z ) 2025-05-07T20:33:28.7202469Z self = 2025-05-07T20:33:28.7202639Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:28.7202644Z 2025-05-07T20:33:28.7202718Z @given( 2025-05-07T20:33:28.7202834Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.7202927Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.7203034Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.7203147Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.7203252Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.7203324Z ) 2025-05-07T20:33:28.7203605Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.7203696Z def test_silu_mul_quant( 2025-05-07T20:33:28.7203768Z self, 2025-05-07T20:33:28.7203842Z T: int, 2025-05-07T20:33:28.7203910Z D: int, 2025-05-07T20:33:28.7204002Z scale_ub: Optional[float], 2025-05-07T20:33:28.7204089Z contiguous: bool, 2025-05-07T20:33:28.7204167Z compiled: bool, 2025-05-07T20:33:28.7204355Z ) -> None: 2025-05-07T20:33:28.7204446Z torch.manual_seed(2025) 2025-05-07T20:33:28.7204507Z 2025-05-07T20:33:28.7204670Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.7204732Z 2025-05-07T20:33:28.7204813Z x_sign = torch.sign(x) 2025-05-07T20:33:28.7204934Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:28.7205012Z x = x_sign * x_clamp 2025-05-07T20:33:28.7205132Z x0 = x[:, :D] 2025-05-07T20:33:28.7205203Z x1 = x[:, D:] 2025-05-07T20:33:28.7205266Z 2025-05-07T20:33:28.7205344Z if contiguous: 2025-05-07T20:33:28.7205427Z x0 = x0.contiguous() 2025-05-07T20:33:28.7205549Z x1 = x1.contiguous() 2025-05-07T20:33:28.7205614Z 2025-05-07T20:33:28.7205693Z if scale_ub is not None: 2025-05-07T20:33:28.7205788Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:28.7205925Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:28.7205988Z ) 2025-05-07T20:33:28.7206056Z else: 2025-05-07T20:33:28.7206145Z scale_ub_tensor = None 2025-05-07T20:33:28.7206205Z 2025-05-07T20:33:28.7206326Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.7206407Z op = silu_mul_quant 2025-05-07T20:33:28.7206480Z if compiled: 2025-05-07T20:33:28.7206569Z op = torch.compile(op) 2025-05-07T20:33:28.7206670Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.7206730Z 2025-05-07T20:33:28.7206823Z > y_fp8, y_scale = fn() 2025-05-07T20:33:28.7206828Z 2025-05-07T20:33:28.7206914Z moe/activation_test.py:117: 2025-05-07T20:33:28.7207038Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.7207135Z moe/activation_test.py:115: in fn 2025-05-07T20:33:28.7207272Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.7207773Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:28.7207863Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:28.7208526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:28.7208760Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:28.7209092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:28.7209180Z kernel = self.compile( 2025-05-07T20:33:28.7209558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:28.7209729Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:28.7209852Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.7209863Z 2025-05-07T20:33:28.7210059Z self = 2025-05-07T20:33:28.7210830Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:28.7211330Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b922251c0>} 2025-05-07T20:33:28.7212172Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:28.7212363Z context = 2025-05-07T20:33:28.7212371Z 2025-05-07T20:33:28.7212527Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:28.7212781Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:28.7212882Z module_map=module_map) 2025-05-07T20:33:28.7213035Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:28.7213122Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:28.7213190Z E ^ 2025-05-07T20:33:28.7213536Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:28.7213600Z 2025-05-07T20:33:28.7214013Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:28.7214082Z 2025-05-07T20:33:28.7214175Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.7214391Z self=, 2025-05-07T20:33:28.7214465Z T=2048, 2025-05-07T20:33:28.7214535Z D=7168, 2025-05-07T20:33:28.7214618Z scale_ub=None, 2025-05-07T20:33:28.7214701Z contiguous=False, 2025-05-07T20:33:28.7214779Z compiled=False, 2025-05-07T20:33:28.7214853Z ) 2025-05-07T20:33:28.7215063Z self = 2025-05-07T20:33:28.7215229Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:28.7215234Z 2025-05-07T20:33:28.7215306Z @given( 2025-05-07T20:33:28.7215418Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.7215513Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.7215628Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.7215742Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.7215850Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.7215919Z ) 2025-05-07T20:33:28.7216220Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.7216312Z def test_silu_mul_quant( 2025-05-07T20:33:28.7216384Z self, 2025-05-07T20:33:28.7216458Z T: int, 2025-05-07T20:33:28.7216533Z D: int, 2025-05-07T20:33:28.7216626Z scale_ub: Optional[float], 2025-05-07T20:33:28.7216710Z contiguous: bool, 2025-05-07T20:33:28.7216792Z compiled: bool, 2025-05-07T20:33:28.7216865Z ) -> None: 2025-05-07T20:33:28.7216952Z torch.manual_seed(2025) 2025-05-07T20:33:28.7217024Z 2025-05-07T20:33:28.7217191Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.7218970Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:28.7218979Z 2025-05-07T20:33:28.7219092Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:28.7219096Z 2025-05-07T20:33:28.7219196Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.7219411Z self=, 2025-05-07T20:33:28.7219485Z T=128, 2025-05-07T20:33:28.7219557Z D=7168, 2025-05-07T20:33:28.7219633Z scale_ub=1200.0, 2025-05-07T20:33:28.7219757Z contiguous=True, 2025-05-07T20:33:28.7219841Z compiled=True, 2025-05-07T20:33:28.7219910Z ) 2025-05-07T20:33:28.7220123Z self = 2025-05-07T20:33:28.7220287Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:28.7220294Z 2025-05-07T20:33:28.7220363Z @given( 2025-05-07T20:33:28.7220479Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.7220571Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.7220678Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.7220794Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.7220900Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.7220971Z ) 2025-05-07T20:33:28.7221215Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.7221344Z def test_silu_mul_quant( 2025-05-07T20:33:28.7221415Z self, 2025-05-07T20:33:28.7221491Z T: int, 2025-05-07T20:33:28.7221561Z D: int, 2025-05-07T20:33:28.7221699Z scale_ub: Optional[float], 2025-05-07T20:33:28.7221783Z contiguous: bool, 2025-05-07T20:33:28.7221861Z compiled: bool, 2025-05-07T20:33:28.7221940Z ) -> None: 2025-05-07T20:33:28.7222027Z torch.manual_seed(2025) 2025-05-07T20:33:28.7222100Z 2025-05-07T20:33:28.7222265Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.7222335Z 2025-05-07T20:33:28.7222421Z x_sign = torch.sign(x) 2025-05-07T20:33:28.7222543Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:28.7222627Z x = x_sign * x_clamp 2025-05-07T20:33:28.7222702Z x0 = x[:, :D] 2025-05-07T20:33:28.7222779Z x1 = x[:, D:] 2025-05-07T20:33:28.7222845Z 2025-05-07T20:33:28.7222927Z if contiguous: 2025-05-07T20:33:28.7223016Z x0 = x0.contiguous() 2025-05-07T20:33:28.7223101Z x1 = x1.contiguous() 2025-05-07T20:33:28.7223171Z 2025-05-07T20:33:28.7223260Z if scale_ub is not None: 2025-05-07T20:33:28.7223359Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:28.7223491Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:28.7223607Z ) 2025-05-07T20:33:28.7223677Z else: 2025-05-07T20:33:28.7223766Z scale_ub_tensor = None 2025-05-07T20:33:28.7223831Z 2025-05-07T20:33:28.7223954Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.7224042Z op = silu_mul_quant 2025-05-07T20:33:28.7224121Z if compiled: 2025-05-07T20:33:28.7224214Z op = torch.compile(op) 2025-05-07T20:33:28.7224317Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.7224385Z 2025-05-07T20:33:28.7224475Z > y_fp8, y_scale = fn() 2025-05-07T20:33:28.7224481Z 2025-05-07T20:33:28.7224574Z moe/activation_test.py:117: 2025-05-07T20:33:28.7224703Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.7224809Z moe/activation_test.py:115: in fn 2025-05-07T20:33:28.7224905Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.7225266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:28.7225357Z return fn(*args, **kwargs) 2025-05-07T20:33:28.7225885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:28.7225987Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:28.7226335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:28.7226550Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:28.7226954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:28.7227045Z kernel = self.compile( 2025-05-07T20:33:28.7227425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:28.7227601Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:28.7227724Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.7227728Z 2025-05-07T20:33:28.7227930Z self = 2025-05-07T20:33:28.7228699Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:28.7229203Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f8b920bfb00>} 2025-05-07T20:33:28.7230210Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:28.7230404Z context = 2025-05-07T20:33:28.7230412Z 2025-05-07T20:33:28.7230576Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:28.7230834Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:28.7230940Z module_map=module_map) 2025-05-07T20:33:28.7231095Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:28.7231188Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:28.7231268Z E ^ 2025-05-07T20:33:28.7231621Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:28.7231629Z 2025-05-07T20:33:28.7232038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:28.7232047Z 2025-05-07T20:33:28.7232145Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.7232406Z self=, 2025-05-07T20:33:28.7232479Z T=128, 2025-05-07T20:33:28.7232551Z D=7168, 2025-05-07T20:33:28.7232629Z scale_ub=1200.0, 2025-05-07T20:33:28.7232710Z contiguous=True, 2025-05-07T20:33:28.7232791Z compiled=False, 2025-05-07T20:33:28.7232860Z ) 2025-05-07T20:33:28.7233074Z self = 2025-05-07T20:33:28.7233241Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:28.7233245Z 2025-05-07T20:33:28.7233318Z @given( 2025-05-07T20:33:28.7233439Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.7233533Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.7233648Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.7233760Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.7233868Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.7233940Z ) 2025-05-07T20:33:28.7234178Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.7234268Z def test_silu_mul_quant( 2025-05-07T20:33:28.7234339Z self, 2025-05-07T20:33:28.7234411Z T: int, 2025-05-07T20:33:28.7234483Z D: int, 2025-05-07T20:33:28.7234579Z scale_ub: Optional[float], 2025-05-07T20:33:28.7234662Z contiguous: bool, 2025-05-07T20:33:28.7234749Z compiled: bool, 2025-05-07T20:33:28.7234821Z ) -> None: 2025-05-07T20:33:28.7234909Z torch.manual_seed(2025) 2025-05-07T20:33:28.7234985Z 2025-05-07T20:33:28.7235195Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.7235262Z 2025-05-07T20:33:28.7235352Z x_sign = torch.sign(x) 2025-05-07T20:33:28.7235476Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:28.7237239Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:28.7237248Z 2025-05-07T20:33:28.7237364Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:28.7237410Z 2025-05-07T20:33:28.7237507Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.7237728Z self=, 2025-05-07T20:33:28.7237839Z T=128, 2025-05-07T20:33:28.7237907Z D=5120, 2025-05-07T20:33:28.7237988Z scale_ub=1200.0, 2025-05-07T20:33:28.7238067Z contiguous=True, 2025-05-07T20:33:28.7238146Z compiled=True, 2025-05-07T20:33:28.7238216Z ) 2025-05-07T20:33:28.7238426Z self = 2025-05-07T20:33:28.7238588Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:28.7238592Z 2025-05-07T20:33:28.7238667Z @given( 2025-05-07T20:33:28.7238776Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.7238874Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.7238983Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.7239098Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.7239211Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.7239283Z ) 2025-05-07T20:33:28.7239522Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.7239614Z def test_silu_mul_quant( 2025-05-07T20:33:28.7239690Z self, 2025-05-07T20:33:28.7239761Z T: int, 2025-05-07T20:33:28.7239885Z D: int, 2025-05-07T20:33:28.7239978Z scale_ub: Optional[float], 2025-05-07T20:33:28.7240061Z contiguous: bool, 2025-05-07T20:33:28.7240141Z compiled: bool, 2025-05-07T20:33:28.7240212Z ) -> None: 2025-05-07T20:33:28.7240304Z torch.manual_seed(2025) 2025-05-07T20:33:28.7240376Z 2025-05-07T20:33:28.7240540Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.7240612Z 2025-05-07T20:33:28.7240698Z x_sign = torch.sign(x) 2025-05-07T20:33:28.7240818Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:28.7242580Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:28.7242587Z 2025-05-07T20:33:28.7242699Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:28.7242704Z 2025-05-07T20:33:28.7242801Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.7243016Z self=, 2025-05-07T20:33:28.7243088Z T=128, 2025-05-07T20:33:28.7243154Z D=7168, 2025-05-07T20:33:28.7243237Z scale_ub=None, 2025-05-07T20:33:28.7243329Z contiguous=True, 2025-05-07T20:33:28.7243453Z compiled=True, 2025-05-07T20:33:28.7243519Z ) 2025-05-07T20:33:28.7243734Z self = 2025-05-07T20:33:28.7243892Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:28.7243896Z 2025-05-07T20:33:28.7243974Z @given( 2025-05-07T20:33:28.7244093Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.7244183Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.7244427Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.7244544Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.7244651Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.7244725Z ) 2025-05-07T20:33:28.7244965Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.7245098Z def test_silu_mul_quant( 2025-05-07T20:33:28.7245177Z self, 2025-05-07T20:33:28.7245248Z T: int, 2025-05-07T20:33:28.7245324Z D: int, 2025-05-07T20:33:28.7245420Z scale_ub: Optional[float], 2025-05-07T20:33:28.7245544Z contiguous: bool, 2025-05-07T20:33:28.7245623Z compiled: bool, 2025-05-07T20:33:28.7245697Z ) -> None: 2025-05-07T20:33:28.7245783Z torch.manual_seed(2025) 2025-05-07T20:33:28.7245855Z 2025-05-07T20:33:28.7246018Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.7247771Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:28.7247784Z 2025-05-07T20:33:28.7247895Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:28.7248027Z =============================== warnings summary =============================== 2025-05-07T20:33:28.7248331Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:33:28.7248668Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:33:28.7248957Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:33:28.7249827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details. 2025-05-07T20:33:28.7250057Z warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See " 2025-05-07T20:33:28.7250061Z 2025-05-07T20:33:28.7250270Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html 2025-05-07T20:33:28.7250431Z ================= 1 failed, 1 deselected, 3 warnings in 12.34s ================= 2025-05-07T20:33:30.5987441Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error) 2025-05-07T20:33:30.6712344Z [EXEC] [ATTEMPT 2/2] Command attempt failed. 2025-05-07T20:33:30.6712752Z 2025-05-07T20:33:30.6713029Z [EXEC] The command has failed after 2 + 1 attempts; aborting. 2025-05-07T20:33:30.6713890Z [TEST] Python test suite FAILED for some or all tests despite multiple retries: ./moe/activation_test.py 2025-05-07T20:33:30.6714488Z 2025-05-07T20:33:30.6714523Z 2025-05-07T20:33:30.6714530Z 2025-05-07T20:33:30.6733032Z ##[error]Process completed with exit code 1. 2025-05-07T20:33:30.6816186Z Post job cleanup. 2025-05-07T20:33:30.7821721Z [command]/usr/bin/git version 2025-05-07T20:33:30.7863102Z git version 2.47.1 2025-05-07T20:33:30.7901140Z Copying '/home/ec2-user/.gitconfig' to '/home/ec2-user/actions-runner/_work/_temp/914c4422-14f6-428a-b58a-905ac220765a/.gitconfig' 2025-05-07T20:33:30.7912295Z Temporarily overriding HOME='/home/ec2-user/actions-runner/_work/_temp/914c4422-14f6-428a-b58a-905ac220765a' before making global git config changes 2025-05-07T20:33:30.7913641Z Adding repository directory to the temporary git global config as a safe directory 2025-05-07T20:33:30.7929124Z [command]/usr/bin/git config --global --add safe.directory /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM 2025-05-07T20:33:30.7977098Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand 2025-05-07T20:33:30.8014114Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :" 2025-05-07T20:33:30.8355085Z Entering 'external/asmjit' 2025-05-07T20:33:30.8421773Z Entering 'external/composable_kernel' 2025-05-07T20:33:30.8495493Z Entering 'external/cpuinfo' 2025-05-07T20:33:30.8563563Z Entering 'external/cutlass' 2025-05-07T20:33:30.8639984Z Entering 'external/googletest' 2025-05-07T20:33:30.8712657Z Entering 'external/hipify_torch' 2025-05-07T20:33:30.8779281Z Entering 'external/json' 2025-05-07T20:33:30.8869081Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader 2025-05-07T20:33:30.8895859Z http.https://github.com/.extraheader 2025-05-07T20:33:30.8908973Z [command]/usr/bin/git config --local --unset-all http.https://github.com/.extraheader 2025-05-07T20:33:30.8945364Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :" 2025-05-07T20:33:30.9281535Z Entering 'external/asmjit' 2025-05-07T20:33:30.9325852Z http.https://github.com/.extraheader 2025-05-07T20:33:30.9369878Z Entering 'external/composable_kernel' 2025-05-07T20:33:30.9413550Z http.https://github.com/.extraheader 2025-05-07T20:33:30.9462963Z Entering 'external/cpuinfo' 2025-05-07T20:33:30.9505558Z http.https://github.com/.extraheader 2025-05-07T20:33:30.9549029Z Entering 'external/cutlass' 2025-05-07T20:33:30.9591708Z http.https://github.com/.extraheader 2025-05-07T20:33:30.9643076Z Entering 'external/googletest' 2025-05-07T20:33:30.9685305Z http.https://github.com/.extraheader 2025-05-07T20:33:30.9727946Z Entering 'external/hipify_torch' 2025-05-07T20:33:30.9772050Z http.https://github.com/.extraheader 2025-05-07T20:33:30.9814941Z Entering 'external/json' 2025-05-07T20:33:30.9857388Z http.https://github.com/.extraheader 2025-05-07T20:33:31.0014315Z A job completed hook has been configured by the self-hosted runner administrator 2025-05-07T20:33:31.0047004Z ##[group]Run '/home/ec2-user/runner-scripts/after_job.sh' 2025-05-07T20:33:31.0057326Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:33:31.0057678Z ##[endgroup] 2025-05-07T20:33:31.0160591Z [!ALERT!] Swap in detected! [!ALERT!] 2025-05-07T20:33:42.2000046Z [!ALERT!] Swap out detected [!ALERT!] 2025-05-07T20:33:59.1149302Z Cleaning up orphan processes