2025-05-07T20:23:13.3036442Z Current runner version: '2.323.0' 2025-05-07T20:23:13.3042098Z Runner name: 'i-0e56304501e4f5200' 2025-05-07T20:23:13.3043019Z Machine name: 'ip-10-0-66-0' 2025-05-07T20:23:13.3045748Z ##[group]GITHUB_TOKEN Permissions 2025-05-07T20:23:13.3048258Z Contents: read 2025-05-07T20:23:13.3048785Z Metadata: read 2025-05-07T20:23:13.3049283Z Packages: read 2025-05-07T20:23:13.3049776Z ##[endgroup] 2025-05-07T20:23:13.3052023Z Secret source: None 2025-05-07T20:23:13.3052672Z Prepare workflow directory 2025-05-07T20:23:13.3998352Z Prepare all required actions 2025-05-07T20:23:13.4040080Z Getting action download info 2025-05-07T20:23:13.5908992Z Download action repository 'actions/checkout@v4' (SHA:11bd71901bbe5b1630ceea73d27597364c9af683) 2025-05-07T20:23:13.8870304Z Download action repository 'actions/download-artifact@v4' (SHA:d3f86a106a0bac45b974a628896c90dbdf5c8093) 2025-05-07T20:23:14.3031624Z Download action repository 'pytorch/test-infra@main' (SHA:117fccdf5892ff9a958d2afb4b4b8b6e930d3187) 2025-05-07T20:23:16.0510324Z Getting action download info 2025-05-07T20:23:16.1757315Z Download action repository 'nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482' (SHA:3e91a01664abd3c5cd539100d10d33b9c5b68482) 2025-05-07T20:23:16.3754906Z Complete job name: test_and_publish_artifact (x86, linux.g5.4xlarge.nvidia.gpu, genai, 3.11, 12.8.0, 12.6.3, clang) 2025-05-07T20:23:16.4268534Z A job started hook has been configured by the self-hosted runner administrator 2025-05-07T20:23:16.4378251Z ##[group]Run '/home/ec2-user/runner-scripts/before_job.sh' 2025-05-07T20:23:16.4389804Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:23:16.4390444Z ##[endgroup] 2025-05-07T20:23:17.4543401Z Runner Type: linux.g5.4xlarge.nvidia.gpu 2025-05-07T20:23:17.4543849Z Instance Type: g5.4xlarge 2025-05-07T20:23:17.4544098Z AMI Name: unknown 2025-05-07T20:23:17.4584494Z AMI ID: ami-071226ecf16aa7d96 2025-05-07T20:23:22.8518324Z ##[group]Run actions/checkout@v4 2025-05-07T20:23:22.8518627Z with: 2025-05-07T20:23:22.8518877Z submodules: true 2025-05-07T20:23:22.8519110Z repository: pytorch/FBGEMM 2025-05-07T20:23:22.8519490Z token: *** 2025-05-07T20:23:22.8519691Z ssh-strict: true 2025-05-07T20:23:22.8519907Z ssh-user: git 2025-05-07T20:23:22.8520125Z persist-credentials: true 2025-05-07T20:23:22.8520374Z clean: true 2025-05-07T20:23:22.8520605Z sparse-checkout-cone-mode: true 2025-05-07T20:23:22.8520874Z fetch-depth: 1 2025-05-07T20:23:22.8521087Z fetch-tags: false 2025-05-07T20:23:22.8521310Z show-progress: true 2025-05-07T20:23:22.8521530Z lfs: false 2025-05-07T20:23:22.8521734Z set-safe-directory: true 2025-05-07T20:23:22.8521992Z env: 2025-05-07T20:23:22.8522201Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:23:22.8522514Z BUILD_ENV: build_binary 2025-05-07T20:23:22.8522767Z BUILD_TARGET: genai 2025-05-07T20:23:22.8523000Z BUILD_VARIANT: cuda 2025-05-07T20:23:22.8523252Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:23:22.8523496Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:23:22.8523729Z ##[endgroup] 2025-05-07T20:23:22.9665761Z Syncing repository: pytorch/FBGEMM 2025-05-07T20:23:22.9667292Z ##[group]Getting Git version info 2025-05-07T20:23:22.9667866Z Working directory is '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM' 2025-05-07T20:23:22.9668672Z [command]/usr/bin/git version 2025-05-07T20:23:22.9668992Z git version 2.47.1 2025-05-07T20:23:22.9684811Z ##[endgroup] 2025-05-07T20:23:22.9695164Z Copying '/home/ec2-user/.gitconfig' to '/home/ec2-user/actions-runner/_work/_temp/cd033a63-f207-416f-848b-cd9b9c59e344/.gitconfig' 2025-05-07T20:23:22.9703861Z Temporarily overriding HOME='/home/ec2-user/actions-runner/_work/_temp/cd033a63-f207-416f-848b-cd9b9c59e344' before making global git config changes 2025-05-07T20:23:22.9704874Z Adding repository directory to the temporary git global config as a safe directory 2025-05-07T20:23:22.9718713Z [command]/usr/bin/git config --global --add safe.directory /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM 2025-05-07T20:23:22.9763958Z [command]/usr/bin/git config --local --get remote.origin.url 2025-05-07T20:23:22.9787535Z https://github.com/pytorch/FBGEMM 2025-05-07T20:23:22.9805691Z ##[group]Removing previously created refs, to avoid conflicts 2025-05-07T20:23:22.9809310Z [command]/usr/bin/git rev-parse --symbolic-full-name --verify --quiet HEAD 2025-05-07T20:23:22.9834611Z refs/heads/main 2025-05-07T20:23:22.9843747Z [command]/usr/bin/git checkout --detach 2025-05-07T20:23:23.8554151Z HEAD is now at b6b2ce3 Migrate TBE forward kernels to `FBGEMM_LAUNCH_KERNEL` (#4079) 2025-05-07T20:23:23.8605850Z [command]/usr/bin/git branch --delete --force main 2025-05-07T20:23:23.8636800Z Deleted branch main (was b6b2ce3). 2025-05-07T20:23:23.8642463Z ##[endgroup] 2025-05-07T20:23:23.8645500Z [command]/usr/bin/git submodule status 2025-05-07T20:23:23.9066446Z e5d7c0bd5d9aec44d68830187138149e6a8c4e32 external/asmjit (e5d7c0b) 2025-05-07T20:23:23.9150765Z 4a61bdd4bd4ed730e078aebc7c0fcf046ff29406 external/composable_kernel (4a61bdd) 2025-05-07T20:23:23.9239389Z 6543fec09b2f04ac4a666882998b534afc9c1349 external/cpuinfo (6543fec) 2025-05-07T20:23:23.9325586Z 3ed8d2ec4ba35ef5d9d8353826209b6f868f63d3 external/cutlass (3ed8d2e) 2025-05-07T20:23:23.9416447Z f8d7d77c06936315286eb55f8de22cd23c188571 external/googletest (f8d7d77) 2025-05-07T20:23:23.9506617Z 420084499c7c1e1c2d801922f40df202eac5f3a0 external/hipify_torch (4200844) 2025-05-07T20:23:23.9588107Z 9cca280a4d0ccf0c08f47a99aa71d1b0e52f8d03 external/json (9cca280) 2025-05-07T20:23:23.9601043Z ##[group]Cleaning the repository 2025-05-07T20:23:23.9605735Z [command]/usr/bin/git clean -ffdx 2025-05-07T20:23:23.9664296Z [command]/usr/bin/git reset --hard HEAD 2025-05-07T20:23:23.9775844Z HEAD is now at b6b2ce3 Migrate TBE forward kernels to `FBGEMM_LAUNCH_KERNEL` (#4079) 2025-05-07T20:23:23.9783188Z ##[endgroup] 2025-05-07T20:23:23.9785000Z ##[group]Disabling automatic garbage collection 2025-05-07T20:23:23.9789380Z [command]/usr/bin/git config --local gc.auto 0 2025-05-07T20:23:23.9821202Z ##[endgroup] 2025-05-07T20:23:23.9821737Z ##[group]Setting up auth 2025-05-07T20:23:23.9826513Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand 2025-05-07T20:23:23.9870138Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :" 2025-05-07T20:23:24.0201139Z Entering 'external/asmjit' 2025-05-07T20:23:24.0267991Z Entering 'external/composable_kernel' 2025-05-07T20:23:24.0340873Z Entering 'external/cpuinfo' 2025-05-07T20:23:24.0407200Z Entering 'external/cutlass' 2025-05-07T20:23:24.0483848Z Entering 'external/googletest' 2025-05-07T20:23:24.0550736Z Entering 'external/hipify_torch' 2025-05-07T20:23:24.0616218Z Entering 'external/json' 2025-05-07T20:23:24.0703323Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader 2025-05-07T20:23:24.0736447Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :" 2025-05-07T20:23:24.1070946Z Entering 'external/asmjit' 2025-05-07T20:23:24.1136347Z Entering 'external/composable_kernel' 2025-05-07T20:23:24.1209082Z Entering 'external/cpuinfo' 2025-05-07T20:23:24.1276589Z Entering 'external/cutlass' 2025-05-07T20:23:24.1352165Z Entering 'external/googletest' 2025-05-07T20:23:24.1418041Z Entering 'external/hipify_torch' 2025-05-07T20:23:24.1486717Z Entering 'external/json' 2025-05-07T20:23:24.1572934Z [command]/usr/bin/git config --local http.https://github.com/.extraheader AUTHORIZATION: basic *** 2025-05-07T20:23:24.1624960Z ##[endgroup] 2025-05-07T20:23:24.1625536Z ##[group]Fetching the repository 2025-05-07T20:23:24.1632533Z [command]/usr/bin/git -c protocol.version=2 fetch --no-tags --prune --no-recurse-submodules --depth=1 origin +a2f4c52051596e74bc8c16e3d2867a4ecdd271e0:refs/remotes/pull/4066/merge 2025-05-07T20:23:24.3561056Z From https://github.com/pytorch/FBGEMM 2025-05-07T20:23:24.3561934Z * [new ref] a2f4c52051596e74bc8c16e3d2867a4ecdd271e0 -> pull/4066/merge 2025-05-07T20:23:24.3587124Z ##[endgroup] 2025-05-07T20:23:24.3587771Z ##[group]Determining the checkout info 2025-05-07T20:23:24.3588713Z ##[endgroup] 2025-05-07T20:23:24.3593047Z [command]/usr/bin/git sparse-checkout disable 2025-05-07T20:23:24.3644138Z [command]/usr/bin/git config --local --unset-all extensions.worktreeConfig 2025-05-07T20:23:24.3672941Z ##[group]Checking out the ref 2025-05-07T20:23:24.3676278Z [command]/usr/bin/git checkout --progress --force refs/remotes/pull/4066/merge 2025-05-07T20:23:24.3802887Z Previous HEAD position was b6b2ce3 Migrate TBE forward kernels to `FBGEMM_LAUNCH_KERNEL` (#4079) 2025-05-07T20:23:24.3806053Z HEAD is now at a2f4c52 Merge 6060cd4b5f971680caecdcc657faccb5720d1c3e into fd4df5f456e0cca514bacd98a39efb72990fd9f4 2025-05-07T20:23:24.3815875Z ##[endgroup] 2025-05-07T20:23:24.3816402Z ##[group]Setting up auth for fetching submodules 2025-05-07T20:23:24.3821394Z [command]/usr/bin/git config --global http.https://github.com/.extraheader AUTHORIZATION: basic *** 2025-05-07T20:23:24.3873051Z [command]/usr/bin/git config --global --unset-all url.https://github.com/.insteadOf 2025-05-07T20:23:24.3903721Z [command]/usr/bin/git config --global --add url.https://github.com/.insteadOf git@github.com: 2025-05-07T20:23:24.3936429Z [command]/usr/bin/git config --global --add url.https://github.com/.insteadOf org-21003710@github.com: 2025-05-07T20:23:24.3965855Z ##[endgroup] 2025-05-07T20:23:24.3966727Z ##[group]Fetching submodules 2025-05-07T20:23:24.3970398Z [command]/usr/bin/git submodule sync 2025-05-07T20:23:24.4351870Z Synchronizing submodule url for 'external/asmjit' 2025-05-07T20:23:24.4352379Z Synchronizing submodule url for 'external/composable_kernel' 2025-05-07T20:23:24.4353147Z Synchronizing submodule url for 'external/cpuinfo' 2025-05-07T20:23:24.4353691Z Synchronizing submodule url for 'external/cutlass' 2025-05-07T20:23:24.4354174Z Synchronizing submodule url for 'external/googletest' 2025-05-07T20:23:24.4354653Z Synchronizing submodule url for 'external/hipify_torch' 2025-05-07T20:23:24.4355174Z Synchronizing submodule url for 'external/json' 2025-05-07T20:23:24.4368168Z [command]/usr/bin/git -c protocol.version=2 submodule update --init --force --depth=1 2025-05-07T20:23:24.4799524Z Submodule path 'external/asmjit': checked out 'e5d7c0bd5d9aec44d68830187138149e6a8c4e32' 2025-05-07T20:23:24.4951469Z Submodule path 'external/composable_kernel': checked out '4a61bdd4bd4ed730e078aebc7c0fcf046ff29406' 2025-05-07T20:23:24.5054868Z Submodule path 'external/cpuinfo': checked out '6543fec09b2f04ac4a666882998b534afc9c1349' 2025-05-07T20:23:24.5222574Z Submodule path 'external/cutlass': checked out '3ed8d2ec4ba35ef5d9d8353826209b6f868f63d3' 2025-05-07T20:23:24.5312676Z Submodule path 'external/googletest': checked out 'f8d7d77c06936315286eb55f8de22cd23c188571' 2025-05-07T20:23:24.5394852Z Submodule path 'external/hipify_torch': checked out '420084499c7c1e1c2d801922f40df202eac5f3a0' 2025-05-07T20:23:24.5495865Z Submodule path 'external/json': checked out '9cca280a4d0ccf0c08f47a99aa71d1b0e52f8d03' 2025-05-07T20:23:24.5512713Z [command]/usr/bin/git submodule foreach git config --local gc.auto 0 2025-05-07T20:23:24.5842974Z Entering 'external/asmjit' 2025-05-07T20:23:24.5873797Z Entering 'external/composable_kernel' 2025-05-07T20:23:24.5906733Z Entering 'external/cpuinfo' 2025-05-07T20:23:24.5939033Z Entering 'external/cutlass' 2025-05-07T20:23:24.5971136Z Entering 'external/googletest' 2025-05-07T20:23:24.6002527Z Entering 'external/hipify_torch' 2025-05-07T20:23:24.6035598Z Entering 'external/json' 2025-05-07T20:23:24.6080207Z ##[endgroup] 2025-05-07T20:23:24.6081101Z ##[group]Persisting credentials for submodules 2025-05-07T20:23:24.6087100Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'url\.https\:\/\/github\.com\/\.insteadOf' && git config --local --unset-all 'url.https://github.com/.insteadOf' || :" 2025-05-07T20:23:24.6418255Z Entering 'external/asmjit' 2025-05-07T20:23:24.6459398Z url.https://github.com/.insteadof 2025-05-07T20:23:24.6459811Z url.https://github.com/.insteadof 2025-05-07T20:23:24.6502319Z Entering 'external/composable_kernel' 2025-05-07T20:23:24.6545295Z url.https://github.com/.insteadof 2025-05-07T20:23:24.6545752Z url.https://github.com/.insteadof 2025-05-07T20:23:24.6594520Z Entering 'external/cpuinfo' 2025-05-07T20:23:24.6638551Z url.https://github.com/.insteadof 2025-05-07T20:23:24.6638898Z url.https://github.com/.insteadof 2025-05-07T20:23:24.6681709Z Entering 'external/cutlass' 2025-05-07T20:23:24.6724073Z url.https://github.com/.insteadof 2025-05-07T20:23:24.6724411Z url.https://github.com/.insteadof 2025-05-07T20:23:24.6775543Z Entering 'external/googletest' 2025-05-07T20:23:24.6817771Z url.https://github.com/.insteadof 2025-05-07T20:23:24.6818127Z url.https://github.com/.insteadof 2025-05-07T20:23:24.6861043Z Entering 'external/hipify_torch' 2025-05-07T20:23:24.6903600Z url.https://github.com/.insteadof 2025-05-07T20:23:24.6903992Z url.https://github.com/.insteadof 2025-05-07T20:23:24.6952119Z Entering 'external/json' 2025-05-07T20:23:24.6994565Z url.https://github.com/.insteadof 2025-05-07T20:23:24.6995012Z url.https://github.com/.insteadof 2025-05-07T20:23:24.7056138Z [command]/usr/bin/git submodule foreach sh -c "git config --local 'http.https://github.com/.extraheader' 'AUTHORIZATION: basic ***' && git config --local --show-origin --name-only --get-regexp remote.origin.url" 2025-05-07T20:23:24.7385152Z Entering 'external/asmjit' 2025-05-07T20:23:24.7451935Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/asmjit/config remote.origin.url 2025-05-07T20:23:24.7454449Z Entering 'external/composable_kernel' 2025-05-07T20:23:24.7515591Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/composable_kernel/config remote.origin.url 2025-05-07T20:23:24.7518483Z Entering 'external/cpuinfo' 2025-05-07T20:23:24.7580437Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/cpuinfo/config remote.origin.url 2025-05-07T20:23:24.7583573Z Entering 'external/cutlass' 2025-05-07T20:23:24.7644798Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/cutlass/config remote.origin.url 2025-05-07T20:23:24.7648521Z Entering 'external/googletest' 2025-05-07T20:23:24.7710201Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/googletest/config remote.origin.url 2025-05-07T20:23:24.7713319Z Entering 'external/hipify_torch' 2025-05-07T20:23:24.7774950Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/hipify_torch/config remote.origin.url 2025-05-07T20:23:24.7777929Z Entering 'external/json' 2025-05-07T20:23:24.7843891Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/json/config remote.origin.url 2025-05-07T20:23:24.7967531Z [command]/usr/bin/git submodule foreach git config --local --add 'url.https://github.com/.insteadOf' 'git@github.com:' 2025-05-07T20:23:24.8300924Z Entering 'external/asmjit' 2025-05-07T20:23:24.8333007Z Entering 'external/composable_kernel' 2025-05-07T20:23:24.8366209Z Entering 'external/cpuinfo' 2025-05-07T20:23:24.8401802Z Entering 'external/cutlass' 2025-05-07T20:23:24.8434698Z Entering 'external/googletest' 2025-05-07T20:23:24.8467303Z Entering 'external/hipify_torch' 2025-05-07T20:23:24.8501473Z Entering 'external/json' 2025-05-07T20:23:24.8548335Z [command]/usr/bin/git submodule foreach git config --local --add 'url.https://github.com/.insteadOf' 'org-21003710@github.com:' 2025-05-07T20:23:24.8881129Z Entering 'external/asmjit' 2025-05-07T20:23:24.8913415Z Entering 'external/composable_kernel' 2025-05-07T20:23:24.8947390Z Entering 'external/cpuinfo' 2025-05-07T20:23:24.8979148Z Entering 'external/cutlass' 2025-05-07T20:23:24.9011521Z Entering 'external/googletest' 2025-05-07T20:23:24.9043702Z Entering 'external/hipify_torch' 2025-05-07T20:23:24.9074991Z Entering 'external/json' 2025-05-07T20:23:24.9121653Z ##[endgroup] 2025-05-07T20:23:24.9163783Z [command]/usr/bin/git log -1 --format=%H 2025-05-07T20:23:24.9190384Z a2f4c52051596e74bc8c16e3d2867a4ecdd271e0 2025-05-07T20:23:24.9381299Z ##[group]Run actions/download-artifact@v4 2025-05-07T20:23:24.9381639Z with: 2025-05-07T20:23:24.9381893Z name: fbgemm_genai_x86_clang_py3.11_cu12.8.0.whl 2025-05-07T20:23:24.9382249Z merge-multiple: false 2025-05-07T20:23:24.9382522Z repository: pytorch/FBGEMM 2025-05-07T20:23:24.9382800Z run-id: 14891846252 2025-05-07T20:23:24.9383023Z env: 2025-05-07T20:23:24.9383253Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:23:24.9383577Z BUILD_ENV: build_binary 2025-05-07T20:23:24.9383842Z BUILD_TARGET: genai 2025-05-07T20:23:24.9384081Z BUILD_VARIANT: cuda 2025-05-07T20:23:24.9384334Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:23:24.9384602Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:23:24.9384857Z ##[endgroup] 2025-05-07T20:23:25.1709420Z Downloading single artifact 2025-05-07T20:23:25.2633067Z Preparing to download the following artifacts: 2025-05-07T20:23:25.2634215Z - fbgemm_genai_x86_clang_py3.11_cu12.8.0.whl (ID: 3081407693, Size: 18493360, Expected Digest: sha256:712e5982f3c27e6bb70c4c07f6076ab85e5daa73adc8fdd928558f49c8845247) 2025-05-07T20:23:25.3174766Z Redirecting to blob download url: https://productionresultssa4.blob.core.windows.net/actions-results/b81c1ade-b872-4473-afc9-b227c140a38f/workflow-job-run-0c78ae5c-d1af-5cac-9cef-71d15264925f/artifacts/26da78488c24807c90bb678b8b7579283275a81cc21beba82a9498d4848351d8.zip 2025-05-07T20:23:25.3176167Z Starting download of artifact to: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM 2025-05-07T20:23:25.4401003Z (node:208300) [DEP0005] DeprecationWarning: Buffer() is deprecated due to security and usability issues. Please use the Buffer.alloc(), Buffer.allocUnsafe(), or Buffer.from() methods instead. 2025-05-07T20:23:25.4402063Z (Use `node --trace-deprecation ...` to show where the warning was created) 2025-05-07T20:23:25.7347652Z SHA256 digest of downloaded artifact is 712e5982f3c27e6bb70c4c07f6076ab85e5daa73adc8fdd928558f49c8845247 2025-05-07T20:23:25.7348419Z Artifact download completed successfully. 2025-05-07T20:23:25.7348753Z Total of 1 artifact(s) downloaded 2025-05-07T20:23:25.7354205Z Download artifact has finished successfully 2025-05-07T20:23:25.7610337Z ##[group]Run pytorch/test-infra/.github/actions/setup-nvidia@main 2025-05-07T20:23:25.7610732Z with: 2025-05-07T20:23:25.7610952Z driver-version: 570.133.07 2025-05-07T20:23:25.7611197Z env: 2025-05-07T20:23:25.7611424Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:23:25.7611737Z BUILD_ENV: build_binary 2025-05-07T20:23:25.7611986Z BUILD_TARGET: genai 2025-05-07T20:23:25.7612213Z BUILD_VARIANT: cuda 2025-05-07T20:23:25.7612453Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:23:25.7612718Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:23:25.7612950Z ##[endgroup] 2025-05-07T20:23:25.7708851Z ##[group]Run nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482 2025-05-07T20:23:25.7709338Z with: 2025-05-07T20:23:25.7709543Z timeout_minutes: 10 2025-05-07T20:23:25.7709774Z max_attempts: 3 2025-05-07T20:23:25.7732984Z command: # Is it disgusting to have a full shell script here in this github action? Sure # But is it the best way to make it so that this action relies on nothing else? Absolutely set -eou pipefail DISTRIBUTION=$(. /etc/os-release;echo $ID$VERSION_ID) DRIVER_FN="NVIDIA-Linux-x86_64-${DRIVER_VERSION}.run" install_nvidia_docker2_amzn2() { ( set -x # Needed for yum-config-manager sudo yum install -y yum-utils if [[ "${DISTRIBUTION}" == "amzn2023" ]] ; then YUM_REPO_URL="https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo" else # Amazon Linux 2 YUM_REPO_URL="https://nvidia.github.io/nvidia-docker/${DISTRIBUTION}/nvidia-docker.repo" fi sudo yum-config-manager --add-repo "${YUM_REPO_URL}" sudo yum install -y nvidia-docker2 nvidia-container-toolkit-1.16.2 sudo systemctl restart docker ) } install_nvidia_docker2_ubuntu20() { ( set -x # Install nvidia-driver package if not installed status="$(dpkg-query -W --showformat='${db:Status-Status}' nvidia-docker2 2>&1)" if [ ! $? = 0 ] || [ ! "$status" = installed ]; then sudo apt-get install -y nvidia-docker2 nvidia-container-toolkit-1.16.2 sudo systemctl restart docker fi ) } pre_install_nvidia_driver_amzn2() { ( # Purge any nvidia driver installed from RHEL repo sudo yum remove -y nvidia-driver-latest-dkms ) } install_nvidia_driver_common() { ( # Try to gather more information about the runner and its existing NVIDIA driver if any echo "Before installing NVIDIA driver" lspci lsmod modinfo nvidia || true HAS_NVIDIA_DRIVER=0 # Check if NVIDIA driver has already been installed if [ -x "$(command -v nvidia-smi)" ]; then set +e # The driver exists, check its version next. Also check only the first GPU if there are more than one of them # so that the same driver version is not print over multiple lines INSTALLED_DRIVER_VERSION=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0) NVIDIA_SMI_STATUS=$? if [ "$NVIDIA_SMI_STATUS" -ne 0 ] && [ "$NVIDIA_SMI_STATUS" -ne 14 ]; then echo "Failed to get NVIDIA driver version ($INSTALLED_DRIVER_VERSION). Continuing" elif [ "$INSTALLED_DRIVER_VERSION" != "$DRIVER_VERSION" ]; then echo "NVIDIA driver ($INSTALLED_DRIVER_VERSION) has been installed, but we expect to have $DRIVER_VERSION instead. Continuing" # Turn off persistent mode so that the installation script can unload the kernel module sudo killall nvidia-persistenced || true else HAS_NVIDIA_DRIVER=1 echo "NVIDIA driver ($INSTALLED_DRIVER_VERSION) has already been installed. Skipping NVIDIA driver installation" fi set -e fi if [ "$HAS_NVIDIA_DRIVER" -eq 0 ]; then # CAUTION: this may need to be updated in future if [ "${DISTRIBUTION}" != ubuntu20.04 ]; then sudo yum groupinstall -y "Development Tools" # ensure our kernel install is the same as our underlying kernel, # groupinstall "Development Tools" has a habit of mismatching kernel headers sudo yum install -y "kernel-devel-uname-r == $(uname -r)" sudo modprobe backlight fi sudo curl -fsL -o /tmp/nvidia_driver "https://s3.amazonaws.com/ossci-linux/nvidia_driver/$DRIVER_FN" set +e sudo /bin/bash /tmp/nvidia_driver -s --no-drm NVIDIA_INSTALLATION_STATUS=$? RESET_GPU=0 if [ "$NVIDIA_INSTALLATION_STATUS" -ne 0 ]; then sudo cat /var/log/nvidia-installer.log # Fail to install NVIDIA driver, try to reset the GPU RESET_GPU=1 elif [ -x "$(command -v nvidia-smi)" ]; then # Check again if nvidia-smi works even if the driver installation completes successfully INSTALLED_DRIVER_VERSION=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0) NVIDIA_SMI_STATUS=$? if [ "$NVIDIA_SMI_STATUS" -ne 0 ] && [ "$NVIDIA_SMI_STATUS" -ne 14 ]; then RESET_GPU=1 fi fi if [ "$RESET_GPU" -eq 1 ]; then NVIDIA_DEVICES=$(lspci -D | grep -i NVIDIA | cut -d' ' -f1) # The GPU can get stuck in a failure state if somehow the test crashs the GPU microcode. When this # happens, we'll try to reset all NVIDIA devices https://github.com/pytorch/pytorch/issues/88388 for PCI_ID in $NVIDIA_DEVICES; do DEVICE_ENABLED=$(cat /sys/bus/pci/devices/$PCI_ID/enable) echo "Reseting $PCI_ID (enabled state: $DEVICE_ENABLED)" # This requires sudo permission of course echo "1" | sudo tee /sys/bus/pci/devices/$PCI_ID/reset sleep 1 done fi sudo rm -fv /tmp/nvidia_driver set -e fi ) } post_install_nvidia_driver_common() { ( sudo modprobe nvidia || true echo "After installing NVIDIA driver" lspci lsmod modinfo nvidia || true ( set +e nvidia-smi # NB: Annoyingly, nvidia-smi command returns successfully with return code 0 even in # the case where the driver has already crashed as it still can get the driver version # and some basic information like the bus ID. However, the rest of the information # would be missing (ERR!), for example: # # +-----------------------------------------------------------------------------+ # | NVIDIA-SMI 525.89.02 Driver Version: 525.89.02 CUDA Version: 12.0 | # |-------------------------------+----------------------+----------------------+ # | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | # | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | # | | | MIG M. | # |===============================+======================+======================| # | 0 ERR! Off | 00000000:00:1E.0 Off | ERR! | # |ERR! ERR! ERR! ERR! / ERR! | 4184MiB / 23028MiB | ERR! Default | # | | | ERR! | # +-------------------------------+----------------------+----------------------+ # # +-----------------------------------------------------------------------------+ # | Processes: | # | GPU GI CI PID Type Process name GPU Memory | # | ID ID Usage | # |=============================================================================| # +-----------------------------------------------------------------------------+ # # This should be reported as a failure instead as it will guarantee to fail when # Docker tries to run with --gpus all # # So, the correct check here is to query one of the missing piece of info like # GPU name, so that the command can fail accordingly nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0 NVIDIA_SMI_STATUS=$? # Allowable exit statuses for nvidia-smi, see: https://github.com/NVIDIA/gpu-operator/issues/285 if [ "$NVIDIA_SMI_STATUS" -eq 0 ] || [ "$NVIDIA_SMI_STATUS" -eq 14 ]; then echo "INFO: Ignoring allowed status ${NVIDIA_SMI_STATUS}" else echo "ERROR: nvidia-smi exited with unresolved status ${NVIDIA_SMI_STATUS}" exit ${NVIDIA_SMI_STATUS} fi set -e ) ) } install_nvidia_driver_amzn2() { ( set -x pre_install_nvidia_driver_amzn2 install_nvidia_driver_common post_install_nvidia_driver_common ) } install_nvidia_driver_ubuntu20() { ( set -x install_nvidia_driver_common post_install_nvidia_driver_common ) } echo "== Installing nvidia driver ${DRIVER_FN} ==" case "${DISTRIBUTION}" in amzn*) install_nvidia_driver_amzn2 ;; ubuntu20.04) install_nvidia_driver_ubuntu20 ;; *) echo "ERROR: Unknown distribution ${DISTRIBUTION}" exit 1 ;; esac # Install container toolkit based on distribution echo "== Installing nvidia container toolkit for ${DISTRIBUTION} ==" case "${DISTRIBUTION}" in amzn*) install_nvidia_docker2_amzn2 ;; ubuntu20.04) install_nvidia_docker2_ubuntu20 ;; *) echo "ERROR: Unknown distribution ${DISTRIBUTION}" exit 1 ;; esac echo "GPU_FLAG=--gpus all -e NVIDIA_DRIVER_CAPABILITIES=all" >> "${GITHUB_ENV}" # Fix https://github.com/NVIDIA/nvidia-docker/issues/1648 on runners with # more than one GPUs. This just needs to be run once. The command fails # on subsequent runs and complains that the mode is already on, but that's # ok sudo nvidia-persistenced || true # This should show persistence mode ON nvidia-smi 2025-05-07T20:23:25.7755818Z retry_wait_seconds: 10 2025-05-07T20:23:25.7756080Z polling_interval_seconds: 1 2025-05-07T20:23:25.7756339Z warning_on_retry: true 2025-05-07T20:23:25.7775723Z continue_on_error: false 2025-05-07T20:23:25.7776005Z env: 2025-05-07T20:23:25.7776222Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:23:25.7776559Z BUILD_ENV: build_binary 2025-05-07T20:23:25.7776796Z BUILD_TARGET: genai 2025-05-07T20:23:25.7777014Z BUILD_VARIANT: cuda 2025-05-07T20:23:25.7777254Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:23:25.7777503Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:23:25.7777738Z DRIVER_VERSION: 570.133.07 2025-05-07T20:23:25.7777990Z ##[endgroup] 2025-05-07T20:23:26.6631196Z == Installing nvidia driver NVIDIA-Linux-x86_64-570.133.07.run == 2025-05-07T20:23:26.6631919Z + pre_install_nvidia_driver_amzn2 2025-05-07T20:23:26.6634792Z + sudo yum remove -y nvidia-driver-latest-dkms 2025-05-07T20:23:26.9881261Z No match for argument: nvidia-driver-latest-dkms 2025-05-07T20:23:26.9881645Z No packages marked for removal. 2025-05-07T20:23:26.9949066Z Dependencies resolved. 2025-05-07T20:23:26.9959012Z Nothing to do. 2025-05-07T20:23:26.9959667Z Complete! 2025-05-07T20:23:27.0885324Z + install_nvidia_driver_common 2025-05-07T20:23:27.0889519Z + echo 'Before installing NVIDIA driver' 2025-05-07T20:23:27.0889857Z + lspci 2025-05-07T20:23:27.0890580Z Before installing NVIDIA driver 2025-05-07T20:23:27.1008984Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] 2025-05-07T20:23:27.1010571Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] 2025-05-07T20:23:27.1012110Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08) 2025-05-07T20:23:27.1013599Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111 2025-05-07T20:23:27.1014564Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller 2025-05-07T20:23:27.1015507Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA) 2025-05-07T20:23:27.1016167Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1) 2025-05-07T20:23:27.1016644Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller 2025-05-07T20:23:27.1017048Z + lsmod 2025-05-07T20:23:27.1060200Z Module Size Used by 2025-05-07T20:23:27.1060905Z veth 36864 0 2025-05-07T20:23:27.1061671Z nvidia_modeset 1716224 0 2025-05-07T20:23:27.1062512Z video 65536 1 nvidia_modeset 2025-05-07T20:23:27.1063447Z wmi 36864 1 video 2025-05-07T20:23:27.1064193Z nvidia_uvm 1884160 0 2025-05-07T20:23:27.1064773Z nvidia 11583488 7 nvidia_uvm,nvidia_modeset 2025-05-07T20:23:27.1065415Z drm 602112 1 nvidia 2025-05-07T20:23:27.1065981Z drm_panel_orientation_quirks 32768 1 drm 2025-05-07T20:23:27.1066330Z backlight 24576 3 video,drm,nvidia_modeset 2025-05-07T20:23:27.1066677Z i2c_core 110592 2 nvidia,drm 2025-05-07T20:23:27.1066968Z xt_conntrack 16384 1 2025-05-07T20:23:27.1067221Z nft_chain_nat 16384 3 2025-05-07T20:23:27.1067480Z xt_MASQUERADE 20480 1 2025-05-07T20:23:27.1067777Z nf_nat 57344 2 nft_chain_nat,xt_MASQUERADE 2025-05-07T20:23:27.1068654Z nf_conntrack_netlink 57344 0 2025-05-07T20:23:27.1069093Z nf_conntrack 184320 4 xt_conntrack,nf_nat,nf_conntrack_netlink,xt_MASQUERADE 2025-05-07T20:23:27.1069525Z nf_defrag_ipv6 24576 1 nf_conntrack 2025-05-07T20:23:27.1069833Z nf_defrag_ipv4 16384 1 nf_conntrack 2025-05-07T20:23:27.1070114Z xfrm_user 57344 1 2025-05-07T20:23:27.1070377Z xfrm_algo 16384 1 xfrm_user 2025-05-07T20:23:27.1070662Z xt_addrtype 16384 2 2025-05-07T20:23:27.1070911Z nft_compat 20480 4 2025-05-07T20:23:27.1071214Z nf_tables 311296 57 nft_compat,nft_chain_nat 2025-05-07T20:23:27.1071617Z nfnetlink 20480 4 nft_compat,nf_conntrack_netlink,nf_tables 2025-05-07T20:23:27.1071975Z br_netfilter 36864 0 2025-05-07T20:23:27.1072258Z bridge 323584 1 br_netfilter 2025-05-07T20:23:27.1072553Z stp 16384 1 bridge 2025-05-07T20:23:27.1072845Z llc 16384 2 bridge,stp 2025-05-07T20:23:27.1073116Z overlay 167936 0 2025-05-07T20:23:27.1073366Z tls 135168 0 2025-05-07T20:23:27.1073619Z nls_ascii 16384 1 2025-05-07T20:23:27.1073863Z nls_cp437 20480 1 2025-05-07T20:23:27.1074105Z vfat 24576 1 2025-05-07T20:23:27.1074353Z fat 86016 1 vfat 2025-05-07T20:23:27.1074612Z sunrpc 696320 1 2025-05-07T20:23:27.1074857Z ena 180224 0 2025-05-07T20:23:27.1075099Z i8042 45056 0 2025-05-07T20:23:27.1075343Z serio 28672 3 i8042 2025-05-07T20:23:27.1075617Z ghash_clmulni_intel 16384 0 2025-05-07T20:23:27.1075878Z button 24576 0 2025-05-07T20:23:27.1076130Z sch_fq_codel 20480 17 2025-05-07T20:23:27.1076383Z dm_mod 188416 0 2025-05-07T20:23:27.1076626Z fuse 163840 1 2025-05-07T20:23:27.1076870Z loop 36864 0 2025-05-07T20:23:27.1077115Z configfs 57344 1 2025-05-07T20:23:27.1077366Z dax 45056 1 dm_mod 2025-05-07T20:23:27.1077637Z dmi_sysfs 20480 0 2025-05-07T20:23:27.1077879Z crc32_pclmul 16384 0 2025-05-07T20:23:27.1078274Z crc32c_intel 24576 0 2025-05-07T20:23:27.1078525Z efivarfs 24576 1 2025-05-07T20:23:27.1078776Z + modinfo nvidia 2025-05-07T20:23:27.1079586Z filename: /lib/modules/6.1.130-139.222.amzn2023.x86_64/kernel/drivers/video/nvidia.ko 2025-05-07T20:23:27.1080116Z import_ns: DMA_BUF 2025-05-07T20:23:27.1080370Z alias: char-major-195-* 2025-05-07T20:23:27.1080644Z version: 570.133.07 2025-05-07T20:23:27.1080894Z supported: external 2025-05-07T20:23:27.1081141Z license: Dual MIT/GPL 2025-05-07T20:23:27.1081431Z firmware: nvidia/570.133.07/gsp_tu10x.bin 2025-05-07T20:23:27.1081774Z firmware: nvidia/570.133.07/gsp_ga10x.bin 2025-05-07T20:23:27.1082107Z srcversion: 49515739FD8F721A3F2F714 2025-05-07T20:23:27.1082421Z alias: pci:v000010DEd*sv*sd*bc06sc80i00* 2025-05-07T20:23:27.1082760Z alias: pci:v000010DEd*sv*sd*bc03sc02i00* 2025-05-07T20:23:27.1083092Z alias: pci:v000010DEd*sv*sd*bc03sc00i00* 2025-05-07T20:23:27.1083402Z depends: i2c-core,drm 2025-05-07T20:23:27.1083669Z retpoline: Y 2025-05-07T20:23:27.1083893Z name: nvidia 2025-05-07T20:23:27.1084259Z vermagic: 6.1.130-139.222.amzn2023.x86_64 SMP preempt mod_unload modversions 2025-05-07T20:23:27.1084727Z parm: NvSwitchRegDwords:NvSwitch regkey (charp) 2025-05-07T20:23:27.1085284Z parm: NvSwitchBlacklist:NvSwitchBlacklist=uuid[,uuid...] (charp) 2025-05-07T20:23:27.1085961Z parm: NVreg_ResmanDebugLevel:int 2025-05-07T20:23:27.1086449Z parm: NVreg_RmLogonRC:int 2025-05-07T20:23:27.1086922Z parm: NVreg_ModifyDeviceFiles:int 2025-05-07T20:23:27.1087439Z parm: NVreg_DeviceFileUID:int 2025-05-07T20:23:27.1088041Z parm: NVreg_DeviceFileGID:int 2025-05-07T20:23:27.1088435Z parm: NVreg_DeviceFileMode:int 2025-05-07T20:23:27.1088904Z parm: NVreg_InitializeSystemMemoryAllocations:int 2025-05-07T20:23:27.1089484Z parm: NVreg_UsePageAttributeTable:int 2025-05-07T20:23:27.1090005Z parm: NVreg_EnablePCIeGen3:int 2025-05-07T20:23:27.1090387Z parm: NVreg_EnableMSI:int 2025-05-07T20:23:27.1090688Z parm: NVreg_EnableStreamMemOPs:int 2025-05-07T20:23:27.1091053Z parm: NVreg_RestrictProfilingToAdminUsers:int 2025-05-07T20:23:27.1091449Z parm: NVreg_PreserveVideoMemoryAllocations:int 2025-05-07T20:23:27.1091826Z parm: NVreg_EnableS0ixPowerManagement:int 2025-05-07T20:23:27.1092228Z parm: NVreg_S0ixPowerManagementVideoMemoryThreshold:int 2025-05-07T20:23:27.1092636Z parm: NVreg_DynamicPowerManagement:int 2025-05-07T20:23:27.1093052Z parm: NVreg_DynamicPowerManagementVideoMemoryThreshold:int 2025-05-07T20:23:27.1093462Z parm: NVreg_EnableGpuFirmware:int 2025-05-07T20:23:27.1093798Z parm: NVreg_EnableGpuFirmwareLogs:int 2025-05-07T20:23:27.1094163Z parm: NVreg_OpenRmEnableUnsupportedGpus:int 2025-05-07T20:23:27.1094530Z parm: NVreg_EnableUserNUMAManagement:int 2025-05-07T20:23:27.1094865Z parm: NVreg_MemoryPoolSize:int 2025-05-07T20:23:27.1095184Z parm: NVreg_KMallocHeapMaxSize:int 2025-05-07T20:23:27.1095513Z parm: NVreg_VMallocHeapMaxSize:int 2025-05-07T20:23:27.1095829Z parm: NVreg_IgnoreMMIOCheck:int 2025-05-07T20:23:27.1096146Z parm: NVreg_NvLinkDisable:int 2025-05-07T20:23:27.1096487Z parm: NVreg_EnablePCIERelaxedOrderingMode:int 2025-05-07T20:23:27.1096840Z parm: NVreg_RegisterPCIDriver:int 2025-05-07T20:23:27.1097167Z parm: NVreg_EnableResizableBar:int 2025-05-07T20:23:27.1097498Z parm: NVreg_EnableDbgBreakpoint:int 2025-05-07T20:23:27.1097831Z parm: NVreg_EnableNonblockingOpen:int 2025-05-07T20:23:27.1098170Z parm: NVreg_RegistryDwords:charp 2025-05-07T20:23:27.1098507Z parm: NVreg_RegistryDwordsPerDevice:charp 2025-05-07T20:23:27.1098837Z parm: NVreg_RmMsg:charp 2025-05-07T20:23:27.1099257Z parm: NVreg_GpuBlacklist:charp 2025-05-07T20:23:27.1099586Z parm: NVreg_TemporaryFilePath:charp 2025-05-07T20:23:27.1099925Z parm: NVreg_ExcludedGpus:charp 2025-05-07T20:23:27.1100233Z parm: NVreg_DmaRemapPeerMmio:int 2025-05-07T20:23:27.1100565Z parm: NVreg_RmNvlinkBandwidth:charp 2025-05-07T20:23:27.1100927Z parm: NVreg_RmNvlinkBandwidthLinkCount:int 2025-05-07T20:23:27.1101270Z parm: NVreg_ImexChannelCount:int 2025-05-07T20:23:27.1101598Z parm: NVreg_CreateImexChannel0:int 2025-05-07T20:23:27.1101945Z parm: NVreg_GrdmaPciTopoCheckOverride:int 2025-05-07T20:23:27.1102285Z parm: rm_firmware_active:charp 2025-05-07T20:23:27.1102587Z + HAS_NVIDIA_DRIVER=0 2025-05-07T20:23:27.1102838Z ++ command -v nvidia-smi 2025-05-07T20:23:27.1103101Z + '[' -x /usr/bin/nvidia-smi ']' 2025-05-07T20:23:27.1103355Z + set +e 2025-05-07T20:23:27.1103667Z ++ nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0 2025-05-07T20:23:27.1320370Z + INSTALLED_DRIVER_VERSION=570.133.07 2025-05-07T20:23:27.1320666Z + NVIDIA_SMI_STATUS=0 2025-05-07T20:23:27.1320900Z + '[' 0 -ne 0 ']' 2025-05-07T20:23:27.1321684Z + '[' 570.133.07 '!=' 570.133.07 ']' 2025-05-07T20:23:27.1321954Z + HAS_NVIDIA_DRIVER=1 2025-05-07T20:23:27.1323131Z + echo 'NVIDIA driver (570.133.07) has already been installed. Skipping NVIDIA driver installation' 2025-05-07T20:23:27.1323931Z + set -e 2025-05-07T20:23:27.1324139Z + '[' 1 -eq 0 ']' 2025-05-07T20:23:27.1324525Z NVIDIA driver (570.133.07) has already been installed. Skipping NVIDIA driver installation 2025-05-07T20:23:27.1324988Z + post_install_nvidia_driver_common 2025-05-07T20:23:27.1327545Z + sudo modprobe nvidia 2025-05-07T20:23:27.2635115Z + echo 'After installing NVIDIA driver' 2025-05-07T20:23:27.2635479Z + lspci 2025-05-07T20:23:27.2635704Z After installing NVIDIA driver 2025-05-07T20:23:27.2751532Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] 2025-05-07T20:23:27.2752060Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] 2025-05-07T20:23:27.2752597Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08) 2025-05-07T20:23:27.2753192Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111 2025-05-07T20:23:27.2753858Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller 2025-05-07T20:23:27.2754610Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA) 2025-05-07T20:23:27.2755096Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1) 2025-05-07T20:23:27.2755561Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller 2025-05-07T20:23:27.2755956Z + lsmod 2025-05-07T20:23:27.2784713Z Module Size Used by 2025-05-07T20:23:27.2784999Z veth 36864 0 2025-05-07T20:23:27.2785262Z nvidia_modeset 1716224 0 2025-05-07T20:23:27.2785541Z video 65536 1 nvidia_modeset 2025-05-07T20:23:27.2786046Z wmi 36864 1 video 2025-05-07T20:23:27.2786570Z nvidia_uvm 1884160 0 2025-05-07T20:23:27.2787150Z nvidia 11583488 7 nvidia_uvm,nvidia_modeset 2025-05-07T20:23:27.2787790Z drm 602112 1 nvidia 2025-05-07T20:23:27.2788371Z drm_panel_orientation_quirks 32768 1 drm 2025-05-07T20:23:27.2789186Z backlight 24576 3 video,drm,nvidia_modeset 2025-05-07T20:23:27.2789864Z i2c_core 110592 2 nvidia,drm 2025-05-07T20:23:27.2790419Z xt_conntrack 16384 1 2025-05-07T20:23:27.2790927Z nft_chain_nat 16384 3 2025-05-07T20:23:27.2791436Z xt_MASQUERADE 20480 1 2025-05-07T20:23:27.2792015Z nf_nat 57344 2 nft_chain_nat,xt_MASQUERADE 2025-05-07T20:23:27.2792661Z nf_conntrack_netlink 57344 0 2025-05-07T20:23:27.2793438Z nf_conntrack 184320 4 xt_conntrack,nf_nat,nf_conntrack_netlink,xt_MASQUERADE 2025-05-07T20:23:27.2794286Z nf_defrag_ipv6 24576 1 nf_conntrack 2025-05-07T20:23:27.2795308Z nf_defrag_ipv4 16384 1 nf_conntrack 2025-05-07T20:23:27.2795892Z xfrm_user 57344 1 2025-05-07T20:23:27.2796213Z xfrm_algo 16384 1 xfrm_user 2025-05-07T20:23:27.2796491Z xt_addrtype 16384 2 2025-05-07T20:23:27.2796757Z nft_compat 20480 4 2025-05-07T20:23:27.2797063Z nf_tables 311296 57 nft_compat,nft_chain_nat 2025-05-07T20:23:27.2797473Z nfnetlink 20480 4 nft_compat,nf_conntrack_netlink,nf_tables 2025-05-07T20:23:27.2797836Z br_netfilter 36864 0 2025-05-07T20:23:27.2798112Z bridge 323584 1 br_netfilter 2025-05-07T20:23:27.2798405Z stp 16384 1 bridge 2025-05-07T20:23:27.2798693Z llc 16384 2 bridge,stp 2025-05-07T20:23:27.2798978Z overlay 167936 0 2025-05-07T20:23:27.2799227Z tls 135168 0 2025-05-07T20:23:27.2799470Z nls_ascii 16384 1 2025-05-07T20:23:27.2799728Z nls_cp437 20480 1 2025-05-07T20:23:27.2799984Z vfat 24576 1 2025-05-07T20:23:27.2800229Z fat 86016 1 vfat 2025-05-07T20:23:27.2800495Z sunrpc 696320 1 2025-05-07T20:23:27.2800742Z ena 180224 0 2025-05-07T20:23:27.2800976Z i8042 45056 0 2025-05-07T20:23:27.2801227Z serio 28672 3 i8042 2025-05-07T20:23:27.2801501Z ghash_clmulni_intel 16384 0 2025-05-07T20:23:27.2801756Z button 24576 0 2025-05-07T20:23:27.2802002Z sch_fq_codel 20480 17 2025-05-07T20:23:27.2802258Z dm_mod 188416 0 2025-05-07T20:23:27.2802503Z fuse 163840 1 2025-05-07T20:23:27.2802741Z loop 36864 0 2025-05-07T20:23:27.2803153Z configfs 57344 1 2025-05-07T20:23:27.2803404Z dax 45056 1 dm_mod 2025-05-07T20:23:27.2803669Z dmi_sysfs 20480 0 2025-05-07T20:23:27.2803921Z crc32_pclmul 16384 0 2025-05-07T20:23:27.2804181Z crc32c_intel 24576 0 2025-05-07T20:23:27.2804427Z efivarfs 24576 1 2025-05-07T20:23:27.2804677Z + modinfo nvidia 2025-05-07T20:23:27.2807932Z filename: /lib/modules/6.1.130-139.222.amzn2023.x86_64/kernel/drivers/video/nvidia.ko 2025-05-07T20:23:27.2808388Z import_ns: DMA_BUF 2025-05-07T20:23:27.2808634Z alias: char-major-195-* 2025-05-07T20:23:27.2808901Z version: 570.133.07 2025-05-07T20:23:27.2809144Z supported: external 2025-05-07T20:23:27.2809388Z license: Dual MIT/GPL 2025-05-07T20:23:27.2809672Z firmware: nvidia/570.133.07/gsp_tu10x.bin 2025-05-07T20:23:27.2810007Z firmware: nvidia/570.133.07/gsp_ga10x.bin 2025-05-07T20:23:27.2810325Z srcversion: 49515739FD8F721A3F2F714 2025-05-07T20:23:27.2810640Z alias: pci:v000010DEd*sv*sd*bc06sc80i00* 2025-05-07T20:23:27.2810976Z alias: pci:v000010DEd*sv*sd*bc03sc02i00* 2025-05-07T20:23:27.2811300Z alias: pci:v000010DEd*sv*sd*bc03sc00i00* 2025-05-07T20:23:27.2811607Z depends: i2c-core,drm 2025-05-07T20:23:27.2811864Z retpoline: Y 2025-05-07T20:23:27.2812083Z name: nvidia 2025-05-07T20:23:27.2812435Z vermagic: 6.1.130-139.222.amzn2023.x86_64 SMP preempt mod_unload modversions 2025-05-07T20:23:27.2812900Z parm: NvSwitchRegDwords:NvSwitch regkey (charp) 2025-05-07T20:23:27.2813333Z parm: NvSwitchBlacklist:NvSwitchBlacklist=uuid[,uuid...] (charp) 2025-05-07T20:23:27.2813747Z parm: NVreg_ResmanDebugLevel:int 2025-05-07T20:23:27.2814050Z parm: NVreg_RmLogonRC:int 2025-05-07T20:23:27.2814356Z parm: NVreg_ModifyDeviceFiles:int 2025-05-07T20:23:27.2814668Z parm: NVreg_DeviceFileUID:int 2025-05-07T20:23:27.2814969Z parm: NVreg_DeviceFileGID:int 2025-05-07T20:23:27.2815274Z parm: NVreg_DeviceFileMode:int 2025-05-07T20:23:27.2815635Z parm: NVreg_InitializeSystemMemoryAllocations:int 2025-05-07T20:23:27.2816012Z parm: NVreg_UsePageAttributeTable:int 2025-05-07T20:23:27.2816452Z parm: NVreg_EnablePCIeGen3:int 2025-05-07T20:23:27.2816759Z parm: NVreg_EnableMSI:int 2025-05-07T20:23:27.2817057Z parm: NVreg_EnableStreamMemOPs:int 2025-05-07T20:23:27.2817424Z parm: NVreg_RestrictProfilingToAdminUsers:int 2025-05-07T20:23:27.2817822Z parm: NVreg_PreserveVideoMemoryAllocations:int 2025-05-07T20:23:27.2818196Z parm: NVreg_EnableS0ixPowerManagement:int 2025-05-07T20:23:27.2818595Z parm: NVreg_S0ixPowerManagementVideoMemoryThreshold:int 2025-05-07T20:23:27.2818998Z parm: NVreg_DynamicPowerManagement:int 2025-05-07T20:23:27.2819413Z parm: NVreg_DynamicPowerManagementVideoMemoryThreshold:int 2025-05-07T20:23:27.2819814Z parm: NVreg_EnableGpuFirmware:int 2025-05-07T20:23:27.2820150Z parm: NVreg_EnableGpuFirmwareLogs:int 2025-05-07T20:23:27.2820512Z parm: NVreg_OpenRmEnableUnsupportedGpus:int 2025-05-07T20:23:27.2820875Z parm: NVreg_EnableUserNUMAManagement:int 2025-05-07T20:23:27.2821208Z parm: NVreg_MemoryPoolSize:int 2025-05-07T20:23:27.2821525Z parm: NVreg_KMallocHeapMaxSize:int 2025-05-07T20:23:27.2821852Z parm: NVreg_VMallocHeapMaxSize:int 2025-05-07T20:23:27.2822164Z parm: NVreg_IgnoreMMIOCheck:int 2025-05-07T20:23:27.2822474Z parm: NVreg_NvLinkDisable:int 2025-05-07T20:23:27.2822817Z parm: NVreg_EnablePCIERelaxedOrderingMode:int 2025-05-07T20:23:27.2823165Z parm: NVreg_RegisterPCIDriver:int 2025-05-07T20:23:27.2823492Z parm: NVreg_EnableResizableBar:int 2025-05-07T20:23:27.2823827Z parm: NVreg_EnableDbgBreakpoint:int 2025-05-07T20:23:27.2824159Z parm: NVreg_EnableNonblockingOpen:int 2025-05-07T20:23:27.2824591Z parm: NVreg_RegistryDwords:charp 2025-05-07T20:23:27.2824929Z parm: NVreg_RegistryDwordsPerDevice:charp 2025-05-07T20:23:27.2825261Z parm: NVreg_RmMsg:charp 2025-05-07T20:23:27.2825546Z parm: NVreg_GpuBlacklist:charp 2025-05-07T20:23:27.2825871Z parm: NVreg_TemporaryFilePath:charp 2025-05-07T20:23:27.2826190Z parm: NVreg_ExcludedGpus:charp 2025-05-07T20:23:27.2826497Z parm: NVreg_DmaRemapPeerMmio:int 2025-05-07T20:23:27.2826829Z parm: NVreg_RmNvlinkBandwidth:charp 2025-05-07T20:23:27.2827179Z parm: NVreg_RmNvlinkBandwidthLinkCount:int 2025-05-07T20:23:27.2827518Z parm: NVreg_ImexChannelCount:int 2025-05-07T20:23:27.2827840Z parm: NVreg_CreateImexChannel0:int 2025-05-07T20:23:27.2828365Z parm: NVreg_GrdmaPciTopoCheckOverride:int 2025-05-07T20:23:27.2828701Z parm: rm_firmware_active:charp 2025-05-07T20:23:27.2828987Z + set +e 2025-05-07T20:23:27.2829241Z + nvidia-smi 2025-05-07T20:23:27.2984136Z Wed May 7 20:23:27 2025 2025-05-07T20:23:27.2984506Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:27.2985015Z | NVIDIA-SMI 570.133.07 Driver Version: 570.133.07 CUDA Version: 12.8 | 2025-05-07T20:23:27.2985505Z |-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:27.2986052Z | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 2025-05-07T20:23:27.2986579Z | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 2025-05-07T20:23:27.2987013Z | | | MIG M. | 2025-05-07T20:23:27.2987348Z |=========================================+========================+======================| 2025-05-07T20:23:27.3119323Z | 0 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 | 2025-05-07T20:23:27.3119791Z | 0% 28C P8 22W / 300W | 0MiB / 23028MiB | 0% Default | 2025-05-07T20:23:27.3120185Z | | | N/A | 2025-05-07T20:23:27.3120756Z +-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:27.3124079Z 2025-05-07T20:23:27.3124537Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:27.3125034Z | Processes: | 2025-05-07T20:23:27.3125539Z | GPU GI CI PID Type Process name GPU Memory | 2025-05-07T20:23:27.3126053Z | ID ID Usage | 2025-05-07T20:23:27.3126476Z |=========================================================================================| 2025-05-07T20:23:27.3130878Z | No running processes found | 2025-05-07T20:23:27.3131363Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:27.5708978Z + nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0 2025-05-07T20:23:27.5877319Z NVIDIA A10G 2025-05-07T20:23:27.5920020Z + NVIDIA_SMI_STATUS=0 2025-05-07T20:23:27.5921661Z + '[' 0 -eq 0 ']' 2025-05-07T20:23:27.5922116Z + echo 'INFO: Ignoring allowed status 0' 2025-05-07T20:23:27.5922423Z + set -e 2025-05-07T20:23:27.5922628Z INFO: Ignoring allowed status 0 2025-05-07T20:23:27.5930726Z == Installing nvidia container toolkit for amzn2023 == 2025-05-07T20:23:27.5943778Z + sudo yum install -y yum-utils 2025-05-07T20:23:28.0299734Z Last metadata expiration check: 0:10:02 ago on Wed May 7 20:13:26 2025. 2025-05-07T20:23:28.0548711Z Package dnf-utils-4.3.0-13.amzn2023.0.5.noarch is already installed. 2025-05-07T20:23:28.0951977Z Dependencies resolved. 2025-05-07T20:23:28.1136534Z Nothing to do. 2025-05-07T20:23:28.1136900Z Complete! 2025-05-07T20:23:28.1541228Z + [[ amzn2023 == \a\m\z\n\2\0\2\3 ]] 2025-05-07T20:23:28.1541867Z + YUM_REPO_URL=https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo 2025-05-07T20:23:28.1542712Z + sudo yum-config-manager --add-repo https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo 2025-05-07T20:23:28.5739873Z Adding repo from: https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo 2025-05-07T20:23:28.6318724Z + sudo yum install -y nvidia-docker2 nvidia-container-toolkit-1.16.2 2025-05-07T20:23:29.1621790Z nvidia-container-toolkit 13 kB/s | 833 B 00:00 2025-05-07T20:23:29.1869263Z Package nvidia-docker2-2.14.0-1.noarch is already installed. 2025-05-07T20:23:29.1874789Z Package nvidia-container-toolkit-1.16.2-1.x86_64 is already installed. 2025-05-07T20:23:29.2269241Z Dependencies resolved. 2025-05-07T20:23:29.2450943Z Nothing to do. 2025-05-07T20:23:29.2451386Z Complete! 2025-05-07T20:23:29.2855044Z + sudo systemctl restart docker 2025-05-07T20:23:31.6697083Z nvidia-persistenced failed to initialize. Check syslog for more details. 2025-05-07T20:23:31.6893459Z Wed May 7 20:23:31 2025 2025-05-07T20:23:31.6894172Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:31.6895076Z | NVIDIA-SMI 570.133.07 Driver Version: 570.133.07 CUDA Version: 12.8 | 2025-05-07T20:23:31.6895961Z |-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:31.6896856Z | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 2025-05-07T20:23:31.6897797Z | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 2025-05-07T20:23:31.6898373Z | | | MIG M. | 2025-05-07T20:23:31.6898724Z |=========================================+========================+======================| 2025-05-07T20:23:31.7030002Z | 0 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 | 2025-05-07T20:23:31.7030450Z | 0% 29C P8 22W / 300W | 0MiB / 23028MiB | 0% Default | 2025-05-07T20:23:31.7030819Z | | | N/A | 2025-05-07T20:23:31.7031210Z +-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:31.7034338Z 2025-05-07T20:23:31.7034735Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:31.7035162Z | Processes: | 2025-05-07T20:23:31.7035601Z | GPU GI CI PID Type Process name GPU Memory | 2025-05-07T20:23:31.7036027Z | ID ID Usage | 2025-05-07T20:23:31.7036370Z |=========================================================================================| 2025-05-07T20:23:31.7040891Z | No running processes found | 2025-05-07T20:23:31.7041363Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:32.8278948Z Command completed after 1 attempt(s). 2025-05-07T20:23:32.8366963Z ##[group]Run . $PRELUDE; print_system_info; print_ec2_info 2025-05-07T20:23:32.8367441Z . $PRELUDE; print_system_info; print_ec2_info 2025-05-07T20:23:32.8380633Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:23:32.8381175Z env: 2025-05-07T20:23:32.8381408Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:23:32.8381710Z BUILD_ENV: build_binary 2025-05-07T20:23:32.8381962Z BUILD_TARGET: genai 2025-05-07T20:23:32.8382204Z BUILD_VARIANT: cuda 2025-05-07T20:23:32.8382437Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:23:32.8382694Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:23:32.8382995Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:23:32.8383322Z ##[endgroup] 2025-05-07T20:23:33.1766328Z ################################################################################ 2025-05-07T20:23:33.1766708Z # Print System Info 2025-05-07T20:23:33.1766930Z # 2025-05-07T20:23:33.1782630Z # [2025-05-07T20:23:33.177Z] + print_system_info 2025-05-07T20:23:33.1783001Z ################################################################################ 2025-05-07T20:23:33.1783219Z 2025-05-07T20:23:33.1783337Z ################################################################################ 2025-05-07T20:23:33.1783673Z [INFO] Printing environment variables ... 2025-05-07T20:23:33.1783982Z + printenv 2025-05-07T20:23:33.1784098Z 2025-05-07T20:23:33.1794419Z SHELL=/bin/bash 2025-05-07T20:23:33.1794941Z GITHUB_WORKSPACE=/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM 2025-05-07T20:23:33.1795497Z BUILD_VARIANT=cuda 2025-05-07T20:23:33.1796219Z GITHUB_PATH=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/add_path_059f0104-fe17-4e08-a0e5-9395de160e8b 2025-05-07T20:23:33.1797010Z GITHUB_ACTION=__run 2025-05-07T20:23:33.1797394Z GPU_FLAG=--gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:23:33.1797851Z GITHUB_RUN_NUMBER=10601 2025-05-07T20:23:33.1798167Z RUNNER_NAME=i-0e56304501e4f5200 2025-05-07T20:23:33.1798445Z GITHUB_REPOSITORY_OWNER_ID=21003710 2025-05-07T20:23:33.1798749Z PLATFORM_NAME_LC=linux-x86_64 2025-05-07T20:23:33.1799009Z MACHINE_NAME_LC=x86_64 2025-05-07T20:23:33.1799363Z ACTIONS_RUNNER_HOOK_JOB_COMPLETED=/home/ec2-user/runner-scripts/after_job.sh 2025-05-07T20:23:33.1799779Z GITHUB_TRIGGERING_ACTOR=q10 2025-05-07T20:23:33.1800062Z PRELUDE=.github/scripts/setup_env.bash 2025-05-07T20:23:33.1800339Z GITHUB_REF_TYPE=branch 2025-05-07T20:23:33.1801044Z *** 2025-05-07T20:23:33.1801239Z LOGNAME=ec2-user 2025-05-07T20:23:33.1801475Z GITHUB_REPOSITORY_ID=150154628 2025-05-07T20:23:33.1801725Z ENFORCE_CUDA_DEVICE=1 2025-05-07T20:23:33.1801951Z GITHUB_ACTIONS=true 2025-05-07T20:23:33.1802170Z SYSTEMD_EXEC_PID=55534 2025-05-07T20:23:33.1802436Z GITHUB_SHA=a2f4c52051596e74bc8c16e3d2867a4ecdd271e0 2025-05-07T20:23:33.1802969Z GITHUB_WORKFLOW_REF=pytorch/FBGEMM/.github/workflows/fbgemm_gpu_ci_cuda.yml@refs/pull/4066/merge 2025-05-07T20:23:33.1803469Z RUNNER_ENVIRONMENT=self-hosted 2025-05-07T20:23:33.1803750Z GITHUB_REF=refs/pull/4066/merge 2025-05-07T20:23:33.1803999Z RUNNER_OS=Linux 2025-05-07T20:23:33.1804220Z GITHUB_REF_PROTECTED=false 2025-05-07T20:23:33.1804465Z HOME=/home/ec2-user 2025-05-07T20:23:33.1804706Z GITHUB_API_URL=https://api.github.com 2025-05-07T20:23:33.1804990Z LANG=C.UTF-8 2025-05-07T20:23:33.1805288Z RUNNER_TRACKING_ID=github_bf3ce286-0e7f-4ee1-994d-9126ade0d35d 2025-05-07T20:23:33.1805635Z RUNNER_ARCH=X64 2025-05-07T20:23:33.1805907Z RUNNER_TEMP=/home/ec2-user/actions-runner/_work/_temp 2025-05-07T20:23:33.1806234Z BUILD_TARGET=genai 2025-05-07T20:23:33.1806743Z GITHUB_STATE=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/save_state_059f0104-fe17-4e08-a0e5-9395de160e8b 2025-05-07T20:23:33.1807584Z GITHUB_ENV=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/set_env_059f0104-fe17-4e08-a0e5-9395de160e8b 2025-05-07T20:23:33.1808303Z GITHUB_EVENT_PATH=/home/ec2-user/actions-runner/_work/_temp/_github_workflow/event.json 2025-05-07T20:23:33.1809152Z INVOCATION_ID=92df7f3866bb4d08acaa1a9054d7e53b 2025-05-07T20:23:33.1809474Z GITHUB_EVENT_NAME=pull_request 2025-05-07T20:23:33.1809739Z GITHUB_RUN_ID=14891846252 2025-05-07T20:23:33.1810311Z GITHUB_STEP_SUMMARY=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/step_summary_059f0104-fe17-4e08-a0e5-9395de160e8b 2025-05-07T20:23:33.1811069Z BUILD_ENV=build_binary 2025-05-07T20:23:33.1811298Z GITHUB_ACTOR=q10 2025-05-07T20:23:33.1811514Z GITHUB_RUN_ATTEMPT=1 2025-05-07T20:23:33.1811753Z KERN_NAME_LC=linux 2025-05-07T20:23:33.1811969Z BUILD_CUDA_VERSION=12.8.0 2025-05-07T20:23:33.1812266Z GITHUB_GRAPHQL_URL=https://api.github.com/graphql 2025-05-07T20:23:33.1812610Z PLATFORM_NAME=Linux-x86_64 2025-05-07T20:23:33.1812855Z USER=ec2-user 2025-05-07T20:23:33.1813078Z GITHUB_SERVER_URL=https://github.com 2025-05-07T20:23:33.1813354Z SHLVL=1 2025-05-07T20:23:33.1813550Z GITHUB_ACTOR_ID=255046 2025-05-07T20:23:33.1813848Z RUNNER_TOOL_CACHE=/home/ec2-user/actions-runner/_work/_tool 2025-05-07T20:23:33.1814285Z GITHUB_WORKFLOW_SHA=6060cd4b5f971680caecdcc657faccb5720d1c3e 2025-05-07T20:23:33.1814647Z GITHUB_REF_NAME=4066/merge 2025-05-07T20:23:33.1814878Z KERN_NAME=Linux 2025-05-07T20:23:33.1815108Z GITHUB_JOB=test_and_publish_artifact 2025-05-07T20:23:33.1815505Z ACTIONS_RUNNER_HOOK_JOB_STARTED=/home/ec2-user/runner-scripts/before_job.sh 2025-05-07T20:23:33.1815924Z GITHUB_REPOSITORY=pytorch/FBGEMM 2025-05-07T20:23:33.1816199Z GITHUB_RETENTION_DAYS=90 2025-05-07T20:23:33.1816445Z JOURNAL_STREAM=8:82680 2025-05-07T20:23:33.1816747Z RUNNER_WORKSPACE=/home/ec2-user/actions-runner/_work/FBGEMM 2025-05-07T20:23:33.1817162Z GITHUB_ACTION_REPOSITORY= 2025-05-07T20:23:33.1817466Z PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin 2025-05-07T20:23:33.1817787Z GITHUB_BASE_REF=main 2025-05-07T20:23:33.1817997Z CI=true 2025-05-07T20:23:33.1818233Z GITHUB_REPOSITORY_OWNER=pytorch 2025-05-07T20:23:33.1818632Z GITHUB_HEAD_REF=bm/genai-rocm-oss-6 2025-05-07T20:23:33.1819018Z GITHUB_ACTION_REF= 2025-05-07T20:23:33.1819354Z GITHUB_WORKFLOW=FBGEMM GPU/GenAI CUDA CI 2025-05-07T20:23:33.1820093Z GITHUB_OUTPUT=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/set_output_059f0104-fe17-4e08-a0e5-9395de160e8b 2025-05-07T20:23:33.1820661Z MACHINE_NAME=x86_64 2025-05-07T20:23:33.1820883Z _=/usr/bin/printenv 2025-05-07T20:23:33.1821019Z 2025-05-07T20:23:33.1821144Z ################################################################################ 2025-05-07T20:23:33.1821569Z [INFO] Print ldd version ... 2025-05-07T20:23:33.1821923Z + ldd --version 2025-05-07T20:23:33.1822104Z 2025-05-07T20:23:33.1822235Z ldd (GNU libc) 2.34 2025-05-07T20:23:33.1822599Z Copyright (C) 2021 Free Software Foundation, Inc. 2025-05-07T20:23:33.1823104Z This is free software; see the source for copying conditions. There is NO 2025-05-07T20:23:33.1823628Z warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. 2025-05-07T20:23:33.1824067Z Written by Roland McGrath and Ulrich Drepper. 2025-05-07T20:23:33.1824281Z 2025-05-07T20:23:33.1824404Z ################################################################################ 2025-05-07T20:23:33.1824707Z [INFO] Print CPU info ... 2025-05-07T20:23:33.1824945Z + nproc 2025-05-07T20:23:33.1825055Z 2025-05-07T20:23:33.1833781Z 16 2025-05-07T20:23:33.1835518Z 2025-05-07T20:23:33.1835733Z + lscpu 2025-05-07T20:23:33.1835915Z 2025-05-07T20:23:33.1905740Z Architecture: x86_64 2025-05-07T20:23:33.1906229Z CPU op-mode(s): 32-bit, 64-bit 2025-05-07T20:23:33.1907228Z Address sizes: 48 bits physical, 48 bits virtual 2025-05-07T20:23:33.1908076Z Byte Order: Little Endian 2025-05-07T20:23:33.1908431Z CPU(s): 16 2025-05-07T20:23:33.1908730Z On-line CPU(s) list: 0-15 2025-05-07T20:23:33.1909119Z Vendor ID: AuthenticAMD 2025-05-07T20:23:33.1909482Z Model name: AMD EPYC 7R32 2025-05-07T20:23:33.1909790Z CPU family: 23 2025-05-07T20:23:33.1910301Z Model: 49 2025-05-07T20:23:33.1910594Z Thread(s) per core: 2 2025-05-07T20:23:33.1910875Z Core(s) per socket: 8 2025-05-07T20:23:33.1911161Z Socket(s): 1 2025-05-07T20:23:33.1911564Z Stepping: 0 2025-05-07T20:23:33.1911851Z BogoMIPS: 5599.99 2025-05-07T20:23:33.1913905Z Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:33.1915954Z Hypervisor vendor: KVM 2025-05-07T20:23:33.1916260Z Virtualization type: full 2025-05-07T20:23:33.1916590Z L1d cache: 256 KiB (8 instances) 2025-05-07T20:23:33.1916951Z L1i cache: 256 KiB (8 instances) 2025-05-07T20:23:33.1917314Z L2 cache: 4 MiB (8 instances) 2025-05-07T20:23:33.1917663Z L3 cache: 32 MiB (2 instances) 2025-05-07T20:23:33.1918009Z NUMA node(s): 1 2025-05-07T20:23:33.1918316Z NUMA node0 CPU(s): 0-15 2025-05-07T20:23:33.1918648Z Vulnerability Gather data sampling: Not affected 2025-05-07T20:23:33.1919010Z Vulnerability Itlb multihit: Not affected 2025-05-07T20:23:33.1919352Z Vulnerability L1tf: Not affected 2025-05-07T20:23:33.1919698Z Vulnerability Mds: Not affected 2025-05-07T20:23:33.1920048Z Vulnerability Meltdown: Not affected 2025-05-07T20:23:33.1920390Z Vulnerability Mmio stale data: Not affected 2025-05-07T20:23:33.1920777Z Vulnerability Reg file data sampling: Not affected 2025-05-07T20:23:33.1921312Z Vulnerability Retbleed: Mitigation; untrained return thunk; SMT enabled with STIBP protection 2025-05-07T20:23:33.1922005Z Vulnerability Spec rstack overflow: Mitigation; safe RET 2025-05-07T20:23:33.1922695Z Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl 2025-05-07T20:23:33.1923435Z Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization 2025-05-07T20:23:33.1924435Z Vulnerability Spectre v2: Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected 2025-05-07T20:23:33.1925122Z Vulnerability Srbds: Not affected 2025-05-07T20:23:33.1925480Z Vulnerability Tsx async abort: Not affected 2025-05-07T20:23:33.1925797Z 2025-05-07T20:23:33.1925891Z + cat /proc/cpuinfo 2025-05-07T20:23:33.1926027Z 2025-05-07T20:23:33.1926213Z processor : 0 2025-05-07T20:23:33.1926427Z vendor_id : AuthenticAMD 2025-05-07T20:23:33.1926669Z cpu family : 23 2025-05-07T20:23:33.1926885Z model : 49 2025-05-07T20:23:33.1927089Z model name : AMD EPYC 7R32 2025-05-07T20:23:33.1927339Z stepping : 0 2025-05-07T20:23:33.1927553Z microcode : 0x830107f 2025-05-07T20:23:33.1927806Z cpu MHz : 3299.302 2025-05-07T20:23:33.1928050Z cache size : 512 KB 2025-05-07T20:23:33.1928539Z physical id : 0 2025-05-07T20:23:33.1928745Z siblings : 16 2025-05-07T20:23:33.1928946Z core id : 0 2025-05-07T20:23:33.1929142Z cpu cores : 8 2025-05-07T20:23:33.1929340Z apicid : 0 2025-05-07T20:23:33.1929530Z initial apicid : 0 2025-05-07T20:23:33.1929739Z fpu : yes 2025-05-07T20:23:33.1929940Z fpu_exception : yes 2025-05-07T20:23:33.1930149Z cpuid level : 13 2025-05-07T20:23:33.1930352Z wp : yes 2025-05-07T20:23:33.1932474Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:33.1934827Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:33.1935305Z bogomips : 5599.99 2025-05-07T20:23:33.1935524Z TLB size : 3072 4K pages 2025-05-07T20:23:33.1935764Z clflush size : 64 2025-05-07T20:23:33.1935978Z cache_alignment : 64 2025-05-07T20:23:33.1936246Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:33.1936565Z power management: 2025-05-07T20:23:33.1936694Z 2025-05-07T20:23:33.1936783Z processor : 1 2025-05-07T20:23:33.1936991Z vendor_id : AuthenticAMD 2025-05-07T20:23:33.1937225Z cpu family : 23 2025-05-07T20:23:33.1937428Z model : 49 2025-05-07T20:23:33.1937627Z model name : AMD EPYC 7R32 2025-05-07T20:23:33.1937871Z stepping : 0 2025-05-07T20:23:33.1938078Z microcode : 0x830107f 2025-05-07T20:23:33.1938295Z cpu MHz : 3191.328 2025-05-07T20:23:33.1938506Z cache size : 512 KB 2025-05-07T20:23:33.1938717Z physical id : 0 2025-05-07T20:23:33.1938916Z siblings : 16 2025-05-07T20:23:33.1939116Z core id : 1 2025-05-07T20:23:33.1939312Z cpu cores : 8 2025-05-07T20:23:33.1939505Z apicid : 2 2025-05-07T20:23:33.1939701Z initial apicid : 2 2025-05-07T20:23:33.1939913Z fpu : yes 2025-05-07T20:23:33.1940107Z fpu_exception : yes 2025-05-07T20:23:33.1940324Z cpuid level : 13 2025-05-07T20:23:33.1940529Z wp : yes 2025-05-07T20:23:33.1942449Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:33.1944634Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:33.1945123Z bogomips : 5599.99 2025-05-07T20:23:33.1945345Z TLB size : 3072 4K pages 2025-05-07T20:23:33.1945580Z clflush size : 64 2025-05-07T20:23:33.1945793Z cache_alignment : 64 2025-05-07T20:23:33.1946072Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:33.1946390Z power management: 2025-05-07T20:23:33.1946523Z 2025-05-07T20:23:33.1946609Z processor : 2 2025-05-07T20:23:33.1946827Z vendor_id : AuthenticAMD 2025-05-07T20:23:33.1947069Z cpu family : 23 2025-05-07T20:23:33.1947277Z model : 49 2025-05-07T20:23:33.1947487Z model name : AMD EPYC 7R32 2025-05-07T20:23:33.1947731Z stepping : 0 2025-05-07T20:23:33.1947931Z microcode : 0x830107f 2025-05-07T20:23:33.1948164Z cpu MHz : 3299.688 2025-05-07T20:23:33.1948380Z cache size : 512 KB 2025-05-07T20:23:33.1948585Z physical id : 0 2025-05-07T20:23:33.1948791Z siblings : 16 2025-05-07T20:23:33.1948993Z core id : 2 2025-05-07T20:23:33.1949261Z cpu cores : 8 2025-05-07T20:23:33.1949460Z apicid : 4 2025-05-07T20:23:33.1949657Z initial apicid : 4 2025-05-07T20:23:33.1949860Z fpu : yes 2025-05-07T20:23:33.1950061Z fpu_exception : yes 2025-05-07T20:23:33.1950277Z cpuid level : 13 2025-05-07T20:23:33.1950475Z wp : yes 2025-05-07T20:23:33.1952478Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:33.1954721Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:33.1955205Z bogomips : 5599.99 2025-05-07T20:23:33.1955423Z TLB size : 3072 4K pages 2025-05-07T20:23:33.1955648Z clflush size : 64 2025-05-07T20:23:33.1955860Z cache_alignment : 64 2025-05-07T20:23:33.1956129Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:33.1956432Z power management: 2025-05-07T20:23:33.1956568Z 2025-05-07T20:23:33.1956649Z processor : 3 2025-05-07T20:23:33.1956864Z vendor_id : AuthenticAMD 2025-05-07T20:23:33.1957101Z cpu family : 23 2025-05-07T20:23:33.1957302Z model : 49 2025-05-07T20:23:33.1957510Z model name : AMD EPYC 7R32 2025-05-07T20:23:33.1957748Z stepping : 0 2025-05-07T20:23:33.1957985Z microcode : 0x830107f 2025-05-07T20:23:33.1958232Z cpu MHz : 3300.234 2025-05-07T20:23:33.1958450Z cache size : 512 KB 2025-05-07T20:23:33.1958665Z physical id : 0 2025-05-07T20:23:33.1958873Z siblings : 16 2025-05-07T20:23:33.1959070Z core id : 3 2025-05-07T20:23:33.1959269Z cpu cores : 8 2025-05-07T20:23:33.1959467Z apicid : 6 2025-05-07T20:23:33.1959664Z initial apicid : 6 2025-05-07T20:23:33.1959872Z fpu : yes 2025-05-07T20:23:33.1960074Z fpu_exception : yes 2025-05-07T20:23:33.1960297Z cpuid level : 13 2025-05-07T20:23:33.1960498Z wp : yes 2025-05-07T20:23:33.1962419Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:33.1964603Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:33.1965086Z bogomips : 5599.99 2025-05-07T20:23:33.1965309Z TLB size : 3072 4K pages 2025-05-07T20:23:33.1965536Z clflush size : 64 2025-05-07T20:23:33.1965754Z cache_alignment : 64 2025-05-07T20:23:33.1966024Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:33.1966328Z power management: 2025-05-07T20:23:33.2015278Z 2025-05-07T20:23:33.2015406Z processor : 4 2025-05-07T20:23:33.2015654Z vendor_id : AuthenticAMD 2025-05-07T20:23:33.2015947Z cpu family : 23 2025-05-07T20:23:33.2016186Z model : 49 2025-05-07T20:23:33.2016440Z model name : AMD EPYC 7R32 2025-05-07T20:23:33.2016680Z stepping : 0 2025-05-07T20:23:33.2016915Z microcode : 0x830107f 2025-05-07T20:23:33.2017170Z cpu MHz : 3288.514 2025-05-07T20:23:33.2017388Z cache size : 512 KB 2025-05-07T20:23:33.2017599Z physical id : 0 2025-05-07T20:23:33.2017816Z siblings : 16 2025-05-07T20:23:33.2018021Z core id : 4 2025-05-07T20:23:33.2018218Z cpu cores : 8 2025-05-07T20:23:33.2018420Z apicid : 8 2025-05-07T20:23:33.2018619Z initial apicid : 8 2025-05-07T20:23:33.2018827Z fpu : yes 2025-05-07T20:23:33.2019088Z fpu_exception : yes 2025-05-07T20:23:33.2019310Z cpuid level : 13 2025-05-07T20:23:33.2019522Z wp : yes 2025-05-07T20:23:33.2021614Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:33.2023894Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:33.2024377Z bogomips : 5599.99 2025-05-07T20:23:33.2024599Z TLB size : 3072 4K pages 2025-05-07T20:23:33.2024825Z clflush size : 64 2025-05-07T20:23:33.2025043Z cache_alignment : 64 2025-05-07T20:23:33.2025310Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:33.2025618Z power management: 2025-05-07T20:23:33.2025755Z 2025-05-07T20:23:33.2025840Z processor : 5 2025-05-07T20:23:33.2026056Z vendor_id : AuthenticAMD 2025-05-07T20:23:33.2026291Z cpu family : 23 2025-05-07T20:23:33.2026490Z model : 49 2025-05-07T20:23:33.2026698Z model name : AMD EPYC 7R32 2025-05-07T20:23:33.2026945Z stepping : 0 2025-05-07T20:23:33.2027146Z microcode : 0x830107f 2025-05-07T20:23:33.2027367Z cpu MHz : 3287.096 2025-05-07T20:23:33.2027588Z cache size : 512 KB 2025-05-07T20:23:33.2027795Z physical id : 0 2025-05-07T20:23:33.2027997Z siblings : 16 2025-05-07T20:23:33.2028467Z core id : 5 2025-05-07T20:23:33.2028720Z cpu cores : 8 2025-05-07T20:23:33.2028922Z apicid : 10 2025-05-07T20:23:33.2029168Z initial apicid : 10 2025-05-07T20:23:33.2029377Z fpu : yes 2025-05-07T20:23:33.2029580Z fpu_exception : yes 2025-05-07T20:23:33.2029796Z cpuid level : 13 2025-05-07T20:23:33.2030002Z wp : yes 2025-05-07T20:23:33.2031916Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:33.2034092Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:33.2034576Z bogomips : 5599.99 2025-05-07T20:23:33.2034798Z TLB size : 3072 4K pages 2025-05-07T20:23:33.2035028Z clflush size : 64 2025-05-07T20:23:33.2035247Z cache_alignment : 64 2025-05-07T20:23:33.2035513Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:33.2035822Z power management: 2025-05-07T20:23:33.2035960Z 2025-05-07T20:23:33.2036044Z processor : 6 2025-05-07T20:23:33.2036263Z vendor_id : AuthenticAMD 2025-05-07T20:23:33.2036497Z cpu family : 23 2025-05-07T20:23:33.2036706Z model : 49 2025-05-07T20:23:33.2036912Z model name : AMD EPYC 7R32 2025-05-07T20:23:33.2037146Z stepping : 0 2025-05-07T20:23:33.2037359Z microcode : 0x830107f 2025-05-07T20:23:33.2037591Z cpu MHz : 3314.091 2025-05-07T20:23:33.2037799Z cache size : 512 KB 2025-05-07T20:23:33.2038014Z physical id : 0 2025-05-07T20:23:33.2038220Z siblings : 16 2025-05-07T20:23:33.2038417Z core id : 6 2025-05-07T20:23:33.2038621Z cpu cores : 8 2025-05-07T20:23:33.2038826Z apicid : 12 2025-05-07T20:23:33.2039032Z initial apicid : 12 2025-05-07T20:23:33.2039245Z fpu : yes 2025-05-07T20:23:33.2039447Z fpu_exception : yes 2025-05-07T20:23:33.2039657Z cpuid level : 13 2025-05-07T20:23:33.2039867Z wp : yes 2025-05-07T20:23:33.2041978Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:33.2044177Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:33.2044666Z bogomips : 5599.99 2025-05-07T20:23:33.2045007Z TLB size : 3072 4K pages 2025-05-07T20:23:33.2045241Z clflush size : 64 2025-05-07T20:23:33.2045456Z cache_alignment : 64 2025-05-07T20:23:33.2045715Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:33.2046034Z power management: 2025-05-07T20:23:33.2046162Z 2025-05-07T20:23:33.2046254Z processor : 7 2025-05-07T20:23:33.2046464Z vendor_id : AuthenticAMD 2025-05-07T20:23:33.2046704Z cpu family : 23 2025-05-07T20:23:33.2046911Z model : 49 2025-05-07T20:23:33.2047111Z model name : AMD EPYC 7R32 2025-05-07T20:23:33.2047347Z stepping : 0 2025-05-07T20:23:33.2047559Z microcode : 0x830107f 2025-05-07T20:23:33.2047780Z cpu MHz : 3305.442 2025-05-07T20:23:33.2048001Z cache size : 512 KB 2025-05-07T20:23:33.2048220Z physical id : 0 2025-05-07T20:23:33.2048477Z siblings : 16 2025-05-07T20:23:33.2048755Z core id : 7 2025-05-07T20:23:33.2049016Z cpu cores : 8 2025-05-07T20:23:33.2049275Z apicid : 14 2025-05-07T20:23:33.2049533Z initial apicid : 14 2025-05-07T20:23:33.2049763Z fpu : yes 2025-05-07T20:23:33.2049972Z fpu_exception : yes 2025-05-07T20:23:33.2050183Z cpuid level : 13 2025-05-07T20:23:33.2050391Z wp : yes 2025-05-07T20:23:33.2052308Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:33.2054488Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:33.2054962Z bogomips : 5599.99 2025-05-07T20:23:33.2055186Z TLB size : 3072 4K pages 2025-05-07T20:23:33.2055422Z clflush size : 64 2025-05-07T20:23:33.2055634Z cache_alignment : 64 2025-05-07T20:23:33.2055909Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:33.2056226Z power management: 2025-05-07T20:23:33.2056354Z 2025-05-07T20:23:33.2056447Z processor : 8 2025-05-07T20:23:33.2056658Z vendor_id : AuthenticAMD 2025-05-07T20:23:33.2056894Z cpu family : 23 2025-05-07T20:23:33.2057106Z model : 49 2025-05-07T20:23:33.2057305Z model name : AMD EPYC 7R32 2025-05-07T20:23:33.2057545Z stepping : 0 2025-05-07T20:23:33.2057757Z microcode : 0x830107f 2025-05-07T20:23:33.2057977Z cpu MHz : 3278.423 2025-05-07T20:23:33.2058213Z cache size : 512 KB 2025-05-07T20:23:33.2058455Z physical id : 0 2025-05-07T20:23:33.2058660Z siblings : 16 2025-05-07T20:23:33.2058863Z core id : 0 2025-05-07T20:23:33.2059059Z cpu cores : 8 2025-05-07T20:23:33.2059252Z apicid : 1 2025-05-07T20:23:33.2059446Z initial apicid : 1 2025-05-07T20:23:33.2059656Z fpu : yes 2025-05-07T20:23:33.2059848Z fpu_exception : yes 2025-05-07T20:23:33.2060068Z cpuid level : 13 2025-05-07T20:23:33.2060278Z wp : yes 2025-05-07T20:23:33.2062186Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:33.2064715Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:33.2065199Z bogomips : 5599.99 2025-05-07T20:23:33.2065418Z TLB size : 3072 4K pages 2025-05-07T20:23:33.2065661Z clflush size : 64 2025-05-07T20:23:33.2065871Z cache_alignment : 64 2025-05-07T20:23:33.2066215Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:33.2066531Z power management: 2025-05-07T20:23:33.2066660Z 2025-05-07T20:23:33.2066751Z processor : 9 2025-05-07T20:23:33.2066961Z vendor_id : AuthenticAMD 2025-05-07T20:23:33.2067197Z cpu family : 23 2025-05-07T20:23:33.2067404Z model : 49 2025-05-07T20:23:33.2067601Z model name : AMD EPYC 7R32 2025-05-07T20:23:33.2067840Z stepping : 0 2025-05-07T20:23:33.2068044Z microcode : 0x830107f 2025-05-07T20:23:33.2068261Z cpu MHz : 3293.563 2025-05-07T20:23:33.2068474Z cache size : 512 KB 2025-05-07T20:23:33.2068687Z physical id : 0 2025-05-07T20:23:33.2068889Z siblings : 16 2025-05-07T20:23:33.2069160Z core id : 1 2025-05-07T20:23:33.2069362Z cpu cores : 8 2025-05-07T20:23:33.2069552Z apicid : 3 2025-05-07T20:23:33.2069756Z initial apicid : 3 2025-05-07T20:23:33.2069965Z fpu : yes 2025-05-07T20:23:33.2070159Z fpu_exception : yes 2025-05-07T20:23:33.2070371Z cpuid level : 13 2025-05-07T20:23:33.2070574Z wp : yes 2025-05-07T20:23:33.2072476Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:33.2074647Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:33.2075125Z bogomips : 5599.99 2025-05-07T20:23:33.2075346Z TLB size : 3072 4K pages 2025-05-07T20:23:33.2075581Z clflush size : 64 2025-05-07T20:23:33.2075790Z cache_alignment : 64 2025-05-07T20:23:33.2076052Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:33.2076427Z power management: 2025-05-07T20:23:33.2076614Z 2025-05-07T20:23:33.2076709Z processor : 10 2025-05-07T20:23:33.2076990Z vendor_id : AuthenticAMD 2025-05-07T20:23:33.2077227Z cpu family : 23 2025-05-07T20:23:33.2077423Z model : 49 2025-05-07T20:23:33.2077625Z model name : AMD EPYC 7R32 2025-05-07T20:23:33.2077858Z stepping : 0 2025-05-07T20:23:33.2078057Z microcode : 0x830107f 2025-05-07T20:23:33.2078279Z cpu MHz : 3299.089 2025-05-07T20:23:33.2078492Z cache size : 512 KB 2025-05-07T20:23:33.2078699Z physical id : 0 2025-05-07T20:23:33.2078903Z siblings : 16 2025-05-07T20:23:33.2079102Z core id : 2 2025-05-07T20:23:33.2079291Z cpu cores : 8 2025-05-07T20:23:33.2079490Z apicid : 5 2025-05-07T20:23:33.2079689Z initial apicid : 5 2025-05-07T20:23:33.2079895Z fpu : yes 2025-05-07T20:23:33.2080090Z fpu_exception : yes 2025-05-07T20:23:33.2080306Z cpuid level : 13 2025-05-07T20:23:33.2080505Z wp : yes 2025-05-07T20:23:33.2082405Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:33.2084580Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:33.2085057Z bogomips : 5599.99 2025-05-07T20:23:33.2085379Z TLB size : 3072 4K pages 2025-05-07T20:23:33.2085610Z clflush size : 64 2025-05-07T20:23:33.2085823Z cache_alignment : 64 2025-05-07T20:23:33.2086095Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:33.2086397Z power management: 2025-05-07T20:23:33.2086607Z 2025-05-07T20:23:33.2086691Z processor : 11 2025-05-07T20:23:33.2086915Z vendor_id : AuthenticAMD 2025-05-07T20:23:33.2087143Z cpu family : 23 2025-05-07T20:23:33.2087343Z model : 49 2025-05-07T20:23:33.2087550Z model name : AMD EPYC 7R32 2025-05-07T20:23:33.2087784Z stepping : 0 2025-05-07T20:23:33.2087996Z microcode : 0x830107f 2025-05-07T20:23:33.2088250Z cpu MHz : 3297.070 2025-05-07T20:23:33.2088480Z cache size : 512 KB 2025-05-07T20:23:33.2088701Z physical id : 0 2025-05-07T20:23:33.2088909Z siblings : 16 2025-05-07T20:23:33.2089108Z core id : 3 2025-05-07T20:23:33.2089310Z cpu cores : 8 2025-05-07T20:23:33.2089515Z apicid : 7 2025-05-07T20:23:33.2089707Z initial apicid : 7 2025-05-07T20:23:33.2089969Z fpu : yes 2025-05-07T20:23:33.2090246Z fpu_exception : yes 2025-05-07T20:23:33.2090511Z cpuid level : 13 2025-05-07T20:23:33.2090731Z wp : yes 2025-05-07T20:23:33.2092655Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:33.2094857Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:33.2095337Z bogomips : 5599.99 2025-05-07T20:23:33.2095549Z TLB size : 3072 4K pages 2025-05-07T20:23:33.2095789Z clflush size : 64 2025-05-07T20:23:33.2096004Z cache_alignment : 64 2025-05-07T20:23:33.2096263Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:33.2096584Z power management: 2025-05-07T20:23:33.2096713Z 2025-05-07T20:23:33.2096804Z processor : 12 2025-05-07T20:23:33.2097019Z vendor_id : AuthenticAMD 2025-05-07T20:23:33.2097256Z cpu family : 23 2025-05-07T20:23:33.2097464Z model : 49 2025-05-07T20:23:33.2097664Z model name : AMD EPYC 7R32 2025-05-07T20:23:33.2097905Z stepping : 0 2025-05-07T20:23:33.2098114Z microcode : 0x830107f 2025-05-07T20:23:33.2098332Z cpu MHz : 3288.719 2025-05-07T20:23:33.2098548Z cache size : 512 KB 2025-05-07T20:23:33.2098775Z physical id : 0 2025-05-07T20:23:33.2098976Z siblings : 16 2025-05-07T20:23:33.2099177Z core id : 4 2025-05-07T20:23:33.2099382Z cpu cores : 8 2025-05-07T20:23:33.2099580Z apicid : 9 2025-05-07T20:23:33.2099779Z initial apicid : 9 2025-05-07T20:23:33.2100002Z fpu : yes 2025-05-07T20:23:33.2100208Z fpu_exception : yes 2025-05-07T20:23:33.2100427Z cpuid level : 13 2025-05-07T20:23:33.2100644Z wp : yes 2025-05-07T20:23:33.2102570Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:33.2104990Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:33.2105474Z bogomips : 5599.99 2025-05-07T20:23:33.2105703Z TLB size : 3072 4K pages 2025-05-07T20:23:33.2105941Z clflush size : 64 2025-05-07T20:23:33.2106151Z cache_alignment : 64 2025-05-07T20:23:33.2106534Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:33.2106853Z power management: 2025-05-07T20:23:33.2106983Z 2025-05-07T20:23:33.2107078Z processor : 13 2025-05-07T20:23:33.2107291Z vendor_id : AuthenticAMD 2025-05-07T20:23:33.2107529Z cpu family : 23 2025-05-07T20:23:33.2107816Z model : 49 2025-05-07T20:23:33.2108044Z model name : AMD EPYC 7R32 2025-05-07T20:23:33.2108308Z stepping : 0 2025-05-07T20:23:33.2108516Z microcode : 0x830107f 2025-05-07T20:23:33.2108733Z cpu MHz : 3291.493 2025-05-07T20:23:33.2108952Z cache size : 512 KB 2025-05-07T20:23:33.2109258Z physical id : 0 2025-05-07T20:23:33.2109461Z siblings : 16 2025-05-07T20:23:33.2109662Z core id : 5 2025-05-07T20:23:33.2109867Z cpu cores : 8 2025-05-07T20:23:33.2110061Z apicid : 11 2025-05-07T20:23:33.2110265Z initial apicid : 11 2025-05-07T20:23:33.2110483Z fpu : yes 2025-05-07T20:23:33.2110683Z fpu_exception : yes 2025-05-07T20:23:33.2110899Z cpuid level : 13 2025-05-07T20:23:33.2111104Z wp : yes 2025-05-07T20:23:33.2113025Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:33.2115223Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:33.2115708Z bogomips : 5599.99 2025-05-07T20:23:33.2115929Z TLB size : 3072 4K pages 2025-05-07T20:23:33.2116172Z clflush size : 64 2025-05-07T20:23:33.2116384Z cache_alignment : 64 2025-05-07T20:23:33.2116653Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:33.2116969Z power management: 2025-05-07T20:23:33.2117102Z 2025-05-07T20:23:33.2117186Z processor : 14 2025-05-07T20:23:33.2117404Z vendor_id : AuthenticAMD 2025-05-07T20:23:33.2117641Z cpu family : 23 2025-05-07T20:23:33.2117846Z model : 49 2025-05-07T20:23:33.2118086Z model name : AMD EPYC 7R32 2025-05-07T20:23:33.2118353Z stepping : 0 2025-05-07T20:23:33.2118560Z microcode : 0x830107f 2025-05-07T20:23:33.2118798Z cpu MHz : 3285.188 2025-05-07T20:23:33.2119012Z cache size : 512 KB 2025-05-07T20:23:33.2119222Z physical id : 0 2025-05-07T20:23:33.2119430Z siblings : 16 2025-05-07T20:23:33.2119630Z core id : 6 2025-05-07T20:23:33.2119824Z cpu cores : 8 2025-05-07T20:23:33.2120028Z apicid : 13 2025-05-07T20:23:33.2120238Z initial apicid : 13 2025-05-07T20:23:33.2120455Z fpu : yes 2025-05-07T20:23:33.2120661Z fpu_exception : yes 2025-05-07T20:23:33.2120881Z cpuid level : 13 2025-05-07T20:23:33.2121081Z wp : yes 2025-05-07T20:23:33.2123014Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:33.2125208Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:33.2125695Z bogomips : 5599.99 2025-05-07T20:23:33.2125920Z TLB size : 3072 4K pages 2025-05-07T20:23:33.2126150Z clflush size : 64 2025-05-07T20:23:33.2126369Z cache_alignment : 64 2025-05-07T20:23:33.2126637Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:33.2126944Z power management: 2025-05-07T20:23:33.2127083Z 2025-05-07T20:23:33.2127271Z processor : 15 2025-05-07T20:23:33.2127492Z vendor_id : AuthenticAMD 2025-05-07T20:23:33.2127723Z cpu family : 23 2025-05-07T20:23:33.2127930Z model : 49 2025-05-07T20:23:33.2128419Z model name : AMD EPYC 7R32 2025-05-07T20:23:33.2128669Z stepping : 0 2025-05-07T20:23:33.2129024Z microcode : 0x830107f 2025-05-07T20:23:33.2129250Z cpu MHz : 3281.467 2025-05-07T20:23:33.2129458Z cache size : 512 KB 2025-05-07T20:23:33.2129671Z physical id : 0 2025-05-07T20:23:33.2129877Z siblings : 16 2025-05-07T20:23:33.2130075Z core id : 7 2025-05-07T20:23:33.2130271Z cpu cores : 8 2025-05-07T20:23:33.2130470Z apicid : 15 2025-05-07T20:23:33.2130677Z initial apicid : 15 2025-05-07T20:23:33.2130882Z fpu : yes 2025-05-07T20:23:33.2131081Z fpu_exception : yes 2025-05-07T20:23:33.2131300Z cpuid level : 13 2025-05-07T20:23:33.2131499Z wp : yes 2025-05-07T20:23:33.2133423Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:33.2135602Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:33.2136089Z bogomips : 5599.99 2025-05-07T20:23:33.2136301Z TLB size : 3072 4K pages 2025-05-07T20:23:33.2136537Z clflush size : 64 2025-05-07T20:23:33.2136750Z cache_alignment : 64 2025-05-07T20:23:33.2137009Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:33.2137324Z power management: 2025-05-07T20:23:33.2137461Z 2025-05-07T20:23:33.2137465Z 2025-05-07T20:23:33.2137591Z ################################################################################ 2025-05-07T20:23:33.2137927Z [INFO] Print PCI info ... 2025-05-07T20:23:33.2138185Z + lspci -v 2025-05-07T20:23:33.2138310Z 2025-05-07T20:23:33.2138527Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] 2025-05-07T20:23:33.2138915Z Subsystem: Amazon.com, Inc. Device 1237 2025-05-07T20:23:33.2139240Z Flags: bus master, medium devsel, latency 0 2025-05-07T20:23:33.2139444Z 2025-05-07T20:23:33.2139637Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] 2025-05-07T20:23:33.2140023Z Physical Slot: 1 2025-05-07T20:23:33.2140270Z Flags: bus master, fast devsel, latency 0 2025-05-07T20:23:33.2140471Z 2025-05-07T20:23:33.2140726Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08) 2025-05-07T20:23:33.2141157Z Physical Slot: 1 2025-05-07T20:23:33.2141417Z Flags: bus master, fast devsel, latency 0, IRQ 9 2025-05-07T20:23:33.2141640Z 2025-05-07T20:23:33.2141917Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111 (prog-if 00 [VGA controller]) 2025-05-07T20:23:33.2142354Z Physical Slot: 3 2025-05-07T20:23:33.2142603Z Flags: bus master, fast devsel, latency 0 2025-05-07T20:23:33.2142944Z Memory at c1000000 (32-bit, prefetchable) [size=4M] 2025-05-07T20:23:33.2143303Z Expansion ROM at 000c0000 [disabled] [size=128K] 2025-05-07T20:23:33.2143522Z 2025-05-07T20:23:33.2143824Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller (prog-if 02 [NVM Express]) 2025-05-07T20:23:33.2144331Z Subsystem: Amazon.com, Inc. Device 0000 2025-05-07T20:23:33.2144623Z Physical Slot: 4 2025-05-07T20:23:33.2144876Z Flags: bus master, fast devsel, latency 0, IRQ 11 2025-05-07T20:23:33.2145260Z Memory at c1808000 (32-bit, non-prefetchable) [size=16K] 2025-05-07T20:23:33.2145622Z Capabilities: 2025-05-07T20:23:33.2145899Z Kernel driver in use: nvme 2025-05-07T20:23:33.2146075Z 2025-05-07T20:23:33.2148266Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA) 2025-05-07T20:23:33.2148761Z Subsystem: Amazon.com, Inc. Elastic Network Adapter (ENA) 2025-05-07T20:23:33.2149176Z Physical Slot: 5 2025-05-07T20:23:33.2149413Z Flags: bus master, fast devsel, latency 0 2025-05-07T20:23:33.2149770Z Memory at c1804000 (32-bit, non-prefetchable) [size=16K] 2025-05-07T20:23:33.2150234Z Memory at c1400000 (32-bit, prefetchable) [size=4M] 2025-05-07T20:23:33.2150551Z Capabilities: 2025-05-07T20:23:33.2150816Z Kernel driver in use: ena 2025-05-07T20:23:33.2151058Z Kernel modules: ena 2025-05-07T20:23:33.2151195Z 2025-05-07T20:23:33.2151364Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1) 2025-05-07T20:23:33.2151742Z Subsystem: NVIDIA Corporation Device 152f 2025-05-07T20:23:33.2152034Z Physical Slot: 30 2025-05-07T20:23:33.2152296Z Flags: bus master, fast devsel, latency 0, IRQ 10 2025-05-07T20:23:33.2152665Z Memory at c0000000 (32-bit, non-prefetchable) [size=16M] 2025-05-07T20:23:33.2153060Z Memory at 1800000000 (64-bit, prefetchable) [size=32G] 2025-05-07T20:23:33.2153429Z Memory at 1040000000 (64-bit, prefetchable) [size=32M] 2025-05-07T20:23:33.2153749Z Capabilities: 2025-05-07T20:23:33.2154020Z Kernel driver in use: nvidia 2025-05-07T20:23:33.2154275Z Kernel modules: nvidia 2025-05-07T20:23:33.2154421Z 2025-05-07T20:23:33.2154721Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller (prog-if 02 [NVM Express]) 2025-05-07T20:23:33.2155230Z Subsystem: Amazon.com, Inc. Device 0000 2025-05-07T20:23:33.2155521Z Physical Slot: 31 2025-05-07T20:23:33.2155766Z Flags: bus master, fast devsel, latency 0 2025-05-07T20:23:33.2156114Z Memory at c1800000 (32-bit, non-prefetchable) [size=16K] 2025-05-07T20:23:33.2156492Z Memory at c180c000 (32-bit, prefetchable) [size=8K] 2025-05-07T20:23:33.2156816Z Capabilities: 2025-05-07T20:23:33.2157076Z Kernel driver in use: nvme 2025-05-07T20:23:33.2157246Z 2025-05-07T20:23:33.2157250Z 2025-05-07T20:23:33.2157367Z ################################################################################ 2025-05-07T20:23:33.2164221Z [INFO] Print Linux distribution info ... 2025-05-07T20:23:33.2164544Z + uname -a 2025-05-07T20:23:33.2164668Z 2025-05-07T20:23:33.2165061Z Linux ip-10-0-66-0.ec2.internal 6.1.130-139.222.amzn2023.x86_64 #1 SMP PREEMPT_DYNAMIC Tue Mar 11 01:10:58 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux 2025-05-07T20:23:33.2165555Z 2025-05-07T20:23:33.2165633Z + uname -m 2025-05-07T20:23:33.2165746Z 2025-05-07T20:23:33.2165829Z x86_64 2025-05-07T20:23:33.2165934Z 2025-05-07T20:23:33.2166018Z + cat /proc/version 2025-05-07T20:23:33.2166156Z 2025-05-07T20:23:33.2166689Z Linux version 6.1.130-139.222.amzn2023.x86_64 (mockbuild@ip-10-0-55-76) (gcc (GCC) 11.5.0 20240719 (Red Hat 11.5.0-5), GNU ld version 2.39-6.amzn2023.0.11) #1 SMP PREEMPT_DYNAMIC Tue Mar 11 01:10:58 UTC 2025 2025-05-07T20:23:33.2167311Z 2025-05-07T20:23:33.2167400Z + cat /etc/os-release 2025-05-07T20:23:33.2167541Z 2025-05-07T20:23:33.2167640Z NAME="Amazon Linux" 2025-05-07T20:23:33.2167847Z VERSION="2023" 2025-05-07T20:23:33.2168051Z ID="amzn" 2025-05-07T20:23:33.2168242Z ID_LIKE="fedora" 2025-05-07T20:23:33.2168443Z VERSION_ID="2023" 2025-05-07T20:23:33.2168679Z PLATFORM_ID="platform:al2023" 2025-05-07T20:23:33.2168960Z PRETTY_NAME="Amazon Linux 2023.6.20250317" 2025-05-07T20:23:33.2169246Z ANSI_COLOR="0;33" 2025-05-07T20:23:33.2169491Z CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2023" 2025-05-07T20:23:33.2169886Z HOME_URL="https://aws.amazon.com/linux/amazon-linux-2023/" 2025-05-07T20:23:33.2170315Z DOCUMENTATION_URL="https://docs.aws.amazon.com/linux/" 2025-05-07T20:23:33.2170723Z SUPPORT_URL="https://aws.amazon.com/premiumsupport/" 2025-05-07T20:23:33.2171160Z BUG_REPORT_URL="https://github.com/amazonlinux/amazon-linux-2023" 2025-05-07T20:23:33.2171528Z VENDOR_NAME="AWS" 2025-05-07T20:23:33.2171763Z VENDOR_URL="https://aws.amazon.com/" 2025-05-07T20:23:33.2172054Z SUPPORT_END="2029-06-30" 2025-05-07T20:23:33.2172203Z 2025-05-07T20:23:33.2172444Z ################################################################################ 2025-05-07T20:23:33.2172743Z # Print EC2 Instance Info 2025-05-07T20:23:33.2172977Z # 2025-05-07T20:23:33.2173184Z # [2025-05-07T20:23:33.211Z] + print_ec2_info 2025-05-07T20:23:33.2173495Z ################################################################################ 2025-05-07T20:23:33.2173783Z 2025-05-07T20:23:33.2239556Z ami-id: ami-071226ecf16aa7d96 2025-05-07T20:23:33.2374878Z instance-id: i-0e56304501e4f5200 2025-05-07T20:23:33.2490372Z instance-type: g5.4xlarge 2025-05-07T20:23:33.2531724Z ##[group]Run . $PRELUDE; print_gpu_info 2025-05-07T20:23:33.2532071Z . $PRELUDE; print_gpu_info 2025-05-07T20:23:33.2541813Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:23:33.2542163Z env: 2025-05-07T20:23:33.2542385Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:23:33.2542691Z BUILD_ENV: build_binary 2025-05-07T20:23:33.2542934Z BUILD_TARGET: genai 2025-05-07T20:23:33.2543158Z BUILD_VARIANT: cuda 2025-05-07T20:23:33.2543395Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:23:33.2543658Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:23:33.2543965Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:23:33.2544292Z ##[endgroup] 2025-05-07T20:23:33.5876323Z ################################################################################ 2025-05-07T20:23:33.5876792Z [INFO] Printing general display info ... 2025-05-07T20:23:33.5893780Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:23:33.6815034Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:23:33.6824361Z /usr/bin/sudo 2025-05-07T20:23:33.6835077Z which: no apt-get in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin) 2025-05-07T20:23:33.6845078Z /usr/bin/yum 2025-05-07T20:23:33.6846796Z [INSTALL] Updating system repositories ... 2025-05-07T20:23:33.6867213Z [EXEC] [ATTEMPT 0/3] + sudo yum update -y 2025-05-07T20:23:34.1718219Z Last metadata expiration check: 0:00:05 ago on Wed May 7 20:23:29 2025. 2025-05-07T20:23:34.2489585Z ================================================================================ 2025-05-07T20:23:34.2490048Z WARNING: 2025-05-07T20:23:34.2490379Z A newer release of "Amazon Linux" is available. 2025-05-07T20:23:34.2490690Z 2025-05-07T20:23:34.2490848Z Available Versions: 2025-05-07T20:23:34.2491059Z 2025-05-07T20:23:34.2491184Z Version 2023.7.20250331: 2025-05-07T20:23:34.2491540Z Run the following command to upgrade to 2023.7.20250331: 2025-05-07T20:23:34.2491793Z 2025-05-07T20:23:34.2491941Z dnf upgrade --releasever=2023.7.20250331 2025-05-07T20:23:34.2492148Z 2025-05-07T20:23:34.2492235Z Release notes: 2025-05-07T20:23:34.2492644Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html 2025-05-07T20:23:34.2493006Z 2025-05-07T20:23:34.2493104Z Version 2023.7.20250414: 2025-05-07T20:23:34.2493414Z Run the following command to upgrade to 2023.7.20250414: 2025-05-07T20:23:34.2493656Z 2025-05-07T20:23:34.2493771Z dnf upgrade --releasever=2023.7.20250414 2025-05-07T20:23:34.2493985Z 2025-05-07T20:23:34.2494070Z Release notes: 2025-05-07T20:23:34.2494458Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html 2025-05-07T20:23:34.2494819Z 2025-05-07T20:23:34.2494914Z Version 2023.7.20250428: 2025-05-07T20:23:34.2495215Z Run the following command to upgrade to 2023.7.20250428: 2025-05-07T20:23:34.2495461Z 2025-05-07T20:23:34.2495575Z dnf upgrade --releasever=2023.7.20250428 2025-05-07T20:23:34.2495782Z 2025-05-07T20:23:34.2495875Z Release notes: 2025-05-07T20:23:34.2496254Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html 2025-05-07T20:23:34.2496613Z 2025-05-07T20:23:34.2496735Z ================================================================================ 2025-05-07T20:23:34.3651653Z Dependencies resolved. 2025-05-07T20:23:34.3938650Z ================================================================================ 2025-05-07T20:23:34.3939261Z Package Arch Version Repository Size 2025-05-07T20:23:34.3939767Z ================================================================================ 2025-05-07T20:23:34.3940117Z Upgrading: 2025-05-07T20:23:34.3940697Z nvidia-container-toolkit x86_64 1.17.6-1 nvidia-container-toolkit 1.2 M 2025-05-07T20:23:34.3941273Z nvidia-container-toolkit-base x86_64 1.17.6-1 nvidia-container-toolkit 5.7 M 2025-05-07T20:23:34.3941705Z 2025-05-07T20:23:34.3942003Z Transaction Summary 2025-05-07T20:23:34.3942264Z ================================================================================ 2025-05-07T20:23:34.3942571Z Upgrade 2 Packages 2025-05-07T20:23:34.3942772Z 2025-05-07T20:23:34.3942882Z Total download size: 6.9 M 2025-05-07T20:23:34.3943141Z Downloading Packages: 2025-05-07T20:23:34.4403823Z (1/2): nvidia-container-toolkit-1.17.6-1.x86_64 27 MB/s | 1.2 MB 00:00 2025-05-07T20:23:34.4813341Z (2/2): nvidia-container-toolkit-base-1.17.6-1.x 66 MB/s | 5.7 MB 00:00 2025-05-07T20:23:34.4821572Z -------------------------------------------------------------------------------- 2025-05-07T20:23:34.4824547Z Total 79 MB/s | 6.9 MB 00:00 2025-05-07T20:23:34.4826940Z Running transaction check 2025-05-07T20:23:34.4924415Z Transaction check succeeded. 2025-05-07T20:23:34.4924964Z Running transaction test 2025-05-07T20:23:34.5221278Z Transaction test succeeded. 2025-05-07T20:23:34.5223798Z Running transaction 2025-05-07T20:23:35.0803341Z Preparing : 1/1 2025-05-07T20:23:35.1882793Z Upgrading : nvidia-container-toolkit-base-1.17.6-1.x86_64 1/4 2025-05-07T20:23:35.1917654Z Upgrading : nvidia-container-toolkit-1.17.6-1.x86_64 2/4 2025-05-07T20:23:35.2139535Z Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64 2/4 2025-05-07T20:23:35.2140176Z Cleanup : nvidia-container-toolkit-1.16.2-1.x86_64 3/4 2025-05-07T20:23:35.2250279Z Running scriptlet: nvidia-container-toolkit-1.16.2-1.x86_64 3/4 2025-05-07T20:23:35.2278976Z Cleanup : nvidia-container-toolkit-base-1.16.2-1.x86_64 4/4 2025-05-07T20:23:35.3753266Z Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64 4/4 2025-05-07T20:23:35.3753844Z Verifying : nvidia-container-toolkit-1.17.6-1.x86_64 1/4 2025-05-07T20:23:35.3754378Z Verifying : nvidia-container-toolkit-1.16.2-1.x86_64 2/4 2025-05-07T20:23:35.3754914Z Verifying : nvidia-container-toolkit-base-1.17.6-1.x86_64 3/4 2025-05-07T20:23:35.5158431Z ================================================================================ 2025-05-07T20:23:35.5159074Z WARNING: 2025-05-07T20:23:35.5159431Z A newer release of "Amazon Linux" is available. 2025-05-07T20:23:35.5159738Z 2025-05-07T20:23:35.5159868Z Available Versions: 2025-05-07T20:23:35.5160065Z 2025-05-07T20:23:35.5160185Z Version 2023.7.20250331: 2025-05-07T20:23:35.5160561Z Run the following command to upgrade to 2023.7.20250331: 2025-05-07T20:23:35.5160809Z 2025-05-07T20:23:35.5160939Z dnf upgrade --releasever=2023.7.20250331 2025-05-07T20:23:35.5161167Z 2025-05-07T20:23:35.5161251Z Release notes: 2025-05-07T20:23:35.5161658Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html 2025-05-07T20:23:35.5162022Z 2025-05-07T20:23:35.5162126Z Version 2023.7.20250414: 2025-05-07T20:23:35.5162426Z Run the following command to upgrade to 2023.7.20250414: 2025-05-07T20:23:35.5162671Z 2025-05-07T20:23:35.5162785Z dnf upgrade --releasever=2023.7.20250414 2025-05-07T20:23:35.5162996Z 2025-05-07T20:23:35.5163093Z Release notes: 2025-05-07T20:23:35.5163479Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html 2025-05-07T20:23:35.5163834Z 2025-05-07T20:23:35.5163931Z Version 2023.7.20250428: 2025-05-07T20:23:35.5164225Z Run the following command to upgrade to 2023.7.20250428: 2025-05-07T20:23:35.5164472Z 2025-05-07T20:23:35.5164585Z dnf upgrade --releasever=2023.7.20250428 2025-05-07T20:23:35.5164788Z 2025-05-07T20:23:35.5165152Z Release notes: 2025-05-07T20:23:35.5165533Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html 2025-05-07T20:23:35.5165890Z 2025-05-07T20:23:35.5166206Z ================================================================================ 2025-05-07T20:23:35.5723427Z Verifying : nvidia-container-toolkit-base-1.16.2-1.x86_64 4/4 2025-05-07T20:23:35.5724342Z 2025-05-07T20:23:35.5724582Z Upgraded: 2025-05-07T20:23:35.5725501Z nvidia-container-toolkit-1.17.6-1.x86_64 2025-05-07T20:23:35.5726619Z nvidia-container-toolkit-base-1.17.6-1.x86_64 2025-05-07T20:23:35.5727294Z 2025-05-07T20:23:35.5727455Z Complete! 2025-05-07T20:23:35.6182671Z [INSTALL] Installing system package(s): hostname lshw ... 2025-05-07T20:23:35.6208867Z [EXEC] [ATTEMPT 0/3] + sudo yum install -y hostname lshw 2025-05-07T20:23:36.0751014Z Last metadata expiration check: 0:00:07 ago on Wed May 7 20:23:29 2025. 2025-05-07T20:23:36.0990760Z Package hostname-3.23-4.amzn2023.0.3.x86_64 is already installed. 2025-05-07T20:23:36.0995718Z Package lshw-B.02.19.2-7.amzn2023.0.3.x86_64 is already installed. 2025-05-07T20:23:36.1397182Z Dependencies resolved. 2025-05-07T20:23:36.1579994Z Nothing to do. 2025-05-07T20:23:36.1580527Z Complete! 2025-05-07T20:23:36.1982437Z + hostname 2025-05-07T20:23:36.1982589Z 2025-05-07T20:23:36.1996778Z ip-10-0-66-0.ec2.internal 2025-05-07T20:23:36.1998162Z 2025-05-07T20:23:36.1998652Z + sudo lshw -C display 2025-05-07T20:23:36.1998877Z 2025-05-07T20:23:36.4794934Z *-display:0 UNCLAIMED 2025-05-07T20:23:36.4795381Z description: VGA compatible controller 2025-05-07T20:23:36.4795704Z product: Amazon.com, Inc. 2025-05-07T20:23:36.4795973Z vendor: Amazon.com, Inc. 2025-05-07T20:23:36.4796231Z physical id: 3 2025-05-07T20:23:36.4796468Z bus info: pci@0000:00:03.0 2025-05-07T20:23:36.4796717Z version: 00 2025-05-07T20:23:36.4796956Z width: 32 bits 2025-05-07T20:23:36.4797173Z clock: 33MHz 2025-05-07T20:23:36.4797415Z capabilities: vga_controller bus_master 2025-05-07T20:23:36.4797727Z configuration: latency=0 2025-05-07T20:23:36.4798061Z resources: memory:c1000000-c13fffff memory:c0000-dffff 2025-05-07T20:23:36.4798387Z *-display:1 2025-05-07T20:23:36.4798612Z description: 3D controller 2025-05-07T20:23:36.4798889Z product: GA102GL [A10G] 2025-05-07T20:23:36.4799179Z vendor: NVIDIA Corporation 2025-05-07T20:23:36.4799466Z physical id: 1e 2025-05-07T20:23:36.4799702Z bus info: pci@0000:00:1e.0 2025-05-07T20:23:36.4799957Z version: a1 2025-05-07T20:23:36.4800163Z width: 64 bits 2025-05-07T20:23:36.4800382Z clock: 33MHz 2025-05-07T20:23:36.4800676Z capabilities: pm pciexpress msix bus_master cap_list 2025-05-07T20:23:36.4801038Z configuration: driver=nvidia latency=0 2025-05-07T20:23:36.4801650Z resources: iomemory:180-17f iomemory:100-ff irq:10 memory:c0000000-c0ffffff memory:1800000000-1fffffffff memory:1040000000-1041ffffff 2025-05-07T20:23:36.4834587Z 2025-05-07T20:23:36.4834790Z ################################################################################ 2025-05-07T20:23:36.4835265Z [INFO] Printing NVIDIA GPU info ... 2025-05-07T20:23:36.4962107Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1) 2025-05-07T20:23:36.5149266Z Wed May 7 20:23:36 2025 2025-05-07T20:23:36.5149765Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:36.5150291Z | NVIDIA-SMI 570.133.07 Driver Version: 570.133.07 CUDA Version: 12.8 | 2025-05-07T20:23:36.5150765Z |-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:36.5151252Z | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 2025-05-07T20:23:36.5151772Z | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 2025-05-07T20:23:36.5152409Z | | | MIG M. | 2025-05-07T20:23:36.5152904Z |=========================================+========================+======================| 2025-05-07T20:23:36.5283068Z | 0 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 | 2025-05-07T20:23:36.5283583Z | 0% 29C P8 22W / 300W | 0MiB / 23028MiB | 0% Default | 2025-05-07T20:23:36.5283986Z | | | N/A | 2025-05-07T20:23:36.5284374Z +-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:36.5287934Z 2025-05-07T20:23:36.5288425Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:36.5288860Z | Processes: | 2025-05-07T20:23:36.5289282Z | GPU GI CI PID Type Process name GPU Memory | 2025-05-07T20:23:36.5289689Z | ID ID Usage | 2025-05-07T20:23:36.5290031Z |=========================================================================================| 2025-05-07T20:23:36.5293253Z | No running processes found | 2025-05-07T20:23:36.5293720Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:36.7749409Z ################################################################################ 2025-05-07T20:23:36.7749745Z [INFO] Printing AMD GPU info ... 2025-05-07T20:23:36.7894020Z which: no rocminfo in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin) 2025-05-07T20:23:36.7894835Z [CHECK] rocminfo not found 2025-05-07T20:23:36.7903722Z which: no rocm-smi in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin) 2025-05-07T20:23:36.7904855Z [CHECK] rocm-smi not found 2025-05-07T20:23:36.7948737Z ##[group]Run . $PRELUDE; setup_miniconda $HOME/miniconda 2025-05-07T20:23:36.7949574Z . $PRELUDE; setup_miniconda $HOME/miniconda 2025-05-07T20:23:36.7963947Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:23:36.7964516Z env: 2025-05-07T20:23:36.7964872Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:23:36.7965360Z BUILD_ENV: build_binary 2025-05-07T20:23:36.7965741Z BUILD_TARGET: genai 2025-05-07T20:23:36.7966084Z BUILD_VARIANT: cuda 2025-05-07T20:23:36.7966456Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:23:36.7966859Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:23:36.7967319Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:23:36.7967822Z ##[endgroup] 2025-05-07T20:23:37.1333317Z ################################################################################ 2025-05-07T20:23:37.1333723Z # Setup Miniconda 2025-05-07T20:23:37.1333940Z # 2025-05-07T20:23:37.1349706Z # [2025-05-07T20:23:37.134Z] + setup_miniconda /home/ec2-user/miniconda 2025-05-07T20:23:37.1350114Z ################################################################################ 2025-05-07T20:23:37.1350327Z 2025-05-07T20:23:37.1366161Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:23:37.2248902Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:23:37.2249421Z [SETUP] A Miniconda installation appears to already exist in /home/ec2-user/miniconda ... 2025-05-07T20:23:37.2249973Z [SETUP] Clearing out directory: /home/ec2-user/miniconda ... 2025-05-07T20:23:37.2250341Z + rm -rf /home/ec2-user/miniconda 2025-05-07T20:23:37.2250532Z 2025-05-07T20:23:42.1495953Z 2025-05-07T20:23:42.1496647Z + mkdir -p /home/ec2-user/miniconda 2025-05-07T20:23:42.1497101Z 2025-05-07T20:23:42.1513992Z 2025-05-07T20:23:42.1514443Z [SETUP] Downloading the Miniconda installer ... 2025-05-07T20:23:42.1538075Z [EXEC] [ATTEMPT 0/3] + wget -q https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh 2025-05-07T20:23:43.1491822Z [SETUP] Installing Miniconda ... 2025-05-07T20:23:43.1492212Z + bash miniconda.sh -b -p /home/ec2-user/miniconda -u 2025-05-07T20:23:43.1492466Z 2025-05-07T20:23:43.1639466Z PREFIX=/home/ec2-user/miniconda 2025-05-07T20:23:43.6175587Z Unpacking payload ... 2025-05-07T20:23:44.1390754Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior. 2025-05-07T20:23:44.9454472Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior. 2025-05-07T20:23:47.0588735Z 2025-05-07T20:23:47.0589477Z Installing base environment... 2025-05-07T20:23:47.0589808Z 2025-05-07T20:23:48.1420674Z Preparing transaction: ...working... done 2025-05-07T20:23:51.1484705Z Executing transaction: ...working... done 2025-05-07T20:23:51.8076875Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior. 2025-05-07T20:23:51.8973224Z installation finished. 2025-05-07T20:23:51.8980143Z 2025-05-07T20:23:51.8980362Z + rm -f miniconda.sh 2025-05-07T20:23:51.8980512Z 2025-05-07T20:23:51.9303921Z 2025-05-07T20:23:51.9304114Z [SETUP] Reloading the bash configuration ... 2025-05-07T20:23:51.9304462Z + /home/ec2-user/miniconda/bin/conda init bash 2025-05-07T20:23:51.9304684Z 2025-05-07T20:23:52.2964228Z no change /home/ec2-user/miniconda/condabin/conda 2025-05-07T20:23:52.2964649Z no change /home/ec2-user/miniconda/bin/conda 2025-05-07T20:23:52.2965099Z no change /home/ec2-user/miniconda/bin/conda-env 2025-05-07T20:23:52.2965546Z no change /home/ec2-user/miniconda/bin/activate 2025-05-07T20:23:52.2966007Z no change /home/ec2-user/miniconda/bin/deactivate 2025-05-07T20:23:52.2966874Z no change /home/ec2-user/miniconda/etc/profile.d/conda.sh 2025-05-07T20:23:52.2967370Z no change /home/ec2-user/miniconda/etc/fish/conf.d/conda.fish 2025-05-07T20:23:52.2967801Z no change /home/ec2-user/miniconda/shell/condabin/Conda.psm1 2025-05-07T20:23:52.2968243Z no change /home/ec2-user/miniconda/shell/condabin/conda-hook.ps1 2025-05-07T20:23:52.2968764Z no change /home/ec2-user/miniconda/lib/python3.13/site-packages/xontrib/conda.xsh 2025-05-07T20:23:52.2969273Z no change /home/ec2-user/miniconda/etc/profile.d/conda.csh 2025-05-07T20:23:52.2969624Z no change /home/ec2-user/.bashrc 2025-05-07T20:23:52.2969896Z No action taken. 2025-05-07T20:23:52.3624290Z 2025-05-07T20:23:52.3624865Z + . /home/ec2-user/.bashrc 2025-05-07T20:23:52.3625121Z 2025-05-07T20:23:53.2032722Z 2025-05-07T20:23:53.2033266Z [SETUP] Installing libmamba-solver (required since Anaconda 2024.02-1) and libarchive ... 2025-05-07T20:23:53.2056163Z [EXEC] [ATTEMPT 0/3] + conda install --solver=classic -c conda-forge --override-channels -y conda-libmamba-solver libmamba libmambapy libarchive 2025-05-07T20:24:06.5918101Z Collecting package metadata (current_repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - done 2025-05-07T20:24:08.1736940Z Solving environment: | / - \ | / - \ | / - \ done 2025-05-07T20:24:08.2708058Z 2025-05-07T20:24:08.2708397Z ## Package Plan ## 2025-05-07T20:24:08.2708551Z 2025-05-07T20:24:08.2708712Z environment location: /home/ec2-user/miniconda 2025-05-07T20:24:08.2709289Z 2025-05-07T20:24:08.2709389Z added / updated specs: 2025-05-07T20:24:08.2709660Z - conda-libmamba-solver 2025-05-07T20:24:08.2709923Z - libarchive 2025-05-07T20:24:08.2710140Z - libmamba 2025-05-07T20:24:08.2710355Z - libmambapy 2025-05-07T20:24:08.2710481Z 2025-05-07T20:24:08.2710485Z 2025-05-07T20:24:08.2710613Z The following packages will be downloaded: 2025-05-07T20:24:08.2710835Z 2025-05-07T20:24:08.2710955Z package | build 2025-05-07T20:24:08.2711273Z ---------------------------|----------------- 2025-05-07T20:24:08.2711686Z ca-certificates-2025.4.26 | hbd8a1cb_0 149 KB conda-forge 2025-05-07T20:24:08.2712177Z certifi-2025.4.26 | pyhd8ed1ab_0 154 KB conda-forge 2025-05-07T20:24:08.2712603Z conda-25.3.1 | py313h78bf25f_1 1.1 MB conda-forge 2025-05-07T20:24:08.2713073Z conda-libmamba-solver-25.4.0| pyhd8ed1ab_0 41 KB conda-forge 2025-05-07T20:24:08.2713531Z ------------------------------------------------------------ 2025-05-07T20:24:08.2713876Z Total: 1.4 MB 2025-05-07T20:24:08.2714089Z 2025-05-07T20:24:08.2714207Z The following packages will be UPDATED: 2025-05-07T20:24:08.2714417Z 2025-05-07T20:24:08.2717944Z ca-certificates pkgs/main/linux-64::ca-certificates-2~ --> conda-forge/noarch::ca-certificates-2025.4.26-hbd8a1cb_0 2025-05-07T20:24:08.2718719Z conda pkgs/main::conda-25.3.1-py313h06a4308~ --> conda-forge::conda-25.3.1-py313h78bf25f_1 2025-05-07T20:24:08.2719087Z 2025-05-07T20:24:08.2719312Z The following packages will be SUPERSEDED by a higher-priority channel: 2025-05-07T20:24:08.2719622Z 2025-05-07T20:24:08.2719934Z certifi pkgs/main/linux-64::certifi-2025.4.26~ --> conda-forge/noarch::certifi-2025.4.26-pyhd8ed1ab_0 2025-05-07T20:24:08.2720722Z conda-libmamba-so~ pkgs/main::conda-libmamba-solver-25.4~ --> conda-forge::conda-libmamba-solver-25.4.0-pyhd8ed1ab_0 2025-05-07T20:24:08.2721213Z 2025-05-07T20:24:08.2721217Z 2025-05-07T20:24:08.2721221Z 2025-05-07T20:24:08.2721580Z Downloading and Extracting Packages: ...working... 2025-05-07T20:24:08.2721955Z conda-25.3.1 | 1.1 MB | | 0% 2025-05-07T20:24:08.2722177Z 2025-05-07T20:24:08.2725973Z certifi-2025.4.26 | 154 KB | | 0%  2025-05-07T20:24:08.2726230Z 2025-05-07T20:24:08.2726234Z 2025-05-07T20:24:08.2729559Z ca-certificates-2025 | 149 KB | | 0%  2025-05-07T20:24:08.2729832Z 2025-05-07T20:24:08.2729836Z 2025-05-07T20:24:08.2729840Z 2025-05-07T20:24:08.3200246Z conda-libmamba-solve | 41 KB | | 0%  2025-05-07T20:24:08.3200542Z 2025-05-07T20:24:08.3200545Z 2025-05-07T20:24:08.3200549Z 2025-05-07T20:24:08.3304865Z conda-libmamba-solve | 41 KB | ########## | 100%  2025-05-07T20:24:08.3305299Z 2025-05-07T20:24:08.3305311Z 2025-05-07T20:24:08.3305346Z 2025-05-07T20:24:08.3342186Z conda-libmamba-solve | 41 KB | ########## | 100%  2025-05-07T20:24:08.3342477Z 2025-05-07T20:24:08.3342491Z 2025-05-07T20:24:08.3410111Z ca-certificates-2025 | 149 KB | ########## | 100%  2025-05-07T20:24:08.3410377Z 2025-05-07T20:24:08.3410383Z 2025-05-07T20:24:08.3562514Z ca-certificates-2025 | 149 KB | ########## | 100%  2025-05-07T20:24:08.3587107Z conda-25.3.1 | 1.1 MB | ########## | 100% 2025-05-07T20:24:08.3589116Z 2025-05-07T20:24:08.3731218Z certifi-2025.4.26 | 154 KB | ########## | 100%  2025-05-07T20:24:08.3731492Z 2025-05-07T20:24:08.3732338Z certifi-2025.4.26 | 154 KB | ########## | 100%  2025-05-07T20:24:08.3732789Z 2025-05-07T20:24:08.4718946Z certifi-2025.4.26 | 154 KB | ########## | 100%  2025-05-07T20:24:08.4719362Z conda-25.3.1 | 1.1 MB | ########## | 100% 2025-05-07T20:24:08.4725309Z conda-25.3.1 | 1.1 MB | ########## | 100% 2025-05-07T20:24:08.4726089Z 2025-05-07T20:24:08.4726353Z 2025-05-07T20:24:08.4726563Z  2025-05-07T20:24:08.4726775Z 2025-05-07T20:24:08.4726781Z 2025-05-07T20:24:08.4726997Z  2025-05-07T20:24:08.4727269Z 2025-05-07T20:24:08.4727273Z 2025-05-07T20:24:08.4727277Z 2025-05-07T20:24:08.4727474Z  done 2025-05-07T20:24:08.5729227Z Preparing transaction: / done 2025-05-07T20:24:08.6736959Z Verifying transaction: \ done 2025-05-07T20:24:10.0762183Z Executing transaction: / - \ | / - \ | / - \ | / - done 2025-05-07T20:24:11.9958264Z [SETUP] Updating Miniconda base packages ... 2025-05-07T20:24:11.9986469Z [EXEC] [ATTEMPT 0/3] + conda update -n base -c defaults --update-deps -y conda 2025-05-07T20:24:12.8264824Z Channels: 2025-05-07T20:24:12.8265092Z - defaults 2025-05-07T20:24:12.8265345Z Platform: linux-64 2025-05-07T20:24:14.0535264Z Collecting package metadata (repodata.json): - \ | / - \ | done 2025-05-07T20:24:14.1757511Z Solving environment: - \ Channels: 2025-05-07T20:24:14.1757828Z - defaults 2025-05-07T20:24:14.1758049Z Platform: linux-64 2025-05-07T20:24:14.4596191Z Collecting package metadata (repodata.json): / - \ | done 2025-05-07T20:24:14.6701934Z Solving environment: - \ | / done 2025-05-07T20:24:14.7569316Z done 2025-05-07T20:24:14.8214959Z 2025-05-07T20:24:14.8215261Z ## Package Plan ## 2025-05-07T20:24:14.8215439Z 2025-05-07T20:24:14.8215595Z environment location: /home/ec2-user/miniconda 2025-05-07T20:24:14.8215869Z 2025-05-07T20:24:14.8215969Z added / updated specs: 2025-05-07T20:24:14.8216242Z - conda 2025-05-07T20:24:14.8216364Z 2025-05-07T20:24:14.8216368Z 2025-05-07T20:24:14.8216506Z The following packages will be downloaded: 2025-05-07T20:24:14.8216747Z 2025-05-07T20:24:14.8216865Z package | build 2025-05-07T20:24:14.8217208Z ---------------------------|----------------- 2025-05-07T20:24:14.8217848Z pip-25.1 | pyhc872135_2 1.3 MB 2025-05-07T20:24:14.8218233Z tzdata-2025b | h04d1e81_0 116 KB 2025-05-07T20:24:14.8218596Z ------------------------------------------------------------ 2025-05-07T20:24:14.8218930Z Total: 1.4 MB 2025-05-07T20:24:14.8219137Z 2025-05-07T20:24:14.8219258Z The following packages will be UPDATED: 2025-05-07T20:24:14.8219460Z 2025-05-07T20:24:14.8219765Z pip pkgs/main/linux-64::pip-25.0-py313h06~ --> pkgs/main/noarch::pip-25.1-pyhc872135_2 2025-05-07T20:24:14.8220259Z tzdata 2025a-h04d1e81_0 --> 2025b-h04d1e81_0 2025-05-07T20:24:14.8220508Z 2025-05-07T20:24:14.8220512Z 2025-05-07T20:24:14.8220516Z 2025-05-07T20:24:14.8220661Z Downloading and Extracting Packages: ...working... 2025-05-07T20:24:14.8221021Z pip-25.1 | 1.3 MB | | 0% 2025-05-07T20:24:14.8221231Z 2025-05-07T20:24:14.8589281Z tzdata-2025b | 116 KB | | 0%  2025-05-07T20:24:14.8589532Z 2025-05-07T20:24:14.8767130Z tzdata-2025b | 116 KB | ########## | 100%  2025-05-07T20:24:15.0827616Z pip-25.1 | 1.3 MB | ########## | 100% 2025-05-07T20:24:15.0829789Z pip-25.1 | 1.3 MB | ########## | 100% 2025-05-07T20:24:15.0882256Z pip-25.1 | 1.3 MB | ########## | 100% 2025-05-07T20:24:15.0882477Z 2025-05-07T20:24:15.0883223Z tzdata-2025b | 116 KB | ########## | 100%  2025-05-07T20:24:15.0883482Z 2025-05-07T20:24:15.0888664Z tzdata-2025b | 116 KB | ########## | 100%  2025-05-07T20:24:15.0889082Z 2025-05-07T20:24:15.0889290Z 2025-05-07T20:24:15.0889466Z  done 2025-05-07T20:24:15.1892653Z Preparing transaction: \ done 2025-05-07T20:24:15.2898811Z Verifying transaction: / done 2025-05-07T20:24:17.2926635Z Executing transaction: \ | / - \ | / - \ | / - \ | / - \ | / - done 2025-05-07T20:24:17.9006963Z [SETUP] Cleaning up Conda packages ... 2025-05-07T20:24:17.9010644Z + conda clean --packages --tarball -y 2025-05-07T20:24:17.9010857Z 2025-05-07T20:24:19.1374519Z Will remove 99 (117.8 MB) tarball(s). 2025-05-07T20:24:19.1374897Z Will remove 11 (16.0 MB) package(s). 2025-05-07T20:24:19.2042452Z 2025-05-07T20:24:19.2051269Z + conda clean --all -y 2025-05-07T20:24:19.2051516Z 2025-05-07T20:24:19.7503421Z There are no unused tarball(s) to remove. 2025-05-07T20:24:19.7503806Z Will remove 1 index cache(s). 2025-05-07T20:24:19.7504093Z There are no unused package(s) to remove. 2025-05-07T20:24:19.7504402Z There are no tempfile(s) to remove. 2025-05-07T20:24:19.7504739Z There are no logfile(s) to remove. 2025-05-07T20:24:19.8143515Z 2025-05-07T20:24:19.8148806Z + conda info 2025-05-07T20:24:19.8149034Z 2025-05-07T20:24:20.5664144Z 2025-05-07T20:24:20.5664550Z active environment : base 2025-05-07T20:24:20.5664939Z active env location : /home/ec2-user/miniconda 2025-05-07T20:24:20.5665294Z shell level : 1 2025-05-07T20:24:20.5665614Z user config file : /home/ec2-user/.condarc 2025-05-07T20:24:20.5665990Z populated config files : /home/ec2-user/miniconda/.condarc 2025-05-07T20:24:20.5666351Z conda version : 25.3.1 2025-05-07T20:24:20.5666631Z conda-build version : not installed 2025-05-07T20:24:20.5666924Z python version : 3.13.2.final.0 2025-05-07T20:24:20.5667223Z solver : libmamba (default) 2025-05-07T20:24:20.5667531Z virtual packages : __archspec=1=zen2 2025-05-07T20:24:20.5667821Z __conda=25.3.1=0 2025-05-07T20:24:20.5668104Z __cuda=12.8=0 2025-05-07T20:24:20.5668377Z __glibc=2.34=0 2025-05-07T20:24:20.5668670Z __linux=6.1.130=0 2025-05-07T20:24:20.5668939Z __unix=0=0 2025-05-07T20:24:20.5669559Z base environment : /home/ec2-user/miniconda (writable) 2025-05-07T20:24:20.5669972Z conda av data dir : /home/ec2-user/miniconda/etc/conda 2025-05-07T20:24:20.5670309Z conda av metadata url : None 2025-05-07T20:24:20.5670679Z channel URLs : https://repo.anaconda.com/pkgs/main/linux-64 2025-05-07T20:24:20.5671100Z https://repo.anaconda.com/pkgs/main/noarch 2025-05-07T20:24:20.5671473Z https://repo.anaconda.com/pkgs/r/linux-64 2025-05-07T20:24:20.5671849Z https://repo.anaconda.com/pkgs/r/noarch 2025-05-07T20:24:20.5672218Z package cache : /home/ec2-user/miniconda/pkgs 2025-05-07T20:24:20.5672558Z /home/ec2-user/.conda/pkgs 2025-05-07T20:24:20.5672888Z envs directories : /home/ec2-user/miniconda/envs 2025-05-07T20:24:20.5673230Z /home/ec2-user/.conda/envs 2025-05-07T20:24:20.5673534Z platform : linux-64 2025-05-07T20:24:20.5674350Z user-agent : conda/25.3.1 requests/2.32.3 CPython/3.13.2 Linux/6.1.130-139.222.amzn2023.x86_64 amzn/2023.6.20250317 glibc/2.34 solver/libmamba conda-libmamba-solver/25.4.0 libmambapy/2.0.5 aau/0.7.0 c/. s/. e/. 2025-05-07T20:24:20.5675164Z UID:GID : 1000:1000 2025-05-07T20:24:20.5675440Z netrc file : None 2025-05-07T20:24:20.5675702Z offline mode : False 2025-05-07T20:24:20.5675867Z 2025-05-07T20:24:20.6340909Z 2025-05-07T20:24:20.6341432Z [SETUP] Exporting Miniconda variables ... 2025-05-07T20:24:20.6342832Z [SETUP] Saving Miniconda variables to /home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/add_path_7aad9153-10f6-47cc-a7e3-d15333c993e3 ... 2025-05-07T20:24:20.6343601Z [SETUP] Successfully set up Miniconda at /home/ec2-user/miniconda 2025-05-07T20:24:20.6425434Z ##[group]Run . $PRELUDE; create_conda_environment $BUILD_ENV 3.11 2025-05-07T20:24:20.6426091Z . $PRELUDE; create_conda_environment $BUILD_ENV 3.11 2025-05-07T20:24:20.6444600Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:24:20.6444955Z env: 2025-05-07T20:24:20.6445183Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:24:20.6445471Z BUILD_ENV: build_binary 2025-05-07T20:24:20.6445716Z BUILD_TARGET: genai 2025-05-07T20:24:20.6445941Z BUILD_VARIANT: cuda 2025-05-07T20:24:20.6446165Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:24:20.6446422Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:24:20.6446719Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:24:20.6447040Z ##[endgroup] 2025-05-07T20:24:20.9809273Z ################################################################################ 2025-05-07T20:24:20.9809750Z # Create Conda Environment 2025-05-07T20:24:20.9810082Z # 2025-05-07T20:24:20.9826593Z # [2025-05-07T20:24:20.982Z] + create_conda_environment build_binary 3.11 2025-05-07T20:24:20.9827170Z ################################################################################ 2025-05-07T20:24:20.9827474Z 2025-05-07T20:24:20.9843975Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:24:21.0710095Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:24:21.0710619Z [SETUP] Listing existing Conda environments ... 2025-05-07T20:24:21.0711029Z + conda info --envs 2025-05-07T20:24:21.0711168Z 2025-05-07T20:24:21.8196733Z 2025-05-07T20:24:21.8197414Z # conda environments: 2025-05-07T20:24:21.8197794Z # 2025-05-07T20:24:21.8198114Z base /home/ec2-user/miniconda 2025-05-07T20:24:21.8198416Z 2025-05-07T20:24:21.8856511Z 2025-05-07T20:24:21.8857037Z [SETUP] Deleting the prefix directory if it exists ... 2025-05-07T20:24:23.5281865Z + rm -rf /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:24:23.5282524Z 2025-05-07T20:24:23.5294311Z 2025-05-07T20:24:23.5303642Z [SETUP] Creating new Conda environment (Python 3.11) ... 2025-05-07T20:24:23.5326296Z [EXEC] [ATTEMPT 0/3] + conda create -y -n build_binary python=3.11 2025-05-07T20:24:24.2878200Z Channels: 2025-05-07T20:24:24.2878519Z - defaults 2025-05-07T20:24:24.2878799Z Platform: linux-64 2025-05-07T20:24:25.8394558Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ done 2025-05-07T20:24:25.9636956Z Solving environment: / done 2025-05-07T20:24:25.9924162Z 2025-05-07T20:24:25.9924552Z ## Package Plan ## 2025-05-07T20:24:25.9924766Z 2025-05-07T20:24:25.9925044Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:24:25.9925454Z 2025-05-07T20:24:25.9925577Z added / updated specs: 2025-05-07T20:24:25.9925821Z - python=3.11 2025-05-07T20:24:25.9925958Z 2025-05-07T20:24:25.9925963Z 2025-05-07T20:24:25.9926083Z The following packages will be downloaded: 2025-05-07T20:24:25.9926294Z 2025-05-07T20:24:25.9926445Z package | build 2025-05-07T20:24:25.9926756Z ---------------------------|----------------- 2025-05-07T20:24:25.9927119Z _libgcc_mutex-0.1 | main 3 KB 2025-05-07T20:24:25.9927547Z _openmp_mutex-5.1 | 1_gnu 21 KB 2025-05-07T20:24:25.9928078Z ca-certificates-2025.2.25 | h06a4308_0 129 KB 2025-05-07T20:24:25.9928836Z python-3.11.11 | he870216_0 32.9 MB 2025-05-07T20:24:25.9936927Z setuptools-78.1.1 | py311h06a4308_0 2.3 MB 2025-05-07T20:24:25.9937511Z wheel-0.45.1 | py311h06a4308_0 151 KB 2025-05-07T20:24:25.9938013Z ------------------------------------------------------------ 2025-05-07T20:24:25.9938372Z Total: 35.4 MB 2025-05-07T20:24:25.9938575Z 2025-05-07T20:24:25.9938714Z The following NEW packages will be INSTALLED: 2025-05-07T20:24:25.9938933Z 2025-05-07T20:24:25.9939515Z _libgcc_mutex pkgs/main/linux-64::_libgcc_mutex-0.1-main 2025-05-07T20:24:25.9940097Z _openmp_mutex pkgs/main/linux-64::_openmp_mutex-5.1-1_gnu 2025-05-07T20:24:25.9940509Z bzip2 pkgs/main/linux-64::bzip2-1.0.8-h5eee18b_6 2025-05-07T20:24:25.9940986Z ca-certificates pkgs/main/linux-64::ca-certificates-2025.2.25-h06a4308_0 2025-05-07T20:24:25.9941509Z ld_impl_linux-64 pkgs/main/linux-64::ld_impl_linux-64-2.40-h12ee557_0 2025-05-07T20:24:25.9941984Z libffi pkgs/main/linux-64::libffi-3.4.4-h6a678d5_1 2025-05-07T20:24:25.9942407Z libgcc-ng pkgs/main/linux-64::libgcc-ng-11.2.0-h1234567_1 2025-05-07T20:24:25.9942837Z libgomp pkgs/main/linux-64::libgomp-11.2.0-h1234567_1 2025-05-07T20:24:25.9943338Z libstdcxx-ng pkgs/main/linux-64::libstdcxx-ng-11.2.0-h1234567_1 2025-05-07T20:24:25.9943792Z libuuid pkgs/main/linux-64::libuuid-1.41.5-h5eee18b_0 2025-05-07T20:24:25.9944211Z ncurses pkgs/main/linux-64::ncurses-6.4-h6a678d5_0 2025-05-07T20:24:25.9944626Z openssl pkgs/main/linux-64::openssl-3.0.16-h5eee18b_0 2025-05-07T20:24:25.9945023Z pip pkgs/main/noarch::pip-25.1-pyhc872135_2 2025-05-07T20:24:25.9945414Z python pkgs/main/linux-64::python-3.11.11-he870216_0 2025-05-07T20:24:25.9945838Z readline pkgs/main/linux-64::readline-8.2-h5eee18b_0 2025-05-07T20:24:25.9946304Z setuptools pkgs/main/linux-64::setuptools-78.1.1-py311h06a4308_0 2025-05-07T20:24:25.9946754Z sqlite pkgs/main/linux-64::sqlite-3.45.3-h5eee18b_0 2025-05-07T20:24:25.9947160Z tk pkgs/main/linux-64::tk-8.6.14-h39e8969_0 2025-05-07T20:24:25.9947561Z tzdata pkgs/main/noarch::tzdata-2025b-h04d1e81_0 2025-05-07T20:24:25.9947971Z wheel pkgs/main/linux-64::wheel-0.45.1-py311h06a4308_0 2025-05-07T20:24:25.9948350Z xz pkgs/main/linux-64::xz-5.6.4-h5eee18b_1 2025-05-07T20:24:25.9948727Z zlib pkgs/main/linux-64::zlib-1.2.13-h5eee18b_1 2025-05-07T20:24:25.9948965Z 2025-05-07T20:24:25.9948970Z 2025-05-07T20:24:25.9948974Z 2025-05-07T20:24:25.9949200Z Downloading and Extracting Packages: ...working... 2025-05-07T20:24:25.9949563Z python-3.11.11 | 32.9 MB | | 0% 2025-05-07T20:24:25.9949790Z 2025-05-07T20:24:25.9951789Z setuptools-78.1.1 | 2.3 MB | | 0%  2025-05-07T20:24:25.9952134Z 2025-05-07T20:24:25.9952140Z 2025-05-07T20:24:25.9956761Z wheel-0.45.1 | 151 KB | | 0%  2025-05-07T20:24:25.9957102Z 2025-05-07T20:24:25.9957108Z 2025-05-07T20:24:25.9957113Z 2025-05-07T20:24:25.9965556Z ca-certificates-2025 | 129 KB | | 0%  2025-05-07T20:24:25.9965932Z 2025-05-07T20:24:25.9965948Z 2025-05-07T20:24:25.9965953Z 2025-05-07T20:24:25.9965958Z 2025-05-07T20:24:25.9975699Z _openmp_mutex-5.1 | 21 KB | | 0%  2025-05-07T20:24:25.9976061Z 2025-05-07T20:24:25.9976066Z 2025-05-07T20:24:25.9976090Z 2025-05-07T20:24:25.9976103Z 2025-05-07T20:24:25.9983160Z 2025-05-07T20:24:26.0292215Z _libgcc_mutex-0.1 | 3 KB | | 0%  2025-05-07T20:24:26.0292490Z 2025-05-07T20:24:26.0292494Z 2025-05-07T20:24:26.0292498Z 2025-05-07T20:24:26.0292501Z 2025-05-07T20:24:26.0378120Z _openmp_mutex-5.1 | 21 KB | ########## | 100%  2025-05-07T20:24:26.0378387Z 2025-05-07T20:24:26.0378391Z 2025-05-07T20:24:26.0515538Z wheel-0.45.1 | 151 KB | ########## | 100%  2025-05-07T20:24:26.0515786Z 2025-05-07T20:24:26.0515790Z 2025-05-07T20:24:26.0515794Z 2025-05-07T20:24:26.0515797Z 2025-05-07T20:24:26.0519444Z 2025-05-07T20:24:26.0667540Z _libgcc_mutex-0.1 | 3 KB | ########## | 100%  2025-05-07T20:24:26.0667814Z 2025-05-07T20:24:26.0667818Z 2025-05-07T20:24:26.0677100Z 2025-05-07T20:24:26.0768077Z ca-certificates-2025 | 129 KB | ########## | 100%  2025-05-07T20:24:26.0768357Z 2025-05-07T20:24:26.0768629Z 2025-05-07T20:24:26.0768634Z 2025-05-07T20:24:26.0768765Z 2025-05-07T20:24:26.0768768Z 2025-05-07T20:24:26.0927448Z _libgcc_mutex-0.1 | 3 KB | ########## | 100%  2025-05-07T20:24:26.0933366Z python-3.11.11 | 32.9 MB | 3 | 4% 2025-05-07T20:24:26.0935152Z 2025-05-07T20:24:26.1212257Z setuptools-78.1.1 | 2.3 MB | ###8 | 38%  2025-05-07T20:24:26.1212516Z 2025-05-07T20:24:26.1212520Z 2025-05-07T20:24:26.1216482Z 2025-05-07T20:24:26.1223816Z ca-certificates-2025 | 129 KB | ########## | 100%  2025-05-07T20:24:26.1224095Z 2025-05-07T20:24:26.1224100Z 2025-05-07T20:24:26.1224103Z 2025-05-07T20:24:26.1341414Z ca-certificates-2025 | 129 KB | ########## | 100%  2025-05-07T20:24:26.1341695Z 2025-05-07T20:24:26.1341699Z 2025-05-07T20:24:26.1341703Z 2025-05-07T20:24:26.1341706Z 2025-05-07T20:24:26.1344261Z _openmp_mutex-5.1 | 21 KB | ########## | 100%  2025-05-07T20:24:26.1344524Z 2025-05-07T20:24:26.1344537Z 2025-05-07T20:24:26.1344549Z 2025-05-07T20:24:26.1344558Z 2025-05-07T20:24:26.1499605Z _openmp_mutex-5.1 | 21 KB | ########## | 100%  2025-05-07T20:24:26.1499870Z 2025-05-07T20:24:26.1613477Z setuptools-78.1.1 | 2.3 MB | ########## | 100%  2025-05-07T20:24:26.1613728Z 2025-05-07T20:24:26.1613780Z 2025-05-07T20:24:26.1616619Z wheel-0.45.1 | 151 KB | ########## | 100%  2025-05-07T20:24:26.1616856Z 2025-05-07T20:24:26.1616860Z 2025-05-07T20:24:26.1928331Z wheel-0.45.1 | 151 KB | ########## | 100%  2025-05-07T20:24:26.2929010Z python-3.11.11 | 32.9 MB | ##9 | 30% 2025-05-07T20:24:26.4137020Z python-3.11.11 | 32.9 MB | ########4 | 85% 2025-05-07T20:24:26.4562794Z python-3.11.11 | 32.9 MB | ########## | 100% 2025-05-07T20:24:26.4563034Z 2025-05-07T20:24:26.4564606Z setuptools-78.1.1 | 2.3 MB | ########## | 100%  2025-05-07T20:24:26.4564857Z 2025-05-07T20:24:27.0769147Z setuptools-78.1.1 | 2.3 MB | ########## | 100%  2025-05-07T20:24:27.0776717Z python-3.11.11 | 32.9 MB | ########## | 100% 2025-05-07T20:24:27.0777188Z 2025-05-07T20:24:27.0777396Z 2025-05-07T20:24:27.0777614Z  2025-05-07T20:24:27.0777843Z 2025-05-07T20:24:27.0777847Z 2025-05-07T20:24:27.0778023Z  2025-05-07T20:24:27.0778297Z 2025-05-07T20:24:27.0778303Z 2025-05-07T20:24:27.0778308Z 2025-05-07T20:24:27.0778566Z  2025-05-07T20:24:27.0778866Z 2025-05-07T20:24:27.0778871Z 2025-05-07T20:24:27.0778876Z 2025-05-07T20:24:27.0778882Z 2025-05-07T20:24:27.0779074Z  2025-05-07T20:24:27.0779286Z 2025-05-07T20:24:27.0779290Z 2025-05-07T20:24:27.0779293Z 2025-05-07T20:24:27.0779297Z 2025-05-07T20:24:27.0779301Z 2025-05-07T20:24:27.0779497Z  done 2025-05-07T20:24:27.2885915Z Preparing transaction: \ | done 2025-05-07T20:24:28.6579323Z Verifying transaction: - \ | / - \ | / - \ | / - done 2025-05-07T20:24:30.9707068Z Executing transaction: | / - \ | / - \ | / - \ | / - \ | / - \ | / - done 2025-05-07T20:24:31.0237011Z # 2025-05-07T20:24:31.0237368Z # To activate this environment, use 2025-05-07T20:24:31.0237766Z # 2025-05-07T20:24:31.0238043Z # $ conda activate build_binary 2025-05-07T20:24:31.0238365Z # 2025-05-07T20:24:31.0238585Z # To deactivate an active environment, use 2025-05-07T20:24:31.0238881Z # 2025-05-07T20:24:31.0239078Z # $ conda deactivate 2025-05-07T20:24:31.0239237Z 2025-05-07T20:24:31.1295132Z [SETUP] Upgrading PIP to latest ... 2025-05-07T20:24:31.1319339Z [EXEC] [ATTEMPT 0/3] + conda run -n build_binary pip install --upgrade pip 2025-05-07T20:24:33.8748566Z Requirement already satisfied: pip in /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages (25.1) 2025-05-07T20:24:33.8749359Z Collecting pip 2025-05-07T20:24:33.8749684Z Using cached pip-25.1.1-py3-none-any.whl.metadata (3.6 kB) 2025-05-07T20:24:33.8750097Z Using cached pip-25.1.1-py3-none-any.whl (1.8 MB) 2025-05-07T20:24:33.8750440Z Installing collected packages: pip 2025-05-07T20:24:33.8750743Z Attempting uninstall: pip 2025-05-07T20:24:33.8751033Z Found existing installation: pip 25.1 2025-05-07T20:24:33.8751338Z Uninstalling pip-25.1: 2025-05-07T20:24:33.8751618Z Successfully uninstalled pip-25.1 2025-05-07T20:24:33.8751934Z Successfully installed pip-25.1.1 2025-05-07T20:24:33.8752120Z 2025-05-07T20:24:33.9380327Z [SETUP] Upgrading pyOpenSSL ... 2025-05-07T20:24:33.9403036Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y pyOpenSSL>22.1.0 2025-05-07T20:24:34.7945665Z Channels: 2025-05-07T20:24:34.7946022Z - conda-forge 2025-05-07T20:24:34.7946345Z Platform: linux-64 2025-05-07T20:24:45.3690724Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | done 2025-05-07T20:24:47.0903237Z Solving environment: - \ | / - done 2025-05-07T20:24:47.1520967Z 2025-05-07T20:24:47.1521621Z ## Package Plan ## 2025-05-07T20:24:47.1521860Z 2025-05-07T20:24:47.1522162Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:24:47.1522567Z 2025-05-07T20:24:47.1522670Z added / updated specs: 2025-05-07T20:24:47.1522945Z - pyopenssl[version='>22.1.0'] 2025-05-07T20:24:47.1523133Z 2025-05-07T20:24:47.1523137Z 2025-05-07T20:24:47.1523261Z The following packages will be downloaded: 2025-05-07T20:24:47.1523481Z 2025-05-07T20:24:47.1523599Z package | build 2025-05-07T20:24:47.1523928Z ---------------------------|----------------- 2025-05-07T20:24:47.1524321Z cffi-1.17.1 | py311hf29c0ef_0 295 KB conda-forge 2025-05-07T20:24:47.1524770Z cryptography-44.0.3 | py311hafd3f86_0 1.5 MB conda-forge 2025-05-07T20:24:47.1525324Z libgcc-15.1.0 | h767d61c_2 810 KB conda-forge 2025-05-07T20:24:47.1525739Z libgcc-ng-15.1.0 | h69a702a_2 34 KB conda-forge 2025-05-07T20:24:47.1526147Z libgomp-15.1.0 | h767d61c_2 442 KB conda-forge 2025-05-07T20:24:47.1526565Z openssl-3.5.0 | h7b32b05_1 3.0 MB conda-forge 2025-05-07T20:24:47.1526981Z pycparser-2.22 | pyh29332c3_1 108 KB conda-forge 2025-05-07T20:24:47.1527413Z pyopenssl-25.0.0 | pyhd8ed1ab_0 120 KB conda-forge 2025-05-07T20:24:47.1527833Z python_abi-3.11 | 2_cp311 5 KB conda-forge 2025-05-07T20:24:47.1528522Z typing-extensions-4.13.2 | h0e9735f_0 88 KB conda-forge 2025-05-07T20:24:47.1529014Z typing_extensions-4.13.2 | pyh29332c3_0 51 KB conda-forge 2025-05-07T20:24:47.1529440Z ------------------------------------------------------------ 2025-05-07T20:24:47.1529794Z Total: 6.4 MB 2025-05-07T20:24:47.1530037Z 2025-05-07T20:24:47.1530220Z The following NEW packages will be INSTALLED: 2025-05-07T20:24:47.1530471Z 2025-05-07T20:24:47.1530675Z cffi conda-forge/linux-64::cffi-1.17.1-py311hf29c0ef_0 2025-05-07T20:24:47.1531171Z cryptography conda-forge/linux-64::cryptography-44.0.3-py311hafd3f86_0 2025-05-07T20:24:47.1531660Z libgcc conda-forge/linux-64::libgcc-15.1.0-h767d61c_2 2025-05-07T20:24:47.1532102Z pycparser conda-forge/noarch::pycparser-2.22-pyh29332c3_1 2025-05-07T20:24:47.1532575Z pyopenssl conda-forge/noarch::pyopenssl-25.0.0-pyhd8ed1ab_0 2025-05-07T20:24:47.1533030Z python_abi conda-forge/linux-64::python_abi-3.11-2_cp311 2025-05-07T20:24:47.1534076Z typing-extensions conda-forge/noarch::typing-extensions-4.13.2-h0e9735f_0 2025-05-07T20:24:47.1535228Z typing_extensions conda-forge/noarch::typing_extensions-4.13.2-pyh29332c3_0 2025-05-07T20:24:47.1535567Z 2025-05-07T20:24:47.1535681Z The following packages will be UPDATED: 2025-05-07T20:24:47.1535884Z 2025-05-07T20:24:47.1536277Z ca-certificates pkgs/main/linux-64::ca-certificates-2~ --> conda-forge/noarch::ca-certificates-2025.4.26-hbd8a1cb_0 2025-05-07T20:24:47.1537021Z libgcc-ng pkgs/main::libgcc-ng-11.2.0-h1234567_1 --> conda-forge::libgcc-ng-15.1.0-h69a702a_2 2025-05-07T20:24:47.1537659Z libgomp pkgs/main::libgomp-11.2.0-h1234567_1 --> conda-forge::libgomp-15.1.0-h767d61c_2 2025-05-07T20:24:47.1538283Z openssl pkgs/main::openssl-3.0.16-h5eee18b_0 --> conda-forge::openssl-3.5.0-h7b32b05_1 2025-05-07T20:24:47.1538638Z 2025-05-07T20:24:47.1538642Z 2025-05-07T20:24:47.1538646Z 2025-05-07T20:24:47.1538802Z Downloading and Extracting Packages: ...working... 2025-05-07T20:24:47.1539187Z openssl-3.5.0 | 3.0 MB | | 0% 2025-05-07T20:24:47.1539411Z 2025-05-07T20:24:47.1543559Z cryptography-44.0.3 | 1.5 MB | | 0%  2025-05-07T20:24:47.1543882Z 2025-05-07T20:24:47.1549408Z 2025-05-07T20:24:47.1555464Z libgcc-15.1.0 | 810 KB | | 0%  2025-05-07T20:24:47.1555815Z 2025-05-07T20:24:47.1555821Z 2025-05-07T20:24:47.1555827Z 2025-05-07T20:24:47.1565583Z libgomp-15.1.0 | 442 KB | | 0%  2025-05-07T20:24:47.1565931Z 2025-05-07T20:24:47.1565935Z 2025-05-07T20:24:47.1565939Z 2025-05-07T20:24:47.1570703Z 2025-05-07T20:24:47.1582116Z cffi-1.17.1 | 295 KB | | 0%  2025-05-07T20:24:47.1582484Z 2025-05-07T20:24:47.1582492Z 2025-05-07T20:24:47.1582497Z 2025-05-07T20:24:47.1582503Z 2025-05-07T20:24:47.1582508Z 2025-05-07T20:24:47.1584135Z pyopenssl-25.0.0 | 120 KB | | 0%  2025-05-07T20:24:47.1584529Z 2025-05-07T20:24:47.1584544Z 2025-05-07T20:24:47.1584549Z 2025-05-07T20:24:47.1584554Z 2025-05-07T20:24:47.1584559Z 2025-05-07T20:24:47.1584564Z 2025-05-07T20:24:47.1585680Z pycparser-2.22 | 108 KB | | 0%  2025-05-07T20:24:47.1585963Z 2025-05-07T20:24:47.1585967Z 2025-05-07T20:24:47.1585971Z 2025-05-07T20:24:47.1585982Z 2025-05-07T20:24:47.1585986Z 2025-05-07T20:24:47.1585989Z 2025-05-07T20:24:47.1593331Z 2025-05-07T20:24:47.1600480Z typing-extensions-4. | 88 KB | | 0%  2025-05-07T20:24:47.1600793Z 2025-05-07T20:24:47.1600797Z 2025-05-07T20:24:47.1600801Z 2025-05-07T20:24:47.1600804Z 2025-05-07T20:24:47.1600808Z 2025-05-07T20:24:47.1600812Z 2025-05-07T20:24:47.1600815Z 2025-05-07T20:24:47.1601945Z 2025-05-07T20:24:47.1603931Z typing_extensions-4. | 51 KB | | 0%  2025-05-07T20:24:47.1604228Z 2025-05-07T20:24:47.1604232Z 2025-05-07T20:24:47.1604245Z 2025-05-07T20:24:47.1604250Z 2025-05-07T20:24:47.1604259Z 2025-05-07T20:24:47.1604263Z 2025-05-07T20:24:47.1604267Z 2025-05-07T20:24:47.1604270Z 2025-05-07T20:24:47.1604582Z 2025-05-07T20:24:47.1606404Z libgcc-ng-15.1.0 | 34 KB | | 0%  2025-05-07T20:24:47.1606711Z 2025-05-07T20:24:47.1606716Z 2025-05-07T20:24:47.1606719Z 2025-05-07T20:24:47.1606723Z 2025-05-07T20:24:47.1606727Z 2025-05-07T20:24:47.1606730Z 2025-05-07T20:24:47.1606734Z 2025-05-07T20:24:47.1606737Z 2025-05-07T20:24:47.1606741Z 2025-05-07T20:24:47.1606745Z 2025-05-07T20:24:47.2138128Z python_abi-3.11 | 5 KB | | 0%  2025-05-07T20:24:47.2138419Z 2025-05-07T20:24:47.2140414Z 2025-05-07T20:24:47.2198409Z libgcc-15.1.0 | 810 KB | ########## | 100%  2025-05-07T20:24:47.2198731Z 2025-05-07T20:24:47.2198736Z 2025-05-07T20:24:47.2198741Z 2025-05-07T20:24:47.2200146Z 2025-05-07T20:24:47.2270622Z cffi-1.17.1 | 295 KB | ########## | 100%  2025-05-07T20:24:47.2271001Z 2025-05-07T20:24:47.2271005Z 2025-05-07T20:24:47.2273415Z 2025-05-07T20:24:47.2524781Z libgomp-15.1.0 | 442 KB | ########## | 100%  2025-05-07T20:24:47.2534194Z openssl-3.5.0 | 3.0 MB | ####5 | 46% 2025-05-07T20:24:47.2534437Z 2025-05-07T20:24:47.2588461Z cryptography-44.0.3 | 1.5 MB | ##9 | 30%  2025-05-07T20:24:47.2588885Z 2025-05-07T20:24:47.2588891Z 2025-05-07T20:24:47.2588896Z 2025-05-07T20:24:47.2588901Z 2025-05-07T20:24:47.2588916Z 2025-05-07T20:24:47.2657454Z pyopenssl-25.0.0 | 120 KB | #####3 | 53%  2025-05-07T20:24:47.2657838Z 2025-05-07T20:24:47.2657852Z 2025-05-07T20:24:47.2657856Z 2025-05-07T20:24:47.2657859Z 2025-05-07T20:24:47.2657863Z 2025-05-07T20:24:47.2657866Z 2025-05-07T20:24:47.2688866Z pycparser-2.22 | 108 KB | #4 | 15%  2025-05-07T20:24:47.2689254Z 2025-05-07T20:24:47.2689258Z 2025-05-07T20:24:47.2689271Z 2025-05-07T20:24:47.2689281Z 2025-05-07T20:24:47.2689284Z 2025-05-07T20:24:47.2747958Z pyopenssl-25.0.0 | 120 KB | ########## | 100%  2025-05-07T20:24:47.2748355Z 2025-05-07T20:24:47.2748360Z 2025-05-07T20:24:47.2748365Z 2025-05-07T20:24:47.2748370Z 2025-05-07T20:24:47.2748384Z 2025-05-07T20:24:47.2748390Z 2025-05-07T20:24:47.2748395Z 2025-05-07T20:24:47.2780065Z typing-extensions-4. | 88 KB | #8 | 18%  2025-05-07T20:24:47.2780415Z 2025-05-07T20:24:47.2780419Z 2025-05-07T20:24:47.2780429Z 2025-05-07T20:24:47.2780433Z 2025-05-07T20:24:47.2780437Z 2025-05-07T20:24:47.2780440Z 2025-05-07T20:24:47.2835831Z pycparser-2.22 | 108 KB | ########## | 100%  2025-05-07T20:24:47.2836200Z 2025-05-07T20:24:47.2836215Z 2025-05-07T20:24:47.2836221Z 2025-05-07T20:24:47.2836226Z 2025-05-07T20:24:47.2836231Z 2025-05-07T20:24:47.2836236Z 2025-05-07T20:24:47.2838255Z 2025-05-07T20:24:47.3058410Z typing-extensions-4. | 88 KB | ########## | 100%  2025-05-07T20:24:47.3058776Z 2025-05-07T20:24:47.3058780Z 2025-05-07T20:24:47.3058784Z 2025-05-07T20:24:47.3058787Z 2025-05-07T20:24:47.3058791Z 2025-05-07T20:24:47.3058795Z 2025-05-07T20:24:47.3058798Z 2025-05-07T20:24:47.3059442Z 2025-05-07T20:24:47.3136449Z typing_extensions-4. | 51 KB | ###1 | 31%  2025-05-07T20:24:47.3136870Z 2025-05-07T20:24:47.3136874Z 2025-05-07T20:24:47.3136878Z 2025-05-07T20:24:47.3136881Z 2025-05-07T20:24:47.3136885Z 2025-05-07T20:24:47.3136889Z 2025-05-07T20:24:47.3136892Z 2025-05-07T20:24:47.3141317Z 2025-05-07T20:24:47.3142174Z typing_extensions-4. | 51 KB | ########## | 100%  2025-05-07T20:24:47.3142584Z 2025-05-07T20:24:47.3142589Z 2025-05-07T20:24:47.3142594Z 2025-05-07T20:24:47.3142599Z 2025-05-07T20:24:47.3142604Z 2025-05-07T20:24:47.3142609Z 2025-05-07T20:24:47.3142615Z 2025-05-07T20:24:47.3142620Z 2025-05-07T20:24:47.3142634Z 2025-05-07T20:24:47.3175216Z libgcc-ng-15.1.0 | 34 KB | ####7 | 47%  2025-05-07T20:24:47.3175611Z 2025-05-07T20:24:47.3175616Z 2025-05-07T20:24:47.3175621Z 2025-05-07T20:24:47.3175627Z 2025-05-07T20:24:47.3175632Z 2025-05-07T20:24:47.3175637Z 2025-05-07T20:24:47.3175642Z 2025-05-07T20:24:47.3175647Z 2025-05-07T20:24:47.3179193Z 2025-05-07T20:24:47.3256643Z libgcc-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:24:47.3257022Z 2025-05-07T20:24:47.3257028Z 2025-05-07T20:24:47.3257033Z 2025-05-07T20:24:47.3257047Z 2025-05-07T20:24:47.3257053Z 2025-05-07T20:24:47.3257058Z 2025-05-07T20:24:47.3257063Z 2025-05-07T20:24:47.3257068Z 2025-05-07T20:24:47.3257073Z 2025-05-07T20:24:47.3257078Z 2025-05-07T20:24:47.3278938Z python_abi-3.11 | 5 KB | ########## | 100%  2025-05-07T20:24:47.3279327Z 2025-05-07T20:24:47.3279332Z 2025-05-07T20:24:47.3279338Z 2025-05-07T20:24:47.3279555Z 2025-05-07T20:24:47.3279561Z 2025-05-07T20:24:47.3279687Z 2025-05-07T20:24:47.3279692Z 2025-05-07T20:24:47.3279697Z 2025-05-07T20:24:47.3279702Z 2025-05-07T20:24:47.3283079Z 2025-05-07T20:24:47.3522654Z python_abi-3.11 | 5 KB | ########## | 100%  2025-05-07T20:24:47.3524183Z 2025-05-07T20:24:47.3664642Z cryptography-44.0.3 | 1.5 MB | ########## | 100%  2025-05-07T20:24:47.3665539Z openssl-3.5.0 | 3.0 MB | ########## | 100% 2025-05-07T20:24:47.3733947Z openssl-3.5.0 | 3.0 MB | ########## | 100% 2025-05-07T20:24:47.3734283Z 2025-05-07T20:24:47.3734289Z 2025-05-07T20:24:47.3734304Z 2025-05-07T20:24:47.3734310Z 2025-05-07T20:24:47.3735316Z cffi-1.17.1 | 295 KB | ########## | 100%  2025-05-07T20:24:47.3735650Z 2025-05-07T20:24:47.3735655Z 2025-05-07T20:24:47.3735672Z 2025-05-07T20:24:47.3735684Z 2025-05-07T20:24:47.3747660Z cffi-1.17.1 | 295 KB | ########## | 100%  2025-05-07T20:24:47.3748007Z 2025-05-07T20:24:47.3748013Z 2025-05-07T20:24:47.3748035Z 2025-05-07T20:24:47.3752351Z libgomp-15.1.0 | 442 KB | ########## | 100%  2025-05-07T20:24:47.3752698Z 2025-05-07T20:24:47.3752704Z 2025-05-07T20:24:47.3752709Z 2025-05-07T20:24:47.3894472Z libgomp-15.1.0 | 442 KB | ########## | 100%  2025-05-07T20:24:47.3895148Z 2025-05-07T20:24:47.3895159Z 2025-05-07T20:24:47.3899891Z libgcc-15.1.0 | 810 KB | ########## | 100%  2025-05-07T20:24:47.3900225Z 2025-05-07T20:24:47.3900745Z 2025-05-07T20:24:47.4039949Z libgcc-15.1.0 | 810 KB | ########## | 100%  2025-05-07T20:24:47.4040289Z 2025-05-07T20:24:47.4040294Z 2025-05-07T20:24:47.4040299Z 2025-05-07T20:24:47.4040304Z 2025-05-07T20:24:47.4040310Z 2025-05-07T20:24:47.4070801Z pyopenssl-25.0.0 | 120 KB | ########## | 100%  2025-05-07T20:24:47.4071180Z 2025-05-07T20:24:47.4071186Z 2025-05-07T20:24:47.4071191Z 2025-05-07T20:24:47.4071196Z 2025-05-07T20:24:47.4071210Z 2025-05-07T20:24:47.4071220Z 2025-05-07T20:24:47.4071226Z 2025-05-07T20:24:47.4075515Z typing-extensions-4. | 88 KB | ########## | 100%  2025-05-07T20:24:47.4075979Z 2025-05-07T20:24:47.4075984Z 2025-05-07T20:24:47.4075989Z 2025-05-07T20:24:47.4075994Z 2025-05-07T20:24:47.4075999Z 2025-05-07T20:24:47.4076004Z 2025-05-07T20:24:47.4076009Z 2025-05-07T20:24:47.4228550Z typing-extensions-4. | 88 KB | ########## | 100%  2025-05-07T20:24:47.4228961Z 2025-05-07T20:24:47.4228966Z 2025-05-07T20:24:47.4228972Z 2025-05-07T20:24:47.4228977Z 2025-05-07T20:24:47.4228982Z 2025-05-07T20:24:47.4228987Z 2025-05-07T20:24:47.4228992Z 2025-05-07T20:24:47.4229006Z 2025-05-07T20:24:47.4233842Z typing_extensions-4. | 51 KB | ########## | 100%  2025-05-07T20:24:47.4234256Z 2025-05-07T20:24:47.4234262Z 2025-05-07T20:24:47.4234267Z 2025-05-07T20:24:47.4234272Z 2025-05-07T20:24:47.4234287Z 2025-05-07T20:24:47.4234301Z 2025-05-07T20:24:47.4234306Z 2025-05-07T20:24:47.4234317Z 2025-05-07T20:24:47.4364410Z typing_extensions-4. | 51 KB | ########## | 100%  2025-05-07T20:24:47.4364855Z 2025-05-07T20:24:47.4364861Z 2025-05-07T20:24:47.4364866Z 2025-05-07T20:24:47.4364871Z 2025-05-07T20:24:47.4364877Z 2025-05-07T20:24:47.4364883Z 2025-05-07T20:24:47.4364890Z 2025-05-07T20:24:47.4364896Z 2025-05-07T20:24:47.4364902Z 2025-05-07T20:24:47.4366543Z 2025-05-07T20:24:47.4491524Z python_abi-3.11 | 5 KB | ########## | 100%  2025-05-07T20:24:47.4492311Z 2025-05-07T20:24:47.4492322Z 2025-05-07T20:24:47.4492332Z 2025-05-07T20:24:47.4492342Z 2025-05-07T20:24:47.4492353Z 2025-05-07T20:24:47.4492363Z 2025-05-07T20:24:47.4492373Z 2025-05-07T20:24:47.4492383Z 2025-05-07T20:24:47.4492394Z 2025-05-07T20:24:47.4494162Z libgcc-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:24:47.4494873Z 2025-05-07T20:24:47.4495086Z 2025-05-07T20:24:47.4495093Z 2025-05-07T20:24:47.4495225Z 2025-05-07T20:24:47.4495231Z 2025-05-07T20:24:47.4495235Z 2025-05-07T20:24:47.4495241Z 2025-05-07T20:24:47.4495246Z 2025-05-07T20:24:47.4495409Z 2025-05-07T20:24:47.4993015Z libgcc-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:24:47.4993404Z 2025-05-07T20:24:47.4993409Z 2025-05-07T20:24:47.4993415Z 2025-05-07T20:24:47.4993420Z 2025-05-07T20:24:47.4993425Z 2025-05-07T20:24:47.4993430Z 2025-05-07T20:24:47.4996750Z pycparser-2.22 | 108 KB | ########## | 100%  2025-05-07T20:24:47.4997126Z 2025-05-07T20:24:47.4997132Z 2025-05-07T20:24:47.4997137Z 2025-05-07T20:24:47.4997142Z 2025-05-07T20:24:47.4997147Z 2025-05-07T20:24:47.4997202Z 2025-05-07T20:24:47.5898101Z pycparser-2.22 | 108 KB | ########## | 100%  2025-05-07T20:24:47.6043837Z openssl-3.5.0 | 3.0 MB | ########## | 100% 2025-05-07T20:24:47.6044186Z 2025-05-07T20:24:47.6044535Z cryptography-44.0.3 | 1.5 MB | ########## | 100%  2025-05-07T20:24:47.6044894Z 2025-05-07T20:24:47.6051990Z cryptography-44.0.3 | 1.5 MB | ########## | 100%  2025-05-07T20:24:47.6052488Z 2025-05-07T20:24:47.6052759Z 2025-05-07T20:24:47.6052996Z  2025-05-07T20:24:47.6053265Z 2025-05-07T20:24:47.6053271Z 2025-05-07T20:24:47.6053498Z  2025-05-07T20:24:47.6053780Z 2025-05-07T20:24:47.6053786Z 2025-05-07T20:24:47.6053791Z 2025-05-07T20:24:47.6054015Z  2025-05-07T20:24:47.6054281Z 2025-05-07T20:24:47.6054284Z 2025-05-07T20:24:47.6054288Z 2025-05-07T20:24:47.6054291Z 2025-05-07T20:24:47.6054467Z  2025-05-07T20:24:47.6054760Z 2025-05-07T20:24:47.6054765Z 2025-05-07T20:24:47.6054771Z 2025-05-07T20:24:47.6054784Z 2025-05-07T20:24:47.6054789Z 2025-05-07T20:24:47.6055069Z  2025-05-07T20:24:47.6055376Z 2025-05-07T20:24:47.6055381Z 2025-05-07T20:24:47.6055395Z 2025-05-07T20:24:47.6055400Z 2025-05-07T20:24:47.6055405Z 2025-05-07T20:24:47.6055410Z 2025-05-07T20:24:47.6055669Z  2025-05-07T20:24:47.6055892Z 2025-05-07T20:24:47.6055895Z 2025-05-07T20:24:47.6055899Z 2025-05-07T20:24:47.6055903Z 2025-05-07T20:24:47.6055906Z 2025-05-07T20:24:47.6055910Z 2025-05-07T20:24:47.6055913Z 2025-05-07T20:24:47.6056093Z  2025-05-07T20:24:47.6056306Z 2025-05-07T20:24:47.6056310Z 2025-05-07T20:24:47.6056313Z 2025-05-07T20:24:47.6056317Z 2025-05-07T20:24:47.6056320Z 2025-05-07T20:24:47.6056324Z 2025-05-07T20:24:47.6056327Z 2025-05-07T20:24:47.6056331Z 2025-05-07T20:24:47.6056517Z  2025-05-07T20:24:47.6056739Z 2025-05-07T20:24:47.6056742Z 2025-05-07T20:24:47.6056746Z 2025-05-07T20:24:47.6056750Z 2025-05-07T20:24:47.6056753Z 2025-05-07T20:24:47.6056757Z 2025-05-07T20:24:47.6056760Z 2025-05-07T20:24:47.6056764Z 2025-05-07T20:24:47.6056767Z 2025-05-07T20:24:47.6056951Z  2025-05-07T20:24:47.6057170Z 2025-05-07T20:24:47.6057174Z 2025-05-07T20:24:47.6057177Z 2025-05-07T20:24:47.6057181Z 2025-05-07T20:24:47.6057184Z 2025-05-07T20:24:47.6057188Z 2025-05-07T20:24:47.6057191Z 2025-05-07T20:24:47.6057195Z 2025-05-07T20:24:47.6057198Z 2025-05-07T20:24:47.6057202Z 2025-05-07T20:24:47.6057395Z  done 2025-05-07T20:24:47.7066738Z Preparing transaction: | done 2025-05-07T20:24:47.8071609Z Verifying transaction: - done 2025-05-07T20:24:49.3098805Z Executing transaction: | / - \ | / - \ | / - \ | / - done 2025-05-07T20:24:49.4862905Z [SETUP] Testing pyOpenSSL import ... 2025-05-07T20:24:51.2209605Z [CHECK] Python (sub-)package 'OpenSSL' found ... 2025-05-07T20:24:51.2224261Z [SETUP] Installing libxcrypt ... 2025-05-07T20:24:51.2248616Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y libxcrypt 2025-05-07T20:24:52.0909717Z Channels: 2025-05-07T20:24:52.0909962Z - conda-forge 2025-05-07T20:24:52.0910196Z Platform: linux-64 2025-05-07T20:24:55.3826716Z Collecting package metadata (repodata.json): - \ | / done 2025-05-07T20:24:55.7503030Z Solving environment: \ done 2025-05-07T20:24:55.8131035Z 2025-05-07T20:24:55.8131741Z ## Package Plan ## 2025-05-07T20:24:55.8132172Z 2025-05-07T20:24:55.8132746Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:24:55.8133538Z 2025-05-07T20:24:55.8133769Z added / updated specs: 2025-05-07T20:24:55.8134313Z - libxcrypt 2025-05-07T20:24:55.8134572Z 2025-05-07T20:24:55.8134595Z 2025-05-07T20:24:55.8134846Z The following packages will be downloaded: 2025-05-07T20:24:55.8135265Z 2025-05-07T20:24:55.8135495Z package | build 2025-05-07T20:24:55.8136126Z ---------------------------|----------------- 2025-05-07T20:24:55.8136867Z libxcrypt-4.4.36 | hd590300_1 98 KB conda-forge 2025-05-07T20:24:55.8137663Z ------------------------------------------------------------ 2025-05-07T20:24:55.8138148Z Total: 98 KB 2025-05-07T20:24:55.8138357Z 2025-05-07T20:24:55.8138483Z The following NEW packages will be INSTALLED: 2025-05-07T20:24:55.8138698Z 2025-05-07T20:24:55.8138925Z libxcrypt conda-forge/linux-64::libxcrypt-4.4.36-hd590300_1 2025-05-07T20:24:55.8139205Z 2025-05-07T20:24:55.8139209Z 2025-05-07T20:24:55.8139213Z 2025-05-07T20:24:55.8139363Z Downloading and Extracting Packages: ...working... 2025-05-07T20:24:55.9759019Z libxcrypt-4.4.36 | 98 KB | | 0% 2025-05-07T20:24:55.9778180Z libxcrypt-4.4.36 | 98 KB | #6 | 16% 2025-05-07T20:24:55.9899940Z libxcrypt-4.4.36 | 98 KB | ########## | 100% 2025-05-07T20:24:55.9902370Z libxcrypt-4.4.36 | 98 KB | ########## | 100% 2025-05-07T20:24:55.9902766Z 2025-05-07T20:24:55.9903124Z done 2025-05-07T20:24:56.0905932Z Preparing transaction: / done 2025-05-07T20:24:56.1910463Z Verifying transaction: \ done 2025-05-07T20:24:56.2914894Z Executing transaction: / done 2025-05-07T20:24:59.7360710Z [SETUP] Copying over ... 2025-05-07T20:24:59.7361416Z + cp /home/ec2-user/miniconda/envs/build_binary/include/crypt.h /home/ec2-user/miniconda/envs/build_binary/include/python3.11/crypt.h 2025-05-07T20:24:59.7361956Z 2025-05-07T20:24:59.7390907Z 2025-05-07T20:25:01.3857762Z [SETUP] Installed Python version: Python 3.11.11 2025-05-07T20:25:01.3858429Z [SETUP] Successfully created Conda environment: build_binary 2025-05-07T20:25:01.3892116Z ##[group]Run . $PRELUDE; install_cxx_compiler $BUILD_ENV gcc 2025-05-07T20:25:01.3892568Z . $PRELUDE; install_cxx_compiler $BUILD_ENV gcc 2025-05-07T20:25:01.3904560Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:25:01.3904895Z env: 2025-05-07T20:25:01.3905126Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:25:01.3905431Z BUILD_ENV: build_binary 2025-05-07T20:25:01.3905672Z BUILD_TARGET: genai 2025-05-07T20:25:01.3905891Z BUILD_VARIANT: cuda 2025-05-07T20:25:01.3906123Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:25:01.3906372Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:25:01.3906662Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:25:01.3906989Z ##[endgroup] 2025-05-07T20:25:01.7366976Z ################################################################################ 2025-05-07T20:25:01.7367357Z # Install C/C++ Compilers 2025-05-07T20:25:01.7367597Z # 2025-05-07T20:25:01.7383879Z # [2025-05-07T20:25:01.738Z] + install_cxx_compiler build_binary gcc 2025-05-07T20:25:01.7384587Z ################################################################################ 2025-05-07T20:25:01.7384805Z 2025-05-07T20:25:01.7404103Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:25:01.8322874Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:25:01.8334137Z [INSTALL] Installing GLIBC (architecture = 64) ... 2025-05-07T20:25:01.8358026Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y sysroot_linux-64=2.17 2025-05-07T20:25:02.7010815Z Channels: 2025-05-07T20:25:02.7011456Z - conda-forge 2025-05-07T20:25:02.7012068Z Platform: linux-64 2025-05-07T20:25:06.0644880Z Collecting package metadata (repodata.json): - \ | / done 2025-05-07T20:25:06.4346228Z Solving environment: \ done 2025-05-07T20:25:06.4967752Z 2025-05-07T20:25:06.4967939Z ## Package Plan ## 2025-05-07T20:25:06.4968097Z 2025-05-07T20:25:06.4968312Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:25:06.4968621Z 2025-05-07T20:25:06.4968730Z added / updated specs: 2025-05-07T20:25:06.4969005Z - sysroot_linux-64=2.17 2025-05-07T20:25:06.4969190Z 2025-05-07T20:25:06.4969194Z 2025-05-07T20:25:06.4969321Z The following packages will be downloaded: 2025-05-07T20:25:06.4969543Z 2025-05-07T20:25:06.4969659Z package | build 2025-05-07T20:25:06.4969981Z ---------------------------|----------------- 2025-05-07T20:25:06.4970392Z kernel-headers_linux-64-3.10.0| he073ed8_18 921 KB conda-forge 2025-05-07T20:25:06.4970879Z sysroot_linux-64-2.17 | h0157908_18 14.5 MB conda-forge 2025-05-07T20:25:06.4971299Z ------------------------------------------------------------ 2025-05-07T20:25:06.4971649Z Total: 15.4 MB 2025-05-07T20:25:06.4971881Z 2025-05-07T20:25:06.4972035Z The following NEW packages will be INSTALLED: 2025-05-07T20:25:06.4972275Z 2025-05-07T20:25:06.4972552Z kernel-headers_li~ conda-forge/noarch::kernel-headers_linux-64-3.10.0-he073ed8_18 2025-05-07T20:25:06.4973120Z sysroot_linux-64 conda-forge/noarch::sysroot_linux-64-2.17-h0157908_18 2025-05-07T20:25:06.4973421Z 2025-05-07T20:25:06.4973425Z 2025-05-07T20:25:06.4973429Z 2025-05-07T20:25:06.4973584Z Downloading and Extracting Packages: ...working... 2025-05-07T20:25:06.4973949Z sysroot_linux-64-2.1 | 14.5 MB | | 0% 2025-05-07T20:25:06.4974184Z 2025-05-07T20:25:06.6991004Z kernel-headers_linux | 921 KB | | 0%  2025-05-07T20:25:06.7290685Z sysroot_linux-64-2.1 | 14.5 MB | | 0% 2025-05-07T20:25:06.7290925Z 2025-05-07T20:25:06.7461148Z kernel-headers_linux | 921 KB | 1 | 2%  2025-05-07T20:25:06.7462340Z 2025-05-07T20:25:06.8238944Z kernel-headers_linux | 921 KB | ########## | 100%  2025-05-07T20:25:06.8239543Z sysroot_linux-64-2.1 | 14.5 MB | ########## | 100% 2025-05-07T20:25:07.0186635Z sysroot_linux-64-2.1 | 14.5 MB | ########## | 100% 2025-05-07T20:25:07.0187072Z 2025-05-07T20:25:07.0188070Z kernel-headers_linux | 921 KB | ########## | 100%  2025-05-07T20:25:07.0188456Z 2025-05-07T20:25:07.4454848Z kernel-headers_linux | 921 KB | ########## | 100%  2025-05-07T20:25:07.4458190Z sysroot_linux-64-2.1 | 14.5 MB | ########## | 100% 2025-05-07T20:25:07.4458540Z 2025-05-07T20:25:07.4458737Z 2025-05-07T20:25:07.4459349Z  done 2025-05-07T20:25:07.5462746Z Preparing transaction: / done 2025-05-07T20:25:07.7466951Z Verifying transaction: \ | done 2025-05-07T20:25:07.9522992Z Executing transaction: - \ done 2025-05-07T20:25:08.1087897Z [CHECK] LD_LIBRARY_PATH = 2025-05-07T20:25:08.1088199Z [CHECK] CONDA_PREFIX is not set. 2025-05-07T20:25:09.7937821Z [CHECK] libstdc++.so.6 found in CONDA_PREFIX PATH (symbolic link): /home/ec2-user/miniconda/envs/build_binary/lib/libstdc++.so.6 2025-05-07T20:25:09.7951823Z [INSTALL] Installing GCC (11.4.0, 64) through Conda ... 2025-05-07T20:25:09.7973193Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y gxx_linux-64=11.4.0 2025-05-07T20:25:10.6862006Z Channels: 2025-05-07T20:25:10.6862266Z - conda-forge 2025-05-07T20:25:10.6862500Z Platform: linux-64 2025-05-07T20:25:14.0093710Z Collecting package metadata (repodata.json): - \ | / done 2025-05-07T20:25:14.9637028Z Solving environment: \ | / done 2025-05-07T20:25:15.0290202Z 2025-05-07T20:25:15.0291140Z ## Package Plan ## 2025-05-07T20:25:15.0291514Z 2025-05-07T20:25:15.0291939Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:25:15.0292526Z 2025-05-07T20:25:15.0292720Z added / updated specs: 2025-05-07T20:25:15.0293233Z - gxx_linux-64=11.4.0 2025-05-07T20:25:15.0293545Z 2025-05-07T20:25:15.0293568Z 2025-05-07T20:25:15.0293810Z The following packages will be downloaded: 2025-05-07T20:25:15.0294289Z 2025-05-07T20:25:15.0294529Z package | build 2025-05-07T20:25:15.0295067Z ---------------------------|----------------- 2025-05-07T20:25:15.0295473Z binutils_impl_linux-64-2.40| ha1999f0_7 6.0 MB conda-forge 2025-05-07T20:25:15.0295953Z binutils_linux-64-2.40 | hb3c18ed_4 28 KB conda-forge 2025-05-07T20:25:15.0296408Z gcc_impl_linux-64-11.4.0 | h00c12a0_13 53.0 MB conda-forge 2025-05-07T20:25:15.0296847Z gcc_linux-64-11.4.0 | ha077dfb_4 31 KB conda-forge 2025-05-07T20:25:15.0297287Z gxx_impl_linux-64-11.4.0 | h634f3ee_13 11.2 MB conda-forge 2025-05-07T20:25:15.0297724Z gxx_linux-64-11.4.0 | h35bfe5d_4 29 KB conda-forge 2025-05-07T20:25:15.0298146Z ld_impl_linux-64-2.40 | hf3520f5_7 691 KB conda-forge 2025-05-07T20:25:15.0298615Z libgcc-devel_linux-64-11.4.0| h8f596e0_113 2.3 MB conda-forge 2025-05-07T20:25:15.0299087Z libsanitizer-11.4.0 | h5763a12_13 3.5 MB conda-forge 2025-05-07T20:25:15.0299521Z libstdcxx-15.1.0 | h8f9b012_2 3.7 MB conda-forge 2025-05-07T20:25:15.0299982Z libstdcxx-devel_linux-64-11.4.0| h8f596e0_113 11.1 MB conda-forge 2025-05-07T20:25:15.0300455Z libstdcxx-ng-15.1.0 | h4852527_2 34 KB conda-forge 2025-05-07T20:25:15.0300861Z ------------------------------------------------------------ 2025-05-07T20:25:15.0301276Z Total: 91.6 MB 2025-05-07T20:25:15.0301562Z 2025-05-07T20:25:15.0301697Z The following NEW packages will be INSTALLED: 2025-05-07T20:25:15.0301922Z 2025-05-07T20:25:15.0302251Z binutils_impl_lin~ conda-forge/linux-64::binutils_impl_linux-64-2.40-ha1999f0_7 2025-05-07T20:25:15.0302819Z binutils_linux-64 conda-forge/linux-64::binutils_linux-64-2.40-hb3c18ed_4 2025-05-07T20:25:15.0303786Z gcc_impl_linux-64 conda-forge/linux-64::gcc_impl_linux-64-11.4.0-h00c12a0_13 2025-05-07T20:25:15.0304303Z gcc_linux-64 conda-forge/linux-64::gcc_linux-64-11.4.0-ha077dfb_4 2025-05-07T20:25:15.0304830Z gxx_impl_linux-64 conda-forge/linux-64::gxx_impl_linux-64-11.4.0-h634f3ee_13 2025-05-07T20:25:15.0305357Z gxx_linux-64 conda-forge/linux-64::gxx_linux-64-11.4.0-h35bfe5d_4 2025-05-07T20:25:15.0305877Z libgcc-devel_linu~ conda-forge/noarch::libgcc-devel_linux-64-11.4.0-h8f596e0_113 2025-05-07T20:25:15.0306432Z libsanitizer conda-forge/linux-64::libsanitizer-11.4.0-h5763a12_13 2025-05-07T20:25:15.0306925Z libstdcxx conda-forge/linux-64::libstdcxx-15.1.0-h8f9b012_2 2025-05-07T20:25:15.0307462Z libstdcxx-devel_l~ conda-forge/noarch::libstdcxx-devel_linux-64-11.4.0-h8f596e0_113 2025-05-07T20:25:15.0307820Z 2025-05-07T20:25:15.0307934Z The following packages will be UPDATED: 2025-05-07T20:25:15.0308144Z 2025-05-07T20:25:15.0308460Z ld_impl_linux-64 pkgs/main::ld_impl_linux-64-2.40-h12e~ --> conda-forge::ld_impl_linux-64-2.40-hf3520f5_7 2025-05-07T20:25:15.0309421Z libstdcxx-ng pkgs/main::libstdcxx-ng-11.2.0-h12345~ --> conda-forge::libstdcxx-ng-15.1.0-h4852527_2 2025-05-07T20:25:15.0309824Z 2025-05-07T20:25:15.0309828Z 2025-05-07T20:25:15.0309832Z 2025-05-07T20:25:15.0309987Z Downloading and Extracting Packages: ...working... 2025-05-07T20:25:15.0310360Z gcc_impl_linux-64-11 | 53.0 MB | | 0% 2025-05-07T20:25:15.0310591Z 2025-05-07T20:25:15.0310991Z gxx_impl_linux-64-11 | 11.2 MB | | 0%  2025-05-07T20:25:15.0311231Z 2025-05-07T20:25:15.0311235Z 2025-05-07T20:25:15.0313211Z libstdcxx-devel_linu | 11.1 MB | | 0%  2025-05-07T20:25:15.0313477Z 2025-05-07T20:25:15.0313489Z 2025-05-07T20:25:15.0315767Z 2025-05-07T20:25:15.0335998Z binutils_impl_linux- | 6.0 MB | | 0%  2025-05-07T20:25:15.0336322Z 2025-05-07T20:25:15.0336328Z 2025-05-07T20:25:15.0336333Z 2025-05-07T20:25:15.0336377Z 2025-05-07T20:25:15.0346341Z libstdcxx-15.1.0 | 3.7 MB | | 0%  2025-05-07T20:25:15.0346794Z 2025-05-07T20:25:15.0346801Z 2025-05-07T20:25:15.0346806Z 2025-05-07T20:25:15.0346823Z 2025-05-07T20:25:15.0346830Z 2025-05-07T20:25:15.0350811Z libsanitizer-11.4.0 | 3.5 MB | | 0%  2025-05-07T20:25:15.0351276Z 2025-05-07T20:25:15.0351283Z 2025-05-07T20:25:15.0351288Z 2025-05-07T20:25:15.0351306Z 2025-05-07T20:25:15.0351311Z 2025-05-07T20:25:15.0351317Z 2025-05-07T20:25:15.0352514Z libgcc-devel_linux-6 | 2.3 MB | | 0%  2025-05-07T20:25:15.0352999Z 2025-05-07T20:25:15.0353007Z 2025-05-07T20:25:15.0353024Z 2025-05-07T20:25:15.0353030Z 2025-05-07T20:25:15.0353035Z 2025-05-07T20:25:15.0353040Z 2025-05-07T20:25:15.0353045Z 2025-05-07T20:25:15.0354502Z ld_impl_linux-64-2.4 | 691 KB | | 0%  2025-05-07T20:25:15.0354915Z 2025-05-07T20:25:15.0354954Z 2025-05-07T20:25:15.0354965Z 2025-05-07T20:25:15.0354970Z 2025-05-07T20:25:15.0354975Z 2025-05-07T20:25:15.0354981Z 2025-05-07T20:25:15.0354986Z 2025-05-07T20:25:15.0354991Z 2025-05-07T20:25:15.0360747Z libstdcxx-ng-15.1.0 | 34 KB | | 0%  2025-05-07T20:25:15.0361082Z 2025-05-07T20:25:15.0361089Z 2025-05-07T20:25:15.0361094Z 2025-05-07T20:25:15.0361100Z 2025-05-07T20:25:15.0361105Z 2025-05-07T20:25:15.0361121Z 2025-05-07T20:25:15.0361127Z 2025-05-07T20:25:15.0361132Z 2025-05-07T20:25:15.0367814Z 2025-05-07T20:25:15.0371858Z gcc_linux-64-11.4.0 | 31 KB | | 0%  2025-05-07T20:25:15.0372345Z 2025-05-07T20:25:15.0372355Z 2025-05-07T20:25:15.0372364Z 2025-05-07T20:25:15.0372374Z 2025-05-07T20:25:15.0372382Z 2025-05-07T20:25:15.0372390Z 2025-05-07T20:25:15.0372397Z 2025-05-07T20:25:15.0372404Z 2025-05-07T20:25:15.0372409Z 2025-05-07T20:25:15.0379399Z 2025-05-07T20:25:15.0388010Z gxx_linux-64-11.4.0 | 29 KB | | 0%  2025-05-07T20:25:15.0388331Z 2025-05-07T20:25:15.0388337Z 2025-05-07T20:25:15.0388343Z 2025-05-07T20:25:15.0388349Z 2025-05-07T20:25:15.0388354Z 2025-05-07T20:25:15.0388358Z 2025-05-07T20:25:15.0388363Z 2025-05-07T20:25:15.0388368Z 2025-05-07T20:25:15.0388372Z 2025-05-07T20:25:15.0388376Z 2025-05-07T20:25:15.0388381Z 2025-05-07T20:25:15.1565209Z binutils_linux-64-2. | 28 KB | | 0%  2025-05-07T20:25:15.1565535Z 2025-05-07T20:25:15.1565539Z 2025-05-07T20:25:15.1565543Z 2025-05-07T20:25:15.1565547Z 2025-05-07T20:25:15.1994704Z libstdcxx-15.1.0 | 3.7 MB | 5 | 5%  2025-05-07T20:25:15.1994984Z 2025-05-07T20:25:15.1994988Z 2025-05-07T20:25:15.2625369Z libstdcxx-devel_linu | 11.1 MB | | 0%  2025-05-07T20:25:15.2625723Z 2025-05-07T20:25:15.2625731Z 2025-05-07T20:25:15.2625738Z 2025-05-07T20:25:15.2625920Z 2025-05-07T20:25:15.2999459Z libstdcxx-15.1.0 | 3.7 MB | #3 | 13%  2025-05-07T20:25:15.2999954Z 2025-05-07T20:25:15.2999959Z 2025-05-07T20:25:15.3073204Z libstdcxx-devel_linu | 11.1 MB | 2 | 3%  2025-05-07T20:25:15.3085771Z gcc_impl_linux-64-11 | 53.0 MB | | 0% 2025-05-07T20:25:15.3087476Z 2025-05-07T20:25:15.3728849Z gxx_impl_linux-64-11 | 11.2 MB | | 0%  2025-05-07T20:25:15.3729130Z 2025-05-07T20:25:15.3729134Z 2025-05-07T20:25:15.3732428Z 2025-05-07T20:25:15.3999685Z binutils_impl_linux- | 6.0 MB | | 0%  2025-05-07T20:25:15.3999975Z 2025-05-07T20:25:15.4000613Z 2025-05-07T20:25:15.4076112Z libstdcxx-devel_linu | 11.1 MB | ####3 | 43%  2025-05-07T20:25:15.4087233Z gcc_impl_linux-64-11 | 53.0 MB | 6 | 6% 2025-05-07T20:25:15.4089997Z 2025-05-07T20:25:15.4224063Z gxx_impl_linux-64-11 | 11.2 MB | ###6 | 37%  2025-05-07T20:25:15.4224333Z 2025-05-07T20:25:15.4224337Z 2025-05-07T20:25:15.4224367Z 2025-05-07T20:25:15.4228397Z 2025-05-07T20:25:15.4229337Z libstdcxx-15.1.0 | 3.7 MB | ########## | 100%  2025-05-07T20:25:15.4229650Z 2025-05-07T20:25:15.4229656Z 2025-05-07T20:25:15.4229661Z 2025-05-07T20:25:15.4231075Z 2025-05-07T20:25:15.4625752Z libstdcxx-15.1.0 | 3.7 MB | ########## | 100%  2025-05-07T20:25:15.4626064Z 2025-05-07T20:25:15.4626070Z 2025-05-07T20:25:15.4626075Z 2025-05-07T20:25:15.4626080Z 2025-05-07T20:25:15.4626085Z 2025-05-07T20:25:15.4729402Z libsanitizer-11.4.0 | 3.5 MB | | 0%  2025-05-07T20:25:15.4729698Z 2025-05-07T20:25:15.4729703Z 2025-05-07T20:25:15.4734701Z 2025-05-07T20:25:15.5010042Z binutils_impl_linux- | 6.0 MB | #####3 | 53%  2025-05-07T20:25:15.5010336Z 2025-05-07T20:25:15.5015995Z 2025-05-07T20:25:15.5076283Z libstdcxx-devel_linu | 11.1 MB | ######8 | 68%  2025-05-07T20:25:15.5092491Z gcc_impl_linux-64-11 | 53.0 MB | #2 | 12% 2025-05-07T20:25:15.5092756Z 2025-05-07T20:25:15.5627265Z gxx_impl_linux-64-11 | 11.2 MB | ######4 | 64%  2025-05-07T20:25:15.5627522Z 2025-05-07T20:25:15.5627526Z 2025-05-07T20:25:15.5627530Z 2025-05-07T20:25:15.5627534Z 2025-05-07T20:25:15.5628773Z 2025-05-07T20:25:15.6010557Z libsanitizer-11.4.0 | 3.5 MB | #######9 | 80%  2025-05-07T20:25:15.6010853Z 2025-05-07T20:25:15.6015244Z 2025-05-07T20:25:15.6078557Z libstdcxx-devel_linu | 11.1 MB | #########6 | 96%  2025-05-07T20:25:15.6093064Z gcc_impl_linux-64-11 | 53.0 MB | #7 | 18% 2025-05-07T20:25:15.6095402Z 2025-05-07T20:25:15.6861767Z gxx_impl_linux-64-11 | 11.2 MB | ######### | 90%  2025-05-07T20:25:15.6862133Z 2025-05-07T20:25:15.6862139Z 2025-05-07T20:25:15.6862145Z 2025-05-07T20:25:15.6862152Z 2025-05-07T20:25:15.6862157Z 2025-05-07T20:25:15.7079841Z libsanitizer-11.4.0 | 3.5 MB | ########## | 100%  2025-05-07T20:25:15.7253135Z gcc_impl_linux-64-11 | 53.0 MB | ##5 | 26% 2025-05-07T20:25:15.7253773Z 2025-05-07T20:25:15.7253782Z 2025-05-07T20:25:15.7253789Z 2025-05-07T20:25:15.7253796Z 2025-05-07T20:25:15.7253804Z 2025-05-07T20:25:15.7258518Z 2025-05-07T20:25:15.7300498Z libgcc-devel_linux-6 | 2.3 MB | | 1%  2025-05-07T20:25:15.7300961Z 2025-05-07T20:25:15.7300968Z 2025-05-07T20:25:15.7302002Z 2025-05-07T20:25:15.7308957Z binutils_impl_linux- | 6.0 MB | ########## | 100%  2025-05-07T20:25:15.7309405Z 2025-05-07T20:25:15.7309410Z 2025-05-07T20:25:15.7312006Z 2025-05-07T20:25:15.7894104Z binutils_impl_linux- | 6.0 MB | ########## | 100%  2025-05-07T20:25:15.7894555Z 2025-05-07T20:25:15.7894562Z 2025-05-07T20:25:15.7894567Z 2025-05-07T20:25:15.7894572Z 2025-05-07T20:25:15.7894577Z 2025-05-07T20:25:15.7894582Z 2025-05-07T20:25:15.7896020Z 2025-05-07T20:25:15.8086819Z ld_impl_linux-64-2.4 | 691 KB | 2 | 2%  2025-05-07T20:25:15.8254382Z gcc_impl_linux-64-11 | 53.0 MB | ###2 | 32% 2025-05-07T20:25:15.8255166Z 2025-05-07T20:25:15.8255173Z 2025-05-07T20:25:15.8255179Z 2025-05-07T20:25:15.8255184Z 2025-05-07T20:25:15.8255189Z 2025-05-07T20:25:15.8259802Z 2025-05-07T20:25:15.8398477Z libgcc-devel_linux-6 | 2.3 MB | ########6 | 87%  2025-05-07T20:25:15.8398836Z 2025-05-07T20:25:15.8398840Z 2025-05-07T20:25:15.8398844Z 2025-05-07T20:25:15.8398848Z 2025-05-07T20:25:15.8398851Z 2025-05-07T20:25:15.8398855Z 2025-05-07T20:25:15.8402670Z 2025-05-07T20:25:15.9086826Z ld_impl_linux-64-2.4 | 691 KB | ########## | 100%  2025-05-07T20:25:15.9087132Z 2025-05-07T20:25:15.9087542Z 2025-05-07T20:25:15.9091381Z libstdcxx-devel_linu | 11.1 MB | ########## | 100%  2025-05-07T20:25:15.9132372Z gcc_impl_linux-64-11 | 53.0 MB | ###8 | 39% 2025-05-07T20:25:15.9132749Z 2025-05-07T20:25:15.9132755Z 2025-05-07T20:25:15.9132761Z 2025-05-07T20:25:15.9132766Z 2025-05-07T20:25:15.9132771Z 2025-05-07T20:25:15.9132803Z 2025-05-07T20:25:15.9132820Z 2025-05-07T20:25:15.9132826Z 2025-05-07T20:25:15.9169245Z libstdcxx-ng-15.1.0 | 34 KB | ####7 | 47%  2025-05-07T20:25:15.9169558Z 2025-05-07T20:25:15.9169563Z 2025-05-07T20:25:15.9169567Z 2025-05-07T20:25:15.9169570Z 2025-05-07T20:25:15.9169574Z 2025-05-07T20:25:15.9169578Z 2025-05-07T20:25:15.9169581Z 2025-05-07T20:25:15.9172833Z 2025-05-07T20:25:15.9186683Z libstdcxx-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:25:15.9186988Z 2025-05-07T20:25:15.9186993Z 2025-05-07T20:25:15.9186996Z 2025-05-07T20:25:15.9190133Z 2025-05-07T20:25:15.9234588Z libstdcxx-15.1.0 | 3.7 MB | ########## | 100%  2025-05-07T20:25:15.9234948Z 2025-05-07T20:25:15.9234954Z 2025-05-07T20:25:15.9234959Z 2025-05-07T20:25:15.9234964Z 2025-05-07T20:25:15.9234970Z 2025-05-07T20:25:15.9234986Z 2025-05-07T20:25:15.9640115Z libgcc-devel_linux-6 | 2.3 MB | ########## | 100%  2025-05-07T20:25:15.9640453Z 2025-05-07T20:25:15.9640467Z 2025-05-07T20:25:15.9640471Z 2025-05-07T20:25:15.9640475Z 2025-05-07T20:25:15.9640487Z 2025-05-07T20:25:15.9640491Z 2025-05-07T20:25:15.9640494Z 2025-05-07T20:25:15.9640498Z 2025-05-07T20:25:15.9640502Z 2025-05-07T20:25:15.9643930Z 2025-05-07T20:25:15.9645819Z gxx_linux-64-11.4.0 | 29 KB | #####5 | 55%  2025-05-07T20:25:15.9646139Z 2025-05-07T20:25:15.9646143Z 2025-05-07T20:25:15.9646147Z 2025-05-07T20:25:15.9646150Z 2025-05-07T20:25:15.9646154Z 2025-05-07T20:25:15.9646157Z 2025-05-07T20:25:15.9646161Z 2025-05-07T20:25:15.9646165Z 2025-05-07T20:25:15.9646168Z 2025-05-07T20:25:15.9646172Z 2025-05-07T20:25:15.9646176Z 2025-05-07T20:25:15.9684148Z binutils_linux-64-2. | 28 KB | #####6 | 56%  2025-05-07T20:25:15.9684631Z 2025-05-07T20:25:15.9684637Z 2025-05-07T20:25:15.9684643Z 2025-05-07T20:25:15.9684648Z 2025-05-07T20:25:15.9684653Z 2025-05-07T20:25:15.9684671Z 2025-05-07T20:25:15.9684961Z 2025-05-07T20:25:15.9684971Z 2025-05-07T20:25:15.9684976Z 2025-05-07T20:25:15.9684981Z 2025-05-07T20:25:15.9690232Z 2025-05-07T20:25:15.9697262Z binutils_linux-64-2. | 28 KB | ########## | 100%  2025-05-07T20:25:15.9697732Z 2025-05-07T20:25:15.9697737Z 2025-05-07T20:25:15.9697741Z 2025-05-07T20:25:15.9697744Z 2025-05-07T20:25:15.9697748Z 2025-05-07T20:25:15.9697753Z 2025-05-07T20:25:15.9697758Z 2025-05-07T20:25:15.9697763Z 2025-05-07T20:25:15.9697768Z 2025-05-07T20:25:15.9699394Z 2025-05-07T20:25:15.9714760Z gxx_linux-64-11.4.0 | 29 KB | ########## | 100%  2025-05-07T20:25:15.9718173Z 2025-05-07T20:25:15.9816790Z gxx_impl_linux-64-11 | 11.2 MB | ########## | 100%  2025-05-07T20:25:15.9817062Z 2025-05-07T20:25:15.9817067Z 2025-05-07T20:25:15.9817070Z 2025-05-07T20:25:15.9817074Z 2025-05-07T20:25:15.9817078Z 2025-05-07T20:25:15.9817081Z 2025-05-07T20:25:15.9817085Z 2025-05-07T20:25:15.9817324Z 2025-05-07T20:25:15.9817693Z 2025-05-07T20:25:15.9848898Z gcc_linux-64-11.4.0 | 31 KB | #####2 | 52%  2025-05-07T20:25:15.9849322Z 2025-05-07T20:25:15.9849328Z 2025-05-07T20:25:15.9849334Z 2025-05-07T20:25:15.9849339Z 2025-05-07T20:25:15.9849344Z 2025-05-07T20:25:15.9849349Z 2025-05-07T20:25:15.9849354Z 2025-05-07T20:25:15.9849359Z 2025-05-07T20:25:15.9851267Z 2025-05-07T20:25:16.0092774Z gcc_linux-64-11.4.0 | 31 KB | ########## | 100%  2025-05-07T20:25:16.0125430Z gcc_impl_linux-64-11 | 53.0 MB | ####8 | 48% 2025-05-07T20:25:16.0125702Z 2025-05-07T20:25:16.0125706Z 2025-05-07T20:25:16.0125710Z 2025-05-07T20:25:16.0125715Z 2025-05-07T20:25:16.0125718Z 2025-05-07T20:25:16.0125723Z 2025-05-07T20:25:16.0126013Z 2025-05-07T20:25:16.0131361Z ld_impl_linux-64-2.4 | 691 KB | ########## | 100%  2025-05-07T20:25:16.0131652Z 2025-05-07T20:25:16.0131656Z 2025-05-07T20:25:16.0131659Z 2025-05-07T20:25:16.0131716Z 2025-05-07T20:25:16.0131722Z 2025-05-07T20:25:16.0131727Z 2025-05-07T20:25:16.0131733Z 2025-05-07T20:25:16.1093018Z ld_impl_linux-64-2.4 | 691 KB | ########## | 100%  2025-05-07T20:25:16.1719044Z gcc_impl_linux-64-11 | 53.0 MB | #####9 | 59% 2025-05-07T20:25:16.1719298Z 2025-05-07T20:25:16.1719302Z 2025-05-07T20:25:16.1719305Z 2025-05-07T20:25:16.1719309Z 2025-05-07T20:25:16.1720341Z 2025-05-07T20:25:16.2093877Z libsanitizer-11.4.0 | 3.5 MB | ########## | 100%  2025-05-07T20:25:16.2594694Z gcc_impl_linux-64-11 | 53.0 MB | ####### | 70% 2025-05-07T20:25:16.2595038Z 2025-05-07T20:25:16.2595044Z 2025-05-07T20:25:16.2595047Z 2025-05-07T20:25:16.2595052Z 2025-05-07T20:25:16.2595057Z 2025-05-07T20:25:16.2595061Z 2025-05-07T20:25:16.2595065Z 2025-05-07T20:25:16.2596126Z 2025-05-07T20:25:16.2601943Z libstdcxx-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:25:16.2602340Z 2025-05-07T20:25:16.2602378Z 2025-05-07T20:25:16.2602398Z 2025-05-07T20:25:16.2602404Z 2025-05-07T20:25:16.2602408Z 2025-05-07T20:25:16.2602413Z 2025-05-07T20:25:16.2602418Z 2025-05-07T20:25:16.2602433Z 2025-05-07T20:25:16.3097362Z libstdcxx-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:25:16.4097980Z gcc_impl_linux-64-11 | 53.0 MB | #######9 | 80% 2025-05-07T20:25:16.4983139Z gcc_impl_linux-64-11 | 53.0 MB | ########9 | 90% 2025-05-07T20:25:16.4983406Z 2025-05-07T20:25:16.4983411Z 2025-05-07T20:25:16.4983416Z 2025-05-07T20:25:16.4983420Z 2025-05-07T20:25:16.4983423Z 2025-05-07T20:25:16.4984118Z 2025-05-07T20:25:16.5492653Z libgcc-devel_linux-6 | 2.3 MB | ########## | 100%  2025-05-07T20:25:16.5493028Z 2025-05-07T20:25:16.5493034Z 2025-05-07T20:25:16.5493040Z 2025-05-07T20:25:16.5493045Z 2025-05-07T20:25:16.5493060Z 2025-05-07T20:25:16.5493066Z 2025-05-07T20:25:16.5493071Z 2025-05-07T20:25:16.5493076Z 2025-05-07T20:25:16.5493081Z 2025-05-07T20:25:16.5493359Z 2025-05-07T20:25:16.5493364Z 2025-05-07T20:25:16.5497576Z binutils_linux-64-2. | 28 KB | ########## | 100%  2025-05-07T20:25:16.5497960Z 2025-05-07T20:25:16.5497964Z 2025-05-07T20:25:16.5497968Z 2025-05-07T20:25:16.5497971Z 2025-05-07T20:25:16.5497975Z 2025-05-07T20:25:16.5497978Z 2025-05-07T20:25:16.5497982Z 2025-05-07T20:25:16.5497986Z 2025-05-07T20:25:16.5497989Z 2025-05-07T20:25:16.5497993Z 2025-05-07T20:25:16.5498171Z 2025-05-07T20:25:16.6057924Z binutils_linux-64-2. | 28 KB | ########## | 100%  2025-05-07T20:25:16.6058226Z 2025-05-07T20:25:16.6058230Z 2025-05-07T20:25:16.6058233Z 2025-05-07T20:25:16.6058237Z 2025-05-07T20:25:16.6058241Z 2025-05-07T20:25:16.6058244Z 2025-05-07T20:25:16.6058248Z 2025-05-07T20:25:16.6058251Z 2025-05-07T20:25:16.6058255Z 2025-05-07T20:25:16.6058259Z 2025-05-07T20:25:16.6063060Z gxx_linux-64-11.4.0 | 29 KB | ########## | 100%  2025-05-07T20:25:16.6063568Z 2025-05-07T20:25:16.6063572Z 2025-05-07T20:25:16.6063576Z 2025-05-07T20:25:16.6063579Z 2025-05-07T20:25:16.6063583Z 2025-05-07T20:25:16.6063586Z 2025-05-07T20:25:16.6063590Z 2025-05-07T20:25:16.6063599Z 2025-05-07T20:25:16.6063603Z 2025-05-07T20:25:16.6064727Z 2025-05-07T20:25:16.6446329Z gxx_linux-64-11.4.0 | 29 KB | ########## | 100%  2025-05-07T20:25:16.6446602Z 2025-05-07T20:25:16.6446613Z 2025-05-07T20:25:16.6446617Z 2025-05-07T20:25:16.7139668Z binutils_impl_linux- | 6.0 MB | ########## | 100%  2025-05-07T20:25:16.7140066Z 2025-05-07T20:25:16.7140072Z 2025-05-07T20:25:16.7140087Z 2025-05-07T20:25:16.7140093Z 2025-05-07T20:25:16.7140098Z 2025-05-07T20:25:16.7140104Z 2025-05-07T20:25:16.7140108Z 2025-05-07T20:25:16.7140113Z 2025-05-07T20:25:16.7140118Z 2025-05-07T20:25:16.7145136Z gcc_linux-64-11.4.0 | 31 KB | ########## | 100%  2025-05-07T20:25:16.7145522Z 2025-05-07T20:25:16.7145550Z 2025-05-07T20:25:16.7145554Z 2025-05-07T20:25:16.7145558Z 2025-05-07T20:25:16.7145561Z 2025-05-07T20:25:16.7145565Z 2025-05-07T20:25:16.7145568Z 2025-05-07T20:25:16.7145572Z 2025-05-07T20:25:16.7145575Z 2025-05-07T20:25:16.8005320Z gcc_linux-64-11.4.0 | 31 KB | ########## | 100%  2025-05-07T20:25:16.8005911Z gcc_impl_linux-64-11 | 53.0 MB | ########## | 100% 2025-05-07T20:25:16.9651531Z gcc_impl_linux-64-11 | 53.0 MB | ########## | 100% 2025-05-07T20:25:16.9651867Z 2025-05-07T20:25:17.1508608Z gxx_impl_linux-64-11 | 11.2 MB | ########## | 100%  2025-05-07T20:25:17.1508887Z 2025-05-07T20:25:17.1508891Z 2025-05-07T20:25:17.6033432Z libstdcxx-devel_linu | 11.1 MB | ########## | 100%  2025-05-07T20:25:17.6040437Z gcc_impl_linux-64-11 | 53.0 MB | ########## | 100% 2025-05-07T20:25:17.6041000Z 2025-05-07T20:25:17.6041338Z 2025-05-07T20:25:17.6041675Z  2025-05-07T20:25:17.6042028Z 2025-05-07T20:25:17.6042034Z 2025-05-07T20:25:17.6042328Z  2025-05-07T20:25:17.6042666Z 2025-05-07T20:25:17.6042672Z 2025-05-07T20:25:17.6042678Z 2025-05-07T20:25:17.6042955Z  2025-05-07T20:25:17.6043278Z 2025-05-07T20:25:17.6043283Z 2025-05-07T20:25:17.6043288Z 2025-05-07T20:25:17.6043293Z 2025-05-07T20:25:17.6043557Z  2025-05-07T20:25:17.6043868Z 2025-05-07T20:25:17.6043874Z 2025-05-07T20:25:17.6043879Z 2025-05-07T20:25:17.6043884Z 2025-05-07T20:25:17.6043889Z 2025-05-07T20:25:17.6044172Z  2025-05-07T20:25:17.6044486Z 2025-05-07T20:25:17.6044491Z 2025-05-07T20:25:17.6044496Z 2025-05-07T20:25:17.6044500Z 2025-05-07T20:25:17.6044505Z 2025-05-07T20:25:17.6044510Z 2025-05-07T20:25:17.6045063Z  2025-05-07T20:25:17.6045345Z 2025-05-07T20:25:17.6045348Z 2025-05-07T20:25:17.6045352Z 2025-05-07T20:25:17.6045356Z 2025-05-07T20:25:17.6045359Z 2025-05-07T20:25:17.6045363Z 2025-05-07T20:25:17.6045366Z 2025-05-07T20:25:17.6045568Z  2025-05-07T20:25:17.6045864Z 2025-05-07T20:25:17.6045869Z 2025-05-07T20:25:17.6045874Z 2025-05-07T20:25:17.6045879Z 2025-05-07T20:25:17.6045884Z 2025-05-07T20:25:17.6045889Z 2025-05-07T20:25:17.6045894Z 2025-05-07T20:25:17.6045899Z 2025-05-07T20:25:17.6046172Z  2025-05-07T20:25:17.6046524Z 2025-05-07T20:25:17.6046530Z 2025-05-07T20:25:17.6046537Z 2025-05-07T20:25:17.6046543Z 2025-05-07T20:25:17.6046550Z 2025-05-07T20:25:17.6046556Z 2025-05-07T20:25:17.6046562Z 2025-05-07T20:25:17.6046568Z 2025-05-07T20:25:17.6046575Z 2025-05-07T20:25:17.6047105Z  2025-05-07T20:25:17.6047437Z 2025-05-07T20:25:17.6047443Z 2025-05-07T20:25:17.6047448Z 2025-05-07T20:25:17.6047453Z 2025-05-07T20:25:17.6047459Z 2025-05-07T20:25:17.6047464Z 2025-05-07T20:25:17.6047469Z 2025-05-07T20:25:17.6047474Z 2025-05-07T20:25:17.6047490Z 2025-05-07T20:25:17.6047495Z 2025-05-07T20:25:17.6047792Z  2025-05-07T20:25:17.6048137Z 2025-05-07T20:25:17.6048142Z 2025-05-07T20:25:17.6048147Z 2025-05-07T20:25:17.6048152Z 2025-05-07T20:25:17.6048157Z 2025-05-07T20:25:17.6048175Z 2025-05-07T20:25:17.6048180Z 2025-05-07T20:25:17.6048185Z 2025-05-07T20:25:17.6048190Z 2025-05-07T20:25:17.6048196Z 2025-05-07T20:25:17.6048201Z 2025-05-07T20:25:17.6048511Z  done 2025-05-07T20:25:17.7052655Z Preparing transaction: \ done 2025-05-07T20:25:18.0058068Z Verifying transaction: / - \ done 2025-05-07T20:25:18.1067997Z Executing transaction: / done 2025-05-07T20:25:18.2720261Z [INSTALL] Setting the C/C++ compiler symlinks ... 2025-05-07T20:25:22.1714407Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-cc /home/ec2-user/miniconda/envs/build_binary/bin/cc 2025-05-07T20:25:22.1714955Z 2025-05-07T20:25:22.1726014Z 2025-05-07T20:25:22.1745659Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-cc /home/ec2-user/miniconda/envs/build_binary/bin/gcc 2025-05-07T20:25:22.1746200Z 2025-05-07T20:25:22.1758725Z 2025-05-07T20:25:22.1775967Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-c++ /home/ec2-user/miniconda/envs/build_binary/bin/c++ 2025-05-07T20:25:22.1776489Z 2025-05-07T20:25:22.1788032Z 2025-05-07T20:25:22.1805286Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-c++ /home/ec2-user/miniconda/envs/build_binary/bin/g++ 2025-05-07T20:25:22.1805854Z 2025-05-07T20:25:22.1818137Z 2025-05-07T20:25:24.0769534Z /home/ec2-user/miniconda/envs/build_binary/bin/cc 2025-05-07T20:25:24.0769815Z 2025-05-07T20:25:24.1388150Z [CHECK] Binary cc found in PATH 2025-05-07T20:25:26.0271372Z /home/ec2-user/miniconda/envs/build_binary/bin/gcc 2025-05-07T20:25:26.0271669Z 2025-05-07T20:25:26.0897040Z [CHECK] Binary gcc found in PATH 2025-05-07T20:25:27.9790795Z /home/ec2-user/miniconda/envs/build_binary/bin/c++ 2025-05-07T20:25:27.9791076Z 2025-05-07T20:25:28.0420874Z [CHECK] Binary c++ found in PATH 2025-05-07T20:25:29.9289175Z /home/ec2-user/miniconda/envs/build_binary/bin/g++ 2025-05-07T20:25:29.9289475Z 2025-05-07T20:25:29.9913345Z [CHECK] Binary g++ found in PATH 2025-05-07T20:25:29.9917541Z [INFO] Printing out all preprocessor defines in the C compiler ... 2025-05-07T20:25:29.9917977Z + conda run -n build_binary cc -dM -E - 2025-05-07T20:25:29.9918185Z 2025-05-07T20:25:31.8880449Z #define __DBL_MIN_EXP__ (-1021) 2025-05-07T20:25:31.8881249Z #define __UINT_LEAST16_MAX__ 0xffff 2025-05-07T20:25:31.8881605Z #define __ATOMIC_ACQUIRE 2 2025-05-07T20:25:31.8881866Z #define __FLT128_MAX_10_EXP__ 4932 2025-05-07T20:25:31.8882200Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F 2025-05-07T20:25:31.8882553Z #define __GCC_IEC_559_COMPLEX 2 2025-05-07T20:25:31.8882838Z #define __UINT_LEAST8_TYPE__ unsigned char 2025-05-07T20:25:31.8883142Z #define __SIZEOF_FLOAT80__ 16 2025-05-07T20:25:31.8883405Z #define __INTMAX_C(c) c ## L 2025-05-07T20:25:31.8883662Z #define __CHAR_BIT__ 8 2025-05-07T20:25:31.8883898Z #define __UINT8_MAX__ 0xff 2025-05-07T20:25:31.8884149Z #define __SCHAR_WIDTH__ 8 2025-05-07T20:25:31.8884407Z #define __WINT_MAX__ 0xffffffffU 2025-05-07T20:25:31.8884676Z #define __FLT32_MIN_EXP__ (-125) 2025-05-07T20:25:31.8884957Z #define __ORDER_LITTLE_ENDIAN__ 1234 2025-05-07T20:25:31.8885260Z #define __SIZE_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:31.8885562Z #define __WCHAR_MAX__ 0x7fffffff 2025-05-07T20:25:31.8886034Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1 2025-05-07T20:25:31.8886364Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1 2025-05-07T20:25:31.8886685Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1 2025-05-07T20:25:31.8887084Z #define __DBL_DENORM_MIN__ ((double)4.94065645841246544176568792868221372e-324L) 2025-05-07T20:25:31.8887533Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1 2025-05-07T20:25:31.8887845Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2 2025-05-07T20:25:31.8888123Z #define __GCC_IEC_559 2 2025-05-07T20:25:31.8888374Z #define __FLT32X_DECIMAL_DIG__ 17 2025-05-07T20:25:31.8888647Z #define __FLT_EVAL_METHOD__ 0 2025-05-07T20:25:31.8888907Z #define __FLT64_DECIMAL_DIG__ 17 2025-05-07T20:25:31.8889190Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2 2025-05-07T20:25:31.8889519Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:31.8889835Z #define __SIG_ATOMIC_TYPE__ int 2025-05-07T20:25:31.8890112Z #define __DBL_MIN_10_EXP__ (-307) 2025-05-07T20:25:31.8890389Z #define __FINITE_MATH_ONLY__ 0 2025-05-07T20:25:31.8890661Z #define __FLT32X_MAX_EXP__ 1024 2025-05-07T20:25:31.8890932Z #define __FLT32_HAS_DENORM__ 1 2025-05-07T20:25:31.8891196Z #define __UINT_FAST8_MAX__ 0xff 2025-05-07T20:25:31.8891490Z #define __FLT32_MAX_10_EXP__ 38 2025-05-07T20:25:31.8891765Z #define __DEC64_MAX_EXP__ 385 2025-05-07T20:25:31.8892022Z #define __INT8_C(c) c 2025-05-07T20:25:31.8892266Z #define __INT_LEAST8_WIDTH__ 8 2025-05-07T20:25:31.8892557Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:31.8892880Z #define __SHRT_MAX__ 0x7fff 2025-05-07T20:25:31.8893194Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:25:31.8893541Z #define __FLT64X_MAX_10_EXP__ 4932 2025-05-07T20:25:31.8893819Z #define __LDBL_IS_IEC_60559__ 2 2025-05-07T20:25:31.8894089Z #define __FLT64X_HAS_QUIET_NAN__ 1 2025-05-07T20:25:31.8894363Z #define __UINT_LEAST8_MAX__ 0xff 2025-05-07T20:25:31.8894643Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2 2025-05-07T20:25:31.8895036Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128 2025-05-07T20:25:31.8895450Z #define __UINTMAX_TYPE__ long unsigned int 2025-05-07T20:25:31.8895734Z #define __linux 1 2025-05-07T20:25:31.8895966Z #define __DEC32_EPSILON__ 1E-6DF 2025-05-07T20:25:31.8896248Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0 2025-05-07T20:25:31.8896526Z #define __unix 1 2025-05-07T20:25:31.8896753Z #define __UINT32_MAX__ 0xffffffffU 2025-05-07T20:25:31.8897033Z #define __FLT128_MIN_EXP__ (-16381) 2025-05-07T20:25:31.8897299Z #define __WINT_MIN__ 0U 2025-05-07T20:25:31.8897550Z #define __FLT128_MIN_10_EXP__ (-4931) 2025-05-07T20:25:31.8897834Z #define __FLT32X_IS_IEC_60559__ 2 2025-05-07T20:25:31.8898101Z #define __INT_LEAST16_WIDTH__ 16 2025-05-07T20:25:31.8898369Z #define __SCHAR_MAX__ 0x7f 2025-05-07T20:25:31.8898623Z #define __FLT128_MANT_DIG__ 113 2025-05-07T20:25:31.8898901Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1) 2025-05-07T20:25:31.8899196Z #define __INT64_C(c) c ## L 2025-05-07T20:25:31.8899465Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2 2025-05-07T20:25:31.8899871Z #define __FLT32X_MANT_DIG__ 53 2025-05-07T20:25:31.8900133Z #define __USER_LABEL_PREFIX__ 2025-05-07T20:25:31.8900479Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x 2025-05-07T20:25:31.8900851Z #define __STDC_HOSTED__ 1 2025-05-07T20:25:31.8901099Z #define __DEC64_MIN_EXP__ (-382) 2025-05-07T20:25:31.8901385Z #define __DBL_DIG__ 15 2025-05-07T20:25:31.8901647Z #define __FLT32_DIG__ 6 2025-05-07T20:25:31.8901940Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F 2025-05-07T20:25:31.8902288Z #define __SHRT_WIDTH__ 16 2025-05-07T20:25:31.8902540Z #define __FLT32_IS_IEC_60559__ 2 2025-05-07T20:25:31.8902860Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L 2025-05-07T20:25:31.8903206Z #define __STDC_UTF_16__ 1 2025-05-07T20:25:31.8903458Z #define __DBL_IS_IEC_60559__ 2 2025-05-07T20:25:31.8903716Z #define __DEC32_MAX__ 9.999999E96DF 2025-05-07T20:25:31.8904095Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x 2025-05-07T20:25:31.8904573Z #define __FLT32X_HAS_INFINITY__ 1 2025-05-07T20:25:31.8904853Z #define __INT32_MAX__ 0x7fffffff 2025-05-07T20:25:31.8905103Z #define __unix__ 1 2025-05-07T20:25:31.8905331Z #define __INT_WIDTH__ 32 2025-05-07T20:25:31.8905576Z #define __SIZEOF_LONG__ 8 2025-05-07T20:25:31.8905816Z #define __STDC_IEC_559__ 1 2025-05-07T20:25:31.8906072Z #define __STDC_ISO_10646__ 201103L 2025-05-07T20:25:31.8906340Z #define __UINT16_C(c) c 2025-05-07T20:25:31.8906576Z #define __DECIMAL_DIG__ 21 2025-05-07T20:25:31.8906838Z #define __STDC_IEC_559_COMPLEX__ 1 2025-05-07T20:25:31.8907193Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64 2025-05-07T20:25:31.8907548Z #define __gnu_linux__ 1 2025-05-07T20:25:31.8907794Z #define __FLT128_IS_IEC_60559__ 2 2025-05-07T20:25:31.8908073Z #define __FLT64X_MIN_10_EXP__ (-4931) 2025-05-07T20:25:31.8908356Z #define __LDBL_HAS_QUIET_NAN__ 1 2025-05-07T20:25:31.8908620Z #define __FLT64_MANT_DIG__ 53 2025-05-07T20:25:31.8908897Z #define __FLT64X_MANT_DIG__ 64 2025-05-07T20:25:31.8909148Z #define __GNUC__ 11 2025-05-07T20:25:31.8909441Z #define __pie__ 2 2025-05-07T20:25:31.8909659Z #define __MMX__ 1 2025-05-07T20:25:31.8909878Z #define __FLT_HAS_DENORM__ 1 2025-05-07T20:25:31.8910147Z #define __SIZEOF_LONG_DOUBLE__ 16 2025-05-07T20:25:31.8910431Z #define __BIGGEST_ALIGNMENT__ 16 2025-05-07T20:25:31.8910698Z #define __FLT64_MAX_10_EXP__ 308 2025-05-07T20:25:31.8911042Z #define __DBL_MAX__ ((double)1.79769313486231570814527423731704357e+308L) 2025-05-07T20:25:31.8911484Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:31.8911800Z #define __DBL_HAS_INFINITY__ 1 2025-05-07T20:25:31.8912056Z #define __SIZEOF_FLOAT__ 4 2025-05-07T20:25:31.8912322Z #define __HAVE_SPECULATION_SAFE_VALUE 1 2025-05-07T20:25:31.8912624Z #define __DEC32_MIN_EXP__ (-94) 2025-05-07T20:25:31.8912885Z #define __INTPTR_WIDTH__ 64 2025-05-07T20:25:31.8913150Z #define __FLT64X_HAS_INFINITY__ 1 2025-05-07T20:25:31.8913441Z #define __UINT_LEAST32_MAX__ 0xffffffffU 2025-05-07T20:25:31.8913739Z #define __FLT32X_HAS_DENORM__ 1 2025-05-07T20:25:31.8914008Z #define __INT_FAST16_TYPE__ long int 2025-05-07T20:25:31.8914291Z #define __MMX_WITH_SSE__ 1 2025-05-07T20:25:31.8914540Z #define __LDBL_HAS_DENORM__ 1 2025-05-07T20:25:31.8914809Z #define __FLT128_HAS_INFINITY__ 1 2025-05-07T20:25:31.8915081Z #define __DEC32_MIN__ 1E-95DF 2025-05-07T20:25:31.8915337Z #define __DBL_MAX_EXP__ 1024 2025-05-07T20:25:31.8915596Z #define __WCHAR_WIDTH__ 32 2025-05-07T20:25:31.8915908Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:25:31.8916260Z #define __DEC128_EPSILON__ 1E-33DL 2025-05-07T20:25:31.8916524Z #define __SSE2_MATH__ 1 2025-05-07T20:25:31.8916775Z #define __ATOMIC_HLE_RELEASE 131072 2025-05-07T20:25:31.8917070Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:31.8917380Z #define __amd64 1 2025-05-07T20:25:31.8917606Z #define __STDC_NO_THREADS__ 1 2025-05-07T20:25:31.8917869Z #define __ATOMIC_HLE_ACQUIRE 65536 2025-05-07T20:25:31.8918278Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL 2025-05-07T20:25:31.8918597Z #define __SIZEOF_SIZE_T__ 8 2025-05-07T20:25:31.8918853Z #define __FLT64X_MIN_EXP__ (-16381) 2025-05-07T20:25:31.8919129Z #define __SIZEOF_WINT_T__ 4 2025-05-07T20:25:31.8919387Z #define __LONG_LONG_WIDTH__ 64 2025-05-07T20:25:31.8919651Z #define __FLT32_MAX_EXP__ 128 2025-05-07T20:25:31.8919908Z #define __GXX_ABI_VERSION 1016 2025-05-07T20:25:31.8920173Z #define __FLT_MIN_EXP__ (-125) 2025-05-07T20:25:31.8920443Z #define __GCC_HAVE_DWARF2_CFI_ASM 1 2025-05-07T20:25:31.8920716Z #define __INT16_MAX__ 0x7fff 2025-05-07T20:25:31.8921040Z #define __x86_64 1 2025-05-07T20:25:31.8929812Z #define __INT_FAST64_TYPE__ long int 2025-05-07T20:25:31.8930235Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64 2025-05-07T20:25:31.8930700Z #define __DBL_MIN__ ((double)2.22507385850720138309023271733240406e-308L) 2025-05-07T20:25:31.8931198Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128 2025-05-07T20:25:31.8931839Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:25:31.8932228Z #define __SIZEOF_POINTER__ 8 2025-05-07T20:25:31.8932487Z #define __LP64__ 1 2025-05-07T20:25:31.8932713Z #define __DBL_HAS_QUIET_NAN__ 1 2025-05-07T20:25:31.8933062Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x 2025-05-07T20:25:31.8933439Z #define __DECIMAL_BID_FORMAT__ 1 2025-05-07T20:25:31.8933717Z #define __FLT64_MIN_EXP__ (-1021) 2025-05-07T20:25:31.8933992Z #define __FLT64_MIN_10_EXP__ (-307) 2025-05-07T20:25:31.8934274Z #define __FLT64X_DECIMAL_DIG__ 21 2025-05-07T20:25:31.8934555Z #define __DEC128_MIN__ 1E-6143DL 2025-05-07T20:25:31.8934817Z #define __REGISTER_PREFIX__ 2025-05-07T20:25:31.8935079Z #define __UINT16_MAX__ 0xffff 2025-05-07T20:25:31.8935340Z #define __DBL_HAS_DENORM__ 1 2025-05-07T20:25:31.8935595Z #define __LDBL_HAS_INFINITY__ 1 2025-05-07T20:25:31.8935925Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32 2025-05-07T20:25:31.8936291Z #define __UINT8_TYPE__ unsigned char 2025-05-07T20:25:31.8936562Z #define __FLT_DIG__ 6 2025-05-07T20:25:31.8936798Z #define __NO_INLINE__ 1 2025-05-07T20:25:31.8937043Z #define __DEC_EVAL_METHOD__ 2 2025-05-07T20:25:31.8937359Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL 2025-05-07T20:25:31.8937708Z #define __FLT_MANT_DIG__ 24 2025-05-07T20:25:31.8937967Z #define __LDBL_DECIMAL_DIG__ 21 2025-05-07T20:25:31.8938229Z #define __VERSION__ "11.4.0" 2025-05-07T20:25:31.8938477Z #define __UINT64_C(c) c ## UL 2025-05-07T20:25:31.8938733Z #define _STDC_PREDEF_H 1 2025-05-07T20:25:31.8938988Z #define __INT_LEAST32_MAX__ 0x7fffffff 2025-05-07T20:25:31.8939273Z #define __GCC_ATOMIC_INT_LOCK_FREE 2 2025-05-07T20:25:31.8939559Z #define __FLT128_MAX_EXP__ 16384 2025-05-07T20:25:31.8939824Z #define __FLT32_MANT_DIG__ 24 2025-05-07T20:25:31.8940118Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:25:31.8940447Z #define __FLT128_HAS_DENORM__ 1 2025-05-07T20:25:31.8940720Z #define __FLT32_DECIMAL_DIG__ 9 2025-05-07T20:25:31.8941008Z #define __FLT128_DIG__ 33 2025-05-07T20:25:31.8941258Z #define __INT32_C(c) c 2025-05-07T20:25:31.8941498Z #define __DEC64_EPSILON__ 1E-15DD 2025-05-07T20:25:31.8941766Z #define __ORDER_PDP_ENDIAN__ 3412 2025-05-07T20:25:31.8942042Z #define __DEC128_MIN_EXP__ (-6142) 2025-05-07T20:25:31.8942319Z #define __INT_FAST32_TYPE__ long int 2025-05-07T20:25:31.8942629Z #define __UINT_LEAST16_TYPE__ short unsigned int 2025-05-07T20:25:31.8942925Z #define unix 1 2025-05-07T20:25:31.8943156Z #define __SIZE_TYPE__ long unsigned int 2025-05-07T20:25:31.8943469Z #define __UINT64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:31.8943766Z #define __FLT_IS_IEC_60559__ 2 2025-05-07T20:25:31.8944072Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE" 2025-05-07T20:25:31.8944398Z #define __FLT64X_DIG__ 18 2025-05-07T20:25:31.8944641Z #define __INT8_TYPE__ signed char 2025-05-07T20:25:31.8944902Z #define __ELF__ 1 2025-05-07T20:25:31.8945277Z #define __GCC_ASM_FLAG_OUTPUTS__ 1 2025-05-07T20:25:31.8945561Z #define __UINT32_TYPE__ unsigned int 2025-05-07T20:25:31.8945837Z #define __FLT_RADIX__ 2 2025-05-07T20:25:31.8946087Z #define __INT_LEAST16_TYPE__ short int 2025-05-07T20:25:31.8946439Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L 2025-05-07T20:25:31.8946802Z #define __UINTMAX_C(c) c ## UL 2025-05-07T20:25:31.8947061Z #define __SSE_MATH__ 1 2025-05-07T20:25:31.8947291Z #define __k8 1 2025-05-07T20:25:31.8947579Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x 2025-05-07T20:25:31.8947950Z #define __SIG_ATOMIC_MAX__ 0x7fffffff 2025-05-07T20:25:31.8948246Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2 2025-05-07T20:25:31.8948539Z #define __SIZEOF_PTRDIFF_T__ 8 2025-05-07T20:25:31.8948799Z #define __LDBL_DIG__ 18 2025-05-07T20:25:31.8949047Z #define __FLT64_IS_IEC_60559__ 2 2025-05-07T20:25:31.8949349Z #define __x86_64__ 1 2025-05-07T20:25:31.8949589Z #define __FLT32X_MIN_EXP__ (-1021) 2025-05-07T20:25:31.8949894Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF 2025-05-07T20:25:31.8950331Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:31.8950637Z #define __FLT64_DIG__ 15 2025-05-07T20:25:31.8950918Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:31.8951262Z #define __UINT_LEAST64_TYPE__ long unsigned int 2025-05-07T20:25:31.8951573Z #define __FLT_HAS_QUIET_NAN__ 1 2025-05-07T20:25:31.8951888Z #define __FLT_MAX_10_EXP__ 38 2025-05-07T20:25:31.8952163Z #define __LONG_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:31.8952449Z #define __FLT64X_HAS_DENORM__ 1 2025-05-07T20:25:31.8952808Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL 2025-05-07T20:25:31.8953201Z #define __FLT_HAS_INFINITY__ 1 2025-05-07T20:25:31.8953485Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8" 2025-05-07T20:25:31.8953819Z #define __UINT_FAST16_TYPE__ long unsigned int 2025-05-07T20:25:31.8954139Z #define __DEC64_MAX__ 9.999999999999999E384DD 2025-05-07T20:25:31.8954426Z #define __INT_FAST32_WIDTH__ 64 2025-05-07T20:25:31.8954716Z #define __CHAR16_TYPE__ short unsigned int 2025-05-07T20:25:31.8955023Z #define __PRAGMA_REDEFINE_EXTNAME 1 2025-05-07T20:25:31.8955300Z #define __SIZE_WIDTH__ 64 2025-05-07T20:25:31.8955531Z #define __SEG_FS 1 2025-05-07T20:25:31.8955762Z #define __INT_LEAST16_MAX__ 0x7fff 2025-05-07T20:25:31.8956037Z #define __DEC64_MANT_DIG__ 16 2025-05-07T20:25:31.8956302Z #define __INT64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:31.8956589Z #define __SEG_GS 1 2025-05-07T20:25:31.8956900Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32 2025-05-07T20:25:31.8957272Z #define __SIG_ATOMIC_WIDTH__ 32 2025-05-07T20:25:31.8957543Z #define __INT_LEAST64_TYPE__ long int 2025-05-07T20:25:31.8957827Z #define __INT16_TYPE__ short int 2025-05-07T20:25:31.8958098Z #define __INT_LEAST8_TYPE__ signed char 2025-05-07T20:25:31.8958390Z #define __STDC_VERSION__ 201710L 2025-05-07T20:25:31.8958653Z #define __SIZEOF_INT__ 4 2025-05-07T20:25:31.8958893Z #define __DEC32_MAX_EXP__ 97 2025-05-07T20:25:31.8959168Z #define __INT_FAST8_MAX__ 0x7f 2025-05-07T20:25:31.8959509Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:25:31.8959899Z #define __INTPTR_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:31.8960184Z #define linux 1 2025-05-07T20:25:31.8960414Z #define __FLT64_HAS_QUIET_NAN__ 1 2025-05-07T20:25:31.8960699Z #define __FLT32_MIN_10_EXP__ (-37) 2025-05-07T20:25:31.8960972Z #define __FLT32X_DIG__ 15 2025-05-07T20:25:31.8961230Z #define __PTRDIFF_WIDTH__ 64 2025-05-07T20:25:31.8961491Z #define __LDBL_MANT_DIG__ 64 2025-05-07T20:25:31.8961749Z #define __FLT64_HAS_INFINITY__ 1 2025-05-07T20:25:31.8962094Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:25:31.8962504Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1) 2025-05-07T20:25:31.8962826Z #define __code_model_small__ 1 2025-05-07T20:25:31.8963100Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2 2025-05-07T20:25:31.8963387Z #define __DEC32_MANT_DIG__ 7 2025-05-07T20:25:31.8963732Z #define __k8__ 1 2025-05-07T20:25:31.8963960Z #define __INTPTR_TYPE__ long int 2025-05-07T20:25:31.8964252Z #define __UINT16_TYPE__ short unsigned int 2025-05-07T20:25:31.8964550Z #define __WCHAR_TYPE__ int 2025-05-07T20:25:31.8964787Z #define __pic__ 2 2025-05-07T20:25:31.8965039Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:31.8965349Z #define __INT_FAST64_WIDTH__ 64 2025-05-07T20:25:31.8965630Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:31.8965958Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1 2025-05-07T20:25:31.8966325Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:25:31.8966676Z #define __FLT32_HAS_INFINITY__ 1 2025-05-07T20:25:31.8966947Z #define __FLT64X_MAX_EXP__ 16384 2025-05-07T20:25:31.8967239Z #define __UINT_FAST64_TYPE__ long unsigned int 2025-05-07T20:25:31.8967543Z #define __INT_MAX__ 0x7fffffff 2025-05-07T20:25:31.8967800Z #define __linux__ 1 2025-05-07T20:25:31.8968028Z #define __INT64_TYPE__ long int 2025-05-07T20:25:31.8968382Z #define __FLT_MAX_EXP__ 128 2025-05-07T20:25:31.8968637Z #define __ORDER_BIG_ENDIAN__ 4321 2025-05-07T20:25:31.8968909Z #define __DBL_MANT_DIG__ 53 2025-05-07T20:25:31.8969163Z #define __SIZEOF_FLOAT128__ 16 2025-05-07T20:25:31.8969445Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:31.8969768Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2 2025-05-07T20:25:31.8970062Z #define __DEC64_MIN__ 1E-383DD 2025-05-07T20:25:31.8970320Z #define __WINT_TYPE__ unsigned int 2025-05-07T20:25:31.8970613Z #define __UINT_LEAST32_TYPE__ unsigned int 2025-05-07T20:25:31.8970921Z #define __SIZEOF_SHORT__ 2 2025-05-07T20:25:31.8971276Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:25:31.8971628Z #define __SSE__ 1 2025-05-07T20:25:31.8971863Z #define __LDBL_MIN_EXP__ (-16381) 2025-05-07T20:25:31.8972201Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:25:31.8972537Z #define __amd64__ 1 2025-05-07T20:25:31.8972763Z #define __WINT_WIDTH__ 32 2025-05-07T20:25:31.8973027Z #define __INT_LEAST8_MAX__ 0x7f 2025-05-07T20:25:31.8973290Z #define __INT_LEAST64_WIDTH__ 64 2025-05-07T20:25:31.8973564Z #define __LDBL_MAX_EXP__ 16384 2025-05-07T20:25:31.8973830Z #define __FLT32X_MAX_10_EXP__ 308 2025-05-07T20:25:31.8974096Z #define __SIZEOF_INT128__ 16 2025-05-07T20:25:31.8974356Z #define __FLT64X_IS_IEC_60559__ 2 2025-05-07T20:25:31.8974628Z #define __LDBL_MAX_10_EXP__ 4932 2025-05-07T20:25:31.8974891Z #define __ATOMIC_RELAXED 0 2025-05-07T20:25:31.8975244Z #define __DBL_EPSILON__ ((double)2.22044604925031308084726333618164062e-16L) 2025-05-07T20:25:31.8975703Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128 2025-05-07T20:25:31.8976052Z #define _LP64 1 2025-05-07T20:25:31.8976279Z #define __UINT8_C(c) c 2025-05-07T20:25:31.8976521Z #define __FLT64_MAX_EXP__ 1024 2025-05-07T20:25:31.8976785Z #define __INT_LEAST32_TYPE__ int 2025-05-07T20:25:31.8977048Z #define __SIZEOF_WCHAR_T__ 4 2025-05-07T20:25:31.8977333Z #define __UINT64_TYPE__ long unsigned int 2025-05-07T20:25:31.8977640Z #define __GNUC_PATCHLEVEL__ 0 2025-05-07T20:25:31.8977996Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:25:31.8978446Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:25:31.8978817Z #define __FLT128_HAS_QUIET_NAN__ 1 2025-05-07T20:25:31.8979111Z #define __INTMAX_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:31.8979415Z #define __INT_FAST8_TYPE__ signed char 2025-05-07T20:25:31.8979777Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x 2025-05-07T20:25:31.8980144Z #define __GNUC_STDC_INLINE__ 1 2025-05-07T20:25:31.8980399Z #define __FLT64_HAS_DENORM__ 1 2025-05-07T20:25:31.8980733Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32 2025-05-07T20:25:31.8981095Z #define __DBL_DECIMAL_DIG__ 17 2025-05-07T20:25:31.8981352Z #define __STDC_UTF_32__ 1 2025-05-07T20:25:31.8981621Z #define __INT_FAST8_WIDTH__ 8 2025-05-07T20:25:31.8981897Z #define __FXSR__ 1 2025-05-07T20:25:31.8982295Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:25:31.8982743Z #define __DBL_NORM_MAX__ ((double)1.79769313486231570814527423731704357e+308L) 2025-05-07T20:25:31.8983148Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:25:31.8983456Z #define __INTMAX_WIDTH__ 64 2025-05-07T20:25:31.8983709Z #define __UINT32_C(c) c ## U 2025-05-07T20:25:31.8984039Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F 2025-05-07T20:25:31.8984392Z #define __INT8_MAX__ 0x7f 2025-05-07T20:25:31.8984630Z #define __LONG_WIDTH__ 64 2025-05-07T20:25:31.8984865Z #define __PIC__ 2 2025-05-07T20:25:31.8985112Z #define __UINT_FAST32_TYPE__ long unsigned int 2025-05-07T20:25:31.8985503Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:25:31.8985877Z #define __CHAR32_TYPE__ unsigned int 2025-05-07T20:25:31.8986206Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:25:31.8986613Z #define __SSE2__ 1 2025-05-07T20:25:31.8986827Z #define __INT32_TYPE__ int 2025-05-07T20:25:31.8987081Z #define __SIZEOF_DOUBLE__ 8 2025-05-07T20:25:31.8987337Z #define __FLT_MIN_10_EXP__ (-37) 2025-05-07T20:25:31.8987659Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64 2025-05-07T20:25:31.8988010Z #define __INT_LEAST32_WIDTH__ 32 2025-05-07T20:25:31.8988277Z #define __INTMAX_TYPE__ long int 2025-05-07T20:25:31.8988539Z #define __DEC128_MAX_EXP__ 6145 2025-05-07T20:25:31.8988805Z #define __FLT32X_HAS_QUIET_NAN__ 1 2025-05-07T20:25:31.8989082Z #define __ATOMIC_CONSUME 1 2025-05-07T20:25:31.8989371Z #define __GNUC_MINOR__ 4 2025-05-07T20:25:31.8989615Z #define __INT_FAST16_WIDTH__ 64 2025-05-07T20:25:31.8989898Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:31.8990189Z #define __PIE__ 2 2025-05-07T20:25:31.8990502Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x 2025-05-07T20:25:31.8990886Z #define __DBL_MAX_10_EXP__ 308 2025-05-07T20:25:31.8991237Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L 2025-05-07T20:25:31.8991589Z #define __INT16_C(c) c 2025-05-07T20:25:31.8991819Z #define __STDC__ 1 2025-05-07T20:25:31.8992049Z #define __PTRDIFF_TYPE__ long int 2025-05-07T20:25:31.8992313Z #define __ATOMIC_SEQ_CST 5 2025-05-07T20:25:31.8992565Z #define __FLT32X_MIN_10_EXP__ (-307) 2025-05-07T20:25:31.8992863Z #define __UINTPTR_TYPE__ long unsigned int 2025-05-07T20:25:31.8993200Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD 2025-05-07T20:25:31.8993531Z #define __DEC128_MANT_DIG__ 34 2025-05-07T20:25:31.8993796Z #define __LDBL_MIN_10_EXP__ (-4931) 2025-05-07T20:25:31.8994076Z #define __SIZEOF_LONG_LONG__ 8 2025-05-07T20:25:31.8994335Z #define __FLT128_DECIMAL_DIG__ 36 2025-05-07T20:25:31.8994617Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2 2025-05-07T20:25:31.8994906Z #define __FLT32_HAS_QUIET_NAN__ 1 2025-05-07T20:25:31.8995175Z #define __FLT_DECIMAL_DIG__ 9 2025-05-07T20:25:31.8995473Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:31.8995866Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:25:31.8996228Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2 2025-05-07T20:25:31.8996533Z #define __UINT_FAST8_TYPE__ unsigned char 2025-05-07T20:25:31.8996826Z #define __ATOMIC_ACQ_REL 4 2025-05-07T20:25:31.8997070Z #define __ATOMIC_RELEASE 3 2025-05-07T20:25:31.8997232Z 2025-05-07T20:25:31.9504912Z 2025-05-07T20:25:31.9505309Z [INFO] Printing out all preprocessor defines in the C++ compiler ... 2025-05-07T20:25:31.9505747Z + conda run -n build_binary c++ -dM -E -x c++ - 2025-05-07T20:25:31.9505974Z 2025-05-07T20:25:33.8442486Z #define __DBL_MIN_EXP__ (-1021) 2025-05-07T20:25:33.8442858Z #define __cpp_attributes 200809L 2025-05-07T20:25:33.8443192Z #define __cpp_nontype_template_parameter_auto 201606L 2025-05-07T20:25:33.8443544Z #define __UINT_LEAST16_MAX__ 0xffff 2025-05-07T20:25:33.8443831Z #define __ATOMIC_ACQUIRE 2 2025-05-07T20:25:33.8444087Z #define __FLT128_MAX_10_EXP__ 4932 2025-05-07T20:25:33.8444773Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F 2025-05-07T20:25:33.8445128Z #define __GCC_IEC_559_COMPLEX 2 2025-05-07T20:25:33.8445406Z #define __cpp_aggregate_nsdmi 201304L 2025-05-07T20:25:33.8445726Z #define __UINT_LEAST8_TYPE__ unsigned char 2025-05-07T20:25:33.8446042Z #define __SIZEOF_FLOAT80__ 16 2025-05-07T20:25:33.8446315Z #define __INTMAX_C(c) c ## L 2025-05-07T20:25:33.8446567Z #define __CHAR_BIT__ 8 2025-05-07T20:25:33.8446806Z #define __UINT8_MAX__ 0xff 2025-05-07T20:25:33.8447053Z #define __SCHAR_WIDTH__ 8 2025-05-07T20:25:33.8447301Z #define __WINT_MAX__ 0xffffffffU 2025-05-07T20:25:33.8447570Z #define __FLT32_MIN_EXP__ (-125) 2025-05-07T20:25:33.8447846Z #define __cpp_static_assert 201411L 2025-05-07T20:25:33.8448127Z #define __ORDER_LITTLE_ENDIAN__ 1234 2025-05-07T20:25:33.8448426Z #define __SIZE_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:33.8448726Z #define __WCHAR_MAX__ 0x7fffffff 2025-05-07T20:25:33.8449010Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1 2025-05-07T20:25:33.8449494Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1 2025-05-07T20:25:33.8449820Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1 2025-05-07T20:25:33.8450216Z #define __DBL_DENORM_MIN__ double(4.94065645841246544176568792868221372e-324L) 2025-05-07T20:25:33.8450618Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1 2025-05-07T20:25:33.8450927Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2 2025-05-07T20:25:33.8451210Z #define __GCC_IEC_559 2 2025-05-07T20:25:33.8451452Z #define __FLT32X_DECIMAL_DIG__ 17 2025-05-07T20:25:33.8451775Z #define __FLT_EVAL_METHOD__ 0 2025-05-07T20:25:33.8452047Z #define __cpp_binary_literals 201304L 2025-05-07T20:25:33.8452328Z #define __FLT64_DECIMAL_DIG__ 17 2025-05-07T20:25:33.8452619Z #define __cpp_noexcept_function_type 201510L 2025-05-07T20:25:33.8452934Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2 2025-05-07T20:25:33.8453236Z #define __cpp_variadic_templates 200704L 2025-05-07T20:25:33.8453567Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:33.8453899Z #define __SIG_ATOMIC_TYPE__ int 2025-05-07T20:25:33.8454164Z #define __DBL_MIN_10_EXP__ (-307) 2025-05-07T20:25:33.8454438Z #define __FINITE_MATH_ONLY__ 0 2025-05-07T20:25:33.8454717Z #define __cpp_variable_templates 201304L 2025-05-07T20:25:33.8455014Z #define __FLT32X_MAX_EXP__ 1024 2025-05-07T20:25:33.8455271Z #define __FLT32_HAS_DENORM__ 1 2025-05-07T20:25:33.8455534Z #define __UINT_FAST8_MAX__ 0xff 2025-05-07T20:25:33.8455811Z #define __cpp_rvalue_reference 200610L 2025-05-07T20:25:33.8456133Z #define __cpp_nested_namespace_definitions 201411L 2025-05-07T20:25:33.8456460Z #define __DEC64_MAX_EXP__ 385 2025-05-07T20:25:33.8456716Z #define __INT8_C(c) c 2025-05-07T20:25:33.8456948Z #define __INT_LEAST8_WIDTH__ 8 2025-05-07T20:25:33.8457222Z #define __cpp_variadic_using 201611L 2025-05-07T20:25:33.8457541Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:33.8457856Z #define __INT_LEAST8_MAX__ 0x7f 2025-05-07T20:25:33.8458138Z #define __cpp_capture_star_this 201603L 2025-05-07T20:25:33.8458441Z #define __SHRT_MAX__ 0x7fff 2025-05-07T20:25:33.8458757Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:25:33.8459104Z #define __FLT64X_MAX_10_EXP__ 4932 2025-05-07T20:25:33.8459387Z #define __cpp_if_constexpr 201606L 2025-05-07T20:25:33.8459667Z #define __LDBL_IS_IEC_60559__ 2 2025-05-07T20:25:33.8459926Z #define __FLT64X_HAS_QUIET_NAN__ 1 2025-05-07T20:25:33.8460200Z #define __UINT_LEAST8_MAX__ 0xff 2025-05-07T20:25:33.8460475Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2 2025-05-07T20:25:33.8460858Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128 2025-05-07T20:25:33.8461275Z #define __UINTMAX_TYPE__ long unsigned int 2025-05-07T20:25:33.8461573Z #define __linux 1 2025-05-07T20:25:33.8461836Z #define __DEC32_EPSILON__ 1E-6DF 2025-05-07T20:25:33.8462113Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0 2025-05-07T20:25:33.8462390Z #define __unix 1 2025-05-07T20:25:33.8462610Z #define __UINT32_MAX__ 0xffffffffU 2025-05-07T20:25:33.8462998Z #define __GXX_EXPERIMENTAL_CXX0X__ 1 2025-05-07T20:25:33.8463289Z #define __FLT128_MIN_EXP__ (-16381) 2025-05-07T20:25:33.8463560Z #define __WINT_MIN__ 0U 2025-05-07T20:25:33.8463801Z #define __FLT128_MIN_10_EXP__ (-4931) 2025-05-07T20:25:33.8464080Z #define __FLT32X_IS_IEC_60559__ 2 2025-05-07T20:25:33.8464353Z #define __INT_LEAST16_WIDTH__ 16 2025-05-07T20:25:33.8464613Z #define __SCHAR_MAX__ 0x7f 2025-05-07T20:25:33.8464865Z #define __FLT128_MANT_DIG__ 113 2025-05-07T20:25:33.8465149Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1) 2025-05-07T20:25:33.8465440Z #define __INT64_C(c) c ## L 2025-05-07T20:25:33.8465704Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2 2025-05-07T20:25:33.8465999Z #define __FLT32X_MANT_DIG__ 53 2025-05-07T20:25:33.8466263Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2 2025-05-07T20:25:33.8466560Z #define __cpp_aligned_new 201606L 2025-05-07T20:25:33.8466833Z #define __USER_LABEL_PREFIX__ 2025-05-07T20:25:33.8467088Z #define __FLT32_MAX_10_EXP__ 38 2025-05-07T20:25:33.8467438Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x 2025-05-07T20:25:33.8467901Z #define __STDC_HOSTED__ 1 2025-05-07T20:25:33.8468152Z #define __DEC64_MIN_EXP__ (-382) 2025-05-07T20:25:33.8468420Z #define __cpp_decltype_auto 201304L 2025-05-07T20:25:33.8468692Z #define __DBL_DIG__ 15 2025-05-07T20:25:33.8468924Z #define __FLT32_DIG__ 6 2025-05-07T20:25:33.8469217Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F 2025-05-07T20:25:33.8469654Z #define __GXX_WEAK__ 1 2025-05-07T20:25:33.8469891Z #define __SHRT_WIDTH__ 16 2025-05-07T20:25:33.8470135Z #define __FLT32_IS_IEC_60559__ 2 2025-05-07T20:25:33.8470458Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L 2025-05-07T20:25:33.8470808Z #define __DBL_IS_IEC_60559__ 2 2025-05-07T20:25:33.8471073Z #define __DEC32_MAX__ 9.999999E96DF 2025-05-07T20:25:33.8471365Z #define __cpp_threadsafe_static_init 200806L 2025-05-07T20:25:33.8471694Z #define __cpp_enumerator_attributes 201411L 2025-05-07T20:25:33.8472150Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x 2025-05-07T20:25:33.8472545Z #define __FLT32X_HAS_INFINITY__ 1 2025-05-07T20:25:33.8472821Z #define __INT32_MAX__ 0x7fffffff 2025-05-07T20:25:33.8473078Z #define __unix__ 1 2025-05-07T20:25:33.8473297Z #define __INT_WIDTH__ 32 2025-05-07T20:25:33.8473543Z #define __SIZEOF_LONG__ 8 2025-05-07T20:25:33.8473788Z #define __STDC_IEC_559__ 1 2025-05-07T20:25:33.8474035Z #define __STDC_ISO_10646__ 201103L 2025-05-07T20:25:33.8474301Z #define __UINT16_C(c) c 2025-05-07T20:25:33.8474541Z #define __DECIMAL_DIG__ 21 2025-05-07T20:25:33.8474791Z #define __STDC_IEC_559_COMPLEX__ 1 2025-05-07T20:25:33.8475144Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64 2025-05-07T20:25:33.8475503Z #define __gnu_linux__ 1 2025-05-07T20:25:33.8475818Z #define __INT16_MAX__ 0x7fff 2025-05-07T20:25:33.8476077Z #define __FLT64_MIN_EXP__ (-1021) 2025-05-07T20:25:33.8476353Z #define __FLT64X_MIN_10_EXP__ (-4931) 2025-05-07T20:25:33.8476638Z #define __LDBL_HAS_QUIET_NAN__ 1 2025-05-07T20:25:33.8476911Z #define __FLT64_MANT_DIG__ 53 2025-05-07T20:25:33.8477171Z #define __FLT64X_MANT_DIG__ 64 2025-05-07T20:25:33.8477427Z #define __GNUC__ 11 2025-05-07T20:25:33.8477639Z #define __GXX_RTTI 1 2025-05-07T20:25:33.8477861Z #define __pie__ 2 2025-05-07T20:25:33.8478073Z #define __MMX__ 1 2025-05-07T20:25:33.8478307Z #define __FLT_HAS_DENORM__ 1 2025-05-07T20:25:33.8478573Z #define __SIZEOF_LONG_DOUBLE__ 16 2025-05-07T20:25:33.8478855Z #define __BIGGEST_ALIGNMENT__ 16 2025-05-07T20:25:33.8479119Z #define __STDC_UTF_16__ 1 2025-05-07T20:25:33.8479369Z #define __FLT64_MAX_10_EXP__ 308 2025-05-07T20:25:33.8479664Z #define __cpp_delegating_constructors 200604L 2025-05-07T20:25:33.8479972Z #define __FLT32_HAS_INFINITY__ 1 2025-05-07T20:25:33.8480316Z #define __DBL_MAX__ double(1.79769313486231570814527423731704357e+308L) 2025-05-07T20:25:33.8480685Z #define __cpp_raw_strings 200710L 2025-05-07T20:25:33.8480982Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:33.8481403Z #define __DBL_HAS_INFINITY__ 1 2025-05-07T20:25:33.8481672Z #define __SIZEOF_FLOAT__ 4 2025-05-07T20:25:33.8481927Z #define __HAVE_SPECULATION_SAFE_VALUE 1 2025-05-07T20:25:33.8482233Z #define __cpp_fold_expressions 201603L 2025-05-07T20:25:33.8482525Z #define __DEC32_MIN_EXP__ (-94) 2025-05-07T20:25:33.8482791Z #define __INTPTR_WIDTH__ 64 2025-05-07T20:25:33.8483042Z #define __FLT64X_HAS_INFINITY__ 1 2025-05-07T20:25:33.8483328Z #define __UINT_LEAST32_MAX__ 0xffffffffU 2025-05-07T20:25:33.8492170Z #define __FLT32X_HAS_DENORM__ 1 2025-05-07T20:25:33.8492469Z #define __INT_FAST16_TYPE__ long int 2025-05-07T20:25:33.8492762Z #define __MMX_WITH_SSE__ 1 2025-05-07T20:25:33.8493015Z #define __LDBL_HAS_DENORM__ 1 2025-05-07T20:25:33.8493284Z #define __cplusplus 201703L 2025-05-07T20:25:33.8493554Z #define __cpp_ref_qualifiers 200710L 2025-05-07T20:25:33.8493839Z #define __DEC32_MIN__ 1E-95DF 2025-05-07T20:25:33.8494098Z #define __DEPRECATED 1 2025-05-07T20:25:33.8494360Z #define __cpp_rvalue_references 200610L 2025-05-07T20:25:33.8494820Z #define __DBL_MAX_EXP__ 1024 2025-05-07T20:25:33.8495087Z #define __WCHAR_WIDTH__ 32 2025-05-07T20:25:33.8495405Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:25:33.8495759Z #define __DEC128_EPSILON__ 1E-33DL 2025-05-07T20:25:33.8496039Z #define __SSE2_MATH__ 1 2025-05-07T20:25:33.8496293Z #define __ATOMIC_HLE_RELEASE 131072 2025-05-07T20:25:33.8496600Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:33.8496887Z #define __amd64 1 2025-05-07T20:25:33.8497121Z #define __STDC_NO_THREADS__ 1 2025-05-07T20:25:33.8497392Z #define __ATOMIC_HLE_ACQUIRE 65536 2025-05-07T20:25:33.8497655Z #define __GNUG__ 11 2025-05-07T20:25:33.8497917Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL 2025-05-07T20:25:33.8498232Z #define __SIZEOF_SIZE_T__ 8 2025-05-07T20:25:33.8498481Z #define __cpp_nsdmi 200809L 2025-05-07T20:25:33.8498745Z #define __FLT64X_MIN_EXP__ (-16381) 2025-05-07T20:25:33.8499022Z #define __SIZEOF_WINT_T__ 4 2025-05-07T20:25:33.8499282Z #define __LONG_LONG_WIDTH__ 64 2025-05-07T20:25:33.8499561Z #define __cpp_initializer_lists 200806L 2025-05-07T20:25:33.8499857Z #define __FLT32_MAX_EXP__ 128 2025-05-07T20:25:33.8500117Z #define __cpp_hex_float 201603L 2025-05-07T20:25:33.8500386Z #define __GXX_ABI_VERSION 1016 2025-05-07T20:25:33.8500655Z #define __FLT128_HAS_INFINITY__ 1 2025-05-07T20:25:33.8500932Z #define __FLT_MIN_EXP__ (-125) 2025-05-07T20:25:33.8501197Z #define __GCC_HAVE_DWARF2_CFI_ASM 1 2025-05-07T20:25:33.8501466Z #define __x86_64 1 2025-05-07T20:25:33.8501695Z #define __cpp_lambdas 200907L 2025-05-07T20:25:33.8501984Z #define __INT_FAST64_TYPE__ long int 2025-05-07T20:25:33.8502378Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64 2025-05-07T20:25:33.8502765Z #define __cpp_template_auto 201606L 2025-05-07T20:25:33.8503113Z #define __DBL_MIN__ double(2.22507385850720138309023271733240406e-308L) 2025-05-07T20:25:33.8503559Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128 2025-05-07T20:25:33.8504037Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:25:33.8504423Z #define __SIZEOF_POINTER__ 8 2025-05-07T20:25:33.8504668Z #define __LP64__ 1 2025-05-07T20:25:33.8504896Z #define __DBL_HAS_QUIET_NAN__ 1 2025-05-07T20:25:33.8505245Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x 2025-05-07T20:25:33.8505614Z #define __DECIMAL_BID_FORMAT__ 1 2025-05-07T20:25:33.8505888Z #define __FLT64_MIN_10_EXP__ (-307) 2025-05-07T20:25:33.8506173Z #define __FLT64X_DECIMAL_DIG__ 21 2025-05-07T20:25:33.8506440Z #define __DEC128_MIN__ 1E-6143DL 2025-05-07T20:25:33.8506710Z #define __REGISTER_PREFIX__ 2025-05-07T20:25:33.8506970Z #define __UINT16_MAX__ 0xffff 2025-05-07T20:25:33.8507225Z #define __LDBL_HAS_INFINITY__ 1 2025-05-07T20:25:33.8507552Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32 2025-05-07T20:25:33.8507912Z #define __UINT8_TYPE__ unsigned char 2025-05-07T20:25:33.8508180Z #define __FLT_DIG__ 6 2025-05-07T20:25:33.8508583Z #define __NO_INLINE__ 1 2025-05-07T20:25:33.8508831Z #define __DEC_EVAL_METHOD__ 2 2025-05-07T20:25:33.8509158Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL 2025-05-07T20:25:33.8509588Z #define __FLT_MANT_DIG__ 24 2025-05-07T20:25:33.8509847Z #define __LDBL_DECIMAL_DIG__ 21 2025-05-07T20:25:33.8510111Z #define __VERSION__ "11.4.0" 2025-05-07T20:25:33.8510360Z #define __UINT64_C(c) c ## UL 2025-05-07T20:25:33.8510634Z #define __cpp_unicode_characters 201411L 2025-05-07T20:25:33.8510936Z #define _STDC_PREDEF_H 1 2025-05-07T20:25:33.8511188Z #define __INT_LEAST32_MAX__ 0x7fffffff 2025-05-07T20:25:33.8511489Z #define __GCC_ATOMIC_INT_LOCK_FREE 2 2025-05-07T20:25:33.8511821Z #define __FLT128_MAX_EXP__ 16384 2025-05-07T20:25:33.8512088Z #define __FLT32_MANT_DIG__ 24 2025-05-07T20:25:33.8512389Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:25:33.8512728Z #define __cpp_aggregate_bases 201603L 2025-05-07T20:25:33.8513007Z #define __FLT128_HAS_DENORM__ 1 2025-05-07T20:25:33.8513412Z #define __FLT32_DECIMAL_DIG__ 9 2025-05-07T20:25:33.8513671Z #define __FLT128_DIG__ 33 2025-05-07T20:25:33.8513911Z #define __INT32_C(c) c 2025-05-07T20:25:33.8514146Z #define __DEC64_EPSILON__ 1E-15DD 2025-05-07T20:25:33.8514425Z #define __ORDER_PDP_ENDIAN__ 3412 2025-05-07T20:25:33.8514702Z #define __DEC128_MIN_EXP__ (-6142) 2025-05-07T20:25:33.8514973Z #define __INT_FAST32_TYPE__ long int 2025-05-07T20:25:33.8515289Z #define __UINT_LEAST16_TYPE__ short unsigned int 2025-05-07T20:25:33.8515595Z #define unix 1 2025-05-07T20:25:33.8515813Z #define __DBL_HAS_DENORM__ 1 2025-05-07T20:25:33.8516080Z #define __cpp_rtti 199711L 2025-05-07T20:25:33.8516344Z #define __SIZE_TYPE__ long unsigned int 2025-05-07T20:25:33.8516650Z #define __UINT64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:33.8516952Z #define __FLT_IS_IEC_60559__ 2 2025-05-07T20:25:33.8517262Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE" 2025-05-07T20:25:33.8517582Z #define __FLT64X_DIG__ 18 2025-05-07T20:25:33.8517841Z #define __INT8_TYPE__ signed char 2025-05-07T20:25:33.8518135Z #define __cpp_digit_separators 201309L 2025-05-07T20:25:33.8518415Z #define __ELF__ 1 2025-05-07T20:25:33.8518639Z #define __GCC_ASM_FLAG_OUTPUTS__ 1 2025-05-07T20:25:33.8518920Z #define __UINT32_TYPE__ unsigned int 2025-05-07T20:25:33.8519195Z #define __FLT_RADIX__ 2 2025-05-07T20:25:33.8519435Z #define __INT_LEAST16_TYPE__ short int 2025-05-07T20:25:33.8519791Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L 2025-05-07T20:25:33.8520155Z #define __UINTMAX_C(c) c ## UL 2025-05-07T20:25:33.8520424Z #define __GLIBCXX_BITSIZE_INT_N_0 128 2025-05-07T20:25:33.8520701Z #define __k8 1 2025-05-07T20:25:33.8520997Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x 2025-05-07T20:25:33.8521363Z #define __SIG_ATOMIC_MAX__ 0x7fffffff 2025-05-07T20:25:33.8521707Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2 2025-05-07T20:25:33.8522011Z #define __SIZEOF_PTRDIFF_T__ 8 2025-05-07T20:25:33.8522271Z #define __LDBL_DIG__ 18 2025-05-07T20:25:33.8522519Z #define __FLT64_IS_IEC_60559__ 2 2025-05-07T20:25:33.8522780Z #define __x86_64__ 1 2025-05-07T20:25:33.8523020Z #define __FLT32X_MIN_EXP__ (-1021) 2025-05-07T20:25:33.8523311Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF 2025-05-07T20:25:33.8523645Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:33.8523954Z #define __FLT64_DIG__ 15 2025-05-07T20:25:33.8524229Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:33.8524576Z #define __UINT_LEAST64_TYPE__ long unsigned int 2025-05-07T20:25:33.8524894Z #define __FLT_HAS_QUIET_NAN__ 1 2025-05-07T20:25:33.8525153Z #define __FLT_MAX_10_EXP__ 38 2025-05-07T20:25:33.8525432Z #define __LONG_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:33.8525736Z #define __FLT64X_HAS_DENORM__ 1 2025-05-07T20:25:33.8526092Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL 2025-05-07T20:25:33.8526488Z #define __FLT_HAS_INFINITY__ 1 2025-05-07T20:25:33.8526779Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8" 2025-05-07T20:25:33.8527207Z #define __cpp_unicode_literals 200710L 2025-05-07T20:25:33.8527515Z #define __UINT_FAST16_TYPE__ long unsigned int 2025-05-07T20:25:33.8527834Z #define __DEC64_MAX__ 9.999999999999999E384DD 2025-05-07T20:25:33.8528582Z #define __INT_FAST32_WIDTH__ 64 2025-05-07T20:25:33.8528924Z #define __CHAR16_TYPE__ short unsigned int 2025-05-07T20:25:33.8529230Z #define __PRAGMA_REDEFINE_EXTNAME 1 2025-05-07T20:25:33.8529508Z #define __SIZE_WIDTH__ 64 2025-05-07T20:25:33.8529740Z #define __SEG_FS 1 2025-05-07T20:25:33.8529972Z #define __INT_LEAST16_MAX__ 0x7fff 2025-05-07T20:25:33.8530249Z #define __DEC64_MANT_DIG__ 16 2025-05-07T20:25:33.8530516Z #define __INT64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:33.8530804Z #define __SEG_GS 1 2025-05-07T20:25:33.8531114Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32 2025-05-07T20:25:33.8531494Z #define __SIG_ATOMIC_WIDTH__ 32 2025-05-07T20:25:33.8531762Z #define __INT_LEAST64_TYPE__ long int 2025-05-07T20:25:33.8532055Z #define __INT16_TYPE__ short int 2025-05-07T20:25:33.8532615Z #define __INT_LEAST8_TYPE__ signed char 2025-05-07T20:25:33.8532946Z #define __cpp_structured_bindings 201606L 2025-05-07T20:25:33.8533242Z #define __SIZEOF_INT__ 4 2025-05-07T20:25:33.8533491Z #define __DEC32_MAX_EXP__ 97 2025-05-07T20:25:33.8533744Z #define __INT_FAST8_MAX__ 0x7f 2025-05-07T20:25:33.8534083Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:25:33.8534464Z #define __INTPTR_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:33.8534773Z #define __cpp_sized_deallocation 201309L 2025-05-07T20:25:33.8535099Z #define __cpp_guaranteed_copy_elision 201606L 2025-05-07T20:25:33.8535398Z #define linux 1 2025-05-07T20:25:33.8535626Z #define __FLT64_HAS_QUIET_NAN__ 1 2025-05-07T20:25:33.8535902Z #define __FLT32_MIN_10_EXP__ (-37) 2025-05-07T20:25:33.8536176Z #define __EXCEPTIONS 1 2025-05-07T20:25:33.8536425Z #define __PTRDIFF_WIDTH__ 64 2025-05-07T20:25:33.8536680Z #define __LDBL_MANT_DIG__ 64 2025-05-07T20:25:33.8536958Z #define __cpp_range_based_for 201603L 2025-05-07T20:25:33.8537249Z #define __FLT64_HAS_INFINITY__ 1 2025-05-07T20:25:33.8537588Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:25:33.8537977Z #define __STDCPP_DEFAULT_NEW_ALIGNMENT__ 16 2025-05-07T20:25:33.8538328Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1) 2025-05-07T20:25:33.8538649Z #define __code_model_small__ 1 2025-05-07T20:25:33.8538931Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2 2025-05-07T20:25:33.8539239Z #define __cpp_nontype_template_args 201411L 2025-05-07T20:25:33.8539538Z #define __DEC32_MANT_DIG__ 7 2025-05-07T20:25:33.8539817Z #define __cpp_return_type_deduction 201304L 2025-05-07T20:25:33.8540108Z #define __k8__ 1 2025-05-07T20:25:33.8540338Z #define __INTPTR_TYPE__ long int 2025-05-07T20:25:33.8540617Z #define __UINT16_TYPE__ short unsigned int 2025-05-07T20:25:33.8540929Z #define __WCHAR_TYPE__ int 2025-05-07T20:25:33.8541172Z #define __pic__ 2 2025-05-07T20:25:33.8541420Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:33.8541727Z #define __INT_FAST64_WIDTH__ 64 2025-05-07T20:25:33.8541993Z #define __cpp_decltype 200707L 2025-05-07T20:25:33.8542299Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:33.8542654Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1 2025-05-07T20:25:33.8543015Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:25:33.8543368Z #define __FLT64X_MAX_EXP__ 16384 2025-05-07T20:25:33.8543653Z #define __UINT_FAST64_TYPE__ long unsigned int 2025-05-07T20:25:33.8543971Z #define __cpp_inline_variables 201606L 2025-05-07T20:25:33.8544258Z #define __INT_MAX__ 0x7fffffff 2025-05-07T20:25:33.8544499Z #define __linux__ 1 2025-05-07T20:25:33.8544723Z #define __INT64_TYPE__ long int 2025-05-07T20:25:33.8544981Z #define __FLT_MAX_EXP__ 128 2025-05-07T20:25:33.8545244Z #define __ORDER_BIG_ENDIAN__ 4321 2025-05-07T20:25:33.8545508Z #define __DBL_MANT_DIG__ 53 2025-05-07T20:25:33.8545792Z #define __cpp_inheriting_constructors 201511L 2025-05-07T20:25:33.8546411Z #define __SIZEOF_FLOAT128__ 16 2025-05-07T20:25:33.8546792Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:33.8547173Z #define __DEC64_MIN__ 1E-383DD 2025-05-07T20:25:33.8547475Z #define __WINT_TYPE__ unsigned int 2025-05-07T20:25:33.8547758Z #define __UINT_LEAST32_TYPE__ unsigned int 2025-05-07T20:25:33.8548054Z #define __SIZEOF_SHORT__ 2 2025-05-07T20:25:33.8548379Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:25:33.8548722Z #define __SSE__ 1 2025-05-07T20:25:33.8548948Z #define __LDBL_MIN_EXP__ (-16381) 2025-05-07T20:25:33.8549353Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:25:33.8549696Z #define __amd64__ 1 2025-05-07T20:25:33.8549914Z #define __WINT_WIDTH__ 32 2025-05-07T20:25:33.8550165Z #define __INT_LEAST64_WIDTH__ 64 2025-05-07T20:25:33.8550434Z #define __LDBL_MAX_EXP__ 16384 2025-05-07T20:25:33.8550689Z #define __FLT32X_MAX_10_EXP__ 308 2025-05-07T20:25:33.8550959Z #define __SIZEOF_INT128__ 16 2025-05-07T20:25:33.8551340Z #define __FLT64X_IS_IEC_60559__ 2 2025-05-07T20:25:33.8551610Z #define __LDBL_MAX_10_EXP__ 4932 2025-05-07T20:25:33.8551911Z #define __ATOMIC_RELAXED 0 2025-05-07T20:25:33.8552264Z #define __DBL_EPSILON__ double(2.22044604925031308084726333618164062e-16L) 2025-05-07T20:25:33.8552713Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128 2025-05-07T20:25:33.8553062Z #define _LP64 1 2025-05-07T20:25:33.8553280Z #define __UINT8_C(c) c 2025-05-07T20:25:33.8553510Z #define __FLT64_MAX_EXP__ 1024 2025-05-07T20:25:33.8553774Z #define __INT_LEAST32_TYPE__ int 2025-05-07T20:25:33.8554040Z #define __SIZEOF_WCHAR_T__ 4 2025-05-07T20:25:33.8554300Z #define __GNUC_PATCHLEVEL__ 0 2025-05-07T20:25:33.8554644Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:25:33.8555102Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:25:33.8555471Z #define __FLT128_HAS_QUIET_NAN__ 1 2025-05-07T20:25:33.8555767Z #define __INTMAX_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:33.8556078Z #define __INT_FAST8_TYPE__ signed char 2025-05-07T20:25:33.8556384Z #define __cpp_namespace_attributes 201411L 2025-05-07T20:25:33.8556750Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x 2025-05-07T20:25:33.8557113Z #define __STDCPP_THREADS__ 1 2025-05-07T20:25:33.8557372Z #define __GNUC_STDC_INLINE__ 1 2025-05-07T20:25:33.8557625Z #define __FLT64_HAS_DENORM__ 1 2025-05-07T20:25:33.8557962Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32 2025-05-07T20:25:33.8558321Z #define __DBL_DECIMAL_DIG__ 17 2025-05-07T20:25:33.8558575Z #define __STDC_UTF_32__ 1 2025-05-07T20:25:33.8558816Z #define __INT_FAST8_WIDTH__ 8 2025-05-07T20:25:33.8559062Z #define __FXSR__ 1 2025-05-07T20:25:33.8559358Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:25:33.8559798Z #define __DBL_NORM_MAX__ double(1.79769313486231570814527423731704357e+308L) 2025-05-07T20:25:33.8560211Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:25:33.8560517Z #define __INTMAX_WIDTH__ 64 2025-05-07T20:25:33.8560773Z #define __cpp_runtime_arrays 198712L 2025-05-07T20:25:33.8561066Z #define __UINT64_TYPE__ long unsigned int 2025-05-07T20:25:33.8561363Z #define __UINT32_C(c) c ## U 2025-05-07T20:25:33.8561625Z #define __cpp_alias_templates 200704L 2025-05-07T20:25:33.8561979Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F 2025-05-07T20:25:33.8562334Z #define __FLT128_IS_IEC_60559__ 2 2025-05-07T20:25:33.8562599Z #define __INT8_MAX__ 0x7f 2025-05-07T20:25:33.8562842Z #define __LONG_WIDTH__ 64 2025-05-07T20:25:33.8563076Z #define __PIC__ 2 2025-05-07T20:25:33.8563341Z #define __UINT_FAST32_TYPE__ long unsigned int 2025-05-07T20:25:33.8563731Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:25:33.8564124Z #define __CHAR32_TYPE__ unsigned int 2025-05-07T20:25:33.8564460Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:25:33.8564894Z #define __cpp_constexpr 201603L 2025-05-07T20:25:33.8565152Z #define __SSE2__ 1 2025-05-07T20:25:33.8565384Z #define __cpp_deduction_guides 201703L 2025-05-07T20:25:33.8565667Z #define __INT32_TYPE__ int 2025-05-07T20:25:33.8565921Z #define __SIZEOF_DOUBLE__ 8 2025-05-07T20:25:33.8566182Z #define __cpp_exceptions 199711L 2025-05-07T20:25:33.8566459Z #define __FLT_MIN_10_EXP__ (-37) 2025-05-07T20:25:33.8566782Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64 2025-05-07T20:25:33.8567138Z #define __INT_LEAST32_WIDTH__ 32 2025-05-07T20:25:33.8567408Z #define __INTMAX_TYPE__ long int 2025-05-07T20:25:33.8567668Z #define __DEC128_MAX_EXP__ 6145 2025-05-07T20:25:33.8567936Z #define __FLT32X_HAS_QUIET_NAN__ 1 2025-05-07T20:25:33.8568213Z #define __ATOMIC_CONSUME 1 2025-05-07T20:25:33.8568452Z #define __GNUC_MINOR__ 4 2025-05-07T20:25:33.8568704Z #define __GLIBCXX_TYPE_INT_N_0 __int128 2025-05-07T20:25:33.8568996Z #define __INT_FAST16_WIDTH__ 64 2025-05-07T20:25:33.8569369Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:33.8569664Z #define __PIE__ 2 2025-05-07T20:25:33.8569987Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x 2025-05-07T20:25:33.8570396Z #define __cpp_template_template_args 201611L 2025-05-07T20:25:33.8570696Z #define __DBL_MAX_10_EXP__ 308 2025-05-07T20:25:33.8571038Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L 2025-05-07T20:25:33.8571400Z #define __INT16_C(c) c 2025-05-07T20:25:33.8571619Z #define __STDC__ 1 2025-05-07T20:25:33.8571862Z #define __FLT32X_DIG__ 15 2025-05-07T20:25:33.8572136Z #define __PTRDIFF_TYPE__ long int 2025-05-07T20:25:33.8572400Z #define __ATOMIC_SEQ_CST 5 2025-05-07T20:25:33.8572656Z #define __FLT32X_MIN_10_EXP__ (-307) 2025-05-07T20:25:33.8572953Z #define __UINTPTR_TYPE__ long unsigned int 2025-05-07T20:25:33.8573286Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD 2025-05-07T20:25:33.8573613Z #define __DEC128_MANT_DIG__ 34 2025-05-07T20:25:33.8573885Z #define __LDBL_MIN_10_EXP__ (-4931) 2025-05-07T20:25:33.8574167Z #define __cpp_generic_lambdas 201304L 2025-05-07T20:25:33.8574442Z #define __SSE_MATH__ 1 2025-05-07T20:25:33.8574680Z #define __SIZEOF_LONG_LONG__ 8 2025-05-07T20:25:33.8574959Z #define __cpp_user_defined_literals 200809L 2025-05-07T20:25:33.8575256Z #define __FLT128_DECIMAL_DIG__ 36 2025-05-07T20:25:33.8575533Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2 2025-05-07T20:25:33.8575821Z #define __FLT32_HAS_QUIET_NAN__ 1 2025-05-07T20:25:33.8576084Z #define __FLT_DECIMAL_DIG__ 9 2025-05-07T20:25:33.8576376Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:33.8576762Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:25:33.8577123Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2 2025-05-07T20:25:33.8577422Z #define __UINT_FAST8_TYPE__ unsigned char 2025-05-07T20:25:33.8577714Z #define _GNU_SOURCE 1 2025-05-07T20:25:33.8577953Z #define __cpp_init_captures 201304L 2025-05-07T20:25:33.8578241Z #define __ATOMIC_ACQ_REL 4 2025-05-07T20:25:33.8578503Z #define __ATOMIC_RELEASE 3 2025-05-07T20:25:33.8578656Z 2025-05-07T20:25:33.9073576Z 2025-05-07T20:25:33.9074247Z + conda run -n build_binary c++ --version 2025-05-07T20:25:33.9074507Z 2025-05-07T20:25:35.7941057Z c++ (conda-forge gcc 11.4.0-13) 11.4.0 2025-05-07T20:25:35.7941457Z Copyright (C) 2021 Free Software Foundation, Inc. 2025-05-07T20:25:35.7941904Z This is free software; see the source for copying conditions. There is NO 2025-05-07T20:25:35.7942439Z warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. 2025-05-07T20:25:35.7942757Z 2025-05-07T20:25:35.7942762Z 2025-05-07T20:25:35.8565281Z 2025-05-07T20:25:35.8566076Z [INFO] Printing the default version of the C standard used by the compiler ... 2025-05-07T20:25:35.8566649Z + conda run -n build_binary cc -dM -E - < /dev/null | grep __STDC_VERSION__ 2025-05-07T20:25:35.8566950Z 2025-05-07T20:25:37.8133402Z #define __STDC_VERSION__ 201710L 2025-05-07T20:25:37.8135448Z 2025-05-07T20:25:37.8136392Z [INFO] Printing the default version of the C++ standard used by the compiler ... 2025-05-07T20:25:37.8136967Z + conda run -n build_binary c++ -dM -E -x c++ - < /dev/null | grep __cplusplus 2025-05-07T20:25:37.8137280Z 2025-05-07T20:25:39.7761298Z #define __cplusplus 201703L 2025-05-07T20:25:39.7764030Z 2025-05-07T20:25:39.7764393Z [INSTALL] Successfully installed C/C++ compilers 2025-05-07T20:25:39.7799075Z ##[group]Run . $PRELUDE; install_cuda $BUILD_ENV 12.8.0 2025-05-07T20:25:39.7799492Z . $PRELUDE; install_cuda $BUILD_ENV 12.8.0 2025-05-07T20:25:39.7811958Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:25:39.7812303Z env: 2025-05-07T20:25:39.7812531Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:25:39.7812834Z BUILD_ENV: build_binary 2025-05-07T20:25:39.7813069Z BUILD_TARGET: genai 2025-05-07T20:25:39.7813306Z BUILD_VARIANT: cuda 2025-05-07T20:25:39.7813540Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:25:39.7813789Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:25:39.7814265Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:25:39.7814600Z ##[endgroup] 2025-05-07T20:25:40.1193475Z ################################################################################ 2025-05-07T20:25:40.1193838Z # Install CUDA 2025-05-07T20:25:40.1194068Z # 2025-05-07T20:25:40.1208938Z # [2025-05-07T20:25:40.120Z] + install_cuda build_binary 12.8.0 2025-05-07T20:25:40.1209312Z ################################################################################ 2025-05-07T20:25:40.1209532Z 2025-05-07T20:25:40.1224336Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:25:40.2096258Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:25:40.2096624Z [SETUP] Cleaning up Conda packages ... 2025-05-07T20:25:40.2101851Z + conda clean --packages --tarball -y 2025-05-07T20:25:40.2102078Z 2025-05-07T20:25:40.9219369Z Will remove 32 (148.9 MB) tarball(s). 2025-05-07T20:25:40.9219789Z Will remove 6 (619 KB) package(s). 2025-05-07T20:25:40.9841104Z 2025-05-07T20:25:40.9850422Z + conda clean --all -y 2025-05-07T20:25:40.9850635Z 2025-05-07T20:25:41.6629937Z There are no unused tarball(s) to remove. 2025-05-07T20:25:41.6630620Z Will remove 1 index cache(s). 2025-05-07T20:25:41.6631176Z There are no unused package(s) to remove. 2025-05-07T20:25:41.6631789Z There are no tempfile(s) to remove. 2025-05-07T20:25:41.6632397Z There are no logfile(s) to remove. 2025-05-07T20:25:41.7263905Z 2025-05-07T20:25:41.7277878Z [INSTALL] Installing CUDA 12.8.0 ... 2025-05-07T20:25:41.7302450Z [EXEC] [ATTEMPT 0/3] + conda install --force-reinstall -n build_binary -c conda-forge --override-channels -y cuda=12.8.0 2025-05-07T20:25:42.6378954Z Channels: 2025-05-07T20:25:42.6379251Z - conda-forge 2025-05-07T20:25:42.6379494Z Platform: linux-64 2025-05-07T20:25:53.2252628Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | / done 2025-05-07T20:25:54.3523869Z Solving environment: \ | / - done 2025-05-07T20:25:54.4275580Z 2025-05-07T20:25:54.4275993Z ## Package Plan ## 2025-05-07T20:25:54.4276197Z 2025-05-07T20:25:54.4276430Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:25:54.4276809Z 2025-05-07T20:25:54.4276911Z added / updated specs: 2025-05-07T20:25:54.4277173Z - cuda=12.8.0 2025-05-07T20:25:54.4277307Z 2025-05-07T20:25:54.4277340Z 2025-05-07T20:25:54.4277471Z The following packages will be downloaded: 2025-05-07T20:25:54.4277684Z 2025-05-07T20:25:54.4277831Z package | build 2025-05-07T20:25:54.4278148Z ---------------------------|----------------- 2025-05-07T20:25:54.4278527Z alsa-lib-1.2.14 | hb9d3cd8_0 553 KB conda-forge 2025-05-07T20:25:54.4278940Z attr-2.5.1 | h166bdaf_1 69 KB conda-forge 2025-05-07T20:25:54.4279398Z binutils-2.40 | h4852527_7 31 KB conda-forge 2025-05-07T20:25:54.4279976Z c-compiler-1.5.2 | h0b41bf4_0 6 KB conda-forge 2025-05-07T20:25:54.4280409Z cuda-12.8.0 | ha804496_0 26 KB conda-forge 2025-05-07T20:25:54.4280837Z cuda-cccl_linux-64-12.8.55 | ha770c72_1 1.0 MB conda-forge 2025-05-07T20:25:54.4281806Z cuda-command-line-tools-12.8.0| ha770c72_0 20 KB conda-forge 2025-05-07T20:25:54.4282320Z cuda-compiler-12.8.0 | hbad6d8a_0 20 KB conda-forge 2025-05-07T20:25:54.4282842Z cuda-crt-dev_linux-64-12.8.61| ha770c72_1 90 KB conda-forge 2025-05-07T20:25:54.4283512Z cuda-crt-tools-12.8.61 | ha770c72_1 27 KB conda-forge 2025-05-07T20:25:54.4284127Z cuda-cudart-12.8.57 | h5888daf_1 22 KB conda-forge 2025-05-07T20:25:54.4284599Z cuda-cudart-dev-12.8.57 | h5888daf_1 23 KB conda-forge 2025-05-07T20:25:54.4285301Z cuda-cudart-dev_linux-64-12.8.57| h3f2d84a_1 377 KB conda-forge 2025-05-07T20:25:54.4286170Z cuda-cudart-static-12.8.57 | h5888daf_1 22 KB conda-forge 2025-05-07T20:25:54.4286683Z cuda-cudart-static_linux-64-12.8.57| h3f2d84a_1 950 KB conda-forge 2025-05-07T20:25:54.4287192Z cuda-cudart_linux-64-12.8.57| h3f2d84a_1 188 KB conda-forge 2025-05-07T20:25:54.4287672Z cuda-cuobjdump-12.8.55 | hbd13f7d_0 227 KB conda-forge 2025-05-07T20:25:54.4288111Z cuda-cupti-12.8.57 | hbd13f7d_0 1.8 MB conda-forge 2025-05-07T20:25:54.4288550Z cuda-cupti-dev-12.8.57 | h5888daf_0 4.0 MB conda-forge 2025-05-07T20:25:54.4289057Z cuda-cuxxfilt-12.8.55 | hbd13f7d_0 211 KB conda-forge 2025-05-07T20:25:54.4289506Z cuda-driver-dev-12.8.57 | h5888daf_1 22 KB conda-forge 2025-05-07T20:25:54.4289990Z cuda-driver-dev_linux-64-12.8.90| h3f2d84a_1 36 KB conda-forge 2025-05-07T20:25:54.4290456Z cuda-gdb-12.8.55 | h50b4baa_0 353 KB conda-forge 2025-05-07T20:25:54.4290890Z cuda-libraries-12.8.0 | ha770c72_0 20 KB conda-forge 2025-05-07T20:25:54.4291356Z cuda-libraries-dev-12.8.0 | ha770c72_0 20 KB conda-forge 2025-05-07T20:25:54.4291821Z cuda-nsight-12.8.55 | h7938cbb_0 113.2 MB conda-forge 2025-05-07T20:25:54.4292253Z cuda-nvcc-12.8.61 | hcdd1206_0 23 KB conda-forge 2025-05-07T20:25:54.4292703Z cuda-nvcc-dev_linux-64-12.8.61| he91c749_1 12.7 MB conda-forge 2025-05-07T20:25:54.4293173Z cuda-nvcc-impl-12.8.61 | h85509e4_1 25 KB conda-forge 2025-05-07T20:25:54.4293627Z cuda-nvcc-tools-12.8.61 | he02047a_1 24.5 MB conda-forge 2025-05-07T20:25:54.4294086Z cuda-nvcc_linux-64-12.8.61 | h04802cd_0 25 KB conda-forge 2025-05-07T20:25:54.4294539Z cuda-nvdisasm-12.8.55 | hbd13f7d_0 4.9 MB conda-forge 2025-05-07T20:25:54.4294997Z cuda-nvml-dev-12.8.55 | hbd13f7d_0 134 KB conda-forge 2025-05-07T20:25:54.4295437Z cuda-nvprof-12.8.57 | hbd13f7d_0 2.5 MB conda-forge 2025-05-07T20:25:54.4295881Z cuda-nvprune-12.8.55 | hbd13f7d_0 68 KB conda-forge 2025-05-07T20:25:54.4296323Z cuda-nvrtc-12.8.61 | hbd13f7d_0 63.1 MB conda-forge 2025-05-07T20:25:54.4296766Z cuda-nvrtc-dev-12.8.61 | h5888daf_0 34 KB conda-forge 2025-05-07T20:25:54.4297201Z cuda-nvtx-12.8.55 | hbd13f7d_0 31 KB conda-forge 2025-05-07T20:25:54.4297650Z cuda-nvvm-dev_linux-64-12.8.61| ha770c72_1 25 KB conda-forge 2025-05-07T20:25:54.4298121Z cuda-nvvm-impl-12.8.61 | he02047a_1 20.8 MB conda-forge 2025-05-07T20:25:54.4298576Z cuda-nvvm-tools-12.8.61 | he02047a_1 23.5 MB conda-forge 2025-05-07T20:25:54.4299020Z cuda-nvvp-12.8.57 | hbd13f7d_0 112.4 MB conda-forge 2025-05-07T20:25:54.4299492Z cuda-opencl-12.8.55 | hbd13f7d_0 29 KB conda-forge 2025-05-07T20:25:54.4299940Z cuda-opencl-dev-12.8.55 | h5888daf_0 95 KB conda-forge 2025-05-07T20:25:54.4300529Z cuda-profiler-api-12.8.55 | h7938cbb_0 22 KB conda-forge 2025-05-07T20:25:54.4300990Z cuda-runtime-12.8.0 | ha804496_0 20 KB conda-forge 2025-05-07T20:25:54.4301455Z cuda-sanitizer-api-12.8.55 | hbd13f7d_0 8.8 MB conda-forge 2025-05-07T20:25:54.4301921Z cuda-toolkit-12.8.0 | ha804496_0 20 KB conda-forge 2025-05-07T20:25:54.4302352Z cuda-tools-12.8.0 | ha770c72_0 19 KB conda-forge 2025-05-07T20:25:54.4302781Z cuda-version-12.8 | h5d125a7_3 21 KB conda-forge 2025-05-07T20:25:54.4303238Z cuda-visual-tools-12.8.0 | ha770c72_0 20 KB conda-forge 2025-05-07T20:25:54.4303781Z cxx-compiler-1.5.2 | hf52228f_0 6 KB conda-forge 2025-05-07T20:25:54.4304184Z dbus-1.13.6 | h5008d03_3 604 KB conda-forge 2025-05-07T20:25:54.4304566Z expat-2.7.0 | h5888daf_0 137 KB conda-forge 2025-05-07T20:25:54.4305030Z font-ttf-dejavu-sans-mono-2.37| hab24e00_0 388 KB conda-forge 2025-05-07T20:25:54.4305543Z font-ttf-inconsolata-3.000 | h77eed37_0 94 KB conda-forge 2025-05-07T20:25:54.4306044Z font-ttf-source-code-pro-2.038| h77eed37_0 684 KB conda-forge 2025-05-07T20:25:54.4306530Z font-ttf-ubuntu-0.83 | h77eed37_3 1.5 MB conda-forge 2025-05-07T20:25:54.4306968Z fontconfig-2.15.0 | h7e30c49_1 259 KB conda-forge 2025-05-07T20:25:54.4307423Z fonts-conda-ecosystem-1 | 0 4 KB conda-forge 2025-05-07T20:25:54.4307883Z fonts-conda-forge-1 | 0 4 KB conda-forge 2025-05-07T20:25:54.4308322Z freetype-2.13.3 | ha770c72_1 168 KB conda-forge 2025-05-07T20:25:54.4308716Z gcc-11.4.0 | h602e360_13 49 KB conda-forge 2025-05-07T20:25:54.4309200Z gds-tools-1.13.0.11 | h5888daf_0 37.9 MB conda-forge 2025-05-07T20:25:54.4309603Z gmp-6.3.0 | hac33072_2 449 KB conda-forge 2025-05-07T20:25:54.4309978Z gxx-11.4.0 | h602e360_13 49 KB conda-forge 2025-05-07T20:25:54.4310370Z keyutils-1.6.1 | h166bdaf_0 115 KB conda-forge 2025-05-07T20:25:54.4310760Z krb5-1.21.3 | h659f571_0 1.3 MB conda-forge 2025-05-07T20:25:54.4311145Z libcap-2.71 | h39aace5_0 100 KB conda-forge 2025-05-07T20:25:54.4311559Z libcublas-12.8.3.14 | h9ab20c4_0 460.2 MB conda-forge 2025-05-07T20:25:54.4312002Z libcublas-dev-12.8.3.14 | h9ab20c4_0 89 KB conda-forge 2025-05-07T20:25:54.4312448Z libcufft-11.3.3.41 | hbd13f7d_0 147.4 MB conda-forge 2025-05-07T20:25:54.4312883Z libcufft-dev-11.3.3.41 | h5888daf_0 33 KB conda-forge 2025-05-07T20:25:54.4313328Z libcufile-1.13.0.11 | h12f29b5_0 939 KB conda-forge 2025-05-07T20:25:54.4313765Z libcufile-dev-1.13.0.11 | h5888daf_0 35 KB conda-forge 2025-05-07T20:25:54.4314209Z libcurand-10.3.9.55 | hbd13f7d_0 43.6 MB conda-forge 2025-05-07T20:25:54.4314657Z libcurand-dev-10.3.9.55 | h5888daf_0 265 KB conda-forge 2025-05-07T20:25:54.4315100Z libcusolver-11.7.2.55 | h9ab20c4_0 156.9 MB conda-forge 2025-05-07T20:25:54.4315563Z libcusolver-dev-11.7.2.55 | h9ab20c4_0 59 KB conda-forge 2025-05-07T20:25:54.4316024Z libcusparse-12.5.7.53 | hbd13f7d_0 164.9 MB conda-forge 2025-05-07T20:25:54.4316493Z libcusparse-dev-12.5.7.53 | h5888daf_0 51 KB conda-forge 2025-05-07T20:25:54.4316959Z libedit-3.1.20191231 | he28a2e2_2 121 KB conda-forge 2025-05-07T20:25:54.4317395Z libexpat-2.7.0 | h5888daf_0 73 KB conda-forge 2025-05-07T20:25:54.4317918Z libfreetype-2.13.3 | ha770c72_1 8 KB conda-forge 2025-05-07T20:25:54.4318366Z libfreetype6-2.13.3 | h48d6fc4_1 371 KB conda-forge 2025-05-07T20:25:54.4318840Z libgcrypt-lib-1.11.0 | hb9d3cd8_2 572 KB conda-forge 2025-05-07T20:25:54.4319298Z libglib-2.84.0 | h2ff4ddf_0 3.8 MB conda-forge 2025-05-07T20:25:54.4319708Z libglvnd-1.7.0 | ha4b6fd6_2 129 KB conda-forge 2025-05-07T20:25:54.4320133Z libgpg-error-1.55 | h3f2d84a_0 305 KB conda-forge 2025-05-07T20:25:54.4320636Z libiconv-1.18 | h4ce23a2_1 696 KB conda-forge 2025-05-07T20:25:54.4321035Z libnl-3.11.0 | hb9d3cd8_0 724 KB conda-forge 2025-05-07T20:25:54.4321439Z libnpp-12.3.3.65 | hbd13f7d_0 130.6 MB conda-forge 2025-05-07T20:25:54.4321862Z libnpp-dev-12.3.3.65 | h5888daf_0 443 KB conda-forge 2025-05-07T20:25:54.4322281Z libnsl-2.0.1 | hd590300_0 33 KB conda-forge 2025-05-07T20:25:54.4322725Z libnuma-2.0.18 | h4ab18f5_2 42 KB conda-forge 2025-05-07T20:25:54.4323149Z libnvfatbin-12.8.55 | hbd13f7d_0 793 KB conda-forge 2025-05-07T20:25:54.4323600Z libnvfatbin-dev-12.8.55 | h5888daf_0 26 KB conda-forge 2025-05-07T20:25:54.4324073Z libnvjitlink-12.8.61 | hbd13f7d_0 28.7 MB conda-forge 2025-05-07T20:25:54.4324537Z libnvjitlink-dev-12.8.61 | h5888daf_0 25 KB conda-forge 2025-05-07T20:25:54.4324999Z libnvjpeg-12.3.5.57 | h97fd463_0 3.0 MB conda-forge 2025-05-07T20:25:54.4325436Z libnvjpeg-dev-12.3.5.57 | ha770c72_0 31 KB conda-forge 2025-05-07T20:25:54.4325877Z libopengl-1.7.0 | ha4b6fd6_2 50 KB conda-forge 2025-05-07T20:25:54.4326291Z libpng-1.6.47 | h943b412_0 282 KB conda-forge 2025-05-07T20:25:54.4326698Z libsqlite-3.49.2 | hee588c1_0 895 KB conda-forge 2025-05-07T20:25:54.4327131Z libsystemd0-256.9 | h2774228_0 401 KB conda-forge 2025-05-07T20:25:54.4327559Z libudev1-257.4 | h9a4d06a_0 140 KB conda-forge 2025-05-07T20:25:54.4327972Z libuuid-2.38.1 | h0b41bf4_0 33 KB conda-forge 2025-05-07T20:25:54.4328765Z libxcb-1.17.0 | h8a09558_0 387 KB conda-forge 2025-05-07T20:25:54.4329249Z libxkbcommon-1.8.0 | hc4a0caf_0 627 KB conda-forge 2025-05-07T20:25:54.4329691Z libxkbfile-1.1.0 | h166bdaf_1 111 KB conda-forge 2025-05-07T20:25:54.4330104Z libxml2-2.13.5 | h064dc61_0 673 KB conda-forge 2025-05-07T20:25:54.4330511Z libzlib-1.3.1 | hb9d3cd8_2 60 KB conda-forge 2025-05-07T20:25:54.4330906Z lz4-c-1.9.4 | hcb278e6_0 140 KB conda-forge 2025-05-07T20:25:54.4331345Z nsight-compute-2025.1.0.14 | hb5ebaad_0 320.6 MB conda-forge 2025-05-07T20:25:54.4331781Z nspr-4.36 | h5888daf_0 225 KB conda-forge 2025-05-07T20:25:54.4332163Z nss-3.111 | h159eef7_0 1.9 MB conda-forge 2025-05-07T20:25:54.4332549Z ocl-icd-2.3.3 | hb9d3cd8_0 104 KB conda-forge 2025-05-07T20:25:54.4332987Z opencl-headers-2024.10.24 | h5888daf_0 53 KB conda-forge 2025-05-07T20:25:54.4333419Z pcre2-10.44 | hc749103_2 934 KB conda-forge 2025-05-07T20:25:54.4333841Z pthread-stubs-0.4 | hb9d3cd8_1002 8 KB conda-forge 2025-05-07T20:25:54.4334280Z python-3.11.8 |hab00c5b_0_cpython 29.3 MB conda-forge 2025-05-07T20:25:54.4334847Z rdma-core-55.0 | h5888daf_0 1.2 MB conda-forge 2025-05-07T20:25:54.4335256Z sqlite-3.32.3 | hcee41ef_1 1.4 MB conda-forge 2025-05-07T20:25:54.4335651Z tk-8.6.13 |noxft_h4845f30_101 3.2 MB conda-forge 2025-05-07T20:25:54.4336046Z wayland-1.23.1 | h3e06ad9_0 314 KB conda-forge 2025-05-07T20:25:54.4336439Z xcb-util-0.4.1 | hb711507_2 19 KB conda-forge 2025-05-07T20:25:54.4336865Z xcb-util-cursor-0.1.5 | hb9d3cd8_0 20 KB conda-forge 2025-05-07T20:25:54.4337313Z xcb-util-image-0.4.0 | hb711507_2 24 KB conda-forge 2025-05-07T20:25:54.4337901Z xcb-util-keysyms-0.4.1 | hb711507_0 14 KB conda-forge 2025-05-07T20:25:54.4338374Z xcb-util-renderutil-0.3.10 | hb711507_0 17 KB conda-forge 2025-05-07T20:25:54.4338827Z xcb-util-wm-0.4.2 | hb711507_0 50 KB conda-forge 2025-05-07T20:25:54.4339280Z xkeyboard-config-2.44 | hb9d3cd8_0 384 KB conda-forge 2025-05-07T20:25:54.4339729Z xorg-libice-1.1.2 | hb9d3cd8_0 57 KB conda-forge 2025-05-07T20:25:54.4340152Z xorg-libsm-1.2.6 | he73a12e_0 27 KB conda-forge 2025-05-07T20:25:54.4340576Z xorg-libx11-1.8.12 | h4f16b4b_0 816 KB conda-forge 2025-05-07T20:25:54.4340995Z xorg-libxau-1.0.12 | hb9d3cd8_0 14 KB conda-forge 2025-05-07T20:25:54.4341456Z xorg-libxcomposite-0.4.6 | hb9d3cd8_2 13 KB conda-forge 2025-05-07T20:25:54.4341939Z xorg-libxdamage-1.1.6 | hb9d3cd8_0 13 KB conda-forge 2025-05-07T20:25:54.4342394Z xorg-libxdmcp-1.1.5 | hb9d3cd8_0 19 KB conda-forge 2025-05-07T20:25:54.4342830Z xorg-libxext-1.3.6 | hb9d3cd8_0 49 KB conda-forge 2025-05-07T20:25:54.4343282Z xorg-libxfixes-6.0.1 | hb9d3cd8_0 19 KB conda-forge 2025-05-07T20:25:54.4343722Z xorg-libxi-1.8.2 | hb9d3cd8_0 46 KB conda-forge 2025-05-07T20:25:54.4344162Z xorg-libxrandr-1.5.4 | hb9d3cd8_0 29 KB conda-forge 2025-05-07T20:25:54.4344616Z xorg-libxrender-0.9.12 | hb9d3cd8_0 32 KB conda-forge 2025-05-07T20:25:54.4345070Z xorg-libxtst-1.2.5 | hb9d3cd8_3 32 KB conda-forge 2025-05-07T20:25:54.4345482Z zlib-1.3.1 | hb9d3cd8_2 90 KB conda-forge 2025-05-07T20:25:54.4345859Z zstd-1.5.7 | hb8e6e7a_2 554 KB conda-forge 2025-05-07T20:25:54.4346249Z ------------------------------------------------------------ 2025-05-07T20:25:54.4346593Z Total: 1.90 GB 2025-05-07T20:25:54.4346800Z 2025-05-07T20:25:54.4346940Z The following NEW packages will be INSTALLED: 2025-05-07T20:25:54.4347157Z 2025-05-07T20:25:54.4347371Z alsa-lib conda-forge/linux-64::alsa-lib-1.2.14-hb9d3cd8_0 2025-05-07T20:25:54.4347801Z attr conda-forge/linux-64::attr-2.5.1-h166bdaf_1 2025-05-07T20:25:54.4348216Z binutils conda-forge/linux-64::binutils-2.40-h4852527_7 2025-05-07T20:25:54.4348675Z c-compiler conda-forge/linux-64::c-compiler-1.5.2-h0b41bf4_0 2025-05-07T20:25:54.4349190Z cuda conda-forge/noarch::cuda-12.8.0-ha804496_0 2025-05-07T20:25:54.4349693Z cuda-cccl_linux-64 conda-forge/noarch::cuda-cccl_linux-64-12.8.55-ha770c72_1 2025-05-07T20:25:54.4350287Z cuda-command-line~ conda-forge/linux-64::cuda-command-line-tools-12.8.0-ha770c72_0 2025-05-07T20:25:54.4350857Z cuda-compiler conda-forge/noarch::cuda-compiler-12.8.0-hbad6d8a_0 2025-05-07T20:25:54.4351394Z cuda-crt-dev_linu~ conda-forge/noarch::cuda-crt-dev_linux-64-12.8.61-ha770c72_1 2025-05-07T20:25:54.4351953Z cuda-crt-tools conda-forge/linux-64::cuda-crt-tools-12.8.61-ha770c72_1 2025-05-07T20:25:54.4352560Z cuda-cudart conda-forge/linux-64::cuda-cudart-12.8.57-h5888daf_1 2025-05-07T20:25:54.4353072Z cuda-cudart-dev conda-forge/linux-64::cuda-cudart-dev-12.8.57-h5888daf_1 2025-05-07T20:25:54.4353639Z cuda-cudart-dev_l~ conda-forge/noarch::cuda-cudart-dev_linux-64-12.8.57-h3f2d84a_1 2025-05-07T20:25:54.4356056Z cuda-cudart-static conda-forge/linux-64::cuda-cudart-static-12.8.57-h5888daf_1 2025-05-07T20:25:54.4356676Z cuda-cudart-stati~ conda-forge/noarch::cuda-cudart-static_linux-64-12.8.57-h3f2d84a_1 2025-05-07T20:25:54.4357267Z cuda-cudart_linux~ conda-forge/noarch::cuda-cudart_linux-64-12.8.57-h3f2d84a_1 2025-05-07T20:25:54.4357936Z cuda-cuobjdump conda-forge/linux-64::cuda-cuobjdump-12.8.55-hbd13f7d_0 2025-05-07T20:25:54.4358446Z cuda-cupti conda-forge/linux-64::cuda-cupti-12.8.57-hbd13f7d_0 2025-05-07T20:25:54.4358946Z cuda-cupti-dev conda-forge/linux-64::cuda-cupti-dev-12.8.57-h5888daf_0 2025-05-07T20:25:54.4359471Z cuda-cuxxfilt conda-forge/linux-64::cuda-cuxxfilt-12.8.55-hbd13f7d_0 2025-05-07T20:25:54.4360055Z cuda-driver-dev conda-forge/linux-64::cuda-driver-dev-12.8.57-h5888daf_1 2025-05-07T20:25:54.4360625Z cuda-driver-dev_l~ conda-forge/noarch::cuda-driver-dev_linux-64-12.8.90-h3f2d84a_1 2025-05-07T20:25:54.4361145Z cuda-gdb conda-forge/linux-64::cuda-gdb-12.8.55-h50b4baa_0 2025-05-07T20:25:54.4361624Z cuda-libraries conda-forge/linux-64::cuda-libraries-12.8.0-ha770c72_0 2025-05-07T20:25:54.4362181Z cuda-libraries-dev conda-forge/linux-64::cuda-libraries-dev-12.8.0-ha770c72_0 2025-05-07T20:25:54.4362724Z cuda-nsight conda-forge/linux-64::cuda-nsight-12.8.55-h7938cbb_0 2025-05-07T20:25:54.4363199Z cuda-nvcc conda-forge/linux-64::cuda-nvcc-12.8.61-hcdd1206_0 2025-05-07T20:25:54.4363706Z cuda-nvcc-dev_lin~ conda-forge/noarch::cuda-nvcc-dev_linux-64-12.8.61-he91c749_1 2025-05-07T20:25:54.4364264Z cuda-nvcc-impl conda-forge/linux-64::cuda-nvcc-impl-12.8.61-h85509e4_1 2025-05-07T20:25:54.4364799Z cuda-nvcc-tools conda-forge/linux-64::cuda-nvcc-tools-12.8.61-he02047a_1 2025-05-07T20:25:54.4365348Z cuda-nvcc_linux-64 conda-forge/linux-64::cuda-nvcc_linux-64-12.8.61-h04802cd_0 2025-05-07T20:25:54.4365875Z cuda-nvdisasm conda-forge/linux-64::cuda-nvdisasm-12.8.55-hbd13f7d_0 2025-05-07T20:25:54.4366391Z cuda-nvml-dev conda-forge/linux-64::cuda-nvml-dev-12.8.55-hbd13f7d_0 2025-05-07T20:25:54.4366890Z cuda-nvprof conda-forge/linux-64::cuda-nvprof-12.8.57-hbd13f7d_0 2025-05-07T20:25:54.4367388Z cuda-nvprune conda-forge/linux-64::cuda-nvprune-12.8.55-hbd13f7d_0 2025-05-07T20:25:54.4367879Z cuda-nvrtc conda-forge/linux-64::cuda-nvrtc-12.8.61-hbd13f7d_0 2025-05-07T20:25:54.4368381Z cuda-nvrtc-dev conda-forge/linux-64::cuda-nvrtc-dev-12.8.61-h5888daf_0 2025-05-07T20:25:54.4369057Z cuda-nvtx conda-forge/linux-64::cuda-nvtx-12.8.55-hbd13f7d_0 2025-05-07T20:25:54.4369603Z cuda-nvvm-dev_lin~ conda-forge/noarch::cuda-nvvm-dev_linux-64-12.8.61-ha770c72_1 2025-05-07T20:25:54.4370154Z cuda-nvvm-impl conda-forge/linux-64::cuda-nvvm-impl-12.8.61-he02047a_1 2025-05-07T20:25:54.4370691Z cuda-nvvm-tools conda-forge/linux-64::cuda-nvvm-tools-12.8.61-he02047a_1 2025-05-07T20:25:54.4371192Z cuda-nvvp conda-forge/linux-64::cuda-nvvp-12.8.57-hbd13f7d_0 2025-05-07T20:25:54.4371658Z cuda-opencl conda-forge/linux-64::cuda-opencl-12.8.55-hbd13f7d_0 2025-05-07T20:25:54.4372177Z cuda-opencl-dev conda-forge/linux-64::cuda-opencl-dev-12.8.55-h5888daf_0 2025-05-07T20:25:54.4372740Z cuda-profiler-api conda-forge/linux-64::cuda-profiler-api-12.8.55-h7938cbb_0 2025-05-07T20:25:54.4373281Z cuda-runtime conda-forge/noarch::cuda-runtime-12.8.0-ha804496_0 2025-05-07T20:25:54.4373816Z cuda-sanitizer-api conda-forge/linux-64::cuda-sanitizer-api-12.8.55-hbd13f7d_0 2025-05-07T20:25:54.4374359Z cuda-toolkit conda-forge/noarch::cuda-toolkit-12.8.0-ha804496_0 2025-05-07T20:25:54.4374943Z cuda-tools conda-forge/linux-64::cuda-tools-12.8.0-ha770c72_0 2025-05-07T20:25:54.4375415Z cuda-version conda-forge/noarch::cuda-version-12.8-h5d125a7_3 2025-05-07T20:25:54.4375931Z cuda-visual-tools conda-forge/linux-64::cuda-visual-tools-12.8.0-ha770c72_0 2025-05-07T20:25:54.4376469Z cxx-compiler conda-forge/linux-64::cxx-compiler-1.5.2-hf52228f_0 2025-05-07T20:25:54.4376918Z dbus conda-forge/linux-64::dbus-1.13.6-h5008d03_3 2025-05-07T20:25:54.4377321Z expat conda-forge/linux-64::expat-2.7.0-h5888daf_0 2025-05-07T20:25:54.4377829Z font-ttf-dejavu-s~ conda-forge/noarch::font-ttf-dejavu-sans-mono-2.37-hab24e00_0 2025-05-07T20:25:54.4378515Z font-ttf-inconsol~ conda-forge/noarch::font-ttf-inconsolata-3.000-h77eed37_0 2025-05-07T20:25:54.4379160Z font-ttf-source-c~ conda-forge/noarch::font-ttf-source-code-pro-2.038-h77eed37_0 2025-05-07T20:25:54.4379729Z font-ttf-ubuntu conda-forge/noarch::font-ttf-ubuntu-0.83-h77eed37_3 2025-05-07T20:25:54.4380226Z fontconfig conda-forge/linux-64::fontconfig-2.15.0-h7e30c49_1 2025-05-07T20:25:54.4380718Z fonts-conda-ecosy~ conda-forge/noarch::fonts-conda-ecosystem-1-0 2025-05-07T20:25:54.4381202Z fonts-conda-forge conda-forge/noarch::fonts-conda-forge-1-0 2025-05-07T20:25:54.4381652Z freetype conda-forge/linux-64::freetype-2.13.3-ha770c72_1 2025-05-07T20:25:54.4382069Z gcc conda-forge/linux-64::gcc-11.4.0-h602e360_13 2025-05-07T20:25:54.4382495Z gds-tools conda-forge/linux-64::gds-tools-1.13.0.11-h5888daf_0 2025-05-07T20:25:54.4382918Z gmp conda-forge/linux-64::gmp-6.3.0-hac33072_2 2025-05-07T20:25:54.4383295Z gxx conda-forge/linux-64::gxx-11.4.0-h602e360_13 2025-05-07T20:25:54.4383703Z keyutils conda-forge/linux-64::keyutils-1.6.1-h166bdaf_0 2025-05-07T20:25:54.4384117Z krb5 conda-forge/linux-64::krb5-1.21.3-h659f571_0 2025-05-07T20:25:54.4384521Z libcap conda-forge/linux-64::libcap-2.71-h39aace5_0 2025-05-07T20:25:54.4384960Z libcublas conda-forge/linux-64::libcublas-12.8.3.14-h9ab20c4_0 2025-05-07T20:25:54.4385467Z libcublas-dev conda-forge/linux-64::libcublas-dev-12.8.3.14-h9ab20c4_0 2025-05-07T20:25:54.4385968Z libcufft conda-forge/linux-64::libcufft-11.3.3.41-hbd13f7d_0 2025-05-07T20:25:54.4386456Z libcufft-dev conda-forge/linux-64::libcufft-dev-11.3.3.41-h5888daf_0 2025-05-07T20:25:54.4386947Z libcufile conda-forge/linux-64::libcufile-1.13.0.11-h12f29b5_0 2025-05-07T20:25:54.4387446Z libcufile-dev conda-forge/linux-64::libcufile-dev-1.13.0.11-h5888daf_0 2025-05-07T20:25:54.4387954Z libcurand conda-forge/linux-64::libcurand-10.3.9.55-hbd13f7d_0 2025-05-07T20:25:54.4388446Z libcurand-dev conda-forge/linux-64::libcurand-dev-10.3.9.55-h5888daf_0 2025-05-07T20:25:54.4388984Z libcusolver conda-forge/linux-64::libcusolver-11.7.2.55-h9ab20c4_0 2025-05-07T20:25:54.4389658Z libcusolver-dev conda-forge/linux-64::libcusolver-dev-11.7.2.55-h9ab20c4_0 2025-05-07T20:25:54.4390198Z libcusparse conda-forge/linux-64::libcusparse-12.5.7.53-hbd13f7d_0 2025-05-07T20:25:54.4390722Z libcusparse-dev conda-forge/linux-64::libcusparse-dev-12.5.7.53-h5888daf_0 2025-05-07T20:25:54.4391241Z libedit conda-forge/linux-64::libedit-3.1.20191231-he28a2e2_2 2025-05-07T20:25:54.4391697Z libexpat conda-forge/linux-64::libexpat-2.7.0-h5888daf_0 2025-05-07T20:25:54.4392165Z libfreetype conda-forge/linux-64::libfreetype-2.13.3-ha770c72_1 2025-05-07T20:25:54.4392660Z libfreetype6 conda-forge/linux-64::libfreetype6-2.13.3-h48d6fc4_1 2025-05-07T20:25:54.4393167Z libgcrypt-lib conda-forge/linux-64::libgcrypt-lib-1.11.0-hb9d3cd8_2 2025-05-07T20:25:54.4393644Z libglib conda-forge/linux-64::libglib-2.84.0-h2ff4ddf_0 2025-05-07T20:25:54.4394082Z libglvnd conda-forge/linux-64::libglvnd-1.7.0-ha4b6fd6_2 2025-05-07T20:25:54.4394642Z libgpg-error conda-forge/linux-64::libgpg-error-1.55-h3f2d84a_0 2025-05-07T20:25:54.4395110Z libiconv conda-forge/linux-64::libiconv-1.18-h4ce23a2_1 2025-05-07T20:25:54.4395538Z libnl conda-forge/linux-64::libnl-3.11.0-hb9d3cd8_0 2025-05-07T20:25:54.4395962Z libnpp conda-forge/linux-64::libnpp-12.3.3.65-hbd13f7d_0 2025-05-07T20:25:54.4396421Z libnpp-dev conda-forge/linux-64::libnpp-dev-12.3.3.65-h5888daf_0 2025-05-07T20:25:54.4396870Z libnsl conda-forge/linux-64::libnsl-2.0.1-hd590300_0 2025-05-07T20:25:54.4397296Z libnuma conda-forge/linux-64::libnuma-2.0.18-h4ab18f5_2 2025-05-07T20:25:54.4397836Z libnvfatbin conda-forge/linux-64::libnvfatbin-12.8.55-hbd13f7d_0 2025-05-07T20:25:54.4398360Z libnvfatbin-dev conda-forge/linux-64::libnvfatbin-dev-12.8.55-h5888daf_0 2025-05-07T20:25:54.4398892Z libnvjitlink conda-forge/linux-64::libnvjitlink-12.8.61-hbd13f7d_0 2025-05-07T20:25:54.4399437Z libnvjitlink-dev conda-forge/linux-64::libnvjitlink-dev-12.8.61-h5888daf_0 2025-05-07T20:25:54.4399963Z libnvjpeg conda-forge/linux-64::libnvjpeg-12.3.5.57-h97fd463_0 2025-05-07T20:25:54.4400474Z libnvjpeg-dev conda-forge/linux-64::libnvjpeg-dev-12.3.5.57-ha770c72_0 2025-05-07T20:25:54.4400969Z libopengl conda-forge/linux-64::libopengl-1.7.0-ha4b6fd6_2 2025-05-07T20:25:54.4401409Z libpng conda-forge/linux-64::libpng-1.6.47-h943b412_0 2025-05-07T20:25:54.4401840Z libsqlite conda-forge/linux-64::libsqlite-3.49.2-hee588c1_0 2025-05-07T20:25:54.4402311Z libsystemd0 conda-forge/linux-64::libsystemd0-256.9-h2774228_0 2025-05-07T20:25:54.4402777Z libudev1 conda-forge/linux-64::libudev1-257.4-h9a4d06a_0 2025-05-07T20:25:54.4403202Z libxcb conda-forge/linux-64::libxcb-1.17.0-h8a09558_0 2025-05-07T20:25:54.4403654Z libxkbcommon conda-forge/linux-64::libxkbcommon-1.8.0-hc4a0caf_0 2025-05-07T20:25:54.4404142Z libxkbfile conda-forge/linux-64::libxkbfile-1.1.0-h166bdaf_1 2025-05-07T20:25:54.4404585Z libxml2 conda-forge/linux-64::libxml2-2.13.5-h064dc61_0 2025-05-07T20:25:54.4405008Z libzlib conda-forge/linux-64::libzlib-1.3.1-hb9d3cd8_2 2025-05-07T20:25:54.4405414Z lz4-c conda-forge/linux-64::lz4-c-1.9.4-hcb278e6_0 2025-05-07T20:25:54.4405899Z nsight-compute conda-forge/linux-64::nsight-compute-2025.1.0.14-hb5ebaad_0 2025-05-07T20:25:54.4406384Z nspr conda-forge/linux-64::nspr-4.36-h5888daf_0 2025-05-07T20:25:54.4406752Z nss conda-forge/linux-64::nss-3.111-h159eef7_0 2025-05-07T20:25:54.4407148Z ocl-icd conda-forge/linux-64::ocl-icd-2.3.3-hb9d3cd8_0 2025-05-07T20:25:54.4407640Z opencl-headers conda-forge/linux-64::opencl-headers-2024.10.24-h5888daf_0 2025-05-07T20:25:54.4408125Z pcre2 conda-forge/linux-64::pcre2-10.44-hc749103_2 2025-05-07T20:25:54.4408585Z pthread-stubs conda-forge/linux-64::pthread-stubs-0.4-hb9d3cd8_1002 2025-05-07T20:25:54.4409117Z rdma-core conda-forge/linux-64::rdma-core-55.0-h5888daf_0 2025-05-07T20:25:54.4409550Z wayland conda-forge/linux-64::wayland-1.23.1-h3e06ad9_0 2025-05-07T20:25:54.4409977Z xcb-util conda-forge/linux-64::xcb-util-0.4.1-hb711507_2 2025-05-07T20:25:54.4410450Z xcb-util-cursor conda-forge/linux-64::xcb-util-cursor-0.1.5-hb9d3cd8_0 2025-05-07T20:25:54.4410971Z xcb-util-image conda-forge/linux-64::xcb-util-image-0.4.0-hb711507_2 2025-05-07T20:25:54.4411500Z xcb-util-keysyms conda-forge/linux-64::xcb-util-keysyms-0.4.1-hb711507_0 2025-05-07T20:25:54.4412073Z xcb-util-renderut~ conda-forge/linux-64::xcb-util-renderutil-0.3.10-hb711507_0 2025-05-07T20:25:54.4412593Z xcb-util-wm conda-forge/linux-64::xcb-util-wm-0.4.2-hb711507_0 2025-05-07T20:25:54.4413102Z xkeyboard-config conda-forge/linux-64::xkeyboard-config-2.44-hb9d3cd8_0 2025-05-07T20:25:54.4413744Z xorg-libice conda-forge/linux-64::xorg-libice-1.1.2-hb9d3cd8_0 2025-05-07T20:25:54.4414213Z xorg-libsm conda-forge/linux-64::xorg-libsm-1.2.6-he73a12e_0 2025-05-07T20:25:54.4414675Z xorg-libx11 conda-forge/linux-64::xorg-libx11-1.8.12-h4f16b4b_0 2025-05-07T20:25:54.4415150Z xorg-libxau conda-forge/linux-64::xorg-libxau-1.0.12-hb9d3cd8_0 2025-05-07T20:25:54.4415689Z xorg-libxcomposite conda-forge/linux-64::xorg-libxcomposite-0.4.6-hb9d3cd8_2 2025-05-07T20:25:54.4416262Z xorg-libxdamage conda-forge/linux-64::xorg-libxdamage-1.1.6-hb9d3cd8_0 2025-05-07T20:25:54.4416793Z xorg-libxdmcp conda-forge/linux-64::xorg-libxdmcp-1.1.5-hb9d3cd8_0 2025-05-07T20:25:54.4417376Z xorg-libxext conda-forge/linux-64::xorg-libxext-1.3.6-hb9d3cd8_0 2025-05-07T20:25:54.4417886Z xorg-libxfixes conda-forge/linux-64::xorg-libxfixes-6.0.1-hb9d3cd8_0 2025-05-07T20:25:54.4418372Z xorg-libxi conda-forge/linux-64::xorg-libxi-1.8.2-hb9d3cd8_0 2025-05-07T20:25:54.4418868Z xorg-libxrandr conda-forge/linux-64::xorg-libxrandr-1.5.4-hb9d3cd8_0 2025-05-07T20:25:54.4419453Z xorg-libxrender conda-forge/linux-64::xorg-libxrender-0.9.12-hb9d3cd8_0 2025-05-07T20:25:54.4419978Z xorg-libxtst conda-forge/linux-64::xorg-libxtst-1.2.5-hb9d3cd8_3 2025-05-07T20:25:54.4420415Z zstd conda-forge/linux-64::zstd-1.5.7-hb8e6e7a_2 2025-05-07T20:25:54.4420665Z 2025-05-07T20:25:54.4420780Z The following packages will be UPDATED: 2025-05-07T20:25:54.4420989Z 2025-05-07T20:25:54.4421261Z libuuid pkgs/main::libuuid-1.41.5-h5eee18b_0 --> conda-forge::libuuid-2.38.1-h0b41bf4_0 2025-05-07T20:25:54.4421861Z zlib pkgs/main::zlib-1.2.13-h5eee18b_1 --> conda-forge::zlib-1.3.1-hb9d3cd8_2 2025-05-07T20:25:54.4422181Z 2025-05-07T20:25:54.4422395Z The following packages will be SUPERSEDED by a higher-priority channel: 2025-05-07T20:25:54.4422708Z 2025-05-07T20:25:54.4423108Z python pkgs/main::python-3.11.11-he870216_0 --> conda-forge::python-3.11.8-hab00c5b_0_cpython 2025-05-07T20:25:54.4423730Z sqlite pkgs/main::sqlite-3.45.3-h5eee18b_0 --> conda-forge::sqlite-3.32.3-hcee41ef_1 2025-05-07T20:25:54.4434049Z tk pkgs/main::tk-8.6.14-h39e8969_0 --> conda-forge::tk-8.6.13-noxft_h4845f30_101 2025-05-07T20:25:54.4434373Z 2025-05-07T20:25:54.4434397Z 2025-05-07T20:25:54.4434401Z 2025-05-07T20:25:54.4434544Z Downloading and Extracting Packages: ...working... 2025-05-07T20:25:54.4434910Z libcublas-12.8.3.14 | 460.2 MB | | 0% 2025-05-07T20:25:54.4435134Z 2025-05-07T20:25:54.4435533Z nsight-compute-2025. | 320.6 MB | | 0%  2025-05-07T20:25:54.4435772Z 2025-05-07T20:25:54.4435776Z 2025-05-07T20:25:54.4435994Z libcusparse-12.5.7.5 | 164.9 MB | | 0%  2025-05-07T20:25:54.4436249Z 2025-05-07T20:25:54.4436253Z 2025-05-07T20:25:54.4436256Z 2025-05-07T20:25:54.4436484Z libcusolver-11.7.2.5 | 156.9 MB | | 0%  2025-05-07T20:25:54.4436735Z 2025-05-07T20:25:54.4436738Z 2025-05-07T20:25:54.4436742Z 2025-05-07T20:25:54.4436746Z 2025-05-07T20:25:54.4436970Z libcufft-11.3.3.41 | 147.4 MB | | 0%  2025-05-07T20:25:54.4437218Z 2025-05-07T20:25:54.4437222Z 2025-05-07T20:25:54.4437229Z 2025-05-07T20:25:54.4437233Z 2025-05-07T20:25:54.4437237Z 2025-05-07T20:25:54.4441604Z libnpp-12.3.3.65 | 130.6 MB | | 0%  2025-05-07T20:25:54.4441866Z 2025-05-07T20:25:54.4441873Z 2025-05-07T20:25:54.4441877Z 2025-05-07T20:25:54.4441881Z 2025-05-07T20:25:54.4441884Z 2025-05-07T20:25:54.4441895Z 2025-05-07T20:25:54.4443784Z cuda-nsight-12.8.55 | 113.2 MB | | 0%  2025-05-07T20:25:54.4444080Z 2025-05-07T20:25:54.4444085Z 2025-05-07T20:25:54.4444089Z 2025-05-07T20:25:54.4444092Z 2025-05-07T20:25:54.4444096Z 2025-05-07T20:25:54.4444100Z 2025-05-07T20:25:54.4444108Z 2025-05-07T20:25:54.4449220Z cuda-nvvp-12.8.57 | 112.4 MB | | 0%  2025-05-07T20:25:54.4449511Z 2025-05-07T20:25:54.4449516Z 2025-05-07T20:25:54.4449519Z 2025-05-07T20:25:54.4449528Z 2025-05-07T20:25:54.4449531Z 2025-05-07T20:25:54.4449535Z 2025-05-07T20:25:54.4449539Z 2025-05-07T20:25:54.4449542Z 2025-05-07T20:25:54.4450638Z cuda-nvrtc-12.8.61 | 63.1 MB | | 0%  2025-05-07T20:25:54.4450987Z 2025-05-07T20:25:54.4450995Z 2025-05-07T20:25:54.4450999Z 2025-05-07T20:25:54.4451003Z 2025-05-07T20:25:54.4451006Z 2025-05-07T20:25:54.4451010Z 2025-05-07T20:25:54.4451017Z 2025-05-07T20:25:54.4451021Z 2025-05-07T20:25:54.4451189Z 2025-05-07T20:25:54.4452950Z libcurand-10.3.9.55 | 43.6 MB | | 0%  2025-05-07T20:25:54.4453309Z 2025-05-07T20:25:54.4453313Z 2025-05-07T20:25:54.4453317Z 2025-05-07T20:25:54.4453320Z 2025-05-07T20:25:54.4453324Z 2025-05-07T20:25:54.4453328Z 2025-05-07T20:25:54.4453331Z 2025-05-07T20:25:54.4453335Z 2025-05-07T20:25:54.4453347Z 2025-05-07T20:25:54.4453351Z 2025-05-07T20:25:54.4461705Z gds-tools-1.13.0.11 | 37.9 MB | | 0%  2025-05-07T20:25:54.4462005Z 2025-05-07T20:25:54.4462009Z 2025-05-07T20:25:54.4462013Z 2025-05-07T20:25:54.4462017Z 2025-05-07T20:25:54.4462020Z 2025-05-07T20:25:54.4462024Z 2025-05-07T20:25:54.4462027Z 2025-05-07T20:25:54.4462031Z 2025-05-07T20:25:54.4462035Z 2025-05-07T20:25:54.4462038Z 2025-05-07T20:25:54.4462042Z 2025-05-07T20:25:54.4463542Z python-3.11.8 | 29.3 MB | | 0%  2025-05-07T20:25:54.4463886Z 2025-05-07T20:25:54.4463899Z 2025-05-07T20:25:54.4463903Z 2025-05-07T20:25:54.4463907Z 2025-05-07T20:25:54.4463910Z 2025-05-07T20:25:54.4463914Z 2025-05-07T20:25:54.4463917Z 2025-05-07T20:25:54.4463928Z 2025-05-07T20:25:54.4463932Z 2025-05-07T20:25:54.4463936Z 2025-05-07T20:25:54.4463939Z 2025-05-07T20:25:54.4463943Z 2025-05-07T20:25:54.4465631Z libnvjitlink-12.8.61 | 28.7 MB | | 0%  2025-05-07T20:25:54.4466037Z 2025-05-07T20:25:54.4466041Z 2025-05-07T20:25:54.4466045Z 2025-05-07T20:25:54.4466048Z 2025-05-07T20:25:54.4466052Z 2025-05-07T20:25:54.4466055Z 2025-05-07T20:25:54.4466059Z 2025-05-07T20:25:54.4466062Z 2025-05-07T20:25:54.4466066Z 2025-05-07T20:25:54.4466069Z 2025-05-07T20:25:54.4466073Z 2025-05-07T20:25:54.4466076Z 2025-05-07T20:25:54.4466080Z 2025-05-07T20:25:54.4467219Z cuda-nvcc-tools-12.8 | 24.5 MB | | 0%  2025-05-07T20:25:54.4467614Z 2025-05-07T20:25:54.4467618Z 2025-05-07T20:25:54.4467629Z 2025-05-07T20:25:54.4467632Z 2025-05-07T20:25:54.4467636Z 2025-05-07T20:25:54.4467639Z 2025-05-07T20:25:54.4467643Z 2025-05-07T20:25:54.4467646Z 2025-05-07T20:25:54.4467650Z 2025-05-07T20:25:54.4467658Z 2025-05-07T20:25:54.4467661Z 2025-05-07T20:25:54.4467665Z 2025-05-07T20:25:54.4467668Z 2025-05-07T20:25:54.4467675Z 2025-05-07T20:25:54.4471310Z cuda-nvvm-tools-12.8 | 23.5 MB | | 0%  2025-05-07T20:25:54.4471680Z 2025-05-07T20:25:54.4471684Z 2025-05-07T20:25:54.4471688Z 2025-05-07T20:25:54.4471691Z 2025-05-07T20:25:54.4471695Z 2025-05-07T20:25:54.4471698Z 2025-05-07T20:25:54.4471702Z 2025-05-07T20:25:54.4471706Z 2025-05-07T20:25:54.4471709Z 2025-05-07T20:25:54.4471713Z 2025-05-07T20:25:54.4471716Z 2025-05-07T20:25:54.4471720Z 2025-05-07T20:25:54.4471723Z 2025-05-07T20:25:54.4471727Z 2025-05-07T20:25:54.4471730Z 2025-05-07T20:25:54.4472952Z cuda-nvvm-impl-12.8. | 20.8 MB | | 0%  2025-05-07T20:25:54.4473383Z 2025-05-07T20:25:54.4473386Z 2025-05-07T20:25:54.4473390Z 2025-05-07T20:25:54.4473394Z 2025-05-07T20:25:54.4473397Z 2025-05-07T20:25:54.4473401Z 2025-05-07T20:25:54.4473404Z 2025-05-07T20:25:54.4473408Z 2025-05-07T20:25:54.4473420Z 2025-05-07T20:25:54.4473428Z 2025-05-07T20:25:54.4473432Z 2025-05-07T20:25:54.4473436Z 2025-05-07T20:25:54.4473598Z 2025-05-07T20:25:54.4473603Z 2025-05-07T20:25:54.4473606Z 2025-05-07T20:25:54.4473610Z 2025-05-07T20:25:54.4475567Z cuda-nvcc-dev_linux- | 12.7 MB | | 0%  2025-05-07T20:25:54.4475900Z 2025-05-07T20:25:54.4475904Z 2025-05-07T20:25:54.4475908Z 2025-05-07T20:25:54.4475911Z 2025-05-07T20:25:54.4475915Z 2025-05-07T20:25:54.4475919Z 2025-05-07T20:25:54.4475922Z 2025-05-07T20:25:54.4475926Z 2025-05-07T20:25:54.4475929Z 2025-05-07T20:25:54.4475933Z 2025-05-07T20:25:54.4475936Z 2025-05-07T20:25:54.4475940Z 2025-05-07T20:25:54.4476056Z 2025-05-07T20:25:54.4476060Z 2025-05-07T20:25:54.4476063Z 2025-05-07T20:25:54.4476067Z 2025-05-07T20:25:54.4476071Z 2025-05-07T20:25:54.4477034Z cuda-sanitizer-api-1 | 8.8 MB | | 0%  2025-05-07T20:25:54.4477425Z 2025-05-07T20:25:54.4477429Z 2025-05-07T20:25:54.4477433Z 2025-05-07T20:25:54.4477437Z 2025-05-07T20:25:54.4477448Z 2025-05-07T20:25:54.4477466Z 2025-05-07T20:25:54.4477469Z 2025-05-07T20:25:54.4477473Z 2025-05-07T20:25:54.4477477Z 2025-05-07T20:25:54.4477480Z 2025-05-07T20:25:54.4477484Z 2025-05-07T20:25:54.4477487Z 2025-05-07T20:25:54.4477491Z 2025-05-07T20:25:54.4477495Z 2025-05-07T20:25:54.4477498Z 2025-05-07T20:25:54.4477502Z 2025-05-07T20:25:54.4477505Z 2025-05-07T20:25:54.4477509Z 2025-05-07T20:25:54.4479388Z cuda-nvdisasm-12.8.5 | 4.9 MB | | 0%  2025-05-07T20:25:54.4479747Z 2025-05-07T20:25:54.4479751Z 2025-05-07T20:25:54.4479763Z 2025-05-07T20:25:54.4479766Z 2025-05-07T20:25:54.4479770Z 2025-05-07T20:25:54.4479774Z 2025-05-07T20:25:54.4479777Z 2025-05-07T20:25:54.4479781Z 2025-05-07T20:25:54.4479785Z 2025-05-07T20:25:54.4479788Z 2025-05-07T20:25:54.4479792Z 2025-05-07T20:25:54.4479796Z 2025-05-07T20:25:54.4479799Z 2025-05-07T20:25:54.4479809Z 2025-05-07T20:25:54.4479812Z 2025-05-07T20:25:54.4479821Z 2025-05-07T20:25:54.4479825Z 2025-05-07T20:25:54.4479829Z 2025-05-07T20:25:54.4479832Z 2025-05-07T20:25:54.5371996Z ... (more hidden) ... 2025-05-07T20:25:54.5378516Z libcublas-12.8.3.14 | 460.2 MB | | 0% 2025-05-07T20:25:54.5378939Z 2025-05-07T20:25:54.5393042Z nsight-compute-2025. | 320.6 MB | | 1%  2025-05-07T20:25:54.5393322Z 2025-05-07T20:25:54.5393953Z 2025-05-07T20:25:54.5530147Z libcusparse-12.5.7.5 | 164.9 MB | | 1%  2025-05-07T20:25:54.5530531Z 2025-05-07T20:25:54.5530546Z 2025-05-07T20:25:54.5530580Z 2025-05-07T20:25:54.5531296Z 2025-05-07T20:25:54.5967505Z libcufft-11.3.3.41 | 147.4 MB | | 0%  2025-05-07T20:25:54.5967790Z 2025-05-07T20:25:54.5967802Z 2025-05-07T20:25:54.5967806Z 2025-05-07T20:25:54.6374227Z libcusolver-11.7.2.5 | 156.9 MB | | 0%  2025-05-07T20:25:54.6379797Z libcublas-12.8.3.14 | 460.2 MB | 1 | 1% 2025-05-07T20:25:54.6380782Z 2025-05-07T20:25:54.6395385Z nsight-compute-2025. | 320.6 MB | 2 | 2%  2025-05-07T20:25:54.6395647Z 2025-05-07T20:25:54.6397711Z 2025-05-07T20:25:54.6530219Z libcusparse-12.5.7.5 | 164.9 MB | 3 | 3%  2025-05-07T20:25:54.6530497Z 2025-05-07T20:25:54.6530501Z 2025-05-07T20:25:54.6530505Z 2025-05-07T20:25:54.6530509Z 2025-05-07T20:25:54.6971881Z libcufft-11.3.3.41 | 147.4 MB | 2 | 3%  2025-05-07T20:25:54.6972286Z 2025-05-07T20:25:54.6972399Z 2025-05-07T20:25:54.6972964Z 2025-05-07T20:25:54.7382478Z libcusolver-11.7.2.5 | 156.9 MB | 2 | 2%  2025-05-07T20:25:54.7385718Z 2025-05-07T20:25:54.7396211Z nsight-compute-2025. | 320.6 MB | 3 | 3%  2025-05-07T20:25:54.7396478Z 2025-05-07T20:25:54.7397262Z 2025-05-07T20:25:54.7495485Z libcusparse-12.5.7.5 | 164.9 MB | 5 | 5%  2025-05-07T20:25:54.7532956Z libcublas-12.8.3.14 | 460.2 MB | 2 | 2% 2025-05-07T20:25:54.7534180Z 2025-05-07T20:25:54.7534192Z 2025-05-07T20:25:54.7534197Z 2025-05-07T20:25:54.7534202Z 2025-05-07T20:25:54.7976560Z libcufft-11.3.3.41 | 147.4 MB | 5 | 5%  2025-05-07T20:25:54.7976851Z 2025-05-07T20:25:54.7976855Z 2025-05-07T20:25:54.7981218Z 2025-05-07T20:25:54.8384310Z libcusolver-11.7.2.5 | 156.9 MB | 4 | 5%  2025-05-07T20:25:54.8385198Z 2025-05-07T20:25:54.8396594Z nsight-compute-2025. | 320.6 MB | 4 | 5%  2025-05-07T20:25:54.8396854Z 2025-05-07T20:25:54.8399378Z 2025-05-07T20:25:54.8535503Z libcusparse-12.5.7.5 | 164.9 MB | 7 | 8%  2025-05-07T20:25:54.8536020Z 2025-05-07T20:25:54.8536025Z 2025-05-07T20:25:54.8536029Z 2025-05-07T20:25:54.8536033Z 2025-05-07T20:25:54.8550431Z libcufft-11.3.3.41 | 147.4 MB | 8 | 8%  2025-05-07T20:25:54.8980462Z libcublas-12.8.3.14 | 460.2 MB | 2 | 3% 2025-05-07T20:25:54.8980731Z 2025-05-07T20:25:54.8980736Z 2025-05-07T20:25:54.8980755Z 2025-05-07T20:25:54.9384971Z libcusolver-11.7.2.5 | 156.9 MB | 7 | 7%  2025-05-07T20:25:54.9389240Z 2025-05-07T20:25:54.9440392Z nsight-compute-2025. | 320.6 MB | 6 | 6%  2025-05-07T20:25:54.9440762Z 2025-05-07T20:25:54.9442924Z 2025-05-07T20:25:54.9536938Z libcusparse-12.5.7.5 | 164.9 MB | 9 | 10%  2025-05-07T20:25:54.9537265Z 2025-05-07T20:25:54.9537271Z 2025-05-07T20:25:54.9537276Z 2025-05-07T20:25:54.9537282Z 2025-05-07T20:25:54.9553901Z libcufft-11.3.3.41 | 147.4 MB | # | 11%  2025-05-07T20:25:54.9981797Z libcublas-12.8.3.14 | 460.2 MB | 3 | 4% 2025-05-07T20:25:54.9982182Z 2025-05-07T20:25:54.9982188Z 2025-05-07T20:25:54.9986217Z 2025-05-07T20:25:55.0438126Z libcusolver-11.7.2.5 | 156.9 MB | # | 10%  2025-05-07T20:25:55.0438955Z 2025-05-07T20:25:55.0445528Z nsight-compute-2025. | 320.6 MB | 7 | 7%  2025-05-07T20:25:55.0445891Z 2025-05-07T20:25:55.0448225Z 2025-05-07T20:25:55.0558234Z libcusparse-12.5.7.5 | 164.9 MB | #2 | 12%  2025-05-07T20:25:55.0842451Z libcublas-12.8.3.14 | 460.2 MB | 4 | 4% 2025-05-07T20:25:55.0842848Z 2025-05-07T20:25:55.0842857Z 2025-05-07T20:25:55.0842866Z 2025-05-07T20:25:55.0842883Z 2025-05-07T20:25:55.1010152Z libcufft-11.3.3.41 | 147.4 MB | #3 | 13%  2025-05-07T20:25:55.1010567Z 2025-05-07T20:25:55.1010574Z 2025-05-07T20:25:55.1011349Z 2025-05-07T20:25:55.1439413Z libcusolver-11.7.2.5 | 156.9 MB | #2 | 13%  2025-05-07T20:25:55.1441407Z 2025-05-07T20:25:55.1448001Z nsight-compute-2025. | 320.6 MB | 8 | 9%  2025-05-07T20:25:55.1448362Z 2025-05-07T20:25:55.1449413Z 2025-05-07T20:25:55.1559159Z libcusparse-12.5.7.5 | 164.9 MB | #4 | 15%  2025-05-07T20:25:55.1875800Z libcublas-12.8.3.14 | 460.2 MB | 5 | 5% 2025-05-07T20:25:55.1876063Z 2025-05-07T20:25:55.1876068Z 2025-05-07T20:25:55.1876073Z 2025-05-07T20:25:55.1876109Z 2025-05-07T20:25:55.2109730Z libcufft-11.3.3.41 | 147.4 MB | #5 | 15%  2025-05-07T20:25:55.2110161Z 2025-05-07T20:25:55.2110167Z 2025-05-07T20:25:55.2110879Z 2025-05-07T20:25:55.2448814Z libcusolver-11.7.2.5 | 156.9 MB | #4 | 15%  2025-05-07T20:25:55.2449201Z 2025-05-07T20:25:55.2449208Z 2025-05-07T20:25:55.2504794Z libcusparse-12.5.7.5 | 164.9 MB | #6 | 17%  2025-05-07T20:25:55.2505122Z 2025-05-07T20:25:55.2623645Z nsight-compute-2025. | 320.6 MB | 9 | 10%  2025-05-07T20:25:55.2883639Z libcublas-12.8.3.14 | 460.2 MB | 5 | 6% 2025-05-07T20:25:55.2884022Z 2025-05-07T20:25:55.2884028Z 2025-05-07T20:25:55.2884046Z 2025-05-07T20:25:55.2886198Z 2025-05-07T20:25:55.3127389Z libcufft-11.3.3.41 | 147.4 MB | #7 | 17%  2025-05-07T20:25:55.3127814Z 2025-05-07T20:25:55.3127821Z 2025-05-07T20:25:55.3127836Z 2025-05-07T20:25:55.3503958Z libcusolver-11.7.2.5 | 156.9 MB | #7 | 17%  2025-05-07T20:25:55.3504247Z 2025-05-07T20:25:55.3505010Z 2025-05-07T20:25:55.3644438Z libcusparse-12.5.7.5 | 164.9 MB | #9 | 19%  2025-05-07T20:25:55.3648379Z 2025-05-07T20:25:55.3833354Z nsight-compute-2025. | 320.6 MB | #1 | 11%  2025-05-07T20:25:55.4087417Z libcublas-12.8.3.14 | 460.2 MB | 6 | 7% 2025-05-07T20:25:55.4087685Z 2025-05-07T20:25:55.4087689Z 2025-05-07T20:25:55.4087696Z 2025-05-07T20:25:55.4087895Z 2025-05-07T20:25:55.4193555Z libcufft-11.3.3.41 | 147.4 MB | #9 | 20%  2025-05-07T20:25:55.4193984Z 2025-05-07T20:25:55.4194298Z 2025-05-07T20:25:55.4194304Z 2025-05-07T20:25:55.4655891Z libcusolver-11.7.2.5 | 156.9 MB | #9 | 20%  2025-05-07T20:25:55.4656229Z 2025-05-07T20:25:55.4789322Z nsight-compute-2025. | 320.6 MB | #2 | 12%  2025-05-07T20:25:55.4789592Z 2025-05-07T20:25:55.4792375Z 2025-05-07T20:25:55.4890121Z libcusparse-12.5.7.5 | 164.9 MB | ##1 | 21%  2025-05-07T20:25:55.5107935Z libcublas-12.8.3.14 | 460.2 MB | 7 | 7% 2025-05-07T20:25:55.5108195Z 2025-05-07T20:25:55.5108199Z 2025-05-07T20:25:55.5108203Z 2025-05-07T20:25:55.5109658Z 2025-05-07T20:25:55.5194169Z libcufft-11.3.3.41 | 147.4 MB | ##1 | 22%  2025-05-07T20:25:55.5194693Z 2025-05-07T20:25:55.5194697Z 2025-05-07T20:25:55.5194701Z 2025-05-07T20:25:55.5672933Z libcusolver-11.7.2.5 | 156.9 MB | ##2 | 22%  2025-05-07T20:25:55.5676396Z 2025-05-07T20:25:55.5846000Z nsight-compute-2025. | 320.6 MB | #3 | 13%  2025-05-07T20:25:55.5846593Z 2025-05-07T20:25:55.5850753Z 2025-05-07T20:25:55.6038018Z libcusparse-12.5.7.5 | 164.9 MB | ##3 | 23%  2025-05-07T20:25:55.6197222Z libcublas-12.8.3.14 | 460.2 MB | 8 | 8% 2025-05-07T20:25:55.6197479Z 2025-05-07T20:25:55.6197483Z 2025-05-07T20:25:55.6197495Z 2025-05-07T20:25:55.6198733Z 2025-05-07T20:25:55.6202589Z libcufft-11.3.3.41 | 147.4 MB | ##3 | 24%  2025-05-07T20:25:55.6203003Z 2025-05-07T20:25:55.6203010Z 2025-05-07T20:25:55.6203026Z 2025-05-07T20:25:55.6674452Z libcusolver-11.7.2.5 | 156.9 MB | ##4 | 25%  2025-05-07T20:25:55.6674809Z 2025-05-07T20:25:55.6849128Z nsight-compute-2025. | 320.6 MB | #4 | 15%  2025-05-07T20:25:55.6849505Z 2025-05-07T20:25:55.6849509Z 2025-05-07T20:25:55.7151291Z libcusparse-12.5.7.5 | 164.9 MB | ##5 | 26%  2025-05-07T20:25:55.7201157Z libcublas-12.8.3.14 | 460.2 MB | 8 | 9% 2025-05-07T20:25:55.7201535Z 2025-05-07T20:25:55.7201542Z 2025-05-07T20:25:55.7201576Z 2025-05-07T20:25:55.7216293Z libcusolver-11.7.2.5 | 156.9 MB | ##7 | 27%  2025-05-07T20:25:55.7216651Z 2025-05-07T20:25:55.7216655Z 2025-05-07T20:25:55.7216659Z 2025-05-07T20:25:55.7217597Z 2025-05-07T20:25:55.7676214Z libcufft-11.3.3.41 | 147.4 MB | ##5 | 26%  2025-05-07T20:25:55.7676574Z 2025-05-07T20:25:55.8152463Z nsight-compute-2025. | 320.6 MB | #5 | 16%  2025-05-07T20:25:55.8217092Z libcublas-12.8.3.14 | 460.2 MB | 9 | 9% 2025-05-07T20:25:55.8217335Z 2025-05-07T20:25:55.8217340Z 2025-05-07T20:25:55.8217343Z 2025-05-07T20:25:55.8217952Z 2025-05-07T20:25:55.8306014Z libcufft-11.3.3.41 | 147.4 MB | ##8 | 28%  2025-05-07T20:25:55.8306330Z 2025-05-07T20:25:55.8306334Z 2025-05-07T20:25:55.8308338Z 2025-05-07T20:25:55.8677317Z libcusolver-11.7.2.5 | 156.9 MB | ##9 | 29%  2025-05-07T20:25:55.8677713Z 2025-05-07T20:25:55.9156400Z nsight-compute-2025. | 320.6 MB | #7 | 17%  2025-05-07T20:25:55.9218106Z libcublas-12.8.3.14 | 460.2 MB | # | 10% 2025-05-07T20:25:55.9218481Z 2025-05-07T20:25:55.9218487Z 2025-05-07T20:25:55.9218492Z 2025-05-07T20:25:55.9220073Z 2025-05-07T20:25:55.9579153Z libcufft-11.3.3.41 | 147.4 MB | ### | 31%  2025-05-07T20:25:55.9579531Z 2025-05-07T20:25:55.9579535Z 2025-05-07T20:25:55.9582904Z 2025-05-07T20:25:55.9679825Z libcusolver-11.7.2.5 | 156.9 MB | ###1 | 32%  2025-05-07T20:25:55.9680159Z 2025-05-07T20:25:55.9702020Z nsight-compute-2025. | 320.6 MB | #8 | 19%  2025-05-07T20:25:55.9702363Z 2025-05-07T20:25:55.9703807Z 2025-05-07T20:25:56.0381756Z libcusparse-12.5.7.5 | 164.9 MB | ##7 | 28%  2025-05-07T20:25:56.0393239Z libcublas-12.8.3.14 | 460.2 MB | #1 | 11% 2025-05-07T20:25:56.0393488Z 2025-05-07T20:25:56.0393493Z 2025-05-07T20:25:56.0393496Z 2025-05-07T20:25:56.0394651Z 2025-05-07T20:25:56.0579571Z libcufft-11.3.3.41 | 147.4 MB | ###2 | 33%  2025-05-07T20:25:56.0580314Z 2025-05-07T20:25:56.0580320Z 2025-05-07T20:25:56.0580326Z 2025-05-07T20:25:56.0703519Z libcusolver-11.7.2.5 | 156.9 MB | ###4 | 34%  2025-05-07T20:25:56.0703955Z 2025-05-07T20:25:56.0704724Z 2025-05-07T20:25:56.0924169Z libcusparse-12.5.7.5 | 164.9 MB | ##9 | 30%  2025-05-07T20:25:56.0925808Z 2025-05-07T20:25:56.1514373Z nsight-compute-2025. | 320.6 MB | ## | 20%  2025-05-07T20:25:56.1520720Z libcublas-12.8.3.14 | 460.2 MB | #1 | 12% 2025-05-07T20:25:56.1521096Z 2025-05-07T20:25:56.1521102Z 2025-05-07T20:25:56.1521108Z 2025-05-07T20:25:56.1522347Z 2025-05-07T20:25:56.1582200Z libcufft-11.3.3.41 | 147.4 MB | ###4 | 35%  2025-05-07T20:25:56.1582600Z 2025-05-07T20:25:56.1582605Z 2025-05-07T20:25:56.1584000Z 2025-05-07T20:25:56.1704797Z libcusolver-11.7.2.5 | 156.9 MB | ###6 | 36%  2025-05-07T20:25:56.1705088Z 2025-05-07T20:25:56.1705092Z 2025-05-07T20:25:56.1986660Z libcusparse-12.5.7.5 | 164.9 MB | ###1 | 32%  2025-05-07T20:25:56.1987392Z 2025-05-07T20:25:56.2525268Z nsight-compute-2025. | 320.6 MB | ##1 | 21%  2025-05-07T20:25:56.2528671Z libcublas-12.8.3.14 | 460.2 MB | #2 | 12% 2025-05-07T20:25:56.2528982Z 2025-05-07T20:25:56.2528989Z 2025-05-07T20:25:56.2528993Z 2025-05-07T20:25:56.2533237Z 2025-05-07T20:25:56.2634522Z libcufft-11.3.3.41 | 147.4 MB | ###6 | 37%  2025-05-07T20:25:56.2634854Z 2025-05-07T20:25:56.2634858Z 2025-05-07T20:25:56.2634862Z 2025-05-07T20:25:56.2706626Z libcusolver-11.7.2.5 | 156.9 MB | ###8 | 39%  2025-05-07T20:25:56.2706921Z 2025-05-07T20:25:56.2709414Z 2025-05-07T20:25:56.3026702Z libcusparse-12.5.7.5 | 164.9 MB | ###4 | 34%  2025-05-07T20:25:56.3027190Z 2025-05-07T20:25:56.3526829Z nsight-compute-2025. | 320.6 MB | ##2 | 23%  2025-05-07T20:25:56.3635198Z libcublas-12.8.3.14 | 460.2 MB | #3 | 13% 2025-05-07T20:25:56.3635481Z 2025-05-07T20:25:56.3635644Z 2025-05-07T20:25:56.3635658Z 2025-05-07T20:25:56.3709332Z libcusolver-11.7.2.5 | 156.9 MB | ####1 | 41%  2025-05-07T20:25:56.3709738Z 2025-05-07T20:25:56.3709745Z 2025-05-07T20:25:56.4149234Z libcusparse-12.5.7.5 | 164.9 MB | ###7 | 37%  2025-05-07T20:25:56.4151069Z 2025-05-07T20:25:56.4546879Z nsight-compute-2025. | 320.6 MB | ##3 | 24%  2025-05-07T20:25:56.4642156Z libcublas-12.8.3.14 | 460.2 MB | #4 | 14% 2025-05-07T20:25:56.4642468Z 2025-05-07T20:25:56.4642474Z 2025-05-07T20:25:56.4644728Z 2025-05-07T20:25:56.4711005Z libcusolver-11.7.2.5 | 156.9 MB | ####4 | 44%  2025-05-07T20:25:56.4711371Z 2025-05-07T20:25:56.4712074Z 2025-05-07T20:25:56.5118429Z libcusparse-12.5.7.5 | 164.9 MB | ###9 | 40%  2025-05-07T20:25:56.5118726Z 2025-05-07T20:25:56.5118731Z 2025-05-07T20:25:56.5118735Z 2025-05-07T20:25:56.5118738Z 2025-05-07T20:25:56.5249830Z libcufft-11.3.3.41 | 147.4 MB | ###8 | 39%  2025-05-07T20:25:56.5250131Z 2025-05-07T20:25:56.5649615Z nsight-compute-2025. | 320.6 MB | ##4 | 25%  2025-05-07T20:25:56.5714648Z libcublas-12.8.3.14 | 460.2 MB | #4 | 15% 2025-05-07T20:25:56.5715021Z 2025-05-07T20:25:56.5716280Z 2025-05-07T20:25:56.5908419Z libcusparse-12.5.7.5 | 164.9 MB | ####2 | 42%  2025-05-07T20:25:56.5908717Z 2025-05-07T20:25:56.5908729Z 2025-05-07T20:25:56.5908733Z 2025-05-07T20:25:56.6122199Z libcusolver-11.7.2.5 | 156.9 MB | ####6 | 47%  2025-05-07T20:25:56.6122486Z 2025-05-07T20:25:56.6122491Z 2025-05-07T20:25:56.6122494Z 2025-05-07T20:25:56.6124799Z 2025-05-07T20:25:56.6646128Z libcufft-11.3.3.41 | 147.4 MB | ####1 | 41%  2025-05-07T20:25:56.6646624Z 2025-05-07T20:25:56.6757546Z nsight-compute-2025. | 320.6 MB | ##6 | 26%  2025-05-07T20:25:56.6757848Z 2025-05-07T20:25:56.6760669Z 2025-05-07T20:25:56.6837144Z libcusparse-12.5.7.5 | 164.9 MB | ####4 | 45%  2025-05-07T20:25:56.6932749Z libcublas-12.8.3.14 | 460.2 MB | #5 | 16% 2025-05-07T20:25:56.6933041Z 2025-05-07T20:25:56.6933046Z 2025-05-07T20:25:56.6935776Z 2025-05-07T20:25:56.7129660Z libcusolver-11.7.2.5 | 156.9 MB | ####8 | 49%  2025-05-07T20:25:56.7129950Z 2025-05-07T20:25:56.7129954Z 2025-05-07T20:25:56.7129958Z 2025-05-07T20:25:56.7131149Z 2025-05-07T20:25:56.7728729Z libcufft-11.3.3.41 | 147.4 MB | ####3 | 44%  2025-05-07T20:25:56.7731247Z 2025-05-07T20:25:56.7803156Z nsight-compute-2025. | 320.6 MB | ##7 | 27%  2025-05-07T20:25:56.7803427Z 2025-05-07T20:25:56.7805262Z 2025-05-07T20:25:56.7902118Z libcusparse-12.5.7.5 | 164.9 MB | ####7 | 47%  2025-05-07T20:25:56.8130941Z libcublas-12.8.3.14 | 460.2 MB | #6 | 16% 2025-05-07T20:25:56.8131197Z 2025-05-07T20:25:56.8131201Z 2025-05-07T20:25:56.8131205Z 2025-05-07T20:25:56.8131209Z 2025-05-07T20:25:56.8716751Z libcufft-11.3.3.41 | 147.4 MB | ####6 | 47%  2025-05-07T20:25:56.8717239Z 2025-05-07T20:25:56.8717243Z 2025-05-07T20:25:56.8717247Z 2025-05-07T20:25:56.8771805Z libcusolver-11.7.2.5 | 156.9 MB | #####1 | 51%  2025-05-07T20:25:56.8772080Z 2025-05-07T20:25:56.8887850Z nsight-compute-2025. | 320.6 MB | ##8 | 28%  2025-05-07T20:25:56.8888183Z 2025-05-07T20:25:56.8890705Z 2025-05-07T20:25:56.9018948Z libcusparse-12.5.7.5 | 164.9 MB | ####9 | 49%  2025-05-07T20:25:56.9138087Z libcublas-12.8.3.14 | 460.2 MB | #6 | 17% 2025-05-07T20:25:56.9138413Z 2025-05-07T20:25:56.9138417Z 2025-05-07T20:25:56.9138421Z 2025-05-07T20:25:56.9138959Z 2025-05-07T20:25:56.9720330Z libcufft-11.3.3.41 | 147.4 MB | ####8 | 49%  2025-05-07T20:25:56.9720633Z 2025-05-07T20:25:56.9720637Z 2025-05-07T20:25:56.9725235Z 2025-05-07T20:25:56.9941055Z libcusolver-11.7.2.5 | 156.9 MB | #####3 | 54%  2025-05-07T20:25:56.9941343Z 2025-05-07T20:25:56.9956594Z nsight-compute-2025. | 320.6 MB | ##9 | 29%  2025-05-07T20:25:56.9956854Z 2025-05-07T20:25:56.9958351Z 2025-05-07T20:25:57.0139729Z libcusparse-12.5.7.5 | 164.9 MB | #####1 | 52%  2025-05-07T20:25:57.0140006Z 2025-05-07T20:25:57.0140010Z 2025-05-07T20:25:57.0140014Z 2025-05-07T20:25:57.0140732Z 2025-05-07T20:25:57.0150514Z libcufft-11.3.3.41 | 147.4 MB | #####1 | 51%  2025-05-07T20:25:57.0722230Z libcublas-12.8.3.14 | 460.2 MB | #7 | 18% 2025-05-07T20:25:57.0722487Z 2025-05-07T20:25:57.0722491Z 2025-05-07T20:25:57.0723925Z 2025-05-07T20:25:57.1003239Z libcusolver-11.7.2.5 | 156.9 MB | #####5 | 56%  2025-05-07T20:25:57.1003552Z 2025-05-07T20:25:57.1038198Z nsight-compute-2025. | 320.6 MB | ### | 30%  2025-05-07T20:25:57.1038461Z 2025-05-07T20:25:57.1038468Z 2025-05-07T20:25:57.1142985Z libcusparse-12.5.7.5 | 164.9 MB | #####4 | 54%  2025-05-07T20:25:57.1143283Z 2025-05-07T20:25:57.1143322Z 2025-05-07T20:25:57.1143327Z 2025-05-07T20:25:57.1144461Z 2025-05-07T20:25:57.1288533Z libcufft-11.3.3.41 | 147.4 MB | #####3 | 54%  2025-05-07T20:25:57.1722786Z libcublas-12.8.3.14 | 460.2 MB | #8 | 18% 2025-05-07T20:25:57.1723051Z 2025-05-07T20:25:57.1723055Z 2025-05-07T20:25:57.1724388Z 2025-05-07T20:25:57.2041016Z libcusolver-11.7.2.5 | 156.9 MB | #####8 | 58%  2025-05-07T20:25:57.2041313Z 2025-05-07T20:25:57.2041892Z 2025-05-07T20:25:57.2143875Z libcusparse-12.5.7.5 | 164.9 MB | #####6 | 56%  2025-05-07T20:25:57.2144156Z 2025-05-07T20:25:57.2144160Z 2025-05-07T20:25:57.2144165Z 2025-05-07T20:25:57.2144168Z 2025-05-07T20:25:57.2161459Z libcufft-11.3.3.41 | 147.4 MB | #####6 | 56%  2025-05-07T20:25:57.2161840Z 2025-05-07T20:25:57.2312017Z nsight-compute-2025. | 320.6 MB | ### | 31%  2025-05-07T20:25:57.2724120Z libcublas-12.8.3.14 | 460.2 MB | #8 | 19% 2025-05-07T20:25:57.2724377Z 2025-05-07T20:25:57.2724601Z 2025-05-07T20:25:57.2727128Z 2025-05-07T20:25:57.3080619Z libcusolver-11.7.2.5 | 156.9 MB | ###### | 61%  2025-05-07T20:25:57.3080898Z 2025-05-07T20:25:57.3080903Z 2025-05-07T20:25:57.3144767Z libcusparse-12.5.7.5 | 164.9 MB | #####8 | 59%  2025-05-07T20:25:57.3145195Z 2025-05-07T20:25:57.3145199Z 2025-05-07T20:25:57.3145203Z 2025-05-07T20:25:57.3145221Z 2025-05-07T20:25:57.3277246Z libcufft-11.3.3.41 | 147.4 MB | #####8 | 59%  2025-05-07T20:25:57.3277920Z 2025-05-07T20:25:57.3363470Z nsight-compute-2025. | 320.6 MB | ###1 | 32%  2025-05-07T20:25:57.3728945Z libcublas-12.8.3.14 | 460.2 MB | #9 | 19% 2025-05-07T20:25:57.3729205Z 2025-05-07T20:25:57.3729211Z 2025-05-07T20:25:57.3730675Z 2025-05-07T20:25:57.4101992Z libcusolver-11.7.2.5 | 156.9 MB | ######3 | 63%  2025-05-07T20:25:57.4102390Z 2025-05-07T20:25:57.4105074Z 2025-05-07T20:25:57.4147768Z libcusparse-12.5.7.5 | 164.9 MB | ###### | 61%  2025-05-07T20:25:57.4148077Z 2025-05-07T20:25:57.4148083Z 2025-05-07T20:25:57.4148088Z 2025-05-07T20:25:57.4148093Z 2025-05-07T20:25:57.4277011Z libcufft-11.3.3.41 | 147.4 MB | ######1 | 61%  2025-05-07T20:25:57.4277288Z 2025-05-07T20:25:57.4370449Z nsight-compute-2025. | 320.6 MB | ###2 | 33%  2025-05-07T20:25:57.4731485Z libcublas-12.8.3.14 | 460.2 MB | ## | 20% 2025-05-07T20:25:57.4731762Z 2025-05-07T20:25:57.4731767Z 2025-05-07T20:25:57.4731773Z 2025-05-07T20:25:57.5104568Z libcusolver-11.7.2.5 | 156.9 MB | ######5 | 66%  2025-05-07T20:25:57.5104855Z 2025-05-07T20:25:57.5104860Z 2025-05-07T20:25:57.5175072Z libcusparse-12.5.7.5 | 164.9 MB | ######2 | 63%  2025-05-07T20:25:57.5175348Z 2025-05-07T20:25:57.5175353Z 2025-05-07T20:25:57.5175358Z 2025-05-07T20:25:57.5175363Z 2025-05-07T20:25:57.5280360Z libcufft-11.3.3.41 | 147.4 MB | ######3 | 64%  2025-05-07T20:25:57.5280650Z 2025-05-07T20:25:57.5372528Z nsight-compute-2025. | 320.6 MB | ###3 | 34%  2025-05-07T20:25:57.5731766Z libcublas-12.8.3.14 | 460.2 MB | ## | 21% 2025-05-07T20:25:57.5732024Z 2025-05-07T20:25:57.5732030Z 2025-05-07T20:25:57.5732035Z 2025-05-07T20:25:57.6107388Z libcusolver-11.7.2.5 | 156.9 MB | ######8 | 68%  2025-05-07T20:25:57.6107665Z 2025-05-07T20:25:57.6108361Z 2025-05-07T20:25:57.6195384Z libcusparse-12.5.7.5 | 164.9 MB | ######5 | 65%  2025-05-07T20:25:57.6195653Z 2025-05-07T20:25:57.6195658Z 2025-05-07T20:25:57.6195661Z 2025-05-07T20:25:57.6195665Z 2025-05-07T20:25:57.6281218Z libcufft-11.3.3.41 | 147.4 MB | ######6 | 66%  2025-05-07T20:25:57.6285738Z 2025-05-07T20:25:57.6375507Z nsight-compute-2025. | 320.6 MB | ###4 | 35%  2025-05-07T20:25:57.6737149Z libcublas-12.8.3.14 | 460.2 MB | ##1 | 21% 2025-05-07T20:25:57.6737417Z 2025-05-07T20:25:57.6737421Z 2025-05-07T20:25:57.6738362Z 2025-05-07T20:25:57.7113480Z libcusolver-11.7.2.5 | 156.9 MB | ####### | 71%  2025-05-07T20:25:57.7113876Z 2025-05-07T20:25:57.7115520Z 2025-05-07T20:25:57.7281477Z libcusparse-12.5.7.5 | 164.9 MB | ######7 | 67%  2025-05-07T20:25:57.7282599Z 2025-05-07T20:25:57.7347027Z nsight-compute-2025. | 320.6 MB | ###5 | 36%  2025-05-07T20:25:57.7347304Z 2025-05-07T20:25:57.7347308Z 2025-05-07T20:25:57.7347567Z 2025-05-07T20:25:57.7350240Z 2025-05-07T20:25:57.7375865Z libcufft-11.3.3.41 | 147.4 MB | ######8 | 69%  2025-05-07T20:25:57.7767581Z libcublas-12.8.3.14 | 460.2 MB | ##2 | 22% 2025-05-07T20:25:57.7767956Z 2025-05-07T20:25:57.7767974Z 2025-05-07T20:25:57.7768916Z 2025-05-07T20:25:57.8164540Z libcusolver-11.7.2.5 | 156.9 MB | #######2 | 73%  2025-05-07T20:25:57.8164830Z 2025-05-07T20:25:57.8167320Z 2025-05-07T20:25:57.8284736Z libcusparse-12.5.7.5 | 164.9 MB | ######9 | 70%  2025-05-07T20:25:57.8286975Z 2025-05-07T20:25:57.8347482Z nsight-compute-2025. | 320.6 MB | ###6 | 37%  2025-05-07T20:25:57.8348037Z 2025-05-07T20:25:57.8348041Z 2025-05-07T20:25:57.8348045Z 2025-05-07T20:25:57.8348049Z 2025-05-07T20:25:57.8375872Z libcufft-11.3.3.41 | 147.4 MB | #######1 | 71%  2025-05-07T20:25:57.8770646Z libcublas-12.8.3.14 | 460.2 MB | ##2 | 23% 2025-05-07T20:25:57.8770899Z 2025-05-07T20:25:57.8770951Z 2025-05-07T20:25:57.8771019Z 2025-05-07T20:25:57.9177332Z libcusolver-11.7.2.5 | 156.9 MB | #######5 | 75%  2025-05-07T20:25:57.9177746Z 2025-05-07T20:25:57.9182545Z 2025-05-07T20:25:57.9286498Z libcusparse-12.5.7.5 | 164.9 MB | #######1 | 72%  2025-05-07T20:25:57.9287530Z 2025-05-07T20:25:57.9378563Z nsight-compute-2025. | 320.6 MB | ###8 | 38%  2025-05-07T20:25:57.9662158Z libcublas-12.8.3.14 | 460.2 MB | ##3 | 24% 2025-05-07T20:25:57.9662423Z 2025-05-07T20:25:57.9662429Z 2025-05-07T20:25:57.9662435Z 2025-05-07T20:25:57.9662440Z 2025-05-07T20:25:57.9771081Z libcufft-11.3.3.41 | 147.4 MB | #######3 | 74%  2025-05-07T20:25:57.9771500Z 2025-05-07T20:25:57.9771504Z 2025-05-07T20:25:57.9771508Z 2025-05-07T20:25:58.0180056Z libcusolver-11.7.2.5 | 156.9 MB | #######7 | 78%  2025-05-07T20:25:58.0180353Z 2025-05-07T20:25:58.0182144Z 2025-05-07T20:25:58.0288508Z libcusparse-12.5.7.5 | 164.9 MB | #######4 | 74%  2025-05-07T20:25:58.0290703Z 2025-05-07T20:25:58.0385113Z nsight-compute-2025. | 320.6 MB | ###9 | 39%  2025-05-07T20:25:58.0771673Z libcublas-12.8.3.14 | 460.2 MB | ##4 | 24% 2025-05-07T20:25:58.0772046Z 2025-05-07T20:25:58.0772052Z 2025-05-07T20:25:58.0772057Z 2025-05-07T20:25:58.1185589Z libcusolver-11.7.2.5 | 156.9 MB | ######## | 81%  2025-05-07T20:25:58.1186014Z 2025-05-07T20:25:58.1187845Z 2025-05-07T20:25:58.1290788Z libcusparse-12.5.7.5 | 164.9 MB | #######6 | 77%  2025-05-07T20:25:58.1292411Z 2025-05-07T20:25:58.1298526Z nsight-compute-2025. | 320.6 MB | #### | 41%  2025-05-07T20:25:58.1298823Z 2025-05-07T20:25:58.1298827Z 2025-05-07T20:25:58.1298831Z 2025-05-07T20:25:58.1298835Z 2025-05-07T20:25:58.1385423Z libcufft-11.3.3.41 | 147.4 MB | #######5 | 76%  2025-05-07T20:25:58.1838155Z libcublas-12.8.3.14 | 460.2 MB | ##5 | 25% 2025-05-07T20:25:58.1838477Z 2025-05-07T20:25:58.1838481Z 2025-05-07T20:25:58.1838507Z 2025-05-07T20:25:58.2270271Z libcusolver-11.7.2.5 | 156.9 MB | ########3 | 83%  2025-05-07T20:25:58.2270583Z 2025-05-07T20:25:58.2270595Z 2025-05-07T20:25:58.2300413Z libcusparse-12.5.7.5 | 164.9 MB | #######8 | 79%  2025-05-07T20:25:58.2300715Z 2025-05-07T20:25:58.2300719Z 2025-05-07T20:25:58.2300722Z 2025-05-07T20:25:58.2301899Z 2025-05-07T20:25:58.2306724Z libcufft-11.3.3.41 | 147.4 MB | #######8 | 78%  2025-05-07T20:25:58.2308250Z 2025-05-07T20:25:58.2475470Z nsight-compute-2025. | 320.6 MB | ####1 | 42%  2025-05-07T20:25:58.2879053Z libcublas-12.8.3.14 | 460.2 MB | ##6 | 26% 2025-05-07T20:25:58.2879364Z 2025-05-07T20:25:58.2879370Z 2025-05-07T20:25:58.2881991Z 2025-05-07T20:25:58.3301955Z libcusolver-11.7.2.5 | 156.9 MB | ########5 | 86%  2025-05-07T20:25:58.3302333Z 2025-05-07T20:25:58.3302339Z 2025-05-07T20:25:58.3302344Z 2025-05-07T20:25:58.3304169Z 2025-05-07T20:25:58.3370356Z libcufft-11.3.3.41 | 147.4 MB | ######## | 81%  2025-05-07T20:25:58.3370643Z 2025-05-07T20:25:58.3370648Z 2025-05-07T20:25:58.3451552Z libcusparse-12.5.7.5 | 164.9 MB | ########1 | 81%  2025-05-07T20:25:58.3455615Z 2025-05-07T20:25:58.3485913Z nsight-compute-2025. | 320.6 MB | ####2 | 43%  2025-05-07T20:25:58.4303743Z libcublas-12.8.3.14 | 460.2 MB | ##6 | 27% 2025-05-07T20:25:58.4304015Z 2025-05-07T20:25:58.4304019Z 2025-05-07T20:25:58.4304023Z 2025-05-07T20:25:58.4307223Z 2025-05-07T20:25:58.4372393Z libcufft-11.3.3.41 | 147.4 MB | ########3 | 83%  2025-05-07T20:25:58.4372765Z 2025-05-07T20:25:58.4373047Z 2025-05-07T20:25:58.4452967Z libcusparse-12.5.7.5 | 164.9 MB | ########3 | 83%  2025-05-07T20:25:58.4453263Z 2025-05-07T20:25:58.4491472Z nsight-compute-2025. | 320.6 MB | ####4 | 44%  2025-05-07T20:25:58.4879256Z libcublas-12.8.3.14 | 460.2 MB | ##7 | 28% 2025-05-07T20:25:58.4879580Z 2025-05-07T20:25:58.4879584Z 2025-05-07T20:25:58.4879610Z 2025-05-07T20:25:58.5304486Z libcusolver-11.7.2.5 | 156.9 MB | ########8 | 88%  2025-05-07T20:25:58.5304851Z 2025-05-07T20:25:58.5304857Z 2025-05-07T20:25:58.5304863Z 2025-05-07T20:25:58.5307465Z 2025-05-07T20:25:58.5372718Z libcufft-11.3.3.41 | 147.4 MB | ########6 | 86%  2025-05-07T20:25:58.5373110Z 2025-05-07T20:25:58.5373711Z 2025-05-07T20:25:58.5494898Z libcusparse-12.5.7.5 | 164.9 MB | ########5 | 86%  2025-05-07T20:25:58.5510545Z libcublas-12.8.3.14 | 460.2 MB | ##8 | 29% 2025-05-07T20:25:58.5513491Z 2025-05-07T20:25:58.5880081Z nsight-compute-2025. | 320.6 MB | ####5 | 45%  2025-05-07T20:25:58.5880422Z 2025-05-07T20:25:58.5880426Z 2025-05-07T20:25:58.5882441Z 2025-05-07T20:25:58.6335417Z libcusolver-11.7.2.5 | 156.9 MB | ######### | 90%  2025-05-07T20:25:58.6335698Z 2025-05-07T20:25:58.6335702Z 2025-05-07T20:25:58.6335706Z 2025-05-07T20:25:58.6335714Z 2025-05-07T20:25:58.6375258Z libcufft-11.3.3.41 | 147.4 MB | ########8 | 89%  2025-05-07T20:25:58.6375826Z 2025-05-07T20:25:58.6379401Z 2025-05-07T20:25:58.6514962Z libcusparse-12.5.7.5 | 164.9 MB | ########8 | 88%  2025-05-07T20:25:58.6517268Z 2025-05-07T20:25:58.6551472Z nsight-compute-2025. | 320.6 MB | ####6 | 46%  2025-05-07T20:25:58.6880711Z libcublas-12.8.3.14 | 460.2 MB | ##9 | 29% 2025-05-07T20:25:58.6880970Z 2025-05-07T20:25:58.6880975Z 2025-05-07T20:25:58.6882603Z 2025-05-07T20:25:58.7341942Z libcusolver-11.7.2.5 | 156.9 MB | #########2 | 93%  2025-05-07T20:25:58.7342235Z 2025-05-07T20:25:58.7342239Z 2025-05-07T20:25:58.7342269Z 2025-05-07T20:25:58.7343334Z 2025-05-07T20:25:58.7453592Z libcufft-11.3.3.41 | 147.4 MB | #########1 | 91%  2025-05-07T20:25:58.7453870Z 2025-05-07T20:25:58.7453874Z 2025-05-07T20:25:58.7547629Z libcusparse-12.5.7.5 | 164.9 MB | ######### | 90%  2025-05-07T20:25:58.7550507Z 2025-05-07T20:25:58.7556904Z nsight-compute-2025. | 320.6 MB | ####7 | 48%  2025-05-07T20:25:58.7881401Z libcublas-12.8.3.14 | 460.2 MB | ### | 30% 2025-05-07T20:25:58.7881781Z 2025-05-07T20:25:58.7881788Z 2025-05-07T20:25:58.7883123Z 2025-05-07T20:25:58.8368673Z libcusolver-11.7.2.5 | 156.9 MB | #########5 | 95%  2025-05-07T20:25:58.8368969Z 2025-05-07T20:25:58.8368974Z 2025-05-07T20:25:58.8368978Z 2025-05-07T20:25:58.8368981Z 2025-05-07T20:25:58.8455842Z libcufft-11.3.3.41 | 147.4 MB | #########3 | 93%  2025-05-07T20:25:58.8456226Z 2025-05-07T20:25:58.8457832Z 2025-05-07T20:25:58.8595179Z libcusparse-12.5.7.5 | 164.9 MB | #########2 | 93%  2025-05-07T20:25:58.8635740Z libcublas-12.8.3.14 | 460.2 MB | ###1 | 31% 2025-05-07T20:25:58.8636822Z 2025-05-07T20:25:58.8883933Z nsight-compute-2025. | 320.6 MB | ####8 | 49%  2025-05-07T20:25:58.8884206Z 2025-05-07T20:25:58.8884211Z 2025-05-07T20:25:58.8886266Z 2025-05-07T20:25:58.9371310Z libcusolver-11.7.2.5 | 156.9 MB | #########7 | 98%  2025-05-07T20:25:58.9371622Z 2025-05-07T20:25:58.9371627Z 2025-05-07T20:25:58.9371631Z 2025-05-07T20:25:58.9371646Z 2025-05-07T20:25:58.9490340Z libcufft-11.3.3.41 | 147.4 MB | #########5 | 96%  2025-05-07T20:25:58.9490654Z 2025-05-07T20:25:58.9494051Z 2025-05-07T20:25:58.9662353Z libcusparse-12.5.7.5 | 164.9 MB | #########4 | 95%  2025-05-07T20:25:58.9665103Z 2025-05-07T20:25:58.9742804Z nsight-compute-2025. | 320.6 MB | ####9 | 50%  2025-05-07T20:25:59.0374152Z libcublas-12.8.3.14 | 460.2 MB | ###1 | 32% 2025-05-07T20:25:59.0374474Z 2025-05-07T20:25:59.0374771Z 2025-05-07T20:25:59.0374775Z 2025-05-07T20:25:59.0374778Z 2025-05-07T20:25:59.0516062Z libcufft-11.3.3.41 | 147.4 MB | #########8 | 98%  2025-05-07T20:25:59.0516338Z 2025-05-07T20:25:59.0516713Z 2025-05-07T20:25:59.0665634Z libcusparse-12.5.7.5 | 164.9 MB | #########7 | 97%  2025-05-07T20:25:59.0665912Z 2025-05-07T20:25:59.0744710Z nsight-compute-2025. | 320.6 MB | #####1 | 51%  2025-05-07T20:25:59.1517601Z libcublas-12.8.3.14 | 460.2 MB | ###2 | 33% 2025-05-07T20:25:59.1517959Z 2025-05-07T20:25:59.1520200Z 2025-05-07T20:25:59.1666168Z libcusparse-12.5.7.5 | 164.9 MB | #########9 | 100%  2025-05-07T20:25:59.1666548Z 2025-05-07T20:25:59.1747096Z nsight-compute-2025. | 320.6 MB | #####2 | 52%  2025-05-07T20:25:59.2667615Z libcublas-12.8.3.14 | 460.2 MB | ###3 | 34% 2025-05-07T20:25:59.2667939Z 2025-05-07T20:25:59.2753896Z nsight-compute-2025. | 320.6 MB | #####4 | 54%  2025-05-07T20:25:59.3671457Z libcublas-12.8.3.14 | 460.2 MB | ###4 | 35% 2025-05-07T20:25:59.3671733Z 2025-05-07T20:25:59.3760181Z nsight-compute-2025. | 320.6 MB | #####5 | 56%  2025-05-07T20:25:59.4764895Z libcublas-12.8.3.14 | 460.2 MB | ###6 | 36% 2025-05-07T20:25:59.4790246Z libcublas-12.8.3.14 | 460.2 MB | ###7 | 37% 2025-05-07T20:25:59.4791979Z 2025-05-07T20:25:59.5766316Z nsight-compute-2025. | 320.6 MB | #####7 | 57%  2025-05-07T20:25:59.5790766Z libcublas-12.8.3.14 | 460.2 MB | ###8 | 39% 2025-05-07T20:25:59.5791692Z 2025-05-07T20:25:59.6767766Z nsight-compute-2025. | 320.6 MB | #####9 | 59%  2025-05-07T20:25:59.6791188Z libcublas-12.8.3.14 | 460.2 MB | ###9 | 40% 2025-05-07T20:25:59.6792076Z 2025-05-07T20:25:59.7768360Z nsight-compute-2025. | 320.6 MB | ######1 | 61%  2025-05-07T20:25:59.7791707Z libcublas-12.8.3.14 | 460.2 MB | ####1 | 41% 2025-05-07T20:25:59.7792035Z 2025-05-07T20:25:59.8942725Z nsight-compute-2025. | 320.6 MB | ######2 | 63%  2025-05-07T20:25:59.8944263Z 2025-05-07T20:25:59.9139040Z nsight-compute-2025. | 320.6 MB | ######4 | 64%  2025-05-07T20:26:00.0115252Z libcublas-12.8.3.14 | 460.2 MB | ####2 | 42% 2025-05-07T20:26:00.0116854Z 2025-05-07T20:26:00.0277991Z nsight-compute-2025. | 320.6 MB | ######5 | 66%  2025-05-07T20:26:00.1274686Z libcublas-12.8.3.14 | 460.2 MB | ####3 | 43% 2025-05-07T20:26:00.1274939Z 2025-05-07T20:26:00.1323671Z nsight-compute-2025. | 320.6 MB | ######7 | 67%  2025-05-07T20:26:00.2275701Z libcublas-12.8.3.14 | 460.2 MB | ####4 | 44% 2025-05-07T20:26:00.2277814Z 2025-05-07T20:26:00.2329008Z nsight-compute-2025. | 320.6 MB | ######8 | 69%  2025-05-07T20:26:00.3282015Z libcublas-12.8.3.14 | 460.2 MB | ####5 | 46% 2025-05-07T20:26:00.3282371Z 2025-05-07T20:26:00.3329874Z nsight-compute-2025. | 320.6 MB | ####### | 71%  2025-05-07T20:26:00.4332110Z libcublas-12.8.3.14 | 460.2 MB | ####6 | 47% 2025-05-07T20:26:00.4837719Z libcublas-12.8.3.14 | 460.2 MB | ####8 | 48% 2025-05-07T20:26:00.4838039Z 2025-05-07T20:26:00.5336324Z nsight-compute-2025. | 320.6 MB | #######2 | 72%  2025-05-07T20:26:00.5839557Z libcublas-12.8.3.14 | 460.2 MB | ####9 | 49% 2025-05-07T20:26:00.5839812Z 2025-05-07T20:26:00.6419485Z nsight-compute-2025. | 320.6 MB | #######4 | 74%  2025-05-07T20:26:00.6841555Z libcublas-12.8.3.14 | 460.2 MB | ##### | 51% 2025-05-07T20:26:00.6841831Z 2025-05-07T20:26:00.7421924Z nsight-compute-2025. | 320.6 MB | #######5 | 76%  2025-05-07T20:26:00.7842020Z libcublas-12.8.3.14 | 460.2 MB | #####1 | 52% 2025-05-07T20:26:00.7842370Z 2025-05-07T20:26:00.8845114Z nsight-compute-2025. | 320.6 MB | #######7 | 78%  2025-05-07T20:26:00.8845838Z 2025-05-07T20:26:00.9285999Z nsight-compute-2025. | 320.6 MB | ######## | 81%  2025-05-07T20:26:01.0036364Z libcublas-12.8.3.14 | 460.2 MB | #####2 | 53% 2025-05-07T20:26:01.0037373Z 2025-05-07T20:26:01.0288411Z nsight-compute-2025. | 320.6 MB | ########2 | 83%  2025-05-07T20:26:01.1143044Z libcublas-12.8.3.14 | 460.2 MB | #####4 | 54% 2025-05-07T20:26:01.1143381Z 2025-05-07T20:26:01.1289365Z nsight-compute-2025. | 320.6 MB | ########4 | 85%  2025-05-07T20:26:01.2207312Z libcublas-12.8.3.14 | 460.2 MB | #####5 | 55% 2025-05-07T20:26:01.2209142Z 2025-05-07T20:26:01.2290014Z nsight-compute-2025. | 320.6 MB | ########6 | 86%  2025-05-07T20:26:01.3353553Z libcublas-12.8.3.14 | 460.2 MB | #####6 | 57% 2025-05-07T20:26:01.3389017Z libcublas-12.8.3.14 | 460.2 MB | #####7 | 58% 2025-05-07T20:26:01.3389415Z 2025-05-07T20:26:01.3927530Z nsight-compute-2025. | 320.6 MB | ########8 | 88%  2025-05-07T20:26:01.3927799Z 2025-05-07T20:26:01.3927804Z 2025-05-07T20:26:01.3927807Z 2025-05-07T20:26:01.3932787Z 2025-05-07T20:26:01.4388219Z libcufft-11.3.3.41 | 147.4 MB | ########## | 100%  2025-05-07T20:26:01.4466162Z libcublas-12.8.3.14 | 460.2 MB | #####8 | 59% 2025-05-07T20:26:01.4466560Z 2025-05-07T20:26:01.4466809Z 2025-05-07T20:26:01.4466816Z 2025-05-07T20:26:01.4466822Z 2025-05-07T20:26:01.4466849Z 2025-05-07T20:26:01.4635396Z libnpp-12.3.3.65 | 130.6 MB | | 0%  2025-05-07T20:26:01.4636435Z 2025-05-07T20:26:01.5468800Z nsight-compute-2025. | 320.6 MB | ######### | 90%  2025-05-07T20:26:01.5469225Z 2025-05-07T20:26:01.5469231Z 2025-05-07T20:26:01.5469234Z 2025-05-07T20:26:01.5469238Z 2025-05-07T20:26:01.5472587Z 2025-05-07T20:26:01.5757968Z libnpp-12.3.3.65 | 130.6 MB | 2 | 2%  2025-05-07T20:26:01.6111786Z libcublas-12.8.3.14 | 460.2 MB | #####9 | 60% 2025-05-07T20:26:01.6112075Z 2025-05-07T20:26:01.6469299Z nsight-compute-2025. | 320.6 MB | #########1 | 92%  2025-05-07T20:26:01.6469581Z 2025-05-07T20:26:01.6469585Z 2025-05-07T20:26:01.6469589Z 2025-05-07T20:26:01.6469592Z 2025-05-07T20:26:01.6469596Z 2025-05-07T20:26:01.7090780Z libnpp-12.3.3.65 | 130.6 MB | 4 | 5%  2025-05-07T20:26:01.7154642Z libcublas-12.8.3.14 | 460.2 MB | ###### | 61% 2025-05-07T20:26:01.7154943Z 2025-05-07T20:26:01.7154947Z 2025-05-07T20:26:01.7154951Z 2025-05-07T20:26:01.7474750Z libcusolver-11.7.2.5 | 156.9 MB | ########## | 100%  2025-05-07T20:26:01.7475232Z 2025-05-07T20:26:01.7475238Z 2025-05-07T20:26:01.7475275Z 2025-05-07T20:26:01.7475281Z 2025-05-07T20:26:01.7475287Z 2025-05-07T20:26:01.7614154Z libnpp-12.3.3.65 | 130.6 MB | 7 | 7%  2025-05-07T20:26:01.7614978Z 2025-05-07T20:26:01.7720064Z nsight-compute-2025. | 320.6 MB | #########3 | 93%  2025-05-07T20:26:01.7720334Z 2025-05-07T20:26:01.7720338Z 2025-05-07T20:26:01.7720342Z 2025-05-07T20:26:01.7720346Z 2025-05-07T20:26:01.7720350Z 2025-05-07T20:26:01.7721137Z 2025-05-07T20:26:01.8477371Z cuda-nsight-12.8.55 | 113.2 MB | | 0%  2025-05-07T20:26:01.8477683Z 2025-05-07T20:26:01.8477689Z 2025-05-07T20:26:01.8477727Z 2025-05-07T20:26:01.8477732Z 2025-05-07T20:26:01.8483459Z 2025-05-07T20:26:01.8491704Z libnpp-12.3.3.65 | 130.6 MB | 9 | 10%  2025-05-07T20:26:01.8722424Z libcublas-12.8.3.14 | 460.2 MB | ######1 | 62% 2025-05-07T20:26:01.8722707Z 2025-05-07T20:26:01.8722713Z 2025-05-07T20:26:01.8722718Z 2025-05-07T20:26:01.8722723Z 2025-05-07T20:26:01.8723012Z 2025-05-07T20:26:01.8725471Z 2025-05-07T20:26:01.9086030Z cuda-nsight-12.8.55 | 113.2 MB | 2 | 3%  2025-05-07T20:26:01.9086904Z 2025-05-07T20:26:01.9486733Z nsight-compute-2025. | 320.6 MB | #########4 | 95%  2025-05-07T20:26:01.9487091Z 2025-05-07T20:26:01.9487097Z 2025-05-07T20:26:01.9487102Z 2025-05-07T20:26:01.9487107Z 2025-05-07T20:26:01.9489695Z 2025-05-07T20:26:01.9723876Z libnpp-12.3.3.65 | 130.6 MB | #2 | 12%  2025-05-07T20:26:01.9724272Z 2025-05-07T20:26:01.9724277Z 2025-05-07T20:26:01.9724280Z 2025-05-07T20:26:01.9724535Z 2025-05-07T20:26:01.9724539Z 2025-05-07T20:26:01.9726778Z 2025-05-07T20:26:01.9799201Z cuda-nsight-12.8.55 | 113.2 MB | 5 | 5%  2025-05-07T20:26:01.9799596Z 2025-05-07T20:26:01.9801474Z 2025-05-07T20:26:01.9928798Z libcusparse-12.5.7.5 | 164.9 MB | ########## | 100%  2025-05-07T20:26:02.0462001Z libcublas-12.8.3.14 | 460.2 MB | ######2 | 63% 2025-05-07T20:26:02.0462365Z 2025-05-07T20:26:02.0488240Z nsight-compute-2025. | 320.6 MB | #########5 | 96%  2025-05-07T20:26:02.0488612Z 2025-05-07T20:26:02.0488618Z 2025-05-07T20:26:02.0488623Z 2025-05-07T20:26:02.0488628Z 2025-05-07T20:26:02.0488633Z 2025-05-07T20:26:02.0488638Z 2025-05-07T20:26:02.0490944Z 2025-05-07T20:26:02.0498569Z cuda-nvvp-12.8.57 | 112.4 MB | | 0%  2025-05-07T20:26:02.0498863Z 2025-05-07T20:26:02.0498867Z 2025-05-07T20:26:02.0498871Z 2025-05-07T20:26:02.0498874Z 2025-05-07T20:26:02.0501004Z 2025-05-07T20:26:02.0724115Z libnpp-12.3.3.65 | 130.6 MB | #4 | 14%  2025-05-07T20:26:02.0724431Z 2025-05-07T20:26:02.0724435Z 2025-05-07T20:26:02.0724439Z 2025-05-07T20:26:02.0724442Z 2025-05-07T20:26:02.0724446Z 2025-05-07T20:26:02.0727959Z 2025-05-07T20:26:02.1244691Z cuda-nsight-12.8.55 | 113.2 MB | 8 | 8%  2025-05-07T20:26:02.1490218Z libcublas-12.8.3.14 | 460.2 MB | ######3 | 64% 2025-05-07T20:26:02.1490496Z 2025-05-07T20:26:02.1490501Z 2025-05-07T20:26:02.1490505Z 2025-05-07T20:26:02.1490508Z 2025-05-07T20:26:02.1490513Z 2025-05-07T20:26:02.1490516Z 2025-05-07T20:26:02.1492500Z 2025-05-07T20:26:02.1761692Z cuda-nvvp-12.8.57 | 112.4 MB | 1 | 1%  2025-05-07T20:26:02.1762009Z 2025-05-07T20:26:02.1762015Z 2025-05-07T20:26:02.1762018Z 2025-05-07T20:26:02.1762022Z 2025-05-07T20:26:02.1762026Z 2025-05-07T20:26:02.1762031Z 2025-05-07T20:26:02.1781442Z cuda-nsight-12.8.55 | 113.2 MB | # | 10%  2025-05-07T20:26:02.1781862Z 2025-05-07T20:26:02.1781866Z 2025-05-07T20:26:02.1781869Z 2025-05-07T20:26:02.1781873Z 2025-05-07T20:26:02.1783936Z 2025-05-07T20:26:02.1832131Z libnpp-12.3.3.65 | 130.6 MB | #6 | 17%  2025-05-07T20:26:02.1832422Z 2025-05-07T20:26:02.2701022Z nsight-compute-2025. | 320.6 MB | #########6 | 97%  2025-05-07T20:26:02.2762283Z libcublas-12.8.3.14 | 460.2 MB | ######4 | 64% 2025-05-07T20:26:02.2762582Z 2025-05-07T20:26:02.2762754Z 2025-05-07T20:26:02.2762759Z 2025-05-07T20:26:02.2762765Z 2025-05-07T20:26:02.2762770Z 2025-05-07T20:26:02.2762789Z 2025-05-07T20:26:02.2807229Z cuda-nsight-12.8.55 | 113.2 MB | #2 | 13%  2025-05-07T20:26:02.2807631Z 2025-05-07T20:26:02.2807638Z 2025-05-07T20:26:02.2807644Z 2025-05-07T20:26:02.2807649Z 2025-05-07T20:26:02.2807654Z 2025-05-07T20:26:02.2807660Z 2025-05-07T20:26:02.2808111Z 2025-05-07T20:26:02.2878657Z cuda-nvvp-12.8.57 | 112.4 MB | 2 | 3%  2025-05-07T20:26:02.2879101Z 2025-05-07T20:26:02.2879107Z 2025-05-07T20:26:02.2879112Z 2025-05-07T20:26:02.2879117Z 2025-05-07T20:26:02.2886473Z 2025-05-07T20:26:02.3105224Z libnpp-12.3.3.65 | 130.6 MB | #8 | 19%  2025-05-07T20:26:02.3107274Z 2025-05-07T20:26:02.3810075Z nsight-compute-2025. | 320.6 MB | #########8 | 98%  2025-05-07T20:26:02.3810809Z 2025-05-07T20:26:02.3810818Z 2025-05-07T20:26:02.3810823Z 2025-05-07T20:26:02.3810829Z 2025-05-07T20:26:02.3810834Z 2025-05-07T20:26:02.3810839Z 2025-05-07T20:26:02.3811098Z 2025-05-07T20:26:02.3821088Z cuda-nvvp-12.8.57 | 112.4 MB | 5 | 5%  2025-05-07T20:26:02.3821515Z 2025-05-07T20:26:02.3821521Z 2025-05-07T20:26:02.3821527Z 2025-05-07T20:26:02.3821532Z 2025-05-07T20:26:02.3821538Z 2025-05-07T20:26:02.3824646Z 2025-05-07T20:26:02.3925938Z cuda-nsight-12.8.55 | 113.2 MB | #4 | 15%  2025-05-07T20:26:02.3934745Z 2025-05-07T20:26:02.3934753Z 2025-05-07T20:26:02.3935036Z 2025-05-07T20:26:02.3935042Z 2025-05-07T20:26:02.3935047Z 2025-05-07T20:26:02.4184246Z libnpp-12.3.3.65 | 130.6 MB | ## | 21%  2025-05-07T20:26:02.4354239Z libcublas-12.8.3.14 | 460.2 MB | ######4 | 65% 2025-05-07T20:26:02.4354573Z 2025-05-07T20:26:02.4813955Z nsight-compute-2025. | 320.6 MB | #########8 | 99%  2025-05-07T20:26:02.4814403Z 2025-05-07T20:26:02.4814410Z 2025-05-07T20:26:02.4814415Z 2025-05-07T20:26:02.4814420Z 2025-05-07T20:26:02.4814426Z 2025-05-07T20:26:02.4814431Z 2025-05-07T20:26:02.4815640Z 2025-05-07T20:26:02.4927534Z cuda-nvvp-12.8.57 | 112.4 MB | 7 | 8%  2025-05-07T20:26:02.4927830Z 2025-05-07T20:26:02.4927834Z 2025-05-07T20:26:02.4927837Z 2025-05-07T20:26:02.4927841Z 2025-05-07T20:26:02.4930796Z 2025-05-07T20:26:02.4935509Z libnpp-12.3.3.65 | 130.6 MB | ##3 | 23%  2025-05-07T20:26:02.4935843Z 2025-05-07T20:26:02.4935847Z 2025-05-07T20:26:02.4935875Z 2025-05-07T20:26:02.4935878Z 2025-05-07T20:26:02.4935882Z 2025-05-07T20:26:02.4935885Z 2025-05-07T20:26:02.5227029Z cuda-nsight-12.8.55 | 113.2 MB | #7 | 17%  2025-05-07T20:26:02.5578212Z libcublas-12.8.3.14 | 460.2 MB | ######5 | 66% 2025-05-07T20:26:02.5581684Z 2025-05-07T20:26:02.5814129Z nsight-compute-2025. | 320.6 MB | #########9 | 100%  2025-05-07T20:26:02.5814428Z 2025-05-07T20:26:02.5814432Z 2025-05-07T20:26:02.5814436Z 2025-05-07T20:26:02.5814440Z 2025-05-07T20:26:02.5814443Z 2025-05-07T20:26:02.5814447Z 2025-05-07T20:26:02.5815807Z 2025-05-07T20:26:02.5938152Z cuda-nvvp-12.8.57 | 112.4 MB | 9 | 10%  2025-05-07T20:26:02.5938470Z 2025-05-07T20:26:02.5938474Z 2025-05-07T20:26:02.5938478Z 2025-05-07T20:26:02.5938481Z 2025-05-07T20:26:02.5938485Z 2025-05-07T20:26:02.5938489Z 2025-05-07T20:26:02.5958661Z cuda-nsight-12.8.55 | 113.2 MB | #9 | 20%  2025-05-07T20:26:02.5959067Z 2025-05-07T20:26:02.5959071Z 2025-05-07T20:26:02.5959075Z 2025-05-07T20:26:02.5959079Z 2025-05-07T20:26:02.5961412Z 2025-05-07T20:26:02.6284059Z libnpp-12.3.3.65 | 130.6 MB | ##5 | 25%  2025-05-07T20:26:02.6815365Z libcublas-12.8.3.14 | 460.2 MB | ######6 | 66% 2025-05-07T20:26:02.6815753Z 2025-05-07T20:26:02.6815758Z 2025-05-07T20:26:02.6815795Z 2025-05-07T20:26:02.6815800Z 2025-05-07T20:26:02.6815803Z 2025-05-07T20:26:02.6815807Z 2025-05-07T20:26:02.6817053Z 2025-05-07T20:26:02.6940247Z cuda-nvvp-12.8.57 | 112.4 MB | #1 | 12%  2025-05-07T20:26:02.6940679Z 2025-05-07T20:26:02.6940686Z 2025-05-07T20:26:02.6940691Z 2025-05-07T20:26:02.6940696Z 2025-05-07T20:26:02.6940701Z 2025-05-07T20:26:02.6943511Z 2025-05-07T20:26:02.6960507Z cuda-nsight-12.8.55 | 113.2 MB | ##2 | 22%  2025-05-07T20:26:02.6960854Z 2025-05-07T20:26:02.6960858Z 2025-05-07T20:26:02.6960862Z 2025-05-07T20:26:02.6960866Z 2025-05-07T20:26:02.6964109Z 2025-05-07T20:26:02.7287278Z libnpp-12.3.3.65 | 130.6 MB | ##7 | 27%  2025-05-07T20:26:02.7825717Z libcublas-12.8.3.14 | 460.2 MB | ######6 | 67% 2025-05-07T20:26:02.7826113Z 2025-05-07T20:26:02.7826121Z 2025-05-07T20:26:02.7826126Z 2025-05-07T20:26:02.7826131Z 2025-05-07T20:26:02.7826136Z 2025-05-07T20:26:02.7826141Z 2025-05-07T20:26:02.7832602Z 2025-05-07T20:26:02.7965786Z cuda-nvvp-12.8.57 | 112.4 MB | #3 | 14%  2025-05-07T20:26:02.7966089Z 2025-05-07T20:26:02.7966094Z 2025-05-07T20:26:02.7966097Z 2025-05-07T20:26:02.7966101Z 2025-05-07T20:26:02.7966376Z 2025-05-07T20:26:02.8025785Z libnpp-12.3.3.65 | 130.6 MB | ##9 | 30%  2025-05-07T20:26:02.8026080Z 2025-05-07T20:26:02.8026086Z 2025-05-07T20:26:02.8026091Z 2025-05-07T20:26:02.8026096Z 2025-05-07T20:26:02.8026102Z 2025-05-07T20:26:02.8028028Z 2025-05-07T20:26:02.8407444Z cuda-nsight-12.8.55 | 113.2 MB | ##4 | 24%  2025-05-07T20:26:02.8829140Z libcublas-12.8.3.14 | 460.2 MB | ######7 | 67% 2025-05-07T20:26:02.8829416Z 2025-05-07T20:26:02.8829420Z 2025-05-07T20:26:02.8829424Z 2025-05-07T20:26:02.8829427Z 2025-05-07T20:26:02.8829432Z 2025-05-07T20:26:02.8829436Z 2025-05-07T20:26:02.8829443Z 2025-05-07T20:26:02.9025419Z cuda-nvvp-12.8.57 | 112.4 MB | #6 | 16%  2025-05-07T20:26:02.9025778Z 2025-05-07T20:26:02.9025782Z 2025-05-07T20:26:02.9025786Z 2025-05-07T20:26:02.9025789Z 2025-05-07T20:26:02.9025793Z 2025-05-07T20:26:02.9037223Z libnpp-12.3.3.65 | 130.6 MB | ###2 | 32%  2025-05-07T20:26:02.9037502Z 2025-05-07T20:26:02.9037506Z 2025-05-07T20:26:02.9037510Z 2025-05-07T20:26:02.9037514Z 2025-05-07T20:26:02.9037518Z 2025-05-07T20:26:02.9037521Z 2025-05-07T20:26:02.9535822Z cuda-nsight-12.8.55 | 113.2 MB | ##6 | 27%  2025-05-07T20:26:02.9830899Z libcublas-12.8.3.14 | 460.2 MB | ######7 | 68% 2025-05-07T20:26:02.9831303Z 2025-05-07T20:26:02.9831309Z 2025-05-07T20:26:02.9831314Z 2025-05-07T20:26:02.9831320Z 2025-05-07T20:26:02.9831325Z 2025-05-07T20:26:02.9831330Z 2025-05-07T20:26:02.9831335Z 2025-05-07T20:26:03.0058202Z cuda-nvvp-12.8.57 | 112.4 MB | #8 | 18%  2025-05-07T20:26:03.0058514Z 2025-05-07T20:26:03.0058519Z 2025-05-07T20:26:03.0058545Z 2025-05-07T20:26:03.0058549Z 2025-05-07T20:26:03.0058552Z 2025-05-07T20:26:03.0062319Z 2025-05-07T20:26:03.0122607Z cuda-nsight-12.8.55 | 113.2 MB | ##8 | 29%  2025-05-07T20:26:03.0122974Z 2025-05-07T20:26:03.0122978Z 2025-05-07T20:26:03.0122982Z 2025-05-07T20:26:03.0122986Z 2025-05-07T20:26:03.0122990Z 2025-05-07T20:26:03.0546216Z libnpp-12.3.3.65 | 130.6 MB | ###4 | 34%  2025-05-07T20:26:03.0883834Z libcublas-12.8.3.14 | 460.2 MB | ######8 | 68% 2025-05-07T20:26:03.0884208Z 2025-05-07T20:26:03.0884212Z 2025-05-07T20:26:03.0884247Z 2025-05-07T20:26:03.0884251Z 2025-05-07T20:26:03.0884254Z 2025-05-07T20:26:03.0884258Z 2025-05-07T20:26:03.0885742Z 2025-05-07T20:26:03.1113330Z cuda-nvvp-12.8.57 | 112.4 MB | ## | 20%  2025-05-07T20:26:03.1113688Z 2025-05-07T20:26:03.1113692Z 2025-05-07T20:26:03.1113696Z 2025-05-07T20:26:03.1113699Z 2025-05-07T20:26:03.1113703Z 2025-05-07T20:26:03.1115865Z 2025-05-07T20:26:03.1231051Z cuda-nsight-12.8.55 | 113.2 MB | ###1 | 31%  2025-05-07T20:26:03.1231475Z 2025-05-07T20:26:03.1231480Z 2025-05-07T20:26:03.1231485Z 2025-05-07T20:26:03.1231500Z 2025-05-07T20:26:03.1231505Z 2025-05-07T20:26:03.1582363Z libnpp-12.3.3.65 | 130.6 MB | ###6 | 36%  2025-05-07T20:26:03.1896851Z libcublas-12.8.3.14 | 460.2 MB | ######8 | 69% 2025-05-07T20:26:03.1897226Z 2025-05-07T20:26:03.1897232Z 2025-05-07T20:26:03.1897236Z 2025-05-07T20:26:03.1897242Z 2025-05-07T20:26:03.1897247Z 2025-05-07T20:26:03.1897252Z 2025-05-07T20:26:03.1899063Z 2025-05-07T20:26:03.2236703Z cuda-nvvp-12.8.57 | 112.4 MB | ##2 | 22%  2025-05-07T20:26:03.2237121Z 2025-05-07T20:26:03.2237127Z 2025-05-07T20:26:03.2237132Z 2025-05-07T20:26:03.2237137Z 2025-05-07T20:26:03.2237142Z 2025-05-07T20:26:03.2237147Z 2025-05-07T20:26:03.2414066Z cuda-nsight-12.8.55 | 113.2 MB | ###3 | 33%  2025-05-07T20:26:03.2414487Z 2025-05-07T20:26:03.2414492Z 2025-05-07T20:26:03.2414497Z 2025-05-07T20:26:03.2414502Z 2025-05-07T20:26:03.2417075Z 2025-05-07T20:26:03.2599610Z libnpp-12.3.3.65 | 130.6 MB | ###8 | 38%  2025-05-07T20:26:03.2914318Z libcublas-12.8.3.14 | 460.2 MB | ######9 | 69% 2025-05-07T20:26:03.2914683Z 2025-05-07T20:26:03.2914689Z 2025-05-07T20:26:03.2914694Z 2025-05-07T20:26:03.2914699Z 2025-05-07T20:26:03.2914715Z 2025-05-07T20:26:03.2914720Z 2025-05-07T20:26:03.2917542Z 2025-05-07T20:26:03.3239350Z cuda-nvvp-12.8.57 | 112.4 MB | ##4 | 25%  2025-05-07T20:26:03.3240059Z 2025-05-07T20:26:03.3240072Z 2025-05-07T20:26:03.3240076Z 2025-05-07T20:26:03.3240079Z 2025-05-07T20:26:03.3240083Z 2025-05-07T20:26:03.3240086Z 2025-05-07T20:26:03.3523528Z cuda-nsight-12.8.55 | 113.2 MB | ###5 | 36%  2025-05-07T20:26:03.3523947Z 2025-05-07T20:26:03.3523953Z 2025-05-07T20:26:03.3523983Z 2025-05-07T20:26:03.3523989Z 2025-05-07T20:26:03.3527220Z 2025-05-07T20:26:03.3607016Z libnpp-12.3.3.65 | 130.6 MB | #### | 40%  2025-05-07T20:26:03.3916943Z libcublas-12.8.3.14 | 460.2 MB | ####### | 70% 2025-05-07T20:26:03.3917288Z 2025-05-07T20:26:03.3917294Z 2025-05-07T20:26:03.3917299Z 2025-05-07T20:26:03.3917304Z 2025-05-07T20:26:03.3917309Z 2025-05-07T20:26:03.3917314Z 2025-05-07T20:26:03.3917331Z 2025-05-07T20:26:03.4374155Z cuda-nvvp-12.8.57 | 112.4 MB | ##6 | 27%  2025-05-07T20:26:03.4374561Z 2025-05-07T20:26:03.4374566Z 2025-05-07T20:26:03.4374601Z 2025-05-07T20:26:03.4374616Z 2025-05-07T20:26:03.4374622Z 2025-05-07T20:26:03.4378423Z 2025-05-07T20:26:03.4589121Z cuda-nsight-12.8.55 | 113.2 MB | ###7 | 38%  2025-05-07T20:26:03.4589530Z 2025-05-07T20:26:03.4589545Z 2025-05-07T20:26:03.4589550Z 2025-05-07T20:26:03.4589556Z 2025-05-07T20:26:03.4591428Z 2025-05-07T20:26:03.4669126Z libnpp-12.3.3.65 | 130.6 MB | ####2 | 42%  2025-05-07T20:26:03.4921618Z libcublas-12.8.3.14 | 460.2 MB | ####### | 71% 2025-05-07T20:26:03.4921976Z 2025-05-07T20:26:03.4921982Z 2025-05-07T20:26:03.4921987Z 2025-05-07T20:26:03.4921992Z 2025-05-07T20:26:03.4921998Z 2025-05-07T20:26:03.4922003Z 2025-05-07T20:26:03.4922017Z 2025-05-07T20:26:03.5423736Z cuda-nvvp-12.8.57 | 112.4 MB | ##9 | 29%  2025-05-07T20:26:03.5424136Z 2025-05-07T20:26:03.5424142Z 2025-05-07T20:26:03.5424147Z 2025-05-07T20:26:03.5424152Z 2025-05-07T20:26:03.5424172Z 2025-05-07T20:26:03.5424219Z 2025-05-07T20:26:03.5606104Z cuda-nsight-12.8.55 | 113.2 MB | ###9 | 40%  2025-05-07T20:26:03.5606503Z 2025-05-07T20:26:03.5606508Z 2025-05-07T20:26:03.5606523Z 2025-05-07T20:26:03.5606528Z 2025-05-07T20:26:03.5613341Z 2025-05-07T20:26:03.5669691Z libnpp-12.3.3.65 | 130.6 MB | ####4 | 44%  2025-05-07T20:26:03.5923238Z libcublas-12.8.3.14 | 460.2 MB | #######1 | 71% 2025-05-07T20:26:03.5923590Z 2025-05-07T20:26:03.5923596Z 2025-05-07T20:26:03.5923601Z 2025-05-07T20:26:03.5923606Z 2025-05-07T20:26:03.5923611Z 2025-05-07T20:26:03.5923617Z 2025-05-07T20:26:03.5924386Z 2025-05-07T20:26:03.6424251Z cuda-nvvp-12.8.57 | 112.4 MB | ###1 | 31%  2025-05-07T20:26:03.6424666Z 2025-05-07T20:26:03.6424670Z 2025-05-07T20:26:03.6424674Z 2025-05-07T20:26:03.6424677Z 2025-05-07T20:26:03.6424681Z 2025-05-07T20:26:03.6429013Z 2025-05-07T20:26:03.6609603Z cuda-nsight-12.8.55 | 113.2 MB | ####2 | 42%  2025-05-07T20:26:03.6609939Z 2025-05-07T20:26:03.6609943Z 2025-05-07T20:26:03.6609947Z 2025-05-07T20:26:03.6609951Z 2025-05-07T20:26:03.6609954Z 2025-05-07T20:26:03.6753828Z libnpp-12.3.3.65 | 130.6 MB | ####6 | 46%  2025-05-07T20:26:03.6923828Z libcublas-12.8.3.14 | 460.2 MB | #######1 | 72% 2025-05-07T20:26:03.6924181Z 2025-05-07T20:26:03.6924432Z 2025-05-07T20:26:03.6924437Z 2025-05-07T20:26:03.6924441Z 2025-05-07T20:26:03.6924445Z 2025-05-07T20:26:03.6924449Z 2025-05-07T20:26:03.6924453Z 2025-05-07T20:26:03.7434007Z cuda-nvvp-12.8.57 | 112.4 MB | ###3 | 34%  2025-05-07T20:26:03.7434352Z 2025-05-07T20:26:03.7434356Z 2025-05-07T20:26:03.7434359Z 2025-05-07T20:26:03.7434363Z 2025-05-07T20:26:03.7434366Z 2025-05-07T20:26:03.7437132Z 2025-05-07T20:26:03.7613878Z cuda-nsight-12.8.55 | 113.2 MB | ####4 | 44%  2025-05-07T20:26:03.7614292Z 2025-05-07T20:26:03.7614298Z 2025-05-07T20:26:03.7614625Z 2025-05-07T20:26:03.7614630Z 2025-05-07T20:26:03.7616220Z 2025-05-07T20:26:03.7757529Z libnpp-12.3.3.65 | 130.6 MB | ####8 | 48%  2025-05-07T20:26:03.7926619Z libcublas-12.8.3.14 | 460.2 MB | #######2 | 72% 2025-05-07T20:26:03.7926947Z 2025-05-07T20:26:03.7926951Z 2025-05-07T20:26:03.7926954Z 2025-05-07T20:26:03.7926958Z 2025-05-07T20:26:03.7926985Z 2025-05-07T20:26:03.7926989Z 2025-05-07T20:26:03.7932552Z 2025-05-07T20:26:03.8434197Z cuda-nvvp-12.8.57 | 112.4 MB | ###6 | 36%  2025-05-07T20:26:03.8434505Z 2025-05-07T20:26:03.8434509Z 2025-05-07T20:26:03.8434513Z 2025-05-07T20:26:03.8434517Z 2025-05-07T20:26:03.8434521Z 2025-05-07T20:26:03.8443201Z 2025-05-07T20:26:03.8615929Z cuda-nsight-12.8.55 | 113.2 MB | ####6 | 47%  2025-05-07T20:26:03.8616249Z 2025-05-07T20:26:03.8616253Z 2025-05-07T20:26:03.8616256Z 2025-05-07T20:26:03.8616260Z 2025-05-07T20:26:03.8625711Z 2025-05-07T20:26:03.8771102Z libnpp-12.3.3.65 | 130.6 MB | ##### | 50%  2025-05-07T20:26:03.8932161Z libcublas-12.8.3.14 | 460.2 MB | #######2 | 73% 2025-05-07T20:26:03.8932449Z 2025-05-07T20:26:03.8932453Z 2025-05-07T20:26:03.8932457Z 2025-05-07T20:26:03.8932461Z 2025-05-07T20:26:03.8932464Z 2025-05-07T20:26:03.8932469Z 2025-05-07T20:26:03.8937188Z 2025-05-07T20:26:03.9435140Z cuda-nvvp-12.8.57 | 112.4 MB | ###8 | 39%  2025-05-07T20:26:03.9435452Z 2025-05-07T20:26:03.9435457Z 2025-05-07T20:26:03.9435461Z 2025-05-07T20:26:03.9435465Z 2025-05-07T20:26:03.9435468Z 2025-05-07T20:26:03.9435472Z 2025-05-07T20:26:03.9616147Z cuda-nsight-12.8.55 | 113.2 MB | ####9 | 49%  2025-05-07T20:26:03.9616443Z 2025-05-07T20:26:03.9616447Z 2025-05-07T20:26:03.9616451Z 2025-05-07T20:26:03.9616455Z 2025-05-07T20:26:03.9618103Z 2025-05-07T20:26:03.9774828Z libnpp-12.3.3.65 | 130.6 MB | #####2 | 52%  2025-05-07T20:26:03.9935718Z libcublas-12.8.3.14 | 460.2 MB | #######3 | 74% 2025-05-07T20:26:03.9936002Z 2025-05-07T20:26:03.9936006Z 2025-05-07T20:26:03.9936010Z 2025-05-07T20:26:03.9936014Z 2025-05-07T20:26:03.9936018Z 2025-05-07T20:26:03.9936022Z 2025-05-07T20:26:03.9936231Z 2025-05-07T20:26:04.0461056Z cuda-nvvp-12.8.57 | 112.4 MB | ####1 | 41%  2025-05-07T20:26:04.0461397Z 2025-05-07T20:26:04.0461402Z 2025-05-07T20:26:04.0461406Z 2025-05-07T20:26:04.0461410Z 2025-05-07T20:26:04.0461413Z 2025-05-07T20:26:04.0461417Z 2025-05-07T20:26:04.0657578Z cuda-nsight-12.8.55 | 113.2 MB | #####1 | 52%  2025-05-07T20:26:04.0657901Z 2025-05-07T20:26:04.0657905Z 2025-05-07T20:26:04.0657908Z 2025-05-07T20:26:04.0657912Z 2025-05-07T20:26:04.0659483Z 2025-05-07T20:26:04.0776491Z libnpp-12.3.3.65 | 130.6 MB | #####4 | 54%  2025-05-07T20:26:04.0936383Z libcublas-12.8.3.14 | 460.2 MB | #######4 | 74% 2025-05-07T20:26:04.0936745Z 2025-05-07T20:26:04.0936783Z 2025-05-07T20:26:04.0936786Z 2025-05-07T20:26:04.0936790Z 2025-05-07T20:26:04.0936794Z 2025-05-07T20:26:04.0936798Z 2025-05-07T20:26:04.0939350Z 2025-05-07T20:26:04.1463714Z cuda-nvvp-12.8.57 | 112.4 MB | ####3 | 44%  2025-05-07T20:26:04.1464022Z 2025-05-07T20:26:04.1464027Z 2025-05-07T20:26:04.1464030Z 2025-05-07T20:26:04.1464034Z 2025-05-07T20:26:04.1464288Z 2025-05-07T20:26:04.1464294Z 2025-05-07T20:26:04.1666589Z cuda-nsight-12.8.55 | 113.2 MB | #####4 | 54%  2025-05-07T20:26:04.1666892Z 2025-05-07T20:26:04.1666896Z 2025-05-07T20:26:04.1666900Z 2025-05-07T20:26:04.1666904Z 2025-05-07T20:26:04.1675400Z 2025-05-07T20:26:04.1800472Z libnpp-12.3.3.65 | 130.6 MB | #####6 | 57%  2025-05-07T20:26:04.2077657Z libcublas-12.8.3.14 | 460.2 MB | #######4 | 75% 2025-05-07T20:26:04.2078024Z 2025-05-07T20:26:04.2078030Z 2025-05-07T20:26:04.2078035Z 2025-05-07T20:26:04.2078040Z 2025-05-07T20:26:04.2078334Z 2025-05-07T20:26:04.2078339Z 2025-05-07T20:26:04.2078344Z 2025-05-07T20:26:04.2465988Z cuda-nvvp-12.8.57 | 112.4 MB | ####6 | 46%  2025-05-07T20:26:04.2466401Z 2025-05-07T20:26:04.2466407Z 2025-05-07T20:26:04.2466412Z 2025-05-07T20:26:04.2466418Z 2025-05-07T20:26:04.2466423Z 2025-05-07T20:26:04.2466428Z 2025-05-07T20:26:04.2668098Z cuda-nsight-12.8.55 | 113.2 MB | #####6 | 57%  2025-05-07T20:26:04.2668495Z 2025-05-07T20:26:04.2668501Z 2025-05-07T20:26:04.2668517Z 2025-05-07T20:26:04.2668523Z 2025-05-07T20:26:04.2670835Z 2025-05-07T20:26:04.2801317Z libnpp-12.3.3.65 | 130.6 MB | #####9 | 59%  2025-05-07T20:26:04.3083540Z libcublas-12.8.3.14 | 460.2 MB | #######5 | 75% 2025-05-07T20:26:04.3083910Z 2025-05-07T20:26:04.3083916Z 2025-05-07T20:26:04.3083921Z 2025-05-07T20:26:04.3083926Z 2025-05-07T20:26:04.3083931Z 2025-05-07T20:26:04.3083936Z 2025-05-07T20:26:04.3086099Z 2025-05-07T20:26:04.3484490Z cuda-nvvp-12.8.57 | 112.4 MB | ####8 | 49%  2025-05-07T20:26:04.3484895Z 2025-05-07T20:26:04.3484901Z 2025-05-07T20:26:04.3484906Z 2025-05-07T20:26:04.3484912Z 2025-05-07T20:26:04.3484926Z 2025-05-07T20:26:04.3490430Z 2025-05-07T20:26:04.3673702Z cuda-nsight-12.8.55 | 113.2 MB | #####9 | 59%  2025-05-07T20:26:04.3674105Z 2025-05-07T20:26:04.3674142Z 2025-05-07T20:26:04.3674161Z 2025-05-07T20:26:04.3674166Z 2025-05-07T20:26:04.3675386Z 2025-05-07T20:26:04.3876917Z libnpp-12.3.3.65 | 130.6 MB | ######1 | 61%  2025-05-07T20:26:04.4085199Z libcublas-12.8.3.14 | 460.2 MB | #######6 | 76% 2025-05-07T20:26:04.4085547Z 2025-05-07T20:26:04.4085553Z 2025-05-07T20:26:04.4085558Z 2025-05-07T20:26:04.4085563Z 2025-05-07T20:26:04.4085568Z 2025-05-07T20:26:04.4085573Z 2025-05-07T20:26:04.4087645Z 2025-05-07T20:26:04.4550824Z cuda-nvvp-12.8.57 | 112.4 MB | #####1 | 51%  2025-05-07T20:26:04.4551180Z 2025-05-07T20:26:04.4551184Z 2025-05-07T20:26:04.4551188Z 2025-05-07T20:26:04.4551192Z 2025-05-07T20:26:04.4551195Z 2025-05-07T20:26:04.4551212Z 2025-05-07T20:26:04.4676624Z cuda-nsight-12.8.55 | 113.2 MB | ######1 | 61%  2025-05-07T20:26:04.4676984Z 2025-05-07T20:26:04.4676989Z 2025-05-07T20:26:04.4676992Z 2025-05-07T20:26:04.4677004Z 2025-05-07T20:26:04.4677028Z 2025-05-07T20:26:04.4879586Z libnpp-12.3.3.65 | 130.6 MB | ######3 | 63%  2025-05-07T20:26:04.5086316Z libcublas-12.8.3.14 | 460.2 MB | #######6 | 77% 2025-05-07T20:26:04.5086570Z 2025-05-07T20:26:04.5086574Z 2025-05-07T20:26:04.5086578Z 2025-05-07T20:26:04.5086581Z 2025-05-07T20:26:04.5086585Z 2025-05-07T20:26:04.5086589Z 2025-05-07T20:26:04.5088027Z 2025-05-07T20:26:04.5593856Z cuda-nvvp-12.8.57 | 112.4 MB | #####3 | 54%  2025-05-07T20:26:04.5594154Z 2025-05-07T20:26:04.5594158Z 2025-05-07T20:26:04.5594162Z 2025-05-07T20:26:04.5594196Z 2025-05-07T20:26:04.5594200Z 2025-05-07T20:26:04.5594212Z 2025-05-07T20:26:04.5705631Z cuda-nsight-12.8.55 | 113.2 MB | ######3 | 64%  2025-05-07T20:26:04.5706041Z 2025-05-07T20:26:04.5706046Z 2025-05-07T20:26:04.5706052Z 2025-05-07T20:26:04.5706057Z 2025-05-07T20:26:04.5708251Z 2025-05-07T20:26:04.5973942Z libnpp-12.3.3.65 | 130.6 MB | ######5 | 66%  2025-05-07T20:26:04.6112600Z libcublas-12.8.3.14 | 460.2 MB | #######7 | 77% 2025-05-07T20:26:04.6112939Z 2025-05-07T20:26:04.6112944Z 2025-05-07T20:26:04.6112949Z 2025-05-07T20:26:04.6112954Z 2025-05-07T20:26:04.6112960Z 2025-05-07T20:26:04.6112967Z 2025-05-07T20:26:04.6114302Z 2025-05-07T20:26:04.6596289Z cuda-nvvp-12.8.57 | 112.4 MB | #####6 | 56%  2025-05-07T20:26:04.6596686Z 2025-05-07T20:26:04.6596690Z 2025-05-07T20:26:04.6596693Z 2025-05-07T20:26:04.6596697Z 2025-05-07T20:26:04.6596700Z 2025-05-07T20:26:04.6599152Z 2025-05-07T20:26:04.6726272Z cuda-nsight-12.8.55 | 113.2 MB | ######6 | 66%  2025-05-07T20:26:04.6726866Z 2025-05-07T20:26:04.6726871Z 2025-05-07T20:26:04.6726874Z 2025-05-07T20:26:04.6726878Z 2025-05-07T20:26:04.6729533Z 2025-05-07T20:26:04.6992857Z libnpp-12.3.3.65 | 130.6 MB | ######7 | 68%  2025-05-07T20:26:04.7429801Z libcublas-12.8.3.14 | 460.2 MB | #######7 | 78% 2025-05-07T20:26:04.7430075Z 2025-05-07T20:26:04.7430079Z 2025-05-07T20:26:04.7430083Z 2025-05-07T20:26:04.7430086Z 2025-05-07T20:26:04.7430090Z 2025-05-07T20:26:04.7430094Z 2025-05-07T20:26:04.7431226Z 2025-05-07T20:26:04.7596415Z cuda-nvvp-12.8.57 | 112.4 MB | #####8 | 59%  2025-05-07T20:26:04.7596823Z 2025-05-07T20:26:04.7596829Z 2025-05-07T20:26:04.7596834Z 2025-05-07T20:26:04.7596838Z 2025-05-07T20:26:04.7596842Z 2025-05-07T20:26:04.7599448Z 2025-05-07T20:26:04.7811767Z cuda-nsight-12.8.55 | 113.2 MB | ######8 | 68%  2025-05-07T20:26:04.7812073Z 2025-05-07T20:26:04.7812109Z 2025-05-07T20:26:04.7812113Z 2025-05-07T20:26:04.7812117Z 2025-05-07T20:26:04.7814426Z 2025-05-07T20:26:04.7993541Z libnpp-12.3.3.65 | 130.6 MB | ######9 | 70%  2025-05-07T20:26:04.8483629Z libcublas-12.8.3.14 | 460.2 MB | #######8 | 79% 2025-05-07T20:26:04.8483909Z 2025-05-07T20:26:04.8483913Z 2025-05-07T20:26:04.8483916Z 2025-05-07T20:26:04.8483950Z 2025-05-07T20:26:04.8483954Z 2025-05-07T20:26:04.8483958Z 2025-05-07T20:26:04.8483970Z 2025-05-07T20:26:04.8606107Z cuda-nvvp-12.8.57 | 112.4 MB | ######1 | 61%  2025-05-07T20:26:04.8606403Z 2025-05-07T20:26:04.8606407Z 2025-05-07T20:26:04.8606410Z 2025-05-07T20:26:04.8606414Z 2025-05-07T20:26:04.8606426Z 2025-05-07T20:26:04.8608544Z 2025-05-07T20:26:04.8997155Z cuda-nsight-12.8.55 | 113.2 MB | ####### | 71%  2025-05-07T20:26:04.9486893Z libcublas-12.8.3.14 | 460.2 MB | #######9 | 79% 2025-05-07T20:26:04.9487242Z 2025-05-07T20:26:04.9487275Z 2025-05-07T20:26:04.9487278Z 2025-05-07T20:26:04.9487282Z 2025-05-07T20:26:04.9487285Z 2025-05-07T20:26:04.9487289Z 2025-05-07T20:26:04.9487293Z 2025-05-07T20:26:04.9612245Z cuda-nvvp-12.8.57 | 112.4 MB | ######3 | 63%  2025-05-07T20:26:04.9612645Z 2025-05-07T20:26:04.9612649Z 2025-05-07T20:26:04.9612652Z 2025-05-07T20:26:04.9612656Z 2025-05-07T20:26:04.9612678Z 2025-05-07T20:26:04.9612691Z 2025-05-07T20:26:05.0002653Z cuda-nsight-12.8.55 | 113.2 MB | #######3 | 73%  2025-05-07T20:26:05.0071490Z libcublas-12.8.3.14 | 460.2 MB | #######9 | 80% 2025-05-07T20:26:05.0071830Z 2025-05-07T20:26:05.0071834Z 2025-05-07T20:26:05.0071838Z 2025-05-07T20:26:05.0071842Z 2025-05-07T20:26:05.0074015Z 2025-05-07T20:26:05.0488303Z libnpp-12.3.3.65 | 130.6 MB | #######2 | 72%  2025-05-07T20:26:05.0488616Z 2025-05-07T20:26:05.0488620Z 2025-05-07T20:26:05.0488624Z 2025-05-07T20:26:05.0488627Z 2025-05-07T20:26:05.0488663Z 2025-05-07T20:26:05.0488666Z 2025-05-07T20:26:05.0489322Z 2025-05-07T20:26:05.0615196Z cuda-nvvp-12.8.57 | 112.4 MB | ######5 | 66%  2025-05-07T20:26:05.0615628Z 2025-05-07T20:26:05.0615634Z 2025-05-07T20:26:05.0615639Z 2025-05-07T20:26:05.0615644Z 2025-05-07T20:26:05.0615649Z 2025-05-07T20:26:05.0615654Z 2025-05-07T20:26:05.1069222Z cuda-nsight-12.8.55 | 113.2 MB | #######5 | 76%  2025-05-07T20:26:05.1079756Z libcublas-12.8.3.14 | 460.2 MB | ######## | 80% 2025-05-07T20:26:05.1080137Z 2025-05-07T20:26:05.1080144Z 2025-05-07T20:26:05.1080150Z 2025-05-07T20:26:05.1080155Z 2025-05-07T20:26:05.1080160Z 2025-05-07T20:26:05.1596962Z libnpp-12.3.3.65 | 130.6 MB | #######3 | 74%  2025-05-07T20:26:05.1597371Z 2025-05-07T20:26:05.1597376Z 2025-05-07T20:26:05.1597381Z 2025-05-07T20:26:05.1597386Z 2025-05-07T20:26:05.1597391Z 2025-05-07T20:26:05.1597396Z 2025-05-07T20:26:05.1598085Z 2025-05-07T20:26:05.1616977Z cuda-nvvp-12.8.57 | 112.4 MB | ######8 | 68%  2025-05-07T20:26:05.1617693Z 2025-05-07T20:26:05.1617701Z 2025-05-07T20:26:05.1617706Z 2025-05-07T20:26:05.1617711Z 2025-05-07T20:26:05.1617715Z 2025-05-07T20:26:05.1617722Z 2025-05-07T20:26:05.2084605Z cuda-nsight-12.8.55 | 113.2 MB | #######8 | 78%  2025-05-07T20:26:05.2085027Z 2025-05-07T20:26:05.2085058Z 2025-05-07T20:26:05.2085062Z 2025-05-07T20:26:05.2085066Z 2025-05-07T20:26:05.2088061Z 2025-05-07T20:26:05.2254518Z libnpp-12.3.3.65 | 130.6 MB | #######5 | 76%  2025-05-07T20:26:05.2601881Z libcublas-12.8.3.14 | 460.2 MB | ########1 | 81% 2025-05-07T20:26:05.2602186Z 2025-05-07T20:26:05.2602191Z 2025-05-07T20:26:05.2602195Z 2025-05-07T20:26:05.2602200Z 2025-05-07T20:26:05.2602203Z 2025-05-07T20:26:05.2602208Z 2025-05-07T20:26:05.2602212Z 2025-05-07T20:26:05.2701891Z cuda-nvvp-12.8.57 | 112.4 MB | ####### | 70%  2025-05-07T20:26:05.2702206Z 2025-05-07T20:26:05.2702211Z 2025-05-07T20:26:05.2702214Z 2025-05-07T20:26:05.2702218Z 2025-05-07T20:26:05.2702222Z 2025-05-07T20:26:05.2705616Z 2025-05-07T20:26:05.3095816Z cuda-nsight-12.8.55 | 113.2 MB | ######## | 81%  2025-05-07T20:26:05.3096127Z 2025-05-07T20:26:05.3096131Z 2025-05-07T20:26:05.3096135Z 2025-05-07T20:26:05.3096138Z 2025-05-07T20:26:05.3100320Z 2025-05-07T20:26:05.3263234Z libnpp-12.3.3.65 | 130.6 MB | #######7 | 78%  2025-05-07T20:26:05.3603774Z libcublas-12.8.3.14 | 460.2 MB | ########1 | 82% 2025-05-07T20:26:05.3604088Z 2025-05-07T20:26:05.3604094Z 2025-05-07T20:26:05.3604099Z 2025-05-07T20:26:05.3604104Z 2025-05-07T20:26:05.3604110Z 2025-05-07T20:26:05.3604115Z 2025-05-07T20:26:05.3609957Z 2025-05-07T20:26:05.3744303Z cuda-nvvp-12.8.57 | 112.4 MB | #######2 | 73%  2025-05-07T20:26:05.3744747Z 2025-05-07T20:26:05.3744753Z 2025-05-07T20:26:05.3744758Z 2025-05-07T20:26:05.3744764Z 2025-05-07T20:26:05.3744804Z 2025-05-07T20:26:05.3744810Z 2025-05-07T20:26:05.4098539Z cuda-nsight-12.8.55 | 113.2 MB | ########3 | 83%  2025-05-07T20:26:05.4098978Z 2025-05-07T20:26:05.4098984Z 2025-05-07T20:26:05.4098990Z 2025-05-07T20:26:05.4098995Z 2025-05-07T20:26:05.4099738Z 2025-05-07T20:26:05.4345240Z libnpp-12.3.3.65 | 130.6 MB | #######9 | 80%  2025-05-07T20:26:05.4704515Z libcublas-12.8.3.14 | 460.2 MB | ########2 | 82% 2025-05-07T20:26:05.4704894Z 2025-05-07T20:26:05.4704900Z 2025-05-07T20:26:05.4704905Z 2025-05-07T20:26:05.4704910Z 2025-05-07T20:26:05.4704915Z 2025-05-07T20:26:05.4704921Z 2025-05-07T20:26:05.4708444Z 2025-05-07T20:26:05.4752039Z cuda-nvvp-12.8.57 | 112.4 MB | #######5 | 75%  2025-05-07T20:26:05.4752328Z 2025-05-07T20:26:05.4752332Z 2025-05-07T20:26:05.4752335Z 2025-05-07T20:26:05.4752339Z 2025-05-07T20:26:05.4752343Z 2025-05-07T20:26:05.4752346Z 2025-05-07T20:26:05.5110597Z cuda-nsight-12.8.55 | 113.2 MB | ########5 | 86%  2025-05-07T20:26:05.5111232Z 2025-05-07T20:26:05.5111236Z 2025-05-07T20:26:05.5111239Z 2025-05-07T20:26:05.5111243Z 2025-05-07T20:26:05.5111732Z 2025-05-07T20:26:05.5491466Z libnpp-12.3.3.65 | 130.6 MB | ########1 | 81%  2025-05-07T20:26:05.5727567Z libcublas-12.8.3.14 | 460.2 MB | ########2 | 83% 2025-05-07T20:26:05.5728555Z 2025-05-07T20:26:05.5728567Z 2025-05-07T20:26:05.5728570Z 2025-05-07T20:26:05.5728574Z 2025-05-07T20:26:05.5728578Z 2025-05-07T20:26:05.5728582Z 2025-05-07T20:26:05.5728586Z 2025-05-07T20:26:05.5780270Z cuda-nvvp-12.8.57 | 112.4 MB | #######7 | 78%  2025-05-07T20:26:05.5780607Z 2025-05-07T20:26:05.5780611Z 2025-05-07T20:26:05.5780614Z 2025-05-07T20:26:05.5780618Z 2025-05-07T20:26:05.5780621Z 2025-05-07T20:26:05.5781700Z 2025-05-07T20:26:05.6114695Z cuda-nsight-12.8.55 | 113.2 MB | ########7 | 88%  2025-05-07T20:26:05.6115111Z 2025-05-07T20:26:05.6115425Z 2025-05-07T20:26:05.6115430Z 2025-05-07T20:26:05.6115435Z 2025-05-07T20:26:05.6117151Z 2025-05-07T20:26:05.6599334Z libnpp-12.3.3.65 | 130.6 MB | ########3 | 83%  2025-05-07T20:26:05.6727515Z libcublas-12.8.3.14 | 460.2 MB | ########3 | 83% 2025-05-07T20:26:05.6727786Z 2025-05-07T20:26:05.6727790Z 2025-05-07T20:26:05.6727794Z 2025-05-07T20:26:05.6727827Z 2025-05-07T20:26:05.6727831Z 2025-05-07T20:26:05.6727835Z 2025-05-07T20:26:05.6727839Z 2025-05-07T20:26:05.6841345Z cuda-nvvp-12.8.57 | 112.4 MB | ######## | 80%  2025-05-07T20:26:05.6841639Z 2025-05-07T20:26:05.6841643Z 2025-05-07T20:26:05.6841647Z 2025-05-07T20:26:05.6841651Z 2025-05-07T20:26:05.6841655Z 2025-05-07T20:26:05.6841658Z 2025-05-07T20:26:05.7120822Z cuda-nsight-12.8.55 | 113.2 MB | ######### | 90%  2025-05-07T20:26:05.7121144Z 2025-05-07T20:26:05.7121148Z 2025-05-07T20:26:05.7121151Z 2025-05-07T20:26:05.7121155Z 2025-05-07T20:26:05.7122813Z 2025-05-07T20:26:05.7602870Z libnpp-12.3.3.65 | 130.6 MB | ########5 | 85%  2025-05-07T20:26:05.7762480Z libcublas-12.8.3.14 | 460.2 MB | ########3 | 84% 2025-05-07T20:26:05.7762846Z 2025-05-07T20:26:05.7762852Z 2025-05-07T20:26:05.7762857Z 2025-05-07T20:26:05.7762862Z 2025-05-07T20:26:05.7762868Z 2025-05-07T20:26:05.7762874Z 2025-05-07T20:26:05.7762899Z 2025-05-07T20:26:05.7952999Z cuda-nvvp-12.8.57 | 112.4 MB | ########2 | 83%  2025-05-07T20:26:05.7953301Z 2025-05-07T20:26:05.7953305Z 2025-05-07T20:26:05.7953309Z 2025-05-07T20:26:05.7953312Z 2025-05-07T20:26:05.7953316Z 2025-05-07T20:26:05.7953320Z 2025-05-07T20:26:05.8126644Z cuda-nsight-12.8.55 | 113.2 MB | #########2 | 93%  2025-05-07T20:26:05.8126977Z 2025-05-07T20:26:05.8126981Z 2025-05-07T20:26:05.8126984Z 2025-05-07T20:26:05.8126988Z 2025-05-07T20:26:05.8131121Z 2025-05-07T20:26:05.8721667Z libnpp-12.3.3.65 | 130.6 MB | ########7 | 87%  2025-05-07T20:26:05.8788547Z libcublas-12.8.3.14 | 460.2 MB | ########4 | 84% 2025-05-07T20:26:05.8788807Z 2025-05-07T20:26:05.8788812Z 2025-05-07T20:26:05.8788815Z 2025-05-07T20:26:05.8788819Z 2025-05-07T20:26:05.8788823Z 2025-05-07T20:26:05.8788827Z 2025-05-07T20:26:05.8791532Z 2025-05-07T20:26:05.8954254Z cuda-nvvp-12.8.57 | 112.4 MB | ########5 | 85%  2025-05-07T20:26:05.8954666Z 2025-05-07T20:26:05.8954672Z 2025-05-07T20:26:05.8954677Z 2025-05-07T20:26:05.8954682Z 2025-05-07T20:26:05.8954687Z 2025-05-07T20:26:05.8954693Z 2025-05-07T20:26:05.9193985Z cuda-nsight-12.8.55 | 113.2 MB | #########5 | 95%  2025-05-07T20:26:05.9194391Z 2025-05-07T20:26:05.9194397Z 2025-05-07T20:26:05.9194402Z 2025-05-07T20:26:05.9194407Z 2025-05-07T20:26:05.9202506Z 2025-05-07T20:26:05.9816292Z libnpp-12.3.3.65 | 130.6 MB | ########9 | 89%  2025-05-07T20:26:05.9947071Z libcublas-12.8.3.14 | 460.2 MB | ########4 | 85% 2025-05-07T20:26:05.9947494Z 2025-05-07T20:26:05.9947500Z 2025-05-07T20:26:05.9947506Z 2025-05-07T20:26:05.9947511Z 2025-05-07T20:26:05.9947516Z 2025-05-07T20:26:05.9947521Z 2025-05-07T20:26:05.9953582Z 2025-05-07T20:26:05.9968978Z cuda-nvvp-12.8.57 | 112.4 MB | ########7 | 87%  2025-05-07T20:26:05.9969282Z 2025-05-07T20:26:05.9969286Z 2025-05-07T20:26:05.9969507Z 2025-05-07T20:26:05.9969512Z 2025-05-07T20:26:05.9969516Z 2025-05-07T20:26:05.9969520Z 2025-05-07T20:26:06.0200820Z cuda-nsight-12.8.55 | 113.2 MB | #########7 | 98%  2025-05-07T20:26:06.0201229Z 2025-05-07T20:26:06.0201233Z 2025-05-07T20:26:06.0201237Z 2025-05-07T20:26:06.0201240Z 2025-05-07T20:26:06.0204885Z 2025-05-07T20:26:06.0827636Z libnpp-12.3.3.65 | 130.6 MB | #########1 | 91%  2025-05-07T20:26:06.1003952Z libcublas-12.8.3.14 | 460.2 MB | ########5 | 86% 2025-05-07T20:26:06.1004340Z 2025-05-07T20:26:06.1004346Z 2025-05-07T20:26:06.1004351Z 2025-05-07T20:26:06.1004629Z 2025-05-07T20:26:06.1004634Z 2025-05-07T20:26:06.1004640Z 2025-05-07T20:26:06.1006556Z 2025-05-07T20:26:06.1206878Z cuda-nvvp-12.8.57 | 112.4 MB | ########9 | 90%  2025-05-07T20:26:06.1207280Z 2025-05-07T20:26:06.1207286Z 2025-05-07T20:26:06.1207291Z 2025-05-07T20:26:06.1207297Z 2025-05-07T20:26:06.1208662Z 2025-05-07T20:26:06.1831251Z libnpp-12.3.3.65 | 130.6 MB | #########3 | 93%  2025-05-07T20:26:06.2038062Z libcublas-12.8.3.14 | 460.2 MB | ########6 | 86% 2025-05-07T20:26:06.2038334Z 2025-05-07T20:26:06.2038338Z 2025-05-07T20:26:06.2038342Z 2025-05-07T20:26:06.2038345Z 2025-05-07T20:26:06.2038349Z 2025-05-07T20:26:06.2038352Z 2025-05-07T20:26:06.2038357Z 2025-05-07T20:26:06.2210673Z cuda-nvvp-12.8.57 | 112.4 MB | #########1 | 92%  2025-05-07T20:26:06.2211100Z 2025-05-07T20:26:06.2211107Z 2025-05-07T20:26:06.2211113Z 2025-05-07T20:26:06.2211118Z 2025-05-07T20:26:06.2214443Z 2025-05-07T20:26:06.2841692Z libnpp-12.3.3.65 | 130.6 MB | #########5 | 96%  2025-05-07T20:26:06.3041925Z libcublas-12.8.3.14 | 460.2 MB | ########6 | 87% 2025-05-07T20:26:06.3042189Z 2025-05-07T20:26:06.3042193Z 2025-05-07T20:26:06.3042197Z 2025-05-07T20:26:06.3042200Z 2025-05-07T20:26:06.3042204Z 2025-05-07T20:26:06.3042208Z 2025-05-07T20:26:06.3044948Z 2025-05-07T20:26:06.3212768Z cuda-nvvp-12.8.57 | 112.4 MB | #########4 | 94%  2025-05-07T20:26:06.3213173Z 2025-05-07T20:26:06.3213178Z 2025-05-07T20:26:06.3213183Z 2025-05-07T20:26:06.3213187Z 2025-05-07T20:26:06.3213192Z 2025-05-07T20:26:06.3856581Z libnpp-12.3.3.65 | 130.6 MB | #########7 | 98%  2025-05-07T20:26:06.4047181Z libcublas-12.8.3.14 | 460.2 MB | ########7 | 87% 2025-05-07T20:26:06.4047542Z 2025-05-07T20:26:06.4047548Z 2025-05-07T20:26:06.4047553Z 2025-05-07T20:26:06.4047558Z 2025-05-07T20:26:06.4047564Z 2025-05-07T20:26:06.4047569Z 2025-05-07T20:26:06.4047574Z 2025-05-07T20:26:06.4865894Z cuda-nvvp-12.8.57 | 112.4 MB | #########6 | 97%  2025-05-07T20:26:06.5048766Z libcublas-12.8.3.14 | 460.2 MB | ########7 | 88% 2025-05-07T20:26:06.5049041Z 2025-05-07T20:26:06.5049047Z 2025-05-07T20:26:06.5049052Z 2025-05-07T20:26:06.5049058Z 2025-05-07T20:26:06.5049063Z 2025-05-07T20:26:06.5049068Z 2025-05-07T20:26:06.5055414Z 2025-05-07T20:26:06.5866912Z cuda-nvvp-12.8.57 | 112.4 MB | #########9 | 99%  2025-05-07T20:26:06.6866414Z libcublas-12.8.3.14 | 460.2 MB | ########8 | 89% 2025-05-07T20:26:06.7867110Z libcublas-12.8.3.14 | 460.2 MB | ########9 | 89% 2025-05-07T20:26:06.8868362Z libcublas-12.8.3.14 | 460.2 MB | ######### | 90% 2025-05-07T20:26:06.9942821Z libcublas-12.8.3.14 | 460.2 MB | #########1 | 91% 2025-05-07T20:26:07.0943060Z libcublas-12.8.3.14 | 460.2 MB | #########1 | 92% 2025-05-07T20:26:07.1950286Z libcublas-12.8.3.14 | 460.2 MB | #########2 | 93% 2025-05-07T20:26:07.2969412Z libcublas-12.8.3.14 | 460.2 MB | #########3 | 93% 2025-05-07T20:26:07.3977261Z libcublas-12.8.3.14 | 460.2 MB | #########4 | 94% 2025-05-07T20:26:07.5000127Z libcublas-12.8.3.14 | 460.2 MB | #########5 | 95% 2025-05-07T20:26:07.6001318Z libcublas-12.8.3.14 | 460.2 MB | #########5 | 96% 2025-05-07T20:26:07.7049146Z libcublas-12.8.3.14 | 460.2 MB | #########6 | 97% 2025-05-07T20:26:07.8092648Z libcublas-12.8.3.14 | 460.2 MB | #########7 | 97% 2025-05-07T20:26:07.9096326Z libcublas-12.8.3.14 | 460.2 MB | #########8 | 98% 2025-05-07T20:26:08.0097614Z libcublas-12.8.3.14 | 460.2 MB | #########9 | 99% 2025-05-07T20:26:09.7613162Z libcublas-12.8.3.14 | 460.2 MB | #########9 | 100% 2025-05-07T20:26:09.7613435Z 2025-05-07T20:26:09.7613646Z 2025-05-07T20:26:09.7613659Z 2025-05-07T20:26:09.7613767Z 2025-05-07T20:26:09.7613779Z 2025-05-07T20:26:09.7616368Z 2025-05-07T20:26:09.8263133Z cuda-nsight-12.8.55 | 113.2 MB | ########## | 100%  2025-05-07T20:26:09.8263892Z 2025-05-07T20:26:09.8263900Z 2025-05-07T20:26:09.8263906Z 2025-05-07T20:26:09.8263912Z 2025-05-07T20:26:09.8263919Z 2025-05-07T20:26:09.8263925Z 2025-05-07T20:26:09.8263932Z 2025-05-07T20:26:09.8263938Z 2025-05-07T20:26:09.9277502Z cuda-nvrtc-12.8.61 | 63.1 MB | | 0%  2025-05-07T20:26:09.9277890Z 2025-05-07T20:26:09.9277894Z 2025-05-07T20:26:09.9277926Z 2025-05-07T20:26:09.9277930Z 2025-05-07T20:26:09.9277934Z 2025-05-07T20:26:09.9277938Z 2025-05-07T20:26:09.9277942Z 2025-05-07T20:26:09.9282025Z 2025-05-07T20:26:10.0290591Z cuda-nvrtc-12.8.61 | 63.1 MB | 5 | 6%  2025-05-07T20:26:10.0291038Z 2025-05-07T20:26:10.0291044Z 2025-05-07T20:26:10.0291049Z 2025-05-07T20:26:10.0291055Z 2025-05-07T20:26:10.0291061Z 2025-05-07T20:26:10.0291068Z 2025-05-07T20:26:10.0291073Z 2025-05-07T20:26:10.0292603Z 2025-05-07T20:26:10.1244447Z cuda-nvrtc-12.8.61 | 63.1 MB | #1 | 11%  2025-05-07T20:26:10.1244885Z 2025-05-07T20:26:10.1244889Z 2025-05-07T20:26:10.1244893Z 2025-05-07T20:26:10.1249755Z 2025-05-07T20:26:10.1290362Z libcufft-11.3.3.41 | 147.4 MB | ########## | 100%  2025-05-07T20:26:10.1290673Z 2025-05-07T20:26:10.1290677Z 2025-05-07T20:26:10.1290682Z 2025-05-07T20:26:10.1290687Z 2025-05-07T20:26:10.1290690Z 2025-05-07T20:26:10.1290694Z 2025-05-07T20:26:10.1290712Z 2025-05-07T20:26:10.1292545Z 2025-05-07T20:26:10.2296173Z cuda-nvrtc-12.8.61 | 63.1 MB | #7 | 17%  2025-05-07T20:26:10.2296481Z 2025-05-07T20:26:10.2299193Z 2025-05-07T20:26:10.2299271Z 2025-05-07T20:26:10.2299277Z 2025-05-07T20:26:10.2299283Z 2025-05-07T20:26:10.2299288Z 2025-05-07T20:26:10.2299294Z 2025-05-07T20:26:10.2299341Z 2025-05-07T20:26:10.3409176Z cuda-nvrtc-12.8.61 | 63.1 MB | ##2 | 23%  2025-05-07T20:26:10.3409583Z 2025-05-07T20:26:10.3409587Z 2025-05-07T20:26:10.3409591Z 2025-05-07T20:26:10.3409594Z 2025-05-07T20:26:10.3409623Z 2025-05-07T20:26:10.3409626Z 2025-05-07T20:26:10.3409630Z 2025-05-07T20:26:10.3410777Z 2025-05-07T20:26:10.4410547Z cuda-nvrtc-12.8.61 | 63.1 MB | ##8 | 28%  2025-05-07T20:26:10.4410861Z 2025-05-07T20:26:10.4410865Z 2025-05-07T20:26:10.4410869Z 2025-05-07T20:26:10.4410873Z 2025-05-07T20:26:10.4410876Z 2025-05-07T20:26:10.4410901Z 2025-05-07T20:26:10.4410905Z 2025-05-07T20:26:10.4412228Z 2025-05-07T20:26:10.4669696Z cuda-nvrtc-12.8.61 | 63.1 MB | ###3 | 34%  2025-05-07T20:26:10.4669993Z 2025-05-07T20:26:10.4669998Z 2025-05-07T20:26:10.4670001Z 2025-05-07T20:26:10.4670005Z 2025-05-07T20:26:10.4670009Z 2025-05-07T20:26:10.4670012Z 2025-05-07T20:26:10.4683265Z 2025-05-07T20:26:10.5324316Z cuda-nvvp-12.8.57 | 112.4 MB | ########## | 100%  2025-05-07T20:26:10.5324614Z 2025-05-07T20:26:10.5324618Z 2025-05-07T20:26:10.5324622Z 2025-05-07T20:26:10.5324626Z 2025-05-07T20:26:10.5324652Z 2025-05-07T20:26:10.5324656Z 2025-05-07T20:26:10.5324660Z 2025-05-07T20:26:10.5324672Z 2025-05-07T20:26:10.5326986Z 2025-05-07T20:26:10.5412036Z libcurand-10.3.9.55 | 43.6 MB | | 0%  2025-05-07T20:26:10.5412334Z 2025-05-07T20:26:10.5412346Z 2025-05-07T20:26:10.5412350Z 2025-05-07T20:26:10.5412354Z 2025-05-07T20:26:10.5412357Z 2025-05-07T20:26:10.5412606Z 2025-05-07T20:26:10.5412611Z 2025-05-07T20:26:10.5412617Z 2025-05-07T20:26:10.6333146Z cuda-nvrtc-12.8.61 | 63.1 MB | ###9 | 40%  2025-05-07T20:26:10.6333474Z 2025-05-07T20:26:10.6333478Z 2025-05-07T20:26:10.6333481Z 2025-05-07T20:26:10.6333485Z 2025-05-07T20:26:10.6333489Z 2025-05-07T20:26:10.6333492Z 2025-05-07T20:26:10.6333496Z 2025-05-07T20:26:10.6333500Z 2025-05-07T20:26:10.6333503Z 2025-05-07T20:26:10.6519213Z libcurand-10.3.9.55 | 43.6 MB | 6 | 6%  2025-05-07T20:26:10.6519667Z 2025-05-07T20:26:10.6519955Z 2025-05-07T20:26:10.6519958Z 2025-05-07T20:26:10.6519962Z 2025-05-07T20:26:10.6519966Z 2025-05-07T20:26:10.6519969Z 2025-05-07T20:26:10.6519973Z 2025-05-07T20:26:10.6521138Z 2025-05-07T20:26:10.7333033Z cuda-nvrtc-12.8.61 | 63.1 MB | ####5 | 46%  2025-05-07T20:26:10.7333344Z 2025-05-07T20:26:10.7333348Z 2025-05-07T20:26:10.7333352Z 2025-05-07T20:26:10.7333377Z 2025-05-07T20:26:10.7333381Z 2025-05-07T20:26:10.7333384Z 2025-05-07T20:26:10.7333388Z 2025-05-07T20:26:10.7333392Z 2025-05-07T20:26:10.7333395Z 2025-05-07T20:26:10.7544968Z libcurand-10.3.9.55 | 43.6 MB | #2 | 12%  2025-05-07T20:26:10.7545271Z 2025-05-07T20:26:10.7545275Z 2025-05-07T20:26:10.7545279Z 2025-05-07T20:26:10.7545283Z 2025-05-07T20:26:10.7545287Z 2025-05-07T20:26:10.7545290Z 2025-05-07T20:26:10.7545294Z 2025-05-07T20:26:10.7545306Z 2025-05-07T20:26:10.8339437Z cuda-nvrtc-12.8.61 | 63.1 MB | #####1 | 51%  2025-05-07T20:26:10.8339870Z 2025-05-07T20:26:10.8339874Z 2025-05-07T20:26:10.8339878Z 2025-05-07T20:26:10.8339882Z 2025-05-07T20:26:10.8339886Z 2025-05-07T20:26:10.8339889Z 2025-05-07T20:26:10.8339893Z 2025-05-07T20:26:10.8339897Z 2025-05-07T20:26:10.8343687Z 2025-05-07T20:26:10.8609132Z libcurand-10.3.9.55 | 43.6 MB | #9 | 19%  2025-05-07T20:26:10.8609473Z 2025-05-07T20:26:10.8609477Z 2025-05-07T20:26:10.8609481Z 2025-05-07T20:26:10.8609492Z 2025-05-07T20:26:10.8609496Z 2025-05-07T20:26:10.8609500Z 2025-05-07T20:26:10.8609503Z 2025-05-07T20:26:10.8610246Z 2025-05-07T20:26:10.9346642Z cuda-nvrtc-12.8.61 | 63.1 MB | #####6 | 57%  2025-05-07T20:26:10.9347031Z 2025-05-07T20:26:10.9347037Z 2025-05-07T20:26:10.9347042Z 2025-05-07T20:26:10.9347047Z 2025-05-07T20:26:10.9347053Z 2025-05-07T20:26:10.9347058Z 2025-05-07T20:26:10.9347063Z 2025-05-07T20:26:10.9347072Z 2025-05-07T20:26:10.9347077Z 2025-05-07T20:26:10.9716832Z libcurand-10.3.9.55 | 43.6 MB | ##6 | 26%  2025-05-07T20:26:10.9717185Z 2025-05-07T20:26:10.9717189Z 2025-05-07T20:26:10.9717193Z 2025-05-07T20:26:10.9717196Z 2025-05-07T20:26:10.9717200Z 2025-05-07T20:26:10.9717204Z 2025-05-07T20:26:10.9717210Z 2025-05-07T20:26:10.9722227Z 2025-05-07T20:26:11.0363244Z cuda-nvrtc-12.8.61 | 63.1 MB | ######1 | 62%  2025-05-07T20:26:11.0363717Z 2025-05-07T20:26:11.0363723Z 2025-05-07T20:26:11.0363729Z 2025-05-07T20:26:11.0363734Z 2025-05-07T20:26:11.0365214Z 2025-05-07T20:26:11.0365620Z libnpp-12.3.3.65 | 130.6 MB | ########## | 100%  2025-05-07T20:26:11.0365889Z 2025-05-07T20:26:11.0365893Z 2025-05-07T20:26:11.0365897Z 2025-05-07T20:26:11.0365900Z 2025-05-07T20:26:11.0366028Z 2025-05-07T20:26:11.0410689Z libnpp-12.3.3.65 | 130.6 MB | ########## | 100%  2025-05-07T20:26:11.0410972Z 2025-05-07T20:26:11.0410976Z 2025-05-07T20:26:11.0410980Z 2025-05-07T20:26:11.0410996Z 2025-05-07T20:26:11.0411000Z 2025-05-07T20:26:11.0411003Z 2025-05-07T20:26:11.0411007Z 2025-05-07T20:26:11.0411011Z 2025-05-07T20:26:11.0416940Z 2025-05-07T20:26:11.0801991Z libcurand-10.3.9.55 | 43.6 MB | ###2 | 33%  2025-05-07T20:26:11.0802338Z 2025-05-07T20:26:11.0802342Z 2025-05-07T20:26:11.0802345Z 2025-05-07T20:26:11.0802589Z 2025-05-07T20:26:11.0802594Z 2025-05-07T20:26:11.0802598Z 2025-05-07T20:26:11.0802601Z 2025-05-07T20:26:11.0803215Z 2025-05-07T20:26:11.0891281Z cuda-nvrtc-12.8.61 | 63.1 MB | ######7 | 67%  2025-05-07T20:26:11.0891567Z 2025-05-07T20:26:11.0891571Z 2025-05-07T20:26:11.0891575Z 2025-05-07T20:26:11.0891579Z 2025-05-07T20:26:11.0891592Z 2025-05-07T20:26:11.0891596Z 2025-05-07T20:26:11.0891599Z 2025-05-07T20:26:11.0891603Z 2025-05-07T20:26:11.0891606Z 2025-05-07T20:26:11.0891610Z 2025-05-07T20:26:11.1419923Z gds-tools-1.13.0.11 | 37.9 MB | | 0%  2025-05-07T20:26:11.1420488Z 2025-05-07T20:26:11.1420492Z 2025-05-07T20:26:11.1420496Z 2025-05-07T20:26:11.1420499Z 2025-05-07T20:26:11.1420503Z 2025-05-07T20:26:11.1420506Z 2025-05-07T20:26:11.1420510Z 2025-05-07T20:26:11.1420514Z 2025-05-07T20:26:11.1422860Z 2025-05-07T20:26:11.1892065Z libcurand-10.3.9.55 | 43.6 MB | ###9 | 39%  2025-05-07T20:26:11.1892395Z 2025-05-07T20:26:11.1892400Z 2025-05-07T20:26:11.1892403Z 2025-05-07T20:26:11.1892407Z 2025-05-07T20:26:11.1892411Z 2025-05-07T20:26:11.1892414Z 2025-05-07T20:26:11.1892418Z 2025-05-07T20:26:11.1893693Z 2025-05-07T20:26:11.1902162Z cuda-nvrtc-12.8.61 | 63.1 MB | #######2 | 72%  2025-05-07T20:26:11.1902441Z 2025-05-07T20:26:11.1902445Z 2025-05-07T20:26:11.1902449Z 2025-05-07T20:26:11.1902452Z 2025-05-07T20:26:11.1902456Z 2025-05-07T20:26:11.1902459Z 2025-05-07T20:26:11.1902463Z 2025-05-07T20:26:11.1902470Z 2025-05-07T20:26:11.1902580Z 2025-05-07T20:26:11.1904490Z 2025-05-07T20:26:11.2570318Z gds-tools-1.13.0.11 | 37.9 MB | 6 | 7%  2025-05-07T20:26:11.2570653Z 2025-05-07T20:26:11.2570659Z 2025-05-07T20:26:11.2570674Z 2025-05-07T20:26:11.2570681Z 2025-05-07T20:26:11.2570686Z 2025-05-07T20:26:11.2570691Z 2025-05-07T20:26:11.2570697Z 2025-05-07T20:26:11.2570704Z 2025-05-07T20:26:11.2574817Z 2025-05-07T20:26:11.2904027Z libcurand-10.3.9.55 | 43.6 MB | ####5 | 46%  2025-05-07T20:26:11.2904326Z 2025-05-07T20:26:11.2904330Z 2025-05-07T20:26:11.2904334Z 2025-05-07T20:26:11.2904338Z 2025-05-07T20:26:11.2904342Z 2025-05-07T20:26:11.2904345Z 2025-05-07T20:26:11.2904350Z 2025-05-07T20:26:11.2904353Z 2025-05-07T20:26:11.2904357Z 2025-05-07T20:26:11.2904402Z 2025-05-07T20:26:11.3080387Z gds-tools-1.13.0.11 | 37.9 MB | #4 | 14%  2025-05-07T20:26:11.3080766Z 2025-05-07T20:26:11.3080772Z 2025-05-07T20:26:11.3080777Z 2025-05-07T20:26:11.3080810Z 2025-05-07T20:26:11.3080816Z 2025-05-07T20:26:11.3080821Z 2025-05-07T20:26:11.3080826Z 2025-05-07T20:26:11.3080832Z 2025-05-07T20:26:11.3664517Z cuda-nvrtc-12.8.61 | 63.1 MB | #######7 | 77%  2025-05-07T20:26:11.3664831Z 2025-05-07T20:26:11.3664835Z 2025-05-07T20:26:11.3664840Z 2025-05-07T20:26:11.3664843Z 2025-05-07T20:26:11.3664847Z 2025-05-07T20:26:11.3664875Z 2025-05-07T20:26:11.3664879Z 2025-05-07T20:26:11.3664891Z 2025-05-07T20:26:11.3664894Z 2025-05-07T20:26:11.3906150Z libcurand-10.3.9.55 | 43.6 MB | #####2 | 52%  2025-05-07T20:26:11.3906489Z 2025-05-07T20:26:11.3906493Z 2025-05-07T20:26:11.3906505Z 2025-05-07T20:26:11.3906509Z 2025-05-07T20:26:11.3906512Z 2025-05-07T20:26:11.3906516Z 2025-05-07T20:26:11.3906519Z 2025-05-07T20:26:11.3906524Z 2025-05-07T20:26:11.3906527Z 2025-05-07T20:26:11.3907004Z 2025-05-07T20:26:11.4174819Z gds-tools-1.13.0.11 | 37.9 MB | ##1 | 21%  2025-05-07T20:26:11.4175266Z 2025-05-07T20:26:11.4175273Z 2025-05-07T20:26:11.4175279Z 2025-05-07T20:26:11.4175286Z 2025-05-07T20:26:11.4175292Z 2025-05-07T20:26:11.4175299Z 2025-05-07T20:26:11.4175305Z 2025-05-07T20:26:11.4180063Z 2025-05-07T20:26:11.4680285Z cuda-nvrtc-12.8.61 | 63.1 MB | ########1 | 82%  2025-05-07T20:26:11.4680613Z 2025-05-07T20:26:11.4680879Z 2025-05-07T20:26:11.4680888Z 2025-05-07T20:26:11.4680894Z 2025-05-07T20:26:11.4680899Z 2025-05-07T20:26:11.4680902Z 2025-05-07T20:26:11.4680906Z 2025-05-07T20:26:11.4680910Z 2025-05-07T20:26:11.4680913Z 2025-05-07T20:26:11.4953783Z libcurand-10.3.9.55 | 43.6 MB | #####8 | 58%  2025-05-07T20:26:11.4954088Z 2025-05-07T20:26:11.4954092Z 2025-05-07T20:26:11.4954096Z 2025-05-07T20:26:11.4954099Z 2025-05-07T20:26:11.4954103Z 2025-05-07T20:26:11.4954107Z 2025-05-07T20:26:11.4954110Z 2025-05-07T20:26:11.4954114Z 2025-05-07T20:26:11.4954117Z 2025-05-07T20:26:11.4954406Z 2025-05-07T20:26:11.5226633Z gds-tools-1.13.0.11 | 37.9 MB | ##8 | 29%  2025-05-07T20:26:11.5227059Z 2025-05-07T20:26:11.5227065Z 2025-05-07T20:26:11.5227070Z 2025-05-07T20:26:11.5227074Z 2025-05-07T20:26:11.5227079Z 2025-05-07T20:26:11.5227094Z 2025-05-07T20:26:11.5227100Z 2025-05-07T20:26:11.5228849Z 2025-05-07T20:26:11.5683220Z cuda-nvrtc-12.8.61 | 63.1 MB | ########6 | 86%  2025-05-07T20:26:11.5683537Z 2025-05-07T20:26:11.5683541Z 2025-05-07T20:26:11.5683545Z 2025-05-07T20:26:11.5683548Z 2025-05-07T20:26:11.5683552Z 2025-05-07T20:26:11.5683555Z 2025-05-07T20:26:11.5683559Z 2025-05-07T20:26:11.5683563Z 2025-05-07T20:26:11.5684921Z 2025-05-07T20:26:11.5954158Z libcurand-10.3.9.55 | 43.6 MB | ######4 | 65%  2025-05-07T20:26:11.5954487Z 2025-05-07T20:26:11.5954492Z 2025-05-07T20:26:11.5954497Z 2025-05-07T20:26:11.5954502Z 2025-05-07T20:26:11.5954507Z 2025-05-07T20:26:11.5954543Z 2025-05-07T20:26:11.5954549Z 2025-05-07T20:26:11.5954554Z 2025-05-07T20:26:11.5954559Z 2025-05-07T20:26:11.5954565Z 2025-05-07T20:26:11.6227541Z gds-tools-1.13.0.11 | 37.9 MB | ###6 | 37%  2025-05-07T20:26:11.6227846Z 2025-05-07T20:26:11.6227851Z 2025-05-07T20:26:11.6227857Z 2025-05-07T20:26:11.6227862Z 2025-05-07T20:26:11.6227866Z 2025-05-07T20:26:11.6227897Z 2025-05-07T20:26:11.6227901Z 2025-05-07T20:26:11.6231517Z 2025-05-07T20:26:11.6696754Z cuda-nvrtc-12.8.61 | 63.1 MB | #########1 | 91%  2025-05-07T20:26:11.6697054Z 2025-05-07T20:26:11.6697066Z 2025-05-07T20:26:11.6697071Z 2025-05-07T20:26:11.6697074Z 2025-05-07T20:26:11.6697078Z 2025-05-07T20:26:11.6697082Z 2025-05-07T20:26:11.6697086Z 2025-05-07T20:26:11.6697089Z 2025-05-07T20:26:11.6697273Z 2025-05-07T20:26:11.7178901Z libcurand-10.3.9.55 | 43.6 MB | ####### | 71%  2025-05-07T20:26:11.7179349Z 2025-05-07T20:26:11.7179386Z 2025-05-07T20:26:11.7179393Z 2025-05-07T20:26:11.7179401Z 2025-05-07T20:26:11.7179408Z 2025-05-07T20:26:11.7179415Z 2025-05-07T20:26:11.7179422Z 2025-05-07T20:26:11.7179428Z 2025-05-07T20:26:11.7179434Z 2025-05-07T20:26:11.7181402Z 2025-05-07T20:26:11.7231608Z gds-tools-1.13.0.11 | 37.9 MB | ####4 | 45%  2025-05-07T20:26:11.7231906Z 2025-05-07T20:26:11.7231925Z 2025-05-07T20:26:11.7231930Z 2025-05-07T20:26:11.7231933Z 2025-05-07T20:26:11.7231937Z 2025-05-07T20:26:11.7231941Z 2025-05-07T20:26:11.7231945Z 2025-05-07T20:26:11.7231948Z 2025-05-07T20:26:11.7711783Z cuda-nvrtc-12.8.61 | 63.1 MB | #########5 | 96%  2025-05-07T20:26:11.7712083Z 2025-05-07T20:26:11.7712087Z 2025-05-07T20:26:11.7712091Z 2025-05-07T20:26:11.7712094Z 2025-05-07T20:26:11.7712108Z 2025-05-07T20:26:11.7712113Z 2025-05-07T20:26:11.7712117Z 2025-05-07T20:26:11.7712120Z 2025-05-07T20:26:11.7712124Z 2025-05-07T20:26:11.8181583Z libcurand-10.3.9.55 | 43.6 MB | #######7 | 77%  2025-05-07T20:26:11.8181930Z 2025-05-07T20:26:11.8181934Z 2025-05-07T20:26:11.8181938Z 2025-05-07T20:26:11.8181942Z 2025-05-07T20:26:11.8181946Z 2025-05-07T20:26:11.8181949Z 2025-05-07T20:26:11.8181953Z 2025-05-07T20:26:11.8181957Z 2025-05-07T20:26:11.8181960Z 2025-05-07T20:26:11.8183480Z 2025-05-07T20:26:11.8713187Z gds-tools-1.13.0.11 | 37.9 MB | #####1 | 52%  2025-05-07T20:26:11.8713613Z 2025-05-07T20:26:11.8713619Z 2025-05-07T20:26:11.8713625Z 2025-05-07T20:26:11.8713630Z 2025-05-07T20:26:11.8713635Z 2025-05-07T20:26:11.8713640Z 2025-05-07T20:26:11.8713645Z 2025-05-07T20:26:11.8713651Z 2025-05-07T20:26:11.8717610Z 2025-05-07T20:26:11.9182392Z libcurand-10.3.9.55 | 43.6 MB | ########4 | 84%  2025-05-07T20:26:11.9182703Z 2025-05-07T20:26:11.9182707Z 2025-05-07T20:26:11.9182711Z 2025-05-07T20:26:11.9182715Z 2025-05-07T20:26:11.9182718Z 2025-05-07T20:26:11.9182981Z 2025-05-07T20:26:11.9182985Z 2025-05-07T20:26:11.9182988Z 2025-05-07T20:26:11.9182992Z 2025-05-07T20:26:11.9182996Z 2025-05-07T20:26:11.9719201Z gds-tools-1.13.0.11 | 37.9 MB | #####9 | 60%  2025-05-07T20:26:11.9719524Z 2025-05-07T20:26:11.9719528Z 2025-05-07T20:26:11.9719532Z 2025-05-07T20:26:11.9719537Z 2025-05-07T20:26:11.9719561Z 2025-05-07T20:26:11.9719573Z 2025-05-07T20:26:11.9719576Z 2025-05-07T20:26:11.9719580Z 2025-05-07T20:26:11.9719584Z 2025-05-07T20:26:12.0189327Z libcurand-10.3.9.55 | 43.6 MB | #########3 | 94%  2025-05-07T20:26:12.0189646Z 2025-05-07T20:26:12.0189650Z 2025-05-07T20:26:12.0189654Z 2025-05-07T20:26:12.0189657Z 2025-05-07T20:26:12.0189661Z 2025-05-07T20:26:12.0189666Z 2025-05-07T20:26:12.0189670Z 2025-05-07T20:26:12.0189674Z 2025-05-07T20:26:12.0189677Z 2025-05-07T20:26:12.0195815Z 2025-05-07T20:26:12.1189703Z gds-tools-1.13.0.11 | 37.9 MB | ######7 | 67%  2025-05-07T20:26:12.1190066Z 2025-05-07T20:26:12.1190070Z 2025-05-07T20:26:12.1190074Z 2025-05-07T20:26:12.1190078Z 2025-05-07T20:26:12.1190081Z 2025-05-07T20:26:12.1190085Z 2025-05-07T20:26:12.1190089Z 2025-05-07T20:26:12.1190092Z 2025-05-07T20:26:12.1190096Z 2025-05-07T20:26:12.1190100Z 2025-05-07T20:26:12.2190120Z gds-tools-1.13.0.11 | 37.9 MB | #######5 | 76%  2025-05-07T20:26:12.2190439Z 2025-05-07T20:26:12.2190444Z 2025-05-07T20:26:12.2190449Z 2025-05-07T20:26:12.2190453Z 2025-05-07T20:26:12.2190456Z 2025-05-07T20:26:12.2190461Z 2025-05-07T20:26:12.2190464Z 2025-05-07T20:26:12.2190468Z 2025-05-07T20:26:12.2190471Z 2025-05-07T20:26:12.2192213Z 2025-05-07T20:26:12.3196119Z gds-tools-1.13.0.11 | 37.9 MB | ########4 | 85%  2025-05-07T20:26:12.3196437Z 2025-05-07T20:26:12.3196441Z 2025-05-07T20:26:12.3196445Z 2025-05-07T20:26:12.3196449Z 2025-05-07T20:26:12.3196452Z 2025-05-07T20:26:12.3196456Z 2025-05-07T20:26:12.3196487Z 2025-05-07T20:26:12.3196490Z 2025-05-07T20:26:12.3196494Z 2025-05-07T20:26:12.3197161Z 2025-05-07T20:26:13.3794632Z gds-tools-1.13.0.11 | 37.9 MB | #########3 | 94%  2025-05-07T20:26:13.3794965Z 2025-05-07T20:26:13.3794969Z 2025-05-07T20:26:13.3795519Z 2025-05-07T20:26:13.4823710Z libcusolver-11.7.2.5 | 156.9 MB | ########## | 100%  2025-05-07T20:26:13.4824139Z 2025-05-07T20:26:13.4824144Z 2025-05-07T20:26:13.4824150Z 2025-05-07T20:26:13.4824155Z 2025-05-07T20:26:13.4824160Z 2025-05-07T20:26:13.4824165Z 2025-05-07T20:26:13.4824169Z 2025-05-07T20:26:13.4824176Z 2025-05-07T20:26:13.4826525Z 2025-05-07T20:26:13.5333866Z libcurand-10.3.9.55 | 43.6 MB | ########## | 100%  2025-05-07T20:26:13.5334293Z 2025-05-07T20:26:13.5334299Z 2025-05-07T20:26:13.5334304Z 2025-05-07T20:26:13.5334309Z 2025-05-07T20:26:13.5334315Z 2025-05-07T20:26:13.5334320Z 2025-05-07T20:26:13.5334325Z 2025-05-07T20:26:13.5334353Z 2025-05-07T20:26:13.5334359Z 2025-05-07T20:26:13.5334364Z 2025-05-07T20:26:13.5334369Z 2025-05-07T20:26:13.5997079Z python-3.11.8 | 29.3 MB | | 0%  2025-05-07T20:26:13.5997482Z 2025-05-07T20:26:13.5997488Z 2025-05-07T20:26:13.5997493Z 2025-05-07T20:26:13.5997498Z 2025-05-07T20:26:13.5997503Z 2025-05-07T20:26:13.6002393Z 2025-05-07T20:26:13.6337601Z cuda-nsight-12.8.55 | 113.2 MB | ########## | 100%  2025-05-07T20:26:13.6337976Z 2025-05-07T20:26:13.6337980Z 2025-05-07T20:26:13.6337984Z 2025-05-07T20:26:13.6337988Z 2025-05-07T20:26:13.6338001Z 2025-05-07T20:26:13.6338005Z 2025-05-07T20:26:13.6338009Z 2025-05-07T20:26:13.6338013Z 2025-05-07T20:26:13.6338016Z 2025-05-07T20:26:13.6338020Z 2025-05-07T20:26:13.6340222Z 2025-05-07T20:26:13.7130239Z python-3.11.8 | 29.3 MB | #1 | 12%  2025-05-07T20:26:13.7130627Z 2025-05-07T20:26:13.7130631Z 2025-05-07T20:26:13.7130855Z 2025-05-07T20:26:13.7130858Z 2025-05-07T20:26:13.7130862Z 2025-05-07T20:26:13.7130866Z 2025-05-07T20:26:13.7130869Z 2025-05-07T20:26:13.7130873Z 2025-05-07T20:26:13.7130877Z 2025-05-07T20:26:13.7130880Z 2025-05-07T20:26:13.7344726Z gds-tools-1.13.0.11 | 37.9 MB | ########## | 100%  2025-05-07T20:26:13.7345034Z 2025-05-07T20:26:13.7345052Z 2025-05-07T20:26:13.7345069Z 2025-05-07T20:26:13.7345075Z 2025-05-07T20:26:13.7345080Z 2025-05-07T20:26:13.7345085Z 2025-05-07T20:26:13.7345090Z 2025-05-07T20:26:13.7345096Z 2025-05-07T20:26:13.7345100Z 2025-05-07T20:26:13.7345105Z 2025-05-07T20:26:13.7345118Z 2025-05-07T20:26:13.7877907Z python-3.11.8 | 29.3 MB | ##3 | 24%  2025-05-07T20:26:13.7878239Z 2025-05-07T20:26:13.7878243Z 2025-05-07T20:26:13.7878257Z 2025-05-07T20:26:13.7878261Z 2025-05-07T20:26:13.7878264Z 2025-05-07T20:26:13.7878268Z 2025-05-07T20:26:13.7878271Z 2025-05-07T20:26:13.7878275Z 2025-05-07T20:26:13.7878289Z 2025-05-07T20:26:13.7878292Z 2025-05-07T20:26:13.7878296Z 2025-05-07T20:26:13.7884993Z 2025-05-07T20:26:13.8335541Z libnvjitlink-12.8.61 | 28.7 MB | | 0%  2025-05-07T20:26:13.8336132Z 2025-05-07T20:26:13.8336137Z 2025-05-07T20:26:13.8336141Z 2025-05-07T20:26:13.8336146Z 2025-05-07T20:26:13.8336150Z 2025-05-07T20:26:13.8336168Z 2025-05-07T20:26:13.8336172Z 2025-05-07T20:26:13.8336177Z 2025-05-07T20:26:13.8349459Z cuda-nvrtc-12.8.61 | 63.1 MB | ########## | 100%  2025-05-07T20:26:13.8349845Z 2025-05-07T20:26:13.8349849Z 2025-05-07T20:26:13.8349853Z 2025-05-07T20:26:13.8349857Z 2025-05-07T20:26:13.8349861Z 2025-05-07T20:26:13.8349864Z 2025-05-07T20:26:13.8349868Z 2025-05-07T20:26:13.8349872Z 2025-05-07T20:26:13.8349883Z 2025-05-07T20:26:13.8349886Z 2025-05-07T20:26:13.8349890Z 2025-05-07T20:26:13.8884191Z python-3.11.8 | 29.3 MB | ###5 | 36%  2025-05-07T20:26:13.8884586Z 2025-05-07T20:26:13.8884602Z 2025-05-07T20:26:13.8884607Z 2025-05-07T20:26:13.8884612Z 2025-05-07T20:26:13.8884617Z 2025-05-07T20:26:13.8884622Z 2025-05-07T20:26:13.8884628Z 2025-05-07T20:26:13.8884632Z 2025-05-07T20:26:13.8884637Z 2025-05-07T20:26:13.8884642Z 2025-05-07T20:26:13.8884647Z 2025-05-07T20:26:13.8884653Z 2025-05-07T20:26:13.9018959Z libnvjitlink-12.8.61 | 28.7 MB | # | 11%  2025-05-07T20:26:13.9019375Z 2025-05-07T20:26:13.9019381Z 2025-05-07T20:26:13.9019386Z 2025-05-07T20:26:13.9019391Z 2025-05-07T20:26:13.9019397Z 2025-05-07T20:26:13.9019402Z 2025-05-07T20:26:13.9019407Z 2025-05-07T20:26:13.9019412Z 2025-05-07T20:26:13.9019417Z 2025-05-07T20:26:13.9019423Z 2025-05-07T20:26:13.9019428Z 2025-05-07T20:26:13.9019433Z 2025-05-07T20:26:13.9024219Z 2025-05-07T20:26:13.9422132Z cuda-nvcc-tools-12.8 | 24.5 MB | | 0%  2025-05-07T20:26:13.9422475Z 2025-05-07T20:26:13.9422490Z 2025-05-07T20:26:13.9422494Z 2025-05-07T20:26:13.9422498Z 2025-05-07T20:26:13.9422501Z 2025-05-07T20:26:13.9422505Z 2025-05-07T20:26:13.9422515Z 2025-05-07T20:26:13.9422519Z 2025-05-07T20:26:13.9422522Z 2025-05-07T20:26:13.9422526Z 2025-05-07T20:26:13.9425542Z 2025-05-07T20:26:13.9884822Z python-3.11.8 | 29.3 MB | ####7 | 48%  2025-05-07T20:26:13.9885232Z 2025-05-07T20:26:13.9885238Z 2025-05-07T20:26:13.9885243Z 2025-05-07T20:26:13.9885248Z 2025-05-07T20:26:13.9885253Z 2025-05-07T20:26:13.9885259Z 2025-05-07T20:26:13.9885264Z 2025-05-07T20:26:13.9885269Z 2025-05-07T20:26:13.9885274Z 2025-05-07T20:26:13.9885279Z 2025-05-07T20:26:13.9885284Z 2025-05-07T20:26:13.9886893Z 2025-05-07T20:26:14.0024730Z libnvjitlink-12.8.61 | 28.7 MB | ##1 | 21%  2025-05-07T20:26:14.0025156Z 2025-05-07T20:26:14.0025162Z 2025-05-07T20:26:14.0025167Z 2025-05-07T20:26:14.0025172Z 2025-05-07T20:26:14.0025439Z 2025-05-07T20:26:14.0025445Z 2025-05-07T20:26:14.0025450Z 2025-05-07T20:26:14.0025455Z 2025-05-07T20:26:14.0025460Z 2025-05-07T20:26:14.0025465Z 2025-05-07T20:26:14.0025470Z 2025-05-07T20:26:14.0025475Z 2025-05-07T20:26:14.0026820Z 2025-05-07T20:26:14.0282482Z cuda-nvcc-tools-12.8 | 24.5 MB | #1 | 11%  2025-05-07T20:26:14.0287522Z 2025-05-07T20:26:14.0669417Z nsight-compute-2025. | 320.6 MB | ########## | 100%  2025-05-07T20:26:14.0669727Z 2025-05-07T20:26:14.0669734Z 2025-05-07T20:26:14.0669739Z 2025-05-07T20:26:14.0669744Z 2025-05-07T20:26:14.0669750Z 2025-05-07T20:26:14.0669755Z 2025-05-07T20:26:14.0669761Z 2025-05-07T20:26:14.0669766Z 2025-05-07T20:26:14.0669772Z 2025-05-07T20:26:14.0669777Z 2025-05-07T20:26:14.0669781Z 2025-05-07T20:26:14.0964134Z python-3.11.8 | 29.3 MB | #####9 | 59%  2025-05-07T20:26:14.0964452Z 2025-05-07T20:26:14.0964456Z 2025-05-07T20:26:14.0964493Z 2025-05-07T20:26:14.0964499Z 2025-05-07T20:26:14.0964504Z 2025-05-07T20:26:14.0964509Z 2025-05-07T20:26:14.0964515Z 2025-05-07T20:26:14.0964529Z 2025-05-07T20:26:14.0964534Z 2025-05-07T20:26:14.0964539Z 2025-05-07T20:26:14.0964544Z 2025-05-07T20:26:14.0964550Z 2025-05-07T20:26:14.0964555Z 2025-05-07T20:26:14.0964561Z 2025-05-07T20:26:14.1034166Z cuda-nvvm-tools-12.8 | 23.5 MB | | 0%  2025-05-07T20:26:14.1034500Z 2025-05-07T20:26:14.1034504Z 2025-05-07T20:26:14.1034508Z 2025-05-07T20:26:14.1034511Z 2025-05-07T20:26:14.1034515Z 2025-05-07T20:26:14.1034518Z 2025-05-07T20:26:14.1034522Z 2025-05-07T20:26:14.1034525Z 2025-05-07T20:26:14.1034529Z 2025-05-07T20:26:14.1034532Z 2025-05-07T20:26:14.1034536Z 2025-05-07T20:26:14.1034539Z 2025-05-07T20:26:14.1034543Z 2025-05-07T20:26:14.1043760Z cuda-nvcc-tools-12.8 | 24.5 MB | ##3 | 24%  2025-05-07T20:26:14.1044225Z 2025-05-07T20:26:14.1044246Z 2025-05-07T20:26:14.1044251Z 2025-05-07T20:26:14.1044256Z 2025-05-07T20:26:14.1044261Z 2025-05-07T20:26:14.1044266Z 2025-05-07T20:26:14.1044271Z 2025-05-07T20:26:14.1044276Z 2025-05-07T20:26:14.1044280Z 2025-05-07T20:26:14.1044284Z 2025-05-07T20:26:14.1044287Z 2025-05-07T20:26:14.1051921Z 2025-05-07T20:26:14.1182260Z libnvjitlink-12.8.61 | 28.7 MB | ###1 | 32%  2025-05-07T20:26:14.1182586Z 2025-05-07T20:26:14.1182590Z 2025-05-07T20:26:14.1966723Z libcusparse-12.5.7.5 | 164.9 MB | ########## | 100%  2025-05-07T20:26:14.1967015Z 2025-05-07T20:26:14.1967029Z 2025-05-07T20:26:14.1967032Z 2025-05-07T20:26:14.1967036Z 2025-05-07T20:26:14.1967039Z 2025-05-07T20:26:14.1967043Z 2025-05-07T20:26:14.1967055Z 2025-05-07T20:26:14.1967059Z 2025-05-07T20:26:14.1967062Z 2025-05-07T20:26:14.1967066Z 2025-05-07T20:26:14.1967069Z 2025-05-07T20:26:14.1967074Z 2025-05-07T20:26:14.1967078Z 2025-05-07T20:26:14.1967083Z 2025-05-07T20:26:14.2014065Z cuda-nvvm-tools-12.8 | 23.5 MB | # | 10%  2025-05-07T20:26:14.2014531Z 2025-05-07T20:26:14.2014537Z 2025-05-07T20:26:14.2014542Z 2025-05-07T20:26:14.2014547Z 2025-05-07T20:26:14.2014552Z 2025-05-07T20:26:14.2014559Z 2025-05-07T20:26:14.2014564Z 2025-05-07T20:26:14.2014569Z 2025-05-07T20:26:14.2014577Z 2025-05-07T20:26:14.2014868Z 2025-05-07T20:26:14.2016349Z 2025-05-07T20:26:14.2301436Z python-3.11.8 | 29.3 MB | ######9 | 70%  2025-05-07T20:26:14.2301742Z 2025-05-07T20:26:14.2301746Z 2025-05-07T20:26:14.2301749Z 2025-05-07T20:26:14.2301753Z 2025-05-07T20:26:14.2301757Z 2025-05-07T20:26:14.2301761Z 2025-05-07T20:26:14.2301764Z 2025-05-07T20:26:14.2301768Z 2025-05-07T20:26:14.2301772Z 2025-05-07T20:26:14.2301775Z 2025-05-07T20:26:14.2301779Z 2025-05-07T20:26:14.2301786Z 2025-05-07T20:26:14.2305627Z 2025-05-07T20:26:14.2317079Z cuda-nvcc-tools-12.8 | 24.5 MB | ###5 | 35%  2025-05-07T20:26:14.2317852Z 2025-05-07T20:26:14.2317859Z 2025-05-07T20:26:14.2317876Z 2025-05-07T20:26:14.2317882Z 2025-05-07T20:26:14.2317888Z 2025-05-07T20:26:14.2317895Z 2025-05-07T20:26:14.2317901Z 2025-05-07T20:26:14.2317906Z 2025-05-07T20:26:14.2317911Z 2025-05-07T20:26:14.2317916Z 2025-05-07T20:26:14.2317921Z 2025-05-07T20:26:14.2317927Z 2025-05-07T20:26:14.2975593Z libnvjitlink-12.8.61 | 28.7 MB | ####1 | 42%  2025-05-07T20:26:14.2975942Z 2025-05-07T20:26:14.2975946Z 2025-05-07T20:26:14.2975950Z 2025-05-07T20:26:14.2975954Z 2025-05-07T20:26:14.2975957Z 2025-05-07T20:26:14.2975961Z 2025-05-07T20:26:14.2975965Z 2025-05-07T20:26:14.2975968Z 2025-05-07T20:26:14.2975972Z 2025-05-07T20:26:14.2975976Z 2025-05-07T20:26:14.2975979Z 2025-05-07T20:26:14.2975983Z 2025-05-07T20:26:14.2975987Z 2025-05-07T20:26:14.2975990Z 2025-05-07T20:26:14.3226649Z cuda-nvvm-tools-12.8 | 23.5 MB | ##1 | 21%  2025-05-07T20:26:14.3227022Z 2025-05-07T20:26:14.3227026Z 2025-05-07T20:26:14.3227030Z 2025-05-07T20:26:14.3227033Z 2025-05-07T20:26:14.3227037Z 2025-05-07T20:26:14.3227041Z 2025-05-07T20:26:14.3227044Z 2025-05-07T20:26:14.3227057Z 2025-05-07T20:26:14.3227061Z 2025-05-07T20:26:14.3227064Z 2025-05-07T20:26:14.3228738Z 2025-05-07T20:26:14.3308841Z python-3.11.8 | 29.3 MB | #######9 | 80%  2025-05-07T20:26:14.3309223Z 2025-05-07T20:26:14.3309228Z 2025-05-07T20:26:14.3309234Z 2025-05-07T20:26:14.3309253Z 2025-05-07T20:26:14.3309259Z 2025-05-07T20:26:14.3309264Z 2025-05-07T20:26:14.3309270Z 2025-05-07T20:26:14.3309275Z 2025-05-07T20:26:14.3309280Z 2025-05-07T20:26:14.3309285Z 2025-05-07T20:26:14.3309290Z 2025-05-07T20:26:14.3309295Z 2025-05-07T20:26:14.3309300Z 2025-05-07T20:26:14.3339748Z cuda-nvcc-tools-12.8 | 24.5 MB | ####6 | 46%  2025-05-07T20:26:14.3340152Z 2025-05-07T20:26:14.3340166Z 2025-05-07T20:26:14.3340170Z 2025-05-07T20:26:14.3340173Z 2025-05-07T20:26:14.3340177Z 2025-05-07T20:26:14.3340180Z 2025-05-07T20:26:14.3340184Z 2025-05-07T20:26:14.3340188Z 2025-05-07T20:26:14.3340191Z 2025-05-07T20:26:14.3340195Z 2025-05-07T20:26:14.3340198Z 2025-05-07T20:26:14.3340202Z 2025-05-07T20:26:14.3978769Z libnvjitlink-12.8.61 | 28.7 MB | ##### | 51%  2025-05-07T20:26:14.3979132Z 2025-05-07T20:26:14.3979138Z 2025-05-07T20:26:14.3979143Z 2025-05-07T20:26:14.3979148Z 2025-05-07T20:26:14.3979153Z 2025-05-07T20:26:14.3979158Z 2025-05-07T20:26:14.3979163Z 2025-05-07T20:26:14.3979168Z 2025-05-07T20:26:14.3979174Z 2025-05-07T20:26:14.3979179Z 2025-05-07T20:26:14.3979194Z 2025-05-07T20:26:14.3979200Z 2025-05-07T20:26:14.3979205Z 2025-05-07T20:26:14.3983357Z 2025-05-07T20:26:14.4343467Z cuda-nvvm-tools-12.8 | 23.5 MB | ###1 | 32%  2025-05-07T20:26:14.4343833Z 2025-05-07T20:26:14.4343864Z 2025-05-07T20:26:14.4343868Z 2025-05-07T20:26:14.4343872Z 2025-05-07T20:26:14.4343875Z 2025-05-07T20:26:14.4343879Z 2025-05-07T20:26:14.4343883Z 2025-05-07T20:26:14.4343886Z 2025-05-07T20:26:14.4343890Z 2025-05-07T20:26:14.4343894Z 2025-05-07T20:26:14.4344708Z 2025-05-07T20:26:14.4359407Z python-3.11.8 | 29.3 MB | ########9 | 89%  2025-05-07T20:26:14.4359761Z 2025-05-07T20:26:14.4359768Z 2025-05-07T20:26:14.4359773Z 2025-05-07T20:26:14.4359778Z 2025-05-07T20:26:14.4359783Z 2025-05-07T20:26:14.4359788Z 2025-05-07T20:26:14.4359793Z 2025-05-07T20:26:14.4359799Z 2025-05-07T20:26:14.4359804Z 2025-05-07T20:26:14.4359809Z 2025-05-07T20:26:14.4359813Z 2025-05-07T20:26:14.4359818Z 2025-05-07T20:26:14.4378170Z libnvjitlink-12.8.61 | 28.7 MB | #####9 | 60%  2025-05-07T20:26:14.4378477Z 2025-05-07T20:26:14.4378481Z 2025-05-07T20:26:14.4378485Z 2025-05-07T20:26:14.4378488Z 2025-05-07T20:26:14.4378771Z 2025-05-07T20:26:14.4378774Z 2025-05-07T20:26:14.4378778Z 2025-05-07T20:26:14.4378782Z 2025-05-07T20:26:14.4378785Z 2025-05-07T20:26:14.4378810Z 2025-05-07T20:26:14.4378816Z 2025-05-07T20:26:14.4378820Z 2025-05-07T20:26:14.4378823Z 2025-05-07T20:26:14.4981947Z cuda-nvcc-tools-12.8 | 24.5 MB | #####6 | 57%  2025-05-07T20:26:14.4982326Z 2025-05-07T20:26:14.4982330Z 2025-05-07T20:26:14.4982334Z 2025-05-07T20:26:14.4982338Z 2025-05-07T20:26:14.4982342Z 2025-05-07T20:26:14.4982346Z 2025-05-07T20:26:14.4982350Z 2025-05-07T20:26:14.4982354Z 2025-05-07T20:26:14.4982358Z 2025-05-07T20:26:14.4982363Z 2025-05-07T20:26:14.4982367Z 2025-05-07T20:26:14.4982371Z 2025-05-07T20:26:14.4982374Z 2025-05-07T20:26:14.4984364Z 2025-05-07T20:26:14.5387034Z cuda-nvvm-tools-12.8 | 23.5 MB | ####2 | 43%  2025-05-07T20:26:14.5387525Z 2025-05-07T20:26:14.5387529Z 2025-05-07T20:26:14.5387533Z 2025-05-07T20:26:14.5387560Z 2025-05-07T20:26:14.5387564Z 2025-05-07T20:26:14.5387567Z 2025-05-07T20:26:14.5387571Z 2025-05-07T20:26:14.5387574Z 2025-05-07T20:26:14.5387578Z 2025-05-07T20:26:14.5387582Z 2025-05-07T20:26:14.5387585Z 2025-05-07T20:26:14.5387596Z 2025-05-07T20:26:14.5387600Z 2025-05-07T20:26:14.5416513Z cuda-nvcc-tools-12.8 | 24.5 MB | ######8 | 69%  2025-05-07T20:26:14.5416913Z 2025-05-07T20:26:14.5416917Z 2025-05-07T20:26:14.5416921Z 2025-05-07T20:26:14.5416924Z 2025-05-07T20:26:14.5416934Z 2025-05-07T20:26:14.5416938Z 2025-05-07T20:26:14.5416946Z 2025-05-07T20:26:14.5416949Z 2025-05-07T20:26:14.5416953Z 2025-05-07T20:26:14.5416957Z 2025-05-07T20:26:14.5418221Z 2025-05-07T20:26:14.5474951Z python-3.11.8 | 29.3 MB | #########7 | 98%  2025-05-07T20:26:14.5475330Z 2025-05-07T20:26:14.5475336Z 2025-05-07T20:26:14.5475341Z 2025-05-07T20:26:14.5475346Z 2025-05-07T20:26:14.5475351Z 2025-05-07T20:26:14.5475368Z 2025-05-07T20:26:14.5475373Z 2025-05-07T20:26:14.5475378Z 2025-05-07T20:26:14.5475383Z 2025-05-07T20:26:14.5475388Z 2025-05-07T20:26:14.5475393Z 2025-05-07T20:26:14.5476831Z 2025-05-07T20:26:14.5986833Z libnvjitlink-12.8.61 | 28.7 MB | ######8 | 69%  2025-05-07T20:26:14.5987232Z 2025-05-07T20:26:14.5987239Z 2025-05-07T20:26:14.5987275Z 2025-05-07T20:26:14.5987279Z 2025-05-07T20:26:14.5987283Z 2025-05-07T20:26:14.5987286Z 2025-05-07T20:26:14.5987293Z 2025-05-07T20:26:14.5987296Z 2025-05-07T20:26:14.5987300Z 2025-05-07T20:26:14.5987303Z 2025-05-07T20:26:14.5987315Z 2025-05-07T20:26:14.5987318Z 2025-05-07T20:26:14.5987322Z 2025-05-07T20:26:14.5990085Z 2025-05-07T20:26:14.6423862Z cuda-nvvm-tools-12.8 | 23.5 MB | #####5 | 55%  2025-05-07T20:26:14.6424346Z 2025-05-07T20:26:14.6424376Z 2025-05-07T20:26:14.6424381Z 2025-05-07T20:26:14.6424386Z 2025-05-07T20:26:14.6424392Z 2025-05-07T20:26:14.6424424Z 2025-05-07T20:26:14.6424429Z 2025-05-07T20:26:14.6424435Z 2025-05-07T20:26:14.6424440Z 2025-05-07T20:26:14.6424445Z 2025-05-07T20:26:14.6424450Z 2025-05-07T20:26:14.6424456Z 2025-05-07T20:26:14.6424461Z 2025-05-07T20:26:14.6475274Z cuda-nvcc-tools-12.8 | 24.5 MB | #######9 | 80%  2025-05-07T20:26:14.6475703Z 2025-05-07T20:26:14.6475941Z 2025-05-07T20:26:14.6475946Z 2025-05-07T20:26:14.6475950Z 2025-05-07T20:26:14.6475953Z 2025-05-07T20:26:14.6475957Z 2025-05-07T20:26:14.6475961Z 2025-05-07T20:26:14.6475966Z 2025-05-07T20:26:14.6475971Z 2025-05-07T20:26:14.6475975Z 2025-05-07T20:26:14.6475980Z 2025-05-07T20:26:14.6475984Z 2025-05-07T20:26:14.6988463Z libnvjitlink-12.8.61 | 28.7 MB | #######8 | 78%  2025-05-07T20:26:14.6988794Z 2025-05-07T20:26:14.6988798Z 2025-05-07T20:26:14.6988802Z 2025-05-07T20:26:14.6988806Z 2025-05-07T20:26:14.6988809Z 2025-05-07T20:26:14.6988813Z 2025-05-07T20:26:14.6989176Z 2025-05-07T20:26:14.6989182Z 2025-05-07T20:26:14.6989187Z 2025-05-07T20:26:14.6989192Z 2025-05-07T20:26:14.6989197Z 2025-05-07T20:26:14.6989202Z 2025-05-07T20:26:14.6989207Z 2025-05-07T20:26:14.6989213Z 2025-05-07T20:26:14.7432835Z cuda-nvvm-tools-12.8 | 23.5 MB | ######7 | 67%  2025-05-07T20:26:14.7433277Z 2025-05-07T20:26:14.7433304Z 2025-05-07T20:26:14.7433308Z 2025-05-07T20:26:14.7433312Z 2025-05-07T20:26:14.7433315Z 2025-05-07T20:26:14.7433319Z 2025-05-07T20:26:14.7433323Z 2025-05-07T20:26:14.7433326Z 2025-05-07T20:26:14.7433330Z 2025-05-07T20:26:14.7433334Z 2025-05-07T20:26:14.7433337Z 2025-05-07T20:26:14.7433342Z 2025-05-07T20:26:14.7433345Z 2025-05-07T20:26:14.7509587Z cuda-nvcc-tools-12.8 | 24.5 MB | ######### | 91%  2025-05-07T20:26:14.7509906Z 2025-05-07T20:26:14.7509910Z 2025-05-07T20:26:14.7509913Z 2025-05-07T20:26:14.7509921Z 2025-05-07T20:26:14.7509927Z 2025-05-07T20:26:14.7509944Z 2025-05-07T20:26:14.7509948Z 2025-05-07T20:26:14.7509951Z 2025-05-07T20:26:14.7509955Z 2025-05-07T20:26:14.7509959Z 2025-05-07T20:26:14.7509962Z 2025-05-07T20:26:14.7511834Z 2025-05-07T20:26:14.8066557Z libnvjitlink-12.8.61 | 28.7 MB | ########7 | 87%  2025-05-07T20:26:14.8066887Z 2025-05-07T20:26:14.8066891Z 2025-05-07T20:26:14.8066914Z 2025-05-07T20:26:14.8066918Z 2025-05-07T20:26:14.8066923Z 2025-05-07T20:26:14.8066927Z 2025-05-07T20:26:14.8066932Z 2025-05-07T20:26:14.8066936Z 2025-05-07T20:26:14.8066941Z 2025-05-07T20:26:14.8066946Z 2025-05-07T20:26:14.8066950Z 2025-05-07T20:26:14.8066955Z 2025-05-07T20:26:14.8066959Z 2025-05-07T20:26:14.8066970Z 2025-05-07T20:26:14.8515250Z cuda-nvvm-tools-12.8 | 23.5 MB | #######8 | 79%  2025-05-07T20:26:14.8515581Z 2025-05-07T20:26:14.8515585Z 2025-05-07T20:26:14.8515589Z 2025-05-07T20:26:14.8515592Z 2025-05-07T20:26:14.8515596Z 2025-05-07T20:26:14.8515632Z 2025-05-07T20:26:14.8515635Z 2025-05-07T20:26:14.8515639Z 2025-05-07T20:26:14.8515643Z 2025-05-07T20:26:14.8515648Z 2025-05-07T20:26:14.8515652Z 2025-05-07T20:26:14.8515655Z 2025-05-07T20:26:14.9068594Z libnvjitlink-12.8.61 | 28.7 MB | #########8 | 99%  2025-05-07T20:26:14.9068921Z 2025-05-07T20:26:14.9068925Z 2025-05-07T20:26:14.9068944Z 2025-05-07T20:26:14.9068948Z 2025-05-07T20:26:14.9068951Z 2025-05-07T20:26:14.9068955Z 2025-05-07T20:26:14.9068958Z 2025-05-07T20:26:14.9068962Z 2025-05-07T20:26:14.9068966Z 2025-05-07T20:26:14.9068969Z 2025-05-07T20:26:14.9068973Z 2025-05-07T20:26:14.9068976Z 2025-05-07T20:26:14.9068980Z 2025-05-07T20:26:14.9068983Z 2025-05-07T20:26:15.6520038Z cuda-nvvm-tools-12.8 | 23.5 MB | #########3 | 93%  2025-05-07T20:26:15.6520388Z 2025-05-07T20:26:15.6520392Z 2025-05-07T20:26:15.6520396Z 2025-05-07T20:26:15.6520399Z 2025-05-07T20:26:15.6520403Z 2025-05-07T20:26:15.6520438Z 2025-05-07T20:26:15.6520446Z 2025-05-07T20:26:15.6520449Z 2025-05-07T20:26:15.6520453Z 2025-05-07T20:26:15.6520457Z 2025-05-07T20:26:15.6522531Z 2025-05-07T20:26:15.6994217Z python-3.11.8 | 29.3 MB | ########## | 100%  2025-05-07T20:26:15.6994517Z 2025-05-07T20:26:15.6994521Z 2025-05-07T20:26:15.6994525Z 2025-05-07T20:26:15.6994770Z 2025-05-07T20:26:15.6994775Z 2025-05-07T20:26:15.6994779Z 2025-05-07T20:26:15.6994783Z 2025-05-07T20:26:15.6994786Z 2025-05-07T20:26:15.6994802Z 2025-05-07T20:26:15.6994806Z 2025-05-07T20:26:15.6994809Z 2025-05-07T20:26:15.6994813Z 2025-05-07T20:26:15.6994816Z 2025-05-07T20:26:15.6994820Z 2025-05-07T20:26:15.7001269Z 2025-05-07T20:26:15.7036306Z cuda-nvvm-impl-12.8. | 20.8 MB | | 0%  2025-05-07T20:26:15.7036761Z 2025-05-07T20:26:15.7036767Z 2025-05-07T20:26:15.7036772Z 2025-05-07T20:26:15.7036777Z 2025-05-07T20:26:15.7036783Z 2025-05-07T20:26:15.7037044Z 2025-05-07T20:26:15.7037048Z 2025-05-07T20:26:15.7037051Z 2025-05-07T20:26:15.7037055Z 2025-05-07T20:26:15.7037059Z 2025-05-07T20:26:15.7037062Z 2025-05-07T20:26:15.7037066Z 2025-05-07T20:26:15.7037724Z 2025-05-07T20:26:15.7489739Z cuda-nvcc-tools-12.8 | 24.5 MB | ########## | 100%  2025-05-07T20:26:15.7490116Z 2025-05-07T20:26:15.7490143Z 2025-05-07T20:26:15.7490147Z 2025-05-07T20:26:15.7490150Z 2025-05-07T20:26:15.7490154Z 2025-05-07T20:26:15.7490158Z 2025-05-07T20:26:15.7490161Z 2025-05-07T20:26:15.7490165Z 2025-05-07T20:26:15.7490169Z 2025-05-07T20:26:15.7490172Z 2025-05-07T20:26:15.7490185Z 2025-05-07T20:26:15.7490189Z 2025-05-07T20:26:15.7490193Z 2025-05-07T20:26:15.7490197Z 2025-05-07T20:26:15.7490200Z 2025-05-07T20:26:15.7492987Z 2025-05-07T20:26:15.7640577Z cuda-nvcc-dev_linux- | 12.7 MB | | 0%  2025-05-07T20:26:15.7641009Z 2025-05-07T20:26:15.7641015Z 2025-05-07T20:26:15.7641040Z 2025-05-07T20:26:15.7641045Z 2025-05-07T20:26:15.7641051Z 2025-05-07T20:26:15.7641056Z 2025-05-07T20:26:15.7641059Z 2025-05-07T20:26:15.7641063Z 2025-05-07T20:26:15.7641067Z 2025-05-07T20:26:15.7641070Z 2025-05-07T20:26:15.7641074Z 2025-05-07T20:26:15.7641078Z 2025-05-07T20:26:15.7641081Z 2025-05-07T20:26:15.7643343Z 2025-05-07T20:26:15.7994173Z cuda-nvvm-tools-12.8 | 23.5 MB | ########## | 100%  2025-05-07T20:26:15.7994546Z 2025-05-07T20:26:15.7994552Z 2025-05-07T20:26:15.7994557Z 2025-05-07T20:26:15.7994562Z 2025-05-07T20:26:15.7994567Z 2025-05-07T20:26:15.7994572Z 2025-05-07T20:26:15.7994577Z 2025-05-07T20:26:15.7994583Z 2025-05-07T20:26:15.7994588Z 2025-05-07T20:26:15.7994593Z 2025-05-07T20:26:15.7994598Z 2025-05-07T20:26:15.7994604Z 2025-05-07T20:26:15.7994609Z 2025-05-07T20:26:15.7994614Z 2025-05-07T20:26:15.7996274Z 2025-05-07T20:26:15.8106636Z cuda-nvvm-impl-12.8. | 20.8 MB | #5 | 15%  2025-05-07T20:26:15.8107098Z 2025-05-07T20:26:15.8107104Z 2025-05-07T20:26:15.8107109Z 2025-05-07T20:26:15.8107115Z 2025-05-07T20:26:15.8107120Z 2025-05-07T20:26:15.8107126Z 2025-05-07T20:26:15.8107131Z 2025-05-07T20:26:15.8107136Z 2025-05-07T20:26:15.8107142Z 2025-05-07T20:26:15.8107147Z 2025-05-07T20:26:15.8107153Z 2025-05-07T20:26:15.8107159Z 2025-05-07T20:26:15.8107185Z 2025-05-07T20:26:15.8107192Z 2025-05-07T20:26:15.8107198Z 2025-05-07T20:26:15.8107203Z 2025-05-07T20:26:15.8109974Z 2025-05-07T20:26:15.8496842Z cuda-sanitizer-api-1 | 8.8 MB | | 0%  2025-05-07T20:26:15.8497353Z 2025-05-07T20:26:15.8497359Z 2025-05-07T20:26:15.8497364Z 2025-05-07T20:26:15.8497369Z 2025-05-07T20:26:15.8497374Z 2025-05-07T20:26:15.8497379Z 2025-05-07T20:26:15.8497384Z 2025-05-07T20:26:15.8497389Z 2025-05-07T20:26:15.8497395Z 2025-05-07T20:26:15.8497400Z 2025-05-07T20:26:15.8497405Z 2025-05-07T20:26:15.8497432Z 2025-05-07T20:26:15.8497437Z 2025-05-07T20:26:15.8497442Z 2025-05-07T20:26:15.8497447Z 2025-05-07T20:26:15.8499685Z 2025-05-07T20:26:15.8503422Z cuda-nvcc-dev_linux- | 12.7 MB | ##4 | 24%  2025-05-07T20:26:15.8503806Z 2025-05-07T20:26:15.8503814Z 2025-05-07T20:26:15.8503822Z 2025-05-07T20:26:15.8503831Z 2025-05-07T20:26:15.8504123Z 2025-05-07T20:26:15.8504130Z 2025-05-07T20:26:15.8504136Z 2025-05-07T20:26:15.8504141Z 2025-05-07T20:26:15.8504147Z 2025-05-07T20:26:15.8504162Z 2025-05-07T20:26:15.8504167Z 2025-05-07T20:26:15.8506378Z 2025-05-07T20:26:15.9068889Z libnvjitlink-12.8.61 | 28.7 MB | ########## | 100%  2025-05-07T20:26:15.9069305Z 2025-05-07T20:26:15.9069311Z 2025-05-07T20:26:15.9069315Z 2025-05-07T20:26:15.9069319Z 2025-05-07T20:26:15.9069325Z 2025-05-07T20:26:15.9069329Z 2025-05-07T20:26:15.9069333Z 2025-05-07T20:26:15.9069337Z 2025-05-07T20:26:15.9069342Z 2025-05-07T20:26:15.9069587Z 2025-05-07T20:26:15.9069592Z 2025-05-07T20:26:15.9069596Z 2025-05-07T20:26:15.9069600Z 2025-05-07T20:26:15.9069603Z 2025-05-07T20:26:15.9069652Z 2025-05-07T20:26:15.9107731Z cuda-nvvm-impl-12.8. | 20.8 MB | ### | 30%  2025-05-07T20:26:15.9108062Z 2025-05-07T20:26:15.9108066Z 2025-05-07T20:26:15.9108070Z 2025-05-07T20:26:15.9108085Z 2025-05-07T20:26:15.9108089Z 2025-05-07T20:26:15.9108093Z 2025-05-07T20:26:15.9108097Z 2025-05-07T20:26:15.9108100Z 2025-05-07T20:26:15.9108104Z 2025-05-07T20:26:15.9108108Z 2025-05-07T20:26:15.9108111Z 2025-05-07T20:26:15.9108121Z 2025-05-07T20:26:15.9108125Z 2025-05-07T20:26:15.9108129Z 2025-05-07T20:26:15.9108132Z 2025-05-07T20:26:15.9108136Z 2025-05-07T20:26:15.9108139Z 2025-05-07T20:26:15.9127997Z cuda-sanitizer-api-1 | 8.8 MB | ### | 31%  2025-05-07T20:26:15.9128639Z 2025-05-07T20:26:15.9128643Z 2025-05-07T20:26:15.9128655Z 2025-05-07T20:26:15.9128659Z 2025-05-07T20:26:15.9128662Z 2025-05-07T20:26:15.9128666Z 2025-05-07T20:26:15.9128669Z 2025-05-07T20:26:15.9128673Z 2025-05-07T20:26:15.9128676Z 2025-05-07T20:26:15.9128680Z 2025-05-07T20:26:15.9128683Z 2025-05-07T20:26:15.9128687Z 2025-05-07T20:26:15.9128690Z 2025-05-07T20:26:15.9128694Z 2025-05-07T20:26:15.9128697Z 2025-05-07T20:26:15.9128706Z 2025-05-07T20:26:15.9128709Z 2025-05-07T20:26:15.9132256Z 2025-05-07T20:26:15.9561185Z cuda-nvdisasm-12.8.5 | 4.9 MB | | 0%  2025-05-07T20:26:15.9561533Z 2025-05-07T20:26:15.9561537Z 2025-05-07T20:26:15.9561540Z 2025-05-07T20:26:15.9561544Z 2025-05-07T20:26:15.9561548Z 2025-05-07T20:26:15.9561559Z 2025-05-07T20:26:15.9561562Z 2025-05-07T20:26:15.9561566Z 2025-05-07T20:26:15.9561570Z 2025-05-07T20:26:15.9561573Z 2025-05-07T20:26:15.9561577Z 2025-05-07T20:26:15.9561580Z 2025-05-07T20:26:15.9561585Z 2025-05-07T20:26:15.9561589Z 2025-05-07T20:26:15.9561607Z 2025-05-07T20:26:15.9561611Z 2025-05-07T20:26:16.0124161Z cuda-nvcc-dev_linux- | 12.7 MB | ####8 | 49%  2025-05-07T20:26:16.0124521Z 2025-05-07T20:26:16.0124525Z 2025-05-07T20:26:16.0124528Z 2025-05-07T20:26:16.0124532Z 2025-05-07T20:26:16.0124536Z 2025-05-07T20:26:16.0124539Z 2025-05-07T20:26:16.0124543Z 2025-05-07T20:26:16.0124567Z 2025-05-07T20:26:16.0124571Z 2025-05-07T20:26:16.0124574Z 2025-05-07T20:26:16.0124578Z 2025-05-07T20:26:16.0124582Z 2025-05-07T20:26:16.0124585Z 2025-05-07T20:26:16.0124589Z 2025-05-07T20:26:16.0124592Z 2025-05-07T20:26:16.0124596Z 2025-05-07T20:26:16.0125987Z 2025-05-07T20:26:16.0132819Z cuda-sanitizer-api-1 | 8.8 MB | ###### | 61%  2025-05-07T20:26:16.0133155Z 2025-05-07T20:26:16.0133159Z 2025-05-07T20:26:16.0133170Z 2025-05-07T20:26:16.0133174Z 2025-05-07T20:26:16.0133178Z 2025-05-07T20:26:16.0133181Z 2025-05-07T20:26:16.0133196Z 2025-05-07T20:26:16.0133200Z 2025-05-07T20:26:16.0133203Z 2025-05-07T20:26:16.0133207Z 2025-05-07T20:26:16.0133211Z 2025-05-07T20:26:16.0133214Z 2025-05-07T20:26:16.0133218Z 2025-05-07T20:26:16.0133221Z 2025-05-07T20:26:16.0133225Z 2025-05-07T20:26:16.0133229Z 2025-05-07T20:26:16.0133232Z 2025-05-07T20:26:16.0133236Z 2025-05-07T20:26:16.0273014Z cuda-nvdisasm-12.8.5 | 4.9 MB | ###5 | 36%  2025-05-07T20:26:16.0273484Z 2025-05-07T20:26:16.0273489Z 2025-05-07T20:26:16.0273493Z 2025-05-07T20:26:16.0273497Z 2025-05-07T20:26:16.0273500Z 2025-05-07T20:26:16.0273504Z 2025-05-07T20:26:16.0273508Z 2025-05-07T20:26:16.0273511Z 2025-05-07T20:26:16.0273515Z 2025-05-07T20:26:16.0273519Z 2025-05-07T20:26:16.0273522Z 2025-05-07T20:26:16.0273534Z 2025-05-07T20:26:16.0273538Z 2025-05-07T20:26:16.0273542Z 2025-05-07T20:26:16.0276604Z 2025-05-07T20:26:16.0770054Z cuda-nvvm-impl-12.8. | 20.8 MB | ####4 | 45%  2025-05-07T20:26:16.0770640Z 2025-05-07T20:26:16.0770644Z 2025-05-07T20:26:16.0770648Z 2025-05-07T20:26:16.0770652Z 2025-05-07T20:26:16.0770655Z 2025-05-07T20:26:16.0770659Z 2025-05-07T20:26:16.0770663Z 2025-05-07T20:26:16.0770666Z 2025-05-07T20:26:16.0770670Z 2025-05-07T20:26:16.0770674Z 2025-05-07T20:26:16.0770677Z 2025-05-07T20:26:16.0770681Z 2025-05-07T20:26:16.0770694Z 2025-05-07T20:26:16.0770698Z 2025-05-07T20:26:16.0770702Z 2025-05-07T20:26:16.0770705Z 2025-05-07T20:26:16.1125365Z cuda-nvcc-dev_linux- | 12.7 MB | #######1 | 72%  2025-05-07T20:26:16.1125702Z 2025-05-07T20:26:16.1125706Z 2025-05-07T20:26:16.1125710Z 2025-05-07T20:26:16.1125713Z 2025-05-07T20:26:16.1125717Z 2025-05-07T20:26:16.1125720Z 2025-05-07T20:26:16.1125724Z 2025-05-07T20:26:16.1125728Z 2025-05-07T20:26:16.1125732Z 2025-05-07T20:26:16.1125736Z 2025-05-07T20:26:16.1125739Z 2025-05-07T20:26:16.1125753Z 2025-05-07T20:26:16.1125770Z 2025-05-07T20:26:16.1125773Z 2025-05-07T20:26:16.1125777Z 2025-05-07T20:26:16.1125780Z 2025-05-07T20:26:16.1127421Z 2025-05-07T20:26:16.1143613Z cuda-sanitizer-api-1 | 8.8 MB | #########3 | 93%  2025-05-07T20:26:16.1144011Z 2025-05-07T20:26:16.1144015Z 2025-05-07T20:26:16.1144018Z 2025-05-07T20:26:16.1144022Z 2025-05-07T20:26:16.1144038Z 2025-05-07T20:26:16.1144042Z 2025-05-07T20:26:16.1144045Z 2025-05-07T20:26:16.1144049Z 2025-05-07T20:26:16.1144052Z 2025-05-07T20:26:16.1144056Z 2025-05-07T20:26:16.1144060Z 2025-05-07T20:26:16.1144063Z 2025-05-07T20:26:16.1144067Z 2025-05-07T20:26:16.1144070Z 2025-05-07T20:26:16.1144074Z 2025-05-07T20:26:16.1144077Z 2025-05-07T20:26:16.1144081Z 2025-05-07T20:26:16.1145333Z 2025-05-07T20:26:16.1272536Z cuda-nvdisasm-12.8.5 | 4.9 MB | ########9 | 90%  2025-05-07T20:26:16.1272887Z 2025-05-07T20:26:16.1272891Z 2025-05-07T20:26:16.1272904Z 2025-05-07T20:26:16.1272908Z 2025-05-07T20:26:16.1272911Z 2025-05-07T20:26:16.1272925Z 2025-05-07T20:26:16.1272928Z 2025-05-07T20:26:16.1272932Z 2025-05-07T20:26:16.1272936Z 2025-05-07T20:26:16.1272939Z 2025-05-07T20:26:16.1272943Z 2025-05-07T20:26:16.1272947Z 2025-05-07T20:26:16.1272950Z 2025-05-07T20:26:16.1272954Z 2025-05-07T20:26:16.1272957Z 2025-05-07T20:26:16.1784339Z cuda-nvvm-impl-12.8. | 20.8 MB | #####8 | 59%  2025-05-07T20:26:16.1784692Z 2025-05-07T20:26:16.1784697Z 2025-05-07T20:26:16.1784702Z 2025-05-07T20:26:16.1784707Z 2025-05-07T20:26:16.1784713Z 2025-05-07T20:26:16.1784718Z 2025-05-07T20:26:16.1784723Z 2025-05-07T20:26:16.1784728Z 2025-05-07T20:26:16.1784733Z 2025-05-07T20:26:16.1784739Z 2025-05-07T20:26:16.1784744Z 2025-05-07T20:26:16.1784749Z 2025-05-07T20:26:16.1784754Z 2025-05-07T20:26:16.1784759Z 2025-05-07T20:26:16.1784764Z 2025-05-07T20:26:16.1784958Z 2025-05-07T20:26:16.2279549Z cuda-nvcc-dev_linux- | 12.7 MB | #########3 | 93%  2025-05-07T20:26:16.2279907Z 2025-05-07T20:26:16.2279911Z 2025-05-07T20:26:16.2279915Z 2025-05-07T20:26:16.2279919Z 2025-05-07T20:26:16.2279930Z 2025-05-07T20:26:16.2279934Z 2025-05-07T20:26:16.2279938Z 2025-05-07T20:26:16.2279941Z 2025-05-07T20:26:16.2279945Z 2025-05-07T20:26:16.2279948Z 2025-05-07T20:26:16.2280171Z 2025-05-07T20:26:16.2280176Z 2025-05-07T20:26:16.2280179Z 2025-05-07T20:26:16.2280183Z 2025-05-07T20:26:16.2280187Z 2025-05-07T20:26:16.3114909Z cuda-nvvm-impl-12.8. | 20.8 MB | #######3 | 73%  2025-05-07T20:26:16.3115259Z 2025-05-07T20:26:16.3115264Z 2025-05-07T20:26:16.3115267Z 2025-05-07T20:26:16.3115271Z 2025-05-07T20:26:16.3115275Z 2025-05-07T20:26:16.3115278Z 2025-05-07T20:26:16.3115282Z 2025-05-07T20:26:16.3115286Z 2025-05-07T20:26:16.3115289Z 2025-05-07T20:26:16.3115293Z 2025-05-07T20:26:16.3115297Z 2025-05-07T20:26:16.3115300Z 2025-05-07T20:26:16.3115527Z 2025-05-07T20:26:16.3115530Z 2025-05-07T20:26:16.3115534Z 2025-05-07T20:26:16.3115538Z 2025-05-07T20:26:16.3115541Z 2025-05-07T20:26:16.3117186Z 2025-05-07T20:26:16.3286719Z cuda-nvdisasm-12.8.5 | 4.9 MB | ########## | 100%  2025-05-07T20:26:16.3287124Z 2025-05-07T20:26:16.3287137Z 2025-05-07T20:26:16.3287160Z 2025-05-07T20:26:16.3287165Z 2025-05-07T20:26:16.3287169Z 2025-05-07T20:26:16.3287172Z 2025-05-07T20:26:16.3287176Z 2025-05-07T20:26:16.3287180Z 2025-05-07T20:26:16.3287183Z 2025-05-07T20:26:16.3287187Z 2025-05-07T20:26:16.3287191Z 2025-05-07T20:26:16.3287194Z 2025-05-07T20:26:16.3287198Z 2025-05-07T20:26:16.3287201Z 2025-05-07T20:26:16.3287205Z 2025-05-07T20:26:16.3721399Z cuda-nvvm-impl-12.8. | 20.8 MB | ########9 | 89%  2025-05-07T20:26:16.3721737Z 2025-05-07T20:26:16.3721741Z 2025-05-07T20:26:16.3721744Z 2025-05-07T20:26:16.3721748Z 2025-05-07T20:26:16.3721763Z 2025-05-07T20:26:16.3721766Z 2025-05-07T20:26:16.3721770Z 2025-05-07T20:26:16.3721773Z 2025-05-07T20:26:16.3721777Z 2025-05-07T20:26:16.3721780Z 2025-05-07T20:26:16.3721784Z 2025-05-07T20:26:16.3721787Z 2025-05-07T20:26:16.3721799Z 2025-05-07T20:26:16.3721803Z 2025-05-07T20:26:16.3721807Z 2025-05-07T20:26:16.3721810Z 2025-05-07T20:26:16.3721814Z 2025-05-07T20:26:16.3721829Z 2025-05-07T20:26:16.3723259Z 2025-05-07T20:26:16.4596022Z ... (more hidden) ... 2025-05-07T20:26:16.4596313Z 2025-05-07T20:26:16.4596317Z 2025-05-07T20:26:16.4596321Z 2025-05-07T20:26:16.4596324Z 2025-05-07T20:26:16.4596328Z 2025-05-07T20:26:16.4596332Z 2025-05-07T20:26:16.4596336Z 2025-05-07T20:26:16.4596349Z 2025-05-07T20:26:16.4596353Z 2025-05-07T20:26:16.4596356Z 2025-05-07T20:26:16.4596360Z 2025-05-07T20:26:16.4596363Z 2025-05-07T20:26:16.4596367Z 2025-05-07T20:26:16.4596370Z 2025-05-07T20:26:16.4596374Z 2025-05-07T20:26:16.4596377Z 2025-05-07T20:26:16.4600930Z 2025-05-07T20:26:16.4728764Z cuda-sanitizer-api-1 | 8.8 MB | ########## | 100%  2025-05-07T20:26:16.4729128Z 2025-05-07T20:26:16.4729132Z 2025-05-07T20:26:16.4729136Z 2025-05-07T20:26:16.4729139Z 2025-05-07T20:26:16.4729143Z 2025-05-07T20:26:16.4729147Z 2025-05-07T20:26:16.4729150Z 2025-05-07T20:26:16.4729168Z 2025-05-07T20:26:16.4729171Z 2025-05-07T20:26:16.4729175Z 2025-05-07T20:26:16.4729178Z 2025-05-07T20:26:16.4729182Z 2025-05-07T20:26:16.4729185Z 2025-05-07T20:26:16.4729189Z 2025-05-07T20:26:16.4729192Z 2025-05-07T20:26:16.4729196Z 2025-05-07T20:26:16.4729208Z 2025-05-07T20:26:16.4729212Z 2025-05-07T20:26:16.4729216Z 2025-05-07T20:26:16.6613131Z ... (more hidden) ... 2025-05-07T20:26:16.6613442Z 2025-05-07T20:26:16.6613455Z 2025-05-07T20:26:16.6613460Z 2025-05-07T20:26:16.6613463Z 2025-05-07T20:26:16.6613467Z 2025-05-07T20:26:16.6613470Z 2025-05-07T20:26:16.6613498Z 2025-05-07T20:26:16.6613502Z 2025-05-07T20:26:16.6613506Z 2025-05-07T20:26:16.6613509Z 2025-05-07T20:26:16.6613513Z 2025-05-07T20:26:16.6613517Z 2025-05-07T20:26:16.6613521Z 2025-05-07T20:26:16.6613524Z 2025-05-07T20:26:16.6613528Z 2025-05-07T20:26:16.6614581Z 2025-05-07T20:26:16.6716310Z cuda-nvcc-dev_linux- | 12.7 MB | ########## | 100%  2025-05-07T20:26:16.6716648Z 2025-05-07T20:26:16.6716652Z 2025-05-07T20:26:16.6716656Z 2025-05-07T20:26:16.6716659Z 2025-05-07T20:26:16.6716663Z 2025-05-07T20:26:16.6716666Z 2025-05-07T20:26:16.6716670Z 2025-05-07T20:26:16.6716673Z 2025-05-07T20:26:16.6716677Z 2025-05-07T20:26:16.6716680Z 2025-05-07T20:26:16.6716684Z 2025-05-07T20:26:16.6716695Z 2025-05-07T20:26:16.6716699Z 2025-05-07T20:26:16.6716702Z 2025-05-07T20:26:16.6716706Z 2025-05-07T20:26:16.6716709Z 2025-05-07T20:26:16.6716713Z 2025-05-07T20:26:16.6716717Z 2025-05-07T20:26:16.6716720Z 2025-05-07T20:26:17.0603747Z ... (more hidden) ... 2025-05-07T20:26:17.0604069Z 2025-05-07T20:26:17.0604073Z 2025-05-07T20:26:17.0604077Z 2025-05-07T20:26:17.0604090Z 2025-05-07T20:26:17.0604094Z 2025-05-07T20:26:17.0604097Z 2025-05-07T20:26:17.0604101Z 2025-05-07T20:26:17.0604104Z 2025-05-07T20:26:17.0604109Z 2025-05-07T20:26:17.0604113Z 2025-05-07T20:26:17.0604145Z 2025-05-07T20:26:17.0604149Z 2025-05-07T20:26:17.0604152Z 2025-05-07T20:26:17.0604156Z 2025-05-07T20:26:17.0606587Z 2025-05-07T20:26:17.6596611Z cuda-nvvm-impl-12.8. | 20.8 MB | ########## | 100%  2025-05-07T20:26:17.6597125Z 2025-05-07T20:26:17.6597131Z 2025-05-07T20:26:17.6597137Z 2025-05-07T20:26:17.6597142Z 2025-05-07T20:26:17.6597148Z 2025-05-07T20:26:17.6597153Z 2025-05-07T20:26:17.6597159Z 2025-05-07T20:26:17.6597164Z 2025-05-07T20:26:17.6599164Z 2025-05-07T20:26:18.5274511Z libcurand-10.3.9.55 | 43.6 MB | ########## | 100%  2025-05-07T20:26:18.5274860Z 2025-05-07T20:26:18.5274882Z 2025-05-07T20:26:18.5274886Z 2025-05-07T20:26:18.5274890Z 2025-05-07T20:26:18.5274893Z 2025-05-07T20:26:18.5274897Z 2025-05-07T20:26:18.5274900Z 2025-05-07T20:26:18.5274913Z 2025-05-07T20:26:18.5274917Z 2025-05-07T20:26:18.5274920Z 2025-05-07T20:26:18.7549001Z gds-tools-1.13.0.11 | 37.9 MB | ########## | 100%  2025-05-07T20:26:18.7549407Z 2025-05-07T20:26:18.7549423Z 2025-05-07T20:26:18.7549427Z 2025-05-07T20:26:18.7549431Z 2025-05-07T20:26:18.7549435Z 2025-05-07T20:26:18.7549438Z 2025-05-07T20:26:18.7553368Z 2025-05-07T20:26:19.1835732Z cuda-nvvp-12.8.57 | 112.4 MB | ########## | 100%  2025-05-07T20:26:19.4097971Z libcublas-12.8.3.14 | 460.2 MB | ########## | 100% 2025-05-07T20:26:19.4098292Z 2025-05-07T20:26:19.4098364Z 2025-05-07T20:26:19.4098369Z 2025-05-07T20:26:19.4098387Z 2025-05-07T20:26:19.4098428Z 2025-05-07T20:26:19.9438537Z libnpp-12.3.3.65 | 130.6 MB | ########## | 100%  2025-05-07T20:26:19.9438972Z 2025-05-07T20:26:19.9438978Z 2025-05-07T20:26:19.9438983Z 2025-05-07T20:26:19.9438989Z 2025-05-07T20:26:19.9439002Z 2025-05-07T20:26:19.9439006Z 2025-05-07T20:26:19.9439010Z 2025-05-07T20:26:19.9439014Z 2025-05-07T20:26:20.4957323Z cuda-nvrtc-12.8.61 | 63.1 MB | ########## | 100%  2025-05-07T20:26:20.4957715Z 2025-05-07T20:26:20.4957722Z 2025-05-07T20:26:20.4957742Z 2025-05-07T20:26:20.4957748Z 2025-05-07T20:26:20.4957753Z 2025-05-07T20:26:20.4957758Z 2025-05-07T20:26:20.4957764Z 2025-05-07T20:26:20.4957769Z 2025-05-07T20:26:20.4957774Z 2025-05-07T20:26:20.4957779Z 2025-05-07T20:26:20.4957784Z 2025-05-07T20:26:20.4957790Z 2025-05-07T20:26:20.4957795Z 2025-05-07T20:26:20.8592872Z cuda-nvcc-tools-12.8 | 24.5 MB | ########## | 100%  2025-05-07T20:26:20.8593201Z 2025-05-07T20:26:20.8593205Z 2025-05-07T20:26:20.8593209Z 2025-05-07T20:26:20.8593242Z 2025-05-07T20:26:20.8593245Z 2025-05-07T20:26:20.8593249Z 2025-05-07T20:26:20.8593252Z 2025-05-07T20:26:20.8593256Z 2025-05-07T20:26:20.8593268Z 2025-05-07T20:26:20.8593272Z 2025-05-07T20:26:20.8593279Z 2025-05-07T20:26:20.9518204Z python-3.11.8 | 29.3 MB | ########## | 100%  2025-05-07T20:26:20.9518518Z 2025-05-07T20:26:20.9518839Z 2025-05-07T20:26:20.9518845Z 2025-05-07T20:26:20.9518849Z 2025-05-07T20:26:20.9518854Z 2025-05-07T20:26:20.9518858Z 2025-05-07T20:26:20.9518863Z 2025-05-07T20:26:20.9518866Z 2025-05-07T20:26:20.9518870Z 2025-05-07T20:26:20.9518873Z 2025-05-07T20:26:20.9518877Z 2025-05-07T20:26:20.9518880Z 2025-05-07T20:26:20.9518884Z 2025-05-07T20:26:20.9518888Z 2025-05-07T20:26:20.9833377Z cuda-nvvm-tools-12.8 | 23.5 MB | ########## | 100%  2025-05-07T20:26:20.9833732Z 2025-05-07T20:26:20.9833736Z 2025-05-07T20:26:20.9833739Z 2025-05-07T20:26:20.9833743Z 2025-05-07T20:26:20.9834008Z 2025-05-07T20:26:20.9834011Z 2025-05-07T20:26:20.9834015Z 2025-05-07T20:26:20.9834018Z 2025-05-07T20:26:20.9834022Z 2025-05-07T20:26:20.9834026Z 2025-05-07T20:26:20.9834029Z 2025-05-07T20:26:20.9834033Z 2025-05-07T20:26:20.9834047Z 2025-05-07T20:26:20.9834051Z 2025-05-07T20:26:20.9834054Z 2025-05-07T20:26:20.9834058Z 2025-05-07T20:26:20.9834061Z 2025-05-07T20:26:20.9834075Z 2025-05-07T20:26:21.1738677Z cuda-nvdisasm-12.8.5 | 4.9 MB | ########## | 100%  2025-05-07T20:26:21.1739027Z 2025-05-07T20:26:21.1739030Z 2025-05-07T20:26:21.1739034Z 2025-05-07T20:26:21.1739038Z 2025-05-07T20:26:21.1739041Z 2025-05-07T20:26:21.1739045Z 2025-05-07T20:26:21.1739048Z 2025-05-07T20:26:21.1739052Z 2025-05-07T20:26:21.1739055Z 2025-05-07T20:26:21.1739059Z 2025-05-07T20:26:21.1739062Z 2025-05-07T20:26:21.1739066Z 2025-05-07T20:26:21.1739070Z 2025-05-07T20:26:21.1739073Z 2025-05-07T20:26:21.1739077Z 2025-05-07T20:26:21.1739080Z 2025-05-07T20:26:21.1739115Z 2025-05-07T20:26:21.3433556Z cuda-sanitizer-api-1 | 8.8 MB | ########## | 100%  2025-05-07T20:26:21.3433932Z 2025-05-07T20:26:21.3433936Z 2025-05-07T20:26:21.3433939Z 2025-05-07T20:26:21.3433943Z 2025-05-07T20:26:21.3433946Z 2025-05-07T20:26:21.3433950Z 2025-05-07T20:26:21.3433954Z 2025-05-07T20:26:21.3433958Z 2025-05-07T20:26:21.3433991Z 2025-05-07T20:26:21.3433995Z 2025-05-07T20:26:21.3433999Z 2025-05-07T20:26:21.3434006Z 2025-05-07T20:26:21.4861599Z libnvjitlink-12.8.61 | 28.7 MB | ########## | 100%  2025-05-07T20:26:21.4862051Z 2025-05-07T20:26:21.4862056Z 2025-05-07T20:26:21.4862059Z 2025-05-07T20:26:21.4862063Z 2025-05-07T20:26:21.4862066Z 2025-05-07T20:26:21.4862078Z 2025-05-07T20:26:21.4862082Z 2025-05-07T20:26:21.4862085Z 2025-05-07T20:26:21.4862089Z 2025-05-07T20:26:21.4862092Z 2025-05-07T20:26:21.4862096Z 2025-05-07T20:26:21.4862100Z 2025-05-07T20:26:21.4862103Z 2025-05-07T20:26:21.4862136Z 2025-05-07T20:26:21.4862140Z 2025-05-07T20:26:21.4862150Z 2025-05-07T20:26:21.5075971Z cuda-nvcc-dev_linux- | 12.7 MB | ########## | 100%  2025-05-07T20:26:21.5076411Z 2025-05-07T20:26:21.5076415Z 2025-05-07T20:26:21.5076419Z 2025-05-07T20:26:21.5076431Z 2025-05-07T20:26:21.5076435Z 2025-05-07T20:26:21.5076459Z 2025-05-07T20:26:21.5076463Z 2025-05-07T20:26:21.5076466Z 2025-05-07T20:26:21.5076470Z 2025-05-07T20:26:21.5076473Z 2025-05-07T20:26:21.5076484Z 2025-05-07T20:26:21.5076487Z 2025-05-07T20:26:21.5076491Z 2025-05-07T20:26:21.5076494Z 2025-05-07T20:26:21.5076498Z 2025-05-07T20:26:21.5076501Z 2025-05-07T20:26:21.5076505Z 2025-05-07T20:26:21.5076509Z 2025-05-07T20:26:21.5076513Z 2025-05-07T20:26:22.3449785Z ... (more hidden) ... 2025-05-07T20:26:22.3450107Z 2025-05-07T20:26:22.3450111Z 2025-05-07T20:26:22.3450125Z 2025-05-07T20:26:22.3450130Z 2025-05-07T20:26:22.3450161Z 2025-05-07T20:26:22.3450165Z 2025-05-07T20:26:22.3450169Z 2025-05-07T20:26:22.3450173Z 2025-05-07T20:26:22.3450177Z 2025-05-07T20:26:22.3450181Z 2025-05-07T20:26:22.3450185Z 2025-05-07T20:26:22.3450188Z 2025-05-07T20:26:22.3450192Z 2025-05-07T20:26:22.3450196Z 2025-05-07T20:26:22.3450199Z 2025-05-07T20:26:26.1732816Z cuda-nvvm-impl-12.8. | 20.8 MB | ########## | 100%  2025-05-07T20:26:26.1733162Z 2025-05-07T20:26:27.4367372Z nsight-compute-2025. | 320.6 MB | ########## | 100%  2025-05-07T20:26:27.4375277Z libcublas-12.8.3.14 | 460.2 MB | ########## | 100% 2025-05-07T20:26:27.4375544Z 2025-05-07T20:26:27.4375549Z 2025-05-07T20:26:27.4375553Z 2025-05-07T20:26:27.4375557Z 2025-05-07T20:26:27.4375564Z 2025-05-07T20:26:27.4375568Z 2025-05-07T20:26:27.4375572Z 2025-05-07T20:26:27.4375577Z 2025-05-07T20:26:27.4375581Z 2025-05-07T20:26:27.4375585Z 2025-05-07T20:26:27.4375588Z 2025-05-07T20:26:27.4375592Z 2025-05-07T20:26:27.4375816Z 2025-05-07T20:26:27.4375820Z 2025-05-07T20:26:27.4375823Z 2025-05-07T20:26:27.4375827Z 2025-05-07T20:26:27.4375831Z 2025-05-07T20:26:27.4375834Z 2025-05-07T20:26:27.4375838Z 2025-05-07T20:26:27.4375939Z 2025-05-07T20:26:27.4376269Z  2025-05-07T20:26:27.4376594Z 2025-05-07T20:26:27.4376795Z 2025-05-07T20:26:27.4376979Z  2025-05-07T20:26:27.4377176Z 2025-05-07T20:26:27.4377181Z 2025-05-07T20:26:27.4377356Z  2025-05-07T20:26:27.4377559Z 2025-05-07T20:26:27.4377563Z 2025-05-07T20:26:27.4377567Z 2025-05-07T20:26:27.4377750Z  2025-05-07T20:26:27.4377956Z 2025-05-07T20:26:27.4377960Z 2025-05-07T20:26:27.4377964Z 2025-05-07T20:26:27.4377967Z 2025-05-07T20:26:27.4378154Z  2025-05-07T20:26:27.4378371Z 2025-05-07T20:26:27.4378374Z 2025-05-07T20:26:27.4378378Z 2025-05-07T20:26:27.4378381Z 2025-05-07T20:26:27.4378385Z 2025-05-07T20:26:27.4378829Z  2025-05-07T20:26:27.4379076Z 2025-05-07T20:26:27.4379080Z 2025-05-07T20:26:27.4379110Z 2025-05-07T20:26:27.4379114Z 2025-05-07T20:26:27.4379118Z 2025-05-07T20:26:27.4379121Z 2025-05-07T20:26:27.4379311Z  2025-05-07T20:26:27.4379531Z 2025-05-07T20:26:27.4379535Z 2025-05-07T20:26:27.4379539Z 2025-05-07T20:26:27.4379542Z 2025-05-07T20:26:27.4379546Z 2025-05-07T20:26:27.4379549Z 2025-05-07T20:26:27.4379553Z 2025-05-07T20:26:27.4379867Z  2025-05-07T20:26:27.4380173Z 2025-05-07T20:26:27.4380188Z 2025-05-07T20:26:27.4380194Z 2025-05-07T20:26:27.4380201Z 2025-05-07T20:26:27.4380216Z 2025-05-07T20:26:27.4380222Z 2025-05-07T20:26:27.4380227Z 2025-05-07T20:26:27.4380232Z 2025-05-07T20:26:27.4380513Z  2025-05-07T20:26:27.4380746Z 2025-05-07T20:26:27.4380756Z 2025-05-07T20:26:27.4380760Z 2025-05-07T20:26:27.4380763Z 2025-05-07T20:26:27.4380767Z 2025-05-07T20:26:27.4380776Z 2025-05-07T20:26:27.4380780Z 2025-05-07T20:26:27.4380783Z 2025-05-07T20:26:27.4380787Z 2025-05-07T20:26:27.4381015Z  2025-05-07T20:26:27.4381241Z 2025-05-07T20:26:27.4381245Z 2025-05-07T20:26:27.4381248Z 2025-05-07T20:26:27.4381252Z 2025-05-07T20:26:27.4381255Z 2025-05-07T20:26:27.4381259Z 2025-05-07T20:26:27.4381262Z 2025-05-07T20:26:27.4381266Z 2025-05-07T20:26:27.4381270Z 2025-05-07T20:26:27.4381273Z 2025-05-07T20:26:27.4382103Z  2025-05-07T20:26:27.4382404Z 2025-05-07T20:26:27.4382410Z 2025-05-07T20:26:27.4382422Z 2025-05-07T20:26:27.4382427Z 2025-05-07T20:26:27.4382431Z 2025-05-07T20:26:27.4382435Z 2025-05-07T20:26:27.4382440Z 2025-05-07T20:26:27.4382443Z 2025-05-07T20:26:27.4382447Z 2025-05-07T20:26:27.4382451Z 2025-05-07T20:26:27.4382454Z 2025-05-07T20:26:27.4383206Z  2025-05-07T20:26:27.4383445Z 2025-05-07T20:26:27.4383450Z 2025-05-07T20:26:27.4383453Z 2025-05-07T20:26:27.4383457Z 2025-05-07T20:26:27.4383460Z 2025-05-07T20:26:27.4383464Z 2025-05-07T20:26:27.4383467Z 2025-05-07T20:26:27.4383471Z 2025-05-07T20:26:27.4383474Z 2025-05-07T20:26:27.4383478Z 2025-05-07T20:26:27.4383482Z 2025-05-07T20:26:27.4383485Z 2025-05-07T20:26:27.4383692Z  2025-05-07T20:26:27.4383914Z 2025-05-07T20:26:27.4383918Z 2025-05-07T20:26:27.4383922Z 2025-05-07T20:26:27.4384033Z 2025-05-07T20:26:27.4384048Z 2025-05-07T20:26:27.4384051Z 2025-05-07T20:26:27.4384055Z 2025-05-07T20:26:27.4384059Z 2025-05-07T20:26:27.4384062Z 2025-05-07T20:26:27.4384066Z 2025-05-07T20:26:27.4384079Z 2025-05-07T20:26:27.4384082Z 2025-05-07T20:26:27.4384086Z 2025-05-07T20:26:27.4384296Z  2025-05-07T20:26:27.4384520Z 2025-05-07T20:26:27.4384524Z 2025-05-07T20:26:27.4384527Z 2025-05-07T20:26:27.4384538Z 2025-05-07T20:26:27.4384542Z 2025-05-07T20:26:27.4384545Z 2025-05-07T20:26:27.4384549Z 2025-05-07T20:26:27.4384552Z 2025-05-07T20:26:27.4384556Z 2025-05-07T20:26:27.4384560Z 2025-05-07T20:26:27.4384563Z 2025-05-07T20:26:27.4384567Z 2025-05-07T20:26:27.4384570Z 2025-05-07T20:26:27.4384574Z 2025-05-07T20:26:27.4384778Z  2025-05-07T20:26:27.4385007Z 2025-05-07T20:26:27.4385010Z 2025-05-07T20:26:27.4385019Z 2025-05-07T20:26:27.4385023Z 2025-05-07T20:26:27.4385027Z 2025-05-07T20:26:27.4385030Z 2025-05-07T20:26:27.4385034Z 2025-05-07T20:26:27.4385037Z 2025-05-07T20:26:27.4385041Z 2025-05-07T20:26:27.4385045Z 2025-05-07T20:26:27.4385048Z 2025-05-07T20:26:27.4385052Z 2025-05-07T20:26:27.4385055Z 2025-05-07T20:26:27.4385059Z 2025-05-07T20:26:27.4385063Z 2025-05-07T20:26:27.4385277Z  2025-05-07T20:26:27.4385502Z 2025-05-07T20:26:27.4385505Z 2025-05-07T20:26:27.4385509Z 2025-05-07T20:26:27.4385512Z 2025-05-07T20:26:27.4385516Z 2025-05-07T20:26:27.4385520Z 2025-05-07T20:26:27.4385523Z 2025-05-07T20:26:27.4385527Z 2025-05-07T20:26:27.4385530Z 2025-05-07T20:26:27.4385552Z 2025-05-07T20:26:27.4385556Z 2025-05-07T20:26:27.4385559Z 2025-05-07T20:26:27.4385563Z 2025-05-07T20:26:27.4385566Z 2025-05-07T20:26:27.4385570Z 2025-05-07T20:26:27.4385574Z 2025-05-07T20:26:27.4385780Z  2025-05-07T20:26:27.4386019Z 2025-05-07T20:26:27.4386023Z 2025-05-07T20:26:27.4386026Z 2025-05-07T20:26:27.4386030Z 2025-05-07T20:26:27.4386033Z 2025-05-07T20:26:27.4386037Z 2025-05-07T20:26:27.4386040Z 2025-05-07T20:26:27.4386044Z 2025-05-07T20:26:27.4386048Z 2025-05-07T20:26:27.4386051Z 2025-05-07T20:26:27.4386059Z 2025-05-07T20:26:27.4386063Z 2025-05-07T20:26:27.4386066Z 2025-05-07T20:26:27.4386070Z 2025-05-07T20:26:27.4386073Z 2025-05-07T20:26:27.4386077Z 2025-05-07T20:26:27.4386081Z 2025-05-07T20:26:27.4386303Z  2025-05-07T20:26:27.4386531Z 2025-05-07T20:26:27.4386535Z 2025-05-07T20:26:27.4386539Z 2025-05-07T20:26:27.4386542Z 2025-05-07T20:26:27.4386546Z 2025-05-07T20:26:27.4386549Z 2025-05-07T20:26:27.4386553Z 2025-05-07T20:26:27.4386556Z 2025-05-07T20:26:27.4386560Z 2025-05-07T20:26:27.4386569Z 2025-05-07T20:26:27.4386578Z 2025-05-07T20:26:27.4386581Z 2025-05-07T20:26:27.4386585Z 2025-05-07T20:26:27.4386589Z 2025-05-07T20:26:27.4386592Z 2025-05-07T20:26:27.4386596Z 2025-05-07T20:26:27.4386600Z 2025-05-07T20:26:27.4386603Z 2025-05-07T20:26:27.4386844Z  2025-05-07T20:26:27.4387169Z 2025-05-07T20:26:27.4387173Z 2025-05-07T20:26:27.4387276Z  2025-05-07T20:26:27.4387378Z 2025-05-07T20:26:27.4387382Z 2025-05-07T20:26:27.4387490Z  2025-05-07T20:26:27.4387603Z 2025-05-07T20:26:27.4387607Z 2025-05-07T20:26:27.4387611Z 2025-05-07T20:26:27.4388013Z  2025-05-07T20:26:27.4388193Z 2025-05-07T20:26:27.4388199Z 2025-05-07T20:26:27.4388206Z 2025-05-07T20:26:27.4388221Z 2025-05-07T20:26:27.4388377Z  2025-05-07T20:26:27.4388548Z 2025-05-07T20:26:27.4388553Z 2025-05-07T20:26:27.4388565Z 2025-05-07T20:26:27.4388570Z 2025-05-07T20:26:27.4388576Z 2025-05-07T20:26:27.4388928Z  2025-05-07T20:26:27.4389192Z 2025-05-07T20:26:27.4389197Z 2025-05-07T20:26:27.4389203Z 2025-05-07T20:26:27.4389208Z 2025-05-07T20:26:27.4389213Z 2025-05-07T20:26:27.4389218Z 2025-05-07T20:26:27.4389458Z  2025-05-07T20:26:27.4389598Z 2025-05-07T20:26:27.4389605Z 2025-05-07T20:26:27.4389614Z 2025-05-07T20:26:27.4389619Z 2025-05-07T20:26:27.4389634Z 2025-05-07T20:26:27.4389639Z 2025-05-07T20:26:27.4389644Z 2025-05-07T20:26:27.4389880Z  2025-05-07T20:26:27.4390031Z 2025-05-07T20:26:27.4390036Z 2025-05-07T20:26:27.4390041Z 2025-05-07T20:26:27.4390046Z 2025-05-07T20:26:27.4390051Z 2025-05-07T20:26:27.4390055Z 2025-05-07T20:26:27.4390060Z 2025-05-07T20:26:27.4390064Z 2025-05-07T20:26:27.4390641Z  2025-05-07T20:26:27.4390874Z 2025-05-07T20:26:27.4390883Z 2025-05-07T20:26:27.4390892Z 2025-05-07T20:26:27.4390900Z 2025-05-07T20:26:27.4390908Z 2025-05-07T20:26:27.4390918Z 2025-05-07T20:26:27.4390946Z 2025-05-07T20:26:27.4390954Z 2025-05-07T20:26:27.4390975Z 2025-05-07T20:26:27.4391166Z  2025-05-07T20:26:27.4391387Z 2025-05-07T20:26:27.4391394Z 2025-05-07T20:26:27.4391400Z 2025-05-07T20:26:27.4391405Z 2025-05-07T20:26:27.4391410Z 2025-05-07T20:26:27.4391415Z 2025-05-07T20:26:27.4391420Z 2025-05-07T20:26:27.4391426Z 2025-05-07T20:26:27.4391445Z 2025-05-07T20:26:27.4391459Z 2025-05-07T20:26:27.4391651Z  2025-05-07T20:26:27.4391872Z 2025-05-07T20:26:27.4391876Z 2025-05-07T20:26:27.4391881Z 2025-05-07T20:26:27.4391886Z 2025-05-07T20:26:27.4391891Z 2025-05-07T20:26:27.4391906Z 2025-05-07T20:26:27.4391911Z 2025-05-07T20:26:27.4391916Z 2025-05-07T20:26:27.4391921Z 2025-05-07T20:26:27.4391927Z 2025-05-07T20:26:27.4391932Z 2025-05-07T20:26:27.4392123Z  2025-05-07T20:26:27.4392369Z 2025-05-07T20:26:27.4392374Z 2025-05-07T20:26:27.4392379Z 2025-05-07T20:26:27.4392383Z 2025-05-07T20:26:27.4392393Z 2025-05-07T20:26:27.4392398Z 2025-05-07T20:26:27.4392403Z 2025-05-07T20:26:27.4392408Z 2025-05-07T20:26:27.4392413Z 2025-05-07T20:26:27.4392418Z 2025-05-07T20:26:27.4392423Z 2025-05-07T20:26:27.4392429Z 2025-05-07T20:26:27.4392624Z  2025-05-07T20:26:27.4392884Z 2025-05-07T20:26:27.4392889Z 2025-05-07T20:26:27.4392894Z 2025-05-07T20:26:27.4392905Z 2025-05-07T20:26:27.4392910Z 2025-05-07T20:26:27.4392916Z 2025-05-07T20:26:27.4392921Z 2025-05-07T20:26:27.4392926Z 2025-05-07T20:26:27.4392931Z 2025-05-07T20:26:27.4392936Z 2025-05-07T20:26:27.4392941Z 2025-05-07T20:26:27.4392946Z 2025-05-07T20:26:27.4392951Z 2025-05-07T20:26:27.4393169Z  2025-05-07T20:26:27.4393427Z 2025-05-07T20:26:27.4393432Z 2025-05-07T20:26:27.4393437Z 2025-05-07T20:26:27.4393443Z 2025-05-07T20:26:27.4393448Z 2025-05-07T20:26:27.4393453Z 2025-05-07T20:26:27.4393458Z 2025-05-07T20:26:27.4393463Z 2025-05-07T20:26:27.4393469Z 2025-05-07T20:26:27.4393479Z 2025-05-07T20:26:27.4393492Z 2025-05-07T20:26:27.4393498Z 2025-05-07T20:26:27.4393503Z 2025-05-07T20:26:27.4393508Z 2025-05-07T20:26:27.4393714Z  2025-05-07T20:26:27.4393981Z 2025-05-07T20:26:27.4393986Z 2025-05-07T20:26:27.4393999Z 2025-05-07T20:26:27.4394004Z 2025-05-07T20:26:27.4394009Z 2025-05-07T20:26:27.4394014Z 2025-05-07T20:26:27.4394236Z 2025-05-07T20:26:27.4394243Z 2025-05-07T20:26:27.4394248Z 2025-05-07T20:26:27.4394253Z 2025-05-07T20:26:27.4394258Z 2025-05-07T20:26:27.4394263Z 2025-05-07T20:26:27.4394268Z 2025-05-07T20:26:27.4394274Z 2025-05-07T20:26:27.4394278Z 2025-05-07T20:26:27.4394510Z  2025-05-07T20:26:27.4394788Z 2025-05-07T20:26:27.4394794Z 2025-05-07T20:26:27.4394799Z 2025-05-07T20:26:27.4394804Z 2025-05-07T20:26:27.4394809Z 2025-05-07T20:26:27.4394814Z 2025-05-07T20:26:27.4394819Z 2025-05-07T20:26:27.4394824Z 2025-05-07T20:26:27.4394829Z 2025-05-07T20:26:27.4394940Z 2025-05-07T20:26:27.4394945Z 2025-05-07T20:26:27.4394950Z 2025-05-07T20:26:27.4394955Z 2025-05-07T20:26:27.4394960Z 2025-05-07T20:26:27.4394965Z 2025-05-07T20:26:27.4394970Z 2025-05-07T20:26:27.4395214Z  2025-05-07T20:26:27.4395491Z 2025-05-07T20:26:27.4395496Z 2025-05-07T20:26:27.4395501Z 2025-05-07T20:26:27.4395506Z 2025-05-07T20:26:27.4395527Z 2025-05-07T20:26:27.4395532Z 2025-05-07T20:26:27.4395537Z 2025-05-07T20:26:27.4395543Z 2025-05-07T20:26:27.4395548Z 2025-05-07T20:26:27.4395553Z 2025-05-07T20:26:27.4395558Z 2025-05-07T20:26:27.4395563Z 2025-05-07T20:26:27.4395568Z 2025-05-07T20:26:27.4395573Z 2025-05-07T20:26:27.4395578Z 2025-05-07T20:26:27.4395583Z 2025-05-07T20:26:27.4395588Z 2025-05-07T20:26:27.4395811Z  2025-05-07T20:26:27.4396088Z 2025-05-07T20:26:27.4396092Z 2025-05-07T20:26:27.4396096Z 2025-05-07T20:26:27.4396099Z 2025-05-07T20:26:27.4396103Z 2025-05-07T20:26:27.4396113Z 2025-05-07T20:26:27.4396116Z 2025-05-07T20:26:27.4396120Z 2025-05-07T20:26:27.4396124Z 2025-05-07T20:26:27.4396127Z 2025-05-07T20:26:27.4396131Z 2025-05-07T20:26:27.4396134Z 2025-05-07T20:26:27.4396138Z 2025-05-07T20:26:27.4396141Z 2025-05-07T20:26:27.4396145Z 2025-05-07T20:26:27.4396149Z 2025-05-07T20:26:27.4396158Z 2025-05-07T20:26:27.4396162Z 2025-05-07T20:26:27.4396384Z  2025-05-07T20:26:27.4396592Z 2025-05-07T20:26:27.4396602Z 2025-05-07T20:26:27.4396713Z  2025-05-07T20:26:27.4396820Z 2025-05-07T20:26:27.4396824Z 2025-05-07T20:26:27.4397126Z  2025-05-07T20:26:27.4397263Z 2025-05-07T20:26:27.4397267Z 2025-05-07T20:26:27.4397277Z 2025-05-07T20:26:27.4397549Z  2025-05-07T20:26:27.4397670Z 2025-05-07T20:26:27.4397675Z 2025-05-07T20:26:27.4397682Z 2025-05-07T20:26:27.4397686Z 2025-05-07T20:26:27.4398017Z  2025-05-07T20:26:27.4398150Z 2025-05-07T20:26:27.4398155Z 2025-05-07T20:26:27.4398174Z 2025-05-07T20:26:27.4398179Z 2025-05-07T20:26:27.4398188Z 2025-05-07T20:26:27.4398430Z  2025-05-07T20:26:27.4398563Z 2025-05-07T20:26:27.4398566Z 2025-05-07T20:26:27.4398570Z 2025-05-07T20:26:27.4398573Z 2025-05-07T20:26:27.4398577Z 2025-05-07T20:26:27.4398580Z 2025-05-07T20:26:27.4398953Z  2025-05-07T20:26:27.4399088Z 2025-05-07T20:26:27.4399110Z 2025-05-07T20:26:27.4399116Z 2025-05-07T20:26:27.4399122Z 2025-05-07T20:26:27.4399127Z 2025-05-07T20:26:27.4399133Z 2025-05-07T20:26:27.4399138Z 2025-05-07T20:26:27.4399341Z  2025-05-07T20:26:27.4399479Z 2025-05-07T20:26:27.4399486Z 2025-05-07T20:26:27.4399490Z 2025-05-07T20:26:27.4399494Z 2025-05-07T20:26:27.4399498Z 2025-05-07T20:26:27.4399503Z 2025-05-07T20:26:27.4399508Z 2025-05-07T20:26:27.4399520Z 2025-05-07T20:26:27.4399819Z  2025-05-07T20:26:27.4399971Z 2025-05-07T20:26:27.4399978Z 2025-05-07T20:26:27.4399982Z 2025-05-07T20:26:27.4399985Z 2025-05-07T20:26:27.4400001Z 2025-05-07T20:26:27.4400011Z 2025-05-07T20:26:27.4400014Z 2025-05-07T20:26:27.4400018Z 2025-05-07T20:26:27.4400021Z 2025-05-07T20:26:27.4400357Z  2025-05-07T20:26:27.4400524Z 2025-05-07T20:26:27.4400537Z 2025-05-07T20:26:27.4400542Z 2025-05-07T20:26:27.4400546Z 2025-05-07T20:26:27.4400550Z 2025-05-07T20:26:27.4400554Z 2025-05-07T20:26:27.4400690Z 2025-05-07T20:26:27.4400695Z 2025-05-07T20:26:27.4400700Z 2025-05-07T20:26:27.4400703Z 2025-05-07T20:26:27.4400854Z  2025-05-07T20:26:27.4401020Z 2025-05-07T20:26:27.4401024Z 2025-05-07T20:26:27.4401027Z 2025-05-07T20:26:27.4401042Z 2025-05-07T20:26:27.4401045Z 2025-05-07T20:26:27.4401049Z 2025-05-07T20:26:27.4401052Z 2025-05-07T20:26:27.4401056Z 2025-05-07T20:26:27.4401061Z 2025-05-07T20:26:27.4401067Z 2025-05-07T20:26:27.4401072Z 2025-05-07T20:26:27.4401219Z  2025-05-07T20:26:27.4401398Z 2025-05-07T20:26:27.4401401Z 2025-05-07T20:26:27.4401494Z 2025-05-07T20:26:27.4401509Z 2025-05-07T20:26:27.4401513Z 2025-05-07T20:26:27.4401516Z 2025-05-07T20:26:27.4401520Z 2025-05-07T20:26:27.4401523Z 2025-05-07T20:26:27.4401527Z 2025-05-07T20:26:27.4401530Z 2025-05-07T20:26:27.4401534Z 2025-05-07T20:26:27.4401537Z 2025-05-07T20:26:27.4401682Z  2025-05-07T20:26:27.4401871Z 2025-05-07T20:26:27.4401882Z 2025-05-07T20:26:27.4401885Z 2025-05-07T20:26:27.4401889Z 2025-05-07T20:26:27.4401892Z 2025-05-07T20:26:27.4401896Z 2025-05-07T20:26:27.4401899Z 2025-05-07T20:26:27.4401903Z 2025-05-07T20:26:27.4401906Z 2025-05-07T20:26:27.4401910Z 2025-05-07T20:26:27.4401913Z 2025-05-07T20:26:27.4401917Z 2025-05-07T20:26:27.4401920Z 2025-05-07T20:26:27.4402347Z  2025-05-07T20:26:27.4402566Z 2025-05-07T20:26:27.4402572Z 2025-05-07T20:26:27.4402576Z 2025-05-07T20:26:27.4402579Z 2025-05-07T20:26:27.4402583Z 2025-05-07T20:26:27.4402586Z 2025-05-07T20:26:27.4402600Z 2025-05-07T20:26:27.4402624Z 2025-05-07T20:26:27.4402627Z 2025-05-07T20:26:27.4402631Z 2025-05-07T20:26:27.4402634Z 2025-05-07T20:26:27.4402638Z 2025-05-07T20:26:27.4402642Z 2025-05-07T20:26:27.4402645Z 2025-05-07T20:26:27.4402956Z  2025-05-07T20:26:27.4403154Z 2025-05-07T20:26:27.4403160Z 2025-05-07T20:26:27.4403165Z 2025-05-07T20:26:27.4403177Z 2025-05-07T20:26:27.4403189Z 2025-05-07T20:26:27.4403201Z 2025-05-07T20:26:27.4403207Z 2025-05-07T20:26:27.4403212Z 2025-05-07T20:26:27.4403217Z 2025-05-07T20:26:27.4403222Z 2025-05-07T20:26:27.4403226Z 2025-05-07T20:26:27.4403231Z 2025-05-07T20:26:27.4403237Z 2025-05-07T20:26:27.4403243Z 2025-05-07T20:26:27.4403247Z 2025-05-07T20:26:27.4403410Z  2025-05-07T20:26:27.4403610Z 2025-05-07T20:26:27.4403613Z 2025-05-07T20:26:27.4403617Z 2025-05-07T20:26:27.4403620Z 2025-05-07T20:26:27.4403624Z 2025-05-07T20:26:27.4403627Z 2025-05-07T20:26:27.4403631Z 2025-05-07T20:26:27.4403647Z 2025-05-07T20:26:27.4403657Z 2025-05-07T20:26:27.4403661Z 2025-05-07T20:26:27.4403664Z 2025-05-07T20:26:27.4403668Z 2025-05-07T20:26:27.4403671Z 2025-05-07T20:26:27.4403675Z 2025-05-07T20:26:27.4403678Z 2025-05-07T20:26:27.4403682Z 2025-05-07T20:26:27.4403852Z  2025-05-07T20:26:27.4404057Z 2025-05-07T20:26:27.4404065Z 2025-05-07T20:26:27.4404069Z 2025-05-07T20:26:27.4404072Z 2025-05-07T20:26:27.4404076Z 2025-05-07T20:26:27.4404079Z 2025-05-07T20:26:27.4404083Z 2025-05-07T20:26:27.4404093Z 2025-05-07T20:26:27.4404096Z 2025-05-07T20:26:27.4404100Z 2025-05-07T20:26:27.4404103Z 2025-05-07T20:26:27.4404107Z 2025-05-07T20:26:27.4404110Z 2025-05-07T20:26:27.4404114Z 2025-05-07T20:26:27.4404117Z 2025-05-07T20:26:27.4404121Z 2025-05-07T20:26:27.4404124Z 2025-05-07T20:26:27.4404295Z  2025-05-07T20:26:27.4404513Z 2025-05-07T20:26:27.4404516Z 2025-05-07T20:26:27.4404520Z 2025-05-07T20:26:27.4404529Z 2025-05-07T20:26:27.4404533Z 2025-05-07T20:26:27.4404536Z 2025-05-07T20:26:27.4404540Z 2025-05-07T20:26:27.4404551Z 2025-05-07T20:26:27.4404555Z 2025-05-07T20:26:27.4404558Z 2025-05-07T20:26:27.4404562Z 2025-05-07T20:26:27.4404565Z 2025-05-07T20:26:27.4404569Z 2025-05-07T20:26:27.4404572Z 2025-05-07T20:26:27.4404575Z 2025-05-07T20:26:27.4404579Z 2025-05-07T20:26:27.4404665Z 2025-05-07T20:26:27.4404669Z 2025-05-07T20:26:27.4405180Z  2025-05-07T20:26:27.4405398Z 2025-05-07T20:26:27.4405410Z 2025-05-07T20:26:27.4405513Z  2025-05-07T20:26:27.4405615Z 2025-05-07T20:26:27.4405618Z 2025-05-07T20:26:27.4405907Z  2025-05-07T20:26:27.4406029Z 2025-05-07T20:26:27.4406034Z 2025-05-07T20:26:27.4406040Z 2025-05-07T20:26:27.4406390Z  2025-05-07T20:26:27.4406504Z 2025-05-07T20:26:27.4406510Z 2025-05-07T20:26:27.4406516Z 2025-05-07T20:26:27.4406525Z 2025-05-07T20:26:27.4406799Z  2025-05-07T20:26:27.4407104Z 2025-05-07T20:26:27.4407117Z 2025-05-07T20:26:27.4407125Z 2025-05-07T20:26:27.4407130Z 2025-05-07T20:26:27.4407135Z 2025-05-07T20:26:27.4407379Z  2025-05-07T20:26:27.4407502Z 2025-05-07T20:26:27.4407510Z 2025-05-07T20:26:27.4407514Z 2025-05-07T20:26:27.4407517Z 2025-05-07T20:26:27.4407521Z 2025-05-07T20:26:27.4407524Z 2025-05-07T20:26:27.4407940Z  2025-05-07T20:26:27.4408072Z 2025-05-07T20:26:27.4408088Z 2025-05-07T20:26:27.4408093Z 2025-05-07T20:26:27.4408098Z 2025-05-07T20:26:27.4408102Z 2025-05-07T20:26:27.4408107Z 2025-05-07T20:26:27.4408111Z 2025-05-07T20:26:27.4408363Z  2025-05-07T20:26:27.4408511Z 2025-05-07T20:26:27.4408521Z 2025-05-07T20:26:27.4408526Z 2025-05-07T20:26:27.4408532Z 2025-05-07T20:26:27.4408537Z 2025-05-07T20:26:27.4408543Z 2025-05-07T20:26:27.4408548Z 2025-05-07T20:26:27.4408554Z 2025-05-07T20:26:27.4408809Z  2025-05-07T20:26:27.4408966Z 2025-05-07T20:26:27.4408972Z 2025-05-07T20:26:27.4408993Z 2025-05-07T20:26:27.4408998Z 2025-05-07T20:26:27.4409003Z 2025-05-07T20:26:27.4409009Z 2025-05-07T20:26:27.4409013Z 2025-05-07T20:26:27.4409017Z 2025-05-07T20:26:27.4409021Z 2025-05-07T20:26:27.4409156Z  2025-05-07T20:26:27.4409313Z 2025-05-07T20:26:27.4409316Z 2025-05-07T20:26:27.4409328Z 2025-05-07T20:26:27.4409331Z 2025-05-07T20:26:27.4409372Z 2025-05-07T20:26:27.4409381Z 2025-05-07T20:26:27.4409398Z 2025-05-07T20:26:27.4409404Z 2025-05-07T20:26:27.4409409Z 2025-05-07T20:26:27.4409415Z 2025-05-07T20:26:27.4409605Z  2025-05-07T20:26:27.4409777Z 2025-05-07T20:26:27.4409783Z 2025-05-07T20:26:27.4409799Z 2025-05-07T20:26:27.4409805Z 2025-05-07T20:26:27.4409810Z 2025-05-07T20:26:27.4409816Z 2025-05-07T20:26:27.4409820Z 2025-05-07T20:26:27.4409825Z 2025-05-07T20:26:27.4409829Z 2025-05-07T20:26:27.4409834Z 2025-05-07T20:26:27.4409837Z 2025-05-07T20:26:27.4409970Z  2025-05-07T20:26:27.4410162Z 2025-05-07T20:26:27.4410166Z 2025-05-07T20:26:27.4410170Z 2025-05-07T20:26:27.4410173Z 2025-05-07T20:26:27.4410177Z 2025-05-07T20:26:27.4410180Z 2025-05-07T20:26:27.4410184Z 2025-05-07T20:26:27.4410187Z 2025-05-07T20:26:27.4410191Z 2025-05-07T20:26:27.4410194Z 2025-05-07T20:26:27.4410198Z 2025-05-07T20:26:27.4410201Z 2025-05-07T20:26:27.4410361Z  2025-05-07T20:26:27.4410539Z 2025-05-07T20:26:27.4410542Z 2025-05-07T20:26:27.4410546Z 2025-05-07T20:26:27.4410549Z 2025-05-07T20:26:27.4410564Z 2025-05-07T20:26:27.4410568Z 2025-05-07T20:26:27.4410571Z 2025-05-07T20:26:27.4410575Z 2025-05-07T20:26:27.4410578Z 2025-05-07T20:26:27.4410582Z 2025-05-07T20:26:27.4410593Z 2025-05-07T20:26:27.4410596Z 2025-05-07T20:26:27.4410600Z 2025-05-07T20:26:27.4410740Z  2025-05-07T20:26:27.4410926Z 2025-05-07T20:26:27.4410929Z 2025-05-07T20:26:27.4410933Z 2025-05-07T20:26:27.4410937Z 2025-05-07T20:26:27.4410953Z 2025-05-07T20:26:27.4410957Z 2025-05-07T20:26:27.4410971Z 2025-05-07T20:26:27.4410974Z 2025-05-07T20:26:27.4410978Z 2025-05-07T20:26:27.4410981Z 2025-05-07T20:26:27.4410985Z 2025-05-07T20:26:27.4410988Z 2025-05-07T20:26:27.4410992Z 2025-05-07T20:26:27.4410995Z 2025-05-07T20:26:27.4411139Z  2025-05-07T20:26:27.4411336Z 2025-05-07T20:26:27.4411431Z 2025-05-07T20:26:27.4411435Z 2025-05-07T20:26:27.4411439Z 2025-05-07T20:26:27.4411442Z 2025-05-07T20:26:27.4411446Z 2025-05-07T20:26:27.4411449Z 2025-05-07T20:26:27.4411453Z 2025-05-07T20:26:27.4411456Z 2025-05-07T20:26:27.4411460Z 2025-05-07T20:26:27.4411463Z 2025-05-07T20:26:27.4411467Z 2025-05-07T20:26:27.4411470Z 2025-05-07T20:26:27.4411474Z 2025-05-07T20:26:27.4411486Z 2025-05-07T20:26:27.4411647Z  2025-05-07T20:26:27.4411843Z 2025-05-07T20:26:27.4411847Z 2025-05-07T20:26:27.4411850Z 2025-05-07T20:26:27.4411854Z 2025-05-07T20:26:27.4411858Z 2025-05-07T20:26:27.4411943Z 2025-05-07T20:26:27.4411947Z 2025-05-07T20:26:27.4411950Z 2025-05-07T20:26:27.4411954Z 2025-05-07T20:26:27.4411966Z 2025-05-07T20:26:27.4411969Z 2025-05-07T20:26:27.4411973Z 2025-05-07T20:26:27.4411976Z 2025-05-07T20:26:27.4411980Z 2025-05-07T20:26:27.4411983Z 2025-05-07T20:26:27.4411987Z 2025-05-07T20:26:27.4412160Z  2025-05-07T20:26:27.4412366Z 2025-05-07T20:26:27.4412370Z 2025-05-07T20:26:27.4412373Z 2025-05-07T20:26:27.4412377Z 2025-05-07T20:26:27.4412380Z 2025-05-07T20:26:27.4412384Z 2025-05-07T20:26:27.4412387Z 2025-05-07T20:26:27.4412391Z 2025-05-07T20:26:27.4412394Z 2025-05-07T20:26:27.4412398Z 2025-05-07T20:26:27.4412401Z 2025-05-07T20:26:27.4412405Z 2025-05-07T20:26:27.4412408Z 2025-05-07T20:26:27.4412412Z 2025-05-07T20:26:27.4412415Z 2025-05-07T20:26:27.4412419Z 2025-05-07T20:26:27.4412422Z 2025-05-07T20:26:27.4412606Z  2025-05-07T20:26:27.4412807Z 2025-05-07T20:26:27.4412815Z 2025-05-07T20:26:27.4412819Z 2025-05-07T20:26:27.4412822Z 2025-05-07T20:26:27.4412826Z 2025-05-07T20:26:27.4412829Z 2025-05-07T20:26:27.4412833Z 2025-05-07T20:26:27.4412836Z 2025-05-07T20:26:27.4412849Z 2025-05-07T20:26:27.4412860Z 2025-05-07T20:26:27.4412863Z 2025-05-07T20:26:27.4412867Z 2025-05-07T20:26:27.4412870Z 2025-05-07T20:26:27.4412874Z 2025-05-07T20:26:27.4412882Z 2025-05-07T20:26:27.4412886Z 2025-05-07T20:26:27.4412889Z 2025-05-07T20:26:27.4412893Z 2025-05-07T20:26:27.4413059Z  2025-05-07T20:26:27.4413269Z 2025-05-07T20:26:27.4413273Z 2025-05-07T20:26:27.4413380Z  2025-05-07T20:26:27.4413485Z 2025-05-07T20:26:27.4413488Z 2025-05-07T20:26:27.4413607Z  2025-05-07T20:26:27.4413717Z 2025-05-07T20:26:27.4413720Z 2025-05-07T20:26:27.4413724Z 2025-05-07T20:26:27.4413994Z  2025-05-07T20:26:27.4414107Z 2025-05-07T20:26:27.4414111Z 2025-05-07T20:26:27.4414117Z 2025-05-07T20:26:27.4414135Z 2025-05-07T20:26:27.4414324Z  2025-05-07T20:26:27.4414446Z 2025-05-07T20:26:27.4414460Z 2025-05-07T20:26:27.4414464Z 2025-05-07T20:26:27.4414468Z 2025-05-07T20:26:27.4414472Z 2025-05-07T20:26:27.4414675Z  2025-05-07T20:26:27.4414799Z 2025-05-07T20:26:27.4414807Z 2025-05-07T20:26:27.4414811Z 2025-05-07T20:26:27.4414815Z 2025-05-07T20:26:27.4414830Z 2025-05-07T20:26:27.4414834Z 2025-05-07T20:26:27.4415044Z  2025-05-07T20:26:27.4415170Z 2025-05-07T20:26:27.4415177Z 2025-05-07T20:26:27.4415180Z 2025-05-07T20:26:27.4415184Z 2025-05-07T20:26:27.4415188Z 2025-05-07T20:26:27.4415191Z 2025-05-07T20:26:27.4415195Z 2025-05-07T20:26:27.4415416Z  2025-05-07T20:26:27.4415557Z 2025-05-07T20:26:27.4415560Z 2025-05-07T20:26:27.4415564Z 2025-05-07T20:26:27.4415567Z 2025-05-07T20:26:27.4415571Z 2025-05-07T20:26:27.4415574Z 2025-05-07T20:26:27.4415578Z 2025-05-07T20:26:27.4415582Z 2025-05-07T20:26:27.4415775Z  2025-05-07T20:26:27.4415929Z 2025-05-07T20:26:27.4415935Z 2025-05-07T20:26:27.4415939Z 2025-05-07T20:26:27.4415942Z 2025-05-07T20:26:27.4415946Z 2025-05-07T20:26:27.4415949Z 2025-05-07T20:26:27.4415953Z 2025-05-07T20:26:27.4415968Z 2025-05-07T20:26:27.4415971Z 2025-05-07T20:26:27.4416148Z  2025-05-07T20:26:27.4416304Z 2025-05-07T20:26:27.4416311Z 2025-05-07T20:26:27.4416429Z 2025-05-07T20:26:27.4416442Z 2025-05-07T20:26:27.4416445Z 2025-05-07T20:26:27.4416449Z 2025-05-07T20:26:27.4416452Z 2025-05-07T20:26:27.4416456Z 2025-05-07T20:26:27.4416459Z 2025-05-07T20:26:27.4416463Z 2025-05-07T20:26:27.4416598Z  2025-05-07T20:26:27.4416768Z 2025-05-07T20:26:27.4416771Z 2025-05-07T20:26:27.4416781Z 2025-05-07T20:26:27.4416785Z 2025-05-07T20:26:27.4416788Z 2025-05-07T20:26:27.4416792Z 2025-05-07T20:26:27.4416795Z 2025-05-07T20:26:27.4416799Z 2025-05-07T20:26:27.4416802Z 2025-05-07T20:26:27.4416806Z 2025-05-07T20:26:27.4416809Z 2025-05-07T20:26:27.4417031Z  2025-05-07T20:26:27.4417209Z 2025-05-07T20:26:27.4417212Z 2025-05-07T20:26:27.4417216Z 2025-05-07T20:26:27.4417219Z 2025-05-07T20:26:27.4417223Z 2025-05-07T20:26:27.4417226Z 2025-05-07T20:26:27.4417230Z 2025-05-07T20:26:27.4417233Z 2025-05-07T20:26:27.4417237Z 2025-05-07T20:26:27.4417240Z 2025-05-07T20:26:27.4417244Z 2025-05-07T20:26:27.4417253Z 2025-05-07T20:26:27.4417398Z  2025-05-07T20:26:27.4417578Z 2025-05-07T20:26:27.4417582Z 2025-05-07T20:26:27.4417585Z 2025-05-07T20:26:27.4417589Z 2025-05-07T20:26:27.4417592Z 2025-05-07T20:26:27.4417596Z 2025-05-07T20:26:27.4417599Z 2025-05-07T20:26:27.4417603Z 2025-05-07T20:26:27.4417607Z 2025-05-07T20:26:27.4417610Z 2025-05-07T20:26:27.4417614Z 2025-05-07T20:26:27.4417622Z 2025-05-07T20:26:27.4417626Z 2025-05-07T20:26:27.4417767Z  2025-05-07T20:26:27.4417950Z 2025-05-07T20:26:27.4417954Z 2025-05-07T20:26:27.4417962Z 2025-05-07T20:26:27.4417966Z 2025-05-07T20:26:27.4417975Z 2025-05-07T20:26:27.4417979Z 2025-05-07T20:26:27.4417982Z 2025-05-07T20:26:27.4417986Z 2025-05-07T20:26:27.4417989Z 2025-05-07T20:26:27.4417993Z 2025-05-07T20:26:27.4417996Z 2025-05-07T20:26:27.4418000Z 2025-05-07T20:26:27.4418003Z 2025-05-07T20:26:27.4418007Z 2025-05-07T20:26:27.4418159Z  2025-05-07T20:26:27.4418356Z 2025-05-07T20:26:27.4418360Z 2025-05-07T20:26:27.4418363Z 2025-05-07T20:26:27.4418367Z 2025-05-07T20:26:27.4418370Z 2025-05-07T20:26:27.4418374Z 2025-05-07T20:26:27.4418377Z 2025-05-07T20:26:27.4418381Z 2025-05-07T20:26:27.4418384Z 2025-05-07T20:26:27.4418388Z 2025-05-07T20:26:27.4418392Z 2025-05-07T20:26:27.4418395Z 2025-05-07T20:26:27.4418399Z 2025-05-07T20:26:27.4418402Z 2025-05-07T20:26:27.4418406Z 2025-05-07T20:26:27.4418568Z  2025-05-07T20:26:27.4418953Z 2025-05-07T20:26:27.4418956Z 2025-05-07T20:26:27.4418966Z 2025-05-07T20:26:27.4418970Z 2025-05-07T20:26:27.4418973Z 2025-05-07T20:26:27.4418977Z 2025-05-07T20:26:27.4418980Z 2025-05-07T20:26:27.4418984Z 2025-05-07T20:26:27.4418987Z 2025-05-07T20:26:27.4418991Z 2025-05-07T20:26:27.4419001Z 2025-05-07T20:26:27.4419005Z 2025-05-07T20:26:27.4419008Z 2025-05-07T20:26:27.4419012Z 2025-05-07T20:26:27.4419016Z 2025-05-07T20:26:27.4419023Z 2025-05-07T20:26:27.4419182Z  2025-05-07T20:26:27.4419382Z 2025-05-07T20:26:27.4419392Z 2025-05-07T20:26:27.4419395Z 2025-05-07T20:26:27.4419399Z 2025-05-07T20:26:27.4419411Z 2025-05-07T20:26:27.4419415Z 2025-05-07T20:26:27.4419418Z 2025-05-07T20:26:27.4419422Z 2025-05-07T20:26:27.4419425Z 2025-05-07T20:26:27.4419429Z 2025-05-07T20:26:27.4419432Z 2025-05-07T20:26:27.4419436Z 2025-05-07T20:26:27.4419440Z 2025-05-07T20:26:27.4419443Z 2025-05-07T20:26:27.4419447Z 2025-05-07T20:26:27.4419450Z 2025-05-07T20:26:27.4419454Z 2025-05-07T20:26:27.4419613Z  2025-05-07T20:26:27.4419822Z 2025-05-07T20:26:27.4419826Z 2025-05-07T20:26:27.4419829Z 2025-05-07T20:26:27.4419833Z 2025-05-07T20:26:27.4419836Z 2025-05-07T20:26:27.4419840Z 2025-05-07T20:26:27.4419843Z 2025-05-07T20:26:27.4419847Z 2025-05-07T20:26:27.4419850Z 2025-05-07T20:26:27.4419854Z 2025-05-07T20:26:27.4419857Z 2025-05-07T20:26:27.4419950Z 2025-05-07T20:26:27.4419954Z 2025-05-07T20:26:27.4419958Z 2025-05-07T20:26:27.4419961Z 2025-05-07T20:26:27.4419965Z 2025-05-07T20:26:27.4419968Z 2025-05-07T20:26:27.4419972Z 2025-05-07T20:26:27.4420138Z  2025-05-07T20:26:27.4420351Z 2025-05-07T20:26:27.4420355Z 2025-05-07T20:26:27.4420459Z  2025-05-07T20:26:27.4420581Z 2025-05-07T20:26:27.4420586Z 2025-05-07T20:26:27.4420707Z  2025-05-07T20:26:27.4420832Z 2025-05-07T20:26:27.4420835Z 2025-05-07T20:26:27.4420839Z 2025-05-07T20:26:27.4420959Z  2025-05-07T20:26:27.4421066Z 2025-05-07T20:26:27.4421150Z 2025-05-07T20:26:27.4421154Z 2025-05-07T20:26:27.4421157Z 2025-05-07T20:26:27.4421267Z  2025-05-07T20:26:27.4421395Z 2025-05-07T20:26:27.4421399Z 2025-05-07T20:26:27.4421403Z 2025-05-07T20:26:27.4421406Z 2025-05-07T20:26:27.4421410Z 2025-05-07T20:26:27.4421525Z  2025-05-07T20:26:27.4421657Z 2025-05-07T20:26:27.4421661Z 2025-05-07T20:26:27.4421671Z 2025-05-07T20:26:27.4421674Z 2025-05-07T20:26:27.4421678Z 2025-05-07T20:26:27.4421681Z 2025-05-07T20:26:27.4421795Z  2025-05-07T20:26:27.4421929Z 2025-05-07T20:26:27.4421932Z 2025-05-07T20:26:27.4421936Z 2025-05-07T20:26:27.4421940Z 2025-05-07T20:26:27.4421943Z 2025-05-07T20:26:27.4421947Z 2025-05-07T20:26:27.4421950Z 2025-05-07T20:26:27.4422068Z  2025-05-07T20:26:27.4422206Z 2025-05-07T20:26:27.4422209Z 2025-05-07T20:26:27.4422213Z 2025-05-07T20:26:27.4422217Z 2025-05-07T20:26:27.4422220Z 2025-05-07T20:26:27.4422224Z 2025-05-07T20:26:27.4422227Z 2025-05-07T20:26:27.4422235Z 2025-05-07T20:26:27.4422359Z  2025-05-07T20:26:27.4422513Z 2025-05-07T20:26:27.4422517Z 2025-05-07T20:26:27.4422520Z 2025-05-07T20:26:27.4422524Z 2025-05-07T20:26:27.4422527Z 2025-05-07T20:26:27.4422531Z 2025-05-07T20:26:27.4422534Z 2025-05-07T20:26:27.4422538Z 2025-05-07T20:26:27.4422542Z 2025-05-07T20:26:27.4422677Z  2025-05-07T20:26:27.4422839Z 2025-05-07T20:26:27.4422843Z 2025-05-07T20:26:27.4422846Z 2025-05-07T20:26:27.4422850Z 2025-05-07T20:26:27.4422853Z 2025-05-07T20:26:27.4422857Z 2025-05-07T20:26:27.4422860Z 2025-05-07T20:26:27.4422864Z 2025-05-07T20:26:27.4422867Z 2025-05-07T20:26:27.4422871Z 2025-05-07T20:26:27.4423011Z  2025-05-07T20:26:27.4423170Z 2025-05-07T20:26:27.4423174Z 2025-05-07T20:26:27.4423177Z 2025-05-07T20:26:27.4423181Z 2025-05-07T20:26:27.4423184Z 2025-05-07T20:26:27.4423188Z 2025-05-07T20:26:27.4423191Z 2025-05-07T20:26:27.4423195Z 2025-05-07T20:26:27.4423203Z 2025-05-07T20:26:27.4423207Z 2025-05-07T20:26:27.4423210Z 2025-05-07T20:26:27.4423355Z  2025-05-07T20:26:27.4423524Z 2025-05-07T20:26:27.4423528Z 2025-05-07T20:26:27.4423531Z 2025-05-07T20:26:27.4423535Z 2025-05-07T20:26:27.4423538Z 2025-05-07T20:26:27.4423542Z 2025-05-07T20:26:27.4423551Z 2025-05-07T20:26:27.4423554Z 2025-05-07T20:26:27.4423562Z 2025-05-07T20:26:27.4423565Z 2025-05-07T20:26:27.4423569Z 2025-05-07T20:26:27.4423572Z 2025-05-07T20:26:27.4423716Z  2025-05-07T20:26:27.4423907Z 2025-05-07T20:26:27.4423910Z 2025-05-07T20:26:27.4423914Z 2025-05-07T20:26:27.4423917Z 2025-05-07T20:26:27.4423921Z 2025-05-07T20:26:27.4423924Z 2025-05-07T20:26:27.4423928Z 2025-05-07T20:26:27.4423931Z 2025-05-07T20:26:27.4423935Z 2025-05-07T20:26:27.4423938Z 2025-05-07T20:26:27.4423942Z 2025-05-07T20:26:27.4423945Z 2025-05-07T20:26:27.4423949Z 2025-05-07T20:26:27.4424087Z  2025-05-07T20:26:27.4424279Z 2025-05-07T20:26:27.4424283Z 2025-05-07T20:26:27.4424286Z 2025-05-07T20:26:27.4424290Z 2025-05-07T20:26:27.4424293Z 2025-05-07T20:26:27.4424297Z 2025-05-07T20:26:27.4424300Z 2025-05-07T20:26:27.4424304Z 2025-05-07T20:26:27.4424307Z 2025-05-07T20:26:27.4424311Z 2025-05-07T20:26:27.4424314Z 2025-05-07T20:26:27.4424318Z 2025-05-07T20:26:27.4424331Z 2025-05-07T20:26:27.4424443Z 2025-05-07T20:26:27.4424599Z  2025-05-07T20:26:27.4424793Z 2025-05-07T20:26:27.4424797Z 2025-05-07T20:26:27.4424800Z 2025-05-07T20:26:27.4424804Z 2025-05-07T20:26:27.4424807Z 2025-05-07T20:26:27.4424811Z 2025-05-07T20:26:27.4424814Z 2025-05-07T20:26:27.4424827Z 2025-05-07T20:26:27.4424830Z 2025-05-07T20:26:27.4424834Z 2025-05-07T20:26:27.4424837Z 2025-05-07T20:26:27.4424841Z 2025-05-07T20:26:27.4424844Z 2025-05-07T20:26:27.4424848Z 2025-05-07T20:26:27.4424851Z 2025-05-07T20:26:27.4424999Z  2025-05-07T20:26:27.4425278Z 2025-05-07T20:26:27.4425281Z 2025-05-07T20:26:27.4425285Z 2025-05-07T20:26:27.4425289Z 2025-05-07T20:26:27.4425292Z 2025-05-07T20:26:27.4425296Z 2025-05-07T20:26:27.4425299Z 2025-05-07T20:26:27.4425303Z 2025-05-07T20:26:27.4425306Z 2025-05-07T20:26:27.4425310Z 2025-05-07T20:26:27.4425313Z 2025-05-07T20:26:27.4425317Z 2025-05-07T20:26:27.4425320Z 2025-05-07T20:26:27.4425328Z 2025-05-07T20:26:27.4425332Z 2025-05-07T20:26:27.4425335Z 2025-05-07T20:26:27.4425491Z  2025-05-07T20:26:27.4425690Z 2025-05-07T20:26:27.4425694Z 2025-05-07T20:26:27.4425697Z 2025-05-07T20:26:27.4425701Z 2025-05-07T20:26:27.4425704Z 2025-05-07T20:26:27.4425708Z 2025-05-07T20:26:27.4425711Z 2025-05-07T20:26:27.4425715Z 2025-05-07T20:26:27.4425718Z 2025-05-07T20:26:27.4425722Z 2025-05-07T20:26:27.4425725Z 2025-05-07T20:26:27.4425729Z 2025-05-07T20:26:27.4425747Z 2025-05-07T20:26:27.4425750Z 2025-05-07T20:26:27.4425754Z 2025-05-07T20:26:27.4425757Z 2025-05-07T20:26:27.4425766Z 2025-05-07T20:26:27.4425922Z  2025-05-07T20:26:27.4426130Z 2025-05-07T20:26:27.4426134Z 2025-05-07T20:26:27.4426138Z 2025-05-07T20:26:27.4426141Z 2025-05-07T20:26:27.4426145Z 2025-05-07T20:26:27.4426148Z 2025-05-07T20:26:27.4426152Z 2025-05-07T20:26:27.4426155Z 2025-05-07T20:26:27.4426159Z 2025-05-07T20:26:27.4426168Z 2025-05-07T20:26:27.4426171Z 2025-05-07T20:26:27.4426175Z 2025-05-07T20:26:27.4426178Z 2025-05-07T20:26:27.4426182Z 2025-05-07T20:26:27.4426185Z 2025-05-07T20:26:27.4426189Z 2025-05-07T20:26:27.4426192Z 2025-05-07T20:26:27.4426196Z 2025-05-07T20:26:27.4426363Z  2025-05-07T20:26:27.4426567Z 2025-05-07T20:26:27.4426571Z 2025-05-07T20:26:27.4426669Z  2025-05-07T20:26:27.4426779Z 2025-05-07T20:26:27.4426782Z 2025-05-07T20:26:27.4426885Z  2025-05-07T20:26:27.4426996Z 2025-05-07T20:26:27.4427006Z 2025-05-07T20:26:27.4427010Z 2025-05-07T20:26:27.4427133Z  2025-05-07T20:26:27.4427243Z 2025-05-07T20:26:27.4427246Z 2025-05-07T20:26:27.4427250Z 2025-05-07T20:26:27.4427253Z 2025-05-07T20:26:27.4427368Z  2025-05-07T20:26:27.4427484Z 2025-05-07T20:26:27.4427488Z 2025-05-07T20:26:27.4427491Z 2025-05-07T20:26:27.4427495Z 2025-05-07T20:26:27.4427498Z 2025-05-07T20:26:27.4427615Z  2025-05-07T20:26:27.4427740Z 2025-05-07T20:26:27.4427744Z 2025-05-07T20:26:27.4427747Z 2025-05-07T20:26:27.4427751Z 2025-05-07T20:26:27.4427754Z 2025-05-07T20:26:27.4427758Z 2025-05-07T20:26:27.4427872Z  2025-05-07T20:26:27.4428023Z 2025-05-07T20:26:27.4428026Z 2025-05-07T20:26:27.4428030Z 2025-05-07T20:26:27.4428033Z 2025-05-07T20:26:27.4428037Z 2025-05-07T20:26:27.4428040Z 2025-05-07T20:26:27.4428044Z 2025-05-07T20:26:27.4428375Z  2025-05-07T20:26:27.4428530Z 2025-05-07T20:26:27.4428534Z 2025-05-07T20:26:27.4428548Z 2025-05-07T20:26:27.4428552Z 2025-05-07T20:26:27.4428555Z 2025-05-07T20:26:27.4428564Z 2025-05-07T20:26:27.4428567Z 2025-05-07T20:26:27.4428571Z 2025-05-07T20:26:27.4428702Z  2025-05-07T20:26:27.4428849Z 2025-05-07T20:26:27.4428852Z 2025-05-07T20:26:27.4428856Z 2025-05-07T20:26:27.4428859Z 2025-05-07T20:26:27.4428863Z 2025-05-07T20:26:27.4428866Z 2025-05-07T20:26:27.4428870Z 2025-05-07T20:26:27.4428873Z 2025-05-07T20:26:27.4429014Z 2025-05-07T20:26:27.4429224Z  2025-05-07T20:26:27.4429380Z 2025-05-07T20:26:27.4429383Z 2025-05-07T20:26:27.4429387Z 2025-05-07T20:26:27.4429390Z 2025-05-07T20:26:27.4429394Z 2025-05-07T20:26:27.4429397Z 2025-05-07T20:26:27.4429401Z 2025-05-07T20:26:27.4429405Z 2025-05-07T20:26:27.4429408Z 2025-05-07T20:26:27.4429412Z 2025-05-07T20:26:27.4429545Z  2025-05-07T20:26:27.4429707Z 2025-05-07T20:26:27.4429711Z 2025-05-07T20:26:27.4429714Z 2025-05-07T20:26:27.4429718Z 2025-05-07T20:26:27.4429722Z 2025-05-07T20:26:27.4429726Z 2025-05-07T20:26:27.4429856Z 2025-05-07T20:26:27.4429860Z 2025-05-07T20:26:27.4429863Z 2025-05-07T20:26:27.4429867Z 2025-05-07T20:26:27.4429870Z 2025-05-07T20:26:27.4430070Z  2025-05-07T20:26:27.4430320Z 2025-05-07T20:26:27.4430335Z 2025-05-07T20:26:27.4430339Z 2025-05-07T20:26:27.4430344Z 2025-05-07T20:26:27.4430349Z 2025-05-07T20:26:27.4430354Z 2025-05-07T20:26:27.4430367Z 2025-05-07T20:26:27.4430373Z 2025-05-07T20:26:27.4430378Z 2025-05-07T20:26:27.4430383Z 2025-05-07T20:26:27.4430388Z 2025-05-07T20:26:27.4430393Z 2025-05-07T20:26:27.4430598Z  2025-05-07T20:26:27.4430825Z 2025-05-07T20:26:27.4430828Z 2025-05-07T20:26:27.4430832Z 2025-05-07T20:26:27.4430835Z 2025-05-07T20:26:27.4430839Z 2025-05-07T20:26:27.4430842Z 2025-05-07T20:26:27.4430846Z 2025-05-07T20:26:27.4430849Z 2025-05-07T20:26:27.4430853Z 2025-05-07T20:26:27.4430856Z 2025-05-07T20:26:27.4430860Z 2025-05-07T20:26:27.4430863Z 2025-05-07T20:26:27.4430867Z 2025-05-07T20:26:27.4431023Z  2025-05-07T20:26:27.4431209Z 2025-05-07T20:26:27.4431213Z 2025-05-07T20:26:27.4431216Z 2025-05-07T20:26:27.4431220Z 2025-05-07T20:26:27.4431223Z 2025-05-07T20:26:27.4431227Z 2025-05-07T20:26:27.4431230Z 2025-05-07T20:26:27.4431234Z 2025-05-07T20:26:27.4431237Z 2025-05-07T20:26:27.4431241Z 2025-05-07T20:26:27.4431244Z 2025-05-07T20:26:27.4431253Z 2025-05-07T20:26:27.4431256Z 2025-05-07T20:26:27.4431260Z 2025-05-07T20:26:27.4431409Z  2025-05-07T20:26:27.4431599Z 2025-05-07T20:26:27.4431603Z 2025-05-07T20:26:27.4431606Z 2025-05-07T20:26:27.4431610Z 2025-05-07T20:26:27.4431613Z 2025-05-07T20:26:27.4431617Z 2025-05-07T20:26:27.4431620Z 2025-05-07T20:26:27.4431624Z 2025-05-07T20:26:27.4431634Z 2025-05-07T20:26:27.4431638Z 2025-05-07T20:26:27.4431641Z 2025-05-07T20:26:27.4431645Z 2025-05-07T20:26:27.4431648Z 2025-05-07T20:26:27.4431652Z 2025-05-07T20:26:27.4431655Z 2025-05-07T20:26:27.4431804Z  2025-05-07T20:26:27.4432008Z 2025-05-07T20:26:27.4432011Z 2025-05-07T20:26:27.4432015Z 2025-05-07T20:26:27.4432018Z 2025-05-07T20:26:27.4432022Z 2025-05-07T20:26:27.4432025Z 2025-05-07T20:26:27.4432029Z 2025-05-07T20:26:27.4432032Z 2025-05-07T20:26:27.4432036Z 2025-05-07T20:26:27.4432039Z 2025-05-07T20:26:27.4432043Z 2025-05-07T20:26:27.4432050Z 2025-05-07T20:26:27.4432054Z 2025-05-07T20:26:27.4432057Z 2025-05-07T20:26:27.4432061Z 2025-05-07T20:26:27.4432064Z 2025-05-07T20:26:27.4432225Z  2025-05-07T20:26:27.4432422Z 2025-05-07T20:26:27.4432425Z 2025-05-07T20:26:27.4432429Z 2025-05-07T20:26:27.4432432Z 2025-05-07T20:26:27.4432436Z 2025-05-07T20:26:27.4432440Z 2025-05-07T20:26:27.4432443Z 2025-05-07T20:26:27.4432447Z 2025-05-07T20:26:27.4432450Z 2025-05-07T20:26:27.4432454Z 2025-05-07T20:26:27.4432457Z 2025-05-07T20:26:27.4432461Z 2025-05-07T20:26:27.4432465Z 2025-05-07T20:26:27.4432473Z 2025-05-07T20:26:27.4432476Z 2025-05-07T20:26:27.4432485Z 2025-05-07T20:26:27.4432489Z 2025-05-07T20:26:27.4432644Z  2025-05-07T20:26:27.4432846Z 2025-05-07T20:26:27.4432875Z 2025-05-07T20:26:27.4432878Z 2025-05-07T20:26:27.4432882Z 2025-05-07T20:26:27.4432885Z 2025-05-07T20:26:27.4432889Z 2025-05-07T20:26:27.4432892Z 2025-05-07T20:26:27.4432996Z 2025-05-07T20:26:27.4433001Z 2025-05-07T20:26:27.4433004Z 2025-05-07T20:26:27.4433008Z 2025-05-07T20:26:27.4433011Z 2025-05-07T20:26:27.4433015Z 2025-05-07T20:26:27.4433018Z 2025-05-07T20:26:27.4433022Z 2025-05-07T20:26:27.4433025Z 2025-05-07T20:26:27.4433029Z 2025-05-07T20:26:27.4433032Z 2025-05-07T20:26:27.4433205Z  2025-05-07T20:26:27.4433413Z 2025-05-07T20:26:27.4433416Z 2025-05-07T20:26:27.4433517Z  2025-05-07T20:26:27.4433628Z 2025-05-07T20:26:27.4433632Z 2025-05-07T20:26:27.4433736Z  2025-05-07T20:26:27.4433849Z 2025-05-07T20:26:27.4433929Z 2025-05-07T20:26:27.4433933Z 2025-05-07T20:26:27.4434041Z  2025-05-07T20:26:27.4434148Z 2025-05-07T20:26:27.4434152Z 2025-05-07T20:26:27.4434155Z 2025-05-07T20:26:27.4434159Z 2025-05-07T20:26:27.4434271Z  2025-05-07T20:26:27.4434387Z 2025-05-07T20:26:27.4434391Z 2025-05-07T20:26:27.4434394Z 2025-05-07T20:26:27.4434398Z 2025-05-07T20:26:27.4434407Z 2025-05-07T20:26:27.4434524Z  2025-05-07T20:26:27.4434644Z 2025-05-07T20:26:27.4434648Z 2025-05-07T20:26:27.4434652Z 2025-05-07T20:26:27.4434655Z 2025-05-07T20:26:27.4434659Z 2025-05-07T20:26:27.4434662Z 2025-05-07T20:26:27.4434779Z  2025-05-07T20:26:27.4434904Z 2025-05-07T20:26:27.4434907Z 2025-05-07T20:26:27.4434911Z 2025-05-07T20:26:27.4434915Z 2025-05-07T20:26:27.4434918Z 2025-05-07T20:26:27.4434922Z 2025-05-07T20:26:27.4434925Z 2025-05-07T20:26:27.4435046Z  2025-05-07T20:26:27.4435181Z 2025-05-07T20:26:27.4435184Z 2025-05-07T20:26:27.4435193Z 2025-05-07T20:26:27.4435197Z 2025-05-07T20:26:27.4435200Z 2025-05-07T20:26:27.4435204Z 2025-05-07T20:26:27.4435207Z 2025-05-07T20:26:27.4435211Z 2025-05-07T20:26:27.4435335Z  2025-05-07T20:26:27.4435479Z 2025-05-07T20:26:27.4435483Z 2025-05-07T20:26:27.4435486Z 2025-05-07T20:26:27.4435490Z 2025-05-07T20:26:27.4435493Z 2025-05-07T20:26:27.4435497Z 2025-05-07T20:26:27.4435504Z 2025-05-07T20:26:27.4435508Z 2025-05-07T20:26:27.4435512Z 2025-05-07T20:26:27.4435641Z  2025-05-07T20:26:27.4435792Z 2025-05-07T20:26:27.4435796Z 2025-05-07T20:26:27.4435799Z 2025-05-07T20:26:27.4435803Z 2025-05-07T20:26:27.4435806Z 2025-05-07T20:26:27.4435810Z 2025-05-07T20:26:27.4435814Z 2025-05-07T20:26:27.4435833Z 2025-05-07T20:26:27.4435837Z 2025-05-07T20:26:27.4435840Z 2025-05-07T20:26:27.4435968Z  2025-05-07T20:26:27.4436125Z 2025-05-07T20:26:27.4436129Z 2025-05-07T20:26:27.4436133Z 2025-05-07T20:26:27.4436136Z 2025-05-07T20:26:27.4436144Z 2025-05-07T20:26:27.4436153Z 2025-05-07T20:26:27.4436157Z 2025-05-07T20:26:27.4436160Z 2025-05-07T20:26:27.4436164Z 2025-05-07T20:26:27.4436167Z 2025-05-07T20:26:27.4436171Z 2025-05-07T20:26:27.4436300Z  2025-05-07T20:26:27.4436469Z 2025-05-07T20:26:27.4436480Z 2025-05-07T20:26:27.4436484Z 2025-05-07T20:26:27.4436488Z 2025-05-07T20:26:27.4436495Z 2025-05-07T20:26:27.4436498Z 2025-05-07T20:26:27.4436502Z 2025-05-07T20:26:27.4436505Z 2025-05-07T20:26:27.4436509Z 2025-05-07T20:26:27.4436512Z 2025-05-07T20:26:27.4436516Z 2025-05-07T20:26:27.4436519Z 2025-05-07T20:26:27.4436651Z  2025-05-07T20:26:27.4436836Z 2025-05-07T20:26:27.4436840Z 2025-05-07T20:26:27.4436843Z 2025-05-07T20:26:27.4436847Z 2025-05-07T20:26:27.4436850Z 2025-05-07T20:26:27.4436854Z 2025-05-07T20:26:27.4436857Z 2025-05-07T20:26:27.4436861Z 2025-05-07T20:26:27.4436864Z 2025-05-07T20:26:27.4436868Z 2025-05-07T20:26:27.4436871Z 2025-05-07T20:26:27.4436880Z 2025-05-07T20:26:27.4436884Z 2025-05-07T20:26:27.4438717Z  done 2025-05-07T20:26:27.7647803Z Preparing transaction: | / - done 2025-05-07T20:26:32.2630577Z Verifying transaction: | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ done 2025-05-07T20:26:33.0743025Z Executing transaction: / - \ | / - \ | done 2025-05-07T20:26:35.4637459Z [INSTALL] Fixing file placements for CUDA 12.8.0+ ... 2025-05-07T20:26:35.4637893Z [INSTALL] Creating symlinks: libnvToolsExt.so 2025-05-07T20:26:35.4638604Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so.1 /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so 2025-05-07T20:26:35.4639161Z 2025-05-07T20:26:35.4651145Z 2025-05-07T20:26:35.4651880Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so.1 /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so 2025-05-07T20:26:35.4652939Z 2025-05-07T20:26:35.4663823Z 2025-05-07T20:26:35.4664135Z [INSTALL] Copying nvtx3 headers ... 2025-05-07T20:26:35.4669307Z + cp -r /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCuda.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCudaRt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtOpenCL.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtSync.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvtx3.hpp /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvtxDetail /home/ec2-user/miniconda/envs/build_binary/include/ 2025-05-07T20:26:35.4673004Z 2025-05-07T20:26:35.6227895Z 2025-05-07T20:26:35.6233409Z + cp -r /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCuda.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCudaRt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtOpenCL.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtSync.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvtx3.hpp /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvtxDetail /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/ 2025-05-07T20:26:35.6237161Z 2025-05-07T20:26:35.6255623Z 2025-05-07T20:26:35.6256008Z [INSTALL] Appending libcuda.so path to LD_LIBRARY_PATH ... 2025-05-07T20:26:35.6633507Z [ENV] Appending to LD_LIBRARY_PATH: /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs ... 2025-05-07T20:26:37.5562891Z ERROR conda.cli.main_run:execute(125): `conda run printenv LD_LIBRARY_PATH` failed. (See above for error) 2025-05-07T20:26:37.6204062Z + conda env config vars set -n build_binary LD_LIBRARY_PATH=/home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs 2025-05-07T20:26:37.6204766Z 2025-05-07T20:26:38.0495866Z 2025-05-07T20:26:38.0504284Z [INSTALL] Setting environment variable NVML_LIB_PATH ... 2025-05-07T20:26:38.0852908Z + conda env config vars set -n build_binary NVML_LIB_PATH=/home/ec2-user/miniconda/envs/build_binary/lib/stubs/libnvidia-ml.so 2025-05-07T20:26:38.0853479Z 2025-05-07T20:26:38.5210529Z 2025-05-07T20:26:38.5210960Z [INSTALL] Setting environment variable CUDA_INCLUDE_DIRS ... 2025-05-07T20:26:38.5212220Z + conda env config vars set -n build_binary CUDA_INCLUDE_DIRS="/home/ec2-user/miniconda/envs/build_binary/include/:/home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/" 2025-05-07T20:26:38.5213281Z 2025-05-07T20:26:38.9477701Z 2025-05-07T20:26:40.9855536Z [CHECK] cuda_runtime.h found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/cuda_runtime.h 2025-05-07T20:26:43.0270466Z [CHECK] libcuda.so found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs/libcuda.so 2025-05-07T20:26:45.0663509Z [CHECK] libnvToolsExt.so found in CONDA_PREFIX PATH (symbolic link): /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so 2025-05-07T20:26:45.0665090Z /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so 2025-05-07T20:26:47.0948825Z [CHECK] libnvidia-ml.so found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs/libnvidia-ml.so 2025-05-07T20:26:48.9952020Z /home/ec2-user/miniconda/envs/build_binary/bin/nvcc 2025-05-07T20:26:48.9952300Z 2025-05-07T20:26:49.0577103Z [CHECK] Binary nvcc found in PATH 2025-05-07T20:26:52.9192323Z /tmp/tmpkx9dulct: line 3: clang: command not found 2025-05-07T20:26:52.9192622Z 2025-05-07T20:26:52.9192913Z ERROR conda.cli.main_run:execute(125): `conda run clang --version` failed. (See above for error) 2025-05-07T20:26:52.9821069Z + ls -la /home/ec2-user/miniconda/envs/build_binary/etc/conda/activate.d 2025-05-07T20:26:52.9821382Z 2025-05-07T20:26:52.9843879Z total 36 2025-05-07T20:26:52.9844269Z drwxr-xr-x. 2 ec2-user ec2-user 191 May 7 20:26 . 2025-05-07T20:26:52.9844798Z drwxr-xr-x. 5 ec2-user ec2-user 62 May 7 20:25 .. 2025-05-07T20:26:52.9845380Z -rw-r--r--. 2 ec2-user ec2-user 3778 Jun 10 2024 activate-binutils_linux-64.sh 2025-05-07T20:26:52.9845919Z -rw-r--r--. 2 ec2-user ec2-user 11630 Jun 10 2024 activate-gcc_linux-64.sh 2025-05-07T20:26:52.9846478Z -rw-r--r--. 2 ec2-user ec2-user 5190 Jun 10 2024 activate-gxx_linux-64.sh 2025-05-07T20:26:52.9847121Z -rw-r--r--. 2 ec2-user ec2-user 136 Mar 27 01:27 libglib_activate.sh 2025-05-07T20:26:52.9847739Z -rw-r--r--. 2 ec2-user ec2-user 872 Nov 13 09:20 libxml2_activate.sh 2025-05-07T20:26:52.9848346Z -rw-r--r--. 2 ec2-user ec2-user 2932 Jan 24 22:22 ~cuda-nvcc_activate.sh 2025-05-07T20:26:52.9848667Z 2025-05-07T20:26:52.9848881Z [INSTALL] Removing the -ccbin=CXX hook from NVCC activation scripts ... 2025-05-07T20:26:52.9849511Z + sed -i /-ccbin=/d /home/ec2-user/miniconda/envs/build_binary/etc/conda/activate.d/*cuda-nvcc_activate.sh 2025-05-07T20:26:52.9849920Z 2025-05-07T20:26:52.9869902Z 2025-05-07T20:26:52.9870356Z + conda run -n build_binary c++ --version | grep -i clang 2025-05-07T20:26:52.9870709Z 2025-05-07T20:26:54.9367803Z 2025-05-07T20:26:54.9368367Z [BUILD] Setting prepend flags for NVCC ... 2025-05-07T20:26:54.9368907Z + conda env config vars set -n build_binary NVCC_PREPEND_FLAGS="-allow-unsupported-compiler" 2025-05-07T20:26:54.9369279Z 2025-05-07T20:26:55.3661736Z 2025-05-07T20:26:55.3662064Z + conda run -n build_binary printenv NVCC_PREPEND_FLAGS 2025-05-07T20:26:55.3662326Z 2025-05-07T20:26:57.2572051Z -allow-unsupported-compiler 2025-05-07T20:26:57.2572417Z 2025-05-07T20:26:57.3204717Z 2025-05-07T20:26:57.3205400Z [INFO] Printing out all preprocessor defines in nvcc ... 2025-05-07T20:26:57.3205913Z + conda run -n build_binary nvcc --compiler-options -dM -E -x cu - < /dev/null 2025-05-07T20:26:57.3206225Z 2025-05-07T20:26:59.2922241Z #define _GLIBCXX_DEPRECATED_SUGGEST(ALT) __attribute__ ((__deprecated__ ("use '" ALT "' instead"))) 2025-05-07T20:26:59.2923002Z #define M_PIl 3.141592653589793238462643383279502884L 2025-05-07T20:26:59.2923410Z #define _IO_CURRENTLY_PUTTING 0x800 2025-05-07T20:26:59.2923847Z #define __W_EXITCODE(ret,sig) ((ret) << 8 | (sig)) 2025-05-07T20:26:59.2924251Z #define __DBL_MIN_EXP__ (-1021) 2025-05-07T20:26:59.2924518Z #define _STL_PAIR_H 1 2025-05-07T20:26:59.2924771Z #define __cpp_attributes 200809L 2025-05-07T20:26:59.2925177Z #define __cpp_nontype_template_parameter_auto 201606L 2025-05-07T20:26:59.2925572Z #define __DELETE_THROW throw() 2025-05-07T20:26:59.2926186Z #define _PTRDIFF_T_ 2025-05-07T20:26:59.2926575Z #define M_PI_4 0.78539816339744830962 2025-05-07T20:26:59.2926973Z #define __UINT_LEAST16_MAX__ 0xffff 2025-05-07T20:26:59.2927335Z #define _IO_LEFT 02 2025-05-07T20:26:59.2927631Z #define __ATOMIC_ACQUIRE 2 2025-05-07T20:26:59.2927919Z #define _POSIX2_BC_SCALE_MAX 99 2025-05-07T20:26:59.2928513Z #define _GLIBCXX_USE_RANDOM_TR1 1 2025-05-07T20:26:59.2929092Z #define _GLIBCXX_MOVE_BACKWARD3(_Tp,_Up,_Vp) std::move_backward(_Tp, _Up, _Vp) 2025-05-07T20:26:59.2929558Z #define __FLT128_MAX_10_EXP__ 4932 2025-05-07T20:26:59.2929834Z #define RE_DUP_MAX (0x7fff) 2025-05-07T20:26:59.2930290Z #define _IOS_OUTPUT 2 2025-05-07T20:26:59.2930586Z #define __SM_100_RT_HPP__ 2025-05-07T20:26:59.2930995Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F 2025-05-07T20:26:59.2931505Z #define toascii_l(c,l) __toascii_l ((c), (l)) 2025-05-07T20:26:59.2932056Z #define __GCC_IEC_559_COMPLEX 2 2025-05-07T20:26:59.2932549Z #define _GLIBCXX_USE_FCHMOD 1 2025-05-07T20:26:59.2933042Z #define __cpp_aggregate_nsdmi 201304L 2025-05-07T20:26:59.2934295Z #define __bswap_16(x) (__extension__ ({ unsigned short int __v, __x = (unsigned short int) (x); if (__builtin_constant_p (__x)) __v = __bswap_constant_16 (__x); else __asm__ ("rorw $8, %w0" : "=r" (__v) : "0" (__x) : "cc"); __v; })) 2025-05-07T20:26:59.2935456Z #define __UINT_LEAST8_TYPE__ unsigned char 2025-05-07T20:26:59.2935883Z #define __SIZEOF_FLOAT80__ 16 2025-05-07T20:26:59.2936280Z #define cudaTextureTypeCubemapLayered 0xFC 2025-05-07T20:26:59.2936587Z #define _T_WCHAR_ 2025-05-07T20:26:59.2936806Z #define stdout stdout 2025-05-07T20:26:59.2937136Z #define _GLIBCXX_ABI_TAG_CXX11 __attribute ((__abi_tag__ ("cxx11"))) 2025-05-07T20:26:59.2937511Z #define CHAR_BIT __CHAR_BIT__ 2025-05-07T20:26:59.2937765Z #define __flexarr [] 2025-05-07T20:26:59.2937998Z #define _GLIBCXX_HAVE_FINITEF 1 2025-05-07T20:26:59.2938317Z #define __islower_l(c,l) __isctype_l((c), _ISlower, (l)) 2025-05-07T20:26:59.2938664Z #define _IO_FLAGS2_USER_WBUF 8 2025-05-07T20:26:59.2938908Z #define _MATH_H 1 2025-05-07T20:26:59.2939184Z #define cudaOccupancyDisableCachingOverride 0x01 2025-05-07T20:26:59.2939519Z #define __S64_TYPE long int 2025-05-07T20:26:59.2939776Z #define __stub_fchflags 2025-05-07T20:26:59.2940033Z #define cudaDeviceScheduleMask 0x07 2025-05-07T20:26:59.2940323Z #define __SQUAD_TYPE long int 2025-05-07T20:26:59.2940585Z #define __INTMAX_C(c) c ## L 2025-05-07T20:26:59.2940882Z #define cudaStreamFireAndForget ((cudaStream_t)0x4) 2025-05-07T20:26:59.2941216Z #define _BSD_SIZE_T_DEFINED_ 2025-05-07T20:26:59.2941471Z #define NL_NMAX INT_MAX 2025-05-07T20:26:59.2941700Z #define _BITS_TIME_H 1 2025-05-07T20:26:59.2941974Z #define M_LN10l 2.302585092994045684017991454684364208L 2025-05-07T20:26:59.2942299Z #define _GLIBCXX_TXN_SAFE_DYN 2025-05-07T20:26:59.2942593Z #define cudaStreamTailLaunch ((cudaStream_t)0x3) 2025-05-07T20:26:59.2942941Z #define M_El 2.718281828459045235360287471352662498L 2025-05-07T20:26:59.2943338Z #define _PSTL_PRAGMA_DECLARE_SIMD _PSTL_PRAGMA(omp declare simd) 2025-05-07T20:26:59.2943691Z #define __CHAR_BIT__ 8 2025-05-07T20:26:59.2943952Z #define __FSWORD_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:26:59.2944265Z #define _PSTL_STRING_CONCAT(x,y) x #y 2025-05-07T20:26:59.2944559Z #define _GLIBCXX98_USE_C99_MATH 1 2025-05-07T20:26:59.2944816Z #define FP_NAN 0 2025-05-07T20:26:59.2945077Z #define makedev(maj,min) gnu_dev_makedev (maj, min) 2025-05-07T20:26:59.2945481Z #define cudaGetDeviceProperties cudaGetDeviceProperties_v2 2025-05-07T20:26:59.2945880Z #define __cudaCDP2GetErrorString 2025-05-07T20:26:59.2955681Z #define SHRT_MAX __SHRT_MAX__ 2025-05-07T20:26:59.2955969Z #define _GLIBCXX_X86_RDSEED 1 2025-05-07T20:26:59.2956228Z #define __SM_80_RT_H__ 2025-05-07T20:26:59.2956458Z #define _NEW 2025-05-07T20:26:59.2956681Z #define CLOCK_PROCESS_CPUTIME_ID 2 2025-05-07T20:26:59.2956965Z #define __UINT8_MAX__ 0xff 2025-05-07T20:26:59.2957587Z #define _PSTL_ASSERT_MSG(_Condition,_Message) __glibcxx_assert(_Condition) 2025-05-07T20:26:59.2957992Z #define __SCHAR_WIDTH__ 8 2025-05-07T20:26:59.2958234Z #define __USE_ANSI 1 2025-05-07T20:26:59.2958519Z #define _IO_BE(expr,res) __builtin_expect ((expr), res) 2025-05-07T20:26:59.2958898Z #define __isupper_l(c,l) __isctype_l((c), _ISupper, (l)) 2025-05-07T20:26:59.2959252Z #define __cudaCDP2Memcpy2DAsync_ptsz 2025-05-07T20:26:59.2959549Z #define __WINT_MAX__ 0xffffffffU 2025-05-07T20:26:59.2959818Z #define __SIZEOF_PTHREAD_ATTR_T 56 2025-05-07T20:26:59.2960098Z #define __FLT32_MIN_EXP__ (-125) 2025-05-07T20:26:59.2960374Z #define _GLIBCXX_END_NAMESPACE_LDBL 2025-05-07T20:26:59.2960648Z #define PIPE_BUF 4096 2025-05-07T20:26:59.2961075Z #define _PSTL_PRAGMA_SIMD_ORDERED_MONOTONIC_2ARGS(PRM1,PRM2) 2025-05-07T20:26:59.2961525Z #define _GLIBCXX_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_NAMESPACE_CXX11 2025-05-07T20:26:59.2961887Z #define ADJ_TICK 0x4000 2025-05-07T20:26:59.2962188Z #define _PSTL_VERSION_PATCH (_PSTL_VERSION % 10) 2025-05-07T20:26:59.2962528Z #define MQ_PRIO_MAX 32768 2025-05-07T20:26:59.2962798Z #define __SIZEOF_PTHREAD_MUTEXATTR_T 4 2025-05-07T20:26:59.2963106Z #define __WAIT_INT(status) (*(int *) &(status)) 2025-05-07T20:26:59.2963560Z #define __GLIBC_PREREQ(maj,min) ((__GLIBC__ << 16) + __GLIBC_MINOR__ >= ((maj) << 16) + (min)) 2025-05-07T20:26:59.2964075Z #define cudaCooperativeLaunchMultiDeviceNoPreSync 0x01 2025-05-07T20:26:59.2964431Z #define _XOPEN_SOURCE 700 2025-05-07T20:26:59.2964690Z #define _POSIX2_BC_DIM_MAX 2048 2025-05-07T20:26:59.2964964Z #define __VECTOR_FUNCTIONS_HPP__ 2025-05-07T20:26:59.2965242Z #define __cpp_static_assert 201411L 2025-05-07T20:26:59.2965523Z #define __GLIBCXX__ 20230528 2025-05-07T20:26:59.2965797Z #define _GLIBCXX_HAVE_STRXFRM_L 1 2025-05-07T20:26:59.2966063Z #define _POSIX_TTY_NAME_MAX 9 2025-05-07T20:26:59.2966338Z #define _GLIBCXX_USE_WEAK_REF __GXX_WEAK__ 2025-05-07T20:26:59.2966637Z #define __OFF_T_MATCHES_OFF64_T 1 2025-05-07T20:26:59.2966911Z #define __ORDER_LITTLE_ENDIAN__ 1234 2025-05-07T20:26:59.2967205Z #define __SIZE_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:59.2967563Z #define __ispunct_l(c,l) __isctype_l((c), _ISpunct, (l)) 2025-05-07T20:26:59.2967900Z #define __WCHAR_MAX__ 0x7fffffff 2025-05-07T20:26:59.2968172Z #define _GLIBCXX_USE_CLOCK_MONOTONIC 1 2025-05-07T20:26:59.2968481Z #define __BLKCNT_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:26:59.2968832Z #define __isprint_l(c,l) __isctype_l((c), _ISprint, (l)) 2025-05-07T20:26:59.2969174Z #define cudaNvSciSyncAttrSignal 0x1 2025-05-07T20:26:59.2969466Z #define _GLIBCXX_USE_LONG_LONG 1 2025-05-07T20:26:59.2969755Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1 2025-05-07T20:26:59.2970069Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1 2025-05-07T20:26:59.2970395Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1 2025-05-07T20:26:59.2970789Z #define __DBL_DENORM_MIN__ double(4.94065645841246544176568792868221372e-324L) 2025-05-07T20:26:59.2971193Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1 2025-05-07T20:26:59.2971484Z #define ADJ_ESTERROR 0x0008 2025-05-07T20:26:59.2971755Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2 2025-05-07T20:26:59.2972040Z #define __GCC_IEC_559 2 2025-05-07T20:26:59.2972366Z #define __cpp_lib_transformation_trait_aliases 201304 2025-05-07T20:26:59.2972696Z #define _IO_flockfile(_fp) 2025-05-07T20:26:59.2972958Z #define CLOCK_MONOTONIC_RAW 4 2025-05-07T20:26:59.2973220Z #define __FLT32X_DECIMAL_DIG__ 17 2025-05-07T20:26:59.2973484Z #define _IOFBF 0 2025-05-07T20:26:59.2973695Z #define __USE_BSD 1 2025-05-07T20:26:59.2973915Z #define __FLT_EVAL_METHOD__ 0 2025-05-07T20:26:59.2974185Z #define SHRT_MIN (-SHRT_MAX - 1) 2025-05-07T20:26:59.2974459Z #define _IO_USER_LOCK 0x8000 2025-05-07T20:26:59.2974719Z #define _IO_NO_WRITES 8 2025-05-07T20:26:59.2974968Z #define _GLIBCXX_PSEUDO_VISIBILITY(V) 2025-05-07T20:26:59.2975317Z #define __ASMNAME2(prefix,cname) __STRING (prefix) cname 2025-05-07T20:26:59.2975664Z #define _GLIBCXX_HAVE_SYS_STAT_H 1 2025-05-07T20:26:59.2975961Z #define MB_CUR_MAX (__ctype_get_mb_cur_max ()) 2025-05-07T20:26:59.2976388Z #define __cpp_binary_literals 201304L 2025-05-07T20:26:59.2976680Z #define _CPP_TYPE_TRAITS_H 1 2025-05-07T20:26:59.2976936Z #define __BEGIN_NAMESPACE_C99 2025-05-07T20:26:59.2977203Z #define __FLT64_DECIMAL_DIG__ 17 2025-05-07T20:26:59.2977509Z #define _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(A) 2025-05-07T20:26:59.2977882Z #define _G_HAVE_ST_BLKSIZE defined (_STATBUF_ST_BLKSIZE) 2025-05-07T20:26:59.2978239Z #define __cpp_noexcept_function_type 201510L 2025-05-07T20:26:59.2978542Z #define M_PI 3.14159265358979323846 2025-05-07T20:26:59.2978849Z #define _GLIBCXX_PACKAGE_NAME "package-unused" 2025-05-07T20:26:59.2979165Z #define _GLIBCXX_HAVE_BUILTIN_IS_SAME 1 2025-05-07T20:26:59.2979555Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2 2025-05-07T20:26:59.2979853Z #define _POSIX_DELAYTIMER_MAX 32 2025-05-07T20:26:59.2980119Z #define _GLIBCXX_USE_UTIME 1 2025-05-07T20:26:59.2980382Z #define _STL_ITERATOR_BASE_FUNCS_H 1 2025-05-07T20:26:59.2980966Z #define _IO_peekc_unlocked(_fp) (_IO_BE ((_fp)->_IO_read_ptr >= (_fp)->_IO_read_end, 0) && __underflow (_fp) == EOF ? EOF : *(unsigned char *) (_fp)->_IO_read_ptr) 2025-05-07T20:26:59.2981540Z #define _GLIBCXX_TR1_ELL_INTEGRAL_TCC 1 2025-05-07T20:26:59.2981861Z #define w_termsig __wait_terminated.__w_termsig 2025-05-07T20:26:59.2982181Z #define __FLOAT_WORD_ORDER __BYTE_ORDER 2025-05-07T20:26:59.2982527Z #define __cudaCDP2GetErrorName 2025-05-07T20:26:59.2982789Z #define XATTR_SIZE_MAX 65536 2025-05-07T20:26:59.2983051Z #define be64toh(x) __bswap_64 (x) 2025-05-07T20:26:59.2983353Z #define __ASSERT_VOID_CAST static_cast 2025-05-07T20:26:59.2983669Z #define __cpp_variadic_templates 200704L 2025-05-07T20:26:59.2983970Z #define RAND_MAX 2147483647 2025-05-07T20:26:59.2984235Z #define _GLIBCXX_USE_C99_COMPLEX_TR1 1 2025-05-07T20:26:59.2984549Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:59.2984858Z #define __SM_90_RT_H__ 2025-05-07T20:26:59.2985098Z #define __SIG_ATOMIC_TYPE__ int 2025-05-07T20:26:59.2985348Z #define __COMPAR_FN_T 2025-05-07T20:26:59.2985592Z #define __GID_T_TYPE __U32_TYPE 2025-05-07T20:26:59.2985855Z #define _IO_BAD_SEEN 0x4000 2025-05-07T20:26:59.2986313Z #define _PSTL_PRAGMA_MESSAGE_IMPL(x) _PSTL_PRAGMA(message(_PSTL_STRING_CONCAT(_PSTL_PRAGMA_LOCATION, x))) 2025-05-07T20:26:59.2986813Z #define __DBL_MIN_10_EXP__ (-307) 2025-05-07T20:26:59.2987150Z #define __glibcxx_requires_sorted_pred(_First,_Last,_Pred) 2025-05-07T20:26:59.2987500Z #define __FINITE_MATH_ONLY__ 0 2025-05-07T20:26:59.2987785Z #define _PSTL_PRAGMA_SIMD_INCLUSIVE_SCAN(PRM) 2025-05-07T20:26:59.2988117Z #define cudaArrayColorAttachment 0x20 2025-05-07T20:26:59.2988425Z #define __cpp_variable_templates 201304L 2025-05-07T20:26:59.2988921Z #define cudaKernelNodeAttributeMemSyncDomainMap cudaLaunchAttributeMemSyncDomainMap 2025-05-07T20:26:59.2989592Z #define __cpp_lib_integral_constant_callable 201304 2025-05-07T20:26:59.2989917Z #define _GLIBCXX_HAVE_SINHF 1 2025-05-07T20:26:59.2990180Z #define MOD_TIMECONST ADJ_TIMECONST 2025-05-07T20:26:59.2990475Z #define __cpp_lib_result_of_sfinae 201210 2025-05-07T20:26:59.2990775Z #define __SM_30_INTRINSICS_H__ 2025-05-07T20:26:59.2991040Z #define __FLT32X_MAX_EXP__ 1024 2025-05-07T20:26:59.2991299Z #define _GLIBCXX_USE_WCHAR_T 1 2025-05-07T20:26:59.2991560Z #define _GLIBCXX_MATH_H 1 2025-05-07T20:26:59.2991809Z #define __u_char_defined 2025-05-07T20:26:59.2992115Z #define WIFEXITED(status) __WIFEXITED (__WAIT_INT (status)) 2025-05-07T20:26:59.2992472Z #define STA_PPSERROR 0x0800 2025-05-07T20:26:59.2992727Z #define _GLIBCXX_STD_A std 2025-05-07T20:26:59.2992971Z #define __FLT32_HAS_DENORM__ 1 2025-05-07T20:26:59.2993247Z #define _GLIBCXX_BEGIN_NAMESPACE_VERSION 2025-05-07T20:26:59.2993681Z #define __device_builtin_texture_type__ __location__(device_builtin_texture_type) 2025-05-07T20:26:59.2994089Z #define FP_INFINITE 1 2025-05-07T20:26:59.2994453Z #define _GLIBCXX11_DEPRECATED_SUGGEST(ALT) _GLIBCXX_DEPRECATED_SUGGEST(ALT) 2025-05-07T20:26:59.2994864Z #define _IO_pid_t __pid_t 2025-05-07T20:26:59.2995116Z #define __UINT_FAST8_MAX__ 0xff 2025-05-07T20:26:59.2995528Z #define __LEAF , __leaf__ 2025-05-07T20:26:59.2995771Z #define PATH_MAX 4096 2025-05-07T20:26:59.2996018Z #define __cpp_rvalue_reference 200610L 2025-05-07T20:26:59.2996342Z #define __LDBL_REDIR1(name,proto,alias) name proto 2025-05-07T20:26:59.2996657Z #define _LIMITS_H___ 2025-05-07T20:26:59.2996879Z #define __size_t 2025-05-07T20:26:59.2997099Z #define _GLIBCXX_HAVE_FREXPF 1 2025-05-07T20:26:59.2997626Z #define STA_RONLY (STA_PPSSIGNAL | STA_PPSJITTER | STA_PPSWANDER | STA_PPSERROR | STA_CLOCKERR | STA_NANO | STA_MODE | STA_CLK) 2025-05-07T20:26:59.2998178Z #define _GLIBCXX_HAVE_FREXPL 1 2025-05-07T20:26:59.2998557Z #define __cpp_nested_namespace_definitions 201411L 2025-05-07T20:26:59.2998880Z #define __DEC64_MAX_EXP__ 385 2025-05-07T20:26:59.2999138Z #define _WCHAR_T_DEFINED 2025-05-07T20:26:59.2999489Z #define __glibcxx_requires_can_decrement_range(_First1,_Last1,_First2) 2025-05-07T20:26:59.2999872Z #define MOD_STATUS ADJ_STATUS 2025-05-07T20:26:59.3000169Z #define _GLIBCXX_PURE __attribute__ ((__pure__)) 2025-05-07T20:26:59.3000489Z #define _GLIBCXX_HAVE_STDINT_H 1 2025-05-07T20:26:59.3000762Z #define __SIZEOF_PTHREAD_CONDATTR_T 4 2025-05-07T20:26:59.3001037Z #define __INT8_C(c) c 2025-05-07T20:26:59.3001296Z #define __cudaCDP2GetParameterBuffer 2025-05-07T20:26:59.3001587Z #define _GLIBCXX_HAVE_COSHF 1 2025-05-07T20:26:59.3001846Z #define _GLIBCXX_HAVE_COSHL 1 2025-05-07T20:26:59.3002101Z #define __SM_70_RT_HPP__ 2025-05-07T20:26:59.3002340Z #define __INT_LEAST8_WIDTH__ 8 2025-05-07T20:26:59.3002613Z #define __cpp_variadic_using 201611L 2025-05-07T20:26:59.3002928Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:59.3003249Z #define __INT_LEAST8_MAX__ 0x7f 2025-05-07T20:26:59.3003519Z #define __SM_61_INTRINSICS_HPP__ 2025-05-07T20:26:59.3003788Z #define _IO_FLAGS2_MMAP 1 2025-05-07T20:26:59.3004049Z #define __cpp_capture_star_this 201603L 2025-05-07T20:26:59.3004352Z #define __cudaCDP2LaunchDeviceV2_ptsz 2025-05-07T20:26:59.3004649Z #define _GLIBCXX_HAVE_ENDIAN_H 1 2025-05-07T20:26:59.3005009Z #define __always_inline __inline __attribute__ ((__always_inline__)) 2025-05-07T20:26:59.3005373Z #define NFDBITS __NFDBITS 2025-05-07T20:26:59.3005629Z #define _PSTL_PRAGMA_FORCEINLINE 2025-05-07T20:26:59.3005914Z #define _GLIBCXX_HAVE_SYS_STATVFS_H 1 2025-05-07T20:26:59.3006220Z #define __glibcxx_requires_sorted(_First,_Last) 2025-05-07T20:26:59.3006533Z #define __SHRT_MAX__ 0x7fff 2025-05-07T20:26:59.3006787Z #define _GLIBCXX_SYMVER_GNU 1 2025-05-07T20:26:59.3007066Z #define w_stopval __wait_stopped.__w_stopval 2025-05-07T20:26:59.3007366Z #define STA_UNSYNC 0x0040 2025-05-07T20:26:59.3007680Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:26:59.3008094Z #define _GLIBCXX_USE_C99_COMPLEX _GLIBCXX11_USE_C99_COMPLEX 2025-05-07T20:26:59.3008444Z #define __FLT64X_MAX_10_EXP__ 4932 2025-05-07T20:26:59.3008728Z #define __cpp_if_constexpr 201606L 2025-05-07T20:26:59.3009038Z #define __glibcxx_class_requires4(_a,_b,_c,_d,_e) 2025-05-07T20:26:59.3009357Z #define _GLIBCXX_HAVE_WCHAR_H 1 2025-05-07T20:26:59.3009673Z #define _GLIBCXX_USE_C99_STDIO _GLIBCXX11_USE_C99_STDIO 2025-05-07T20:26:59.3010004Z #define __daddr_t_defined 2025-05-07T20:26:59.3010248Z #define __LDBL_IS_IEC_60559__ 2 2025-05-07T20:26:59.3010522Z #define _GLIBCXX_TR1_RIEMANN_ZETA_TCC 1 2025-05-07T20:26:59.3010834Z #define _GLIBCXX_HAVE_STRUCT_DIRENT_D_TYPE 1 2025-05-07T20:26:59.3011353Z #define _PSTL_CPP11_STD_ROTATE_BROKEN ((__GLIBCXX__ && __GLIBCXX__ < 20150716) || (_MSC_VER && _MSC_VER < 1800)) 2025-05-07T20:26:59.3011894Z #define _ACRTIMP 2025-05-07T20:26:59.3012170Z #define _IO_EOF_SEEN 0x10 2025-05-07T20:26:59.3012499Z #define _GLIBCXX_TR1_POLY_LAGUERRE_TCC 1 2025-05-07T20:26:59.3012853Z #define _IOS_BIN 128 2025-05-07T20:26:59.3013280Z #define __fortify_function __extern_always_inline __attribute_artificial__ 2025-05-07T20:26:59.3013776Z #define __FLT64X_HAS_QUIET_NAN__ 1 2025-05-07T20:26:59.3014105Z #define UNDERFLOW 4 2025-05-07T20:26:59.3014375Z #define NAME_MAX 255 2025-05-07T20:26:59.3014703Z #define SCHAR_MAX __SCHAR_MAX__ 2025-05-07T20:26:59.3014971Z #define __UINT_LEAST8_MAX__ 0xff 2025-05-07T20:26:59.3015246Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2 2025-05-07T20:26:59.3015527Z #define _IO_UNIFIED_JUMPTABLES 1 2025-05-07T20:26:59.3015900Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128 2025-05-07T20:26:59.3016283Z #define __ptr_t void * 2025-05-07T20:26:59.3016520Z #define M_E 2.7182818284590452354 2025-05-07T20:26:59.3016787Z #define cudaSurfaceType1D 0x01 2025-05-07T20:26:59.3017054Z #define __USE_ISOCXX11 1 2025-05-07T20:26:59.3017320Z #define __UINTMAX_TYPE__ long unsigned int 2025-05-07T20:26:59.3017711Z #define cudaDeviceBlockingSync 0x04 2025-05-07T20:26:59.3018002Z #define CLOCK_MONOTONIC_COARSE 6 2025-05-07T20:26:59.3018275Z #define _GLIBCXX_OS_DEFINES 1 2025-05-07T20:26:59.3018550Z #define _GLIBCXX_NODISCARD [[__nodiscard__]] 2025-05-07T20:26:59.3018861Z #define cudaSurfaceType2D 0x02 2025-05-07T20:26:59.3019115Z #define __linux 1 2025-05-07T20:26:59.3019342Z #define __DEC32_EPSILON__ 1E-6DF 2025-05-07T20:26:59.3019617Z #define cudaDeviceMask 0xff 2025-05-07T20:26:59.3019883Z #define _GLIBCXX_END_NAMESPACE_ALGO 2025-05-07T20:26:59.3020165Z #define __CUDA_API_VER_MAJOR__ 12 2025-05-07T20:26:59.3020445Z #define htobe16(x) __bswap_16 (x) 2025-05-07T20:26:59.3020725Z #define HUGE_VALF (__builtin_huge_valf()) 2025-05-07T20:26:59.3021026Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0 2025-05-07T20:26:59.3021319Z #define HUGE_VALL (__builtin_huge_vall()) 2025-05-07T20:26:59.3021609Z #define _BITS_TYPES_H 1 2025-05-07T20:26:59.3021897Z #define ULONG_LONG_MAX (LONG_LONG_MAX * 2ULL + 1ULL) 2025-05-07T20:26:59.3022312Z #define _IO_cleanup_region_end(_Doit) 2025-05-07T20:26:59.3022680Z #define cudaSurfaceType3D 0x03 2025-05-07T20:26:59.3023016Z #define _GLIBCXX_HAVE_SYS_TIME_H 1 2025-05-07T20:26:59.3023360Z #define __cudaGet_blockIdx() blockIdx 2025-05-07T20:26:59.3023712Z #define _IO_DONT_CLOSE 0100000 2025-05-07T20:26:59.3024482Z #define __MATHDECLX(type,function,suffix,args,attrib) __MATHDECL_1(type, function,suffix, args) __attribute__ (attrib); __MATHDECL_1(type, __CONCAT(__,function),suffix, args) __attribute__ (attrib) 2025-05-07T20:26:59.3025282Z #define cudaHostRegisterDefault 0x00 2025-05-07T20:26:59.3025558Z #define __unix 1 2025-05-07T20:26:59.3025773Z #define MATH_ERRNO 1 2025-05-07T20:26:59.3026010Z #define _GLIBCXX_STDIO_SEEK_END 2 2025-05-07T20:26:59.3026278Z #define _GLIBCXX_USE_FCHMODAT 1 2025-05-07T20:26:59.3026533Z #define __SM_100_RT_H__ 2025-05-07T20:26:59.3026782Z #define __UINT32_MAX__ 0xffffffffU 2025-05-07T20:26:59.3027058Z #define __GXX_EXPERIMENTAL_CXX0X__ 1 2025-05-07T20:26:59.3027350Z #define __UID_T_TYPE __U32_TYPE 2025-05-07T20:26:59.3027626Z #define _GLIBCXX20_DEPRECATED(MSG) 2025-05-07T20:26:59.3027918Z #define _GLIBCXX_HAVE_ATOMIC_LOCK_POLICY 1 2025-05-07T20:26:59.3029023Z #define __CUDART_API_VERSION ((__CUDA_API_VER_MAJOR__ * 1000) + (__CUDA_API_VER_MINOR__ * 10)) 2025-05-07T20:26:59.3029548Z #define __nv_pure__ __location__(nv_pure) 2025-05-07T20:26:59.3029845Z #define CUDARTAPI_CDECL 2025-05-07T20:26:59.3030095Z #define _PSTL_USAGE_WARNINGS 0 2025-05-07T20:26:59.3030368Z #define _GLIBCXX98_USE_C99_COMPLEX 1 2025-05-07T20:26:59.3030655Z #define __cpp_lib_void_t 201411 2025-05-07T20:26:59.3030911Z #define _POSIX_AIO_MAX 1 2025-05-07T20:26:59.3031146Z #define __SIZE_T 2025-05-07T20:26:59.3031395Z #define isgraph_l(c,l) __isgraph_l ((c), (l)) 2025-05-07T20:26:59.3031703Z #define _GLIBCXX_FULLY_DYNAMIC_STRING 0 2025-05-07T20:26:59.3031996Z #define _POSIX_PIPE_BUF 512 2025-05-07T20:26:59.3032253Z #define __CUDA_RUNTIME_API_H__ 2025-05-07T20:26:59.3032515Z #define _GLIBCXX_HAVE_STRTOLD 1 2025-05-07T20:26:59.3032778Z #define _ATFILE_SOURCE 1 2025-05-07T20:26:59.3033158Z #define __glibcxx_assert(cond) do { __glibcxx_constexpr_assert(cond); } while (false) 2025-05-07T20:26:59.3033572Z #define __WAIT_STATUS void * 2025-05-07T20:26:59.3033833Z #define __MATH_FUNCTIONS_H__ 2025-05-07T20:26:59.3034099Z #define _GLIBCXX_HAVE_WCSTOF 1 2025-05-07T20:26:59.3034692Z #define __FLT128_MIN_EXP__ (-16381) 2025-05-07T20:26:59.3034974Z #define _GLIBCXX_HAVE_LC_MESSAGES 1 2025-05-07T20:26:59.3035247Z #define __WINT_MIN__ 0U 2025-05-07T20:26:59.3035811Z #define _PSTL_CPP14_VARIABLE_TEMPLATES_PRESENT (!__INTEL_COMPILER || __INTEL_COMPILER >= 1700) && (_MSC_FULL_VER >= 190023918 || __cplusplus >= 201402L) 2025-05-07T20:26:59.3036436Z #define isdigit_l(c,l) __isdigit_l ((c), (l)) 2025-05-07T20:26:59.3036734Z #define WUNTRACED 2 2025-05-07T20:26:59.3036961Z #define _GLIBCXX_HAVE_SQRTF 1 2025-05-07T20:26:59.3037228Z #define __SIZEOF_PTHREAD_RWLOCKATTR_T 8 2025-05-07T20:26:59.3039197Z #define NZERO 20 2025-05-07T20:26:59.3039426Z #define _GLIBCXX_HAVE_MEMALIGN 1 2025-05-07T20:26:59.3039698Z #define _PSTL_PRAGMA(x) _Pragma(#x) 2025-05-07T20:26:59.3039990Z #define MOD_CLKA ADJ_OFFSET_SINGLESHOT 2025-05-07T20:26:59.3040278Z #define MOD_CLKB ADJ_TICK 2025-05-07T20:26:59.3040531Z #define __FLT128_MIN_10_EXP__ (-4931) 2025-05-07T20:26:59.3040813Z #define __FLT32X_IS_IEC_60559__ 2 2025-05-07T20:26:59.3041085Z #define __DEVICE_FUNCTIONS_H__ 2025-05-07T20:26:59.3041358Z #define SCHAR_MIN (-SCHAR_MAX - 1) 2025-05-07T20:26:59.3041621Z #define EXIT_FAILURE 1 2025-05-07T20:26:59.3041862Z #define ADJ_MAXERROR 0x0004 2025-05-07T20:26:59.3042120Z #define __INT_LEAST16_WIDTH__ 16 2025-05-07T20:26:59.3042377Z #define _SIZE_T_DEFINED_ 2025-05-07T20:26:59.3042628Z #define _POSIX_AIO_LISTIO_MAX 2 2025-05-07T20:26:59.3042904Z #define __cudaCDP2DeviceGetLimit 2025-05-07T20:26:59.3043232Z #define __LDBL_REDIR_NTH(name,proto) name proto __THROW 2025-05-07T20:26:59.3044324Z nvcc warning : Support for offline compilation for architectures prior to '_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning). 2025-05-07T20:26:59.3045016Z 2025-05-07T20:26:59.3045132Z #define __cudaCDP2FuncGetAttributes 2025-05-07T20:26:59.3045427Z #define __SCHAR_MAX__ 0x7f 2025-05-07T20:26:59.3045674Z #define __FLT128_MANT_DIG__ 113 2025-05-07T20:26:59.3045951Z #define __USING_NAMESPACE_STD(name) 2025-05-07T20:26:59.3046242Z #define _GLIBCXX_HAVE_OBSOLETE_ISINF 1 2025-05-07T20:26:59.3046541Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1) 2025-05-07T20:26:59.3046835Z #define SEEK_DATA 3 2025-05-07T20:26:59.3047067Z #define __KERNEL_STRICT_NAMES 2025-05-07T20:26:59.3047352Z #define _IO_stderr ((_IO_FILE*)(&_IO_2_1_stderr_)) 2025-05-07T20:26:59.3047765Z #define _IO_ferror_unlocked(__fp) (((__fp)->_flags & _IO_ERR_SEEN) != 0) 2025-05-07T20:26:59.3048151Z #define _FUNCTEXCEPT_H 1 2025-05-07T20:26:59.3048403Z #define __INT64_C(c) c ## L 2025-05-07T20:26:59.3048667Z #define __NTH(fct) __LEAF_ATTR fct throw () 2025-05-07T20:26:59.3048999Z #define _GLIBCXX_CONST __attribute__ ((__const__)) 2025-05-07T20:26:59.3049320Z #define _GLIBCXX_HAVE_LINK 1 2025-05-07T20:26:59.3049590Z #define cudaNvSciSyncAttrWait 0x2 2025-05-07T20:26:59.3049886Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2 2025-05-07T20:26:59.3050182Z #define STA_PPSWANDER 0x0400 2025-05-07T20:26:59.3050434Z #define __INT_WCHAR_T_H 2025-05-07T20:26:59.3050669Z #define WSTOPPED 2 2025-05-07T20:26:59.3050905Z #define _POSIX_THREAD_THREADS_MAX 64 2025-05-07T20:26:59.3051181Z #define _POSIX_MQ_OPEN_MAX 8 2025-05-07T20:26:59.3051430Z #define FP_NORMAL 4 2025-05-07T20:26:59.3051672Z #define __cudaCDP2LaunchDevice_ptsz 2025-05-07T20:26:59.3051969Z #define _BITS_TIMEX_H 1 2025-05-07T20:26:59.3052262Z #define _POSIX_LINK_MAX 8 2025-05-07T20:26:59.3052578Z #define _GLIBCXX_HAVE_LIMIT_FSIZE 1 2025-05-07T20:26:59.3052926Z #define _GLIBCXX_HAVE_ATAN2F 1 2025-05-07T20:26:59.3053253Z #define cudaTextureType1D 0x01 2025-05-07T20:26:59.3053586Z #define _GLIBCXX_HAVE_ATAN2L 1 2025-05-07T20:26:59.3053917Z #define COLL_WEIGHTS_MAX 255 2025-05-07T20:26:59.3054240Z #define __isascii(c) (((c) & ~0x7f) == 0) 2025-05-07T20:26:59.3054558Z #define __toascii(c) ((c) & 0x7f) 2025-05-07T20:26:59.3054979Z #define __attribute_format_strfmon__(a,b) __attribute__ ((__format__ (__strfmon__, a, b))) 2025-05-07T20:26:59.3055414Z #define _IO_MAGIC 0xFBAD0000 2025-05-07T20:26:59.3055779Z #define _GLIBCXX_USE_SENDFILE 1 2025-05-07T20:26:59.3056044Z #define _POSIX_SOURCE 1 2025-05-07T20:26:59.3056285Z #define cudaTextureType2D 0x02 2025-05-07T20:26:59.3056545Z #define _PTR_TRAITS_H 1 2025-05-07T20:26:59.3056817Z #define _GLIBCXX_NOEXCEPT_QUAL noexcept (_NE) 2025-05-07T20:26:59.3057122Z #define _GLIBCXX_HAVE_POWF 1 2025-05-07T20:26:59.3057386Z #define _POSIX2_BC_STRING_MAX 1000 2025-05-07T20:26:59.3057705Z #define __attribute_used__ __attribute__ ((__used__)) 2025-05-07T20:26:59.3058038Z #define cudaTextureType3D 0x03 2025-05-07T20:26:59.3058300Z #define _STDIO_USES_IOSTREAM 2025-05-07T20:26:59.3058647Z #define CLOCK_REALTIME 0 2025-05-07T20:26:59.3058893Z #define __FLT32X_MANT_DIG__ 53 2025-05-07T20:26:59.3059158Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2 2025-05-07T20:26:59.3059457Z #define __cpp_aligned_new 201606L 2025-05-07T20:26:59.3059732Z #define __USER_LABEL_PREFIX__ 2025-05-07T20:26:59.3060000Z #define cudaEventBlockingSync 0x01 2025-05-07T20:26:59.3060295Z #define _GLIBCXX_HAVE_TANL 1 2025-05-07T20:26:59.3060566Z #define _GLIBCXX_USE_PTHREAD_RWLOCK_T 1 2025-05-07T20:26:59.3060862Z #define _GLIBCXX_HAVE_LINUX_RANDOM_H 1 2025-05-07T20:26:59.3061157Z #define _GLIBCXX_USE_C99_FENV_TR1 1 2025-05-07T20:26:59.3061434Z #define __FLT32_MAX_10_EXP__ 38 2025-05-07T20:26:59.3061695Z #define __GLIBC__ 2 2025-05-07T20:26:59.3061968Z #define __END_DECLS } 2025-05-07T20:26:59.3062264Z #define FP_ILOGB0 (-2147483647 - 1) 2025-05-07T20:26:59.3062713Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x 2025-05-07T20:26:59.3063170Z #define __CONCAT(x,y) x ## y 2025-05-07T20:26:59.3063487Z #define WCONTINUED 8 2025-05-07T20:26:59.3063748Z #define __STDC_HOSTED__ 1 2025-05-07T20:26:59.3063998Z #define _GLIBCXX_HAVE_ARPA_INET_H 1 2025-05-07T20:26:59.3064267Z #define _ALLOCA_H 1 2025-05-07T20:26:59.3064497Z #define __host__ __location__(host) 2025-05-07T20:26:59.3064905Z #define __warndecl(name,msg) extern void name (void) __attribute__((__warning__ (msg))) 2025-05-07T20:26:59.3065340Z #define __SLONG32_TYPE int 2025-05-07T20:26:59.3065602Z #define _GLIBCXX_DEBUG_ASSERTIONS_H 1 2025-05-07T20:26:59.3065882Z #define _SYS_SELECT_H 1 2025-05-07T20:26:59.3066117Z #define _IO_LINE_BUF 0x200 2025-05-07T20:26:59.3066360Z #define _IOS_NOCREATE 32 2025-05-07T20:26:59.3066609Z #define __DEC64_MIN_EXP__ (-382) 2025-05-07T20:26:59.3066878Z #define __cudaGet_warpSize() warpSize 2025-05-07T20:26:59.3067166Z #define __SSIZE_T_TYPE __SWORD_TYPE 2025-05-07T20:26:59.3067450Z #define _GLIBCXX_HAVE_LIMIT_VMEM 0 2025-05-07T20:26:59.3067724Z #define __global__ __location__(global) 2025-05-07T20:26:59.3068012Z #define __GNU_LIBRARY__ 6 2025-05-07T20:26:59.3068267Z #define __cpp_decltype_auto 201304L 2025-05-07T20:26:59.3068532Z #define __DBL_DIG__ 15 2025-05-07T20:26:59.3068758Z #define TIME_UTC 1 2025-05-07T20:26:59.3068977Z #define __FLT32_DIG__ 6 2025-05-07T20:26:59.3069397Z #define __forceinline__ __inline__ __attribute__((always_inline)) 2025-05-07T20:26:59.3069796Z #define cudaHostAllocWriteCombined 0x04 2025-05-07T20:26:59.3070112Z #define cudaDeviceScheduleAuto 0x00 2025-05-07T20:26:59.3070412Z #define iscntrl_l(c,l) __iscntrl_l ((c), (l)) 2025-05-07T20:26:59.3070706Z #define _G_BUFSIZ 8192 2025-05-07T20:26:59.3071008Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F 2025-05-07T20:26:59.3071367Z #define cudaTextureTypeCubemap 0x0C 2025-05-07T20:26:59.3071658Z #define __cudaCDP2GetDevice 2025-05-07T20:26:59.3071933Z #define __cudaCDP2PeekAtLastError 2025-05-07T20:26:59.3072219Z #define STA_CLOCKERR 0x1000 2025-05-07T20:26:59.3072465Z #define __GXX_WEAK__ 1 2025-05-07T20:26:59.3072741Z #define __RLIM_T_TYPE __SYSCALL_ULONG_TYPE 2025-05-07T20:26:59.3073043Z #define _GLIBCXX_HAVE_ISNANF 1 2025-05-07T20:26:59.3073301Z #define __SHRT_WIDTH__ 16 2025-05-07T20:26:59.3073594Z #define __cpp_lib_robust_nonmodifying_seq_ops 201304 2025-05-07T20:26:59.3073929Z #define _GLIBCXX_BITS_SPECFUN_H 1 2025-05-07T20:26:59.3074200Z #define _GLIBCXX_HAVE_ISNANL 1 2025-05-07T20:26:59.3074664Z #define isblank_l(c,l) __isblank_l ((c), (l)) 2025-05-07T20:26:59.3074965Z #define _G_config_h 1 2025-05-07T20:26:59.3075233Z #define M_LOG2El 1.442695040888963407359924681001892137L 2025-05-07T20:26:59.3075565Z #define ADJ_OFFSET_SINGLESHOT 0x8001 2025-05-07T20:26:59.3075841Z #define _GCC_WCHAR_T 2025-05-07T20:26:59.3076075Z #define TMP_MAX 238328 2025-05-07T20:26:59.3076307Z #define __FLT32_IS_IEC_60559__ 2 2025-05-07T20:26:59.3076576Z #define __DEVICE_TYPES_H__ 2025-05-07T20:26:59.3076833Z #define __DEV_T_TYPE __UQUAD_TYPE 2025-05-07T20:26:59.3134301Z #define _EXT_NUMERIC_TRAITS 1 2025-05-07T20:26:59.3134663Z #define _GLIBCXX_BEGIN_NAMESPACE_ALGO 2025-05-07T20:26:59.3135378Z #define _IO_SKIPWS 01 2025-05-07T20:26:59.3135769Z #define cudaStreamGraphFireAndForgetAsSibling (cudaStream_t)0x0300000000000000 2025-05-07T20:26:59.3136218Z #define _IO_SCIENTIFIC 04000 2025-05-07T20:26:59.3136478Z #define _GLIBCXX_HAVE_STRING_H 1 2025-05-07T20:26:59.3136804Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L 2025-05-07T20:26:59.3137156Z #define cudaDeviceScheduleSpin 0x01 2025-05-07T20:26:59.3137510Z #define __nonnull(params) __attribute__ ((__nonnull__ params)) 2025-05-07T20:26:59.3137851Z #define __DBL_IS_IEC_60559__ 2 2025-05-07T20:26:59.3138096Z #define le32toh(x) (x) 2025-05-07T20:26:59.3138318Z #define _SIZE_T_DEFINED 2025-05-07T20:26:59.3138556Z #define _GLIBCXX_HAVE_XLOCALE_H 1 2025-05-07T20:26:59.3138877Z #define cudaArraySparsePropertiesSingleMipTail 0x1 2025-05-07T20:26:59.3139224Z #define __DEC32_MAX__ 9.999999E96DF 2025-05-07T20:26:59.3139621Z #define __WIFSIGNALED(status) (((signed char) (((status) & 0x7f) + 1) >> 1) > 0) 2025-05-07T20:26:59.3140034Z #define _GLIBCXX_HAVE_FMODL 1 2025-05-07T20:26:59.3140305Z #define _GLIBCXX_HAVE_POLL 1 2025-05-07T20:26:59.3140571Z #define __SM_32_INTRINSICS_H__ 2025-05-07T20:26:59.3140832Z #define _POSIX_NAME_MAX 14 2025-05-07T20:26:59.3141118Z #define __cpp_threadsafe_static_init 200806L 2025-05-07T20:26:59.3141685Z #define _GLIBCXX_MAKE_MOVE_IF_NOEXCEPT_ITERATOR(_Iter) std::__make_move_if_noexcept_iterator(_Iter) 2025-05-07T20:26:59.3142299Z #define _GLIBCXX_USE_CLOCK_REALTIME 1 2025-05-07T20:26:59.3142670Z #define __cpp_enumerator_attributes 201411L 2025-05-07T20:26:59.3143094Z #define __WCOREDUMP(status) ((status) & __WCOREFLAG) 2025-05-07T20:26:59.3143481Z #define _WCHAR_T_ 2025-05-07T20:26:59.3143751Z #define _GLIBCXX_FAST_MATH 0 2025-05-07T20:26:59.3144195Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x 2025-05-07T20:26:59.3144594Z #define RTSIG_MAX 32 2025-05-07T20:26:59.3144813Z #define _STDDEF_H 2025-05-07T20:26:59.3145041Z #define CU_UUID_HAS_BEEN_DEFINED 2025-05-07T20:26:59.3145314Z #define _VA_LIST_DEFINED 2025-05-07T20:26:59.3145562Z #define __FLT32X_HAS_INFINITY__ 1 2025-05-07T20:26:59.3145882Z #define __glibcxx_requires_non_empty_range(_First,_Last) 2025-05-07T20:26:59.3146264Z #define __grid_constant__ __location__(grid_constant) 2025-05-07T20:26:59.3146590Z #define __INT32_MAX__ 0x7fffffff 2025-05-07T20:26:59.3146879Z #define _GLIBCXX_BEGIN_EXTERN_C extern "C" { 2025-05-07T20:26:59.3147334Z #define _PSTL_CPP14_INTEGER_SEQUENCE_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201402L) 2025-05-07T20:26:59.3147852Z #define __glibcxx_digits_b(T,B) (B - __glibcxx_signed_b (T,B)) 2025-05-07T20:26:59.3148208Z #define __SIZEOF_PTHREAD_COND_T 48 2025-05-07T20:26:59.3148520Z #define _PSTL_PRAGMA_SIMD_ORDERED_MONOTONIC(PRM) 2025-05-07T20:26:59.3148830Z #define __unix__ 1 2025-05-07T20:26:59.3149135Z #define __SM_60_ATOMIC_FUNCTIONS_H__ 2025-05-07T20:26:59.3149409Z #define __INT_WIDTH__ 32 2025-05-07T20:26:59.3149650Z #define __SIZEOF_LONG__ 8 2025-05-07T20:26:59.3149885Z #define _IONBF 2 2025-05-07T20:26:59.3150318Z #define __MATHCALLX(function,suffix,args,attrib) __MATHDECLX (_Mdouble_,function,suffix, args, attrib) 2025-05-07T20:26:59.3151067Z #define _IO_getc_unlocked(_fp) (_IO_BE ((_fp)->_IO_read_ptr >= (_fp)->_IO_read_end, 0) ? __uflow (_fp) : *(unsigned char *) (_fp)->_IO_read_ptr++) 2025-05-07T20:26:59.3151800Z #define __STDC_IEC_559__ 1 2025-05-07T20:26:59.3152110Z #define __STDC_ISO_10646__ 201103L 2025-05-07T20:26:59.3152436Z #define __UINT16_C(c) c 2025-05-07T20:26:59.3152728Z #define M_2_PI 0.63661977236758134308 2025-05-07T20:26:59.3153054Z #define STA_DEL 0x0020 2025-05-07T20:26:59.3153347Z #define __CUDACC_VER_MINOR__ 8 2025-05-07T20:26:59.3153657Z #define __id_t_defined 2025-05-07T20:26:59.3153957Z #define w_retcode __wait_terminated.__w_retcode 2025-05-07T20:26:59.3154398Z #define _IO_PENDING_OUTPUT_COUNT(_fp) ((_fp)->_IO_write_ptr - (_fp)->_IO_write_base) 2025-05-07T20:26:59.3154818Z #define _GLIBCXX_HAVE_MODFF 1 2025-05-07T20:26:59.3155162Z #define _GLIBCXX_HAVE_MODFL 1 2025-05-07T20:26:59.3155413Z #define __DECIMAL_DIG__ 21 2025-05-07T20:26:59.3155664Z #define _POSIX2_RE_DUP_MAX 255 2025-05-07T20:26:59.3155925Z #define __USE_FORTIFY_LEVEL 0 2025-05-07T20:26:59.3156180Z #define __STDC_IEC_559_COMPLEX__ 1 2025-05-07T20:26:59.3156440Z #define SING 2 2025-05-07T20:26:59.3156663Z #define STA_FREQHOLD 0x0080 2025-05-07T20:26:59.3156922Z #define __SM_32_ATOMIC_FUNCTIONS_HPP__ 2025-05-07T20:26:59.3157219Z #define cudaStreamDefault 0x00 2025-05-07T20:26:59.3157563Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64 2025-05-07T20:26:59.3157932Z #define _GLIBCXX_HAVE_HYPOTL 1 2025-05-07T20:26:59.3158192Z #define _GLIBCXX_HAVE_SYS_UIO_H 1 2025-05-07T20:26:59.3158457Z #define __gnu_linux__ 1 2025-05-07T20:26:59.3158692Z #define __INT16_MAX__ 0x7fff 2025-05-07T20:26:59.3158939Z #define _LARGEFILE_SOURCE 1 2025-05-07T20:26:59.3159232Z #define MAX_INPUT 255 2025-05-07T20:26:59.3159474Z #define __FLT64_MIN_EXP__ (-1021) 2025-05-07T20:26:59.3159798Z #define __isalpha_l(c,l) __isctype_l((c), _ISalpha, (l)) 2025-05-07T20:26:59.3160167Z #define __glibcxx_requires_heap(_First,_Last) 2025-05-07T20:26:59.3160479Z #define _GLIBCXX_CPU_DEFINES 1 2025-05-07T20:26:59.3160737Z #define _GLIBCXX_HAVE_POLL_H 1 2025-05-07T20:26:59.3161135Z #define __attribute_warn_unused_result__ __attribute__ ((__warn_unused_result__)) 2025-05-07T20:26:59.3161555Z #define _IO_SHOWPOS 02000 2025-05-07T20:26:59.3161886Z #define _GLIBCXX_HAVE_SYMVER_SYMBOL_RENAMING_RUNTIME_SUPPORT 1 2025-05-07T20:26:59.3162234Z #define _Mfloat_ float 2025-05-07T20:26:59.3162497Z #define __glibcxx_requires_cond(_Cond,_Msg) 2025-05-07T20:26:59.3162809Z #define __FLT64X_MIN_10_EXP__ (-4931) 2025-05-07T20:26:59.3163089Z #define DELAYTIMER_MAX 2147483647 2025-05-07T20:26:59.3163407Z #define cudaMemPoolCreateUsageHwDecompress 0x2 2025-05-07T20:26:59.3163934Z #define __glibcxx_max_b(T,B) (__glibcxx_signed_b (T,B) ? (((((T)1 << (__glibcxx_digits_b (T,B) - 1)) - 1) << 1) + 1) : ~(T)0) 2025-05-07T20:26:59.3164414Z #define __LDBL_HAS_QUIET_NAN__ 1 2025-05-07T20:26:59.3164688Z #define _GLIBCXX98_USE_C99_STDIO 1 2025-05-07T20:26:59.3165009Z #define cudaKernelNodeAttrID cudaLaunchAttributeID 2025-05-07T20:26:59.3165363Z #define __glibcxx_class_requires2(_a,_b,_c) 2025-05-07T20:26:59.3165654Z #define __USE_ISOC11 1 2025-05-07T20:26:59.3165881Z #define _BSD_SIZE_T_ 2025-05-07T20:26:59.3166114Z #define ADJ_MICRO 0x1000 2025-05-07T20:26:59.3166353Z #define _GLIBCXX_HAVE_FABSF 1 2025-05-07T20:26:59.3166610Z #define _GLIBCXX_HAVE_FABSL 1 2025-05-07T20:26:59.3166904Z #define _PSTL_PRAGMA_SIMD _PSTL_PRAGMA(omp simd) 2025-05-07T20:26:59.3167212Z #define __FLT64_MANT_DIG__ 53 2025-05-07T20:26:59.3167514Z #define __attribute_const__ __attribute__ ((__const__)) 2025-05-07T20:26:59.3167841Z #define __THROW throw () 2025-05-07T20:26:59.3168086Z #define __cudaGet_gridDim() gridDim 2025-05-07T20:26:59.3168375Z #define __SM_60_ATOMIC_FUNCTIONS_HPP__ 2025-05-07T20:26:59.3168724Z #define __glibcxx_requires_heap_pred(_First,_Last,_Pred) 2025-05-07T20:26:59.3169067Z #define htobe32(x) __bswap_32 (x) 2025-05-07T20:26:59.3169339Z #define _GLIBCXX_HAVE_POWL 1 2025-05-07T20:26:59.3169598Z #define __FLT64X_MANT_DIG__ 64 2025-05-07T20:26:59.3169859Z #define __GLIBC_HAVE_LONG_LONG 1 2025-05-07T20:26:59.3170110Z #define L_tmpnam 20 2025-05-07T20:26:59.3170335Z #define ___int_wchar_t_h 2025-05-07T20:26:59.3170772Z #define WIFCONTINUED(status) __WIFCONTINUED (__WAIT_INT (status)) 2025-05-07T20:26:59.3171146Z #define isascii(c) __isascii (c) 2025-05-07T20:26:59.3171401Z #define _T_PTRDIFF 2025-05-07T20:26:59.3171705Z #define _GLIBCXX_MOVE3(_Tp,_Up,_Vp) std::move(_Tp, _Up, _Vp) 2025-05-07T20:26:59.3172053Z #define toascii(c) __toascii (c) 2025-05-07T20:26:59.3172333Z #define __GNUC__ 11 2025-05-07T20:26:59.3172605Z #define __SYSCALL_ULONG_TYPE __ULONGWORD_TYPE 2025-05-07T20:26:59.3172896Z #define __GXX_RTTI 1 2025-05-07T20:26:59.3173115Z #define __pie__ 2 2025-05-07T20:26:59.3173322Z #define __MMX__ 1 2025-05-07T20:26:59.3173623Z #define __cudaCDP2Malloc 2025-05-07T20:26:59.3173874Z #define __timespec_defined 1 2025-05-07T20:26:59.3174122Z #define L_ctermid 9 2025-05-07T20:26:59.3174348Z #define __OFF64_T_TYPE __SQUAD_TYPE 2025-05-07T20:26:59.3174654Z #define __cudaCDP2GetParameterBufferV2 2025-05-07T20:26:59.3175046Z #define offsetof(TYPE,MEMBER) __builtin_offsetof (TYPE, MEMBER) 2025-05-07T20:26:59.3175420Z #define _BITS_POSIX2_LIM_H 1 2025-05-07T20:26:59.3175681Z #define _GLIBCXX98_USE_C99_STDLIB 1 2025-05-07T20:26:59.3175968Z #define cudaMemAttachGlobal 0x01 2025-05-07T20:26:59.3176269Z #define FD_SET(fd,fdsetp) __FD_SET (fd, fdsetp) 2025-05-07T20:26:59.3176571Z #define __FLT_HAS_DENORM__ 1 2025-05-07T20:26:59.3176833Z #define __SIZEOF_LONG_DOUBLE__ 16 2025-05-07T20:26:59.3177264Z #define _GLIBCXX_NATIVE_THREAD_ID (__gthread_active_p() ? __gthread_self() : (__gthread_t)1) 2025-05-07T20:26:59.3177986Z #define assert_perror(errnum) (!(errnum) ? __ASSERT_VOID_CAST (0) : __assert_perror_fail ((errnum), __FILE__, __LINE__, __ASSERT_FUNCTION)) 2025-05-07T20:26:59.3178584Z #define _IO_HAVE_ST_BLKSIZE _G_HAVE_ST_BLKSIZE 2025-05-07T20:26:59.3178886Z #define __USE_SVID 1 2025-05-07T20:26:59.3179139Z #define __constant__ __location__(constant) 2025-05-07T20:26:59.3179445Z #define _GLIBCXX_HAVE_POSIX_MEMALIGN 1 2025-05-07T20:26:59.3179738Z #define __device__ __location__(device) 2025-05-07T20:26:59.3180067Z #define _GLIBCXX_HAVE_EXCEPTION_PTR_SINCE_GCC46 1 2025-05-07T20:26:59.3180380Z #define _GLIBCXX_RES_LIMITS 1 2025-05-07T20:26:59.3180640Z #define M_1_PI 0.31830988618379067154 2025-05-07T20:26:59.3180915Z #define CUDART_DEVICE __device__ 2025-05-07T20:26:59.3181250Z #define __LDBL_REDIR1_NTH(name,proto,alias) name proto __THROW 2025-05-07T20:26:59.3181613Z #define M_PI_2 1.57079632679489661923 2025-05-07T20:26:59.3181945Z #define __BIGGEST_ALIGNMENT__ 16 2025-05-07T20:26:59.3182388Z #define cudaExternalSemaphoreWaitSkipNvSciBufMemSync 0x02 2025-05-07T20:26:59.3182855Z #define __STDC_UTF_16__ 1 2025-05-07T20:26:59.3183158Z #define LONG_MAX __LONG_MAX__ 2025-05-07T20:26:59.3183607Z #define __glibcxx_digits10_b(T,B) (__glibcxx_digits_b (T,B) * 643L / 2136) 2025-05-07T20:26:59.3184067Z #define _POSIX_THREAD_DESTRUCTOR_ITERATIONS 4 2025-05-07T20:26:59.3184376Z #define _POSIX_HOST_NAME_MAX 255 2025-05-07T20:26:59.3184643Z #define __FLT64_MAX_10_EXP__ 308 2025-05-07T20:26:59.3184899Z #define NGROUPS_MAX 65536 2025-05-07T20:26:59.3185157Z #define _GLIBCXX_NAMESPACE_LDBL 2025-05-07T20:26:59.3185418Z #define __USE_ISOC95 1 2025-05-07T20:26:59.3185635Z #define _TIME_H 1 2025-05-07T20:26:59.3185897Z #define M_LOG10El 0.434294481903251827651128918916605082L 2025-05-07T20:26:59.3186212Z #define __USE_ISOC99 1 2025-05-07T20:26:59.3186523Z #define __ASMNAME(cname) __ASMNAME2 (__USER_LABEL_PREFIX__, cname) 2025-05-07T20:26:59.3186884Z #define HOST_NAME_MAX 64 2025-05-07T20:26:59.3187133Z #define _POSIX_SEM_NSEMS_MAX 256 2025-05-07T20:26:59.3187389Z #define _IOS_ATEND 4 2025-05-07T20:26:59.3187616Z #define __SM_35_INTRINSICS_H__ 2025-05-07T20:26:59.3187940Z #define WTERMSIG(status) __WTERMSIG (__WAIT_INT (status)) 2025-05-07T20:26:59.3188336Z #define cudaStreamAttrValue cudaLaunchAttributeValue 2025-05-07T20:26:59.3188668Z #define _GLIBCXX_HAVE_S_ISREG 1 2025-05-07T20:26:59.3188947Z #define cudaSurfaceTypeCubemap 0x0C 2025-05-07T20:26:59.3189360Z #define __cpp_delegating_constructors 200604L 2025-05-07T20:26:59.3189775Z #define __FLT32_HAS_INFINITY__ 1 2025-05-07T20:26:59.3190033Z #define _STDIO_H 1 2025-05-07T20:26:59.3190435Z #define __isctype_l(c,type,locale) ((locale)->__ctype_b[(int) (c)] & (unsigned short int) type) 2025-05-07T20:26:59.3190888Z #define _GLIBCXX_PREDEFINED_OPS_H 1 2025-05-07T20:26:59.3191250Z #define __DBL_MAX__ double(1.79769313486231570814527423731704357e+308L) 2025-05-07T20:26:59.3191624Z #define _G_IO_IO_FILE_VERSION 0x20001 2025-05-07T20:26:59.3191914Z #define _POSIX_SIGQUEUE_MAX 32 2025-05-07T20:26:59.3192172Z #define _GLIBCXX_HAVE_GETS 1 2025-05-07T20:26:59.3192440Z #define _GLIBCXX_HAVE_LINUX_TYPES_H 1 2025-05-07T20:26:59.3192818Z #define __cpp_raw_strings 200710L 2025-05-07T20:26:59.3193110Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:59.3193420Z #define _GLIBCXX_HAVE_VFWSCANF 1 2025-05-07T20:26:59.3193694Z #define __DBL_HAS_INFINITY__ 1 2025-05-07T20:26:59.3193962Z #define __STDCPP_MATH_SPEC_FUNCS__ 201003L 2025-05-07T20:26:59.3194272Z #define _GLIBCXX_STDIO_EOF -1 2025-05-07T20:26:59.3194544Z #define __SIZEOF_PTHREAD_MUTEX_T 40 2025-05-07T20:26:59.3194821Z #define __CHANNEL_DESCRIPTOR_H__ 2025-05-07T20:26:59.3195174Z #define _ISbit(bit) ((bit) < 8 ? ((1 << (bit)) << 8) : ((1 << (bit)) >> 8)) 2025-05-07T20:26:59.3195555Z #define __SIZEOF_FLOAT__ 4 2025-05-07T20:26:59.3195793Z #define __USE_XOPEN 1 2025-05-07T20:26:59.3196035Z #define __SIZEOF_PTHREAD_RWLOCK_T 56 2025-05-07T20:26:59.3196467Z #define cudaStreamAttributeMemSyncDomain cudaLaunchAttributeMemSyncDomain 2025-05-07T20:26:59.3196899Z #define __USE_XOPEN2K 1 2025-05-07T20:26:59.3197142Z #define _PSTL_UDR_PRESENT 1 2025-05-07T20:26:59.3197406Z #define __HAVE_SPECULATION_SAFE_VALUE 1 2025-05-07T20:26:59.3197697Z #define _GLIBCXX_HAVE_COSF 1 2025-05-07T20:26:59.3197969Z #define __cpp_fold_expressions 201603L 2025-05-07T20:26:59.3198487Z #define cudaWaitExternalSemaphoresAsync __CUDART_API_PTSZ(cudaWaitExternalSemaphoresAsync_v2) 2025-05-07T20:26:59.3199000Z #define NL_LANGMAX _POSIX2_LINE_MAX 2025-05-07T20:26:59.3199286Z #define __DEC32_MIN_EXP__ (-94) 2025-05-07T20:26:59.3199643Z #define __glibcxx_requires_partitioned_upper(_First,_Last,_Value) 2025-05-07T20:26:59.3200025Z #define __DADDR_T_TYPE __S32_TYPE 2025-05-07T20:26:59.3200396Z #define cudaExternalSemaphoreSignalSkipNvSciBufMemSync 0x01 2025-05-07T20:26:59.3200786Z #define __END_NAMESPACE_C99 2025-05-07T20:26:59.3201059Z #define __glibcxx_integral_traps true 2025-05-07T20:26:59.3201336Z #define _POSIX_PATH_MAX 256 2025-05-07T20:26:59.3201595Z #define __INTPTR_WIDTH__ 64 2025-05-07T20:26:59.3201880Z #define __FLT64X_HAS_INFINITY__ 1 2025-05-07T20:26:59.3202197Z #define _IOS_TRUNC 16 2025-05-07T20:26:59.3202487Z #define _ISOC11_SOURCE 1 2025-05-07T20:26:59.3202796Z #define _GLIBCXX_HAVE_LINUX_FUTEX 1 2025-05-07T20:26:59.3203142Z #define __UINT_LEAST32_MAX__ 0xffffffffU 2025-05-07T20:26:59.3203509Z #define _GLIBCXX_HAVE_QUICK_EXIT 1 2025-05-07T20:26:59.3203953Z #define __glibcxx_requires_irreflexive_pred2(_First,_Last,_Pred) 2025-05-07T20:26:59.3204391Z #define LONG_MIN (-LONG_MAX - 1L) 2025-05-07T20:26:59.3204659Z #define _GLIBCXX_HAVE_SINCOSF 1 2025-05-07T20:26:59.3204918Z #define _IO_UNITBUF 020000 2025-05-07T20:26:59.3205167Z #define _GLIBCXX_HAVE_SINCOSL 1 2025-05-07T20:26:59.3205417Z #define __FD_SETSIZE 1024 2025-05-07T20:26:59.3205664Z #define getc(_fp) _IO_getc (_fp) 2025-05-07T20:26:59.3205931Z #define be32toh(x) __bswap_32 (x) 2025-05-07T20:26:59.3206263Z #define _GLIBCXX_PACKAGE__GLIBCXX_VERSION "version-unused" 2025-05-07T20:26:59.3206613Z #define __FLT32X_HAS_DENORM__ 1 2025-05-07T20:26:59.3206875Z #define __INT_FAST16_TYPE__ long int 2025-05-07T20:26:59.3207174Z #define isxdigit_l(c,l) __isxdigit_l ((c), (l)) 2025-05-07T20:26:59.3207498Z #define _GLIBCXX_HAVE_GETIPINFO 1 2025-05-07T20:26:59.3207765Z #define __MMX_WITH_SSE__ 1 2025-05-07T20:26:59.3208056Z #define __isalnum_l(c,l) __isctype_l((c), _ISalnum, (l)) 2025-05-07T20:26:59.3208386Z #define _WCHAR_T_DEFINED_ 2025-05-07T20:26:59.3208670Z #define cudaIpcMemLazyEnablePeerAccess 0x01 2025-05-07T20:26:59.3209132Z #define _GLIBCXX_HAVE_AT_QUICK_EXIT 1 2025-05-07T20:26:59.3209414Z #define __INO_T_MATCHES_INO64_T 1 2025-05-07T20:26:59.3209682Z #define __USE_POSIX199506 1 2025-05-07T20:26:59.3209933Z #define _FEATURES_H 1 2025-05-07T20:26:59.3210164Z #define __LDBL_HAS_DENORM__ 1 2025-05-07T20:26:59.3210547Z #define _PSTL_PRAGMA_SIMD_REDUCTION(PRM) _PSTL_PRAGMA(omp simd reduction(PRM)) 2025-05-07T20:26:59.3211016Z #define __WEXITSTATUS(status) (((status) & 0xff00) >> 8) 2025-05-07T20:26:59.3211337Z #define __stub_getmsg 2025-05-07T20:26:59.3211597Z #define _IO_FIXED 010000 2025-05-07T20:26:59.3211935Z #define __cpp_lib_addressof_constexpr 201603 2025-05-07T20:26:59.3212418Z #define _GLIBCXX11_USE_C99_STDIO 1 2025-05-07T20:26:59.3212752Z #define __stub_setlogin 2025-05-07T20:26:59.3213049Z #define __stub_fattach 2025-05-07T20:26:59.3213341Z #define __cplusplus 201703L 2025-05-07T20:26:59.3213663Z #define __cpp_ref_qualifiers 200710L 2025-05-07T20:26:59.3213960Z #define _STRUCT_TIMEVAL 1 2025-05-07T20:26:59.3214220Z #define INFINITY (__builtin_inff()) 2025-05-07T20:26:59.3214488Z #define _IO_UNBUFFERED 2 2025-05-07T20:26:59.3214970Z #define cudaStreamAttributeSynchronizationPolicy cudaLaunchAttributeSynchronizationPolicy 2025-05-07T20:26:59.3215489Z #define _IO_INTERNAL 010 2025-05-07T20:26:59.3215726Z #define __DEC32_MIN__ 1E-95DF 2025-05-07T20:26:59.3216057Z #define cudaKernelNodeAttrValue cudaLaunchAttributeValue 2025-05-07T20:26:59.3216405Z #define __dev_t_defined 2025-05-07T20:26:59.3216634Z #define __DEPRECATED 1 2025-05-07T20:26:59.3216862Z #define __S32_TYPE int 2025-05-07T20:26:59.3217113Z #define __cpp_rvalue_references 200610L 2025-05-07T20:26:59.3217405Z #define __DBL_MAX_EXP__ 1024 2025-05-07T20:26:59.3217663Z #define _IO_fpos_t _G_fpos_t 2025-05-07T20:26:59.3217914Z #define __WCHAR_WIDTH__ 32 2025-05-07T20:26:59.3218507Z #define cudaKernelNodeAttributePreferredSharedMemoryCarveout cudaLaunchAttributePreferredSharedMemoryCarveout 2025-05-07T20:26:59.3219127Z #define _G_HAVE_MREMAP 1 2025-05-07T20:26:59.3219439Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:26:59.3219776Z #define OVERFLOW 3 2025-05-07T20:26:59.3220015Z #define __toascii_l(c,l) ((l), __toascii (c)) 2025-05-07T20:26:59.3220320Z #define __DEC128_EPSILON__ 1E-33DL 2025-05-07T20:26:59.3220606Z #define __SM_32_ATOMIC_FUNCTIONS_H__ 2025-05-07T20:26:59.3220934Z #define _GLIBCXX_DEFAULT_ABI_TAG _GLIBCXX_ABI_TAG_CXX11 2025-05-07T20:26:59.3221260Z #define __SSE2_MATH__ 1 2025-05-07T20:26:59.3221502Z #define __ATOMIC_HLE_RELEASE 131072 2025-05-07T20:26:59.3221800Z #define __FSFILCNT_T_TYPE __SYSCALL_ULONG_TYPE 2025-05-07T20:26:59.3222096Z #define _IO_STDIO_H 2025-05-07T20:26:59.3222343Z #define PDP_ENDIAN __PDP_ENDIAN 2025-05-07T20:26:59.3222627Z #define isspace_l(c,l) __isspace_l ((c), (l)) 2025-05-07T20:26:59.3222932Z #define __cudaCDP2Memcpy2DAsync 2025-05-07T20:26:59.3223223Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:59.3223530Z #define _GLIBCXX_HAVE_STRERROR_R 1 2025-05-07T20:26:59.3223789Z #define __amd64 1 2025-05-07T20:26:59.3224010Z #define _POSIX_TZNAME_MAX 6 2025-05-07T20:26:59.3224273Z #define __cudaCDP2Memset3DAsync 2025-05-07T20:26:59.3224538Z #define __SYSCALL_WORDSIZE 64 2025-05-07T20:26:59.3224823Z #define _GLIBCXX_HAVE_ATTRIBUTE_VISIBILITY 1 2025-05-07T20:26:59.3225121Z #define _EXT_TYPE_TRAITS 1 2025-05-07T20:26:59.3225374Z #define _GLIBCXX_HAVE_POSIX_SEMAPHORE 1 2025-05-07T20:26:59.3225669Z #define _POSIX_RE_DUP_MAX 255 2025-05-07T20:26:59.3225929Z #define __STDC_NO_THREADS__ 1 2025-05-07T20:26:59.3226169Z #define __bounded 2025-05-07T20:26:59.3226393Z #define _GLIBCXX_HAVE_ACOSL 1 2025-05-07T20:26:59.3226665Z #define __USECONDS_T_TYPE __U32_TYPE 2025-05-07T20:26:59.3226949Z #define _IO_DELETE_DONT_CLOSE 0x40 2025-05-07T20:26:59.3227220Z #define __BEGIN_NAMESPACE_STD 2025-05-07T20:26:59.3227484Z #define _PTRDIFF_T_DECLARED 2025-05-07T20:26:59.3227756Z #define __OFF_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:26:59.3228069Z #define __W_STOPCODE(sig) ((sig) << 8 | 0x7f) 2025-05-07T20:26:59.3229299Z #define cudaStreamAttributePriority cudaLaunchAttributePriority 2025-05-07T20:26:59.3229810Z #define _GLIBCXX_HAVE_NETDB_H 1 2025-05-07T20:26:59.3230074Z #define __SM_20_INTRINSICS_HPP__ 2025-05-07T20:26:59.3230410Z #define __cpp_lib_has_unique_object_representations 201606 2025-05-07T20:26:59.3230751Z #define STA_PLL 0x0001 2025-05-07T20:26:59.3230990Z #define __ATOMIC_HLE_ACQUIRE 65536 2025-05-07T20:26:59.3231254Z #define __GNUG__ 11 2025-05-07T20:26:59.3231486Z #define _GLIBCXX_USE_GET_NPROCS 1 2025-05-07T20:26:59.3231753Z #define _T_WCHAR 2025-05-07T20:26:59.3232044Z #define __cudaCDP2GetDeviceCount 2025-05-07T20:26:59.3232521Z #define __specialization_static 2025-05-07T20:26:59.3232893Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL 2025-05-07T20:26:59.3233267Z #define __SIZEOF_SIZE_T__ 8 2025-05-07T20:26:59.3233584Z #define cudaArraySparse 0x40 2025-05-07T20:26:59.3233905Z #define STA_PPSFREQ 0x0002 2025-05-07T20:26:59.3234238Z #define _IO_stdin ((_IO_FILE*)(&_IO_2_1_stdin_)) 2025-05-07T20:26:59.3234538Z #define _WCHAR_T 2025-05-07T20:26:59.3234755Z #define __cudaCDP2Free 2025-05-07T20:26:59.3235374Z #define __FD_ZERO(fdsp) do { int __d0, __d1; __asm__ __volatile__ ("cld; rep; " __FD_ZERO_STOS : "=c" (__d0), "=D" (__d1) : "a" (0), "0" (sizeof (fd_set) / sizeof (__fd_mask)), "1" (&__FDS_BITS (fdsp)[0]) : "memory"); } while (0) 2025-05-07T20:26:59.3236055Z #define __cpp_nsdmi 200809L 2025-05-07T20:26:59.3236466Z #define __glibcxx_min_b(T,B) (__glibcxx_signed_b (T,B) ? -__glibcxx_max_b (T,B) - 1 : (T)0) 2025-05-07T20:26:59.3236898Z #define __FLT64X_MIN_EXP__ (-16381) 2025-05-07T20:26:59.3237163Z #define __SIZEOF_WINT_T__ 4 2025-05-07T20:26:59.3237426Z #define cudaArrayCubemap 0x04 2025-05-07T20:26:59.3237753Z #define _PSTL_MONOTONIC_PRESENT (__INTEL_COMPILER >= 1800) 2025-05-07T20:26:59.3238090Z #define _GLIBCXX_UTILITY 1 2025-05-07T20:26:59.3238332Z #define __NO_CTYPE 1 2025-05-07T20:26:59.3238557Z #define __stub_bdflush 2025-05-07T20:26:59.3238920Z #define _GLIBCXX_MAKE_MOVE_ITERATOR(_Iter) std::make_move_iterator(_Iter) 2025-05-07T20:26:59.3239334Z #define __CORRECT_ISO_CPP_STRING_H_PROTO 2025-05-07T20:26:59.3239632Z #define _GLIBCXX_STDC_HEADERS 1 2025-05-07T20:26:59.3239886Z #define __LONG_LONG_WIDTH__ 64 2025-05-07T20:26:59.3240159Z #define __cpp_initializer_lists 200806L 2025-05-07T20:26:59.3240460Z #define _GLIBCXX_HAVE_NETINET_TCP_H 1 2025-05-07T20:26:59.3240749Z #define __U16_TYPE unsigned short int 2025-05-07T20:26:59.3241070Z #define __glibcxx_requires_can_increment(_First,_Size) 2025-05-07T20:26:59.3241416Z #define _GLIBCXX_HAVE_SYS_PARAM_H 1 2025-05-07T20:26:59.3241698Z #define __FLT32_MAX_EXP__ 128 2025-05-07T20:26:59.3241976Z #define cudaHostRegisterIoMemory 0x04 2025-05-07T20:26:59.3242319Z #define __FD_MASK(d) ((__fd_mask) 1 << ((d) % __NFDBITS)) 2025-05-07T20:26:59.3242676Z #define __cpp_lib_is_invocable 201703 2025-05-07T20:26:59.3242983Z #define _IO_STDIO 040000 2025-05-07T20:26:59.3243307Z #define _SIGSET_NWORDS (1024 / (8 * sizeof (unsigned long int))) 2025-05-07T20:26:59.3243693Z #define cudaSurfaceType1DLayered 0xF1 2025-05-07T20:26:59.3244001Z #define cudaArraySurfaceLoadStore 0x02 2025-05-07T20:26:59.3244289Z #define _PTRDIFF_T 2025-05-07T20:26:59.3244504Z #define _MOVE_H 1 2025-05-07T20:26:59.3244730Z #define __cpp_hex_float 201603L 2025-05-07T20:26:59.3244982Z #define ADJ_TAI 0x0080 2025-05-07T20:26:59.3245208Z #define __ptrvalue 2025-05-07T20:26:59.3245429Z #define _GLIBCXX_HOSTED 1 2025-05-07T20:26:59.3245674Z #define __GXX_ABI_VERSION 1016 2025-05-07T20:26:59.3245958Z #define __WTERMSIG(status) ((status) & 0x7f) 2025-05-07T20:26:59.3246254Z #define MATH_ERREXCEPT 2 2025-05-07T20:26:59.3246504Z #define _GLIBCXX_HAS_GTHREADS 1 2025-05-07T20:26:59.3246782Z #define cudaTextureType2DLayered 0xF2 2025-05-07T20:26:59.3247176Z #define __isleap(year) ((year) % 4 == 0 && ((year) % 100 != 0 || (year) % 400 == 0)) 2025-05-07T20:26:59.3247546Z #define __USE_GNU 1 2025-05-07T20:26:59.3247778Z #define __FLT128_HAS_INFINITY__ 1 2025-05-07T20:26:59.3248154Z #define __FLT_MIN_EXP__ (-125) 2025-05-07T20:26:59.3248414Z #define __GCC_HAVE_DWARF2_CFI_ASM 1 2025-05-07T20:26:59.3248791Z #define __FD_CLR(d,set) ((void) (__FDS_BITS (set)[__FD_ELT (d)] &= ~__FD_MASK (d))) 2025-05-07T20:26:59.3249169Z #define WEXITED 4 2025-05-07T20:26:59.3249388Z #define _IO_NO_READS 4 2025-05-07T20:26:59.3249678Z #define cudaGraphKernelNodePortLaunchCompletion 2 2025-05-07T20:26:59.3250019Z #define M_LOG2E 1.4426950408889634074 2025-05-07T20:26:59.3250294Z #define _POSIX_SYMLINK_MAX 255 2025-05-07T20:26:59.3250579Z #define _GLIBCXX_HAVE_BUILTIN_HAS_UNIQ_OBJ_REP 1 2025-05-07T20:26:59.3250889Z #define __uid_t_defined 2025-05-07T20:26:59.3251216Z #define __FD_ELT(d) ((d) / __NFDBITS) 2025-05-07T20:26:59.3251495Z #define _GLIBCXX_USE_STD_SPEC_FUNCS 1 2025-05-07T20:26:59.3251767Z #define WNOHANG 1 2025-05-07T20:26:59.3252010Z #define alloca(size) __builtin_alloca (size) 2025-05-07T20:26:59.3252305Z #define _GLIBCXX_HAVE_HYPOTF 1 2025-05-07T20:26:59.3252573Z #define cudaEventDefault 0x00 2025-05-07T20:26:59.3252873Z #define __maxnreg__(a) __attribute__((maxnreg(a))) 2025-05-07T20:26:59.3253191Z #define NL_SETMAX INT_MAX 2025-05-07T20:26:59.3253418Z #define __x86_64 1 2025-05-07T20:26:59.3253646Z #define __cudaCDP2LaunchDevice 2025-05-07T20:26:59.3254034Z #define __REDIRECT(name,proto,alias) name proto __asm__ (__ASMNAME (#alias)) 2025-05-07T20:26:59.3254497Z #define _GLIBCXX_BEGIN_NAMESPACE_CXX11 namespace __cxx11 { 2025-05-07T20:26:59.3254986Z #define __extern_always_inline extern __always_inline __attribute__ ((__gnu_inline__)) 2025-05-07T20:26:59.3255414Z #define __PTRDIFF_T 2025-05-07T20:26:59.3255725Z #define __exctype_l(name) extern int name (int, __locale_t) __THROW 2025-05-07T20:26:59.3256100Z #define _GLIBCXX_HAVE_FINITEL 1 2025-05-07T20:26:59.3265556Z #define __SM_35_ATOMIC_FUNCTIONS_H__ 2025-05-07T20:26:59.3265900Z #define _Mlong_double_ long double 2025-05-07T20:26:59.3266192Z #define __cpp_lambdas 200907L 2025-05-07T20:26:59.3266449Z #define _IO_DEC 020 2025-05-07T20:26:59.3266684Z #define _GLIBCXX_HAVE_SINHL 1 2025-05-07T20:26:59.3266963Z #define _POSIX_CLOCKRES_MIN 20000000 2025-05-07T20:26:59.3267253Z #define __INT_FAST64_TYPE__ long int 2025-05-07T20:26:59.3267531Z #define ADJ_TIMECONST 0x0020 2025-05-07T20:26:59.3267799Z #define _GLIBCXX_HAVE_SQRTL 1 2025-05-07T20:26:59.3268098Z #define __cudaCDP2DeviceGetSharedMemConfig 2025-05-07T20:26:59.3268420Z #define _GLIBCXX_HAVE_STDALIGN_H 1 2025-05-07T20:26:59.3268697Z #define _ANSI_STDDEF_H 2025-05-07T20:26:59.3268979Z #define _GLIBCXX_MOVE(__val) std::move(__val) 2025-05-07T20:26:59.3269351Z #define _GLIBCXX_HAVE_STRERROR_L 1 2025-05-07T20:26:59.3269718Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64 2025-05-07T20:26:59.3270113Z #define _GLIBCXX_USE_DEV_RANDOM 1 2025-05-07T20:26:59.3270390Z #define _STL_ITERATOR_BASE_TYPES_H 1 2025-05-07T20:26:59.3270685Z #define __cpp_template_auto 201606L 2025-05-07T20:26:59.3271047Z #define __DBL_MIN__ double(2.22507385850720138309023271733240406e-308L) 2025-05-07T20:26:59.3271420Z #define _GLIBCXX_HAVE_SYS_SEM_H 1 2025-05-07T20:26:59.3271740Z #define __key_t_defined 2025-05-07T20:26:59.3272041Z #define _IO_MAGIC_MASK 0xFFFF0000 2025-05-07T20:26:59.3272494Z #define __cluster_dims__(...) __attribute__((cluster_dims(__VA_ARGS__))) 2025-05-07T20:26:59.3273075Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128 2025-05-07T20:26:59.3273519Z #define __GNUC_VA_LIST 2025-05-07T20:26:59.3273909Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:26:59.3274294Z #define __SIZEOF_POINTER__ 8 2025-05-07T20:26:59.3274553Z #define CLOCK_REALTIME_COARSE 5 2025-05-07T20:26:59.3274839Z #define _GLIBCXX14_CONSTEXPR constexpr 2025-05-07T20:26:59.3275134Z #define __USE_XOPEN2KXSI 1 2025-05-07T20:26:59.3275383Z #define __WCOREFLAG 0x80 2025-05-07T20:26:59.3275632Z #define M_2_SQRTPI 1.12837916709551257390 2025-05-07T20:26:59.3275937Z #define cudaEventDisableTiming 0x02 2025-05-07T20:26:59.3276213Z #define __LP64__ 1 2025-05-07T20:26:59.3276647Z #define __isascii_l(c,l) ((l), __isascii (c)) 2025-05-07T20:26:59.3276970Z #define cudaStreamNonBlocking 0x01 2025-05-07T20:26:59.3277254Z #define _IO_off64_t __off64_t 2025-05-07T20:26:59.3277508Z #define __DBL_HAS_QUIET_NAN__ 1 2025-05-07T20:26:59.3277770Z #define __time_t_defined 1 2025-05-07T20:26:59.3278024Z #define _POSIX_SYMLOOP_MAX 8 2025-05-07T20:26:59.3278363Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x 2025-05-07T20:26:59.3278728Z #define __USE_UNIX98 1 2025-05-07T20:26:59.3278972Z #define __MODE_T_TYPE __U32_TYPE 2025-05-07T20:26:59.3279235Z #define CLOCK_REALTIME_ALARM 8 2025-05-07T20:26:59.3279598Z #define _GLIBCXX_HAVE_STRINGS_H 1 2025-05-07T20:26:59.3279894Z #define __LEAF_ATTR __attribute__ ((__leaf__)) 2025-05-07T20:26:59.3280202Z #define __DECIMAL_BID_FORMAT__ 1 2025-05-07T20:26:59.3280455Z #define SEEK_CUR 1 2025-05-07T20:26:59.3280681Z #define __RLIM64_T_TYPE __UQUAD_TYPE 2025-05-07T20:26:59.3280950Z #define _ASSERT_H 1 2025-05-07T20:26:59.3281518Z #define _PSTL_PRAGMA_DECLARE_REDUCTION(NAME,OP) _PSTL_PRAGMA(omp declare reduction(NAME:OP : omp_out(omp_in)) initializer(omp_priv = omp_orig)) 2025-05-07T20:26:59.3282144Z #define _GLIBCXX_USE_DEPRECATED 1 2025-05-07T20:26:59.3282416Z #define CHAR_MAX SCHAR_MAX 2025-05-07T20:26:59.3282664Z #define _GLIBCXX_HAVE_SETENV 1 2025-05-07T20:26:59.3282928Z #define NL_ARGMAX _POSIX_ARG_MAX 2025-05-07T20:26:59.3283198Z #define _GLIBCXX_USE_UTIMENSAT 1 2025-05-07T20:26:59.3283562Z #define __extern_inline extern __inline __attribute__ ((__gnu_inline__)) 2025-05-07T20:26:59.3283967Z #define _GLIBCXX_DEBUG_ONLY(_Statement) 2025-05-07T20:26:59.3284625Z #define _IO_putc_unlocked(_ch,_fp) (_IO_BE ((_fp)->_IO_write_ptr >= (_fp)->_IO_write_end, 0) ? __overflow (_fp, (unsigned char) (_ch)) : (unsigned char) (*(_fp)->_IO_write_ptr++ = (_ch))) 2025-05-07T20:26:59.3285279Z #define _GLIBCXX_HAVE_BUILTIN_LAUNDER 1 2025-05-07T20:26:59.3285572Z #define _IO_BOOLALPHA 0200000 2025-05-07T20:26:59.3285930Z #define _PSTL_CPP17_EXECUTION_POLICIES_PRESENT (_MSC_VER >= 1912) 2025-05-07T20:26:59.3286307Z #define _GLIBCXX_PACKAGE_URL "" 2025-05-07T20:26:59.3286570Z #define __FLT64_MIN_10_EXP__ (-307) 2025-05-07T20:26:59.3286852Z #define cudaArrayDefault 0x00 2025-05-07T20:26:59.3287130Z #define __cudaCDP2LaunchDeviceV2 2025-05-07T20:26:59.3287416Z #define __FDS_BITS(set) ((set)->fds_bits) 2025-05-07T20:26:59.3287700Z #define TLOSS 5 2025-05-07T20:26:59.3287920Z #define __ssize_t_defined 2025-05-07T20:26:59.3288173Z #define __CUDACC_VER_BUILD__ 61 2025-05-07T20:26:59.3288439Z #define ULONG_MAX (LONG_MAX * 2UL + 1UL) 2025-05-07T20:26:59.3288735Z #define __FLT64X_DECIMAL_DIG__ 21 2025-05-07T20:26:59.3289015Z #define _POSIX_HIWAT _POSIX_PIPE_BUF 2025-05-07T20:26:59.3289291Z #define __DEC128_MIN__ 1E-6143DL 2025-05-07T20:26:59.3289577Z #define __cudaCDP2EventRecordWithFlags 2025-05-07T20:26:59.3289885Z #define _GLIBCXX_ATOMIC_BUILTINS 1 2025-05-07T20:26:59.3290172Z #define cudaPeerAccessDefault 0x00 2025-05-07T20:26:59.3290463Z #define _GLIBCXX_HAVE_SYS_SOCKET_H 1 2025-05-07T20:26:59.3290755Z #define __REGISTER_PREFIX__ 2025-05-07T20:26:59.3291008Z #define __UINT16_MAX__ 0xffff 2025-05-07T20:26:59.3291338Z #define __glibcxx_requires_sorted_set(_First1,_Last1,_First2) 2025-05-07T20:26:59.3291705Z #define _IOS_NOREPLACE 64 2025-05-07T20:26:59.3291992Z #define __cdecl 2025-05-07T20:26:59.3292292Z #define cudaEventInterprocess 0x04 2025-05-07T20:26:59.3292699Z #define M_SQRT1_2l 0.707106781186547524400844362104849039L 2025-05-07T20:26:59.3293104Z #define LOGIN_NAME_MAX 256 2025-05-07T20:26:59.3293410Z #define _IO_TIED_PUT_GET 0x400 2025-05-07T20:26:59.3293747Z #define X_TLOSS 1.41484755040568800000e+16 2025-05-07T20:26:59.3294112Z #define CUDA_IPC_HANDLE_SIZE 64 2025-05-07T20:26:59.3294421Z #define __LDBL_HAS_INFINITY__ 1 2025-05-07T20:26:59.3294727Z #define __attribute_pure__ __attribute__ ((__pure__)) 2025-05-07T20:26:59.3295054Z #define __TEXTURE_TYPES_H__ 2025-05-07T20:26:59.3295451Z #define __NV_GLIBCXX_VERSION (__GNUC__ * 10000 + __GNUC_MINOR__ * 100 + __GNUC_PATCHLEVEL__) 2025-05-07T20:26:59.3295983Z #define ADJ_NANO 0x2000 2025-05-07T20:26:59.3296289Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32 2025-05-07T20:26:59.3296643Z #define __UINT8_TYPE__ unsigned char 2025-05-07T20:26:59.3296922Z #define _GLIBCXX_HAVE_ISWBLANK 1 2025-05-07T20:26:59.3297183Z #define __FLT_DIG__ 6 2025-05-07T20:26:59.3297534Z #define __REDIRECT_LDBL(name,proto,alias) __REDIRECT (name, proto, alias) 2025-05-07T20:26:59.3297926Z #define __NO_INLINE__ 1 2025-05-07T20:26:59.3298228Z #define _PSTL_EARLYEXIT_PRESENT (__INTEL_COMPILER >= 1800) 2025-05-07T20:26:59.3298578Z #define _POSIX_NGROUPS_MAX 8 2025-05-07T20:26:59.3298911Z #define ADJ_STATUS 0x0010 2025-05-07T20:26:59.3299176Z #define __cudaCDP2MemcpyAsync_ptsz 2025-05-07T20:26:59.3299465Z #define CLOCK_BOOTTIME_ALARM 9 2025-05-07T20:26:59.3299729Z #define LONG_LONG_MAX __LONG_LONG_MAX__ 2025-05-07T20:26:59.3300030Z #define _GLIBCXX_HAVE_OBSOLETE_ISNAN 1 2025-05-07T20:26:59.3300317Z #define __DEC_EVAL_METHOD__ 2 2025-05-07T20:26:59.3300700Z #define cudaStreamGraphFireAndForget (cudaStream_t)0x0200000000000000 2025-05-07T20:26:59.3301115Z #define _GLIBCXX_HAVE_ALIGNED_ALLOC 1 2025-05-07T20:26:59.3301459Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL 2025-05-07T20:26:59.3301864Z #define CHAR_MIN SCHAR_MIN 2025-05-07T20:26:59.3302159Z #define MAX_CANON 255 2025-05-07T20:26:59.3302448Z #define __FLT_MANT_DIG__ 24 2025-05-07T20:26:59.3302767Z #define __LDBL_DECIMAL_DIG__ 21 2025-05-07T20:26:59.3303093Z #define _GLIBCXX_HAVE_COMPLEX_H 1 2025-05-07T20:26:59.3303445Z #define _PSTL_PRAGMA_VECTOR_UNALIGNED 2025-05-07T20:26:59.3303826Z #define _POSIX_FD_SETSIZE _POSIX_OPEN_MAX 2025-05-07T20:26:59.3304128Z #define _GLIBCXX_HAVE_HYPOT 1 2025-05-07T20:26:59.3304406Z #define __cudaCDP2Memset2DAsync_ptsz 2025-05-07T20:26:59.3304729Z #define _GLIBCXX_TR1_MODIFIED_BESSEL_FUNC_TCC 1 2025-05-07T20:26:59.3305037Z #define __VERSION__ "11.4.0" 2025-05-07T20:26:59.3305301Z #define _GLIBCXX11_USE_C99_STDLIB 1 2025-05-07T20:26:59.3305601Z #define cudaHostRegisterMapped 0x02 2025-05-07T20:26:59.3305891Z #define _GLIBCXX_HAVE_INT64_T 1 2025-05-07T20:26:59.3306167Z #define _GLIBCXX_USE_CONSTEXPR constexpr 2025-05-07T20:26:59.3306478Z #define FD_ZERO(fdsetp) __FD_ZERO (fdsetp) 2025-05-07T20:26:59.3306773Z #define __UINT64_C(c) c ## UL 2025-05-07T20:26:59.3307028Z #define MOD_OFFSET ADJ_OFFSET 2025-05-07T20:26:59.3307282Z #define _SYS_TYPES_H 1 2025-05-07T20:26:59.3307525Z #define AIO_PRIO_DELTA_MAX 20 2025-05-07T20:26:59.3307780Z #define _GLIBCXX_HAVE_TANHF 1 2025-05-07T20:26:59.3308032Z #define _SYS_CDEFS_H 1 2025-05-07T20:26:59.3308270Z #define _GLIBCXX_HAVE_TANHL 1 2025-05-07T20:26:59.3308547Z #define __cpp_unicode_characters 201411L 2025-05-07T20:26:59.3308841Z #define _IO_ERR_SEEN 0x20 2025-05-07T20:26:59.3309191Z #define _GLIBCXX_USE_DECIMAL_FLOAT 1 2025-05-07T20:26:59.3309485Z #define __cudaCDP2StreamDestroy 2025-05-07T20:26:59.3309754Z #define FP_SUBNORMAL 3 2025-05-07T20:26:59.3310007Z #define cudaOccupancyDefault 0x00 2025-05-07T20:26:59.3310287Z #define _INITIALIZER_LIST 2025-05-07T20:26:59.3310537Z #define _STDC_PREDEF_H 1 2025-05-07T20:26:59.3310794Z #define _GLIBCXX_PACKAGE_BUGREPORT "" 2025-05-07T20:26:59.3311083Z #define _GLIBCXX_HAVE_MODF 1 2025-05-07T20:26:59.3311334Z #define _IO_file_flags _flags 2025-05-07T20:26:59.3311589Z #define __USE_XOPEN2K8 1 2025-05-07T20:26:59.3311837Z #define htobe64(x) __bswap_64 (x) 2025-05-07T20:26:59.3312107Z #define _OLD_STDIO_MAGIC 0xFABC0000 2025-05-07T20:26:59.3312381Z #define HUGE 3.40282347e+38F 2025-05-07T20:26:59.3312647Z #define __cpp_lib_is_null_pointer 201309 2025-05-07T20:26:59.3313015Z #define WEXITSTATUS(status) __WEXITSTATUS (__WAIT_INT (status)) 2025-05-07T20:26:59.3313414Z #define islower_l(c,l) __islower_l ((c), (l)) 2025-05-07T20:26:59.3313723Z #define _GLIBCXX_USE_CXX11_ABI 1 2025-05-07T20:26:59.3313988Z #define _GLIBCXX_HAVE_SYMLINK 1 2025-05-07T20:26:59.3314244Z #define _BSD_SOURCE 1 2025-05-07T20:26:59.3314479Z #define _GLIBCXX_THROW(_EXC) 2025-05-07T20:26:59.3315442Z #define _GLIBCXX_HAS_NESTED_TYPE(_NTYPE) template> struct __has_ ##_NTYPE : false_type { }; template struct __has_ ##_NTYPE<_Tp, __void_t> : true_type { }; 2025-05-07T20:26:59.3316297Z #define __catch(X) catch(X) 2025-05-07T20:26:59.3316558Z #define __INT_LEAST32_MAX__ 0x7fffffff 2025-05-07T20:26:59.3316848Z #define LINE_MAX _POSIX2_LINE_MAX 2025-05-07T20:26:59.3317116Z #define __TIMER_T_TYPE void * 2025-05-07T20:26:59.3317367Z #define __STRING(x) #x 2025-05-07T20:26:59.3317609Z #define __GCC_ATOMIC_INT_LOCK_FREE 2 2025-05-07T20:26:59.3317876Z #define _T_PTRDIFF_ 2025-05-07T20:26:59.3318214Z #define _GLIBCXX_USE_NOEXCEPT noexcept 2025-05-07T20:26:59.3318520Z #define cudaEventWaitExternal 0x01 2025-05-07T20:26:59.3318791Z #define __unbounded 2025-05-07T20:26:59.3319033Z #define __DEVICE_ATOMIC_FUNCTIONS_H__ 2025-05-07T20:26:59.3319323Z #define __FLT128_MAX_EXP__ 16384 2025-05-07T20:26:59.3319603Z #define __INO_T_TYPE __SYSCALL_ULONG_TYPE 2025-05-07T20:26:59.3319903Z #define be16toh(x) __bswap_16 (x) 2025-05-07T20:26:59.3320181Z #define __cpp_lib_is_final 201402L 2025-05-07T20:26:59.3320480Z #define _GLIBCXX_BEGIN_NAMESPACE_CONTAINER 2025-05-07T20:26:59.3320806Z #define LONG_LONG_MIN (-LONG_LONG_MAX - 1LL) 2025-05-07T20:26:59.3321118Z #define __MATH_DECLARE_LDOUBLE 1 2025-05-07T20:26:59.3321398Z #define __managed__ __location__(managed) 2025-05-07T20:26:59.3321694Z #define _POSIX2_EXPR_NEST_MAX 32 2025-05-07T20:26:59.3322168Z #define __GNUC_PREREQ(maj,min) ((__GNUC__ << 16) + __GNUC_MINOR__ >= ((maj) << 16) + (min)) 2025-05-07T20:26:59.3322692Z #define _POSIX_STREAM_MAX 8 2025-05-07T20:26:59.3323013Z #define __LIBRARY_TYPES_H__ 2025-05-07T20:26:59.3323474Z #define _GLIBCXX_END_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_END_NAMESPACE_CXX11 2025-05-07T20:26:59.3323970Z #define __FLT32_MANT_DIG__ 24 2025-05-07T20:26:59.3324282Z #define _SYS_SIZE_T_H 2025-05-07T20:26:59.3324569Z #define _PSTL_VERSION_MINOR ((_PSTL_VERSION % 1000) / 10) 2025-05-07T20:26:59.3324910Z #define _GLIBCXX_STDLIB_H 1 2025-05-07T20:26:59.3325190Z #define isupper_l(c,l) __isupper_l ((c), (l)) 2025-05-07T20:26:59.3325480Z #define _CRTIMP 2025-05-07T20:26:59.3325705Z #define _GLIBCXX_CXX_CONFIG_H 1 2025-05-07T20:26:59.3326018Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:26:59.3326350Z #define STA_PPSJITTER 0x0200 2025-05-07T20:26:59.3326704Z #define _IO_feof_unlocked(__fp) (((__fp)->_flags & _IO_EOF_SEEN) != 0) 2025-05-07T20:26:59.3327120Z #define __SUSECONDS_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:26:59.3327430Z #define _GLIBCXX_HAVE_ISINFF 1 2025-05-07T20:26:59.3327710Z #define __glibcxx_requires_subscript(_N) 2025-05-07T20:26:59.3328002Z #define __SIZE_T__ 2025-05-07T20:26:59.3328557Z #define __stub_gtty 2025-05-07T20:26:59.3328861Z #define __pid_t_defined 2025-05-07T20:26:59.3329133Z #define _GLIBCXX_FWDREF(_Tp) _Tp&& 2025-05-07T20:26:59.3329439Z #define __NLINK_T_TYPE __SYSCALL_ULONG_TYPE 2025-05-07T20:26:59.3329757Z #define __glibcxx_function_requires(...) 2025-05-07T20:26:59.3330064Z #define __SM_80_RT_HPP__ 2025-05-07T20:26:59.3330305Z #define __need_clockid_t 2025-05-07T20:26:59.3330551Z #define SSIZE_MAX LONG_MAX 2025-05-07T20:26:59.3330808Z #define _GLIBCXX_HAVE_USELOCALE 1 2025-05-07T20:26:59.3331121Z #define __glibcxx_requires_string_len(_String,_Len) 2025-05-07T20:26:59.3331455Z #define _IO_HEX 0100 2025-05-07T20:26:59.3331774Z #define __NFDBITS (8 * (int) sizeof (__fd_mask)) 2025-05-07T20:26:59.3332188Z #define cudaExternalMemoryDedicated 0x1 2025-05-07T20:26:59.3332314Z #define _GLIBCXX_HAVE_TGMATH_H 1 2025-05-07T20:26:59.3332442Z #define _GLIBCXX11_USE_C99_COMPLEX 1 2025-05-07T20:26:59.3332728Z #define _GLIBCXX17_DEPRECATED_SUGGEST(ALT) _GLIBCXX_DEPRECATED_SUGGEST(ALT) 2025-05-07T20:26:59.3332876Z #define ispunct_l(c,l) __ispunct_l ((c), (l)) 2025-05-07T20:26:59.3333007Z #define __cpp_aggregate_bases 201603L 2025-05-07T20:26:59.3333139Z #define __cudaGet_blockDim() blockDim 2025-05-07T20:26:59.3333271Z #define __cudaCDP2Memcpy3DAsync 2025-05-07T20:26:59.3333657Z #define __cudaCDP2MemcpyAsync 2025-05-07T20:26:59.3333754Z #define __stub_sstk 2025-05-07T20:26:59.3333846Z #define _IO_IN_BACKUP 0x100 2025-05-07T20:26:59.3334001Z #define _GLIBCXX_USE_C99_STDLIB _GLIBCXX11_USE_C99_STDLIB 2025-05-07T20:26:59.3334089Z #define __wur 2025-05-07T20:26:59.3334206Z #define isprint_l(c,l) __isprint_l ((c), (l)) 2025-05-07T20:26:59.3334299Z #define _G_HAVE_MMAP 1 2025-05-07T20:26:59.3334381Z #define _IO_OCT 040 2025-05-07T20:26:59.3334476Z #define __FLT128_HAS_DENORM__ 1 2025-05-07T20:26:59.3334571Z #define NL_MSGMAX INT_MAX 2025-05-07T20:26:59.3334662Z #define _GLIBCXX_USE_LFS 1 2025-05-07T20:26:59.3334913Z #define cudaDeviceScheduleBlockingSync 0x04 2025-05-07T20:26:59.3335011Z #define _POSIX_RTSIG_MAX 8 2025-05-07T20:26:59.3335114Z #define _GLIBCXX_NOEXCEPT noexcept 2025-05-07T20:26:59.3335302Z #define __glibcxx_requires_partitioned_lower(_First,_Last,_Value) 2025-05-07T20:26:59.3335404Z #define __FLT32_DECIMAL_DIG__ 9 2025-05-07T20:26:59.3335502Z #define _STL_ALGOBASE_H 1 2025-05-07T20:26:59.3335610Z #define __cudaCDP2MemsetAsync_ptsz 2025-05-07T20:26:59.3335707Z #define __off64_t_defined 2025-05-07T20:26:59.3335806Z #define _GLIBCXX_WEAK_DEFINITION 2025-05-07T20:26:59.3335899Z #define __FLT128_DIG__ 33 2025-05-07T20:26:59.3336005Z #define _GLIBCXX_USE_C99_INTTYPES_TR1 1 2025-05-07T20:26:59.3336104Z #define _GLIBCXX_HAVE_LOCALE_H 1 2025-05-07T20:26:59.3336197Z #define __INT32_C(c) c 2025-05-07T20:26:59.3336292Z #define __DEC64_EPSILON__ 1E-15DD 2025-05-07T20:26:59.3336391Z #define __ORDER_PDP_ENDIAN__ 3412 2025-05-07T20:26:59.3336491Z #define __DEC128_MIN_EXP__ (-6142) 2025-05-07T20:26:59.3336589Z #define __PDP_ENDIAN 3412 2025-05-07T20:26:59.3336676Z #define _ISOC95_SOURCE 1 2025-05-07T20:26:59.3336780Z #define _IO_fpos64_t _G_fpos64_t 2025-05-07T20:26:59.3336910Z #define M_PI_2l 1.570796326794896619231321691639751442L 2025-05-07T20:26:59.3337004Z #define BYTE_ORDER __BYTE_ORDER 2025-05-07T20:26:59.3337104Z #define __SM_90_RT_HPP__ 2025-05-07T20:26:59.3337205Z #define __INT_FAST32_TYPE__ long int 2025-05-07T20:26:59.3337304Z #define __have_pthread_attr_t 1 2025-05-07T20:26:59.3337403Z #define _GLIBCXX_HAVE_LIMIT_DATA 1 2025-05-07T20:26:59.3337623Z #define _GLIBCXX_BEGIN_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_BEGIN_NAMESPACE_CXX11 2025-05-07T20:26:59.3337736Z #define __cudaCDP2StreamWaitEvent 2025-05-07T20:26:59.3337838Z #define __cudaCDP2EventRecord 2025-05-07T20:26:59.3337932Z #define _BITS_TYPESIZES_H 1 2025-05-07T20:26:59.3338024Z #define htole32(x) (x) 2025-05-07T20:26:59.3338271Z #define __cudaCDP2OccupancyMaxActiveBlocksPerMultiprocessorWithFlags 2025-05-07T20:26:59.3338397Z #define __SYSCALL_SLONG_TYPE __SLONGWORD_TYPE 2025-05-07T20:26:59.3338506Z #define _GLIBCXX_USE_C99_MATH_TR1 1 2025-05-07T20:26:59.3338662Z #define WSTOPSIG(status) __WSTOPSIG (__WAIT_INT (status)) 2025-05-07T20:26:59.3338804Z #define _GLIBCXX_USE_C99_MATH _GLIBCXX11_USE_C99_MATH 2025-05-07T20:26:59.3338928Z #define __UINT_LEAST16_TYPE__ short unsigned int 2025-05-07T20:26:59.3339072Z #define __WIFEXITED(status) (__WTERMSIG(status) == 0) 2025-05-07T20:26:59.3339170Z #define ADJ_OFFSET 0x0001 2025-05-07T20:26:59.3339270Z #define cudaArrayLayered 0x01 2025-05-07T20:26:59.3339438Z #define _PSTL_ICC_18_OMP_SIMD_BROKEN (__INTEL_COMPILER == 1800) 2025-05-07T20:26:59.3339558Z #define cudaEventRecordDefault 0x00 2025-05-07T20:26:59.3339654Z #define _GLIBCXX_HAVE_FMODF 1 2025-05-07T20:26:59.3339755Z #define _PSTL_PRAGMA_MESSAGE(x) 2025-05-07T20:26:59.3339842Z #define unix 1 2025-05-07T20:26:59.3339935Z #define __DBL_HAS_DENORM__ 1 2025-05-07T20:26:59.3340036Z #define _POSIX_CHILD_MAX 25 2025-05-07T20:26:59.3340135Z #define _POSIX_MAX_INPUT 255 2025-05-07T20:26:59.3340251Z #define __cudaCDP2DeviceGetCacheConfig 2025-05-07T20:26:59.3340344Z #define __USE_POSIX 1 2025-05-07T20:26:59.3340440Z #define __FD_ZERO_STOS "stosq" 2025-05-07T20:26:59.3340572Z #define _PSTL_VERSION_MAJOR (_PSTL_VERSION / 1000) 2025-05-07T20:26:59.3340671Z #define __THROWNL throw () 2025-05-07T20:26:59.3340893Z #define __cpp_rtti 199711L 2025-05-07T20:26:59.3340997Z #define __SIZE_TYPE__ long unsigned int 2025-05-07T20:26:59.3341093Z #define __PMT(args) args 2025-05-07T20:26:59.3341210Z #define __UINT64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:59.3341357Z #define __va_arg_pack_len() __builtin_va_arg_pack_len () 2025-05-07T20:26:59.3341478Z #define __ULONGWORD_TYPE unsigned long int 2025-05-07T20:26:59.3341569Z #define _SIZE_T_DECLARED 2025-05-07T20:26:59.3341671Z #define _PSTL_STRING_AUX(x) #x 2025-05-07T20:26:59.3341765Z #define __FLT_IS_IEC_60559__ 2 2025-05-07T20:26:59.3342156Z #define _PSTL_CPP14_MAKE_REVERSE_ITERATOR_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201402L || __cpp_lib_make_reverse_iterator == 201402) 2025-05-07T20:26:59.3342346Z #define _GLIBCXX_HAVE_LIMIT_AS 1 2025-05-07T20:26:59.3342441Z #define XATTR_LIST_MAX 65536 2025-05-07T20:26:59.3342535Z #define __CUDACC_VER_MAJOR__ 12 2025-05-07T20:26:59.3342682Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE" 2025-05-07T20:26:59.3342767Z #define _WCHAR_T_H 2025-05-07T20:26:59.3342865Z #define __FLT64X_DIG__ 18 2025-05-07T20:26:59.3342964Z #define _IO_SHOWBASE 0200 2025-05-07T20:26:59.3343051Z #define _POSIX_QLIMIT 1 2025-05-07T20:26:59.3343156Z #define __INT8_TYPE__ signed char 2025-05-07T20:26:59.3343249Z #define __SURFACE_TYPES_H__ 2025-05-07T20:26:59.3343338Z #define __CUDA_ARCH__ 520 2025-05-07T20:26:59.3343451Z #define __cpp_digit_separators 201309L 2025-05-07T20:26:59.3343534Z #define __ELF__ 1 2025-05-07T20:26:59.3343634Z #define CLOCK_THREAD_CPUTIME_ID 3 2025-05-07T20:26:59.3343741Z #define __GCC_ASM_FLAG_OUTPUTS__ 1 2025-05-07T20:26:59.3343827Z #define STA_INS 0x0010 2025-05-07T20:26:59.3343931Z #define __UINT32_TYPE__ unsigned int 2025-05-07T20:26:59.3344106Z #define _toupper(c) ((int) (*__ctype_toupper_loc ())[(int) (c)]) 2025-05-07T20:26:59.3344203Z #define _BITS_BYTESWAP_H 1 2025-05-07T20:26:59.3344302Z #define __ID_T_TYPE __U32_TYPE 2025-05-07T20:26:59.3344418Z #define __TIME_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:26:59.3344532Z #define __DEVICE_DOUBLE_FUNCTIONS_HPP__ 2025-05-07T20:26:59.3344637Z #define _GLIBCXX_HAVE_MBSTATE_T 1 2025-05-07T20:26:59.3344740Z #define __cpp_lib_logical_traits 201510 2025-05-07T20:26:59.3344840Z #define ADJ_OFFSET_SS_READ 0xa001 2025-05-07T20:26:59.3344998Z #define __warnattr(msg) __attribute__((__warning__ (msg))) 2025-05-07T20:26:59.3345156Z #define _PSTL_PRAGMA_LOCATION " [Parallel STL message]: " 2025-05-07T20:26:59.3345254Z #define _IO_funlockfile(_fp) 2025-05-07T20:26:59.3345579Z #define cudaKernelNodeAttributeAccessPolicyWindow cudaLaunchAttributeAccessPolicyWindow 2025-05-07T20:26:59.3345706Z #define M_2_PIl 0.636619772367581343075535053490057448L 2025-05-07T20:26:59.3345807Z #define __DRIVER_TYPES_H__ 2025-05-07T20:26:59.3345904Z #define __FLT_RADIX__ 2 2025-05-07T20:26:59.3346007Z #define __INT_LEAST16_TYPE__ short int 2025-05-07T20:26:59.3346179Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L 2025-05-07T20:26:59.3346274Z #define __UINTMAX_C(c) c ## UL 2025-05-07T20:26:59.3346373Z #define _GLIBCXX_USE_LSTAT 1 2025-05-07T20:26:59.3346481Z #define minor(dev) gnu_dev_minor (dev) 2025-05-07T20:26:59.3346577Z #define _POSIX_C_SOURCE 200809L 2025-05-07T20:26:59.3346678Z #define _GLIBCXX_HAVE_DIRENT_H 1 2025-05-07T20:26:59.3346786Z #define __GLIBCXX_BITSIZE_INT_N_0 128 2025-05-07T20:26:59.3346870Z #define WORD_BIT 32 2025-05-07T20:26:59.3346956Z #define _IO_USER_BUF 1 2025-05-07T20:26:59.3347054Z #define __VECTOR_TYPES_H__ 2025-05-07T20:26:59.3347158Z #define __SM_20_ATOMIC_FUNCTIONS_HPP__ 2025-05-07T20:26:59.3347268Z #define cudaHostAllocPortable 0x01 2025-05-07T20:26:59.3347374Z #define PTHREAD_STACK_MIN 16384 2025-05-07T20:26:59.3347476Z #define __long_double_t long double 2025-05-07T20:26:59.3347577Z #define _GLIBCXX_HAVE_ISINF 1 2025-05-07T20:26:59.3347669Z #define _POSIX_ARG_MAX 4096 2025-05-07T20:26:59.3348069Z #define cudaKernelNodeAttributeDeviceUpdatableKernelNode cudaLaunchAttributeDeviceUpdatableKernelNode 2025-05-07T20:26:59.3348158Z #define __k8 1 2025-05-07T20:26:59.3348443Z #define _GLIBCXX_NO_OBSOLETE_ISINF_ISNAN_DYNAMIC __GLIBC_PREREQ(2,23) 2025-05-07T20:26:59.3348616Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x 2025-05-07T20:26:59.3348740Z #define __LDBL_REDIR(name,proto) name proto 2025-05-07T20:26:59.3348841Z #define __SIG_ATOMIC_MAX__ 0x7fffffff 2025-05-07T20:26:59.3348940Z #define __SM_30_INTRINSICS_HPP__ 2025-05-07T20:26:59.3349118Z #define _GLIBCXX_EXTERN_TEMPLATE 1 2025-05-07T20:26:59.3349213Z #define __blksize_t_defined 2025-05-07T20:26:59.3349312Z #define _IO_SHOWPOINT 0400 2025-05-07T20:26:59.3349410Z #define _GLIBCXX_HAVE_LIMIT_RSS 1 2025-05-07T20:26:59.3349608Z #define cudaDeviceLmemResizeToMax 0x10 2025-05-07T20:26:59.3349710Z #define _GLIBCXX_X86_RDRAND 1 2025-05-07T20:26:59.3349815Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2 2025-05-07T20:26:59.3349912Z #define _IO_IS_FILEBUF 0x2000 2025-05-07T20:26:59.3350017Z #define _GLIBCXX_USE_DUAL_ABI 1 2025-05-07T20:26:59.3350276Z #define __bswap_constant_16(x) ((unsigned short int) ((((x) >> 8) & 0xff) | (((x) & 0xff) << 8))) 2025-05-07T20:26:59.3350615Z #define cudaSignalExternalSemaphoresAsync __CUDART_API_PTSZ(cudaSignalExternalSemaphoresAsync_v2) 2025-05-07T20:26:59.3350723Z #define UCHAR_MAX (SCHAR_MAX * 2 + 1) 2025-05-07T20:26:59.3350822Z #define __SIZEOF_PTRDIFF_T__ 8 2025-05-07T20:26:59.3350911Z #define SEEK_SET 0 2025-05-07T20:26:59.3351010Z #define _GLIBCXX_TR1_GAMMA_TCC 1 2025-05-07T20:26:59.3351106Z #define __CUDA_API_VER_MINOR__ 8 2025-05-07T20:26:59.3351303Z #define _GLIBCXX_VISIBILITY(V) __attribute__ ((__visibility__ (#V))) 2025-05-07T20:26:59.3351408Z #define __cudaCDP2GetLastError 2025-05-07T20:26:59.3351514Z #define _GLIBCXX_HAVE_COSL 1 2025-05-07T20:26:59.3351635Z #define _MATH_H_MATHDEF 1 2025-05-07T20:26:59.3352031Z #define __bswap_constant_32(x) ((((x) & 0xff000000) >> 24) | (((x) & 0x00ff0000) >> 8) | (((x) & 0x0000ff00) << 8) | (((x) & 0x000000ff) << 24)) 2025-05-07T20:26:59.3352154Z #define _GLIBCXX_USE_FLOAT128 1 2025-05-07T20:26:59.3352283Z #define _IO_FLAGS2_NOTCANCEL 2 2025-05-07T20:26:59.3352406Z #define __stub_sigreturn 2025-05-07T20:26:59.3352708Z #define __errordecl(name,msg) extern void name (void) __attribute__((__error__ (msg))) 2025-05-07T20:26:59.3352831Z #define _GLIBCXX_HAVE_UTIME_H 1 2025-05-07T20:26:59.3352946Z #define __HOST_CONFIG_H__ 2025-05-07T20:26:59.3353078Z #define _XOPEN_SOURCE_EXTENDED 1 2025-05-07T20:26:59.3353184Z #define CLOCK_TAI 11 2025-05-07T20:26:59.3353320Z #define _GLIBCXX_END_NAMESPACE_VERSION 2025-05-07T20:26:59.3353585Z #define __glibcxx_requires_sorted_set_pred(_First1,_Last1,_First2,_Pred) 2025-05-07T20:26:59.3353696Z #define __restrict_arr 2025-05-07T20:26:59.3353841Z #define _PSTL_PRAGMA_MESSAGE_POLICIES(x) 2025-05-07T20:26:59.3354020Z #define __glibcxx_requires_valid_range(_First,_Last) 2025-05-07T20:26:59.3354609Z #define strndupa(s,n) (__extension__ ({ const char *__old = (s); size_t __len = strnlen (__old, (n)); char *__new = (char *) __builtin_alloca (__len + 1); __new[__len] = '\0'; (char *) memcpy (__new, __old, __len); })) 2025-05-07T20:26:59.3354805Z #define __attribute_artificial__ __attribute__ ((__artificial__)) 2025-05-07T20:26:59.3354893Z #define __USE_MISC 1 2025-05-07T20:26:59.3354999Z #define __UWORD_TYPE unsigned long int 2025-05-07T20:26:59.3355109Z #define _EXCEPTION_DEFINES_H 1 2025-05-07T20:26:59.3355200Z #define _GCC_LIMITS_H_ 2025-05-07T20:26:59.3355287Z #define __LDBL_DIG__ 18 2025-05-07T20:26:59.3355394Z #define __BIT_TYPES_DEFINED__ 1 2025-05-07T20:26:59.3355500Z #define __malloc_and_calloc_defined 2025-05-07T20:26:59.3355600Z #define __FLT64_IS_IEC_60559__ 2 2025-05-07T20:26:59.3355707Z #define _GLIBCXX_HAVE_SYS_SYSINFO_H 1 2025-05-07T20:26:59.3355795Z #define __x86_64__ 1 2025-05-07T20:26:59.3355883Z #define _SIZE_T_ 2025-05-07T20:26:59.3356850Z #define __bswap_constant_64(x) (__extension__ ((((x) & 0xff00000000000000ull) >> 56) | (((x) & 0x00ff000000000000ull) >> 40) | (((x) & 0x0000ff0000000000ull) >> 24) | (((x) & 0x000000ff00000000ull) >> 8) | (((x) & 0x00000000ff000000ull) << 8) | (((x) & 0x0000000000ff0000ull) << 24) | (((x) & 0x000000000000ff00ull) << 40) | (((x) & 0x00000000000000ffull) << 56))) 2025-05-07T20:26:59.3356956Z #define _POSIX2_COLL_WEIGHTS_MAX 2 2025-05-07T20:26:59.3357064Z #define __FLT32X_MIN_EXP__ (-1021) 2025-05-07T20:26:59.3357178Z #define __PTHREAD_RWLOCK_INT_FLAGS_SHARED 1 2025-05-07T20:26:59.3357303Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF 2025-05-07T20:26:59.3357397Z #define _IO_iconv_t _G_iconv_t 2025-05-07T20:26:59.3357506Z #define _GLIBCXX_FLOAT_IS_IEEE_BINARY32 1 2025-05-07T20:26:59.3357631Z #define __cpp_lib_make_reverse_iterator 201402 2025-05-07T20:26:59.3357768Z #define _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(A) 2025-05-07T20:26:59.3357942Z #define _GLIBCXX_HAVE_DLFCN_H 1 2025-05-07T20:26:59.3358411Z #define strdupa(s) (__extension__ ({ const char *__old = (s); size_t __len = strlen (__old) + 1; char *__new = (char *) __builtin_alloca (__len); (char *) memcpy (__new, __old, __len); })) 2025-05-07T20:26:59.3358534Z #define __no_return__ __attribute__((noreturn)) 2025-05-07T20:26:59.3358691Z #define __device_builtin__ __location__(device_builtin) 2025-05-07T20:26:59.3358792Z #define _PSTL_HIDE_FROM_ABI_POP 2025-05-07T20:26:59.3358891Z #define _GLIBCXX_HAVE_ACOSF 1 2025-05-07T20:26:59.3358984Z #define STA_FLL 0x0008 2025-05-07T20:26:59.3359126Z #define _GLIBCXX_HAVE_BUILTIN_IS_CONSTANT_EVALUATED 1 2025-05-07T20:26:59.3359222Z #define _GLIBCXX_END_EXTERN_C } 2025-05-07T20:26:59.3359349Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:59.3359461Z #define __cpp_lib_integer_sequence 201304 2025-05-07T20:26:59.3359547Z #define __stub_revoke 2025-05-07T20:26:59.3359648Z #define __timer_t_defined 1 2025-05-07T20:26:59.3359785Z #define _GLIBCXX11_DEPRECATED _GLIBCXX_DEPRECATED 2025-05-07T20:26:59.3359876Z #define INT_MAX __INT_MAX__ 2025-05-07T20:26:59.3359992Z #define ULLONG_MAX (LLONG_MAX * 2ULL + 1) 2025-05-07T20:26:59.3360097Z #define _GLIBCXX_END_NAMESPACE_CXX11 } 2025-05-07T20:26:59.3360200Z #define _GLIBCXX_ICONV_CONST 2025-05-07T20:26:59.3360306Z #define major(dev) gnu_dev_major (dev) 2025-05-07T20:26:59.3360415Z #define cudaArrayTextureGather 0x08 2025-05-07T20:26:59.3360524Z #define _GLIBCXX_LT_OBJDIR ".libs/" 2025-05-07T20:26:59.3360668Z #define __inline_hint__ __attribute__((nv_inline_hint)) 2025-05-07T20:26:59.3360763Z #define __NV_LEGACY_LAUNCH 1 2025-05-07T20:26:59.3360864Z #define _IO_off_t __off_t 2025-05-07T20:26:59.3360953Z #define __FLT64_DIG__ 15 2025-05-07T20:26:59.3361172Z #define PTHREAD_DESTRUCTOR_ITERATIONS _POSIX_THREAD_DESTRUCTOR_ITERATIONS 2025-05-07T20:26:59.3361280Z #define _POSIX2_LINE_MAX 2048 2025-05-07T20:26:59.3361409Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:59.3361547Z #define __UINT_LEAST64_TYPE__ long unsigned int 2025-05-07T20:26:59.3361644Z #define ADJ_FREQUENCY 0x0002 2025-05-07T20:26:59.3361745Z #define __CUDART_API_PTDS(api) api 2025-05-07T20:26:59.3361836Z #define NULL __null 2025-05-07T20:26:59.3361967Z #define cudaStreamPerThread ((cudaStream_t)0x2) 2025-05-07T20:26:59.3362077Z #define _GLIBCXX_CONSTEXPR constexpr 2025-05-07T20:26:59.3362188Z #define __U64_TYPE unsigned long int 2025-05-07T20:26:59.3362282Z #define __FLT_HAS_QUIET_NAN__ 1 2025-05-07T20:26:59.3362375Z #define __FLT_MAX_10_EXP__ 38 2025-05-07T20:26:59.3362467Z #define FP_ZERO 2 2025-05-07T20:26:59.3362564Z #define _GLIBCXX_HAVE_FLOORL 1 2025-05-07T20:26:59.3362722Z #define __isgraph_l(c,l) __isctype_l((c), _ISgraph, (l)) 2025-05-07T20:26:59.3362856Z #define __LONG_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:59.3362958Z #define __WCHAR_T__ 2025-05-07T20:26:59.3363066Z #define __FLT64X_HAS_DENORM__ 1 2025-05-07T20:26:59.3363262Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL 2025-05-07T20:26:59.3363418Z #define _GLIBCXX_NORETURN __attribute__ ((__noreturn__)) 2025-05-07T20:26:59.3363518Z #define __FLT_HAS_INFINITY__ 1 2025-05-07T20:26:59.3363639Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8" 2025-05-07T20:26:59.3363757Z #define _GLIBCXX20_DEPRECATED_SUGGEST(ALT) 2025-05-07T20:26:59.3364006Z #define __WSTOPSIG(status) __WEXITSTATUS(status) 2025-05-07T20:26:59.3364136Z #define cudaSurfaceTypeCubemapLayered 0xFC 2025-05-07T20:26:59.3364227Z #define _BSD_PTRDIFF_T_ 2025-05-07T20:26:59.3364325Z #define _SIGSET_H_types 1 2025-05-07T20:26:59.3364441Z #define cudaTextureType1DLayered 0xF1 2025-05-07T20:26:59.3364552Z #define __cpp_unicode_literals 200710L 2025-05-07T20:26:59.3364701Z #define __isdigit_l(c,l) __isctype_l((c), _ISdigit, (l)) 2025-05-07T20:26:59.3364808Z #define __LONG_LONG_PAIR(HI,LO) LO, HI 2025-05-07T20:26:59.3364933Z #define __UINT_FAST16_TYPE__ long unsigned int 2025-05-07T20:26:59.3365068Z #define __bos0(ptr) __builtin_object_size (ptr, 0) 2025-05-07T20:26:59.3365282Z #define __DEC64_MAX__ 9.999999999999999E384DD 2025-05-07T20:26:59.3365415Z #define M_1_PIl 0.318309886183790671537767526745028724L 2025-05-07T20:26:59.3365527Z #define __CUDACC_DEVICE_ATOMIC_BUILTINS__ 1 2025-05-07T20:26:59.3365699Z #define WIFSTOPPED(status) __WIFSTOPPED (__WAIT_INT (status)) 2025-05-07T20:26:59.3365808Z #define __INT_FAST32_WIDTH__ 64 2025-05-07T20:26:59.3365911Z #define _POSIX2_CHARCLASS_NAME_MAX 14 2025-05-07T20:26:59.3366021Z #define _GLIBCXX_BITS_STD_ABS_H 2025-05-07T20:26:59.3366110Z #define STA_MODE 0x4000 2025-05-07T20:26:59.3366219Z #define __CHAR16_TYPE__ short unsigned int 2025-05-07T20:26:59.3366331Z #define __PRAGMA_REDEFINE_EXTNAME 1 2025-05-07T20:26:59.3366448Z #define __glibcxx_signed_b(T,B) ((T)(-1) < 0) 2025-05-07T20:26:59.3366549Z #define __USING_NAMESPACE_C99(name) 2025-05-07T20:26:59.3366651Z #define BIG_ENDIAN __BIG_ENDIAN 2025-05-07T20:26:59.3366758Z #define __cudaCDP2EventRecord_ptsz 2025-05-07T20:26:59.3366857Z #define _GLIBCXX_HAVE_SINL 1 2025-05-07T20:26:59.3366978Z #define EXPR_NEST_MAX _POSIX2_EXPR_NEST_MAX 2025-05-07T20:26:59.3367068Z #define __SIZE_WIDTH__ 64 2025-05-07T20:26:59.3367184Z #define __BLKSIZE_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:26:59.3367275Z #define __SEG_FS 1 2025-05-07T20:26:59.3367365Z #define _IO_size_t size_t 2025-05-07T20:26:59.3367476Z #define __INT_LEAST16_MAX__ 0x7fff 2025-05-07T20:26:59.3367575Z #define INT_MIN (-INT_MAX - 1) 2025-05-07T20:26:59.3367661Z #define __stub_lchmod 2025-05-07T20:26:59.3367759Z #define __DEC64_MANT_DIG__ 16 2025-05-07T20:26:59.3367870Z #define __INT64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:59.3367967Z #define _GLIBCXX_MANGLE_SIZE_T m 2025-05-07T20:26:59.3368056Z #define __SEG_GS 1 2025-05-07T20:26:59.3368235Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32 2025-05-07T20:26:59.3368325Z #define _IOS_APPEND 8 2025-05-07T20:26:59.3368428Z #define __SIG_ATOMIC_WIDTH__ 32 2025-05-07T20:26:59.3368520Z #define _GLIBCXX_RELEASE 11 2025-05-07T20:26:59.3368622Z #define _GLIBCXX98_USE_C99_WCHAR 1 2025-05-07T20:26:59.3368727Z #define _IO_IS_APPENDING 0x1000 2025-05-07T20:26:59.3368826Z #define __INT_LEAST64_TYPE__ long int 2025-05-07T20:26:59.3368917Z #define htole16(x) (x) 2025-05-07T20:26:59.3369027Z #define __TEXTURE_INDIRECT_FUNCTIONS_H__ 2025-05-07T20:26:59.3369121Z #define _GLIBCXX_HAVE_FCNTL_H 1 2025-05-07T20:26:59.3369226Z #define __INT16_TYPE__ short int 2025-05-07T20:26:59.3369328Z #define __INT_LEAST8_TYPE__ signed char 2025-05-07T20:26:59.3369434Z #define __glibcxx_class_requires(_a,_b) 2025-05-07T20:26:59.3369553Z #define __cpp_structured_bindings 201606L 2025-05-07T20:26:59.3369677Z #define __align__(n) __attribute__((aligned(n))) 2025-05-07T20:26:59.3369769Z #define __SIZEOF_INT__ 4 2025-05-07T20:26:59.3369864Z #define __WCLONE 0x80000000 2025-05-07T20:26:59.3369956Z #define __DEC32_MAX_EXP__ 97 2025-05-07T20:26:59.3370039Z #define SEEK_HOLE 4 2025-05-07T20:26:59.3370133Z #define TIMER_ABSTIME 1 2025-05-07T20:26:59.3370234Z #define __INT_FAST8_MAX__ 0x7f 2025-05-07T20:26:59.3370331Z #define __CUDA_MATH_CRTIMP 2025-05-07T20:26:59.3370504Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:26:59.3370616Z #define __INTPTR_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:59.3370718Z #define __DRIVER_FUNCTIONS_H__ 2025-05-07T20:26:59.3370917Z #define __cpp_sized_deallocation 201309L 2025-05-07T20:26:59.3371017Z #define __MATH_FUNCTIONS_HPP__ 2025-05-07T20:26:59.3371143Z #define __cpp_guaranteed_copy_elision 201606L 2025-05-07T20:26:59.3371233Z #define _LINUX_LIMITS_H 2025-05-07T20:26:59.3371314Z #define linux 1 2025-05-07T20:26:59.3371415Z #define MOD_MICRO ADJ_MICRO 2025-05-07T20:26:59.3371524Z #define _GLIBCXX_DEBUG_ASSERT(_Condition) 2025-05-07T20:26:59.3371628Z #define _GLIBCXX_HAVE_VSWSCANF 1 2025-05-07T20:26:59.3371722Z #define _GLIBCXX_HAVE_ISNAN 1 2025-05-07T20:26:59.3371826Z #define _XOPEN_IOV_MAX _POSIX_UIO_MAXIOV 2025-05-07T20:26:59.3371975Z #define __cudart_builtin__ __location__(cudart_builtin) 2025-05-07T20:26:59.3372151Z #define __cpp_lib_hypot 201603 2025-05-07T20:26:59.3372247Z #define __FLT64_HAS_QUIET_NAN__ 1 2025-05-07T20:26:59.3372350Z #define _GLIBCXX_HAVE_WCTYPE_H 1 2025-05-07T20:26:59.3372440Z #define MOD_NANO ADJ_NANO 2025-05-07T20:26:59.3372525Z #define htole64(x) (x) 2025-05-07T20:26:59.3372632Z #define FP_ILOGBNAN (-2147483647 - 1) 2025-05-07T20:26:59.3372762Z #define _IO_stdout ((_IO_FILE*)(&_IO_2_1_stdout_)) 2025-05-07T20:26:59.3372857Z #define _IO_UPPERCASE 01000 2025-05-07T20:26:59.3373351Z #define cudaKernelNodeAttributeClusterSchedulingPolicyPreference cudaLaunchAttributeClusterSchedulingPolicyPreference 2025-05-07T20:26:59.3373439Z #define __USE_POSIX2 1 2025-05-07T20:26:59.3373547Z #define MOD_ESTERROR ADJ_ESTERROR 2025-05-07T20:26:59.3373635Z #define __WALL 0x40000000 2025-05-07T20:26:59.3373732Z #define _GLIBCXX_HAVE_LDEXPF 1 2025-05-07T20:26:59.3373825Z #define _XLOCALE_H 1 2025-05-07T20:26:59.3373920Z #define _GLIBCXX_USE_TMPNAM 1 2025-05-07T20:26:59.3374023Z #define __FLT32_MIN_10_EXP__ (-37) 2025-05-07T20:26:59.3374125Z #define __KEY_T_TYPE __S32_TYPE 2025-05-07T20:26:59.3374228Z #define __cudaGet_threadIdx() threadIdx 2025-05-07T20:26:59.3374316Z #define __EXCEPTIONS 1 2025-05-07T20:26:59.3374421Z #define __CUDART_API_PTSZ(api) api 2025-05-07T20:26:59.3374612Z #define __launch_bounds__(...) __annotate__(launch_bounds(__VA_ARGS__)) 2025-05-07T20:26:59.3374711Z #define __WORDSIZE 64 2025-05-07T20:26:59.3374817Z #define CLOCK_MONOTONIC 1 2025-05-07T20:26:59.3375020Z #define _STL_RELOPS_H 1 2025-05-07T20:26:59.3375188Z #define __PTRDIFF_WIDTH__ 64 2025-05-07T20:26:59.3375320Z #define __BEGIN_DECLS extern "C" { 2025-05-07T20:26:59.3384507Z #define _GLIBCXX_HAVE_SYS_IPC_H 1 2025-05-07T20:26:59.3384632Z #define __LDBL_MANT_DIG__ 64 2025-05-07T20:26:59.3384735Z #define _GLIBCXX_HAVE_TRUNCATE 1 2025-05-07T20:26:59.3385041Z #define cudaKernelNodeAttributeClusterDimension cudaLaunchAttributeClusterDimension 2025-05-07T20:26:59.3385275Z #define _PSTL_GCC_VERSION (__GNUC__ * 10000 + __GNUC_MINOR__ * 100 + __GNUC_PATCHLEVEL__) 2025-05-07T20:26:59.3385430Z #define _GLIBCXX_NAMESPACE_CXX11 __cxx11:: 2025-05-07T20:26:59.3385533Z #define _GLIBCXX_NUMERIC_LIMITS 1 2025-05-07T20:26:59.3385638Z #define __cpp_range_based_for 201603L 2025-05-07T20:26:59.3385759Z #define __cpp_lib_exchange_function 201304 2025-05-07T20:26:59.3385868Z #define _GLIBCXX_HAVE_INTTYPES_H 1 2025-05-07T20:26:59.3385979Z #define _GLIBCXX_DARWIN_USE_64_BIT_INODE 1 2025-05-07T20:26:59.3386168Z #define cudaCooperativeLaunchMultiDeviceNoPostSync 0x02 2025-05-07T20:26:59.3386269Z #define __FLT64_HAS_INFINITY__ 1 2025-05-07T20:26:59.3386364Z #define _GLIBCXX_CSTDLIB 1 2025-05-07T20:26:59.3386477Z #define _GLIBCXX_DEBUG_MACRO_SWITCH_H 1 2025-05-07T20:26:59.3386650Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:26:59.3386771Z #define __STDCPP_DEFAULT_NEW_ALIGNMENT__ 16 2025-05-07T20:26:59.3386859Z #define _STRING_H 1 2025-05-07T20:26:59.3386961Z #define _BITS_PTHREADTYPES_H 1 2025-05-07T20:26:59.3387062Z #define _GCC_MAX_ALIGN_T 2025-05-07T20:26:59.3387163Z #define __SM_32_INTRINSICS_HPP__ 2025-05-07T20:26:59.3387300Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1) 2025-05-07T20:26:59.3387403Z #define __code_model_small__ 1 2025-05-07T20:26:59.3387494Z #define _PSTL_CONFIG_H 2025-05-07T20:26:59.3387597Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2 2025-05-07T20:26:59.3387879Z #define __cpp_nontype_template_args 201411L 2025-05-07T20:26:59.3387977Z #define __SM_20_INTRINSICS_H__ 2025-05-07T20:26:59.3388089Z #define cudaCpuDeviceId ((int)-1) 2025-05-07T20:26:59.3388427Z #define assert(expr) ((expr) ? __ASSERT_VOID_CAST (0) : __assert_fail (__STRING(expr), __FILE__, __LINE__, __ASSERT_FUNCTION)) 2025-05-07T20:26:59.3388525Z #define __DEC32_MANT_DIG__ 7 2025-05-07T20:26:59.3388622Z #define le64toh(x) (x) 2025-05-07T20:26:59.3388715Z #define FILENAME_MAX 4096 2025-05-07T20:26:59.3388868Z #define __iscntrl_l(c,l) __isctype_l((c), _IScntrl, (l)) 2025-05-07T20:26:59.3388994Z #define __cpp_return_type_deduction 201304L 2025-05-07T20:26:59.3389319Z #define L_cuserid 9 2025-05-07T20:26:59.3389412Z #define __ino_t_defined 2025-05-07T20:26:59.3389503Z #define __k8__ 1 2025-05-07T20:26:59.3389603Z #define __INTPTR_TYPE__ long int 2025-05-07T20:26:59.3389713Z #define __UINT16_TYPE__ short unsigned int 2025-05-07T20:26:59.3389817Z #define __int8_t_defined 2025-05-07T20:26:59.3389919Z #define __WCHAR_TYPE__ int 2025-05-07T20:26:59.3390028Z #define __CLOCKID_T_TYPE __S32_TYPE 2025-05-07T20:26:59.3390144Z #define cudaHostRegisterPortable 0x01 2025-05-07T20:26:59.3390242Z #define __SLONGWORD_TYPE long int 2025-05-07T20:26:59.3390367Z #define _GLIBCXX_PACKAGE_TARNAME "libstdc++" 2025-05-07T20:26:59.3390516Z #define __isblank_l(c,l) __isctype_l((c), _ISblank, (l)) 2025-05-07T20:26:59.3390609Z #define __HAVE_COLUMN 2025-05-07T20:26:59.3390696Z #define __stub_fdetach 2025-05-07T20:26:59.3391100Z #define __CUDACC_VER__ "__CUDACC_VER__ is no longer supported. Use __CUDACC_VER_MAJOR__, __CUDACC_VER_MINOR__, and __CUDACC_VER_BUILD__ instead." 2025-05-07T20:26:59.3391194Z #define __pic__ 2 2025-05-07T20:26:59.3391313Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:59.3391411Z #define CLOCKS_PER_SEC 1000000l 2025-05-07T20:26:59.3391534Z #define __INT_FAST64_WIDTH__ 64 2025-05-07T20:26:59.3391658Z #define _GLIBCXX_HAVE_SOCKATMARK 1 2025-05-07T20:26:59.3391773Z #define __stub_chflags 2025-05-07T20:26:59.3391891Z #define CLOCK_BOOTTIME 7 2025-05-07T20:26:59.3391996Z #define __need_IOV_MAX 2025-05-07T20:26:59.3392131Z #define putc(_ch,_fp) _IO_putc (_ch, _fp) 2025-05-07T20:26:59.3392270Z #define __UQUAD_TYPE unsigned long int 2025-05-07T20:26:59.3392393Z #define __cpp_decltype 200707L 2025-05-07T20:26:59.3392520Z #define __BYTE_ORDER __LITTLE_ENDIAN 2025-05-07T20:26:59.3392634Z #define _GLIBCXX_USE_C99 1 2025-05-07T20:26:59.3392766Z #define _GLIBCXX_TR1_BETA_FUNCTION_TCC 1 2025-05-07T20:26:59.3392882Z #define TTY_NAME_MAX 32 2025-05-07T20:26:59.3393086Z #define _GLIBCXX_FORWARD(_Tp,__val) std::forward<_Tp>(__val) 2025-05-07T20:26:59.3393242Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:59.3393457Z #define _PSTL_ASSERT(_Condition) __glibcxx_assert(_Condition) 2025-05-07T20:26:59.3393596Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1 2025-05-07T20:26:59.3393713Z #define __LITTLE_ENDIAN 1234 2025-05-07T20:26:59.3393835Z #define STA_PPSTIME 0x0004 2025-05-07T20:26:59.3393944Z #define __import__ 2025-05-07T20:26:59.3394062Z #define BUFSIZ _IO_BUFSIZ 2025-05-07T20:26:59.3394200Z #define M_SQRT2l 1.414213562373095048801688724209698079L 2025-05-07T20:26:59.3394287Z #define __export__ 2025-05-07T20:26:59.3394411Z #define __FSID_T_TYPE struct { int __val[2]; } 2025-05-07T20:26:59.3394512Z #define cudaMemAttachHost 0x02 2025-05-07T20:26:59.3394674Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:26:59.3394777Z #define _GLIBCXX_HAVE_ICONV 1 2025-05-07T20:26:59.3394867Z #define _GLIBCXX_SYMVER 1 2025-05-07T20:26:59.3394963Z #define __FLT64X_MAX_EXP__ 16384 2025-05-07T20:26:59.3395064Z #define _WCHAR_T_DECLARED 2025-05-07T20:26:59.3395182Z #define __UINT_FAST64_TYPE__ long unsigned int 2025-05-07T20:26:59.3395301Z #define isalpha_l(c,l) __isalpha_l ((c), (l)) 2025-05-07T20:26:59.3395414Z #define __cpp_inline_variables 201606L 2025-05-07T20:26:59.3395504Z #define WNOWAIT 0x01000000 2025-05-07T20:26:59.3395597Z #define PLOSS 6 2025-05-07T20:26:59.3395779Z #define M_LN10 2.30258509299404568402 2025-05-07T20:26:59.3396041Z #define _PSTL_UDS_PRESENT (__INTEL_COMPILER >= 1900 && __INTEL_COMPILER_BUILD_DATE >= 20180626) 2025-05-07T20:26:59.3396135Z #define EXIT_SUCCESS 0 2025-05-07T20:26:59.3396232Z #define __LDBL_REDIR_DECL(name) 2025-05-07T20:26:59.3396326Z #define _GLIBCXX_HAVE_STRTOF 1 2025-05-07T20:26:59.3396434Z #define MOD_FREQUENCY ADJ_FREQUENCY 2025-05-07T20:26:59.3396524Z #define __thread__ __thread 2025-05-07T20:26:59.3396620Z #define _GLIBCXX_HAVE_MEMORY_H 1 2025-05-07T20:26:59.3396720Z #define __INT_MAX__ 0x7fffffff 2025-05-07T20:26:59.3396822Z #define __SIZEOF_PTHREAD_BARRIER_T 32 2025-05-07T20:26:59.3397128Z #define __glibcxx_requires_partitioned_upper_pred(_First,_Last,_Value,_Pred) 2025-05-07T20:26:59.3397242Z #define __cudaCDP2StreamWaitEvent_ptsz 2025-05-07T20:26:59.3397336Z #define _GLIBCXX_HAVE_SINF 1 2025-05-07T20:26:59.3397428Z #define __linux__ 1 2025-05-07T20:26:59.3397523Z #define STA_PPSSIGNAL 0x0100 2025-05-07T20:26:59.3397653Z #define M_LN2l 0.693147180559945309417232121458176568L 2025-05-07T20:26:59.3397751Z #define __S16_TYPE short int 2025-05-07T20:26:59.3398092Z #define __glibcxx_constexpr_assert(cond) if (__builtin_is_constant_evaluated() && !bool(cond)) __builtin_unreachable() 2025-05-07T20:26:59.3398198Z #define __NVCC_DIAG_PRAGMA_SUPPORT__ 1 2025-05-07T20:26:59.3398391Z #define __bos(ptr) __builtin_object_size (ptr, __USE_FORTIFY_LEVEL > 1) 2025-05-07T20:26:59.3398490Z #define __COMMON_FUNCTIONS_H__ 2025-05-07T20:26:59.3398598Z #define UINT_MAX (INT_MAX * 2U + 1U) 2025-05-07T20:26:59.3398679Z #define _T_SIZE_ 2025-05-07T20:26:59.3398782Z #define LLONG_MAX __LONG_LONG_MAX__ 2025-05-07T20:26:59.3398906Z #define __cudaCDP2StreamCreateWithFlags 2025-05-07T20:26:59.3398999Z #define _PSTL_VERSION 12000 2025-05-07T20:26:59.3399120Z #define __noinline__ __attribute__((noinline)) 2025-05-07T20:26:59.3399224Z #define __WNOTHREAD 0x20000000 2025-05-07T20:26:59.3399321Z #define _G_va_list __gnuc_va_list 2025-05-07T20:26:59.3399453Z #define M_PI_4l 0.785398163397448309615660845819875721L 2025-05-07T20:26:59.3399549Z #define _IOS_INPUT 1 2025-05-07T20:26:59.3399641Z #define __USE_LARGEFILE64 1 2025-05-07T20:26:59.3399744Z #define _GLIBCXX_TR1_EXP_INTEGRAL_TCC 1 2025-05-07T20:26:59.3399846Z #define __INT64_TYPE__ long int 2025-05-07T20:26:59.3399944Z #define _POSIX_SSIZE_MAX 32767 2025-05-07T20:26:59.3400048Z #define __shared__ __location__(shared) 2025-05-07T20:26:59.3400140Z #define __FLT_MAX_EXP__ 128 2025-05-07T20:26:59.3400294Z #define __glibc_unlikely(cond) __builtin_expect((cond), 0) 2025-05-07T20:26:59.3400391Z #define __gid_t_defined 2025-05-07T20:26:59.3400511Z #define _GLIBCXX_USE_SC_NPROCESSORS_ONLN 1 2025-05-07T20:26:59.3400610Z #define __ORDER_BIG_ENDIAN__ 4321 2025-05-07T20:26:59.3400813Z #define __glibcxx_requires_can_increment_range(_First1,_Last1,_First2) 2025-05-07T20:26:59.3400909Z #define _GLIBCXX17_INLINE inline 2025-05-07T20:26:59.3401000Z #define __DBL_MANT_DIG__ 53 2025-05-07T20:26:59.3401097Z #define ___int_size_t_h 2025-05-07T20:26:59.3401202Z #define __FSBLKCNT64_T_TYPE __UQUAD_TYPE 2025-05-07T20:26:59.3401328Z #define __cpp_inheriting_constructors 201511L 2025-05-07T20:26:59.3401484Z #define __WIFCONTINUED(status) ((status) == __W_CONTINUED) 2025-05-07T20:26:59.3401588Z #define CUDA_DOUBLE_MATH_FUNCTIONS 1 2025-05-07T20:26:59.3401689Z #define _GLIBCXX_HAVE_FENV_H 1 2025-05-07T20:26:59.3401787Z #define _GLIBCXX_HAVE_STDBOOL_H 1 2025-05-07T20:26:59.3401881Z #define __SIZEOF_FLOAT128__ 16 2025-05-07T20:26:59.3402010Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:59.3402124Z #define _GLIBCXX_TR1_HYPERGEOMETRIC_TCC 1 2025-05-07T20:26:59.3402250Z #define _GLIBCXX_DEBUG_PEDASSERT(_Condition) 2025-05-07T20:26:59.3402347Z #define __clock_t_defined 1 2025-05-07T20:26:59.3402447Z #define _POSIX_SEM_VALUE_MAX 32767 2025-05-07T20:26:59.3402566Z #define __cudaCDP2RuntimeGetVersion 2025-05-07T20:26:59.3402656Z #define __GLIBC_MINOR__ 17 2025-05-07T20:26:59.3402749Z #define __DEC64_MIN__ 1E-383DD 2025-05-07T20:26:59.3402948Z #define __WINT_TYPE__ unsigned int 2025-05-07T20:26:59.3403058Z #define __UINT_LEAST32_TYPE__ unsigned int 2025-05-07T20:26:59.3403149Z #define __SIZEOF_SHORT__ 2 2025-05-07T20:26:59.3403326Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:26:59.3403410Z #define __SSE__ 1 2025-05-07T20:26:59.3403508Z #define SEM_VALUE_MAX (2147483647) 2025-05-07T20:26:59.3403611Z #define M_SQRT1_2 0.70710678118654752440 2025-05-07T20:26:59.3403696Z #define _CTYPE_H 1 2025-05-07T20:26:59.3403787Z #define __sigset_t_defined 2025-05-07T20:26:59.3403895Z #define __LDBL_MIN_EXP__ (-16381) 2025-05-07T20:26:59.3404070Z #define _GLIBCXX_HAVE_LOGF 1 2025-05-07T20:26:59.3404168Z #define MOD_TAI ADJ_TAI 2025-05-07T20:26:59.3404265Z #define _IO_va_list __gnuc_va_list 2025-05-07T20:26:59.3404361Z #define _GLIBCXX_HAVE_LOGL 1 2025-05-07T20:26:59.3404455Z #define __SM_70_RT_H__ 2025-05-07T20:26:59.3404549Z #define _GLIBCXX_HAVE_WRITEV 1 2025-05-07T20:26:59.3404663Z #define cudaEventWaitDefault 0x00 2025-05-07T20:26:59.3404765Z #define _GLIBCXX_HAVE_EXPL 1 2025-05-07T20:26:59.3404928Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:26:59.3405022Z #define _POSIX_MAX_CANON 255 2025-05-07T20:26:59.3405142Z #define _GLIBCXX_NOEXCEPT_PARM , bool _NE 2025-05-07T20:26:59.3405240Z #define FD_SETSIZE __FD_SETSIZE 2025-05-07T20:26:59.3405331Z #define _GLIBCXX_TXN_SAFE 2025-05-07T20:26:59.3405419Z #define __amd64__ 1 2025-05-07T20:26:59.3405509Z #define __WINT_WIDTH__ 32 2025-05-07T20:26:59.3405620Z #define __CUDA_DEVICE_RUNTIME_API_H__ 2025-05-07T20:26:59.3405883Z #define __REDIRECT_NTHNL(name,proto,alias) name proto __THROWNL __asm__ (__ASMNAME (#alias)) 2025-05-07T20:26:59.3405990Z #define _GLIBCXX_STDIO_SEEK_CUR 1 2025-05-07T20:26:59.3406081Z #define EOF (-1) 2025-05-07T20:26:59.3406180Z #define __WAIT_STATUS_DEFN void * 2025-05-07T20:26:59.3406276Z #define __USE_POSIX199309 1 2025-05-07T20:26:59.3406384Z #define __INT_LEAST64_WIDTH__ 64 2025-05-07T20:26:59.3406486Z #define __LDBL_MAX_EXP__ 16384 2025-05-07T20:26:59.3406581Z #define __FLT32X_MAX_10_EXP__ 308 2025-05-07T20:26:59.3406685Z #define LLONG_MIN (-LLONG_MAX-1) 2025-05-07T20:26:59.3406800Z #define cudaSurfaceType2DLayered 0xF2 2025-05-07T20:26:59.3406895Z #define ____mbstate_t_defined 1 2025-05-07T20:26:59.3406990Z #define STA_NANO 0x2000 2025-05-07T20:26:59.3407088Z #define _GLIBCXX_HAVE_LOG10F 1 2025-05-07T20:26:59.3407189Z #define _GLIBCXX_HAVE_LOG10L 1 2025-05-07T20:26:59.3407278Z #define _IO_LINKED 0x80 2025-05-07T20:26:59.3407374Z #define __cpp_lib_launder 201606 2025-05-07T20:26:59.3407476Z #define __SIZEOF_INT128__ 16 2025-05-07T20:26:59.3407584Z #define __PTHREAD_MUTEX_HAVE_PREV 1 2025-05-07T20:26:59.3407681Z #define __FLT64X_IS_IEC_60559__ 2 2025-05-07T20:26:59.3407781Z #define _GLIBCXX_TYPE_TRAITS 1 2025-05-07T20:26:59.3407921Z #define cudaGraphKernelNodePortProgrammatic 1 2025-05-07T20:26:59.3408027Z #define __DEVICE_ATOMIC_FUNCTIONS_HPP__ 2025-05-07T20:26:59.3408139Z #define __BLKCNT64_T_TYPE __SQUAD_TYPE 2025-05-07T20:26:59.3408237Z #define __LDBL_MAX_10_EXP__ 4932 2025-05-07T20:26:59.3408330Z #define __W_CONTINUED 0xffff 2025-05-07T20:26:59.3408426Z #define __ATOMIC_RELAXED 0 2025-05-07T20:26:59.3408557Z #define w_coredump __wait_terminated.__w_coredump 2025-05-07T20:26:59.3408684Z #define __FSBLKCNT_T_TYPE __SYSCALL_ULONG_TYPE 2025-05-07T20:26:59.3408884Z #define __cudaCDP2OccupancyMaxActiveBlocksPerMultiprocessor 2025-05-07T20:26:59.3409066Z #define __DBL_EPSILON__ double(2.22044604925031308084726333618164062e-16L) 2025-05-07T20:26:59.3409156Z #define __stub_stty 2025-05-07T20:26:59.3409319Z #define _tolower(c) ((int) (*__ctype_tolower_loc ())[(int) (c)]) 2025-05-07T20:26:59.3409410Z #define le16toh(x) (x) 2025-05-07T20:26:59.3409524Z #define BC_SCALE_MAX _POSIX2_BC_SCALE_MAX 2025-05-07T20:26:59.3409696Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128 2025-05-07T20:26:59.3409785Z #define _SIZET_ 2025-05-07T20:26:59.3409877Z #define XATTR_NAME_MAX 255 2025-05-07T20:26:59.3410051Z #define _SVID_SOURCE 1 2025-05-07T20:26:59.3410140Z #define _LP64 1 2025-05-07T20:26:59.3410231Z #define _LIBC_LIMITS_H_ 1 2025-05-07T20:26:59.3410462Z #define __REDIRECT_NTH_LDBL(name,proto,alias) __REDIRECT_NTH (name, proto, alias) 2025-05-07T20:26:59.3410582Z #define _GLIBCXX_TR1_BESSEL_FUNCTION_TCC 1 2025-05-07T20:26:59.3410667Z #define __UINT8_C(c) c 2025-05-07T20:26:59.3410761Z #define _GLIBCXX_HAVE_CEILF 1 2025-05-07T20:26:59.3410861Z #define _GLIBCXX_HAVE_CEILL 1 2025-05-07T20:26:59.3410971Z #define __cudaCDP2Memset3DAsync_ptsz 2025-05-07T20:26:59.3411065Z #define __CUDA_ARCH_LIST__ 520 2025-05-07T20:26:59.3411243Z #define __FLT64_MAX_EXP__ 1024 2025-05-07T20:26:59.3411340Z #define MOD_MAXERROR ADJ_MAXERROR 2025-05-07T20:26:59.3411429Z #define CUDARTAPI 2025-05-07T20:26:59.3411526Z #define IOV_MAX 1024 2025-05-07T20:26:59.3411705Z #define __glibcxx_requires_irreflexive2(_First,_Last) 2025-05-07T20:26:59.3411830Z #define __INT_LEAST32_TYPE__ int 2025-05-07T20:26:59.3411953Z #define P_tmpdir "/tmp" 2025-05-07T20:26:59.3412081Z #define cudaMemAttachSingle 0x04 2025-05-07T20:26:59.3412191Z #define __wchar_t__ 2025-05-07T20:26:59.3412321Z #define __cpp_lib_is_aggregate 201703 2025-05-07T20:26:59.3412422Z #define SEEK_END 2 2025-05-07T20:26:59.3412543Z #define __SIZEOF_WCHAR_T__ 4 2025-05-07T20:26:59.3412756Z #define _GLIBCXX_USE_TBB_PAR_BACKEND __has_include() 2025-05-07T20:26:59.3412878Z #define _IO_ftrylockfile(_fp) 2025-05-07T20:26:59.3413065Z #define _GLIBCXX_USE_C99_WCHAR _GLIBCXX11_USE_C99_WCHAR 2025-05-07T20:26:59.3413177Z #define ____FILE_defined 1 2025-05-07T20:26:59.3413328Z #define _GLIBCXX_HAVE_BUILTIN_IS_AGGREGATE 1 2025-05-07T20:26:59.3413457Z #define __GNUC_PATCHLEVEL__ 0 2025-05-07T20:26:59.3413565Z #define _ISOC99_SOURCE 1 2025-05-07T20:26:59.3413689Z #define __VECTOR_FUNCTIONS_H__ 2025-05-07T20:26:59.3413995Z #define __REDIRECT_NTH(name,proto,alias) name proto __THROW __asm__ (__ASMNAME (#alias)) 2025-05-07T20:26:59.3414161Z #define _PSTL_USE_NONTEMPORAL_STORES_IF_ALLOWED 2025-05-07T20:26:59.3414276Z #define _IO_RIGHT 04 2025-05-07T20:26:59.3414374Z #define __END_NAMESPACE_STD 2025-05-07T20:26:59.3414558Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:26:59.3414661Z #define _GLIBCXX_STD_C std 2025-05-07T20:26:59.3414781Z #define cudaInitDeviceFlagsAreValid 0x01 2025-05-07T20:26:59.3414883Z #define _LARGEFILE64_SOURCE 1 2025-05-07T20:26:59.3414984Z #define _GLIBCXX_USE_C99_STDINT_TR1 1 2025-05-07T20:26:59.3415068Z #define _STDDEF_H_ 2025-05-07T20:26:59.3415245Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:26:59.3415348Z #define __FLT128_HAS_QUIET_NAN__ 1 2025-05-07T20:26:59.3415467Z #define isalnum_l(c,l) __isalnum_l ((c), (l)) 2025-05-07T20:26:59.3415674Z #define __FD_ISSET(d,set) ((__FDS_BITS (set)[__FD_ELT (d)] & __FD_MASK (d)) != 0) 2025-05-07T20:26:59.3415786Z #define __INTMAX_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:59.3415932Z #define __glibcxx_requires_irreflexive(_First,_Last) 2025-05-07T20:26:59.3416065Z #define cudaGraphKernelNodePortDefault 0 2025-05-07T20:26:59.3416167Z #define __INT_FAST8_TYPE__ signed char 2025-05-07T20:26:59.3416284Z #define __cudaCDP2Memcpy3DAsync_ptsz 2025-05-07T20:26:59.3416381Z #define __PID_T_TYPE __S32_TYPE 2025-05-07T20:26:59.3416494Z #define __cpp_namespace_attributes 201411L 2025-05-07T20:26:59.3416598Z #define CHARCLASS_NAME_MAX 2048 2025-05-07T20:26:59.3416696Z #define _GLIBCXX_HAVE_TANF 1 2025-05-07T20:26:59.3416793Z #define _GLIBCXX_USE_ST_MTIM 1 2025-05-07T20:26:59.3416973Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x 2025-05-07T20:26:59.3417074Z #define __CUDA_RUNTIME_H__ 2025-05-07T20:26:59.3417249Z #define WIFSIGNALED(status) __WIFSIGNALED (__WAIT_INT (status)) 2025-05-07T20:26:59.3417355Z #define _GLIBCXX_HAVE_STDLIB_H 1 2025-05-07T20:26:59.3417450Z #define __STDCPP_THREADS__ 1 2025-05-07T20:26:59.3417593Z #define M_2_SQRTPIl 1.128379167095512573896158903121545172L 2025-05-07T20:26:59.3417816Z #define __GNUC_STDC_INLINE__ 1 2025-05-07T20:26:59.3417918Z #define _POSIX_UIO_MAXIOV 16 2025-05-07T20:26:59.3418023Z #define _PSTL_PAR_BACKEND_SERIAL 2025-05-07T20:26:59.3418142Z #define __ASSERT_FUNCTION __PRETTY_FUNCTION__ 2025-05-07T20:26:59.3418235Z #define __FLT64_HAS_DENORM__ 1 2025-05-07T20:26:59.3418342Z #define __WORDSIZE_TIME64_COMPAT32 1 2025-05-07T20:26:59.3418505Z #define _GLIBCXX_DEPRECATED __attribute__ ((__deprecated__)) 2025-05-07T20:26:59.3418673Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32 2025-05-07T20:26:59.3418779Z #define _PSTL_HIDE_FROM_ABI_PUSH 2025-05-07T20:26:59.3418900Z #define cudaStreamLegacy ((cudaStream_t)0x1) 2025-05-07T20:26:59.3419093Z #define _IO_cleanup_region_start(_fct,_fp) 2025-05-07T20:26:59.3419201Z #define __location__(a) __annotate__(a) 2025-05-07T20:26:59.3419426Z #define __device_builtin_surface_type__ __location__(device_builtin_surface_type) 2025-05-07T20:26:59.3419528Z #define _POSIX2_BC_BASE_MAX 99 2025-05-07T20:26:59.3419645Z #define __cudaCDP2DeviceGetAttribute 2025-05-07T20:26:59.3419739Z #define __DBL_DECIMAL_DIG__ 17 2025-05-07T20:26:59.3419837Z #define __STDC_UTF_32__ 1 2025-05-07T20:26:59.3419933Z #define __INT_FAST8_WIDTH__ 8 2025-05-07T20:26:59.3420029Z #define NAN (__builtin_nanf ("")) 2025-05-07T20:26:59.3420129Z #define _POSIX_MQ_PRIO_MAX 32 2025-05-07T20:26:59.3420210Z #define __FXSR__ 1 2025-05-07T20:26:59.3420291Z #define _SIZE_T 2025-05-07T20:26:59.3420398Z #define _GLIBCXX_USE_GETTIMEOFDAY 1 2025-05-07T20:26:59.3420509Z #define cudaHostRegisterReadOnly 0x08 2025-05-07T20:26:59.3420680Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:26:59.3420832Z #define __WIFSTOPPED(status) (((status) & 0xff) == 0x7f) 2025-05-07T20:26:59.3420924Z #define _IO_ssize_t __ssize_t 2025-05-07T20:26:59.3421028Z #define __ULONG32_TYPE unsigned int 2025-05-07T20:26:59.3421209Z #define __DBL_NORM_MAX__ double(1.79769313486231570814527423731704357e+308L) 2025-05-07T20:26:59.3421412Z #define cudaStreamGraphTailLaunch (cudaStream_t)0x0100000000000000 2025-05-07T20:26:59.3421511Z #define _GXX_NULLPTR_T 2025-05-07T20:26:59.3421637Z #define __glibcxx_class_requires3(_a,_b,_c,_d) 2025-05-07T20:26:59.3421734Z #define FOPEN_MAX 16 2025-05-07T20:26:59.3421851Z #define __BIG_ENDIAN 4321 2025-05-07T20:26:59.3421997Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:26:59.3422123Z #define __suseconds_t_defined 2025-05-07T20:26:59.3422232Z #define __off_t_defined 2025-05-07T20:26:59.3422339Z #define stderr stderr 2025-05-07T20:26:59.3422463Z #define M_LOG10E 0.43429448190325182765 2025-05-07T20:26:59.3422603Z #define __glibcxx_requires_string(_String) 2025-05-07T20:26:59.3422729Z #define _GLIBCXX_HAVE_LDEXPL 1 2025-05-07T20:26:59.3422852Z #define __INTMAX_WIDTH__ 64 2025-05-07T20:26:59.3423355Z #define _PSTL_CPP14_2RANGE_MISMATCH_EQUAL_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201300L || __cpp_lib_robust_nonmodifying_seq_ops == 201304) 2025-05-07T20:26:59.3423468Z #define __mode_t_defined 2025-05-07T20:26:59.3423584Z #define _GCC_SIZE_T 2025-05-07T20:26:59.3423706Z #define __INO64_T_TYPE __UQUAD_TYPE 2025-05-07T20:26:59.3423823Z #define __cpp_runtime_arrays 198712L 2025-05-07T20:26:59.3423936Z #define __UINT64_TYPE__ long unsigned int 2025-05-07T20:26:59.3424032Z #define __USE_XOPEN2K8XSI 1 2025-05-07T20:26:59.3424129Z #define __UINT32_C(c) c ## U 2025-05-07T20:26:59.3424233Z #define __cpp_alias_templates 200704L 2025-05-07T20:26:59.3424338Z #define cudaHostAllocMapped 0x02 2025-05-07T20:26:59.3424449Z #define __DEVICE_LAUNCH_PARAMETERS_H__ 2025-05-07T20:26:59.3424541Z #define _STL_ITERATOR_H 1 2025-05-07T20:26:59.3424627Z #define __size_t__ 2025-05-07T20:26:59.3424766Z #define cudaStreamAttrID cudaLaunchAttributeID 2025-05-07T20:26:59.3424861Z #define _GLIBCXX_HAVE_ATANF 1 2025-05-07T20:26:59.3424969Z #define cudaEventRecordExternal 0x01 2025-05-07T20:26:59.3425124Z #define __isspace_l(c,l) __isctype_l((c), _ISspace, (l)) 2025-05-07T20:26:59.3425218Z #define _IO_BUFSIZ _G_BUFSIZ 2025-05-07T20:26:59.3425476Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F 2025-05-07T20:26:59.3425562Z #define _ENDIAN_H 1 2025-05-07T20:26:59.3425666Z #define __builtin_align__(a) __align__(a) 2025-05-07T20:26:59.3425767Z #define _GLIBCXX20_CONSTEXPR 2025-05-07T20:26:59.3425868Z #define __NV_NO_HOST_COMPILER_CHECK 1 2025-05-07T20:26:59.3425948Z #define __try try 2025-05-07T20:26:59.3426047Z #define _GLIBCXX_HAVE_FINITE 1 2025-05-07T20:26:59.3426139Z #define __FLT128_IS_IEC_60559__ 2 2025-05-07T20:26:59.3426228Z #define __INT8_MAX__ 0x7f 2025-05-07T20:26:59.3426492Z #define cudaStreamGetCaptureInfo __CUDART_API_PTSZ(cudaStreamGetCaptureInfo_v2) 2025-05-07T20:26:59.3426663Z #define __LONG_WIDTH__ 64 2025-05-07T20:26:59.3426746Z #define __PIC__ 2 2025-05-07T20:26:59.3426865Z #define BC_STRING_MAX _POSIX2_BC_STRING_MAX 2025-05-07T20:26:59.3426985Z #define __UINT_FAST32_TYPE__ long unsigned int 2025-05-07T20:26:59.3427124Z #define FD_ISSET(fd,fdsetp) __FD_ISSET (fd, fdsetp) 2025-05-07T20:26:59.3427226Z #define _GLIBCXX_HAVE_FLOAT_H 1 2025-05-07T20:26:59.3427322Z #define _GLIBCXX_HAVE_ATANL 1 2025-05-07T20:26:59.3427510Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:26:59.3427612Z #define __DEVICE_FUNCTIONS_HPP__ 2025-05-07T20:26:59.3427714Z #define __CHAR32_TYPE__ unsigned int 2025-05-07T20:26:59.3427811Z #define _IO_uid_t __uid_t 2025-05-07T20:26:59.3427908Z #define _GLIBCXX_HAVE_READLINK 1 2025-05-07T20:26:59.3428035Z #define __cudaCDP2EventRecordWithFlags_ptsz 2025-05-07T20:26:59.3428134Z #define _CONCEPT_CHECK_H 1 2025-05-07T20:26:59.3428633Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:26:59.3428790Z #define _GLIBCXX_HAVE_NETINET_IN_H 1 2025-05-07T20:26:59.3428913Z #define _GLIBCXX_TR1_SPECIAL_FUNCTION_UTIL_H 1 2025-05-07T20:26:59.3428996Z #define LONG_BIT 64 2025-05-07T20:26:59.3429161Z #define __SIZEOF_PTHREAD_BARRIERATTR_T 4 2025-05-07T20:26:59.3429262Z #define _GLIBCXX_USE_ALLOCATOR_NEW 1 2025-05-07T20:26:59.3429394Z #define __cpp_lib_math_special_functions 201603L 2025-05-07T20:26:59.3429495Z #define __fsfilcnt_t_defined 2025-05-07T20:26:59.3429586Z #define __blkcnt_t_defined 2025-05-07T20:26:59.3429855Z #define cudaKernelNodeAttributeMemSyncDomain cudaLaunchAttributeMemSyncDomain 2025-05-07T20:26:59.3429953Z #define __USE_LARGEFILE 1 2025-05-07T20:26:59.3430052Z #define __cpp_constexpr 201603L 2025-05-07T20:26:59.3430146Z #define CUDART_VERSION 12080 2025-05-07T20:26:59.3430245Z #define NL_TEXTMAX INT_MAX 2025-05-07T20:26:59.3430347Z #define cudaDeviceMapHost 0x08 2025-05-07T20:26:59.3430441Z #define _GLIBCXX_CMATH 1 2025-05-07T20:26:59.3430640Z #define __attribute_format_arg__(x) __attribute__ ((__format_arg__ (x))) 2025-05-07T20:26:59.3430737Z #define __lldiv_t_defined 1 2025-05-07T20:26:59.3430830Z #define __SSE2__ 1 2025-05-07T20:26:59.3430913Z #define _IOLBF 1 2025-05-07T20:26:59.3431013Z #define _GLIBCXX_HAVE_SYS_TYPES_H 1 2025-05-07T20:26:59.3431109Z #define _GLIBCXX_HAVE_FLOORF 1 2025-05-07T20:26:59.3431237Z #define __cpp_deduction_guides 201703L 2025-05-07T20:26:59.3431332Z #define _GLIBCXX_HAVE_EXPF 1 2025-05-07T20:26:59.3431442Z #define __annotate__(a) __attribute__((a)) 2025-05-07T20:26:59.3431541Z #define __INT32_TYPE__ int 2025-05-07T20:26:59.3431634Z #define __SIZEOF_DOUBLE__ 8 2025-05-07T20:26:59.3431746Z #define cudaDeviceSyncMemops 0x80 2025-05-07T20:26:59.3431845Z #define __cpp_exceptions 199711L 2025-05-07T20:26:59.3431941Z #define __FLT_MIN_10_EXP__ (-37) 2025-05-07T20:26:59.3432056Z #define cudaDeviceScheduleYield 0x02 2025-05-07T20:26:59.3432148Z #define _SYS_SYSMACROS_H 1 2025-05-07T20:26:59.3432263Z #define _GLIBCXX_TR1_LEGENDRE_FUNCTION_TCC 1 2025-05-07T20:26:59.3432439Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64 2025-05-07T20:26:59.3432534Z #define __INT_LEAST32_WIDTH__ 32 2025-05-07T20:26:59.3432629Z #define __SWORD_TYPE long int 2025-05-07T20:26:59.3432729Z #define __INTMAX_TYPE__ long int 2025-05-07T20:26:59.3432825Z #define _GLIBCXX11_USE_C99_MATH 1 2025-05-07T20:26:59.3433166Z #define __PTHREAD_SPINS 0, 0 2025-05-07T20:26:59.3433262Z #define _BITS_POSIX1_LIM_H 1 2025-05-07T20:26:59.3433543Z #define cudaStreamAttributeMemSyncDomainMap cudaLaunchAttributeMemSyncDomainMap 2025-05-07T20:26:59.3433645Z #define __DEC128_MAX_EXP__ 6145 2025-05-07T20:26:59.3433790Z #define math_errhandling (MATH_ERRNO | MATH_ERREXCEPT) 2025-05-07T20:26:59.3433871Z #define _T_SIZE 2025-05-07T20:26:59.3433982Z #define cudaHostAllocDefault 0x00 2025-05-07T20:26:59.3434107Z #define _PSTL_PRAGMA_SIMD_EXCLUSIVE_SCAN(PRM) 2025-05-07T20:26:59.3434231Z #define __va_arg_pack() __builtin_va_arg_pack () 2025-05-07T20:26:59.3434333Z #define _POSIX_TIMER_MAX 32 2025-05-07T20:26:59.3434548Z #define _GLIBCXX_HAVE_TLS 1 2025-05-07T20:26:59.3434669Z #define _GLIBCXX_NOTHROW _GLIBCXX_USE_NOEXCEPT 2025-05-07T20:26:59.3434776Z #define __FLT32X_HAS_QUIET_NAN__ 1 2025-05-07T20:26:59.3434866Z #define __ATOMIC_CONSUME 1 2025-05-07T20:26:59.3435046Z #define __CUDA_ARCH_HAS_FEATURE__(_FEAT) __CUDA_ARCH_FEAT_ ##_FEAT 2025-05-07T20:26:59.3435141Z #define __GNUC_MINOR__ 4 2025-05-07T20:26:59.3435244Z #define __GLIBCXX_TYPE_INT_N_0 __int128 2025-05-07T20:26:59.3435344Z #define __INT_FAST16_WIDTH__ 64 2025-05-07T20:26:59.3435461Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:59.3435543Z #define __PIE__ 2 2025-05-07T20:26:59.3435653Z #define LITTLE_ENDIAN __LITTLE_ENDIAN 2025-05-07T20:26:59.3435751Z #define _GLIBCXX_HAVE_INT64_T_LONG 1 2025-05-07T20:26:59.3435941Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x 2025-05-07T20:26:59.3436163Z #define __intN_t(N,MODE) typedef int int ##N ##_t __attribute__ ((__mode__ (MODE))) 2025-05-07T20:26:59.3436262Z #define __nlink_t_defined 2025-05-07T20:26:59.3436393Z #define _GLIBCXX17_DEPRECATED [[__deprecated__]] 2025-05-07T20:26:59.3436504Z #define _PSTL_STRING(x) _PSTL_STRING_AUX(x) 2025-05-07T20:26:59.3436590Z #define _XOPEN_LIM_H 1 2025-05-07T20:26:59.3436853Z #define __u_intN_t(N,MODE) typedef unsigned int u_int ##N ##_t __attribute__ ((__mode__ (MODE))) 2025-05-07T20:26:59.3436975Z #define __cpp_template_template_args 201611L 2025-05-07T20:26:59.3437078Z #define _GTHREAD_USE_MUTEX_TIMEDLOCK 1 2025-05-07T20:26:59.3437184Z #define BC_DIM_MAX _POSIX2_BC_DIM_MAX 2025-05-07T20:26:59.3437277Z #define __DBL_MAX_10_EXP__ 308 2025-05-07T20:26:59.3437366Z #define __FILE_defined 1 2025-05-07T20:26:59.3437551Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L 2025-05-07T20:26:59.3437647Z #define _GLIBCXX_HAVE_SINCOS 1 2025-05-07T20:26:59.3437747Z #define __USE_XOPEN_EXTENDED 1 2025-05-07T20:26:59.3437857Z #define __cpp_lib_tuple_element_t 201402L 2025-05-07T20:26:59.3437975Z #define isascii_l(c,l) __isascii_l ((c), (l)) 2025-05-07T20:26:59.3438088Z #define cudaInvalidDeviceId ((int)-2) 2025-05-07T20:26:59.3438191Z #define _GLIBCXX_HAVE_SYS_RESOURCE_H 1 2025-05-07T20:26:59.3438275Z #define __INT16_C(c) c 2025-05-07T20:26:59.3438377Z #define __U32_TYPE unsigned int 2025-05-07T20:26:59.3438475Z #define _GLIBCXX_HAVE_SYS_IOCTL_H 1 2025-05-07T20:26:59.3438602Z #define FD_CLR(fd,fdsetp) __FD_CLR (fd, fdsetp) 2025-05-07T20:26:59.3438691Z #define __STDC__ 1 2025-05-07T20:26:59.3438787Z #define _GLIBCXX_HAVE_VWSCANF 1 2025-05-07T20:26:59.3438891Z #define _GLIBCXX_HAVE_EXECINFO_H 1 2025-05-07T20:26:59.3438987Z #define _GLIBCXX_USE_REALPATH 1 2025-05-07T20:26:59.3439137Z #define __attribute_malloc__ __attribute__ ((__malloc__)) 2025-05-07T20:26:59.3439233Z #define __FLT32X_DIG__ 15 2025-05-07T20:26:59.3439332Z #define _GLIBCXX_USE_C99_CTYPE_TR1 1 2025-05-07T20:26:59.3439428Z #define __PTRDIFF_TYPE__ long int 2025-05-07T20:26:59.3439547Z #define cudaArrayDeferredMapping 0x80 2025-05-07T20:26:59.3439662Z #define _GLIBCXX_END_NAMESPACE_CONTAINER 2025-05-07T20:26:59.3439758Z #define USHRT_MAX (SHRT_MAX * 2 + 1) 2025-05-07T20:26:59.3439869Z #define __cpp_lib_is_swappable 201603 2025-05-07T20:26:59.3439950Z #define stdin stdin 2025-05-07T20:26:59.3440040Z #define __ino64_t_defined 2025-05-07T20:26:59.3440132Z #define STA_CLK 0x8000 2025-05-07T20:26:59.3440312Z #define __clockid_t_defined 1 2025-05-07T20:26:59.3440466Z #define _GLIBCXX_NOEXCEPT_IF(...) noexcept(__VA_ARGS__) 2025-05-07T20:26:59.3440628Z #define __attribute_noinline__ __attribute__ ((__noinline__)) 2025-05-07T20:26:59.3440731Z #define __cudaCDP2MemsetAsync 2025-05-07T20:26:59.3440838Z #define _PSTL_PRAGMA_SIMD_SCAN(PRM) 2025-05-07T20:26:59.3440943Z #define _GLIBCXX_BEGIN_NAMESPACE_LDBL 2025-05-07T20:26:59.3441046Z #define _GLIBCXX_TR1_POLY_HERMITE_TCC 1 2025-05-07T20:26:59.3441251Z #define __FD_SET(d,set) ((void) (__FDS_BITS (set)[__FD_ELT (d)] |= __FD_MASK (d))) 2025-05-07T20:26:59.3441344Z #define __ATOMIC_SEQ_CST 5 2025-05-07T20:26:59.3442030Z #define __tobody(c,f,a,args) (__extension__ ({ int __res; if (sizeof (c) > 1) { if (__builtin_constant_p (c)) { int __c = (c); __res = __c < -128 || __c > 255 ? __c : (a)[__c]; } else __res = f args; } else __res = (a)[(int) (c)]; __res; })) 2025-05-07T20:26:59.3442142Z #define DOMAIN 1 2025-05-07T20:26:59.3442257Z #define M_LN2 0.69314718055994530942 2025-05-07T20:26:59.3442370Z #define __NVCC__ 1 2025-05-07T20:26:59.3442498Z #define __cudaCDP2Memset2DAsync 2025-05-07T20:26:59.3442638Z #define __CLOCK_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:26:59.3442768Z #define _PSTL_PRAGMA_SIMD_EARLYEXIT 2025-05-07T20:26:59.3442893Z #define __throw_exception_again throw 2025-05-07T20:26:59.3443011Z #define M_SQRT2 1.41421356237309504880 2025-05-07T20:26:59.3443128Z #define __EXCEPTION_H 1 2025-05-07T20:26:59.3443247Z #define __FLT32X_MIN_10_EXP__ (-307) 2025-05-07T20:26:59.3443376Z #define HUGE_VAL (__builtin_huge_val()) 2025-05-07T20:26:59.3443760Z #define cudaStreamAttributeAccessPolicyWindow cudaLaunchAttributeAccessPolicyWindow 2025-05-07T20:26:59.3443909Z #define __UINTPTR_TYPE__ long unsigned int 2025-05-07T20:26:59.3444042Z #define _GLIBCXX_INLINE_VERSION 0 2025-05-07T20:26:59.3444160Z #define _GLIBCXX_USE_INT128 1 2025-05-07T20:26:59.3444289Z #define __cpp_lib_bool_constant 201505 2025-05-07T20:26:59.3444393Z #define PTHREAD_KEYS_MAX 1024 2025-05-07T20:26:59.3444539Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD 2025-05-07T20:26:59.3444645Z #define __FSFILCNT64_T_TYPE __UQUAD_TYPE 2025-05-07T20:26:59.3444764Z #define _GLIBCXX_DOUBLE_IS_IEEE_BINARY64 1 2025-05-07T20:26:59.3444857Z #define __DEC128_MANT_DIG__ 34 2025-05-07T20:26:59.3444962Z #define __cpp_lib_tuples_by_type 201304 2025-05-07T20:26:59.3445065Z #define __LDBL_MIN_10_EXP__ (-4931) 2025-05-07T20:26:59.3445167Z #define __cpp_generic_lambdas 201304L 2025-05-07T20:26:59.3445310Z #define _GLIBCXX_THROW_OR_ABORT(_EXC) (throw (_EXC)) 2025-05-07T20:26:59.3445404Z #define __useconds_t_defined 2025-05-07T20:26:59.3445504Z #define _GLIBCXX_USE_SCHED_YIELD 1 2025-05-07T20:26:59.3445698Z #define __attribute_deprecated__ __attribute__ ((__deprecated__)) 2025-05-07T20:26:59.3445845Z #define __cpp_lib_type_trait_variable_templates 201510L 2025-05-07T20:26:59.3445931Z #define __SSE_MATH__ 1 2025-05-07T20:26:59.3446028Z #define _IO_wint_t wint_t 2025-05-07T20:26:59.3446124Z #define __SIZEOF_LONG_LONG__ 8 2025-05-07T20:26:59.3446221Z #define _GLIBCXX_VERBOSE 1 2025-05-07T20:26:59.3446323Z #define _GLIBCXX_HAVE_ASINF 1 2025-05-07T20:26:59.3446437Z #define __cpp_user_defined_literals 200809L 2025-05-07T20:26:59.3446542Z #define _GLIBCXX_HAVE_ISINFL 1 2025-05-07T20:26:59.3446639Z #define _GLIBCXX_HAVE_ASINL 1 2025-05-07T20:26:59.3446723Z #define __USE_ATFILE 1 2025-05-07T20:26:59.3446821Z #define _POSIX_OPEN_MAX 20 2025-05-07T20:26:59.3446916Z #define _POSIX_LOGIN_NAME_MAX 9 2025-05-07T20:26:59.3447005Z #define _GCC_PTRDIFF_T 2025-05-07T20:26:59.3447236Z #define cudaKernelNodeAttributePriority cudaLaunchAttributePriority 2025-05-07T20:26:59.3447338Z #define __FLT128_DECIMAL_DIG__ 36 2025-05-07T20:26:59.3447439Z #define _POSIX_THREAD_KEYS_MAX 128 2025-05-07T20:26:59.3447549Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2 2025-05-07T20:26:59.3447661Z #define __cpp_lib_array_constexpr 201803L 2025-05-07T20:26:59.3447744Z #define _STDLIB_H 1 2025-05-07T20:26:59.3447888Z #define __exctype(name) extern int name (int) __THROW 2025-05-07T20:26:59.3448100Z #define __FLT32_HAS_QUIET_NAN__ 1 2025-05-07T20:26:59.3448200Z #define __FLT_DECIMAL_DIG__ 9 2025-05-07T20:26:59.3448328Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:59.3448435Z #define __SURFACE_INDIRECT_FUNCTIONS_H__ 2025-05-07T20:26:59.3448536Z #define __SM_61_INTRINSICS_H__ 2025-05-07T20:26:59.3448720Z #define _GLIBCXX_PACKAGE_STRING "package-unused version-unused" 2025-05-07T20:26:59.3448875Z #define __isxdigit_l(c,l) __isctype_l((c), _ISxdigit, (l)) 2025-05-07T20:26:59.3448985Z #define __glibcxx_requires_nonempty() 2025-05-07T20:26:59.3449105Z #define w_stopsig __wait_stopped.__w_stopsig 2025-05-07T20:26:59.3449277Z #define __ldiv_t_defined 1 2025-05-07T20:26:59.3449461Z #define __glibcxx_requires_irreflexive_pred(_First,_Last,_Pred) 2025-05-07T20:26:59.3449555Z #define ___int_ptrdiff_t_h 2025-05-07T20:26:59.3449731Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:26:59.3449834Z #define __cudaCDP2EventDestroy 2025-05-07T20:26:59.3449934Z #define __HOST_DEFINES_H__ 2025-05-07T20:26:59.3450041Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2 2025-05-07T20:26:59.3450141Z #define __SM_20_ATOMIC_FUNCTIONS_H__ 2025-05-07T20:26:59.3450240Z #define _GLIBCXX_USE_NANOSLEEP 1 2025-05-07T20:26:59.3450328Z #define CUDART_CB 2025-05-07T20:26:59.3450430Z #define BC_BASE_MAX _POSIX2_BC_BASE_MAX 2025-05-07T20:26:59.3450553Z #define _GLIBCXX_USE_C99_INTTYPES_WCHAR_T_TR1 1 2025-05-07T20:26:59.3450644Z #define MB_LEN_MAX 16 2025-05-07T20:26:59.3450868Z #define __glibcxx_requires_partitioned_lower_pred(_First,_Last,_Value,_Pred) 2025-05-07T20:26:59.3450972Z #define _GLIBCXX11_USE_C99_WCHAR 1 2025-05-07T20:26:59.3451102Z #define _IO_peekc(_fp) _IO_peekc_unlocked (_fp) 2025-05-07T20:26:59.3451217Z #define _GLIBCXX_HAVE_AS_SYMVER_DIRECTIVE 1 2025-05-07T20:26:59.3451319Z #define _GLIBCXX_HAVE_UNISTD_H 1 2025-05-07T20:26:59.3451468Z #define __glibc_likely(cond) __builtin_expect((cond), 1) 2025-05-07T20:26:59.3451578Z #define __UINT_FAST8_TYPE__ unsigned char 2025-05-07T20:26:59.3451678Z #define _GNU_SOURCE 1 2025-05-07T20:26:59.3451764Z #define __stub_putmsg 2025-05-07T20:26:59.3451848Z #define __CUDACC__ 1 2025-05-07T20:26:59.3451951Z #define __N(msgid) (msgid) 2025-05-07T20:26:59.3452056Z #define __P(args) args 2025-05-07T20:26:59.3452371Z #define cudaKernelNodeAttributeCooperative cudaLaunchAttributeCooperative 2025-05-07T20:26:59.3452503Z #define __cpp_init_captures 201304L 2025-05-07T20:26:59.3452632Z #define _GLIBCXX17_CONSTEXPR constexpr 2025-05-07T20:26:59.3452752Z #define __ATOMIC_ACQ_REL 4 2025-05-07T20:26:59.3452874Z #define __cpp_lib_as_const 201510 2025-05-07T20:26:59.3452982Z #define __WCHAR_T 2025-05-07T20:26:59.3453104Z #define __ATOMIC_RELEASE 3 2025-05-07T20:26:59.3453222Z #define __fsblkcnt_t_defined 2025-05-07T20:26:59.3453366Z #define __cudaCDP2EventCreateWithFlags 2025-05-07T20:26:59.3453501Z #define __DEVICE_DOUBLE_FUNCTIONS_H__ 2025-05-07T20:26:59.3453509Z 2025-05-07T20:26:59.3612526Z 2025-05-07T20:26:59.3613176Z + conda run -n build_binary nvcc --version 2025-05-07T20:26:59.3613192Z 2025-05-07T20:27:01.2597750Z nvcc: NVIDIA (R) Cuda compiler driver 2025-05-07T20:27:01.2598180Z Copyright (c) 2005-2025 NVIDIA Corporation 2025-05-07T20:27:01.2598498Z Built on Wed_Jan_15_19:20:09_PST_2025 2025-05-07T20:27:01.2598808Z Cuda compilation tools, release 12.8, V12.8.61 2025-05-07T20:27:01.2599139Z Build cuda_12.8.r12.8/compiler.35404655_0 2025-05-07T20:27:01.2599342Z 2025-05-07T20:27:01.3226832Z 2025-05-07T20:27:01.3239982Z /usr/bin/nvidia-smi 2025-05-07T20:27:01.3244929Z + nvidia-smi 2025-05-07T20:27:01.3245083Z 2025-05-07T20:27:01.3417608Z Wed May 7 20:27:01 2025 2025-05-07T20:27:01.3418030Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:27:01.3418538Z | NVIDIA-SMI 570.133.07 Driver Version: 570.133.07 CUDA Version: 12.8 | 2025-05-07T20:27:01.3419023Z |-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:27:01.3419832Z | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 2025-05-07T20:27:01.3420358Z | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 2025-05-07T20:27:01.3420778Z | | | MIG M. | 2025-05-07T20:27:01.3421116Z |=========================================+========================+======================| 2025-05-07T20:27:01.3590816Z | 0 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 | 2025-05-07T20:27:01.3591255Z | 0% 28C P8 22W / 300W | 0MiB / 23028MiB | 0% Default | 2025-05-07T20:27:01.3591826Z | | | N/A | 2025-05-07T20:27:01.3592219Z +-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:27:01.3595606Z 2025-05-07T20:27:01.3596006Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:27:01.3596429Z | Processes: | 2025-05-07T20:27:01.3596866Z | GPU GI CI PID Type Process name GPU Memory | 2025-05-07T20:27:01.3597279Z | ID ID Usage | 2025-05-07T20:27:01.3597625Z |=========================================================================================| 2025-05-07T20:27:01.3600578Z | No running processes found | 2025-05-07T20:27:01.3601053Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:27:01.6314983Z 2025-05-07T20:27:01.6319553Z [INSTALL] Successfully installed CUDA 12.8.0 2025-05-07T20:27:01.6370570Z ##[group]Run . $PRELUDE; install_pytorch_pip $BUILD_ENV nightly cuda/12.8.0 2025-05-07T20:27:01.6371115Z . $PRELUDE; install_pytorch_pip $BUILD_ENV nightly cuda/12.8.0 2025-05-07T20:27:01.6396141Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:27:01.6396501Z env: 2025-05-07T20:27:01.6396738Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:27:01.6397037Z BUILD_ENV: build_binary 2025-05-07T20:27:01.6397284Z BUILD_TARGET: genai 2025-05-07T20:27:01.6397516Z BUILD_VARIANT: cuda 2025-05-07T20:27:01.6397747Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:27:01.6398010Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:27:01.6398333Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:27:01.6398664Z ##[endgroup] 2025-05-07T20:27:01.9773707Z ################################################################################ 2025-05-07T20:27:01.9774414Z # Install PyTorch (PIP) 2025-05-07T20:27:01.9774871Z # 2025-05-07T20:27:01.9790249Z # [2025-05-07T20:27:01.978Z] + install_pytorch_pip build_binary nightly cuda/12.8.0 2025-05-07T20:27:01.9790689Z ################################################################################ 2025-05-07T20:27:01.9790901Z 2025-05-07T20:27:01.9821246Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y numpy 2025-05-07T20:27:02.9809535Z Channels: 2025-05-07T20:27:02.9809879Z - conda-forge 2025-05-07T20:27:02.9810191Z Platform: linux-64 2025-05-07T20:27:06.2453542Z Collecting package metadata (repodata.json): - \ | / done 2025-05-07T20:27:06.9634574Z Solving environment: \ | / done 2025-05-07T20:27:07.1864703Z 2025-05-07T20:27:07.1865294Z ## Package Plan ## 2025-05-07T20:27:07.1865532Z 2025-05-07T20:27:07.1865810Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:27:07.1866184Z 2025-05-07T20:27:07.1866290Z added / updated specs: 2025-05-07T20:27:07.1866532Z - numpy 2025-05-07T20:27:07.1866663Z 2025-05-07T20:27:07.1866698Z 2025-05-07T20:27:07.1866822Z The following packages will be downloaded: 2025-05-07T20:27:07.1867040Z 2025-05-07T20:27:07.1867153Z package | build 2025-05-07T20:27:07.1867473Z ---------------------------|----------------- 2025-05-07T20:27:07.1867850Z libblas-3.9.0 |31_h59b9bed_openblas 16 KB conda-forge 2025-05-07T20:27:07.1868463Z libcblas-3.9.0 |31_he106b2a_openblas 16 KB conda-forge 2025-05-07T20:27:07.1869162Z libgfortran-15.1.0 | h69a702a_2 34 KB conda-forge 2025-05-07T20:27:07.1869717Z libgfortran5-15.1.0 | hcea5267_2 1.5 MB conda-forge 2025-05-07T20:27:07.1870573Z liblapack-3.9.0 |31_h7ac8fdf_openblas 16 KB conda-forge 2025-05-07T20:27:07.1871042Z libopenblas-0.3.29 |pthreads_h94d23a6_0 5.6 MB conda-forge 2025-05-07T20:27:07.1871491Z numpy-2.2.5 | py311h5d046bc_0 8.6 MB conda-forge 2025-05-07T20:27:07.1871871Z ------------------------------------------------------------ 2025-05-07T20:27:07.1872216Z Total: 15.9 MB 2025-05-07T20:27:07.1872429Z 2025-05-07T20:27:07.1872559Z The following NEW packages will be INSTALLED: 2025-05-07T20:27:07.1872773Z 2025-05-07T20:27:07.1873003Z libblas conda-forge/linux-64::libblas-3.9.0-31_h59b9bed_openblas 2025-05-07T20:27:07.1873498Z libcblas conda-forge/linux-64::libcblas-3.9.0-31_he106b2a_openblas 2025-05-07T20:27:07.1874001Z libgfortran conda-forge/linux-64::libgfortran-15.1.0-h69a702a_2 2025-05-07T20:27:07.1874508Z libgfortran5 conda-forge/linux-64::libgfortran5-15.1.0-hcea5267_2 2025-05-07T20:27:07.1875040Z liblapack conda-forge/linux-64::liblapack-3.9.0-31_h7ac8fdf_openblas 2025-05-07T20:27:07.1875598Z libopenblas conda-forge/linux-64::libopenblas-0.3.29-pthreads_h94d23a6_0 2025-05-07T20:27:07.1876311Z numpy conda-forge/linux-64::numpy-2.2.5-py311h5d046bc_0 2025-05-07T20:27:07.1876586Z 2025-05-07T20:27:07.1876590Z 2025-05-07T20:27:07.1876594Z 2025-05-07T20:27:07.1876735Z Downloading and Extracting Packages: ...working... 2025-05-07T20:27:07.1877095Z numpy-2.2.5 | 8.6 MB | | 0% 2025-05-07T20:27:07.1877309Z 2025-05-07T20:27:07.1882949Z libopenblas-0.3.29 | 5.6 MB | | 0%  2025-05-07T20:27:07.1883297Z 2025-05-07T20:27:07.1883303Z 2025-05-07T20:27:07.1893342Z libgfortran5-15.1.0 | 1.5 MB | | 0%  2025-05-07T20:27:07.1893706Z 2025-05-07T20:27:07.1893711Z 2025-05-07T20:27:07.1893716Z 2025-05-07T20:27:07.1908507Z libgfortran-15.1.0 | 34 KB | | 0%  2025-05-07T20:27:07.1908882Z 2025-05-07T20:27:07.1908888Z 2025-05-07T20:27:07.1908893Z 2025-05-07T20:27:07.1911711Z 2025-05-07T20:27:07.1929611Z libblas-3.9.0 | 16 KB | | 0%  2025-05-07T20:27:07.1929976Z 2025-05-07T20:27:07.1929993Z 2025-05-07T20:27:07.1929999Z 2025-05-07T20:27:07.1930004Z 2025-05-07T20:27:07.1930009Z 2025-05-07T20:27:07.1944770Z libcblas-3.9.0 | 16 KB | | 0%  2025-05-07T20:27:07.1945143Z 2025-05-07T20:27:07.1945149Z 2025-05-07T20:27:07.1945154Z 2025-05-07T20:27:07.1945159Z 2025-05-07T20:27:07.1945164Z 2025-05-07T20:27:07.1945169Z 2025-05-07T20:27:07.2512600Z liblapack-3.9.0 | 16 KB | | 0%  2025-05-07T20:27:07.2512979Z 2025-05-07T20:27:07.2512984Z 2025-05-07T20:27:07.2512989Z 2025-05-07T20:27:07.2513311Z 2025-05-07T20:27:07.3383486Z libblas-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:27:07.3383857Z 2025-05-07T20:27:07.3383863Z 2025-05-07T20:27:07.3383868Z 2025-05-07T20:27:07.3419225Z libgfortran-15.1.0 | 34 KB | ####7 | 47%  2025-05-07T20:27:07.3419597Z 2025-05-07T20:27:07.3419602Z 2025-05-07T20:27:07.3419608Z 2025-05-07T20:27:07.3419613Z 2025-05-07T20:27:07.3481846Z 2025-05-07T20:27:07.4019040Z libcblas-3.9.0 | 16 KB | #########7 | 98%  2025-05-07T20:27:07.4019408Z 2025-05-07T20:27:07.4019413Z 2025-05-07T20:27:07.4076825Z 2025-05-07T20:27:07.4085062Z libgfortran-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:27:07.4085433Z 2025-05-07T20:27:07.4085439Z 2025-05-07T20:27:07.4085444Z 2025-05-07T20:27:07.4085450Z 2025-05-07T20:27:07.4090563Z 2025-05-07T20:27:07.4803408Z libcblas-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:27:07.4803774Z 2025-05-07T20:27:07.4803778Z 2025-05-07T20:27:07.4803782Z 2025-05-07T20:27:07.4803785Z 2025-05-07T20:27:07.4803987Z 2025-05-07T20:27:07.4804777Z 2025-05-07T20:27:07.4829631Z liblapack-3.9.0 | 16 KB | #########7 | 98%  2025-05-07T20:27:07.4829966Z 2025-05-07T20:27:07.4829970Z 2025-05-07T20:27:07.4829973Z 2025-05-07T20:27:07.4829977Z 2025-05-07T20:27:07.4829981Z 2025-05-07T20:27:07.4831183Z 2025-05-07T20:27:07.5010390Z liblapack-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:27:07.5010683Z 2025-05-07T20:27:07.5010691Z 2025-05-07T20:27:07.5010696Z 2025-05-07T20:27:07.5010701Z 2025-05-07T20:27:07.5012644Z 2025-05-07T20:27:07.5017841Z libcblas-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:27:07.5018185Z 2025-05-07T20:27:07.5018191Z 2025-05-07T20:27:07.5018196Z 2025-05-07T20:27:07.5018202Z 2025-05-07T20:27:07.5023394Z libblas-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:27:07.5023672Z 2025-05-07T20:27:07.5023676Z 2025-05-07T20:27:07.5023680Z 2025-05-07T20:27:07.5023690Z 2025-05-07T20:27:07.5151913Z libblas-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:27:07.5152317Z 2025-05-07T20:27:07.5152323Z 2025-05-07T20:27:07.5152329Z 2025-05-07T20:27:07.5153274Z libgfortran-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:27:07.5153668Z 2025-05-07T20:27:07.5153673Z 2025-05-07T20:27:07.5153679Z 2025-05-07T20:27:07.5153981Z 2025-05-07T20:27:07.5153990Z 2025-05-07T20:27:07.5154003Z 2025-05-07T20:27:07.5154647Z liblapack-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:27:07.5154916Z 2025-05-07T20:27:07.5154920Z 2025-05-07T20:27:07.5154926Z 2025-05-07T20:27:07.5204355Z libgfortran-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:27:07.5204706Z 2025-05-07T20:27:07.5434578Z libopenblas-0.3.29 | 5.6 MB | | 0%  2025-05-07T20:27:07.5674131Z numpy-2.2.5 | 8.6 MB | | 0% 2025-05-07T20:27:07.5674377Z 2025-05-07T20:27:07.5674572Z 2025-05-07T20:27:07.5791947Z libgfortran5-15.1.0 | 1.5 MB | 1 | 1%  2025-05-07T20:27:07.5794012Z 2025-05-07T20:27:07.5991845Z libopenblas-0.3.29 | 5.6 MB | ########## | 100%  2025-05-07T20:27:07.5992104Z 2025-05-07T20:27:07.5994279Z 2025-05-07T20:27:07.6435953Z libgfortran5-15.1.0 | 1.5 MB | ########## | 100%  2025-05-07T20:27:07.6694414Z numpy-2.2.5 | 8.6 MB | #######5 | 76% 2025-05-07T20:27:07.6723436Z numpy-2.2.5 | 8.6 MB | ########## | 100% 2025-05-07T20:27:07.6723669Z 2025-05-07T20:27:07.6725790Z 2025-05-07T20:27:07.6733192Z libgfortran5-15.1.0 | 1.5 MB | ########## | 100%  2025-05-07T20:27:07.6733620Z 2025-05-07T20:27:07.6733626Z 2025-05-07T20:27:07.7391979Z libgfortran5-15.1.0 | 1.5 MB | ########## | 100%  2025-05-07T20:27:07.7392256Z 2025-05-07T20:27:07.7395168Z libopenblas-0.3.29 | 5.6 MB | ########## | 100%  2025-05-07T20:27:07.7395475Z 2025-05-07T20:27:08.1165220Z libopenblas-0.3.29 | 5.6 MB | ########## | 100%  2025-05-07T20:27:08.1171651Z numpy-2.2.5 | 8.6 MB | ########## | 100% 2025-05-07T20:27:08.1171999Z 2025-05-07T20:27:08.1172195Z 2025-05-07T20:27:08.1172392Z  2025-05-07T20:27:08.1172600Z 2025-05-07T20:27:08.1172604Z 2025-05-07T20:27:08.1172790Z  2025-05-07T20:27:08.1172994Z 2025-05-07T20:27:08.1172998Z 2025-05-07T20:27:08.1173007Z 2025-05-07T20:27:08.1173179Z  2025-05-07T20:27:08.1173381Z 2025-05-07T20:27:08.1173385Z 2025-05-07T20:27:08.1173389Z 2025-05-07T20:27:08.1173392Z 2025-05-07T20:27:08.1173582Z  2025-05-07T20:27:08.1173785Z 2025-05-07T20:27:08.1173789Z 2025-05-07T20:27:08.1173792Z 2025-05-07T20:27:08.1173796Z 2025-05-07T20:27:08.1173800Z 2025-05-07T20:27:08.1173980Z  2025-05-07T20:27:08.1174448Z 2025-05-07T20:27:08.1174451Z 2025-05-07T20:27:08.1174455Z 2025-05-07T20:27:08.1174458Z 2025-05-07T20:27:08.1174462Z 2025-05-07T20:27:08.1174465Z 2025-05-07T20:27:08.1174665Z  done 2025-05-07T20:27:08.2177840Z Preparing transaction: \ done 2025-05-07T20:27:08.3180344Z Verifying transaction: / done 2025-05-07T20:27:08.4188944Z Executing transaction: \ done 2025-05-07T20:27:08.5961469Z ################################################################################ 2025-05-07T20:27:08.5961879Z # Install Package From PyTorch PIP: torch 2025-05-07T20:27:08.5962181Z # 2025-05-07T20:27:08.5979560Z # [2025-05-07T20:27:08.597Z] + install_from_pytorch_pip build_binary torch nightly cuda/12.8.0 2025-05-07T20:27:08.5980035Z ################################################################################ 2025-05-07T20:27:08.5980259Z 2025-05-07T20:27:08.5995713Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:27:08.6893570Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:27:08.6894567Z ################################################################################ 2025-05-07T20:27:08.6895302Z # Prepare PIP Arguments (PyTorch PIP) 2025-05-07T20:27:08.6895631Z # 2025-05-07T20:27:08.6913482Z # [2025-05-07T20:27:08.690Z] + __prepare_pip_arguments torch nightly cuda/12.8.0 2025-05-07T20:27:08.6913926Z ################################################################################ 2025-05-07T20:27:08.6914140Z 2025-05-07T20:27:08.6934361Z [INSTALL] Extracted package (channel, version): (nightly, LATEST) 2025-05-07T20:27:08.6960113Z [INSTALL] Extracted package variant: cu128 2025-05-07T20:27:08.6976932Z [INSTALL] Using a non-RELEASE channel: nightly ... 2025-05-07T20:27:08.6977526Z [INSTALL] Extracted the full PIP channel: https://download.pytorch.org/whl/nightly/cu128/ 2025-05-07T20:27:08.6986198Z [INSTALL] Extracted the full PIP package: --pre torch 2025-05-07T20:27:08.6995131Z [INSTALL] Attempting to install [torch, LATEST] from PyTorch PIP using channel https://download.pytorch.org/whl/nightly/cu128/ ... 2025-05-07T20:27:08.7016269Z [EXEC] [ATTEMPT 0/3] + conda run -n build_binary pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu128/ 2025-05-07T20:28:06.7934620Z Looking in indexes: https://download.pytorch.org/whl/nightly/cu128/ 2025-05-07T20:28:06.7935201Z Collecting torch 2025-05-07T20:28:06.7936087Z Using cached https://download.pytorch.org/whl/nightly/cu128/torch-2.8.0.dev20250507%2Bcu128-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (30 kB) 2025-05-07T20:28:06.7937015Z Collecting filelock (from torch) 2025-05-07T20:28:06.7937517Z Using cached https://download.pytorch.org/whl/nightly/filelock-3.16.1-py3-none-any.whl (16 kB) 2025-05-07T20:28:06.7938434Z Requirement already satisfied: typing-extensions>=4.10.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages (from torch) (4.13.2) 2025-05-07T20:28:06.7939152Z Collecting sympy>=1.13.3 (from torch) 2025-05-07T20:28:06.7939699Z Using cached https://download.pytorch.org/whl/nightly/sympy-1.13.3-py3-none-any.whl (6.2 MB) 2025-05-07T20:28:06.7940207Z Collecting networkx (from torch) 2025-05-07T20:28:06.7940700Z Using cached https://download.pytorch.org/whl/nightly/networkx-3.4.2-py3-none-any.whl (1.7 MB) 2025-05-07T20:28:06.7941228Z Collecting jinja2 (from torch) 2025-05-07T20:28:06.7941694Z Using cached https://download.pytorch.org/whl/nightly/jinja2-3.1.4-py3-none-any.whl (133 kB) 2025-05-07T20:28:06.7942197Z Collecting fsspec (from torch) 2025-05-07T20:28:06.7942680Z Using cached https://download.pytorch.org/whl/nightly/fsspec-2024.10.0-py3-none-any.whl (179 kB) 2025-05-07T20:28:06.7943243Z Collecting nvidia-cuda-nvrtc-cu12==12.8.61 (from torch) 2025-05-07T20:28:06.7944057Z Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_cuda_nvrtc_cu12-12.8.61-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl.metadata (1.7 kB) 2025-05-07T20:28:06.7945318Z Collecting nvidia-cuda-runtime-cu12==12.8.57 (from torch) 2025-05-07T20:28:06.7946137Z Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_cuda_runtime_cu12-12.8.57-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.7 kB) 2025-05-07T20:28:06.7946946Z Collecting nvidia-cuda-cupti-cu12==12.8.57 (from torch) 2025-05-07T20:28:06.7947744Z Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_cuda_cupti_cu12-12.8.57-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.7 kB) 2025-05-07T20:28:06.7948525Z Collecting nvidia-cudnn-cu12==9.8.0.87 (from torch) 2025-05-07T20:28:06.7949422Z Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_cudnn_cu12-9.8.0.87-py3-none-manylinux_2_27_x86_64.whl.metadata (1.8 kB) 2025-05-07T20:28:06.7950119Z Collecting nvidia-cublas-cu12==12.8.3.14 (from torch) 2025-05-07T20:28:06.7950820Z Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_cublas_cu12-12.8.3.14-py3-none-manylinux_2_27_x86_64.whl.metadata (1.7 kB) 2025-05-07T20:28:06.7951527Z Collecting nvidia-cufft-cu12==11.3.3.41 (from torch) 2025-05-07T20:28:06.7952301Z Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_cufft_cu12-11.3.3.41-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.5 kB) 2025-05-07T20:28:06.7953247Z Collecting nvidia-curand-cu12==10.3.9.55 (from torch) 2025-05-07T20:28:06.7953952Z Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_curand_cu12-10.3.9.55-py3-none-manylinux_2_27_x86_64.whl.metadata (1.5 kB) 2025-05-07T20:28:06.7954662Z Collecting nvidia-cusolver-cu12==11.7.2.55 (from torch) 2025-05-07T20:28:06.7955374Z Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_cusolver_cu12-11.7.2.55-py3-none-manylinux_2_27_x86_64.whl.metadata (1.6 kB) 2025-05-07T20:28:06.7956084Z Collecting nvidia-cusparse-cu12==12.5.7.53 (from torch) 2025-05-07T20:28:06.7956884Z Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_cusparse_cu12-12.5.7.53-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.6 kB) 2025-05-07T20:28:06.7957684Z Collecting nvidia-cusparselt-cu12==0.6.3 (from torch) 2025-05-07T20:28:06.7958404Z Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_cusparselt_cu12-0.6.3-py3-none-manylinux2014_x86_64.whl.metadata (6.8 kB) 2025-05-07T20:28:06.7959096Z Collecting nvidia-nccl-cu12==2.26.2 (from torch) 2025-05-07T20:28:06.7959850Z Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_nccl_cu12-2.26.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (2.0 kB) 2025-05-07T20:28:06.7960605Z Collecting nvidia-nvtx-cu12==12.8.55 (from torch) 2025-05-07T20:28:06.7961355Z Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_nvtx_cu12-12.8.55-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.6 kB) 2025-05-07T20:28:06.7962127Z Collecting nvidia-nvjitlink-cu12==12.8.61 (from torch) 2025-05-07T20:28:06.7962925Z Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_nvjitlink_cu12-12.8.61-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl.metadata (1.7 kB) 2025-05-07T20:28:06.7963716Z Collecting nvidia-cufile-cu12==1.13.0.11 (from torch) 2025-05-07T20:28:06.7964496Z Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_cufile_cu12-1.13.0.11-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.5 kB) 2025-05-07T20:28:06.7965296Z Collecting pytorch-triton==3.3.0+git96316ce5 (from torch) 2025-05-07T20:28:06.7966124Z Using cached https://download.pytorch.org/whl/nightly/pytorch_triton-3.3.0%2Bgit96316ce5-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (1.6 kB) 2025-05-07T20:28:06.7967380Z Requirement already satisfied: setuptools>=40.8.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages (from pytorch-triton==3.3.0+git96316ce5->torch) (78.1.1) 2025-05-07T20:28:06.7968216Z Collecting mpmath<1.4,>=1.1.0 (from sympy>=1.13.3->torch) 2025-05-07T20:28:06.7968862Z Using cached https://download.pytorch.org/whl/nightly/mpmath-1.3.0-py3-none-any.whl (536 kB) 2025-05-07T20:28:06.7969394Z Collecting MarkupSafe>=2.0 (from jinja2->torch) 2025-05-07T20:28:06.7970093Z Using cached https://download.pytorch.org/whl/nightly/MarkupSafe-2.1.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (28 kB) 2025-05-07T20:28:06.7971130Z Using cached https://download.pytorch.org/whl/nightly/cu128/torch-2.8.0.dev20250507%2Bcu128-cp311-cp311-manylinux_2_28_x86_64.whl (1047.1 MB) 2025-05-07T20:28:06.7972141Z Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_cublas_cu12-12.8.3.14-py3-none-manylinux_2_27_x86_64.whl (609.6 MB) 2025-05-07T20:28:06.7973201Z Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_cuda_cupti_cu12-12.8.57-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (10.2 MB) 2025-05-07T20:28:06.7974339Z Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_cuda_nvrtc_cu12-12.8.61-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl (88.0 MB) 2025-05-07T20:28:06.7975477Z Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_cuda_runtime_cu12-12.8.57-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (954 kB) 2025-05-07T20:28:06.7977122Z Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_cudnn_cu12-9.8.0.87-py3-none-manylinux_2_27_x86_64.whl (698.0 MB) 2025-05-07T20:28:06.7978170Z Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_cufft_cu12-11.3.3.41-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (193.1 MB) 2025-05-07T20:28:06.7979285Z Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_cufile_cu12-1.13.0.11-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (1.2 MB) 2025-05-07T20:28:06.7980326Z Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_curand_cu12-10.3.9.55-py3-none-manylinux_2_27_x86_64.whl (63.6 MB) 2025-05-07T20:28:06.7981294Z Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_cusolver_cu12-11.7.2.55-py3-none-manylinux_2_27_x86_64.whl (260.4 MB) 2025-05-07T20:28:06.7982363Z Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_cusparse_cu12-12.5.7.53-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (292.1 MB) 2025-05-07T20:28:06.7983433Z Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_cusparselt_cu12-0.6.3-py3-none-manylinux2014_x86_64.whl (156.8 MB) 2025-05-07T20:28:06.7984475Z Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_nccl_cu12-2.26.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (201.3 MB) 2025-05-07T20:28:06.7985588Z Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_nvjitlink_cu12-12.8.61-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl (39.2 MB) 2025-05-07T20:28:06.7986692Z Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_nvtx_cu12-12.8.55-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (89 kB) 2025-05-07T20:28:06.7987815Z Using cached https://download.pytorch.org/whl/nightly/pytorch_triton-3.3.0%2Bgit96316ce5-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (153.5 MB) 2025-05-07T20:28:06.7990047Z Installing collected packages: nvidia-cusparselt-cu12, mpmath, sympy, pytorch-triton, nvidia-nvtx-cu12, nvidia-nvjitlink-cu12, nvidia-nccl-cu12, nvidia-curand-cu12, nvidia-cufile-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, networkx, MarkupSafe, fsspec, filelock, nvidia-cusparse-cu12, nvidia-cufft-cu12, nvidia-cudnn-cu12, jinja2, nvidia-cusolver-cu12, torch 2025-05-07T20:28:06.7991636Z 2025-05-07T20:28:06.7993567Z Successfully installed MarkupSafe-2.1.5 filelock-3.16.1 fsspec-2024.10.0 jinja2-3.1.4 mpmath-1.3.0 networkx-3.4.2 nvidia-cublas-cu12-12.8.3.14 nvidia-cuda-cupti-cu12-12.8.57 nvidia-cuda-nvrtc-cu12-12.8.61 nvidia-cuda-runtime-cu12-12.8.57 nvidia-cudnn-cu12-9.8.0.87 nvidia-cufft-cu12-11.3.3.41 nvidia-cufile-cu12-1.13.0.11 nvidia-curand-cu12-10.3.9.55 nvidia-cusolver-cu12-11.7.2.55 nvidia-cusparse-cu12-12.5.7.53 nvidia-cusparselt-cu12-0.6.3 nvidia-nccl-cu12-2.26.2 nvidia-nvjitlink-cu12-12.8.61 nvidia-nvtx-cu12-12.8.55 pytorch-triton-3.3.0+git96316ce5 sympy-1.13.3 torch-2.8.0.dev20250507+cu128 2025-05-07T20:28:06.7995686Z 2025-05-07T20:28:09.0236981Z torch 2.8.0.dev20250507+cu128 2025-05-07T20:28:09.0239377Z [CHECK] The installed package [torch, nightly/LATEST] is the correct variant (cu128) 2025-05-07T20:28:12.4288830Z [CHECK] Python (sub-)package 'torch.distributed' found ... 2025-05-07T20:28:15.8668734Z [CHECK] NOTE: The installed version is: 2.8.0.dev20250507+cu128 2025-05-07T20:28:15.8669531Z [CHECK] NOTE: Checking _GLIBCXX_USE_CXX11_ABI ... 2025-05-07T20:28:19.1963654Z True 2025-05-07T20:28:19.1963921Z True 2025-05-07T20:28:19.1964033Z 2025-05-07T20:28:19.2588801Z [INSTALL] Successfully installed PyTorch through PyTorch PIP 2025-05-07T20:28:19.2626114Z ##[group]Run if . $PRELUDE && which conda; then collect_pytorch_env_info $BUILD_ENV; fi 2025-05-07T20:28:19.2626718Z if . $PRELUDE && which conda; then collect_pytorch_env_info $BUILD_ENV; fi 2025-05-07T20:28:19.2639042Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:28:19.2639390Z env: 2025-05-07T20:28:19.2639615Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:28:19.2639911Z BUILD_ENV: build_binary 2025-05-07T20:28:19.2640153Z BUILD_TARGET: genai 2025-05-07T20:28:19.2640381Z BUILD_VARIANT: cuda 2025-05-07T20:28:19.2640615Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:28:19.2640861Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:28:19.2641161Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:28:19.2641496Z ##[endgroup] 2025-05-07T20:28:19.6014864Z /home/ec2-user/miniconda/bin/conda 2025-05-07T20:28:19.6016456Z ################################################################################ 2025-05-07T20:28:19.6017075Z # Collect PyTorch Environment Information (for Reporting Issues) 2025-05-07T20:28:19.6017467Z # 2025-05-07T20:28:19.6032660Z # [2025-05-07T20:28:19.602Z] + collect_pytorch_env_info build_binary 2025-05-07T20:28:19.6033160Z ################################################################################ 2025-05-07T20:28:19.6033377Z 2025-05-07T20:28:19.6048197Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:28:19.6981790Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:28:19.6990266Z [INFO] Downloading the PyTorch environment info collection script ... 2025-05-07T20:28:19.6991227Z + wget -q https://raw.githubusercontent.com/pytorch/pytorch/main/torch/utils/collect_env.py 2025-05-07T20:28:19.6991637Z 2025-05-07T20:28:19.8028662Z 2025-05-07T20:28:19.8029547Z [INFO] Collecting PyTorch environment info (will be needed for reporting issues to PyTorch) ... 2025-05-07T20:28:19.8054985Z [EXEC] [ATTEMPT 0/3] + conda run -n build_binary python collect_env.py 2025-05-07T20:28:25.5706496Z Collecting environment information... 2025-05-07T20:28:25.5706930Z PyTorch version: 2.8.0.dev20250507+cu128 2025-05-07T20:28:25.5707252Z Is debug build: False 2025-05-07T20:28:25.5707512Z CUDA used to build PyTorch: 12.8 2025-05-07T20:28:25.5707791Z ROCM used to build PyTorch: N/A 2025-05-07T20:28:25.5707971Z 2025-05-07T20:28:25.5708078Z OS: Amazon Linux 2023.6.20250317 (x86_64) 2025-05-07T20:28:25.5708407Z GCC version: (conda-forge gcc 11.4.0-13) 11.4.0 2025-05-07T20:28:25.5708724Z Clang version: Could not collect 2025-05-07T20:28:25.5709062Z CMake version: Could not collect 2025-05-07T20:28:25.5709341Z Libc version: glibc-2.34 2025-05-07T20:28:25.5709497Z 2025-05-07T20:28:25.5709797Z Python version: 3.11.8 | packaged by conda-forge | (main, Feb 16 2024, 20:53:32) [GCC 12.3.0] (64-bit runtime) 2025-05-07T20:28:25.5710417Z Python platform: Linux-6.1.130-139.222.amzn2023.x86_64-x86_64-with-glibc2.34 2025-05-07T20:28:25.5710837Z Is CUDA available: True 2025-05-07T20:28:25.5711103Z CUDA runtime version: 12.8.61 2025-05-07T20:28:25.5711373Z CUDA_MODULE_LOADING set to: LAZY 2025-05-07T20:28:25.5711683Z GPU models and configuration: GPU 0: NVIDIA A10G 2025-05-07T20:28:25.5712284Z Nvidia driver version: 570.133.07 2025-05-07T20:28:25.5712560Z cuDNN version: Could not collect 2025-05-07T20:28:25.5712834Z HIP runtime version: N/A 2025-05-07T20:28:25.5713092Z MIOpen runtime version: N/A 2025-05-07T20:28:25.5713350Z Is XNNPACK available: True 2025-05-07T20:28:25.5713514Z 2025-05-07T20:28:25.5713593Z CPU: 2025-05-07T20:28:25.5713814Z Architecture: x86_64 2025-05-07T20:28:25.5714151Z CPU op-mode(s): 32-bit, 64-bit 2025-05-07T20:28:25.5714533Z Address sizes: 48 bits physical, 48 bits virtual 2025-05-07T20:28:25.5714924Z Byte Order: Little Endian 2025-05-07T20:28:25.5715245Z CPU(s): 16 2025-05-07T20:28:25.5715536Z On-line CPU(s) list: 0-15 2025-05-07T20:28:25.5716050Z Vendor ID: AuthenticAMD 2025-05-07T20:28:25.5716397Z Model name: AMD EPYC 7R32 2025-05-07T20:28:25.5716718Z CPU family: 23 2025-05-07T20:28:25.5717017Z Model: 49 2025-05-07T20:28:25.5717305Z Thread(s) per core: 2 2025-05-07T20:28:25.5717592Z Core(s) per socket: 8 2025-05-07T20:28:25.5717879Z Socket(s): 1 2025-05-07T20:28:25.5718165Z Stepping: 0 2025-05-07T20:28:25.5718466Z BogoMIPS: 5599.99 2025-05-07T20:28:25.5720488Z Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:28:25.5722516Z Hypervisor vendor: KVM 2025-05-07T20:28:25.5722835Z Virtualization type: full 2025-05-07T20:28:25.5723177Z L1d cache: 256 KiB (8 instances) 2025-05-07T20:28:25.5723541Z L1i cache: 256 KiB (8 instances) 2025-05-07T20:28:25.5723911Z L2 cache: 4 MiB (8 instances) 2025-05-07T20:28:25.5724265Z L3 cache: 32 MiB (2 instances) 2025-05-07T20:28:25.5724590Z NUMA node(s): 1 2025-05-07T20:28:25.5724879Z NUMA node0 CPU(s): 0-15 2025-05-07T20:28:25.5725250Z Vulnerability Gather data sampling: Not affected 2025-05-07T20:28:25.5725628Z Vulnerability Itlb multihit: Not affected 2025-05-07T20:28:25.5725986Z Vulnerability L1tf: Not affected 2025-05-07T20:28:25.5726329Z Vulnerability Mds: Not affected 2025-05-07T20:28:25.5726684Z Vulnerability Meltdown: Not affected 2025-05-07T20:28:25.5727043Z Vulnerability Mmio stale data: Not affected 2025-05-07T20:28:25.5727401Z Vulnerability Reg file data sampling: Not affected 2025-05-07T20:28:25.5727938Z Vulnerability Retbleed: Mitigation; untrained return thunk; SMT enabled with STIBP protection 2025-05-07T20:28:25.5728699Z Vulnerability Spec rstack overflow: Mitigation; safe RET 2025-05-07T20:28:25.5729238Z Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl 2025-05-07T20:28:25.5729908Z Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization 2025-05-07T20:28:25.5730755Z Vulnerability Spectre v2: Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected 2025-05-07T20:28:25.5731427Z Vulnerability Srbds: Not affected 2025-05-07T20:28:25.5731918Z Vulnerability Tsx async abort: Not affected 2025-05-07T20:28:25.5732147Z 2025-05-07T20:28:25.5732252Z Versions of relevant libraries: 2025-05-07T20:28:25.5732520Z [pip3] numpy==2.2.5 2025-05-07T20:28:25.5732766Z [pip3] nvidia-cublas-cu12==12.8.3.14 2025-05-07T20:28:25.5733070Z [pip3] nvidia-cuda-cupti-cu12==12.8.57 2025-05-07T20:28:25.5733383Z [pip3] nvidia-cuda-nvrtc-cu12==12.8.61 2025-05-07T20:28:25.5733697Z [pip3] nvidia-cuda-runtime-cu12==12.8.57 2025-05-07T20:28:25.5734006Z [pip3] nvidia-cudnn-cu12==9.8.0.87 2025-05-07T20:28:25.5734297Z [pip3] nvidia-cufft-cu12==11.3.3.41 2025-05-07T20:28:25.5734589Z [pip3] nvidia-curand-cu12==10.3.9.55 2025-05-07T20:28:25.5734890Z [pip3] nvidia-cusolver-cu12==11.7.2.55 2025-05-07T20:28:25.5735192Z [pip3] nvidia-cusparse-cu12==12.5.7.53 2025-05-07T20:28:25.5735649Z [pip3] nvidia-cusparselt-cu12==0.6.3 2025-05-07T20:28:25.5735954Z [pip3] nvidia-nccl-cu12==2.26.2 2025-05-07T20:28:25.5736239Z [pip3] nvidia-nvjitlink-cu12==12.8.61 2025-05-07T20:28:25.5736548Z [pip3] nvidia-nvtx-cu12==12.8.55 2025-05-07T20:28:25.5736838Z [pip3] pytorch-triton==3.3.0+git96316ce5 2025-05-07T20:28:25.5737137Z [pip3] torch==2.8.0.dev20250507+cu128 2025-05-07T20:28:25.5737508Z [conda] cuda-cudart 12.8.57 h5888daf_1 conda-forge 2025-05-07T20:28:25.5737993Z [conda] cuda-cudart-dev 12.8.57 h5888daf_1 conda-forge 2025-05-07T20:28:25.5738495Z [conda] cuda-cudart-dev_linux-64 12.8.57 h3f2d84a_1 conda-forge 2025-05-07T20:28:25.5739012Z [conda] cuda-cudart-static 12.8.57 h5888daf_1 conda-forge 2025-05-07T20:28:25.5739544Z [conda] cuda-cudart-static_linux-64 12.8.57 h3f2d84a_1 conda-forge 2025-05-07T20:28:25.5740073Z [conda] cuda-cudart_linux-64 12.8.57 h3f2d84a_1 conda-forge 2025-05-07T20:28:25.5740548Z [conda] cuda-cupti 12.8.57 hbd13f7d_0 conda-forge 2025-05-07T20:28:25.5741014Z [conda] cuda-cupti-dev 12.8.57 h5888daf_0 conda-forge 2025-05-07T20:28:25.5741500Z [conda] cuda-libraries 12.8.0 ha770c72_0 conda-forge 2025-05-07T20:28:25.5741994Z [conda] cuda-libraries-dev 12.8.0 ha770c72_0 conda-forge 2025-05-07T20:28:25.5742461Z [conda] cuda-nvrtc 12.8.61 hbd13f7d_0 conda-forge 2025-05-07T20:28:25.5742931Z [conda] cuda-nvrtc-dev 12.8.61 h5888daf_0 conda-forge 2025-05-07T20:28:25.5743390Z [conda] cuda-nvtx 12.8.55 hbd13f7d_0 conda-forge 2025-05-07T20:28:25.5743835Z [conda] cuda-opencl 12.8.55 hbd13f7d_0 conda-forge 2025-05-07T20:28:25.5744309Z [conda] cuda-opencl-dev 12.8.55 h5888daf_0 conda-forge 2025-05-07T20:28:25.5744785Z [conda] cuda-runtime 12.8.0 ha804496_0 conda-forge 2025-05-07T20:28:25.5745243Z [conda] libcublas 12.8.3.14 h9ab20c4_0 conda-forge 2025-05-07T20:28:25.5745705Z [conda] libcublas-dev 12.8.3.14 h9ab20c4_0 conda-forge 2025-05-07T20:28:25.5746169Z [conda] libcufft 11.3.3.41 hbd13f7d_0 conda-forge 2025-05-07T20:28:25.5746626Z [conda] libcufft-dev 11.3.3.41 h5888daf_0 conda-forge 2025-05-07T20:28:25.5747079Z [conda] libcurand 10.3.9.55 hbd13f7d_0 conda-forge 2025-05-07T20:28:25.5747546Z [conda] libcurand-dev 10.3.9.55 h5888daf_0 conda-forge 2025-05-07T20:28:25.5748018Z [conda] libcusolver 11.7.2.55 h9ab20c4_0 conda-forge 2025-05-07T20:28:25.5748496Z [conda] libcusolver-dev 11.7.2.55 h9ab20c4_0 conda-forge 2025-05-07T20:28:25.5749030Z [conda] libcusparse 12.5.7.53 hbd13f7d_0 conda-forge 2025-05-07T20:28:25.5749519Z [conda] libcusparse-dev 12.5.7.53 h5888daf_0 conda-forge 2025-05-07T20:28:25.5749998Z [conda] libnvjitlink 12.8.61 hbd13f7d_0 conda-forge 2025-05-07T20:28:25.5750563Z [conda] libnvjitlink-dev 12.8.61 h5888daf_0 conda-forge 2025-05-07T20:28:25.5751019Z [conda] numpy 2.2.5 py311h5d046bc_0 conda-forge 2025-05-07T20:28:25.5751477Z [conda] nvidia-cublas-cu12 12.8.3.14 pypi_0 pypi 2025-05-07T20:28:25.5751975Z [conda] nvidia-cuda-cupti-cu12 12.8.57 pypi_0 pypi 2025-05-07T20:28:25.5752463Z [conda] nvidia-cuda-nvrtc-cu12 12.8.61 pypi_0 pypi 2025-05-07T20:28:25.5752961Z [conda] nvidia-cuda-runtime-cu12 12.8.57 pypi_0 pypi 2025-05-07T20:28:25.5753448Z [conda] nvidia-cudnn-cu12 9.8.0.87 pypi_0 pypi 2025-05-07T20:28:25.5753998Z [conda] nvidia-cufft-cu12 11.3.3.41 pypi_0 pypi 2025-05-07T20:28:25.5754475Z [conda] nvidia-curand-cu12 10.3.9.55 pypi_0 pypi 2025-05-07T20:28:25.5754959Z [conda] nvidia-cusolver-cu12 11.7.2.55 pypi_0 pypi 2025-05-07T20:28:25.5755444Z [conda] nvidia-cusparse-cu12 12.5.7.53 pypi_0 pypi 2025-05-07T20:28:25.5755932Z [conda] nvidia-cusparselt-cu12 0.6.3 pypi_0 pypi 2025-05-07T20:28:25.5756412Z [conda] nvidia-nccl-cu12 2.26.2 pypi_0 pypi 2025-05-07T20:28:25.5756885Z [conda] nvidia-nvjitlink-cu12 12.8.61 pypi_0 pypi 2025-05-07T20:28:25.5757353Z [conda] nvidia-nvtx-cu12 12.8.55 pypi_0 pypi 2025-05-07T20:28:25.5757823Z [conda] pytorch-triton 3.3.0+git96316ce5 pypi_0 pypi 2025-05-07T20:28:25.5758277Z [conda] torch 2.8.0.dev20250507+cu128 pypi_0 pypi 2025-05-07T20:28:25.5758543Z 2025-05-07T20:28:25.6461589Z ##[group]Run . $PRELUDE; cd fbgemm_gpu; prepare_fbgemm_gpu_build $BUILD_ENV 2025-05-07T20:28:25.6462251Z . $PRELUDE; cd fbgemm_gpu; prepare_fbgemm_gpu_build $BUILD_ENV 2025-05-07T20:28:25.6474156Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:28:25.6474501Z env: 2025-05-07T20:28:25.6474730Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:28:25.6475024Z BUILD_ENV: build_binary 2025-05-07T20:28:25.6475272Z BUILD_TARGET: genai 2025-05-07T20:28:25.6475502Z BUILD_VARIANT: cuda 2025-05-07T20:28:25.6475739Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:28:25.6475996Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:28:25.6476296Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:28:25.6476628Z ##[endgroup] 2025-05-07T20:28:25.9878594Z ################################################################################ 2025-05-07T20:28:25.9878983Z # Prepare FBGEMM-GPU Build 2025-05-07T20:28:25.9879229Z # 2025-05-07T20:28:25.9895081Z # [2025-05-07T20:28:25.989Z] + prepare_fbgemm_gpu_build build_binary 2025-05-07T20:28:25.9895484Z ################################################################################ 2025-05-07T20:28:25.9895706Z 2025-05-07T20:28:25.9910519Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:28:26.0777919Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:28:26.0798012Z [BUILD] Running git submodules update ... 2025-05-07T20:28:26.0818571Z [EXEC] [ATTEMPT 0/3] + git submodule sync 2025-05-07T20:28:26.1187007Z Synchronizing submodule url for '../external/asmjit' 2025-05-07T20:28:26.1187668Z Synchronizing submodule url for '../external/composable_kernel' 2025-05-07T20:28:26.1188305Z Synchronizing submodule url for '../external/cpuinfo' 2025-05-07T20:28:26.1188861Z Synchronizing submodule url for '../external/cutlass' 2025-05-07T20:28:26.1189415Z Synchronizing submodule url for '../external/googletest' 2025-05-07T20:28:26.1189868Z Synchronizing submodule url for '../external/hipify_torch' 2025-05-07T20:28:26.1190273Z Synchronizing submodule url for '../external/json' 2025-05-07T20:28:26.1222837Z [EXEC] [ATTEMPT 0/3] + git submodule update --init --recursive 2025-05-07T20:28:26.1774816Z [BUILD] Installing other build dependencies ... 2025-05-07T20:28:26.1796946Z [EXEC] [ATTEMPT 0/3] + conda run --no-capture-output -n build_binary python -m pip install -r requirements.txt 2025-05-07T20:28:28.5437980Z Collecting backports.tarfile (from -r requirements.txt (line 13)) 2025-05-07T20:28:28.5452302Z Using cached backports.tarfile-1.2.0-py3-none-any.whl.metadata (2.0 kB) 2025-05-07T20:28:28.5793515Z Collecting build (from -r requirements.txt (line 14)) 2025-05-07T20:28:28.5805857Z Using cached build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB) 2025-05-07T20:28:28.7238216Z Collecting cmake (from -r requirements.txt (line 15)) 2025-05-07T20:28:28.7252981Z Using cached cmake-4.0.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.3 kB) 2025-05-07T20:28:28.7641054Z Collecting click (from -r requirements.txt (line 16)) 2025-05-07T20:28:28.7654453Z Using cached click-8.1.8-py3-none-any.whl.metadata (2.3 kB) 2025-05-07T20:28:28.9923744Z Collecting hypothesis (from -r requirements.txt (line 17)) 2025-05-07T20:28:28.9937951Z Using cached hypothesis-6.131.14-py3-none-any.whl.metadata (5.6 kB) 2025-05-07T20:28:29.0023211Z Requirement already satisfied: jinja2 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages (from -r requirements.txt (line 18)) (3.1.4) 2025-05-07T20:28:29.0025643Z Requirement already satisfied: mpmath==1.3.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages (from -r requirements.txt (line 19)) (1.3.0) 2025-05-07T20:28:29.0480751Z Collecting ninja (from -r requirements.txt (line 20)) 2025-05-07T20:28:29.0493524Z Using cached ninja-1.11.1.4-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.metadata (5.0 kB) 2025-05-07T20:28:29.0506362Z Requirement already satisfied: numpy>=2.0.2 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages (from -r requirements.txt (line 21)) (2.2.5) 2025-05-07T20:28:29.0826733Z Collecting pyre-extensions (from -r requirements.txt (line 22)) 2025-05-07T20:28:29.0839198Z Using cached pyre_extensions-0.0.32-py3-none-any.whl.metadata (4.0 kB) 2025-05-07T20:28:29.1371817Z Collecting pyyaml (from -r requirements.txt (line 23)) 2025-05-07T20:28:29.1385075Z Using cached PyYAML-6.0.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.1 kB) 2025-05-07T20:28:29.1709963Z Collecting scikit-build (from -r requirements.txt (line 24)) 2025-05-07T20:28:29.1721708Z Using cached scikit_build-0.18.1-py3-none-any.whl.metadata (18 kB) 2025-05-07T20:28:29.1768737Z Requirement already satisfied: setuptools in /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages (from -r requirements.txt (line 25)) (78.1.1) 2025-05-07T20:28:29.2125034Z Collecting setuptools_git_versioning (from -r requirements.txt (line 26)) 2025-05-07T20:28:29.2137672Z Using cached setuptools_git_versioning-2.1.0-py3-none-any.whl.metadata (6.1 kB) 2025-05-07T20:28:29.2462015Z Collecting tabulate (from -r requirements.txt (line 27)) 2025-05-07T20:28:29.2473240Z Using cached tabulate-0.9.0-py3-none-any.whl.metadata (34 kB) 2025-05-07T20:28:29.2865102Z Collecting patchelf (from -r requirements.txt (line 28)) 2025-05-07T20:28:29.2877351Z Using cached patchelf-0.17.2.2-py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.musllinux_1_1_x86_64.whl.metadata (3.5 kB) 2025-05-07T20:28:29.3272266Z Collecting packaging>=19.1 (from build->-r requirements.txt (line 14)) 2025-05-07T20:28:29.3283751Z Using cached packaging-25.0-py3-none-any.whl.metadata (3.3 kB) 2025-05-07T20:28:29.3569733Z Collecting pyproject_hooks (from build->-r requirements.txt (line 14)) 2025-05-07T20:28:29.3581299Z Using cached pyproject_hooks-1.2.0-py3-none-any.whl.metadata (1.3 kB) 2025-05-07T20:28:29.3912809Z Collecting attrs>=22.2.0 (from hypothesis->-r requirements.txt (line 17)) 2025-05-07T20:28:29.3924832Z Using cached attrs-25.3.0-py3-none-any.whl.metadata (10 kB) 2025-05-07T20:28:29.4297226Z Collecting sortedcontainers<3.0.0,>=2.1.0 (from hypothesis->-r requirements.txt (line 17)) 2025-05-07T20:28:29.4309802Z Using cached sortedcontainers-2.4.0-py2.py3-none-any.whl.metadata (10 kB) 2025-05-07T20:28:29.4334715Z Requirement already satisfied: MarkupSafe>=2.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages (from jinja2->-r requirements.txt (line 18)) (2.1.5) 2025-05-07T20:28:29.4640339Z Collecting typing-inspect (from pyre-extensions->-r requirements.txt (line 22)) 2025-05-07T20:28:29.4652304Z Using cached typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB) 2025-05-07T20:28:29.4666593Z Requirement already satisfied: typing-extensions in /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages (from pyre-extensions->-r requirements.txt (line 22)) (4.13.2) 2025-05-07T20:28:29.4949767Z Collecting distro (from scikit-build->-r requirements.txt (line 24)) 2025-05-07T20:28:29.4961674Z Using cached distro-1.9.0-py3-none-any.whl.metadata (6.8 kB) 2025-05-07T20:28:29.4982098Z Requirement already satisfied: wheel>=0.32.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages (from scikit-build->-r requirements.txt (line 24)) (0.45.1) 2025-05-07T20:28:29.5393878Z Collecting mypy-extensions>=0.3.0 (from typing-inspect->pyre-extensions->-r requirements.txt (line 22)) 2025-05-07T20:28:29.5405714Z Using cached mypy_extensions-1.1.0-py3-none-any.whl.metadata (1.1 kB) 2025-05-07T20:28:29.5437473Z Using cached backports.tarfile-1.2.0-py3-none-any.whl (30 kB) 2025-05-07T20:28:29.5449528Z Using cached build-1.2.2.post1-py3-none-any.whl (22 kB) 2025-05-07T20:28:29.5461777Z Using cached cmake-4.0.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.9 MB) 2025-05-07T20:28:29.5680931Z Using cached click-8.1.8-py3-none-any.whl (98 kB) 2025-05-07T20:28:29.5693495Z Using cached hypothesis-6.131.14-py3-none-any.whl (500 kB) 2025-05-07T20:28:29.5709448Z Using cached sortedcontainers-2.4.0-py2.py3-none-any.whl (29 kB) 2025-05-07T20:28:29.5721491Z Using cached ninja-1.11.1.4-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (422 kB) 2025-05-07T20:28:29.5736867Z Using cached pyre_extensions-0.0.32-py3-none-any.whl (12 kB) 2025-05-07T20:28:29.5749063Z Using cached PyYAML-6.0.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (762 kB) 2025-05-07T20:28:29.5765953Z Using cached scikit_build-0.18.1-py3-none-any.whl (85 kB) 2025-05-07T20:28:29.5778390Z Using cached setuptools_git_versioning-2.1.0-py3-none-any.whl (10 kB) 2025-05-07T20:28:29.5790201Z Using cached tabulate-0.9.0-py3-none-any.whl (35 kB) 2025-05-07T20:28:29.5802472Z Using cached patchelf-0.17.2.2-py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.musllinux_1_1_x86_64.whl (466 kB) 2025-05-07T20:28:29.5817726Z Using cached attrs-25.3.0-py3-none-any.whl (63 kB) 2025-05-07T20:28:29.5831016Z Using cached packaging-25.0-py3-none-any.whl (66 kB) 2025-05-07T20:28:29.5843576Z Using cached distro-1.9.0-py3-none-any.whl (20 kB) 2025-05-07T20:28:29.5855653Z Using cached pyproject_hooks-1.2.0-py3-none-any.whl (10 kB) 2025-05-07T20:28:29.5867413Z Using cached typing_inspect-0.9.0-py3-none-any.whl (8.8 kB) 2025-05-07T20:28:29.5879448Z Using cached mypy_extensions-1.1.0-py3-none-any.whl (5.0 kB) 2025-05-07T20:28:29.7222367Z Installing collected packages: sortedcontainers, tabulate, pyyaml, pyproject_hooks, patchelf, packaging, ninja, mypy-extensions, distro, cmake, click, backports.tarfile, attrs, typing-inspect, setuptools_git_versioning, scikit-build, hypothesis, build, pyre-extensions 2025-05-07T20:28:32.0352978Z 2025-05-07T20:28:32.0380830Z Successfully installed attrs-25.3.0 backports.tarfile-1.2.0 build-1.2.2.post1 click-8.1.8 cmake-4.0.0 distro-1.9.0 hypothesis-6.131.14 mypy-extensions-1.1.0 ninja-1.11.1.4 packaging-25.0 patchelf-0.17.2.2 pyproject_hooks-1.2.0 pyre-extensions-0.0.32 pyyaml-6.0.2 scikit-build-0.18.1 setuptools_git_versioning-2.1.0 sortedcontainers-2.4.0 tabulate-0.9.0 typing-inspect-0.9.0 2025-05-07T20:28:32.2123727Z ################################################################################ 2025-05-07T20:28:32.2124071Z # Install PyTorch (PyTorch PIP) 2025-05-07T20:28:32.2124523Z # 2025-05-07T20:28:32.2143971Z # [2025-05-07T20:28:32.214Z] + install_triton_pip build_binary 2025-05-07T20:28:32.2144351Z ################################################################################ 2025-05-07T20:28:32.2144562Z 2025-05-07T20:28:32.2144786Z [BUILD] Installing pytorch-triton nightly/3.2.0+git4b3bb1f8 from PIP ... 2025-05-07T20:28:32.2154166Z ################################################################################ 2025-05-07T20:28:32.2154584Z # Install Package From PyTorch PIP: pytorch-triton 2025-05-07T20:28:32.2154913Z # 2025-05-07T20:28:32.2161771Z # [2025-05-07T20:28:32.215Z] + install_from_pytorch_pip build_binary pytorch-triton nightly/3.2.0+git4b3bb1f8 2025-05-07T20:28:32.2162295Z ################################################################################ 2025-05-07T20:28:32.2162516Z 2025-05-07T20:28:32.2180858Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:28:32.3117139Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:28:32.3117497Z ################################################################################ 2025-05-07T20:28:32.3117840Z # Prepare PIP Arguments (PyTorch PIP) 2025-05-07T20:28:32.3118124Z # 2025-05-07T20:28:32.3136629Z # [2025-05-07T20:28:32.313Z] + __prepare_pip_arguments pytorch-triton nightly/3.2.0+git4b3bb1f8 2025-05-07T20:28:32.3137106Z ################################################################################ 2025-05-07T20:28:32.3137326Z 2025-05-07T20:28:32.3187245Z [INSTALL] Extracted package (channel, version): (nightly, 3.2.0+git4b3bb1f8) 2025-05-07T20:28:32.3203360Z [INSTALL] Using a non-RELEASE channel: nightly ... 2025-05-07T20:28:32.3203973Z [INSTALL] Extracted the full PIP channel: https://download.pytorch.org/whl/nightly/ 2025-05-07T20:28:32.3212631Z [INSTALL] Extracted the full PIP package: --pre pytorch-triton==3.2.0+git4b3bb1f8 2025-05-07T20:28:32.3221735Z [INSTALL] Attempting to install [pytorch-triton, 3.2.0+git4b3bb1f8] from PyTorch PIP using channel https://download.pytorch.org/whl/nightly/ ... 2025-05-07T20:28:32.3242682Z [EXEC] [ATTEMPT 0/3] + conda run -n build_binary pip install --pre pytorch-triton==3.2.0+git4b3bb1f8 --index-url https://download.pytorch.org/whl/nightly/ 2025-05-07T20:28:37.0806651Z ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. 2025-05-07T20:28:37.0807853Z torch 2.8.0.dev20250507+cu128 requires pytorch-triton==3.3.0+git96316ce5; platform_system == "Linux", but you have pytorch-triton 3.2.0+git4b3bb1f8 which is incompatible. 2025-05-07T20:28:37.0808484Z 2025-05-07T20:28:37.0808694Z Looking in indexes: https://download.pytorch.org/whl/nightly/ 2025-05-07T20:28:37.0809108Z Collecting pytorch-triton==3.2.0+git4b3bb1f8 2025-05-07T20:28:37.0809932Z Using cached https://download.pytorch.org/whl/nightly/pytorch_triton-3.2.0%2Bgit4b3bb1f8-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (1.3 kB) 2025-05-07T20:28:37.0811138Z Using cached https://download.pytorch.org/whl/nightly/pytorch_triton-3.2.0%2Bgit4b3bb1f8-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (166.5 MB) 2025-05-07T20:28:37.0811911Z Installing collected packages: pytorch-triton 2025-05-07T20:28:37.0812257Z Attempting uninstall: pytorch-triton 2025-05-07T20:28:37.0812642Z Found existing installation: pytorch-triton 3.3.0+git96316ce5 2025-05-07T20:28:37.0813053Z Uninstalling pytorch-triton-3.3.0+git96316ce5: 2025-05-07T20:28:37.0813472Z Successfully uninstalled pytorch-triton-3.3.0+git96316ce5 2025-05-07T20:28:37.0813907Z Successfully installed pytorch-triton-3.2.0+git4b3bb1f8 2025-05-07T20:28:37.0814162Z 2025-05-07T20:28:39.2932409Z [CHECK] Python (sub-)package 'triton' found ... 2025-05-07T20:28:39.2936258Z [CHECK] Printing out the pytorch-triton version ... 2025-05-07T20:28:41.4425229Z ################################################################################ 2025-05-07T20:28:41.4426227Z [CHECK] The installed VERSION of pytorch-triton is: 3.2.0 2025-05-07T20:28:41.4426743Z ################################################################################ 2025-05-07T20:28:41.4427043Z 2025-05-07T20:28:43.5094813Z [CHECK] Python (sub-)package 'numpy' found ... 2025-05-07T20:28:45.6223673Z [CHECK] Python (sub-)package 'skbuild' found ... 2025-05-07T20:28:45.6228054Z [BUILD] Successfully ran git submodules update 2025-05-07T20:28:45.6272978Z ##[group]Run . $PRELUDE; install_fbgemm_gpu_wheel $BUILD_ENV *.whl 2025-05-07T20:28:45.6273460Z . $PRELUDE; install_fbgemm_gpu_wheel $BUILD_ENV *.whl 2025-05-07T20:28:45.6285277Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:28:45.6285625Z env: 2025-05-07T20:28:45.6285847Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:28:45.6286154Z BUILD_ENV: build_binary 2025-05-07T20:28:45.6286402Z BUILD_TARGET: genai 2025-05-07T20:28:45.6286626Z BUILD_VARIANT: cuda 2025-05-07T20:28:45.6286869Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:28:45.6287146Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:28:45.6287449Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:28:45.6287779Z ##[endgroup] 2025-05-07T20:28:45.9629020Z ################################################################################ 2025-05-07T20:28:45.9629413Z # Install FBGEMM-GPU from Wheel 2025-05-07T20:28:45.9629681Z # 2025-05-07T20:28:45.9649264Z # [2025-05-07T20:28:45.964Z] + install_fbgemm_gpu_wheel build_binary fbgemm_gpu_genai_nightly-2025.5.7-cp311-cp311-manylinux_2_28_x86_64.whl 2025-05-07T20:28:45.9649902Z ################################################################################ 2025-05-07T20:28:45.9650123Z 2025-05-07T20:28:45.9650482Z [INSTALL] Printing out FBGEMM-GPU wheel SHA: fbgemm_gpu_genai_nightly-2025.5.7-cp311-cp311-manylinux_2_28_x86_64.whl 2025-05-07T20:28:45.9651172Z + sha1sum fbgemm_gpu_genai_nightly-2025.5.7-cp311-cp311-manylinux_2_28_x86_64.whl 2025-05-07T20:28:45.9651515Z 2025-05-07T20:28:45.9801070Z c326345df354c6141153099e3e50ba8d6de34fcb fbgemm_gpu_genai_nightly-2025.5.7-cp311-cp311-manylinux_2_28_x86_64.whl 2025-05-07T20:28:45.9803916Z 2025-05-07T20:28:45.9804414Z + sha256sum fbgemm_gpu_genai_nightly-2025.5.7-cp311-cp311-manylinux_2_28_x86_64.whl 2025-05-07T20:28:45.9804775Z 2025-05-07T20:28:45.9972989Z 9f4154b2f6c41ae40824604f2980de212f6e65550128fe52cae1c9c75e71312b fbgemm_gpu_genai_nightly-2025.5.7-cp311-cp311-manylinux_2_28_x86_64.whl 2025-05-07T20:28:45.9975438Z 2025-05-07T20:28:45.9975810Z + md5sum fbgemm_gpu_genai_nightly-2025.5.7-cp311-cp311-manylinux_2_28_x86_64.whl 2025-05-07T20:28:45.9976154Z 2025-05-07T20:28:46.0304705Z 1c01cd21bdf738277ab20dc3f0582ce3 fbgemm_gpu_genai_nightly-2025.5.7-cp311-cp311-manylinux_2_28_x86_64.whl 2025-05-07T20:28:46.0307032Z 2025-05-07T20:28:46.0316687Z [INSTALL] Installing FBGEMM-GPU wheel: fbgemm_gpu_genai_nightly-2025.5.7-cp311-cp311-manylinux_2_28_x86_64.whl ... 2025-05-07T20:28:46.0338823Z [EXEC] [ATTEMPT 0/3] + conda run -n build_binary python -m pip install fbgemm_gpu_genai_nightly-2025.5.7-cp311-cp311-manylinux_2_28_x86_64.whl 2025-05-07T20:28:48.8019364Z Processing ./fbgemm_gpu_genai_nightly-2025.5.7-cp311-cp311-manylinux_2_28_x86_64.whl 2025-05-07T20:28:48.8020374Z Requirement already satisfied: numpy in /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages (from fbgemm-gpu-genai-nightly==2025.5.7) (2.2.5) 2025-05-07T20:28:48.8021213Z Installing collected packages: fbgemm-gpu-genai-nightly 2025-05-07T20:28:48.8021663Z Successfully installed fbgemm-gpu-genai-nightly-2025.5.7 2025-05-07T20:28:48.8021936Z 2025-05-07T20:28:55.6336692Z ################################################################################ 2025-05-07T20:28:55.6337127Z [CHECK] !!!! INFO !!!! 2025-05-07T20:28:55.6337520Z [CHECK] The installed version of PyTorch is: 2.8.0.dev20250507+cu128 2025-05-07T20:28:55.6337949Z [CHECK] CUDA version reported by PyTorch is: 12.8 2025-05-07T20:28:55.6338252Z [CHECK] 2025-05-07T20:28:55.6338576Z [CHECK] NOTE: If the PyTorch package channel is different from the FBGEMM_GPU 2025-05-07T20:28:55.6339459Z [CHECK] package channel; the package may be broken at runtime!!! 2025-05-07T20:28:55.6339852Z ################################################################################ 2025-05-07T20:28:55.6340065Z 2025-05-07T20:28:55.6340187Z [INSTALL] Checking imports and symbols ... 2025-05-07T20:28:59.5479198Z [CHECK] Python (sub-)package 'fbgemm_gpu' found ... 2025-05-07T20:29:03.4698316Z [CHECK] Found symbol '__version__' in Python package 'fbgemm_gpu'. 2025-05-07T20:29:07.3979634Z [CHECK] Found symbol '__variant__' in Python package 'fbgemm_gpu'. 2025-05-07T20:29:07.3982529Z [CHECK] Printing out the FBGEMM-GPU version ... 2025-05-07T20:29:19.1487861Z ################################################################################ 2025-05-07T20:29:19.1488275Z [CHECK] The installed FBGEMM TARGET is: genai 2025-05-07T20:29:19.1488621Z [CHECK] The installed FBGEMM VARIANT is: cuda 2025-05-07T20:29:19.1488972Z [CHECK] The installed FBGEMM VERSION is: 2025.5.7 2025-05-07T20:29:19.1489351Z ################################################################################ 2025-05-07T20:29:19.1489570Z 2025-05-07T20:29:26.9951077Z ################################################################################ 2025-05-07T20:29:26.9951558Z [CHECK] FBGEMM_GPU Experimental Packages 2025-05-07T20:29:26.9953133Z [CHECK] fbgemm_gpu: ['__annotations__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '__target__', '__variant__', '__version__', '_load_library', 'docs', 'fbgemm_genai_libraries', 'fbgemm_gpu', 'fbgemm_gpu_libraries', 'libraries_to_load', 'library', 'logging', 'open_source', 'os', 'split_embedding_configs', 'split_table_batched_embeddings_ops_common', 'torch', 'utils'] 2025-05-07T20:29:26.9955187Z [CHECK] fbgemm_gpu.experimental: ['__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__'] 2025-05-07T20:29:26.9955712Z ################################################################################ 2025-05-07T20:29:26.9955945Z 2025-05-07T20:29:26.9956098Z [INSTALL] Check for installation of Python sources ... 2025-05-07T20:29:30.9208158Z [CHECK] Python (sub-)package 'fbgemm_gpu.config' found ... 2025-05-07T20:29:34.8392907Z [CHECK] Python (sub-)package 'fbgemm_gpu.docs' found ... 2025-05-07T20:29:38.8872111Z [CHECK] Python (sub-)package 'fbgemm_gpu.quantize' found ... 2025-05-07T20:29:42.8069664Z [CHECK] Python (sub-)package 'fbgemm_gpu.tbe.cache' found ... 2025-05-07T20:29:42.8073862Z [INSTALL] Check for operator registrations ... 2025-05-07T20:29:46.6524445Z fbgemm.nccl_init 2025-05-07T20:29:46.6524629Z 2025-05-07T20:29:46.7142596Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.nccl_init 2025-05-07T20:29:50.5579860Z fbgemm.gqa_attn_splitk 2025-05-07T20:29:50.5580074Z 2025-05-07T20:29:50.6201289Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.gqa_attn_splitk 2025-05-07T20:29:54.4748396Z fbgemm.rope_qkv_decoding 2025-05-07T20:29:54.4748627Z 2025-05-07T20:29:54.5365298Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.rope_qkv_decoding 2025-05-07T20:29:54.5365913Z [INSTALL] FBGEMM-GPU installation through wheel completed ... 2025-05-07T20:29:54.5402579Z ##[group]Run . $PRELUDE; test_all_fbgemm_gpu_modules $BUILD_ENV 2025-05-07T20:29:54.5403049Z . $PRELUDE; test_all_fbgemm_gpu_modules $BUILD_ENV 2025-05-07T20:29:54.5415043Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:29:54.5415420Z env: 2025-05-07T20:29:54.5415655Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:29:54.5415949Z BUILD_ENV: build_binary 2025-05-07T20:29:54.5416196Z BUILD_TARGET: genai 2025-05-07T20:29:54.5416428Z BUILD_VARIANT: cuda 2025-05-07T20:29:54.5416660Z BUILD_CUDA_VERSION: 12.8.0 2025-05-07T20:29:54.5416921Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:29:54.5417229Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:29:54.5417756Z ##[endgroup] 2025-05-07T20:29:54.8770361Z ################################################################################ 2025-05-07T20:29:54.8770741Z # Test All FBGEMM-GPU Modules 2025-05-07T20:29:54.8771009Z # 2025-05-07T20:29:54.8787202Z # [2025-05-07T20:29:54.878Z] + test_all_fbgemm_gpu_modules build_binary 2025-05-07T20:29:54.8787615Z ################################################################################ 2025-05-07T20:29:54.8787830Z 2025-05-07T20:30:02.7393642Z [TEST] Determined FBGEMM_GPU (target : variant) from installation: (genai : cuda) 2025-05-07T20:30:02.7394760Z [TEST] Will be running tests specific to this target and variant ... 2025-05-07T20:30:02.7395541Z [TEST] Determined the test directories: 2025-05-07T20:30:02.7396155Z fbgemm_gpu/experimental/gen_ai/test 2025-05-07T20:30:02.7396753Z fbgemm_gpu/experimental/example/test 2025-05-07T20:30:02.7397344Z fbgemm_gpu/experimental/gemm/test 2025-05-07T20:30:02.7397711Z 2025-05-07T20:30:02.7400071Z [TEST] FBGEMM_GPU variant is cuda; configuring for CUDA-based testing ... 2025-05-07T20:30:02.7406325Z [TEST] Set environment variables for CUDA testing ... 2025-05-07T20:30:02.7406902Z + conda env config vars unset -n build_binary CUDA_VISIBLE_DEVICES 2025-05-07T20:30:02.7407183Z 2025-05-07T20:30:03.1677895Z 2025-05-07T20:30:03.1678314Z [TEST] Installing PyTest ... 2025-05-07T20:30:03.1701447Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y pytest expecttest 2025-05-07T20:30:04.2812978Z Channels: 2025-05-07T20:30:04.2813236Z - conda-forge 2025-05-07T20:30:04.2813464Z Platform: linux-64 2025-05-07T20:30:07.5677983Z Collecting package metadata (repodata.json): - \ | / done 2025-05-07T20:30:08.7148052Z Solving environment: \ | / done 2025-05-07T20:30:08.9466800Z 2025-05-07T20:30:08.9467162Z ## Package Plan ## 2025-05-07T20:30:08.9467391Z 2025-05-07T20:30:08.9467672Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:30:08.9468087Z 2025-05-07T20:30:08.9468208Z added / updated specs: 2025-05-07T20:30:08.9468533Z - expecttest 2025-05-07T20:30:08.9468810Z - pytest 2025-05-07T20:30:08.9469044Z 2025-05-07T20:30:08.9469048Z 2025-05-07T20:30:08.9469170Z The following packages will be downloaded: 2025-05-07T20:30:08.9469400Z 2025-05-07T20:30:08.9469516Z package | build 2025-05-07T20:30:08.9469838Z ---------------------------|----------------- 2025-05-07T20:30:08.9470232Z colorama-0.4.6 | pyhd8ed1ab_1 26 KB conda-forge 2025-05-07T20:30:08.9470692Z exceptiongroup-1.2.2 | pyhd8ed1ab_1 20 KB conda-forge 2025-05-07T20:30:08.9471159Z expecttest-0.3.0 | pyhd8ed1ab_0 14 KB conda-forge 2025-05-07T20:30:08.9471596Z iniconfig-2.0.0 | pyhd8ed1ab_1 11 KB conda-forge 2025-05-07T20:30:08.9472021Z packaging-25.0 | pyh29332c3_1 61 KB conda-forge 2025-05-07T20:30:08.9472456Z pluggy-1.5.0 | pyhd8ed1ab_1 23 KB conda-forge 2025-05-07T20:30:08.9472866Z pytest-8.3.5 | pyhd8ed1ab_0 254 KB conda-forge 2025-05-07T20:30:08.9473746Z tomli-2.2.1 | pyhd8ed1ab_1 19 KB conda-forge 2025-05-07T20:30:08.9474397Z ------------------------------------------------------------ 2025-05-07T20:30:08.9474840Z Total: 428 KB 2025-05-07T20:30:08.9475080Z 2025-05-07T20:30:08.9475241Z The following NEW packages will be INSTALLED: 2025-05-07T20:30:08.9475589Z 2025-05-07T20:30:08.9475839Z colorama conda-forge/noarch::colorama-0.4.6-pyhd8ed1ab_1 2025-05-07T20:30:08.9476438Z exceptiongroup conda-forge/noarch::exceptiongroup-1.2.2-pyhd8ed1ab_1 2025-05-07T20:30:08.9477033Z expecttest conda-forge/noarch::expecttest-0.3.0-pyhd8ed1ab_0 2025-05-07T20:30:08.9477648Z iniconfig conda-forge/noarch::iniconfig-2.0.0-pyhd8ed1ab_1 2025-05-07T20:30:08.9478374Z packaging conda-forge/noarch::packaging-25.0-pyh29332c3_1 2025-05-07T20:30:08.9478919Z pluggy conda-forge/noarch::pluggy-1.5.0-pyhd8ed1ab_1 2025-05-07T20:30:08.9489223Z pytest conda-forge/noarch::pytest-8.3.5-pyhd8ed1ab_0 2025-05-07T20:30:08.9489665Z tomli conda-forge/noarch::tomli-2.2.1-pyhd8ed1ab_1 2025-05-07T20:30:08.9489919Z 2025-05-07T20:30:08.9489931Z 2025-05-07T20:30:08.9489935Z 2025-05-07T20:30:08.9490079Z Downloading and Extracting Packages: ...working... 2025-05-07T20:30:08.9490482Z pytest-8.3.5 | 254 KB | | 0% 2025-05-07T20:30:08.9490732Z 2025-05-07T20:30:08.9491131Z packaging-25.0 | 61 KB | | 0%  2025-05-07T20:30:08.9491368Z 2025-05-07T20:30:08.9491372Z 2025-05-07T20:30:08.9501805Z colorama-0.4.6 | 26 KB | | 0%  2025-05-07T20:30:08.9502193Z 2025-05-07T20:30:08.9502199Z 2025-05-07T20:30:08.9502204Z 2025-05-07T20:30:08.9510992Z pluggy-1.5.0 | 23 KB | | 0%  2025-05-07T20:30:08.9511530Z 2025-05-07T20:30:08.9511537Z 2025-05-07T20:30:08.9511543Z 2025-05-07T20:30:08.9511549Z 2025-05-07T20:30:08.9519644Z exceptiongroup-1.2.2 | 20 KB | | 0%  2025-05-07T20:30:08.9520031Z 2025-05-07T20:30:08.9520037Z 2025-05-07T20:30:08.9520042Z 2025-05-07T20:30:08.9520046Z 2025-05-07T20:30:08.9523373Z 2025-05-07T20:30:08.9524766Z tomli-2.2.1 | 19 KB | | 0%  2025-05-07T20:30:08.9525024Z 2025-05-07T20:30:08.9525028Z 2025-05-07T20:30:08.9525035Z 2025-05-07T20:30:08.9525038Z 2025-05-07T20:30:08.9525042Z 2025-05-07T20:30:08.9525046Z 2025-05-07T20:30:08.9537365Z expecttest-0.3.0 | 14 KB | | 0%  2025-05-07T20:30:08.9537654Z 2025-05-07T20:30:08.9537659Z 2025-05-07T20:30:08.9537662Z 2025-05-07T20:30:08.9537666Z 2025-05-07T20:30:08.9537669Z 2025-05-07T20:30:08.9537673Z 2025-05-07T20:30:08.9537768Z 2025-05-07T20:30:09.0775591Z iniconfig-2.0.0 | 11 KB | | 0%  2025-05-07T20:30:09.0776166Z 2025-05-07T20:30:09.0776175Z 2025-05-07T20:30:09.0776183Z 2025-05-07T20:30:09.0776203Z 2025-05-07T20:30:09.0803325Z exceptiongroup-1.2.2 | 20 KB | #######9 | 80%  2025-05-07T20:30:09.0803613Z 2025-05-07T20:30:09.0803617Z 2025-05-07T20:30:09.0803621Z 2025-05-07T20:30:09.0806851Z 2025-05-07T20:30:09.0998183Z exceptiongroup-1.2.2 | 20 KB | ########## | 100%  2025-05-07T20:30:09.0998468Z 2025-05-07T20:30:09.0998472Z 2025-05-07T20:30:09.1049999Z 2025-05-07T20:30:09.1544786Z pluggy-1.5.0 | 23 KB | ######9 | 69%  2025-05-07T20:30:09.1545040Z 2025-05-07T20:30:09.1545044Z 2025-05-07T20:30:09.1547199Z 2025-05-07T20:30:09.1584366Z pluggy-1.5.0 | 23 KB | ########## | 100%  2025-05-07T20:30:09.1584613Z 2025-05-07T20:30:09.1584617Z 2025-05-07T20:30:09.1584620Z 2025-05-07T20:30:09.1584624Z 2025-05-07T20:30:09.1588971Z 2025-05-07T20:30:09.1612069Z tomli-2.2.1 | 19 KB | ########5 | 85%  2025-05-07T20:30:09.1612334Z 2025-05-07T20:30:09.1613702Z 2025-05-07T20:30:09.2044421Z colorama-0.4.6 | 26 KB | ###### | 61%  2025-05-07T20:30:09.2044971Z 2025-05-07T20:30:09.2044976Z 2025-05-07T20:30:09.2044980Z 2025-05-07T20:30:09.2044984Z 2025-05-07T20:30:09.2044987Z 2025-05-07T20:30:09.2049122Z tomli-2.2.1 | 19 KB | ########## | 100%  2025-05-07T20:30:09.2049387Z 2025-05-07T20:30:09.2049391Z 2025-05-07T20:30:09.2278212Z colorama-0.4.6 | 26 KB | ########## | 100%  2025-05-07T20:30:09.2278707Z 2025-05-07T20:30:09.2278726Z 2025-05-07T20:30:09.2278733Z 2025-05-07T20:30:09.2278741Z 2025-05-07T20:30:09.2278748Z 2025-05-07T20:30:09.2278755Z 2025-05-07T20:30:09.2316264Z expecttest-0.3.0 | 14 KB | ########## | 100%  2025-05-07T20:30:09.2316547Z 2025-05-07T20:30:09.2316551Z 2025-05-07T20:30:09.2316731Z 2025-05-07T20:30:09.2316734Z 2025-05-07T20:30:09.2316738Z 2025-05-07T20:30:09.2320893Z 2025-05-07T20:30:09.2511791Z expecttest-0.3.0 | 14 KB | ########## | 100%  2025-05-07T20:30:09.2512140Z 2025-05-07T20:30:09.2512145Z 2025-05-07T20:30:09.2512159Z 2025-05-07T20:30:09.2513781Z 2025-05-07T20:30:09.2659246Z exceptiongroup-1.2.2 | 20 KB | ########## | 100%  2025-05-07T20:30:09.2659532Z 2025-05-07T20:30:09.2659536Z 2025-05-07T20:30:09.2659540Z 2025-05-07T20:30:09.2659543Z 2025-05-07T20:30:09.2659547Z 2025-05-07T20:30:09.2682291Z tomli-2.2.1 | 19 KB | ########## | 100%  2025-05-07T20:30:09.2682538Z 2025-05-07T20:30:09.2682542Z 2025-05-07T20:30:09.2682893Z 2025-05-07T20:30:09.2778140Z pluggy-1.5.0 | 23 KB | ########## | 100%  2025-05-07T20:30:09.2778403Z 2025-05-07T20:30:09.2778407Z 2025-05-07T20:30:09.2782531Z colorama-0.4.6 | 26 KB | ########## | 100%  2025-05-07T20:30:09.2782790Z 2025-05-07T20:30:09.2782802Z 2025-05-07T20:30:09.2784687Z colorama-0.4.6 | 26 KB | ########## | 100%  2025-05-07T20:30:09.2784934Z 2025-05-07T20:30:09.2784938Z 2025-05-07T20:30:09.2784942Z 2025-05-07T20:30:09.2784945Z 2025-05-07T20:30:09.2784956Z 2025-05-07T20:30:09.2784964Z 2025-05-07T20:30:09.2784968Z 2025-05-07T20:30:09.2794745Z iniconfig-2.0.0 | 11 KB | ########## | 100%  2025-05-07T20:30:09.2795019Z 2025-05-07T20:30:09.2795023Z 2025-05-07T20:30:09.2795041Z 2025-05-07T20:30:09.2795045Z 2025-05-07T20:30:09.2795049Z 2025-05-07T20:30:09.2795116Z 2025-05-07T20:30:09.2797089Z 2025-05-07T20:30:09.2799764Z iniconfig-2.0.0 | 11 KB | ########## | 100%  2025-05-07T20:30:09.2800117Z 2025-05-07T20:30:09.2800122Z 2025-05-07T20:30:09.2800127Z 2025-05-07T20:30:09.2800131Z 2025-05-07T20:30:09.2800136Z 2025-05-07T20:30:09.2800141Z 2025-05-07T20:30:09.2878010Z expecttest-0.3.0 | 14 KB | ########## | 100%  2025-05-07T20:30:09.2878297Z 2025-05-07T20:30:09.2878301Z 2025-05-07T20:30:09.2878305Z 2025-05-07T20:30:09.2878308Z 2025-05-07T20:30:09.2878312Z 2025-05-07T20:30:09.2878315Z 2025-05-07T20:30:09.2878330Z 2025-05-07T20:30:09.3035138Z iniconfig-2.0.0 | 11 KB | ########## | 100%  2025-05-07T20:30:09.3035521Z 2025-05-07T20:30:09.3041854Z packaging-25.0 | 61 KB | ##6 | 26%  2025-05-07T20:30:09.3042158Z 2025-05-07T20:30:09.3129888Z packaging-25.0 | 61 KB | ########## | 100%  2025-05-07T20:30:09.3130236Z 2025-05-07T20:30:09.3248174Z packaging-25.0 | 61 KB | ########## | 100%  2025-05-07T20:30:09.3304607Z pytest-8.3.5 | 254 KB | 6 | 6% 2025-05-07T20:30:09.3556249Z pytest-8.3.5 | 254 KB | ########## | 100% 2025-05-07T20:30:09.3562955Z pytest-8.3.5 | 254 KB | ########## | 100% 2025-05-07T20:30:09.3563355Z 2025-05-07T20:30:09.3563673Z 2025-05-07T20:30:09.3563933Z  2025-05-07T20:30:09.3564223Z 2025-05-07T20:30:09.3564228Z 2025-05-07T20:30:09.3564454Z  2025-05-07T20:30:09.3564729Z 2025-05-07T20:30:09.3564985Z 2025-05-07T20:30:09.3565006Z 2025-05-07T20:30:09.3565208Z  2025-05-07T20:30:09.3565480Z 2025-05-07T20:30:09.3565485Z 2025-05-07T20:30:09.3565490Z 2025-05-07T20:30:09.3565495Z 2025-05-07T20:30:09.3565755Z  2025-05-07T20:30:09.3566037Z 2025-05-07T20:30:09.3566043Z 2025-05-07T20:30:09.3566048Z 2025-05-07T20:30:09.3566053Z 2025-05-07T20:30:09.3566058Z 2025-05-07T20:30:09.3566315Z  2025-05-07T20:30:09.3566572Z 2025-05-07T20:30:09.3566576Z 2025-05-07T20:30:09.3566580Z 2025-05-07T20:30:09.3566583Z 2025-05-07T20:30:09.3566735Z 2025-05-07T20:30:09.3566739Z 2025-05-07T20:30:09.3566937Z  2025-05-07T20:30:09.3567147Z 2025-05-07T20:30:09.3567150Z 2025-05-07T20:30:09.3567154Z 2025-05-07T20:30:09.3567157Z 2025-05-07T20:30:09.3567167Z 2025-05-07T20:30:09.3567171Z 2025-05-07T20:30:09.3567174Z 2025-05-07T20:30:09.3567374Z  done 2025-05-07T20:30:09.4568635Z Preparing transaction: \ done 2025-05-07T20:30:09.5573849Z Verifying transaction: / done 2025-05-07T20:30:11.3606171Z Executing transaction: \ | / - \ | / - \ | / - \ | / - \ | done 2025-05-07T20:30:11.4879730Z [TEST] Checking imports ... 2025-05-07T20:30:15.3987087Z [CHECK] Python (sub-)package 'fbgemm_gpu' found ... 2025-05-07T20:30:15.3999704Z [TEST] Setting feature flags ... 2025-05-07T20:30:15.4000153Z + conda env config vars set -n build_binary FBGEMM_TBE_ENSEMBLE_ROWWISE_ADAGRAD=1 2025-05-07T20:30:15.4000514Z 2025-05-07T20:30:15.8246650Z 2025-05-07T20:30:15.8247221Z [TEST] PyTest args: -v -rsx -s -W ignore::pytest.PytestCollectionWarning 2025-05-07T20:30:15.8248558Z ################################################################################ 2025-05-07T20:30:15.8249023Z # Run FBGEMM-GPU Tests: 2025-05-07T20:30:15.8249333Z # 2025-05-07T20:30:15.8269033Z # [2025-05-07T20:30:15.826Z] + __run_fbgemm_gpu_tests_in_directory build_binary 2025-05-07T20:30:15.8269460Z ################################################################################ 2025-05-07T20:30:15.8269673Z 2025-05-07T20:30:15.8277139Z [TEST] Enumerating ALL test files ... 2025-05-07T20:30:15.8306019Z ./attention/gqa_test.py 2025-05-07T20:30:15.8306330Z ./coalesce/coalesce_test.py 2025-05-07T20:30:15.8306725Z ./comm/multi_gpu_car_test.py 2025-05-07T20:30:15.8307020Z ./gather_scatter/gather_scatter_test.py 2025-05-07T20:30:15.8307317Z ./kv_cache/kv_cache_test.py 2025-05-07T20:30:15.8307587Z ./moe/activation_test.py 2025-05-07T20:30:15.8307851Z ./moe/gather_scatter_test.py 2025-05-07T20:30:15.8308101Z ./moe/layers_test.py 2025-05-07T20:30:15.8308341Z ./moe/shuffling_test.py 2025-05-07T20:30:15.8308589Z ./quantize/quantize_test.py 2025-05-07T20:30:15.8308750Z 2025-05-07T20:30:15.8308874Z [TEST] Enumerating IGNORED test files ... 2025-05-07T20:30:15.8309161Z 2025-05-07T20:30:15.8327068Z ################################################################################ 2025-05-07T20:30:15.8342479Z # [2025-05-07T20:30:15.833Z] Run Python Test Suite: 2025-05-07T20:30:15.8342847Z # ./attention/gqa_test.py 2025-05-07T20:30:15.8343187Z ################################################################################ 2025-05-07T20:30:15.8366400Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./attention/gqa_test.py 2025-05-07T20:30:15.8367005Z 2025-05-07T20:30:18.3484775Z ============================= test session starts ============================== 2025-05-07T20:30:18.3485464Z platform linux -- Python 3.11.8, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:30:18.3485980Z cachedir: .pytest_cache 2025-05-07T20:30:18.3486875Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:30:18.3487610Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:30:18.3488012Z plugins: hypothesis-6.131.14 2025-05-07T20:30:20.0811491Z collecting ... collected 2 items 2025-05-07T20:30:20.0811857Z 2025-05-07T20:30:57.9416794Z attention/gqa_test.py::Int4GQATest::test_gqa Trying example: test_gqa( 2025-05-07T20:30:57.9419531Z self=, 2025-05-07T20:30:57.9420134Z int4_kv=False, 2025-05-07T20:30:57.9420472Z num_groups=1, 2025-05-07T20:30:57.9420741Z B=1, 2025-05-07T20:30:57.9421024Z MAX_T=4, 2025-05-07T20:30:57.9421266Z N_H_L=1, 2025-05-07T20:30:57.9421950Z ) 2025-05-07T20:30:57.9422191Z Trying example: test_gqa( 2025-05-07T20:30:57.9422557Z self=, 2025-05-07T20:30:57.9423283Z int4_kv=True, 2025-05-07T20:30:57.9423544Z num_groups=1, 2025-05-07T20:30:57.9423814Z B=1, 2025-05-07T20:30:57.9424036Z MAX_T=4, 2025-05-07T20:30:57.9424273Z N_H_L=1, 2025-05-07T20:30:57.9424504Z ) 2025-05-07T20:30:57.9424737Z Trying example: test_gqa( 2025-05-07T20:30:57.9425101Z self=, 2025-05-07T20:30:57.9425486Z int4_kv=True, 2025-05-07T20:30:57.9425739Z num_groups=4, 2025-05-07T20:30:57.9425994Z B=23, 2025-05-07T20:30:57.9426244Z MAX_T=33, 2025-05-07T20:30:57.9426489Z N_H_L=68, 2025-05-07T20:30:57.9426736Z ) 2025-05-07T20:30:57.9426975Z Trying example: test_gqa( 2025-05-07T20:30:57.9427323Z self=, 2025-05-07T20:30:57.9427712Z int4_kv=True, 2025-05-07T20:30:57.9427966Z num_groups=4, 2025-05-07T20:30:57.9428465Z B=77, 2025-05-07T20:30:57.9428700Z MAX_T=4, 2025-05-07T20:30:57.9428957Z N_H_L=1, 2025-05-07T20:30:57.9429291Z ) 2025-05-07T20:30:57.9429652Z Trying example: test_gqa( 2025-05-07T20:30:57.9430025Z self=, 2025-05-07T20:30:57.9430401Z int4_kv=True, 2025-05-07T20:30:57.9430659Z num_groups=4, 2025-05-07T20:30:57.9430913Z B=77, 2025-05-07T20:30:57.9431137Z MAX_T=52, 2025-05-07T20:30:57.9431382Z N_H_L=67, 2025-05-07T20:30:57.9431618Z ) 2025-05-07T20:30:57.9431851Z Trying example: test_gqa( 2025-05-07T20:30:57.9432205Z self=, 2025-05-07T20:30:57.9432590Z int4_kv=False, 2025-05-07T20:30:57.9432852Z num_groups=4, 2025-05-07T20:30:57.9433097Z B=57, 2025-05-07T20:30:57.9433326Z MAX_T=45, 2025-05-07T20:30:57.9433565Z N_H_L=120, 2025-05-07T20:30:57.9433800Z ) 2025-05-07T20:30:57.9434038Z Trying example: test_gqa( 2025-05-07T20:30:57.9434391Z self=, 2025-05-07T20:30:57.9434779Z int4_kv=True, 2025-05-07T20:30:57.9435032Z num_groups=4, 2025-05-07T20:30:57.9435284Z B=52, 2025-05-07T20:30:57.9435506Z MAX_T=42, 2025-05-07T20:30:57.9435744Z N_H_L=53, 2025-05-07T20:30:57.9435984Z ) 2025-05-07T20:30:57.9436214Z Trying example: test_gqa( 2025-05-07T20:30:57.9436569Z self=, 2025-05-07T20:30:57.9436952Z int4_kv=True, 2025-05-07T20:30:57.9437202Z num_groups=1, 2025-05-07T20:30:57.9437452Z B=77, 2025-05-07T20:30:57.9437684Z MAX_T=95, 2025-05-07T20:30:57.9437917Z N_H_L=53, 2025-05-07T20:30:57.9438192Z ) 2025-05-07T20:30:57.9438436Z Trying example: test_gqa( 2025-05-07T20:30:57.9438785Z self=, 2025-05-07T20:30:57.9439170Z int4_kv=True, 2025-05-07T20:30:57.9439430Z num_groups=4, 2025-05-07T20:30:57.9439676Z B=113, 2025-05-07T20:30:57.9439909Z MAX_T=48, 2025-05-07T20:30:57.9440161Z N_H_L=96, 2025-05-07T20:30:57.9440389Z ) 2025-05-07T20:30:57.9440628Z Trying example: test_gqa( 2025-05-07T20:30:57.9440981Z self=, 2025-05-07T20:30:57.9441363Z int4_kv=False, 2025-05-07T20:30:57.9441617Z num_groups=1, 2025-05-07T20:30:57.9442160Z B=51, 2025-05-07T20:30:57.9442401Z MAX_T=61, 2025-05-07T20:30:57.9442634Z N_H_L=69, 2025-05-07T20:30:57.9442865Z ) 2025-05-07T20:30:57.9443104Z Trying example: test_gqa( 2025-05-07T20:30:57.9443451Z self=, 2025-05-07T20:30:57.9443835Z int4_kv=False, 2025-05-07T20:30:57.9444094Z num_groups=4, 2025-05-07T20:30:57.9444336Z B=17, 2025-05-07T20:30:57.9444569Z MAX_T=113, 2025-05-07T20:30:57.9444819Z N_H_L=65, 2025-05-07T20:30:57.9445050Z ) 2025-05-07T20:30:57.9445288Z Trying example: test_gqa( 2025-05-07T20:30:57.9445643Z self=, 2025-05-07T20:30:57.9446024Z int4_kv=False, 2025-05-07T20:30:57.9446416Z num_groups=4, 2025-05-07T20:30:57.9446672Z B=17, 2025-05-07T20:30:57.9446896Z MAX_T=65, 2025-05-07T20:30:57.9447136Z N_H_L=65, 2025-05-07T20:30:57.9447376Z ) 2025-05-07T20:30:57.9447611Z Trying example: test_gqa( 2025-05-07T20:30:57.9447979Z self=, 2025-05-07T20:30:57.9448364Z int4_kv=False, 2025-05-07T20:30:57.9448619Z num_groups=4, 2025-05-07T20:30:57.9448868Z B=65, 2025-05-07T20:30:57.9449325Z MAX_T=65, 2025-05-07T20:30:57.9449574Z N_H_L=65, 2025-05-07T20:30:57.9449810Z ) 2025-05-07T20:30:57.9450042Z Trying example: test_gqa( 2025-05-07T20:30:57.9450401Z self=, 2025-05-07T20:30:57.9450789Z int4_kv=False, 2025-05-07T20:30:57.9451050Z num_groups=1, 2025-05-07T20:30:57.9451303Z B=6, 2025-05-07T20:30:57.9451539Z MAX_T=108, 2025-05-07T20:30:57.9451779Z N_H_L=14, 2025-05-07T20:30:57.9452014Z ) 2025-05-07T20:30:57.9452252Z Trying example: test_gqa( 2025-05-07T20:30:57.9452609Z self=, 2025-05-07T20:30:57.9452997Z int4_kv=False, 2025-05-07T20:30:57.9453257Z num_groups=1, 2025-05-07T20:30:57.9453505Z B=6, 2025-05-07T20:30:57.9453739Z MAX_T=14, 2025-05-07T20:30:57.9453990Z N_H_L=14, 2025-05-07T20:30:57.9454218Z ) 2025-05-07T20:30:57.9454463Z Trying example: test_gqa( 2025-05-07T20:30:57.9454824Z self=, 2025-05-07T20:30:57.9455211Z int4_kv=False, 2025-05-07T20:30:57.9455471Z num_groups=1, 2025-05-07T20:30:57.9455722Z B=6, 2025-05-07T20:30:57.9455954Z MAX_T=6, 2025-05-07T20:30:57.9456216Z N_H_L=14, 2025-05-07T20:30:57.9456466Z ) 2025-05-07T20:30:57.9456715Z Trying example: test_gqa( 2025-05-07T20:30:57.9457085Z self=, 2025-05-07T20:30:57.9457493Z int4_kv=False, 2025-05-07T20:30:57.9457755Z num_groups=1, 2025-05-07T20:30:57.9457960Z B=6, 2025-05-07T20:30:57.9458154Z MAX_T=6, 2025-05-07T20:30:57.9458355Z N_H_L=6, 2025-05-07T20:30:57.9458544Z ) 2025-05-07T20:30:57.9458745Z Trying example: test_gqa( 2025-05-07T20:30:57.9459032Z self=, 2025-05-07T20:30:57.9459335Z int4_kv=False, 2025-05-07T20:30:57.9459559Z num_groups=1, 2025-05-07T20:30:57.9459768Z B=70, 2025-05-07T20:30:57.9459952Z MAX_T=94, 2025-05-07T20:30:57.9460157Z N_H_L=78, 2025-05-07T20:30:57.9460352Z ) 2025-05-07T20:30:57.9460542Z Trying example: test_gqa( 2025-05-07T20:30:57.9460829Z self=, 2025-05-07T20:30:57.9461141Z int4_kv=False, 2025-05-07T20:30:57.9461351Z num_groups=1, 2025-05-07T20:30:57.9461561Z B=78, 2025-05-07T20:30:57.9461754Z MAX_T=94, 2025-05-07T20:30:57.9461945Z N_H_L=78, 2025-05-07T20:30:57.9462137Z ) 2025-05-07T20:30:57.9462339Z Trying example: test_gqa( 2025-05-07T20:30:57.9462630Z self=, 2025-05-07T20:30:57.9462935Z int4_kv=False, 2025-05-07T20:30:57.9463154Z num_groups=1, 2025-05-07T20:30:57.9463371Z B=94, 2025-05-07T20:30:57.9463561Z MAX_T=94, 2025-05-07T20:30:57.9463772Z N_H_L=78, 2025-05-07T20:30:57.9463967Z ) 2025-05-07T20:30:57.9464160Z Trying example: test_gqa( 2025-05-07T20:30:57.9464562Z self=, 2025-05-07T20:30:57.9464881Z int4_kv=False, 2025-05-07T20:30:57.9465094Z num_groups=1, 2025-05-07T20:30:57.9465311Z B=94, 2025-05-07T20:30:57.9465507Z MAX_T=94, 2025-05-07T20:30:57.9465701Z N_H_L=94, 2025-05-07T20:30:57.9465911Z ) 2025-05-07T20:30:57.9466110Z Trying example: test_gqa( 2025-05-07T20:30:57.9466398Z self=, 2025-05-07T20:30:57.9466715Z int4_kv=False, 2025-05-07T20:30:57.9466938Z num_groups=4, 2025-05-07T20:30:57.9467144Z B=41, 2025-05-07T20:30:57.9467345Z MAX_T=105, 2025-05-07T20:30:57.9467555Z N_H_L=126, 2025-05-07T20:30:57.9467752Z ) 2025-05-07T20:30:57.9467980Z Trying example: test_gqa( 2025-05-07T20:30:57.9468381Z self=, 2025-05-07T20:30:57.9468686Z int4_kv=False, 2025-05-07T20:30:57.9468903Z num_groups=4, 2025-05-07T20:30:57.9469167Z B=105, 2025-05-07T20:30:57.9469362Z MAX_T=105, 2025-05-07T20:30:57.9469571Z N_H_L=126, 2025-05-07T20:30:57.9469778Z ) 2025-05-07T20:30:57.9469978Z Trying example: test_gqa( 2025-05-07T20:30:57.9470262Z self=, 2025-05-07T20:30:57.9470574Z int4_kv=False, 2025-05-07T20:30:57.9470790Z num_groups=4, 2025-05-07T20:30:57.9470997Z B=105, 2025-05-07T20:30:57.9471196Z MAX_T=105, 2025-05-07T20:30:57.9471407Z N_H_L=105, 2025-05-07T20:30:57.9471611Z ) 2025-05-07T20:30:57.9471809Z Trying example: test_gqa( 2025-05-07T20:30:57.9472105Z self=, 2025-05-07T20:30:57.9472417Z int4_kv=True, 2025-05-07T20:30:57.9472620Z num_groups=1, 2025-05-07T20:30:57.9472828Z B=95, 2025-05-07T20:30:57.9473028Z MAX_T=114, 2025-05-07T20:30:57.9473221Z N_H_L=43, 2025-05-07T20:30:57.9473417Z ) 2025-05-07T20:30:57.9473613Z Trying example: test_gqa( 2025-05-07T20:30:57.9473894Z self=, 2025-05-07T20:30:57.9474208Z int4_kv=True, 2025-05-07T20:30:57.9474422Z num_groups=1, 2025-05-07T20:30:57.9474623Z B=43, 2025-05-07T20:30:57.9474819Z MAX_T=114, 2025-05-07T20:30:57.9475024Z N_H_L=43, 2025-05-07T20:30:57.9475214Z ) 2025-05-07T20:30:57.9475414Z Trying example: test_gqa( 2025-05-07T20:30:57.9475702Z self=, 2025-05-07T20:30:57.9476013Z int4_kv=True, 2025-05-07T20:30:57.9476227Z num_groups=1, 2025-05-07T20:30:57.9476440Z B=43, 2025-05-07T20:30:57.9476624Z MAX_T=43, 2025-05-07T20:30:57.9476827Z N_H_L=43, 2025-05-07T20:30:57.9477023Z ) 2025-05-07T20:30:57.9477216Z Trying example: test_gqa( 2025-05-07T20:30:57.9477512Z self=, 2025-05-07T20:30:57.9477837Z int4_kv=False, 2025-05-07T20:30:57.9478048Z num_groups=1, 2025-05-07T20:30:57.9478256Z B=21, 2025-05-07T20:30:57.9478451Z MAX_T=38, 2025-05-07T20:30:57.9478652Z N_H_L=42, 2025-05-07T20:30:57.9478842Z ) 2025-05-07T20:30:57.9479055Z Trying example: test_gqa( 2025-05-07T20:30:57.9479348Z self=, 2025-05-07T20:30:57.9479656Z int4_kv=False, 2025-05-07T20:30:57.9479881Z num_groups=1, 2025-05-07T20:30:57.9480098Z B=38, 2025-05-07T20:30:57.9480289Z MAX_T=38, 2025-05-07T20:30:57.9480492Z N_H_L=42, 2025-05-07T20:30:57.9480689Z ) 2025-05-07T20:30:57.9480885Z Trying example: test_gqa( 2025-05-07T20:30:57.9481181Z self=, 2025-05-07T20:30:57.9481496Z int4_kv=False, 2025-05-07T20:30:57.9481711Z num_groups=1, 2025-05-07T20:30:57.9481926Z B=38, 2025-05-07T20:30:57.9482121Z MAX_T=42, 2025-05-07T20:30:57.9482321Z N_H_L=42, 2025-05-07T20:30:57.9482531Z ) 2025-05-07T20:30:57.9482734Z Trying example: test_gqa( 2025-05-07T20:30:57.9483020Z self=, 2025-05-07T20:30:57.9483338Z int4_kv=False, 2025-05-07T20:30:57.9483562Z num_groups=1, 2025-05-07T20:30:57.9483774Z B=42, 2025-05-07T20:30:57.9484063Z MAX_T=42, 2025-05-07T20:30:57.9484272Z N_H_L=42, 2025-05-07T20:30:57.9484460Z ) 2025-05-07T20:30:57.9484657Z Trying example: test_gqa( 2025-05-07T20:30:57.9484945Z self=, 2025-05-07T20:30:57.9485268Z int4_kv=True, 2025-05-07T20:30:57.9485478Z num_groups=1, 2025-05-07T20:30:57.9485691Z B=74, 2025-05-07T20:30:57.9485885Z MAX_T=20, 2025-05-07T20:30:57.9486082Z N_H_L=15, 2025-05-07T20:30:57.9486284Z ) 2025-05-07T20:30:57.9486492Z Trying example: test_gqa( 2025-05-07T20:30:57.9486778Z self=, 2025-05-07T20:30:57.9487097Z int4_kv=True, 2025-05-07T20:30:57.9487313Z num_groups=1, 2025-05-07T20:30:57.9487603Z B=20, 2025-05-07T20:30:57.9487792Z MAX_T=20, 2025-05-07T20:30:57.9488129Z N_H_L=15, 2025-05-07T20:30:57.9488406Z ) 2025-05-07T20:30:57.9488653Z Trying example: test_gqa( 2025-05-07T20:30:57.9498549Z self=, 2025-05-07T20:30:57.9498915Z int4_kv=True, 2025-05-07T20:30:57.9499141Z num_groups=1, 2025-05-07T20:30:57.9499353Z B=20, 2025-05-07T20:30:57.9499542Z MAX_T=15, 2025-05-07T20:30:57.9499743Z N_H_L=15, 2025-05-07T20:30:57.9499939Z ) 2025-05-07T20:30:57.9500136Z Trying example: test_gqa( 2025-05-07T20:30:57.9500434Z self=, 2025-05-07T20:30:57.9500745Z int4_kv=True, 2025-05-07T20:30:57.9500953Z num_groups=1, 2025-05-07T20:30:57.9501162Z B=15, 2025-05-07T20:30:57.9501352Z MAX_T=20, 2025-05-07T20:30:57.9501542Z N_H_L=15, 2025-05-07T20:30:57.9501734Z ) 2025-05-07T20:30:57.9502190Z Trying example: test_gqa( 2025-05-07T20:30:57.9502556Z self=, 2025-05-07T20:30:57.9502882Z int4_kv=True, 2025-05-07T20:30:57.9503097Z num_groups=1, 2025-05-07T20:30:57.9503297Z B=15, 2025-05-07T20:30:57.9503492Z MAX_T=15, 2025-05-07T20:30:57.9503701Z N_H_L=15, 2025-05-07T20:30:57.9503896Z ) 2025-05-07T20:30:57.9504087Z Trying example: test_gqa( 2025-05-07T20:30:57.9504378Z self=, 2025-05-07T20:30:57.9504691Z int4_kv=False, 2025-05-07T20:30:57.9504901Z num_groups=4, 2025-05-07T20:30:57.9505108Z B=117, 2025-05-07T20:30:57.9505296Z MAX_T=104, 2025-05-07T20:30:57.9505489Z N_H_L=69, 2025-05-07T20:30:57.9505681Z ) 2025-05-07T20:30:57.9505878Z Trying example: test_gqa( 2025-05-07T20:30:57.9506158Z self=, 2025-05-07T20:30:57.9506470Z int4_kv=False, 2025-05-07T20:30:57.9506685Z num_groups=4, 2025-05-07T20:30:57.9506884Z B=117, 2025-05-07T20:30:57.9507074Z MAX_T=117, 2025-05-07T20:30:57.9507273Z N_H_L=69, 2025-05-07T20:30:57.9507466Z ) 2025-05-07T20:30:57.9507664Z Trying example: test_gqa( 2025-05-07T20:30:57.9507953Z self=, 2025-05-07T20:30:57.9508255Z int4_kv=False, 2025-05-07T20:30:57.9508469Z num_groups=4, 2025-05-07T20:30:57.9508679Z B=69, 2025-05-07T20:30:57.9508860Z MAX_T=117, 2025-05-07T20:30:57.9509111Z N_H_L=69, 2025-05-07T20:30:57.9509301Z ) 2025-05-07T20:30:57.9509487Z Trying example: test_gqa( 2025-05-07T20:30:57.9509771Z self=, 2025-05-07T20:30:57.9510082Z int4_kv=False, 2025-05-07T20:30:57.9510284Z num_groups=4, 2025-05-07T20:30:57.9510488Z B=117, 2025-05-07T20:30:57.9510678Z MAX_T=69, 2025-05-07T20:30:57.9510873Z N_H_L=69, 2025-05-07T20:30:57.9511057Z ) 2025-05-07T20:30:57.9511244Z PASSED 2025-05-07T20:30:57.9726961Z attention/gqa_test.py::Int4GQATest::test_mqa_main SKIPPED (Skip when...) 2025-05-07T20:30:57.9727286Z 2025-05-07T20:30:57.9727436Z =========================== short test summary info ============================ 2025-05-07T20:30:57.9728710Z SKIPPED [1] ../../../../../../../../miniconda/envs/build_binary/lib/python3.11/unittest/case.py:153: Skip when CUDA is not available or xformers is not available 2025-05-07T20:30:57.9730470Z ======================== 1 passed, 1 skipped in 40.11s ========================= 2025-05-07T20:30:58.6306098Z 2025-05-07T20:30:58.6306710Z [TEST] Python test suite PASSED: ./attention/gqa_test.py 2025-05-07T20:30:58.6327356Z [TEST] Python test time for ./attention/gqa_test.py: 43 seconds 2025-05-07T20:30:58.6327648Z 2025-05-07T20:30:58.6327653Z 2025-05-07T20:30:58.6327657Z 2025-05-07T20:30:58.6327660Z 2025-05-07T20:30:58.6348252Z ################################################################################ 2025-05-07T20:30:58.6363747Z # [2025-05-07T20:30:58.636Z] Run Python Test Suite: 2025-05-07T20:30:58.6364086Z # ./coalesce/coalesce_test.py 2025-05-07T20:30:58.6364379Z ################################################################################ 2025-05-07T20:30:58.6389941Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./coalesce/coalesce_test.py 2025-05-07T20:30:58.6390573Z 2025-05-07T20:31:00.7790120Z ============================= test session starts ============================== 2025-05-07T20:31:00.7790744Z platform linux -- Python 3.11.8, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:31:00.7791270Z cachedir: .pytest_cache 2025-05-07T20:31:00.7791847Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:31:00.7792573Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:31:00.7792976Z plugins: hypothesis-6.131.14 2025-05-07T20:31:02.4494177Z collecting ... collected 1 item 2025-05-07T20:31:02.4494384Z 2025-05-07T20:31:03.1773600Z coalesce/coalesce_test.py::CoalesceTest::test_coalesce_batches PASSED 2025-05-07T20:31:03.1773961Z 2025-05-07T20:31:03.1774112Z ============================== 1 passed in 2.52s =============================== 2025-05-07T20:31:03.8179847Z 2025-05-07T20:31:03.8180494Z [TEST] Python test suite PASSED: ./coalesce/coalesce_test.py 2025-05-07T20:31:03.8202568Z [TEST] Python test time for ./coalesce/coalesce_test.py: 5 seconds 2025-05-07T20:31:03.8202860Z 2025-05-07T20:31:03.8202864Z 2025-05-07T20:31:03.8202869Z 2025-05-07T20:31:03.8202920Z 2025-05-07T20:31:03.8223104Z ################################################################################ 2025-05-07T20:31:03.8238587Z # [2025-05-07T20:31:03.823Z] Run Python Test Suite: 2025-05-07T20:31:03.8238935Z # ./comm/multi_gpu_car_test.py 2025-05-07T20:31:03.8239228Z ################################################################################ 2025-05-07T20:31:03.8265336Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./comm/multi_gpu_car_test.py 2025-05-07T20:31:03.8266230Z 2025-05-07T20:31:05.9603918Z ============================= test session starts ============================== 2025-05-07T20:31:05.9604581Z platform linux -- Python 3.11.8, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:31:05.9605107Z cachedir: .pytest_cache 2025-05-07T20:31:05.9605691Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:31:05.9606426Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:31:05.9606828Z plugins: hypothesis-6.131.14 2025-05-07T20:31:07.6726568Z collecting ... collected 5 items 2025-05-07T20:31:07.6726773Z 2025-05-07T20:31:07.6737130Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allgather SKIPPED 2025-05-07T20:31:07.6755944Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allgather_dtype_mismatch SKIPPED 2025-05-07T20:31:07.6763315Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allreduce SKIPPED 2025-05-07T20:31:07.6770241Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_oneshot_car_stress SKIPPED 2025-05-07T20:31:07.6785553Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_reducescatter SKIPPED 2025-05-07T20:31:07.6785902Z 2025-05-07T20:31:07.6786066Z =========================== short test summary info ============================ 2025-05-07T20:31:07.6786725Z SKIPPED [1] comm/multi_gpu_car_test.py:310: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs 2025-05-07T20:31:07.6787644Z SKIPPED [1] comm/multi_gpu_car_test.py:351: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs 2025-05-07T20:31:07.6788558Z SKIPPED [1] comm/multi_gpu_car_test.py:418: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs 2025-05-07T20:31:07.6789513Z SKIPPED [1] comm/multi_gpu_car_test.py:434: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs 2025-05-07T20:31:07.6790582Z SKIPPED [1] comm/multi_gpu_car_test.py:402: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs 2025-05-07T20:31:07.6791225Z ============================== 5 skipped in 1.84s ============================== 2025-05-07T20:31:08.2606149Z 2025-05-07T20:31:08.2606928Z [TEST] Python test suite PASSED: ./comm/multi_gpu_car_test.py 2025-05-07T20:31:08.2627957Z [TEST] Python test time for ./comm/multi_gpu_car_test.py: 5 seconds 2025-05-07T20:31:08.2628377Z 2025-05-07T20:31:08.2628381Z 2025-05-07T20:31:08.2628385Z 2025-05-07T20:31:08.2628388Z 2025-05-07T20:31:08.2648649Z ################################################################################ 2025-05-07T20:31:08.2663688Z # [2025-05-07T20:31:08.266Z] Run Python Test Suite: 2025-05-07T20:31:08.2664041Z # ./gather_scatter/gather_scatter_test.py 2025-05-07T20:31:08.2664394Z ################################################################################ 2025-05-07T20:31:08.2689922Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./gather_scatter/gather_scatter_test.py 2025-05-07T20:31:08.2690579Z 2025-05-07T20:31:10.4029627Z ============================= test session starts ============================== 2025-05-07T20:31:10.4030265Z platform linux -- Python 3.11.8, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:31:10.4030785Z cachedir: .pytest_cache 2025-05-07T20:31:10.4031353Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:31:10.4032078Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:31:10.4032496Z plugins: hypothesis-6.131.14 2025-05-07T20:31:12.2475752Z collecting ... collected 2 items 2025-05-07T20:31:12.2476185Z 2025-05-07T20:31:12.2484765Z gather_scatter/gather_scatter_test.py::GatherScatterTests::test_gather_along_first_dim SKIPPED 2025-05-07T20:31:12.2499184Z gather_scatter/gather_scatter_test.py::GatherScatterTests::test_scatter_add_along_first_dim SKIPPED 2025-05-07T20:31:12.2499612Z 2025-05-07T20:31:12.2499771Z =========================== short test summary info ============================ 2025-05-07T20:31:12.2500385Z SKIPPED [1] gather_scatter/gather_scatter_test.py:29: Skip when no Hopper GPU is available. This test is only for Hopper GPU. 2025-05-07T20:31:12.2501210Z SKIPPED [1] gather_scatter/gather_scatter_test.py:70: Skip when no Hopper GPU is available. This test is only for Hopper GPU. 2025-05-07T20:31:12.2501809Z ============================== 2 skipped in 1.96s ============================== 2025-05-07T20:31:12.8490878Z 2025-05-07T20:31:12.8491548Z [TEST] Python test suite PASSED: ./gather_scatter/gather_scatter_test.py 2025-05-07T20:31:12.8510718Z [TEST] Python test time for ./gather_scatter/gather_scatter_test.py: 4 seconds 2025-05-07T20:31:12.8511044Z 2025-05-07T20:31:12.8511048Z 2025-05-07T20:31:12.8511061Z 2025-05-07T20:31:12.8511065Z 2025-05-07T20:31:12.8533614Z ################################################################################ 2025-05-07T20:31:12.8548446Z # [2025-05-07T20:31:12.854Z] Run Python Test Suite: 2025-05-07T20:31:12.8548774Z # ./kv_cache/kv_cache_test.py 2025-05-07T20:31:12.8549110Z ################################################################################ 2025-05-07T20:31:12.8573600Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./kv_cache/kv_cache_test.py 2025-05-07T20:31:12.8574221Z 2025-05-07T20:31:14.9901882Z ============================= test session starts ============================== 2025-05-07T20:31:14.9902514Z platform linux -- Python 3.11.8, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:31:14.9903358Z cachedir: .pytest_cache 2025-05-07T20:31:14.9903936Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:31:14.9904663Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:31:14.9905072Z plugins: hypothesis-6.131.14 2025-05-07T20:31:16.7573151Z collecting ... collected 4 items 2025-05-07T20:31:16.7573511Z 2025-05-07T20:31:19.2908124Z kv_cache/kv_cache_test.py::KVCacheTests::test_fp8_kv_cache SKIPPED (...) 2025-05-07T20:31:19.3031460Z kv_cache/kv_cache_test.py::KVCacheTests::test_int4_kv_cache SKIPPED 2025-05-07T20:31:19.3178802Z kv_cache/kv_cache_test.py::KVCacheTests::test_positional_encoding_with_paged_attention SKIPPED 2025-05-07T20:31:19.3303905Z kv_cache/kv_cache_test.py::KVCacheTests::test_rope_positional_encoding_only SKIPPED 2025-05-07T20:31:19.3304263Z 2025-05-07T20:31:19.3304414Z =========================== short test summary info ============================ 2025-05-07T20:31:19.3305142Z SKIPPED [1] ../../../../../../../../miniconda/envs/build_binary/lib/python3.11/unittest/case.py:153: Skip when H100 is not available or MI300 is not available 2025-05-07T20:31:19.3306285Z SKIPPED [3] ../../../../../../../../miniconda/envs/build_binary/lib/python3.11/unittest/case.py:153: Skip when xformers is not available 2025-05-07T20:31:19.3307033Z ============================== 4 skipped in 4.46s ============================== 2025-05-07T20:31:21.4329531Z 2025-05-07T20:31:21.4330521Z [TEST] Python test suite PASSED: ./kv_cache/kv_cache_test.py 2025-05-07T20:31:21.4349942Z [TEST] Python test time for ./kv_cache/kv_cache_test.py: 9 seconds 2025-05-07T20:31:21.4350350Z 2025-05-07T20:31:21.4350356Z 2025-05-07T20:31:21.4350362Z 2025-05-07T20:31:21.4350414Z 2025-05-07T20:31:21.4372571Z ################################################################################ 2025-05-07T20:31:21.4387814Z # [2025-05-07T20:31:21.438Z] Run Python Test Suite: 2025-05-07T20:31:21.4388317Z # ./moe/activation_test.py 2025-05-07T20:31:21.4388702Z ################################################################################ 2025-05-07T20:31:21.4412512Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./moe/activation_test.py 2025-05-07T20:31:21.4413236Z 2025-05-07T20:31:23.5802156Z ============================= test session starts ============================== 2025-05-07T20:31:23.5802822Z platform linux -- Python 3.11.8, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:31:23.5803362Z cachedir: .pytest_cache 2025-05-07T20:31:23.5803935Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:31:23.5804669Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:31:23.5805084Z plugins: hypothesis-6.131.14 2025-05-07T20:31:25.2101139Z TMA benchmarks will be running with experimental grid constant TMA descriptor. 2025-05-07T20:31:25.3620934Z collecting ... collected 2 items 2025-05-07T20:31:25.3621314Z 2025-05-07T20:31:30.5472053Z moe/activation_test.py::ActivationTests::test_silu_mul Trying example: test_silu_mul( 2025-05-07T20:31:30.5472660Z self=, 2025-05-07T20:31:30.5473036Z T=1, 2025-05-07T20:31:30.5473229Z D=5120, 2025-05-07T20:31:30.5473434Z contiguous=True, 2025-05-07T20:31:30.5473660Z compiled=True, 2025-05-07T20:31:30.5473877Z ) 2025-05-07T20:31:30.5474096Z Trying example: test_silu_mul( 2025-05-07T20:31:30.5474473Z self=, 2025-05-07T20:31:30.5474858Z T=4096, 2025-05-07T20:31:30.5475051Z D=5120, 2025-05-07T20:31:30.5475243Z contiguous=True, 2025-05-07T20:31:30.5475472Z compiled=True, 2025-05-07T20:31:30.5475681Z ) 2025-05-07T20:31:30.5476060Z Trying example: test_silu_mul( 2025-05-07T20:31:30.5476433Z self=, 2025-05-07T20:31:30.5476807Z T=4096, 2025-05-07T20:31:30.5476998Z D=7168, 2025-05-07T20:31:30.5477195Z contiguous=False, 2025-05-07T20:31:30.5477428Z compiled=False, 2025-05-07T20:31:30.5477635Z ) 2025-05-07T20:31:30.5477829Z Trying example: test_silu_mul( 2025-05-07T20:31:30.5478205Z self=, 2025-05-07T20:31:30.5478588Z T=4096, 2025-05-07T20:31:30.5478774Z D=5120, 2025-05-07T20:31:30.5479025Z contiguous=False, 2025-05-07T20:31:30.5479249Z compiled=True, 2025-05-07T20:31:30.5479461Z ) 2025-05-07T20:31:30.5479664Z Trying example: test_silu_mul( 2025-05-07T20:31:30.5480028Z self=, 2025-05-07T20:31:30.5480411Z T=1, 2025-05-07T20:31:30.5480603Z D=7168, 2025-05-07T20:31:30.5480799Z contiguous=True, 2025-05-07T20:31:30.5481037Z compiled=True, 2025-05-07T20:31:30.5481244Z ) 2025-05-07T20:31:30.5481444Z Trying example: test_silu_mul( 2025-05-07T20:31:30.5481819Z self=, 2025-05-07T20:31:30.5482196Z T=1, 2025-05-07T20:31:30.5482391Z D=7168, 2025-05-07T20:31:30.5482596Z contiguous=False, 2025-05-07T20:31:30.5482825Z compiled=True, 2025-05-07T20:31:30.5483023Z ) 2025-05-07T20:31:30.5483222Z Trying example: test_silu_mul( 2025-05-07T20:31:30.5483594Z self=, 2025-05-07T20:31:30.5483969Z T=4096, 2025-05-07T20:31:30.5484162Z D=5120, 2025-05-07T20:31:30.5484366Z contiguous=False, 2025-05-07T20:31:30.5484599Z compiled=False, 2025-05-07T20:31:30.5484802Z ) 2025-05-07T20:31:30.5485000Z Trying example: test_silu_mul( 2025-05-07T20:31:30.5485369Z self=, 2025-05-07T20:31:30.5485739Z T=1, 2025-05-07T20:31:30.5485933Z D=7168, 2025-05-07T20:31:30.5486131Z contiguous=True, 2025-05-07T20:31:30.5486358Z compiled=False, 2025-05-07T20:31:30.5486564Z ) 2025-05-07T20:31:30.5486767Z Trying example: test_silu_mul( 2025-05-07T20:31:30.5487139Z self=, 2025-05-07T20:31:30.5487515Z T=2048, 2025-05-07T20:31:30.5487711Z D=5120, 2025-05-07T20:31:30.5487902Z contiguous=True, 2025-05-07T20:31:30.5488130Z compiled=True, 2025-05-07T20:31:30.5488340Z ) 2025-05-07T20:31:30.5488535Z Trying example: test_silu_mul( 2025-05-07T20:31:30.5488903Z self=, 2025-05-07T20:31:30.5489280Z T=2048, 2025-05-07T20:31:30.5489469Z D=7168, 2025-05-07T20:31:30.5489668Z contiguous=True, 2025-05-07T20:31:30.5489941Z compiled=True, 2025-05-07T20:31:30.5490155Z ) 2025-05-07T20:31:30.5490363Z Trying example: test_silu_mul( 2025-05-07T20:31:30.5490735Z self=, 2025-05-07T20:31:30.5491114Z T=2048, 2025-05-07T20:31:30.5491300Z D=7168, 2025-05-07T20:31:30.5491499Z contiguous=True, 2025-05-07T20:31:30.5491732Z compiled=False, 2025-05-07T20:31:30.5491933Z ) 2025-05-07T20:31:30.5492237Z Trying example: test_silu_mul( 2025-05-07T20:31:30.5492617Z self=, 2025-05-07T20:31:30.5492992Z T=128, 2025-05-07T20:31:30.5493191Z D=5120, 2025-05-07T20:31:30.5493393Z contiguous=False, 2025-05-07T20:31:30.5493617Z compiled=True, 2025-05-07T20:31:30.5493825Z ) 2025-05-07T20:31:30.5494028Z Trying example: test_silu_mul( 2025-05-07T20:31:30.5494396Z self=, 2025-05-07T20:31:30.5494775Z T=128, 2025-05-07T20:31:30.5494970Z D=5120, 2025-05-07T20:31:30.5495166Z contiguous=True, 2025-05-07T20:31:30.5495396Z compiled=True, 2025-05-07T20:31:30.5495606Z ) 2025-05-07T20:31:30.5495887Z Trying example: test_silu_mul( 2025-05-07T20:31:30.5496258Z self=, 2025-05-07T20:31:30.5496637Z T=16384, 2025-05-07T20:31:30.5496838Z D=5120, 2025-05-07T20:31:30.5497034Z contiguous=False, 2025-05-07T20:31:30.5497268Z compiled=True, 2025-05-07T20:31:30.5497478Z ) 2025-05-07T20:31:30.5497672Z Trying example: test_silu_mul( 2025-05-07T20:31:30.5498047Z self=, 2025-05-07T20:31:30.5498420Z T=16384, 2025-05-07T20:31:30.5498628Z D=5120, 2025-05-07T20:31:30.5498833Z contiguous=False, 2025-05-07T20:31:30.5499057Z compiled=False, 2025-05-07T20:31:30.5499270Z ) 2025-05-07T20:31:30.5499473Z Trying example: test_silu_mul( 2025-05-07T20:31:30.5499841Z self=, 2025-05-07T20:31:30.5500216Z T=128, 2025-05-07T20:31:30.5500412Z D=7168, 2025-05-07T20:31:30.5500626Z contiguous=True, 2025-05-07T20:31:30.5500860Z compiled=False, 2025-05-07T20:31:30.5501074Z ) 2025-05-07T20:31:30.5501283Z Trying example: test_silu_mul( 2025-05-07T20:31:30.5501654Z self=, 2025-05-07T20:31:30.5502032Z T=128, 2025-05-07T20:31:30.5502232Z D=7168, 2025-05-07T20:31:30.5502430Z contiguous=False, 2025-05-07T20:31:30.5502663Z compiled=False, 2025-05-07T20:31:30.5502877Z ) 2025-05-07T20:31:30.5503072Z Trying example: test_silu_mul( 2025-05-07T20:31:30.5503446Z self=, 2025-05-07T20:31:30.5503822Z T=1, 2025-05-07T20:31:30.5504006Z D=5120, 2025-05-07T20:31:30.5504207Z contiguous=False, 2025-05-07T20:31:30.5504436Z compiled=False, 2025-05-07T20:31:30.5504639Z ) 2025-05-07T20:31:30.5504843Z Trying example: test_silu_mul( 2025-05-07T20:31:30.5505217Z self=, 2025-05-07T20:31:30.5505585Z T=1, 2025-05-07T20:31:30.5505781Z D=7168, 2025-05-07T20:31:30.5505992Z contiguous=False, 2025-05-07T20:31:30.5506234Z compiled=False, 2025-05-07T20:31:30.5506444Z ) 2025-05-07T20:31:30.5506652Z Trying example: test_silu_mul( 2025-05-07T20:31:30.5507042Z self=, 2025-05-07T20:31:30.5507422Z T=4096, 2025-05-07T20:31:30.5507628Z D=5120, 2025-05-07T20:31:30.5507837Z contiguous=True, 2025-05-07T20:31:30.5508065Z compiled=False, 2025-05-07T20:31:30.5508283Z ) 2025-05-07T20:31:30.5508492Z Trying example: test_silu_mul( 2025-05-07T20:31:30.5508872Z self=, 2025-05-07T20:31:30.5509314Z T=128, 2025-05-07T20:31:30.5509515Z D=7168, 2025-05-07T20:31:30.5509710Z contiguous=True, 2025-05-07T20:31:30.5509954Z compiled=True, 2025-05-07T20:31:30.5510195Z ) 2025-05-07T20:31:30.5510418Z Trying example: test_silu_mul( 2025-05-07T20:31:30.5510803Z self=, 2025-05-07T20:31:30.5511200Z T=1, 2025-05-07T20:31:30.5511390Z D=5120, 2025-05-07T20:31:30.5511604Z contiguous=False, 2025-05-07T20:31:30.5511847Z compiled=True, 2025-05-07T20:31:30.5512054Z ) 2025-05-07T20:31:30.5512358Z Trying example: test_silu_mul( 2025-05-07T20:31:30.5512734Z self=, 2025-05-07T20:31:30.5513120Z T=4096, 2025-05-07T20:31:30.5513308Z D=7168, 2025-05-07T20:31:30.5513511Z contiguous=True, 2025-05-07T20:31:30.5513743Z compiled=False, 2025-05-07T20:31:30.5513950Z ) 2025-05-07T20:31:30.5514157Z Trying example: test_silu_mul( 2025-05-07T20:31:30.5514533Z self=, 2025-05-07T20:31:30.5514908Z T=4096, 2025-05-07T20:31:30.5515099Z D=7168, 2025-05-07T20:31:30.5515295Z contiguous=False, 2025-05-07T20:31:30.5515516Z compiled=True, 2025-05-07T20:31:30.5515727Z ) 2025-05-07T20:31:30.5516012Z Trying example: test_silu_mul( 2025-05-07T20:31:30.5516403Z self=, 2025-05-07T20:31:30.5516776Z T=128, 2025-05-07T20:31:30.5516972Z D=5120, 2025-05-07T20:31:30.5517181Z contiguous=True, 2025-05-07T20:31:30.5517410Z compiled=False, 2025-05-07T20:31:30.5517624Z ) 2025-05-07T20:31:30.5517827Z Trying example: test_silu_mul( 2025-05-07T20:31:30.5518197Z self=, 2025-05-07T20:31:30.5518573Z T=128, 2025-05-07T20:31:30.5518764Z D=5120, 2025-05-07T20:31:30.5518963Z contiguous=False, 2025-05-07T20:31:30.5519196Z compiled=False, 2025-05-07T20:31:30.5519413Z ) 2025-05-07T20:31:30.5519608Z Trying example: test_silu_mul( 2025-05-07T20:31:30.5519980Z self=, 2025-05-07T20:31:30.5529176Z T=1, 2025-05-07T20:31:30.5529387Z D=5120, 2025-05-07T20:31:30.5529589Z contiguous=True, 2025-05-07T20:31:30.5529875Z compiled=False, 2025-05-07T20:31:30.5530098Z ) 2025-05-07T20:31:30.5530294Z Trying example: test_silu_mul( 2025-05-07T20:31:30.5530674Z self=, 2025-05-07T20:31:30.5531054Z T=2048, 2025-05-07T20:31:30.5531243Z D=7168, 2025-05-07T20:31:30.5531443Z contiguous=False, 2025-05-07T20:31:30.5531675Z compiled=True, 2025-05-07T20:31:30.5531874Z ) 2025-05-07T20:31:30.5532075Z Trying example: test_silu_mul( 2025-05-07T20:31:30.5532447Z self=, 2025-05-07T20:31:30.5532827Z T=2048, 2025-05-07T20:31:30.5533010Z D=7168, 2025-05-07T20:31:30.5533214Z contiguous=False, 2025-05-07T20:31:30.5533448Z compiled=False, 2025-05-07T20:31:30.5533649Z ) 2025-05-07T20:31:30.5533848Z Trying example: test_silu_mul( 2025-05-07T20:31:30.5534218Z self=, 2025-05-07T20:31:30.5534589Z T=16384, 2025-05-07T20:31:30.5534794Z D=7168, 2025-05-07T20:31:30.5534995Z contiguous=False, 2025-05-07T20:31:30.5535216Z compiled=True, 2025-05-07T20:31:30.5535419Z ) 2025-05-07T20:31:30.5535617Z Trying example: test_silu_mul( 2025-05-07T20:31:30.5535983Z self=, 2025-05-07T20:31:30.5536361Z T=16384, 2025-05-07T20:31:30.5536563Z D=7168, 2025-05-07T20:31:30.5536792Z contiguous=True, 2025-05-07T20:31:30.5537017Z compiled=True, 2025-05-07T20:31:30.5537224Z ) 2025-05-07T20:31:30.5537415Z Trying example: test_silu_mul( 2025-05-07T20:31:30.5537783Z self=, 2025-05-07T20:31:30.5538160Z T=4096, 2025-05-07T20:31:30.5538341Z D=7168, 2025-05-07T20:31:30.5538534Z contiguous=True, 2025-05-07T20:31:30.5538756Z compiled=True, 2025-05-07T20:31:30.5538957Z ) 2025-05-07T20:31:30.5539147Z Trying example: test_silu_mul( 2025-05-07T20:31:30.5539515Z self=, 2025-05-07T20:31:30.5539897Z T=2048, 2025-05-07T20:31:30.5540105Z D=5120, 2025-05-07T20:31:30.5540324Z contiguous=False, 2025-05-07T20:31:30.5540549Z compiled=False, 2025-05-07T20:31:30.5540747Z ) 2025-05-07T20:31:30.5541139Z Trying example: test_silu_mul( 2025-05-07T20:31:30.5541516Z self=, 2025-05-07T20:31:30.5541884Z T=2048, 2025-05-07T20:31:30.5542078Z D=5120, 2025-05-07T20:31:30.5542277Z contiguous=True, 2025-05-07T20:31:30.5542496Z compiled=False, 2025-05-07T20:31:30.5542707Z ) 2025-05-07T20:31:30.5542905Z Trying example: test_silu_mul( 2025-05-07T20:31:30.5543270Z self=, 2025-05-07T20:31:30.5543653Z T=128, 2025-05-07T20:31:30.5543846Z D=7168, 2025-05-07T20:31:30.5544036Z contiguous=False, 2025-05-07T20:31:30.5544268Z compiled=True, 2025-05-07T20:31:30.5544871Z ) 2025-05-07T20:31:30.5545063Z Trying example: test_silu_mul( 2025-05-07T20:31:30.5545434Z self=, 2025-05-07T20:31:30.5545813Z T=16384, 2025-05-07T20:31:30.5546009Z D=5120, 2025-05-07T20:31:30.5546206Z contiguous=True, 2025-05-07T20:31:30.5546431Z compiled=True, 2025-05-07T20:31:30.5546640Z ) 2025-05-07T20:31:30.5546834Z Trying example: test_silu_mul( 2025-05-07T20:31:30.5547204Z self=, 2025-05-07T20:31:30.5547582Z T=2048, 2025-05-07T20:31:30.5547763Z D=5120, 2025-05-07T20:31:30.5547963Z contiguous=False, 2025-05-07T20:31:30.5548190Z compiled=True, 2025-05-07T20:31:30.5548389Z ) 2025-05-07T20:31:30.5548589Z Trying example: test_silu_mul( 2025-05-07T20:31:30.5548960Z self=, 2025-05-07T20:31:30.5549381Z T=16384, 2025-05-07T20:31:30.5549577Z D=5120, 2025-05-07T20:31:30.5549781Z contiguous=True, 2025-05-07T20:31:30.5550006Z compiled=False, 2025-05-07T20:31:30.5550210Z ) 2025-05-07T20:31:30.5550412Z Trying example: test_silu_mul( 2025-05-07T20:31:30.5550775Z self=, 2025-05-07T20:31:30.5551158Z T=16384, 2025-05-07T20:31:30.5551354Z D=7168, 2025-05-07T20:31:30.5551553Z contiguous=False, 2025-05-07T20:31:30.5551772Z compiled=False, 2025-05-07T20:31:30.5551980Z ) 2025-05-07T20:31:30.5552175Z Trying example: test_silu_mul( 2025-05-07T20:31:30.5552537Z self=, 2025-05-07T20:31:30.5552913Z T=16384, 2025-05-07T20:31:30.5553110Z D=7168, 2025-05-07T20:31:30.5553299Z contiguous=True, 2025-05-07T20:31:30.5553524Z compiled=False, 2025-05-07T20:31:30.5553729Z ) 2025-05-07T20:31:30.5553907Z PASSED 2025-05-07T20:31:30.6171278Z W0507 20:31:30.615000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:30.6173400Z W0507 20:31:30.615000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last): 2025-05-07T20:31:30.6176117Z W0507 20:31:30.615000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:30.6178956Z W0507 20:31:30.615000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:30.6180322Z W0507 20:31:30.615000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:30.6181637Z W0507 20:31:30.615000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:30.6183194Z W0507 20:31:30.615000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:30.6184193Z W0507 20:31:30.615000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:30.6185435Z W0507 20:31:30.615000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:30.6186821Z W0507 20:31:30.615000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:30.6187994Z W0507 20:31:30.615000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:30.6189334Z W0507 20:31:30.615000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:30.6190590Z W0507 20:31:30.615000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] generator.visit(fn.parse()) 2025-05-07T20:31:30.6191822Z W0507 20:31:30.615000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:30.6193043Z W0507 20:31:30.615000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ret = super().visit(node) 2025-05-07T20:31:30.6193878Z W0507 20:31:30.615000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:30.6194919Z W0507 20:31:30.615000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:30.6195950Z W0507 20:31:30.615000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return visitor(node) 2025-05-07T20:31:30.6196751Z W0507 20:31:30.615000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^ 2025-05-07T20:31:30.6197952Z W0507 20:31:30.615000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:30.6199251Z W0507 20:31:30.615000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:30.6200383Z W0507 20:31:30.615000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:30.6201444Z W0507 20:31:30.615000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] self.visit(item) 2025-05-07T20:31:30.6202624Z W0507 20:31:30.615000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:30.6203985Z W0507 20:31:30.615000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:30.6205055Z W0507 20:31:30.615000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:30.6206067Z W0507 20:31:30.615000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:30.6206809Z W0507 20:31:30.615000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^ 2025-05-07T20:31:30.6207843Z W0507 20:31:30.615000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:30.6328690Z W0507 20:31:30.632000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:30.6329747Z W0507 20:31:30.632000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last): 2025-05-07T20:31:30.6332494Z W0507 20:31:30.632000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:30.6335303Z W0507 20:31:30.632000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:30.6337225Z W0507 20:31:30.632000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:30.6339781Z W0507 20:31:30.632000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:30.6341222Z W0507 20:31:30.632000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:30.6342205Z W0507 20:31:30.632000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:30.6343430Z W0507 20:31:30.632000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:30.6344802Z W0507 20:31:30.632000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:30.6345861Z W0507 20:31:30.632000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:30.6347151Z W0507 20:31:30.632000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:30.6348402Z W0507 20:31:30.632000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] generator.visit(fn.parse()) 2025-05-07T20:31:30.6349671Z W0507 20:31:30.632000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:30.6350926Z W0507 20:31:30.632000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ret = super().visit(node) 2025-05-07T20:31:30.6351766Z W0507 20:31:30.632000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:30.6352960Z W0507 20:31:30.632000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:30.6353985Z W0507 20:31:30.632000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return visitor(node) 2025-05-07T20:31:30.6354787Z W0507 20:31:30.632000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^ 2025-05-07T20:31:30.6355995Z W0507 20:31:30.632000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:30.6357277Z W0507 20:31:30.632000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:30.6358500Z W0507 20:31:30.632000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:30.6359549Z W0507 20:31:30.632000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] self.visit(item) 2025-05-07T20:31:30.6360722Z W0507 20:31:30.632000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:30.6362081Z W0507 20:31:30.632000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:30.6363146Z W0507 20:31:30.632000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:30.6364064Z W0507 20:31:30.632000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:30.6364817Z W0507 20:31:30.632000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^ 2025-05-07T20:31:30.6365833Z W0507 20:31:30.632000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:30.6712629Z W0507 20:31:30.670000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:30.6714802Z W0507 20:31:30.670000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last): 2025-05-07T20:31:30.6717532Z W0507 20:31:30.670000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:30.6720281Z W0507 20:31:30.670000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:30.6721287Z W0507 20:31:30.670000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:30.6722606Z W0507 20:31:30.670000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:30.6723995Z W0507 20:31:30.670000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:30.6725209Z W0507 20:31:30.670000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:30.6726447Z W0507 20:31:30.670000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:30.6727822Z W0507 20:31:30.670000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:30.6729144Z W0507 20:31:30.670000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:30.6730586Z W0507 20:31:30.670000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:30.6731845Z W0507 20:31:30.670000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] generator.visit(fn.parse()) 2025-05-07T20:31:30.6733077Z W0507 20:31:30.670000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:30.6734292Z W0507 20:31:30.670000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ret = super().visit(node) 2025-05-07T20:31:30.6735134Z W0507 20:31:30.670000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:30.6736184Z W0507 20:31:30.670000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:30.6737209Z W0507 20:31:30.670000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return visitor(node) 2025-05-07T20:31:30.6738013Z W0507 20:31:30.670000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^ 2025-05-07T20:31:30.6739219Z W0507 20:31:30.670000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:30.6740506Z W0507 20:31:30.670000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:30.6741639Z W0507 20:31:30.670000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:30.6742695Z W0507 20:31:30.670000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] self.visit(item) 2025-05-07T20:31:30.6743870Z W0507 20:31:30.670000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:30.6745238Z W0507 20:31:30.670000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:30.6746306Z W0507 20:31:30.670000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:30.6747237Z W0507 20:31:30.670000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:30.6748101Z W0507 20:31:30.670000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^ 2025-05-07T20:31:30.6749180Z W0507 20:31:30.670000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:30.6751204Z W0507 20:31:30.674000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:30.6752249Z W0507 20:31:30.674000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last): 2025-05-07T20:31:30.6753589Z W0507 20:31:30.674000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:30.6755098Z W0507 20:31:30.674000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:30.6756075Z W0507 20:31:30.674000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:30.6757374Z W0507 20:31:30.674000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:30.6758750Z W0507 20:31:30.674000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:30.6759743Z W0507 20:31:30.674000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:30.6760972Z W0507 20:31:30.674000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:30.6762345Z W0507 20:31:30.674000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:30.6763407Z W0507 20:31:30.674000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:30.6764696Z W0507 20:31:30.674000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:30.6765955Z W0507 20:31:30.674000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] generator.visit(fn.parse()) 2025-05-07T20:31:30.6767187Z W0507 20:31:30.674000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:30.6768402Z W0507 20:31:30.674000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ret = super().visit(node) 2025-05-07T20:31:30.6769230Z W0507 20:31:30.674000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:30.6770266Z W0507 20:31:30.674000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:30.6771381Z W0507 20:31:30.674000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return visitor(node) 2025-05-07T20:31:30.6772185Z W0507 20:31:30.674000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^ 2025-05-07T20:31:30.6773397Z W0507 20:31:30.674000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:30.6774687Z W0507 20:31:30.674000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:30.6775811Z W0507 20:31:30.674000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:30.6776939Z W0507 20:31:30.674000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] self.visit(item) 2025-05-07T20:31:30.6778127Z W0507 20:31:30.674000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:30.6779487Z W0507 20:31:30.674000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:30.6780558Z W0507 20:31:30.674000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:30.6781475Z W0507 20:31:30.674000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:30.6782227Z W0507 20:31:30.674000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^ 2025-05-07T20:31:30.6783262Z W0507 20:31:30.674000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:31.0850482Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant( 2025-05-07T20:31:31.0851822Z self=, 2025-05-07T20:31:31.0852633Z T=1, 2025-05-07T20:31:31.0853018Z D=5120, 2025-05-07T20:31:31.0853400Z scale_ub=None, 2025-05-07T20:31:31.0853832Z contiguous=True, 2025-05-07T20:31:31.0854281Z compiled=True, 2025-05-07T20:31:31.0854683Z ) 2025-05-07T20:31:31.0855325Z self = 2025-05-07T20:31:31.0856320Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:31.0856830Z 2025-05-07T20:31:31.0857005Z @given( 2025-05-07T20:31:31.0857464Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:31.0858102Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:31.0858714Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:31.0859356Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:31.0859899Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:31.0860186Z ) 2025-05-07T20:31:31.0860531Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:31.0860974Z def test_silu_mul_quant( 2025-05-07T20:31:31.0861223Z self, 2025-05-07T20:31:31.0861418Z T: int, 2025-05-07T20:31:31.0861624Z D: int, 2025-05-07T20:31:31.0861849Z scale_ub: Optional[float], 2025-05-07T20:31:31.0862120Z contiguous: bool, 2025-05-07T20:31:31.0862362Z compiled: bool, 2025-05-07T20:31:31.0862593Z ) -> None: 2025-05-07T20:31:31.0862815Z torch.manual_seed(2025) 2025-05-07T20:31:31.0863056Z 2025-05-07T20:31:31.0863336Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:31.0864036Z 2025-05-07T20:31:31.0864234Z x_sign = torch.sign(x) 2025-05-07T20:31:31.0864533Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:31.0864849Z x = x_sign * x_clamp 2025-05-07T20:31:31.0865086Z x0 = x[:, :D] 2025-05-07T20:31:31.0865317Z x1 = x[:, D:] 2025-05-07T20:31:31.0865526Z 2025-05-07T20:31:31.0865711Z if contiguous: 2025-05-07T20:31:31.0865952Z x0 = x0.contiguous() 2025-05-07T20:31:31.0866214Z x1 = x1.contiguous() 2025-05-07T20:31:31.0866453Z 2025-05-07T20:31:31.0866660Z if scale_ub is not None: 2025-05-07T20:31:31.0866935Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:31.0867434Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:31.0867753Z ) 2025-05-07T20:31:31.0867955Z else: 2025-05-07T20:31:31.0868173Z scale_ub_tensor = None 2025-05-07T20:31:31.0868423Z 2025-05-07T20:31:31.0868665Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:31.0868984Z op = silu_mul_quant 2025-05-07T20:31:31.0869294Z if compiled: 2025-05-07T20:31:31.0869546Z op = torch.compile(op) 2025-05-07T20:31:31.0869848Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:31.0870124Z 2025-05-07T20:31:31.0870324Z y_fp8, y_scale = fn() 2025-05-07T20:31:31.0870613Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:31.0870901Z 2025-05-07T20:31:31.0871150Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:31.0871485Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:31.0871776Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:31.0872099Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:31.0872467Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:31.0872778Z 2025-05-07T20:31:31.0872984Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:31.0873187Z 2025-05-07T20:31:31.0873289Z moe/activation_test.py:126: 2025-05-07T20:31:31.0873593Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:31.0873940Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:31.0874273Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:31.0875068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:31.0875817Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:31.0876364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:31.0877060Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:31.0877743Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:31.0878477Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:31.0879241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:31.0879993Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:31.0880718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:31.0881363Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:31.0881971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:31.0882492Z fn() 2025-05-07T20:31:31.0882996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:31.0883583Z self.fn.run( 2025-05-07T20:31:31.0884145Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:31.0884671Z kernel = self.compile( 2025-05-07T20:31:31.0885211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:31.0885864Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:31.0886265Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:31.0886493Z 2025-05-07T20:31:31.0886701Z self = 2025-05-07T20:31:31.0887787Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:31.0889286Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f68a9943240>} 2025-05-07T20:31:31.0890672Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:31.0891687Z context = 2025-05-07T20:31:31.0891975Z 2025-05-07T20:31:31.0892142Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:31.0892668Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:31.0893136Z module_map=module_map) 2025-05-07T20:31:31.0893509Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:31.0893866Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:31.0894137Z E ^ 2025-05-07T20:31:31.0894601Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:31.0895054Z 2025-05-07T20:31:31.0895470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:31.0895992Z 2025-05-07T20:31:31.0896099Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:31.0896519Z self=, 2025-05-07T20:31:31.0896915Z T=2048, 2025-05-07T20:31:31.0897112Z D=5120, 2025-05-07T20:31:31.0897311Z scale_ub=1200.0, 2025-05-07T20:31:31.0897533Z contiguous=True, 2025-05-07T20:31:31.0897758Z compiled=False, 2025-05-07T20:31:31.0897976Z ) 2025-05-07T20:31:31.4394775Z W0507 20:31:31.436000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:31.4395856Z W0507 20:31:31.436000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last): 2025-05-07T20:31:31.4397196Z W0507 20:31:31.436000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:31.4398617Z W0507 20:31:31.436000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:31.4399594Z W0507 20:31:31.436000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:31.4401063Z W0507 20:31:31.436000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:31.4402446Z W0507 20:31:31.436000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:31.4403430Z W0507 20:31:31.436000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:31.4404657Z W0507 20:31:31.436000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:31.4406144Z W0507 20:31:31.436000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:31.4407217Z W0507 20:31:31.436000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:31.4408497Z W0507 20:31:31.436000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:31.4409742Z W0507 20:31:31.436000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] generator.visit(fn.parse()) 2025-05-07T20:31:31.4410972Z W0507 20:31:31.436000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:31.4412202Z W0507 20:31:31.436000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ret = super().visit(node) 2025-05-07T20:31:31.4413047Z W0507 20:31:31.436000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:31.4414075Z W0507 20:31:31.436000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:31.4415097Z W0507 20:31:31.436000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return visitor(node) 2025-05-07T20:31:31.4415898Z W0507 20:31:31.436000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^ 2025-05-07T20:31:31.4417115Z W0507 20:31:31.436000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:31.4418421Z W0507 20:31:31.436000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:31.4419541Z W0507 20:31:31.436000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:31.4420579Z W0507 20:31:31.436000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] self.visit(item) 2025-05-07T20:31:31.4421764Z W0507 20:31:31.436000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:31.4423130Z W0507 20:31:31.436000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:31.4424284Z W0507 20:31:31.436000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:31.4425202Z W0507 20:31:31.436000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:31.4425943Z W0507 20:31:31.436000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^ 2025-05-07T20:31:31.4426963Z W0507 20:31:31.436000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:31.5354774Z W0507 20:31:31.533000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:31.5357169Z W0507 20:31:31.533000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last): 2025-05-07T20:31:31.5359820Z W0507 20:31:31.533000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:31.5361383Z W0507 20:31:31.533000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:31.5370327Z W0507 20:31:31.533000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:31.5371706Z W0507 20:31:31.533000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:31.5373100Z W0507 20:31:31.533000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:31.5374089Z W0507 20:31:31.533000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:31.5375325Z W0507 20:31:31.533000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:31.5376701Z W0507 20:31:31.533000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:31.5377771Z W0507 20:31:31.533000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:31.5379044Z W0507 20:31:31.533000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:31.5380292Z W0507 20:31:31.533000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] generator.visit(fn.parse()) 2025-05-07T20:31:31.5381520Z W0507 20:31:31.533000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:31.5382749Z W0507 20:31:31.533000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ret = super().visit(node) 2025-05-07T20:31:31.5384584Z W0507 20:31:31.533000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:31.5385623Z W0507 20:31:31.533000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:31.5386647Z W0507 20:31:31.533000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return visitor(node) 2025-05-07T20:31:31.5387453Z W0507 20:31:31.533000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^ 2025-05-07T20:31:31.5388675Z W0507 20:31:31.533000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:31.5390102Z W0507 20:31:31.533000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:31.5391236Z W0507 20:31:31.533000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:31.5392284Z W0507 20:31:31.533000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] self.visit(item) 2025-05-07T20:31:31.5393470Z W0507 20:31:31.533000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:31.5394841Z W0507 20:31:31.533000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:31.5395909Z W0507 20:31:31.533000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:31.5396837Z W0507 20:31:31.533000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:31.5397590Z W0507 20:31:31.533000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^ 2025-05-07T20:31:31.5398624Z W0507 20:31:31.533000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:31.8011045Z W0507 20:31:31.798000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:31.8012129Z W0507 20:31:31.798000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last): 2025-05-07T20:31:31.8013476Z W0507 20:31:31.798000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:31.8014894Z W0507 20:31:31.798000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:31.8015876Z W0507 20:31:31.798000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:31.8017171Z W0507 20:31:31.798000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:31.8018703Z W0507 20:31:31.798000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:31.8019694Z W0507 20:31:31.798000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:31.8020972Z W0507 20:31:31.798000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:31.8022350Z W0507 20:31:31.798000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:31.8023410Z W0507 20:31:31.798000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:31.8024811Z W0507 20:31:31.798000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:31.8026064Z W0507 20:31:31.798000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] generator.visit(fn.parse()) 2025-05-07T20:31:31.8027296Z W0507 20:31:31.798000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:31.8028677Z W0507 20:31:31.798000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ret = super().visit(node) 2025-05-07T20:31:31.8029555Z W0507 20:31:31.798000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:31.8030598Z W0507 20:31:31.798000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:31.8031626Z W0507 20:31:31.798000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return visitor(node) 2025-05-07T20:31:31.8032436Z W0507 20:31:31.798000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^ 2025-05-07T20:31:31.8033656Z W0507 20:31:31.798000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:31.8034934Z W0507 20:31:31.798000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:31.8036069Z W0507 20:31:31.798000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:31.8037126Z W0507 20:31:31.798000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] self.visit(item) 2025-05-07T20:31:31.8038310Z W0507 20:31:31.798000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:31.8039677Z W0507 20:31:31.798000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:31.8040738Z W0507 20:31:31.798000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:31.8041789Z W0507 20:31:31.798000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:31.8042544Z W0507 20:31:31.798000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^ 2025-05-07T20:31:31.8043573Z W0507 20:31:31.798000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:31.8158774Z W0507 20:31:31.813000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:31.8160480Z W0507 20:31:31.813000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last): 2025-05-07T20:31:31.8161963Z W0507 20:31:31.813000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:31.8163379Z W0507 20:31:31.813000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:31.8164363Z W0507 20:31:31.813000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:31.8165658Z W0507 20:31:31.813000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:31.8167028Z W0507 20:31:31.813000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:31.8168013Z W0507 20:31:31.813000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:31.8169235Z W0507 20:31:31.813000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:31.8170606Z W0507 20:31:31.813000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:31.8171669Z W0507 20:31:31.813000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:31.8172952Z W0507 20:31:31.813000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:31.8174199Z W0507 20:31:31.813000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] generator.visit(fn.parse()) 2025-05-07T20:31:31.8175431Z W0507 20:31:31.813000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:31.8176646Z W0507 20:31:31.813000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ret = super().visit(node) 2025-05-07T20:31:31.8177492Z W0507 20:31:31.813000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:31.8178531Z W0507 20:31:31.813000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:31.8179674Z W0507 20:31:31.813000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return visitor(node) 2025-05-07T20:31:31.8180485Z W0507 20:31:31.813000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^ 2025-05-07T20:31:31.8181707Z W0507 20:31:31.813000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:31.8182998Z W0507 20:31:31.813000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:31.8184228Z W0507 20:31:31.813000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:31.8185281Z W0507 20:31:31.813000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] self.visit(item) 2025-05-07T20:31:31.8186472Z W0507 20:31:31.813000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:31.8187846Z W0507 20:31:31.813000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:31.8188904Z W0507 20:31:31.813000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:31.8189876Z W0507 20:31:31.813000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:31.8190628Z W0507 20:31:31.813000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^ 2025-05-07T20:31:31.8191656Z W0507 20:31:31.813000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:32.1375374Z self = 2025-05-07T20:31:32.1376318Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:32.1376836Z 2025-05-07T20:31:32.1376989Z @given( 2025-05-07T20:31:32.1377422Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:32.1378002Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:32.1378555Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:32.1379171Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:32.1379775Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:32.1380331Z ) 2025-05-07T20:31:32.1380995Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:32.1381532Z def test_silu_mul_quant( 2025-05-07T20:31:32.1381775Z self, 2025-05-07T20:31:32.1381987Z T: int, 2025-05-07T20:31:32.1382196Z D: int, 2025-05-07T20:31:32.1382427Z scale_ub: Optional[float], 2025-05-07T20:31:32.1382699Z contiguous: bool, 2025-05-07T20:31:32.1382942Z compiled: bool, 2025-05-07T20:31:32.1383175Z ) -> None: 2025-05-07T20:31:32.1383390Z torch.manual_seed(2025) 2025-05-07T20:31:32.1383644Z 2025-05-07T20:31:32.1383923Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:32.1384264Z 2025-05-07T20:31:32.1384462Z x_sign = torch.sign(x) 2025-05-07T20:31:32.1384764Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:32.1385074Z x = x_sign * x_clamp 2025-05-07T20:31:32.1385326Z x0 = x[:, :D] 2025-05-07T20:31:32.1385540Z x1 = x[:, D:] 2025-05-07T20:31:32.1385908Z 2025-05-07T20:31:32.1386104Z if contiguous: 2025-05-07T20:31:32.1386333Z x0 = x0.contiguous() 2025-05-07T20:31:32.1386594Z x1 = x1.contiguous() 2025-05-07T20:31:32.1386839Z 2025-05-07T20:31:32.1387026Z if scale_ub is not None: 2025-05-07T20:31:32.1387300Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:32.1387754Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:32.1388188Z ) 2025-05-07T20:31:32.1388459Z else: 2025-05-07T20:31:32.1388747Z scale_ub_tensor = None 2025-05-07T20:31:32.1389190Z 2025-05-07T20:31:32.1389500Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:32.1390086Z op = silu_mul_quant 2025-05-07T20:31:32.1390382Z if compiled: 2025-05-07T20:31:32.1390635Z op = torch.compile(op) 2025-05-07T20:31:32.1390934Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:32.1391218Z 2025-05-07T20:31:32.1391414Z > y_fp8, y_scale = fn() 2025-05-07T20:31:32.1391589Z 2025-05-07T20:31:32.1391691Z moe/activation_test.py:117: 2025-05-07T20:31:32.1391990Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:32.1392318Z moe/activation_test.py:115: in fn 2025-05-07T20:31:32.1392606Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:32.1393302Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:32.1393982Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:32.1394520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:32.1395211Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:32.1395871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:32.1396442Z kernel = self.compile( 2025-05-07T20:31:32.1396991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:32.1397643Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:32.1398049Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:32.1398277Z 2025-05-07T20:31:32.1398493Z self = 2025-05-07T20:31:32.1399576Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:32.1400938Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f68a995ade0>} 2025-05-07T20:31:32.1402286Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:32.1403315Z context = 2025-05-07T20:31:32.1403603Z 2025-05-07T20:31:32.1403780Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:32.1404295Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:32.1404767Z module_map=module_map) 2025-05-07T20:31:32.1405136Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:32.1405499Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:32.1405762Z E ^ 2025-05-07T20:31:32.1406231Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:32.1406682Z 2025-05-07T20:31:32.1407199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:32.1407710Z 2025-05-07T20:31:32.1407828Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:32.1408246Z self=, 2025-05-07T20:31:32.1408659Z T=2048, 2025-05-07T20:31:32.1408865Z D=5120, 2025-05-07T20:31:32.1409060Z scale_ub=1200.0, 2025-05-07T20:31:32.1409292Z contiguous=True, 2025-05-07T20:31:32.1409518Z compiled=True, 2025-05-07T20:31:32.1409722Z ) 2025-05-07T20:31:32.1410094Z self = 2025-05-07T20:31:32.1410803Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:32.1411140Z 2025-05-07T20:31:32.1411238Z @given( 2025-05-07T20:31:32.1411533Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:32.1411937Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:32.1412327Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:32.1412670Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:32.1413003Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:32.1413293Z ) 2025-05-07T20:31:32.1413642Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:32.1414085Z def test_silu_mul_quant( 2025-05-07T20:31:32.1414330Z self, 2025-05-07T20:31:32.1414523Z T: int, 2025-05-07T20:31:32.1414725Z D: int, 2025-05-07T20:31:32.1414949Z scale_ub: Optional[float], 2025-05-07T20:31:32.1415217Z contiguous: bool, 2025-05-07T20:31:32.1415468Z compiled: bool, 2025-05-07T20:31:32.1415694Z ) -> None: 2025-05-07T20:31:32.1415911Z torch.manual_seed(2025) 2025-05-07T20:31:32.1416157Z 2025-05-07T20:31:32.1416432Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:32.1416775Z 2025-05-07T20:31:32.1416981Z x_sign = torch.sign(x) 2025-05-07T20:31:32.1417282Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:32.1417595Z x = x_sign * x_clamp 2025-05-07T20:31:32.1417834Z x0 = x[:, :D] 2025-05-07T20:31:32.1418055Z x1 = x[:, D:] 2025-05-07T20:31:32.1418269Z 2025-05-07T20:31:32.1418456Z if contiguous: 2025-05-07T20:31:32.1418692Z x0 = x0.contiguous() 2025-05-07T20:31:32.1418960Z x1 = x1.contiguous() 2025-05-07T20:31:32.1419202Z 2025-05-07T20:31:32.1419404Z if scale_ub is not None: 2025-05-07T20:31:32.1419687Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:32.1420028Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:32.1420353Z ) 2025-05-07T20:31:32.1420558Z else: 2025-05-07T20:31:32.1420790Z scale_ub_tensor = None 2025-05-07T20:31:32.1421077Z 2025-05-07T20:31:32.1421326Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:32.1421642Z op = silu_mul_quant 2025-05-07T20:31:32.1421906Z if compiled: 2025-05-07T20:31:32.1422166Z op = torch.compile(op) 2025-05-07T20:31:32.1422468Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:32.1422740Z 2025-05-07T20:31:32.1422937Z y_fp8, y_scale = fn() 2025-05-07T20:31:32.1423229Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:32.1423516Z 2025-05-07T20:31:32.1423762Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:32.1424096Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:32.1424387Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:32.1424709Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:32.1425074Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:32.1425384Z 2025-05-07T20:31:32.1425681Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:32.1425883Z 2025-05-07T20:31:32.1425984Z moe/activation_test.py:126: 2025-05-07T20:31:32.1426290Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:32.1426623Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:32.1426950Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:32.1427734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:32.1428905Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:32.1429515Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:32.1430346Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:32.1431029Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:32.1431746Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:32.1432498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:32.1433245Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:32.1433974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:32.1434602Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:32.1435202Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:32.1435729Z fn() 2025-05-07T20:31:32.1436234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:32.1436820Z self.fn.run( 2025-05-07T20:31:32.1437301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:32.1437832Z kernel = self.compile( 2025-05-07T20:31:32.1438374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:32.1439031Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:32.1439437Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:32.1439663Z 2025-05-07T20:31:32.1439874Z self = 2025-05-07T20:31:32.1440995Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:32.1442377Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f68a9ace700>} 2025-05-07T20:31:32.1443717Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:32.1444740Z context = 2025-05-07T20:31:32.1445025Z 2025-05-07T20:31:32.1445192Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:32.1445707Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:32.1446173Z module_map=module_map) 2025-05-07T20:31:32.1446543Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:32.1446896Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:32.1447164Z E ^ 2025-05-07T20:31:32.1447772Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:32.1448223Z 2025-05-07T20:31:32.1448642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:32.1449153Z 2025-05-07T20:31:32.1449256Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:32.1449665Z self=, 2025-05-07T20:31:32.1450123Z T=16384, 2025-05-07T20:31:32.1450365Z D=7168, 2025-05-07T20:31:32.1450609Z scale_ub=1200.0, 2025-05-07T20:31:32.1450889Z contiguous=False, 2025-05-07T20:31:32.1451165Z compiled=False, 2025-05-07T20:31:32.1451419Z ) 2025-05-07T20:31:32.3947194Z W0507 20:31:32.392000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:32.3948281Z W0507 20:31:32.392000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last): 2025-05-07T20:31:32.3949671Z W0507 20:31:32.392000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:32.3951084Z W0507 20:31:32.392000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:32.3952064Z W0507 20:31:32.392000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:32.3953372Z W0507 20:31:32.392000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:32.3954747Z W0507 20:31:32.392000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:32.3955725Z W0507 20:31:32.392000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:32.3956947Z W0507 20:31:32.392000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:32.3958326Z W0507 20:31:32.392000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:32.3959396Z W0507 20:31:32.392000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:32.3960674Z W0507 20:31:32.392000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:32.3961926Z W0507 20:31:32.392000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] generator.visit(fn.parse()) 2025-05-07T20:31:32.3963147Z W0507 20:31:32.392000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:32.3964356Z W0507 20:31:32.392000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ret = super().visit(node) 2025-05-07T20:31:32.3965453Z W0507 20:31:32.392000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:32.3966487Z W0507 20:31:32.392000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:32.3967520Z W0507 20:31:32.392000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return visitor(node) 2025-05-07T20:31:32.3968318Z W0507 20:31:32.392000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^ 2025-05-07T20:31:32.3969538Z W0507 20:31:32.392000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:32.3970957Z W0507 20:31:32.392000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:32.3972088Z W0507 20:31:32.392000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:32.3973138Z W0507 20:31:32.392000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] self.visit(item) 2025-05-07T20:31:32.3974317Z W0507 20:31:32.392000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:32.3975688Z W0507 20:31:32.392000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:32.3976755Z W0507 20:31:32.392000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:32.3977670Z W0507 20:31:32.392000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:32.3978413Z W0507 20:31:32.392000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^ 2025-05-07T20:31:32.3979438Z W0507 20:31:32.392000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:32.6128411Z W0507 20:31:32.610000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:32.6129504Z W0507 20:31:32.610000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last): 2025-05-07T20:31:32.6130862Z W0507 20:31:32.610000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:32.6132284Z W0507 20:31:32.610000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:32.6133258Z W0507 20:31:32.610000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:32.6134564Z W0507 20:31:32.610000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:32.6136100Z W0507 20:31:32.610000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:32.6137093Z W0507 20:31:32.610000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:32.6138327Z W0507 20:31:32.610000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:32.6139700Z W0507 20:31:32.610000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:32.6140888Z W0507 20:31:32.610000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:32.6142168Z W0507 20:31:32.610000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:32.6143422Z W0507 20:31:32.610000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] generator.visit(fn.parse()) 2025-05-07T20:31:32.6144647Z W0507 20:31:32.610000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:32.6145865Z W0507 20:31:32.610000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ret = super().visit(node) 2025-05-07T20:31:32.6146701Z W0507 20:31:32.610000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:32.6147730Z W0507 20:31:32.610000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:32.6148757Z W0507 20:31:32.610000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return visitor(node) 2025-05-07T20:31:32.6149609Z W0507 20:31:32.610000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^ 2025-05-07T20:31:32.6150816Z W0507 20:31:32.610000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:32.6152110Z W0507 20:31:32.610000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:32.6153239Z W0507 20:31:32.610000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:32.6154293Z W0507 20:31:32.610000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] self.visit(item) 2025-05-07T20:31:32.6155481Z W0507 20:31:32.610000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:32.6156857Z W0507 20:31:32.610000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:32.6157938Z W0507 20:31:32.610000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:32.6158945Z W0507 20:31:32.610000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:32.6159699Z W0507 20:31:32.610000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^ 2025-05-07T20:31:32.6160724Z W0507 20:31:32.610000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:32.8540036Z W0507 20:31:32.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:32.8542132Z W0507 20:31:32.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last): 2025-05-07T20:31:32.8544803Z W0507 20:31:32.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:32.8547629Z W0507 20:31:32.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:32.8549670Z W0507 20:31:32.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:32.8551333Z W0507 20:31:32.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:32.8552716Z W0507 20:31:32.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:32.8553703Z W0507 20:31:32.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:32.8554927Z W0507 20:31:32.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:32.8556295Z W0507 20:31:32.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:32.8557353Z W0507 20:31:32.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:32.8558640Z W0507 20:31:32.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:32.8559886Z W0507 20:31:32.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] generator.visit(fn.parse()) 2025-05-07T20:31:32.8561131Z W0507 20:31:32.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:32.8562347Z W0507 20:31:32.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ret = super().visit(node) 2025-05-07T20:31:32.8563192Z W0507 20:31:32.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:32.8564371Z W0507 20:31:32.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:32.8565407Z W0507 20:31:32.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return visitor(node) 2025-05-07T20:31:32.8566211Z W0507 20:31:32.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^ 2025-05-07T20:31:32.8567437Z W0507 20:31:32.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:32.8568739Z W0507 20:31:32.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:32.8569961Z W0507 20:31:32.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:32.8571013Z W0507 20:31:32.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] self.visit(item) 2025-05-07T20:31:32.8572188Z W0507 20:31:32.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:32.8573553Z W0507 20:31:32.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:32.8574617Z W0507 20:31:32.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:32.8575538Z W0507 20:31:32.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:32.8583432Z W0507 20:31:32.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^ 2025-05-07T20:31:32.8584492Z W0507 20:31:32.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:32.8681171Z W0507 20:31:32.866000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:32.8682229Z W0507 20:31:32.866000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last): 2025-05-07T20:31:32.8683585Z W0507 20:31:32.866000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:32.8685006Z W0507 20:31:32.866000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:32.8685973Z W0507 20:31:32.866000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:32.8687275Z W0507 20:31:32.866000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:32.8688658Z W0507 20:31:32.866000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:32.8689801Z W0507 20:31:32.866000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:32.8691045Z W0507 20:31:32.866000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:32.8692413Z W0507 20:31:32.866000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:32.8693476Z W0507 20:31:32.866000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:32.8694874Z W0507 20:31:32.866000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:32.8696121Z W0507 20:31:32.866000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] generator.visit(fn.parse()) 2025-05-07T20:31:32.8697358Z W0507 20:31:32.866000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:32.8698570Z W0507 20:31:32.866000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ret = super().visit(node) 2025-05-07T20:31:32.8699404Z W0507 20:31:32.866000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:32.8700435Z W0507 20:31:32.866000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:32.8701466Z W0507 20:31:32.866000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return visitor(node) 2025-05-07T20:31:32.8702267Z W0507 20:31:32.866000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^ 2025-05-07T20:31:32.8703473Z W0507 20:31:32.866000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:32.8704756Z W0507 20:31:32.866000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:32.8705883Z W0507 20:31:32.866000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:32.8706935Z W0507 20:31:32.866000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] self.visit(item) 2025-05-07T20:31:32.8708128Z W0507 20:31:32.866000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:32.8709540Z W0507 20:31:32.866000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:32.8710607Z W0507 20:31:32.866000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:32.8711574Z W0507 20:31:32.866000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:32.8712408Z W0507 20:31:32.866000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^ 2025-05-07T20:31:32.8713425Z W0507 20:31:32.866000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:33.6605155Z self = 2025-05-07T20:31:33.6605905Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:33.6606305Z 2025-05-07T20:31:33.6606411Z @given( 2025-05-07T20:31:33.6606716Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:33.6607134Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:33.6607523Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:33.6608297Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:33.6608631Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:33.6608918Z ) 2025-05-07T20:31:33.6609280Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:33.6609729Z def test_silu_mul_quant( 2025-05-07T20:31:33.6609980Z self, 2025-05-07T20:31:33.6610185Z T: int, 2025-05-07T20:31:33.6610393Z D: int, 2025-05-07T20:31:33.6610628Z scale_ub: Optional[float], 2025-05-07T20:31:33.6610897Z contiguous: bool, 2025-05-07T20:31:33.6611148Z compiled: bool, 2025-05-07T20:31:33.6611382Z ) -> None: 2025-05-07T20:31:33.6611604Z torch.manual_seed(2025) 2025-05-07T20:31:33.6611854Z 2025-05-07T20:31:33.6612138Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:33.6612485Z 2025-05-07T20:31:33.6612691Z x_sign = torch.sign(x) 2025-05-07T20:31:33.6612999Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:33.6613312Z x = x_sign * x_clamp 2025-05-07T20:31:33.6613565Z x0 = x[:, :D] 2025-05-07T20:31:33.6613791Z x1 = x[:, D:] 2025-05-07T20:31:33.6614001Z 2025-05-07T20:31:33.6614198Z if contiguous: 2025-05-07T20:31:33.6614439Z x0 = x0.contiguous() 2025-05-07T20:31:33.6614698Z x1 = x1.contiguous() 2025-05-07T20:31:33.6614948Z 2025-05-07T20:31:33.6615147Z if scale_ub is not None: 2025-05-07T20:31:33.6615420Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:33.6615760Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:33.6616074Z ) 2025-05-07T20:31:33.6616274Z else: 2025-05-07T20:31:33.6616486Z scale_ub_tensor = None 2025-05-07T20:31:33.6616742Z 2025-05-07T20:31:33.6616983Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:33.6617305Z op = silu_mul_quant 2025-05-07T20:31:33.6617564Z if compiled: 2025-05-07T20:31:33.6617817Z op = torch.compile(op) 2025-05-07T20:31:33.6618115Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:33.6618397Z 2025-05-07T20:31:33.6618607Z > y_fp8, y_scale = fn() 2025-05-07T20:31:33.6618772Z 2025-05-07T20:31:33.6618883Z moe/activation_test.py:117: 2025-05-07T20:31:33.6619182Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:33.6619519Z moe/activation_test.py:115: in fn 2025-05-07T20:31:33.6619802Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:33.6620489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:33.6621180Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:33.6621719Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:33.6622405Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:33.6623068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:33.6623761Z kernel = self.compile( 2025-05-07T20:31:33.6624307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:33.6624962Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:33.6625367Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:33.6625597Z 2025-05-07T20:31:33.6625811Z self = 2025-05-07T20:31:33.6626890Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:33.6628536Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f68aabc1760>} 2025-05-07T20:31:33.6629948Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:33.6631014Z context = 2025-05-07T20:31:33.6631313Z 2025-05-07T20:31:33.6631492Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:33.6632006Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:33.6632478Z module_map=module_map) 2025-05-07T20:31:33.6632861Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:33.6633228Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:33.6633489Z E ^ 2025-05-07T20:31:33.6633964Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:33.6634411Z 2025-05-07T20:31:33.6634844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:33.6635355Z 2025-05-07T20:31:33.6635469Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:33.6635882Z self=, 2025-05-07T20:31:33.6636288Z T=1, 2025-05-07T20:31:33.6636483Z D=7168, 2025-05-07T20:31:33.6636680Z scale_ub=None, 2025-05-07T20:31:33.6636904Z contiguous=True, 2025-05-07T20:31:33.6637136Z compiled=True, 2025-05-07T20:31:33.6637348Z ) 2025-05-07T20:31:33.6637678Z self = 2025-05-07T20:31:33.6638177Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:33.6638437Z 2025-05-07T20:31:33.6638525Z @given( 2025-05-07T20:31:33.6638757Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:33.6639086Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:33.6639401Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:33.6639728Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:33.6640067Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:33.6640357Z ) 2025-05-07T20:31:33.6640703Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:33.6641148Z def test_silu_mul_quant( 2025-05-07T20:31:33.6641397Z self, 2025-05-07T20:31:33.6641595Z T: int, 2025-05-07T20:31:33.6641793Z D: int, 2025-05-07T20:31:33.6642015Z scale_ub: Optional[float], 2025-05-07T20:31:33.6642294Z contiguous: bool, 2025-05-07T20:31:33.6642544Z compiled: bool, 2025-05-07T20:31:33.6642769Z ) -> None: 2025-05-07T20:31:33.6642991Z torch.manual_seed(2025) 2025-05-07T20:31:33.6643230Z 2025-05-07T20:31:33.6643510Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:33.6643978Z 2025-05-07T20:31:33.6644175Z x_sign = torch.sign(x) 2025-05-07T20:31:33.6644474Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:33.6644785Z x = x_sign * x_clamp 2025-05-07T20:31:33.6645025Z x0 = x[:, :D] 2025-05-07T20:31:33.6645244Z x1 = x[:, D:] 2025-05-07T20:31:33.6645454Z 2025-05-07T20:31:33.6645638Z if contiguous: 2025-05-07T20:31:33.6645879Z x0 = x0.contiguous() 2025-05-07T20:31:33.6646142Z x1 = x1.contiguous() 2025-05-07T20:31:33.6646374Z 2025-05-07T20:31:33.6646570Z if scale_ub is not None: 2025-05-07T20:31:33.6646848Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:33.6647187Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:33.6647611Z ) 2025-05-07T20:31:33.6647807Z else: 2025-05-07T20:31:33.6648025Z scale_ub_tensor = None 2025-05-07T20:31:33.6648274Z 2025-05-07T20:31:33.6648516Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:33.6648832Z op = silu_mul_quant 2025-05-07T20:31:33.6649087Z if compiled: 2025-05-07T20:31:33.6649337Z op = torch.compile(op) 2025-05-07T20:31:33.6649636Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:33.6649914Z 2025-05-07T20:31:33.6650111Z y_fp8, y_scale = fn() 2025-05-07T20:31:33.6650400Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:33.6650690Z 2025-05-07T20:31:33.6650973Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:33.6651332Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:33.6651631Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:33.6651947Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:33.6652315Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:33.6652633Z 2025-05-07T20:31:33.6652837Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:33.6653046Z 2025-05-07T20:31:33.6653148Z moe/activation_test.py:126: 2025-05-07T20:31:33.6653454Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:33.6653792Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:33.6654127Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:33.6654914Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:33.6655673Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:33.6656217Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:33.6656908Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:33.6657605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:33.6658340Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:33.6659096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:33.6659865Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:33.6660598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:33.6661288Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:33.6661891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:33.6662417Z fn() 2025-05-07T20:31:33.6662925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:33.6663504Z self.fn.run( 2025-05-07T20:31:33.6664061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:33.6664597Z kernel = self.compile( 2025-05-07T20:31:33.6665137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:33.6665793Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:33.6666200Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:33.6666427Z 2025-05-07T20:31:33.6666644Z self = 2025-05-07T20:31:33.6667716Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:33.6669228Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f68aa04cd60>} 2025-05-07T20:31:33.6670577Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:33.6671600Z context = 2025-05-07T20:31:33.6671887Z 2025-05-07T20:31:33.6672062Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:33.6672574Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:33.6673043Z module_map=module_map) 2025-05-07T20:31:33.6673415Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:33.6673765Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:33.6674034Z E ^ 2025-05-07T20:31:33.6674511Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:33.6674955Z 2025-05-07T20:31:33.6675380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:33.6675887Z 2025-05-07T20:31:33.6675991Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:33.6676404Z self=, 2025-05-07T20:31:33.6676811Z T=4096, 2025-05-07T20:31:33.6676998Z D=5120, 2025-05-07T20:31:33.6677196Z scale_ub=None, 2025-05-07T20:31:33.6677419Z contiguous=False, 2025-05-07T20:31:33.6677641Z compiled=False, 2025-05-07T20:31:33.6677856Z ) 2025-05-07T20:31:34.0225762Z W0507 20:31:34.020000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:34.0228051Z W0507 20:31:34.020000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last): 2025-05-07T20:31:34.0230856Z W0507 20:31:34.020000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:34.0232514Z W0507 20:31:34.020000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:34.0233483Z W0507 20:31:34.020000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:34.0234962Z W0507 20:31:34.020000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:34.0236341Z W0507 20:31:34.020000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:34.0237325Z W0507 20:31:34.020000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:34.0238545Z W0507 20:31:34.020000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:34.0239908Z W0507 20:31:34.020000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:34.0241092Z W0507 20:31:34.020000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:34.0242367Z W0507 20:31:34.020000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:34.0243610Z W0507 20:31:34.020000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] generator.visit(fn.parse()) 2025-05-07T20:31:34.0244831Z W0507 20:31:34.020000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:34.0246038Z W0507 20:31:34.020000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ret = super().visit(node) 2025-05-07T20:31:34.0246871Z W0507 20:31:34.020000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:34.0247895Z W0507 20:31:34.020000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:34.0248915Z W0507 20:31:34.020000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return visitor(node) 2025-05-07T20:31:34.0249704Z W0507 20:31:34.020000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^ 2025-05-07T20:31:34.0250914Z W0507 20:31:34.020000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:34.0252260Z W0507 20:31:34.020000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:34.0253382Z W0507 20:31:34.020000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:34.0254427Z W0507 20:31:34.020000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] self.visit(item) 2025-05-07T20:31:34.0255601Z W0507 20:31:34.020000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:34.0256968Z W0507 20:31:34.020000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:34.0258115Z W0507 20:31:34.020000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:34.0259030Z W0507 20:31:34.020000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:34.0259775Z W0507 20:31:34.020000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^ 2025-05-07T20:31:34.0260787Z W0507 20:31:34.020000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:34.2744085Z W0507 20:31:34.272000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:34.2745511Z W0507 20:31:34.272000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last): 2025-05-07T20:31:34.2746846Z W0507 20:31:34.272000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:34.2748305Z W0507 20:31:34.272000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:34.2749369Z W0507 20:31:34.272000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:34.2750666Z W0507 20:31:34.272000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:34.2752093Z W0507 20:31:34.272000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:34.2753073Z W0507 20:31:34.272000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:34.2754300Z W0507 20:31:34.272000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:34.2755825Z W0507 20:31:34.272000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:34.2756890Z W0507 20:31:34.272000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:34.2758299Z W0507 20:31:34.272000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:34.2759550Z W0507 20:31:34.272000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] generator.visit(fn.parse()) 2025-05-07T20:31:34.2760768Z W0507 20:31:34.272000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:34.2761979Z W0507 20:31:34.272000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ret = super().visit(node) 2025-05-07T20:31:34.2762813Z W0507 20:31:34.272000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:34.2763983Z W0507 20:31:34.272000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:34.2765003Z W0507 20:31:34.272000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return visitor(node) 2025-05-07T20:31:34.2765804Z W0507 20:31:34.272000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^ 2025-05-07T20:31:34.2767008Z W0507 20:31:34.272000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:34.2768368Z W0507 20:31:34.272000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:34.2769502Z W0507 20:31:34.272000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:34.2770541Z W0507 20:31:34.272000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] self.visit(item) 2025-05-07T20:31:34.2771721Z W0507 20:31:34.272000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:34.2773073Z W0507 20:31:34.272000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:34.2774137Z W0507 20:31:34.272000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:34.2775053Z W0507 20:31:34.272000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:34.2775796Z W0507 20:31:34.272000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^ 2025-05-07T20:31:34.2776818Z W0507 20:31:34.272000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:34.6422647Z W0507 20:31:34.640000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:34.6423714Z W0507 20:31:34.640000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last): 2025-05-07T20:31:34.6425057Z W0507 20:31:34.640000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:34.6426467Z W0507 20:31:34.640000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:34.6427443Z W0507 20:31:34.640000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:34.6428888Z W0507 20:31:34.640000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:34.6430312Z W0507 20:31:34.640000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:34.6431457Z W0507 20:31:34.640000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:34.6432685Z W0507 20:31:34.640000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:34.6434050Z W0507 20:31:34.640000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:34.6435122Z W0507 20:31:34.640000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:34.6436544Z W0507 20:31:34.640000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:34.6437792Z W0507 20:31:34.640000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] generator.visit(fn.parse()) 2025-05-07T20:31:34.6439017Z W0507 20:31:34.640000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:34.6440221Z W0507 20:31:34.640000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ret = super().visit(node) 2025-05-07T20:31:34.6441063Z W0507 20:31:34.640000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:34.6442095Z W0507 20:31:34.640000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:34.6443117Z W0507 20:31:34.640000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return visitor(node) 2025-05-07T20:31:34.6443907Z W0507 20:31:34.640000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^ 2025-05-07T20:31:34.6445121Z W0507 20:31:34.640000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:34.6446400Z W0507 20:31:34.640000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:34.6447531Z W0507 20:31:34.640000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:34.6448578Z W0507 20:31:34.640000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] self.visit(item) 2025-05-07T20:31:34.6449745Z W0507 20:31:34.640000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:34.6451105Z W0507 20:31:34.640000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:34.6452163Z W0507 20:31:34.640000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:34.6453083Z W0507 20:31:34.640000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:34.6453909Z W0507 20:31:34.640000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^ 2025-05-07T20:31:34.6454927Z W0507 20:31:34.640000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:34.6565532Z W0507 20:31:34.654000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:34.6567760Z W0507 20:31:34.654000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last): 2025-05-07T20:31:34.6570706Z W0507 20:31:34.654000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:34.6572488Z W0507 20:31:34.654000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:34.6573457Z W0507 20:31:34.654000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:34.6574759Z W0507 20:31:34.654000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:34.6576138Z W0507 20:31:34.654000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:34.6577126Z W0507 20:31:34.654000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:34.6578347Z W0507 20:31:34.654000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:34.6579704Z W0507 20:31:34.654000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:34.6580766Z W0507 20:31:34.654000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:34.6582057Z W0507 20:31:34.654000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:34.6583301Z W0507 20:31:34.654000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] generator.visit(fn.parse()) 2025-05-07T20:31:34.6584522Z W0507 20:31:34.654000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:34.6585730Z W0507 20:31:34.654000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ret = super().visit(node) 2025-05-07T20:31:34.6586556Z W0507 20:31:34.654000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:34.6587588Z W0507 20:31:34.654000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:34.6588720Z W0507 20:31:34.654000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return visitor(node) 2025-05-07T20:31:34.6589586Z W0507 20:31:34.654000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^ 2025-05-07T20:31:34.6590790Z W0507 20:31:34.654000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:34.6592121Z W0507 20:31:34.654000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:34.6593319Z W0507 20:31:34.654000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:34.6594372Z W0507 20:31:34.654000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] self.visit(item) 2025-05-07T20:31:34.6595556Z W0507 20:31:34.654000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:34.6596905Z W0507 20:31:34.654000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:34.6597967Z W0507 20:31:34.654000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:34.6598881Z W0507 20:31:34.654000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:34.6599632Z W0507 20:31:34.654000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^ 2025-05-07T20:31:34.6600647Z W0507 20:31:34.654000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:36.4001400Z self = 2025-05-07T20:31:36.4001979Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:36.4002363Z 2025-05-07T20:31:36.4002486Z @given( 2025-05-07T20:31:36.4002811Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:36.4003251Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:36.4003654Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:36.4004017Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:36.4004359Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:36.4004646Z ) 2025-05-07T20:31:36.4005009Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:36.4005455Z def test_silu_mul_quant( 2025-05-07T20:31:36.4005712Z self, 2025-05-07T20:31:36.4005929Z T: int, 2025-05-07T20:31:36.4006124Z D: int, 2025-05-07T20:31:36.4006346Z scale_ub: Optional[float], 2025-05-07T20:31:36.4006627Z contiguous: bool, 2025-05-07T20:31:36.4006865Z compiled: bool, 2025-05-07T20:31:36.4007103Z ) -> None: 2025-05-07T20:31:36.4007330Z torch.manual_seed(2025) 2025-05-07T20:31:36.4007582Z 2025-05-07T20:31:36.4007869Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:36.4016631Z 2025-05-07T20:31:36.4016869Z x_sign = torch.sign(x) 2025-05-07T20:31:36.4017179Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:36.4017507Z x = x_sign * x_clamp 2025-05-07T20:31:36.4017786Z x0 = x[:, :D] 2025-05-07T20:31:36.4018011Z x1 = x[:, D:] 2025-05-07T20:31:36.4018230Z 2025-05-07T20:31:36.4018750Z if contiguous: 2025-05-07T20:31:36.4018996Z x0 = x0.contiguous() 2025-05-07T20:31:36.4019259Z x1 = x1.contiguous() 2025-05-07T20:31:36.4019502Z 2025-05-07T20:31:36.4019703Z if scale_ub is not None: 2025-05-07T20:31:36.4019981Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:36.4020323Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:36.4020638Z ) 2025-05-07T20:31:36.4020838Z else: 2025-05-07T20:31:36.4021049Z scale_ub_tensor = None 2025-05-07T20:31:36.4021308Z 2025-05-07T20:31:36.4021552Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:36.4021876Z op = silu_mul_quant 2025-05-07T20:31:36.4022337Z if compiled: 2025-05-07T20:31:36.4022591Z op = torch.compile(op) 2025-05-07T20:31:36.4022891Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:36.4023161Z 2025-05-07T20:31:36.4023366Z > y_fp8, y_scale = fn() 2025-05-07T20:31:36.4023531Z 2025-05-07T20:31:36.4023646Z moe/activation_test.py:117: 2025-05-07T20:31:36.4023937Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:36.4024274Z moe/activation_test.py:115: in fn 2025-05-07T20:31:36.4024559Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:36.4025246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:36.4025947Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:36.4026488Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:36.4027186Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:36.4027845Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:36.4028800Z kernel = self.compile( 2025-05-07T20:31:36.4029416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:36.4030077Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:36.4030469Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:36.4030702Z 2025-05-07T20:31:36.4030908Z self = 2025-05-07T20:31:36.4031985Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:36.4033453Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f68a8d402c0>} 2025-05-07T20:31:36.4034804Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:36.4035822Z context = 2025-05-07T20:31:36.4036112Z 2025-05-07T20:31:36.4036278Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:36.4036794Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:36.4037261Z module_map=module_map) 2025-05-07T20:31:36.4037623Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:36.4037982Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:36.4038244Z E ^ 2025-05-07T20:31:36.4038706Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:36.4039165Z 2025-05-07T20:31:36.4039756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:36.4040275Z 2025-05-07T20:31:36.4040381Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:36.4040796Z self=, 2025-05-07T20:31:36.4041193Z T=4096, 2025-05-07T20:31:36.4041388Z D=7168, 2025-05-07T20:31:36.4041590Z scale_ub=None, 2025-05-07T20:31:36.4041813Z contiguous=False, 2025-05-07T20:31:36.4042047Z compiled=False, 2025-05-07T20:31:36.4042262Z ) 2025-05-07T20:31:36.4042577Z self = 2025-05-07T20:31:36.4043074Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:36.4043476Z 2025-05-07T20:31:36.4043561Z @given( 2025-05-07T20:31:36.4043797Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:36.4044112Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:36.4044423Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:36.4044760Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:36.4045082Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:36.4045374Z ) 2025-05-07T20:31:36.4045731Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:36.4046168Z def test_silu_mul_quant( 2025-05-07T20:31:36.4046412Z self, 2025-05-07T20:31:36.4046609Z T: int, 2025-05-07T20:31:36.4046803Z D: int, 2025-05-07T20:31:36.4047027Z scale_ub: Optional[float], 2025-05-07T20:31:36.4047301Z contiguous: bool, 2025-05-07T20:31:36.4047535Z compiled: bool, 2025-05-07T20:31:36.4047770Z ) -> None: 2025-05-07T20:31:36.4047992Z torch.manual_seed(2025) 2025-05-07T20:31:36.4048235Z 2025-05-07T20:31:36.4048505Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:36.4048856Z 2025-05-07T20:31:36.4049055Z x_sign = torch.sign(x) 2025-05-07T20:31:36.4049340Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:36.4049651Z x = x_sign * x_clamp 2025-05-07T20:31:36.4049890Z x0 = x[:, :D] 2025-05-07T20:31:36.4050103Z x1 = x[:, D:] 2025-05-07T20:31:36.4050310Z 2025-05-07T20:31:36.4050501Z if contiguous: 2025-05-07T20:31:36.4050729Z x0 = x0.contiguous() 2025-05-07T20:31:36.4050988Z x1 = x1.contiguous() 2025-05-07T20:31:36.4051230Z 2025-05-07T20:31:36.4051417Z if scale_ub is not None: 2025-05-07T20:31:36.4051690Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:36.4052027Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:36.4052362Z ) 2025-05-07T20:31:36.4052582Z else: 2025-05-07T20:31:36.4052795Z scale_ub_tensor = None 2025-05-07T20:31:36.4053051Z 2025-05-07T20:31:36.4053283Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:36.4053603Z op = silu_mul_quant 2025-05-07T20:31:36.4053856Z if compiled: 2025-05-07T20:31:36.4054101Z op = torch.compile(op) 2025-05-07T20:31:36.4054400Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:36.4054678Z 2025-05-07T20:31:36.4054869Z > y_fp8, y_scale = fn() 2025-05-07T20:31:36.4055039Z 2025-05-07T20:31:36.4055138Z moe/activation_test.py:117: 2025-05-07T20:31:36.4055436Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:36.4055762Z moe/activation_test.py:115: in fn 2025-05-07T20:31:36.4056048Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:36.4056744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:36.4057433Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:36.4058063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:36.4058751Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:36.4059420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:36.4059945Z kernel = self.compile( 2025-05-07T20:31:36.4060489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:36.4061145Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:36.4061541Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:36.4061844Z 2025-05-07T20:31:36.4062052Z self = 2025-05-07T20:31:36.4063134Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:36.4064510Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f68a8d42160>} 2025-05-07T20:31:36.4065846Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:36.4066863Z context = 2025-05-07T20:31:36.4067156Z 2025-05-07T20:31:36.4067322Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:36.4067845Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:36.4068435Z module_map=module_map) 2025-05-07T20:31:36.4068851Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:36.4069324Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:36.4069585Z E ^ 2025-05-07T20:31:36.4070044Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:36.4070503Z 2025-05-07T20:31:36.4070930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:36.4071451Z 2025-05-07T20:31:36.4071555Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:36.4071968Z self=, 2025-05-07T20:31:36.4072365Z T=128, 2025-05-07T20:31:36.4072566Z D=7168, 2025-05-07T20:31:36.4072768Z scale_ub=None, 2025-05-07T20:31:36.4072981Z contiguous=False, 2025-05-07T20:31:36.4073207Z compiled=True, 2025-05-07T20:31:36.4073410Z ) 2025-05-07T20:31:36.4533305Z self = 2025-05-07T20:31:36.4534064Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:36.4534430Z 2025-05-07T20:31:36.4534546Z @given( 2025-05-07T20:31:36.4534858Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:36.4535245Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:36.4535556Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:36.4535893Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:36.4536226Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:36.4536520Z ) 2025-05-07T20:31:36.4536875Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:36.4537324Z def test_silu_mul_quant( 2025-05-07T20:31:36.4537574Z self, 2025-05-07T20:31:36.4537774Z T: int, 2025-05-07T20:31:36.4537979Z D: int, 2025-05-07T20:31:36.4538207Z scale_ub: Optional[float], 2025-05-07T20:31:36.4538819Z contiguous: bool, 2025-05-07T20:31:36.4539065Z compiled: bool, 2025-05-07T20:31:36.4539297Z ) -> None: 2025-05-07T20:31:36.4539518Z torch.manual_seed(2025) 2025-05-07T20:31:36.4539762Z 2025-05-07T20:31:36.4540045Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:36.4540394Z 2025-05-07T20:31:36.4540591Z x_sign = torch.sign(x) 2025-05-07T20:31:36.4540885Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:36.4541203Z x = x_sign * x_clamp 2025-05-07T20:31:36.4541447Z x0 = x[:, :D] 2025-05-07T20:31:36.4541672Z x1 = x[:, D:] 2025-05-07T20:31:36.4541887Z 2025-05-07T20:31:36.4542083Z if contiguous: 2025-05-07T20:31:36.4542506Z x0 = x0.contiguous() 2025-05-07T20:31:36.4542775Z x1 = x1.contiguous() 2025-05-07T20:31:36.4543022Z 2025-05-07T20:31:36.4543212Z if scale_ub is not None: 2025-05-07T20:31:36.4543498Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:36.4543840Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:36.4544147Z ) 2025-05-07T20:31:36.4544348Z else: 2025-05-07T20:31:36.4544566Z scale_ub_tensor = None 2025-05-07T20:31:36.4544816Z 2025-05-07T20:31:36.4545061Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:36.4545382Z op = silu_mul_quant 2025-05-07T20:31:36.4545634Z if compiled: 2025-05-07T20:31:36.4545893Z op = torch.compile(op) 2025-05-07T20:31:36.4546197Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:36.4546483Z 2025-05-07T20:31:36.4546679Z y_fp8, y_scale = fn() 2025-05-07T20:31:36.4546972Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:36.4547268Z 2025-05-07T20:31:36.4547505Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:36.4547844Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:36.4548144Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:36.4548458Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:36.4548820Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:36.4549256Z 2025-05-07T20:31:36.4549462Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:36.4549662Z 2025-05-07T20:31:36.4549766Z moe/activation_test.py:126: 2025-05-07T20:31:36.4550066Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:36.4550407Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:36.4550733Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:36.4551526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:36.4552288Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:36.4552837Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:36.4553520Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:36.4554212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:36.4554936Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:36.4555685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:36.4556431Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:36.4557156Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:36.4557801Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:36.4558492Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:36.4559019Z fn() 2025-05-07T20:31:36.4559529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:36.4560106Z self.fn.run( 2025-05-07T20:31:36.4560573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:36.4561104Z kernel = self.compile( 2025-05-07T20:31:36.4561650Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:36.4562357Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:36.4562834Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:36.4563065Z 2025-05-07T20:31:36.4563281Z self = 2025-05-07T20:31:36.4564364Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:36.4565754Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f68a8d437e0>} 2025-05-07T20:31:36.4567092Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:36.4568111Z context = 2025-05-07T20:31:36.4568402Z 2025-05-07T20:31:36.4568578Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:36.4569095Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:36.4569565Z module_map=module_map) 2025-05-07T20:31:36.4569973Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:36.4570441Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:36.4570708Z E ^ 2025-05-07T20:31:36.4571181Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:36.4571628Z 2025-05-07T20:31:36.4572087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:36.4572621Z 2025-05-07T20:31:36.4572733Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:36.4573143Z self=, 2025-05-07T20:31:36.4573555Z T=128, 2025-05-07T20:31:36.4573753Z D=7168, 2025-05-07T20:31:36.4573944Z scale_ub=None, 2025-05-07T20:31:36.4574166Z contiguous=False, 2025-05-07T20:31:36.4574408Z compiled=False, 2025-05-07T20:31:36.4574617Z ) 2025-05-07T20:31:36.6097863Z self = 2025-05-07T20:31:36.6098597Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:36.6098987Z 2025-05-07T20:31:36.6099103Z @given( 2025-05-07T20:31:36.6099431Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:36.6099878Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:36.6100191Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:36.6100531Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:36.6100865Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:36.6101156Z ) 2025-05-07T20:31:36.6101533Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:36.6101979Z def test_silu_mul_quant( 2025-05-07T20:31:36.6102262Z self, 2025-05-07T20:31:36.6102497Z T: int, 2025-05-07T20:31:36.6103086Z D: int, 2025-05-07T20:31:36.6103313Z scale_ub: Optional[float], 2025-05-07T20:31:36.6103588Z contiguous: bool, 2025-05-07T20:31:36.6103835Z compiled: bool, 2025-05-07T20:31:36.6104063Z ) -> None: 2025-05-07T20:31:36.6104287Z torch.manual_seed(2025) 2025-05-07T20:31:36.6104534Z 2025-05-07T20:31:36.6104806Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:36.6105159Z 2025-05-07T20:31:36.6105363Z x_sign = torch.sign(x) 2025-05-07T20:31:36.6105657Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:36.6105982Z x = x_sign * x_clamp 2025-05-07T20:31:36.6106231Z x0 = x[:, :D] 2025-05-07T20:31:36.6106628Z x1 = x[:, D:] 2025-05-07T20:31:36.6106845Z 2025-05-07T20:31:36.6107042Z if contiguous: 2025-05-07T20:31:36.6107277Z x0 = x0.contiguous() 2025-05-07T20:31:36.6107547Z x1 = x1.contiguous() 2025-05-07T20:31:36.6107799Z 2025-05-07T20:31:36.6108002Z if scale_ub is not None: 2025-05-07T20:31:36.6108284Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:36.6108635Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:36.6108959Z ) 2025-05-07T20:31:36.6109228Z else: 2025-05-07T20:31:36.6109448Z scale_ub_tensor = None 2025-05-07T20:31:36.6109703Z 2025-05-07T20:31:36.6109935Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:36.6110256Z op = silu_mul_quant 2025-05-07T20:31:36.6110514Z if compiled: 2025-05-07T20:31:36.6110759Z op = torch.compile(op) 2025-05-07T20:31:36.6111058Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:36.6111344Z 2025-05-07T20:31:36.6111535Z > y_fp8, y_scale = fn() 2025-05-07T20:31:36.6111705Z 2025-05-07T20:31:36.6111809Z moe/activation_test.py:117: 2025-05-07T20:31:36.6112114Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:36.6112497Z moe/activation_test.py:115: in fn 2025-05-07T20:31:36.6112786Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:36.6113480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:36.6114170Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:36.6114712Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:36.6115388Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:36.6116055Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:36.6116597Z kernel = self.compile( 2025-05-07T20:31:36.6117142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:36.6117796Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:36.6118206Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:36.6118435Z 2025-05-07T20:31:36.6118650Z self = 2025-05-07T20:31:36.6119732Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:36.6121205Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6894be0860>} 2025-05-07T20:31:36.6122548Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:36.6123663Z context = 2025-05-07T20:31:36.6123957Z 2025-05-07T20:31:36.6124128Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:36.6124651Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:36.6125118Z module_map=module_map) 2025-05-07T20:31:36.6125496Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:36.6125853Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:36.6126109Z E ^ 2025-05-07T20:31:36.6126584Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:36.6127116Z 2025-05-07T20:31:36.6127533Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:36.6128041Z 2025-05-07T20:31:36.6128555Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:36.6129031Z self=, 2025-05-07T20:31:36.6129502Z T=4096, 2025-05-07T20:31:36.6129708Z D=5120, 2025-05-07T20:31:36.6129915Z scale_ub=1200.0, 2025-05-07T20:31:36.6130157Z contiguous=True, 2025-05-07T20:31:36.6130400Z compiled=False, 2025-05-07T20:31:36.6130618Z ) 2025-05-07T20:31:36.6130978Z self = 2025-05-07T20:31:36.6131553Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:36.6131874Z 2025-05-07T20:31:36.6131959Z @given( 2025-05-07T20:31:36.6132206Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:36.6132611Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:36.6132954Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:36.6133322Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:36.6133698Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:36.6134018Z ) 2025-05-07T20:31:36.6134409Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:36.6134924Z def test_silu_mul_quant( 2025-05-07T20:31:36.6135187Z self, 2025-05-07T20:31:36.6135392Z T: int, 2025-05-07T20:31:36.6135598Z D: int, 2025-05-07T20:31:36.6135833Z scale_ub: Optional[float], 2025-05-07T20:31:36.6136133Z contiguous: bool, 2025-05-07T20:31:36.6136394Z compiled: bool, 2025-05-07T20:31:36.6136639Z ) -> None: 2025-05-07T20:31:36.6136873Z torch.manual_seed(2025) 2025-05-07T20:31:36.6137134Z 2025-05-07T20:31:36.6137438Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:36.6137836Z 2025-05-07T20:31:36.6138038Z x_sign = torch.sign(x) 2025-05-07T20:31:36.6138365Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:36.6138712Z x = x_sign * x_clamp 2025-05-07T20:31:36.6138972Z x0 = x[:, :D] 2025-05-07T20:31:36.6139210Z x1 = x[:, D:] 2025-05-07T20:31:36.6139431Z 2025-05-07T20:31:36.6139623Z if contiguous: 2025-05-07T20:31:36.6139875Z x0 = x0.contiguous() 2025-05-07T20:31:36.6140162Z x1 = x1.contiguous() 2025-05-07T20:31:36.6140419Z 2025-05-07T20:31:36.6140630Z if scale_ub is not None: 2025-05-07T20:31:36.6140931Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:36.6141317Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:36.6141660Z ) 2025-05-07T20:31:36.6141873Z else: 2025-05-07T20:31:36.6142110Z scale_ub_tensor = None 2025-05-07T20:31:36.6142430Z 2025-05-07T20:31:36.6142676Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:36.6143044Z op = silu_mul_quant 2025-05-07T20:31:36.6143320Z if compiled: 2025-05-07T20:31:36.6143580Z op = torch.compile(op) 2025-05-07T20:31:36.6144052Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:36.6144333Z 2025-05-07T20:31:36.6144522Z > y_fp8, y_scale = fn() 2025-05-07T20:31:36.6144695Z 2025-05-07T20:31:36.6144797Z moe/activation_test.py:117: 2025-05-07T20:31:36.6145093Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:36.6145423Z moe/activation_test.py:115: in fn 2025-05-07T20:31:36.6145712Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:36.6146398Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:36.6147091Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:36.6147763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:36.6148449Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:36.6158521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:36.6159083Z kernel = self.compile( 2025-05-07T20:31:36.6159642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:36.6160310Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:36.6160711Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:36.6160954Z 2025-05-07T20:31:36.6161162Z self = 2025-05-07T20:31:36.6162261Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:36.6163722Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6894ef2b60>} 2025-05-07T20:31:36.6165080Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:36.6166128Z context = 2025-05-07T20:31:36.6166432Z 2025-05-07T20:31:36.6166602Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:36.6167136Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:36.6167603Z module_map=module_map) 2025-05-07T20:31:36.6167981Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:36.6168351Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:36.6168623Z E ^ 2025-05-07T20:31:36.6169106Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:36.6169573Z 2025-05-07T20:31:36.6169998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:36.6170513Z 2025-05-07T20:31:36.6170629Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:36.6171047Z self=, 2025-05-07T20:31:36.6171461Z T=1, 2025-05-07T20:31:36.6171664Z D=5120, 2025-05-07T20:31:36.6171868Z scale_ub=None, 2025-05-07T20:31:36.6172086Z contiguous=True, 2025-05-07T20:31:36.6172341Z compiled=True, 2025-05-07T20:31:36.6172583Z ) 2025-05-07T20:31:36.9570821Z W0507 20:31:36.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:36.9572409Z W0507 20:31:36.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last): 2025-05-07T20:31:36.9573810Z W0507 20:31:36.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:36.9575243Z W0507 20:31:36.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:36.9576217Z W0507 20:31:36.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:36.9577698Z W0507 20:31:36.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:36.9579076Z W0507 20:31:36.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:36.9580060Z W0507 20:31:36.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:36.9581285Z W0507 20:31:36.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:36.9582660Z W0507 20:31:36.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:36.9583729Z W0507 20:31:36.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:36.9585008Z W0507 20:31:36.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:36.9586260Z W0507 20:31:36.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] generator.visit(fn.parse()) 2025-05-07T20:31:36.9587493Z W0507 20:31:36.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:36.9588707Z W0507 20:31:36.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ret = super().visit(node) 2025-05-07T20:31:36.9589613Z W0507 20:31:36.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:36.9590648Z W0507 20:31:36.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:36.9591669Z W0507 20:31:36.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return visitor(node) 2025-05-07T20:31:36.9592517Z W0507 20:31:36.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^ 2025-05-07T20:31:36.9593717Z W0507 20:31:36.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:36.9595086Z W0507 20:31:36.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:36.9596210Z W0507 20:31:36.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:36.9597255Z W0507 20:31:36.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] self.visit(item) 2025-05-07T20:31:36.9598431Z W0507 20:31:36.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:36.9599852Z W0507 20:31:36.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:36.9600919Z W0507 20:31:36.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:36.9601837Z W0507 20:31:36.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:36.9602580Z W0507 20:31:36.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^ 2025-05-07T20:31:36.9603644Z W0507 20:31:36.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:37.0423005Z W0507 20:31:37.039000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:37.0424272Z W0507 20:31:37.039000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last): 2025-05-07T20:31:37.0425610Z W0507 20:31:37.039000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:37.0427034Z W0507 20:31:37.039000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:37.0428012Z W0507 20:31:37.039000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:37.0429610Z W0507 20:31:37.039000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:37.0431002Z W0507 20:31:37.039000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:37.0431996Z W0507 20:31:37.039000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:37.0433275Z W0507 20:31:37.039000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:37.0434653Z W0507 20:31:37.039000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:37.0436041Z W0507 20:31:37.039000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:37.0437326Z W0507 20:31:37.039000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:37.0438577Z W0507 20:31:37.039000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] generator.visit(fn.parse()) 2025-05-07T20:31:37.0439803Z W0507 20:31:37.039000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:37.0441025Z W0507 20:31:37.039000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ret = super().visit(node) 2025-05-07T20:31:37.0441988Z W0507 20:31:37.039000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:37.0443077Z W0507 20:31:37.039000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:37.0444097Z W0507 20:31:37.039000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return visitor(node) 2025-05-07T20:31:37.0444899Z W0507 20:31:37.039000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^ 2025-05-07T20:31:37.0446116Z W0507 20:31:37.039000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:37.0447402Z W0507 20:31:37.039000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:37.0448530Z W0507 20:31:37.039000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:37.0449578Z W0507 20:31:37.039000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] self.visit(item) 2025-05-07T20:31:37.0450764Z W0507 20:31:37.039000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:37.0452128Z W0507 20:31:37.039000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:37.0453195Z W0507 20:31:37.039000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:37.0454117Z W0507 20:31:37.039000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:37.0454866Z W0507 20:31:37.039000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^ 2025-05-07T20:31:37.0455883Z W0507 20:31:37.039000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:37.3031319Z W0507 20:31:37.300000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:37.3033195Z W0507 20:31:37.300000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last): 2025-05-07T20:31:37.3034816Z W0507 20:31:37.300000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:37.3036246Z W0507 20:31:37.300000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:37.3037235Z W0507 20:31:37.300000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:37.3038532Z W0507 20:31:37.300000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:37.3040091Z W0507 20:31:37.300000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:37.3041087Z W0507 20:31:37.300000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:37.3042371Z W0507 20:31:37.300000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:37.3043748Z W0507 20:31:37.300000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:37.3044817Z W0507 20:31:37.300000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:37.3046105Z W0507 20:31:37.300000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:37.3047360Z W0507 20:31:37.300000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] generator.visit(fn.parse()) 2025-05-07T20:31:37.3048587Z W0507 20:31:37.300000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:37.3049805Z W0507 20:31:37.300000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ret = super().visit(node) 2025-05-07T20:31:37.3050650Z W0507 20:31:37.300000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:37.3051678Z W0507 20:31:37.300000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:37.3052712Z W0507 20:31:37.300000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return visitor(node) 2025-05-07T20:31:37.3053514Z W0507 20:31:37.300000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^ 2025-05-07T20:31:37.3054721Z W0507 20:31:37.300000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:37.3056006Z W0507 20:31:37.300000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:37.3057220Z W0507 20:31:37.300000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:37.3058265Z W0507 20:31:37.300000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] self.visit(item) 2025-05-07T20:31:37.3059441Z W0507 20:31:37.300000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:37.3060798Z W0507 20:31:37.300000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:37.3061871Z W0507 20:31:37.300000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:37.3062917Z W0507 20:31:37.300000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:37.3063677Z W0507 20:31:37.300000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^ 2025-05-07T20:31:37.3064694Z W0507 20:31:37.300000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:37.3169689Z W0507 20:31:37.314000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:37.3170926Z W0507 20:31:37.314000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last): 2025-05-07T20:31:37.3172275Z W0507 20:31:37.314000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:37.3173689Z W0507 20:31:37.314000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:37.3174664Z W0507 20:31:37.314000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:37.3175976Z W0507 20:31:37.314000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:37.3177355Z W0507 20:31:37.314000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:37.3178353Z W0507 20:31:37.314000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:37.3179579Z W0507 20:31:37.314000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:37.3180960Z W0507 20:31:37.314000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:37.3182020Z W0507 20:31:37.314000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:37.3183906Z W0507 20:31:37.314000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:37.3185166Z W0507 20:31:37.314000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] generator.visit(fn.parse()) 2025-05-07T20:31:37.3186391Z W0507 20:31:37.314000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:37.3187612Z W0507 20:31:37.314000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ret = super().visit(node) 2025-05-07T20:31:37.3188442Z W0507 20:31:37.314000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:37.3189671Z W0507 20:31:37.314000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:37.3190702Z W0507 20:31:37.314000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return visitor(node) 2025-05-07T20:31:37.3191507Z W0507 20:31:37.314000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^ 2025-05-07T20:31:37.3192709Z W0507 20:31:37.314000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:37.3193983Z W0507 20:31:37.314000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:37.3195105Z W0507 20:31:37.314000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:37.3196148Z W0507 20:31:37.314000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] self.visit(item) 2025-05-07T20:31:37.3197324Z W0507 20:31:37.314000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:37.3198671Z W0507 20:31:37.314000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:37.3199734Z W0507 20:31:37.314000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:37.3200651Z W0507 20:31:37.314000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:37.3201400Z W0507 20:31:37.314000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^ 2025-05-07T20:31:37.3202426Z W0507 20:31:37.314000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:37.5167943Z self = 2025-05-07T20:31:37.5168656Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:37.5169014Z 2025-05-07T20:31:37.5169144Z @given( 2025-05-07T20:31:37.5169392Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:37.5169709Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:37.5170022Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:37.5170386Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:37.5170711Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:37.5171001Z ) 2025-05-07T20:31:37.5171692Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:37.5172139Z def test_silu_mul_quant( 2025-05-07T20:31:37.5172392Z self, 2025-05-07T20:31:37.5172592Z T: int, 2025-05-07T20:31:37.5172789Z D: int, 2025-05-07T20:31:37.5173015Z scale_ub: Optional[float], 2025-05-07T20:31:37.5173294Z contiguous: bool, 2025-05-07T20:31:37.5173532Z compiled: bool, 2025-05-07T20:31:37.5173763Z ) -> None: 2025-05-07T20:31:37.5173992Z torch.manual_seed(2025) 2025-05-07T20:31:37.5174261Z 2025-05-07T20:31:37.5174533Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:37.5174880Z 2025-05-07T20:31:37.5175078Z x_sign = torch.sign(x) 2025-05-07T20:31:37.5175521Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:37.5175830Z x = x_sign * x_clamp 2025-05-07T20:31:37.5176080Z x0 = x[:, :D] 2025-05-07T20:31:37.5176307Z x1 = x[:, D:] 2025-05-07T20:31:37.5176518Z 2025-05-07T20:31:37.5176719Z if contiguous: 2025-05-07T20:31:37.5176963Z x0 = x0.contiguous() 2025-05-07T20:31:37.5177220Z x1 = x1.contiguous() 2025-05-07T20:31:37.5177464Z 2025-05-07T20:31:37.5177663Z if scale_ub is not None: 2025-05-07T20:31:37.5177935Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:37.5178281Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:37.5178595Z ) 2025-05-07T20:31:37.5178787Z else: 2025-05-07T20:31:37.5179004Z scale_ub_tensor = None 2025-05-07T20:31:37.5179259Z 2025-05-07T20:31:37.5179487Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:37.5179805Z op = silu_mul_quant 2025-05-07T20:31:37.5180066Z if compiled: 2025-05-07T20:31:37.5180316Z op = torch.compile(op) 2025-05-07T20:31:37.5180607Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:37.5180884Z 2025-05-07T20:31:37.5181085Z y_fp8, y_scale = fn() 2025-05-07T20:31:37.5181367Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:37.5181660Z 2025-05-07T20:31:37.5181899Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:37.5182229Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:37.5182524Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:37.5182841Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:37.5183193Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:37.5183504Z 2025-05-07T20:31:37.5183709Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:37.5183905Z 2025-05-07T20:31:37.5184016Z moe/activation_test.py:126: 2025-05-07T20:31:37.5184316Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:37.5184656Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:37.5184984Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:37.5185769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:37.5186518Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:37.5187065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:37.5187752Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:37.5188431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:37.5189290Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:37.5190052Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:37.5190927Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:37.5191654Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:37.5192321Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:37.5192953Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:37.5193467Z fn() 2025-05-07T20:31:37.5193978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:37.5194567Z self.fn.run( 2025-05-07T20:31:37.5195038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:37.5195642Z kernel = self.compile( 2025-05-07T20:31:37.5196183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:37.5196848Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:37.5197247Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:37.5197485Z 2025-05-07T20:31:37.5197690Z self = 2025-05-07T20:31:37.5198767Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:37.5200149Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f68a93ec040>} 2025-05-07T20:31:37.5201495Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:37.5202512Z context = 2025-05-07T20:31:37.5202803Z 2025-05-07T20:31:37.5202969Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:37.5203485Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:37.5203949Z module_map=module_map) 2025-05-07T20:31:37.5204307Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:37.5204662Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:37.5204930Z E ^ 2025-05-07T20:31:37.5205389Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:37.5205846Z 2025-05-07T20:31:37.5206263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:37.5206776Z 2025-05-07T20:31:37.5206880Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:37.5207303Z self=, 2025-05-07T20:31:37.5207700Z T=2048, 2025-05-07T20:31:37.5207894Z D=5120, 2025-05-07T20:31:37.5208093Z scale_ub=None, 2025-05-07T20:31:37.5208312Z contiguous=True, 2025-05-07T20:31:37.5208544Z compiled=True, 2025-05-07T20:31:37.5208757Z ) 2025-05-07T20:31:37.8540404Z W0507 20:31:37.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:37.8541480Z W0507 20:31:37.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last): 2025-05-07T20:31:37.8543232Z W0507 20:31:37.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:37.8544665Z W0507 20:31:37.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:37.8545639Z W0507 20:31:37.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:37.8546954Z W0507 20:31:37.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:37.8548328Z W0507 20:31:37.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:37.8549552Z W0507 20:31:37.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:37.8550776Z W0507 20:31:37.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:37.8552152Z W0507 20:31:37.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:37.8553214Z W0507 20:31:37.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:37.8554508Z W0507 20:31:37.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:37.8555758Z W0507 20:31:37.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] generator.visit(fn.parse()) 2025-05-07T20:31:37.8556971Z W0507 20:31:37.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:37.8558183Z W0507 20:31:37.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ret = super().visit(node) 2025-05-07T20:31:37.8559019Z W0507 20:31:37.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:37.8560055Z W0507 20:31:37.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:37.8561080Z W0507 20:31:37.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return visitor(node) 2025-05-07T20:31:37.8561874Z W0507 20:31:37.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^ 2025-05-07T20:31:37.8563138Z W0507 20:31:37.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:37.8564420Z W0507 20:31:37.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:37.8565550Z W0507 20:31:37.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:37.8566748Z W0507 20:31:37.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] self.visit(item) 2025-05-07T20:31:37.8567933Z W0507 20:31:37.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:37.8569295Z W0507 20:31:37.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:37.8570356Z W0507 20:31:37.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:37.8571338Z W0507 20:31:37.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:37.8572086Z W0507 20:31:37.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^ 2025-05-07T20:31:37.8573121Z W0507 20:31:37.851000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:37.9393997Z W0507 20:31:37.937000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:37.9395200Z W0507 20:31:37.937000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last): 2025-05-07T20:31:37.9396530Z W0507 20:31:37.937000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:37.9397985Z W0507 20:31:37.937000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:37.9398965Z W0507 20:31:37.937000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:37.9400266Z W0507 20:31:37.937000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:37.9401641Z W0507 20:31:37.937000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:37.9402645Z W0507 20:31:37.937000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:37.9403896Z W0507 20:31:37.937000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:37.9405265Z W0507 20:31:37.937000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:37.9406324Z W0507 20:31:37.937000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:37.9407603Z W0507 20:31:37.937000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:37.9409736Z W0507 20:31:37.937000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] generator.visit(fn.parse()) 2025-05-07T20:31:37.9410969Z W0507 20:31:37.937000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:37.9412174Z W0507 20:31:37.937000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ret = super().visit(node) 2025-05-07T20:31:37.9413053Z W0507 20:31:37.937000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:37.9414079Z W0507 20:31:37.937000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:37.9415234Z W0507 20:31:37.937000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return visitor(node) 2025-05-07T20:31:37.9416032Z W0507 20:31:37.937000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^ 2025-05-07T20:31:37.9417242Z W0507 20:31:37.937000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:37.9418523Z W0507 20:31:37.937000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:37.9419633Z W0507 20:31:37.937000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:37.9420681Z W0507 20:31:37.937000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] self.visit(item) 2025-05-07T20:31:37.9421861Z W0507 20:31:37.937000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:37.9423264Z W0507 20:31:37.937000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:37.9424322Z W0507 20:31:37.937000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:37.9425231Z W0507 20:31:37.937000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:37.9425980Z W0507 20:31:37.937000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^ 2025-05-07T20:31:37.9427000Z W0507 20:31:37.937000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:38.1987525Z W0507 20:31:38.196000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:38.1988647Z W0507 20:31:38.196000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last): 2025-05-07T20:31:38.1990068Z W0507 20:31:38.196000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:38.1991540Z W0507 20:31:38.196000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:38.1992920Z W0507 20:31:38.196000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:38.1994291Z W0507 20:31:38.196000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:38.1995676Z W0507 20:31:38.196000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:38.1996662Z W0507 20:31:38.196000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:38.1998041Z W0507 20:31:38.196000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:38.1999419Z W0507 20:31:38.196000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:38.2000482Z W0507 20:31:38.196000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:38.2001757Z W0507 20:31:38.196000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:38.2003062Z W0507 20:31:38.196000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] generator.visit(fn.parse()) 2025-05-07T20:31:38.2004282Z W0507 20:31:38.196000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:38.2005491Z W0507 20:31:38.196000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ret = super().visit(node) 2025-05-07T20:31:38.2006328Z W0507 20:31:38.196000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:38.2007357Z W0507 20:31:38.196000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:38.2008383Z W0507 20:31:38.196000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return visitor(node) 2025-05-07T20:31:38.2009179Z W0507 20:31:38.196000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^ 2025-05-07T20:31:38.2010389Z W0507 20:31:38.196000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:38.2011670Z W0507 20:31:38.196000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:38.2012784Z W0507 20:31:38.196000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:38.2013886Z W0507 20:31:38.196000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] self.visit(item) 2025-05-07T20:31:38.2015140Z W0507 20:31:38.196000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:38.2016502Z W0507 20:31:38.196000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:38.2017559Z W0507 20:31:38.196000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:38.2018475Z W0507 20:31:38.196000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:38.2019210Z W0507 20:31:38.196000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^ 2025-05-07T20:31:38.2020308Z W0507 20:31:38.196000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:38.2131075Z W0507 20:31:38.211000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:38.2132348Z W0507 20:31:38.211000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last): 2025-05-07T20:31:38.2133681Z W0507 20:31:38.211000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:38.2135090Z W0507 20:31:38.211000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:38.2136078Z W0507 20:31:38.211000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:38.2137370Z W0507 20:31:38.211000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:38.2138748Z W0507 20:31:38.211000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:38.2139728Z W0507 20:31:38.211000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:38.2140959Z W0507 20:31:38.211000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:38.2142332Z W0507 20:31:38.211000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:38.2143438Z W0507 20:31:38.211000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:38.2144714Z W0507 20:31:38.211000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:38.2145966Z W0507 20:31:38.211000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] generator.visit(fn.parse()) 2025-05-07T20:31:38.2147369Z W0507 20:31:38.211000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:38.2148710Z W0507 20:31:38.211000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ret = super().visit(node) 2025-05-07T20:31:38.2158462Z W0507 20:31:38.211000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:38.2159562Z W0507 20:31:38.211000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:38.2160607Z W0507 20:31:38.211000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return visitor(node) 2025-05-07T20:31:38.2161602Z W0507 20:31:38.211000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^ 2025-05-07T20:31:38.2162894Z W0507 20:31:38.211000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:38.2164201Z W0507 20:31:38.211000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:38.2165330Z W0507 20:31:38.211000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:38.2166396Z W0507 20:31:38.211000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] self.visit(item) 2025-05-07T20:31:38.2167601Z W0507 20:31:38.211000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:38.2168984Z W0507 20:31:38.211000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:38.2170061Z W0507 20:31:38.211000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:38.2170978Z W0507 20:31:38.211000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:38.2171737Z W0507 20:31:38.211000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^ 2025-05-07T20:31:38.2172772Z W0507 20:31:38.211000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:38.5756753Z self = 2025-05-07T20:31:38.5757554Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:38.5757924Z 2025-05-07T20:31:38.5758016Z @given( 2025-05-07T20:31:38.5758257Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:38.5758582Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:38.5758898Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:38.5759227Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:38.5759565Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:38.5759861Z ) 2025-05-07T20:31:38.5760212Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:38.5760671Z def test_silu_mul_quant( 2025-05-07T20:31:38.5760927Z self, 2025-05-07T20:31:38.5761136Z T: int, 2025-05-07T20:31:38.5761341Z D: int, 2025-05-07T20:31:38.5761571Z scale_ub: Optional[float], 2025-05-07T20:31:38.5761847Z contiguous: bool, 2025-05-07T20:31:38.5762456Z compiled: bool, 2025-05-07T20:31:38.5762723Z ) -> None: 2025-05-07T20:31:38.5762972Z torch.manual_seed(2025) 2025-05-07T20:31:38.5763215Z 2025-05-07T20:31:38.5763495Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:38.5763842Z 2025-05-07T20:31:38.5764041Z x_sign = torch.sign(x) 2025-05-07T20:31:38.5764339Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:38.5764655Z x = x_sign * x_clamp 2025-05-07T20:31:38.5764896Z x0 = x[:, :D] 2025-05-07T20:31:38.5765120Z x1 = x[:, D:] 2025-05-07T20:31:38.5765337Z 2025-05-07T20:31:38.5765527Z if contiguous: 2025-05-07T20:31:38.5765927Z x0 = x0.contiguous() 2025-05-07T20:31:38.5766196Z x1 = x1.contiguous() 2025-05-07T20:31:38.5766432Z 2025-05-07T20:31:38.5766634Z if scale_ub is not None: 2025-05-07T20:31:38.5766915Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:38.5767260Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:38.5767569Z ) 2025-05-07T20:31:38.5767774Z else: 2025-05-07T20:31:38.5767997Z scale_ub_tensor = None 2025-05-07T20:31:38.5768249Z 2025-05-07T20:31:38.5768484Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:38.5768804Z op = silu_mul_quant 2025-05-07T20:31:38.5769053Z if compiled: 2025-05-07T20:31:38.5769309Z op = torch.compile(op) 2025-05-07T20:31:38.5769615Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:38.5769891Z 2025-05-07T20:31:38.5770096Z y_fp8, y_scale = fn() 2025-05-07T20:31:38.5770388Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:38.5770682Z 2025-05-07T20:31:38.5770926Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:38.5771268Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:38.5771572Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:38.5771886Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:38.5772253Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:38.5772569Z 2025-05-07T20:31:38.5772774Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:38.5772979Z 2025-05-07T20:31:38.5773093Z moe/activation_test.py:126: 2025-05-07T20:31:38.5773435Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:38.5773770Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:38.5774105Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:38.5774893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:38.5775655Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:38.5776203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:38.5776887Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:38.5777577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:38.5778302Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:38.5779048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:38.5779799Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:38.5780525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:38.5781161Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:38.5781868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:38.5782393Z fn() 2025-05-07T20:31:38.5782904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:38.5783485Z self.fn.run( 2025-05-07T20:31:38.5784008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:38.5784542Z kernel = self.compile( 2025-05-07T20:31:38.5785079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:38.5785734Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:38.5786136Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:38.5786446Z 2025-05-07T20:31:38.5786657Z self = 2025-05-07T20:31:38.5787744Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:38.5789221Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f68a9986840>} 2025-05-07T20:31:38.5790563Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:38.5791584Z context = 2025-05-07T20:31:38.5791874Z 2025-05-07T20:31:38.5792049Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:38.5792562Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:38.5793035Z module_map=module_map) 2025-05-07T20:31:38.5793444Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:38.5793816Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:38.5794098Z E ^ 2025-05-07T20:31:38.5794569Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:38.5795017Z 2025-05-07T20:31:38.5795444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:38.5795951Z 2025-05-07T20:31:38.5796057Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:38.5796479Z self=, 2025-05-07T20:31:38.5796894Z T=128, 2025-05-07T20:31:38.5797084Z D=5120, 2025-05-07T20:31:38.5797290Z scale_ub=None, 2025-05-07T20:31:38.5797516Z contiguous=True, 2025-05-07T20:31:38.5797742Z compiled=True, 2025-05-07T20:31:38.5797968Z ) 2025-05-07T20:31:38.9232185Z W0507 20:31:38.920000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:38.9233317Z W0507 20:31:38.920000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last): 2025-05-07T20:31:38.9234675Z W0507 20:31:38.920000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:38.9236108Z W0507 20:31:38.920000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:38.9237441Z W0507 20:31:38.920000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:38.9238751Z W0507 20:31:38.920000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:38.9240125Z W0507 20:31:38.920000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:38.9241104Z W0507 20:31:38.920000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:38.9242477Z W0507 20:31:38.920000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:38.9243846Z W0507 20:31:38.920000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:38.9244910Z W0507 20:31:38.920000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:38.9246192Z W0507 20:31:38.920000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:38.9247438Z W0507 20:31:38.920000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] generator.visit(fn.parse()) 2025-05-07T20:31:38.9248674Z W0507 20:31:38.920000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:38.9249877Z W0507 20:31:38.920000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ret = super().visit(node) 2025-05-07T20:31:38.9250718Z W0507 20:31:38.920000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:38.9251751Z W0507 20:31:38.920000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:38.9252769Z W0507 20:31:38.920000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return visitor(node) 2025-05-07T20:31:38.9253573Z W0507 20:31:38.920000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^ 2025-05-07T20:31:38.9254776Z W0507 20:31:38.920000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:38.9256057Z W0507 20:31:38.920000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:38.9257175Z W0507 20:31:38.920000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:38.9258214Z W0507 20:31:38.920000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] self.visit(item) 2025-05-07T20:31:38.9259399Z W0507 20:31:38.920000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:38.9260861Z W0507 20:31:38.920000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:38.9261933Z W0507 20:31:38.920000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:38.9262853Z W0507 20:31:38.920000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:38.9263649Z W0507 20:31:38.920000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^ 2025-05-07T20:31:38.9264665Z W0507 20:31:38.920000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:39.0091662Z W0507 20:31:39.006000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:39.0093523Z W0507 20:31:39.006000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last): 2025-05-07T20:31:39.0094878Z W0507 20:31:39.006000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:39.0096309Z W0507 20:31:39.006000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:39.0097308Z W0507 20:31:39.006000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:39.0098608Z W0507 20:31:39.006000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:39.0099999Z W0507 20:31:39.006000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:39.0100984Z W0507 20:31:39.006000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:39.0102205Z W0507 20:31:39.006000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:39.0103646Z W0507 20:31:39.006000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:39.0104702Z W0507 20:31:39.006000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:39.0105979Z W0507 20:31:39.006000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:39.0107231Z W0507 20:31:39.006000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] generator.visit(fn.parse()) 2025-05-07T20:31:39.0108465Z W0507 20:31:39.006000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:39.0110045Z W0507 20:31:39.006000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ret = super().visit(node) 2025-05-07T20:31:39.0110877Z W0507 20:31:39.006000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:39.0111910Z W0507 20:31:39.006000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:39.0112935Z W0507 20:31:39.006000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return visitor(node) 2025-05-07T20:31:39.0113733Z W0507 20:31:39.006000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^ 2025-05-07T20:31:39.0115075Z W0507 20:31:39.006000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:39.0116359Z W0507 20:31:39.006000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:39.0117474Z W0507 20:31:39.006000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:39.0118513Z W0507 20:31:39.006000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] self.visit(item) 2025-05-07T20:31:39.0119690Z W0507 20:31:39.006000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:39.0121047Z W0507 20:31:39.006000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:39.0122109Z W0507 20:31:39.006000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:39.0123038Z W0507 20:31:39.006000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:39.0123811Z W0507 20:31:39.006000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^ 2025-05-07T20:31:39.0124835Z W0507 20:31:39.006000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:39.2714261Z W0507 20:31:39.269000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:39.2715557Z W0507 20:31:39.269000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last): 2025-05-07T20:31:39.2716896Z W0507 20:31:39.269000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:39.2718332Z W0507 20:31:39.269000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:39.2719314Z W0507 20:31:39.269000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:39.2720988Z W0507 20:31:39.269000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:39.2722369Z W0507 20:31:39.269000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:39.2723360Z W0507 20:31:39.269000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:39.2724645Z W0507 20:31:39.269000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:39.2726164Z W0507 20:31:39.269000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:39.2727244Z W0507 20:31:39.269000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:39.2728788Z W0507 20:31:39.269000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:39.2730043Z W0507 20:31:39.269000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] generator.visit(fn.parse()) 2025-05-07T20:31:39.2731269Z W0507 20:31:39.269000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:39.2732496Z W0507 20:31:39.269000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ret = super().visit(node) 2025-05-07T20:31:39.2733385Z W0507 20:31:39.269000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:39.2734410Z W0507 20:31:39.269000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:39.2735432Z W0507 20:31:39.269000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return visitor(node) 2025-05-07T20:31:39.2736237Z W0507 20:31:39.269000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^ 2025-05-07T20:31:39.2737463Z W0507 20:31:39.269000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:39.2738758Z W0507 20:31:39.269000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:39.2739881Z W0507 20:31:39.269000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:39.2740931Z W0507 20:31:39.269000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] self.visit(item) 2025-05-07T20:31:39.2742119Z W0507 20:31:39.269000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:39.2743782Z W0507 20:31:39.269000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:39.2745240Z W0507 20:31:39.269000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:39.2746161Z W0507 20:31:39.269000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:39.2746911Z W0507 20:31:39.269000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^ 2025-05-07T20:31:39.2747935Z W0507 20:31:39.269000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:39.2861630Z W0507 20:31:39.284000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:39.2862922Z W0507 20:31:39.284000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last): 2025-05-07T20:31:39.2864269Z W0507 20:31:39.284000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:39.2865704Z W0507 20:31:39.284000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:39.2866695Z W0507 20:31:39.284000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:39.2868008Z W0507 20:31:39.284000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:39.2869507Z W0507 20:31:39.284000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:39.2870496Z W0507 20:31:39.284000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:39.2871725Z W0507 20:31:39.284000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:39.2873095Z W0507 20:31:39.284000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:39.2874211Z W0507 20:31:39.284000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:39.2875491Z W0507 20:31:39.284000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:39.2876741Z W0507 20:31:39.284000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] generator.visit(fn.parse()) 2025-05-07T20:31:39.2877960Z W0507 20:31:39.284000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:39.2879175Z W0507 20:31:39.284000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ret = super().visit(node) 2025-05-07T20:31:39.2880123Z W0507 20:31:39.284000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:39.2881158Z W0507 20:31:39.284000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:39.2882180Z W0507 20:31:39.284000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return visitor(node) 2025-05-07T20:31:39.2882980Z W0507 20:31:39.284000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^ 2025-05-07T20:31:39.2884241Z W0507 20:31:39.284000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:39.2885596Z W0507 20:31:39.284000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:39.2886715Z W0507 20:31:39.284000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:39.2887759Z W0507 20:31:39.284000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] self.visit(item) 2025-05-07T20:31:39.2888941Z W0507 20:31:39.284000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:39.2890298Z W0507 20:31:39.284000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:39.2891366Z W0507 20:31:39.284000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:39.2892288Z W0507 20:31:39.284000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:39.2893040Z W0507 20:31:39.284000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^ 2025-05-07T20:31:39.2894065Z W0507 20:31:39.284000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:39.5197710Z self = 2025-05-07T20:31:39.5198430Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:39.5198831Z 2025-05-07T20:31:39.5198942Z @given( 2025-05-07T20:31:39.5199256Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:39.5199569Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:39.5199898Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:39.5200244Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:39.5200575Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:39.5200869Z ) 2025-05-07T20:31:39.5201227Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:39.5201669Z def test_silu_mul_quant( 2025-05-07T20:31:39.5201916Z self, 2025-05-07T20:31:39.5202118Z T: int, 2025-05-07T20:31:39.5202326Z D: int, 2025-05-07T20:31:39.5202542Z scale_ub: Optional[float], 2025-05-07T20:31:39.5202816Z contiguous: bool, 2025-05-07T20:31:39.5203060Z compiled: bool, 2025-05-07T20:31:39.5203308Z ) -> None: 2025-05-07T20:31:39.5203589Z torch.manual_seed(2025) 2025-05-07T20:31:39.5203836Z 2025-05-07T20:31:39.5204115Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:39.5204455Z 2025-05-07T20:31:39.5205010Z x_sign = torch.sign(x) 2025-05-07T20:31:39.5205307Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:39.5205614Z x = x_sign * x_clamp 2025-05-07T20:31:39.5205856Z x0 = x[:, :D] 2025-05-07T20:31:39.5206075Z x1 = x[:, D:] 2025-05-07T20:31:39.5206277Z 2025-05-07T20:31:39.5206465Z if contiguous: 2025-05-07T20:31:39.5206702Z x0 = x0.contiguous() 2025-05-07T20:31:39.5206958Z x1 = x1.contiguous() 2025-05-07T20:31:39.5207207Z 2025-05-07T20:31:39.5207408Z if scale_ub is not None: 2025-05-07T20:31:39.5207688Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:39.5208021Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:39.5208479Z ) 2025-05-07T20:31:39.5208678Z else: 2025-05-07T20:31:39.5208894Z scale_ub_tensor = None 2025-05-07T20:31:39.5209156Z 2025-05-07T20:31:39.5209392Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:39.5209709Z op = silu_mul_quant 2025-05-07T20:31:39.5209964Z if compiled: 2025-05-07T20:31:39.5210214Z op = torch.compile(op) 2025-05-07T20:31:39.5210507Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:39.5210785Z 2025-05-07T20:31:39.5210982Z y_fp8, y_scale = fn() 2025-05-07T20:31:39.5211261Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:39.5211555Z 2025-05-07T20:31:39.5211800Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:39.5212137Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:39.5212427Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:39.5212745Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:39.5213110Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:39.5213442Z 2025-05-07T20:31:39.5213671Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:39.5213865Z 2025-05-07T20:31:39.5213977Z moe/activation_test.py:126: 2025-05-07T20:31:39.5214270Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:39.5214606Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:39.5214931Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:39.5215716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:39.5216457Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:39.5216999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:39.5217676Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:39.5218360Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:39.5219083Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:39.5219835Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:39.5220581Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:39.5221301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:39.5221938Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:39.5222534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:39.5223049Z fn() 2025-05-07T20:31:39.5223555Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:39.5224136Z self.fn.run( 2025-05-07T20:31:39.5224702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:39.5225227Z kernel = self.compile( 2025-05-07T20:31:39.5225767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:39.5226422Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:39.5226822Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:39.5227048Z 2025-05-07T20:31:39.5227253Z self = 2025-05-07T20:31:39.5228706Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:39.5230321Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6894c95c60>} 2025-05-07T20:31:39.5231659Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:39.5232678Z context = 2025-05-07T20:31:39.5232965Z 2025-05-07T20:31:39.5233135Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:39.5233706Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:39.5234175Z module_map=module_map) 2025-05-07T20:31:39.5234542Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:39.5234904Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:39.5235174Z E ^ 2025-05-07T20:31:39.5235641Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:39.5236086Z 2025-05-07T20:31:39.5236500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:39.5237015Z 2025-05-07T20:31:39.5237123Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:39.5237537Z self=, 2025-05-07T20:31:39.5237940Z T=4096, 2025-05-07T20:31:39.5238130Z D=5120, 2025-05-07T20:31:39.5238331Z scale_ub=None, 2025-05-07T20:31:39.5238552Z contiguous=True, 2025-05-07T20:31:39.5238773Z compiled=True, 2025-05-07T20:31:39.5238987Z ) 2025-05-07T20:31:39.8697305Z W0507 20:31:39.867000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:39.8698507Z W0507 20:31:39.867000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last): 2025-05-07T20:31:39.8699875Z W0507 20:31:39.867000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:39.8701309Z W0507 20:31:39.867000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:39.8702294Z W0507 20:31:39.867000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:39.8703600Z W0507 20:31:39.867000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:39.8705332Z W0507 20:31:39.867000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:39.8706322Z W0507 20:31:39.867000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:39.8707548Z W0507 20:31:39.867000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:39.8708923Z W0507 20:31:39.867000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:39.8710242Z W0507 20:31:39.867000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:39.8711512Z W0507 20:31:39.867000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:39.8712757Z W0507 20:31:39.867000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] generator.visit(fn.parse()) 2025-05-07T20:31:39.8713982Z W0507 20:31:39.867000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:39.8715193Z W0507 20:31:39.867000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ret = super().visit(node) 2025-05-07T20:31:39.8716027Z W0507 20:31:39.867000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:39.8717042Z W0507 20:31:39.867000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:39.8718118Z W0507 20:31:39.867000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return visitor(node) 2025-05-07T20:31:39.8727632Z W0507 20:31:39.867000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^ 2025-05-07T20:31:39.8729327Z W0507 20:31:39.867000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:39.8730662Z W0507 20:31:39.867000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:39.8731797Z W0507 20:31:39.867000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:39.8732860Z W0507 20:31:39.867000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] self.visit(item) 2025-05-07T20:31:39.8734110Z W0507 20:31:39.867000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:39.8735479Z W0507 20:31:39.867000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:39.8736767Z W0507 20:31:39.867000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:39.8737701Z W0507 20:31:39.867000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:39.8738459Z W0507 20:31:39.867000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^ 2025-05-07T20:31:39.8739487Z W0507 20:31:39.867000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:39.9570399Z W0507 20:31:39.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:39.9571909Z W0507 20:31:39.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last): 2025-05-07T20:31:39.9573257Z W0507 20:31:39.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:39.9574742Z W0507 20:31:39.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:39.9575726Z W0507 20:31:39.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:39.9577021Z W0507 20:31:39.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:39.9578407Z W0507 20:31:39.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:39.9579406Z W0507 20:31:39.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:39.9580637Z W0507 20:31:39.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:39.9582015Z W0507 20:31:39.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:39.9583078Z W0507 20:31:39.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:39.9584367Z W0507 20:31:39.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:39.9585617Z W0507 20:31:39.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] generator.visit(fn.parse()) 2025-05-07T20:31:39.9586844Z W0507 20:31:39.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:39.9588062Z W0507 20:31:39.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ret = super().visit(node) 2025-05-07T20:31:39.9588893Z W0507 20:31:39.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:39.9590183Z W0507 20:31:39.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:39.9591208Z W0507 20:31:39.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return visitor(node) 2025-05-07T20:31:39.9592011Z W0507 20:31:39.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^ 2025-05-07T20:31:39.9593216Z W0507 20:31:39.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:39.9594625Z W0507 20:31:39.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:39.9595754Z W0507 20:31:39.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:39.9596800Z W0507 20:31:39.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] self.visit(item) 2025-05-07T20:31:39.9597989Z W0507 20:31:39.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:39.9599339Z W0507 20:31:39.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:39.9600410Z W0507 20:31:39.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:39.9601330Z W0507 20:31:39.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:39.9602075Z W0507 20:31:39.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^ 2025-05-07T20:31:39.9603098Z W0507 20:31:39.954000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:40.2189171Z W0507 20:31:40.216000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:40.2190382Z W0507 20:31:40.216000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last): 2025-05-07T20:31:40.2192084Z W0507 20:31:40.216000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:40.2193934Z W0507 20:31:40.216000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:40.2195156Z W0507 20:31:40.216000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:40.2196792Z W0507 20:31:40.216000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:40.2198165Z W0507 20:31:40.216000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:40.2199514Z W0507 20:31:40.216000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:40.2200747Z W0507 20:31:40.216000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:40.2202124Z W0507 20:31:40.216000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:40.2203189Z W0507 20:31:40.216000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:40.2204607Z W0507 20:31:40.216000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:40.2205854Z W0507 20:31:40.216000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] generator.visit(fn.parse()) 2025-05-07T20:31:40.2207077Z W0507 20:31:40.216000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:40.2208289Z W0507 20:31:40.216000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ret = super().visit(node) 2025-05-07T20:31:40.2209125Z W0507 20:31:40.216000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:40.2210158Z W0507 20:31:40.216000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:40.2211178Z W0507 20:31:40.216000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return visitor(node) 2025-05-07T20:31:40.2211978Z W0507 20:31:40.216000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^ 2025-05-07T20:31:40.2213186Z W0507 20:31:40.216000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:40.2214508Z W0507 20:31:40.216000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:40.2215643Z W0507 20:31:40.216000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:40.2216686Z W0507 20:31:40.216000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] self.visit(item) 2025-05-07T20:31:40.2217869Z W0507 20:31:40.216000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:40.2219222Z W0507 20:31:40.216000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:40.2220285Z W0507 20:31:40.216000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:40.2221210Z W0507 20:31:40.216000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:40.2222042Z W0507 20:31:40.216000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^ 2025-05-07T20:31:40.2223073Z W0507 20:31:40.216000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:40.2334454Z W0507 20:31:40.231000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:40.2335685Z W0507 20:31:40.231000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last): 2025-05-07T20:31:40.2337024Z W0507 20:31:40.231000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:40.2338662Z W0507 20:31:40.231000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:40.2339641Z W0507 20:31:40.231000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:40.2340940Z W0507 20:31:40.231000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:40.2342317Z W0507 20:31:40.231000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:40.2343306Z W0507 20:31:40.231000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:40.2344582Z W0507 20:31:40.231000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:40.2345958Z W0507 20:31:40.231000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:40.2347023Z W0507 20:31:40.231000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:40.2348311Z W0507 20:31:40.231000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:40.2349644Z W0507 20:31:40.231000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] generator.visit(fn.parse()) 2025-05-07T20:31:40.2350867Z W0507 20:31:40.231000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:40.2352080Z W0507 20:31:40.231000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ret = super().visit(node) 2025-05-07T20:31:40.2352909Z W0507 20:31:40.231000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:40.2353988Z W0507 20:31:40.231000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:40.2355128Z W0507 20:31:40.231000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return visitor(node) 2025-05-07T20:31:40.2355932Z W0507 20:31:40.231000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^ 2025-05-07T20:31:40.2357145Z W0507 20:31:40.231000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:40.2358435Z W0507 20:31:40.231000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:40.2359556Z W0507 20:31:40.231000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:40.2360686Z W0507 20:31:40.231000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] self.visit(item) 2025-05-07T20:31:40.2361873Z W0507 20:31:40.231000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:40.2363238Z W0507 20:31:40.231000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:40.2364358Z W0507 20:31:40.231000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:40.2365273Z W0507 20:31:40.231000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:40.2366027Z W0507 20:31:40.231000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^ 2025-05-07T20:31:40.2367057Z W0507 20:31:40.231000 237772 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:40.4700012Z self = 2025-05-07T20:31:40.4700780Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:40.4701157Z 2025-05-07T20:31:40.4701267Z @given( 2025-05-07T20:31:40.4701575Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:40.4701896Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:40.4702209Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:40.4702574Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:40.4702899Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:40.4703195Z ) 2025-05-07T20:31:40.4703589Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:40.4704064Z def test_silu_mul_quant( 2025-05-07T20:31:40.4704320Z self, 2025-05-07T20:31:40.4704522Z T: int, 2025-05-07T20:31:40.4704717Z D: int, 2025-05-07T20:31:40.4704941Z scale_ub: Optional[float], 2025-05-07T20:31:40.4705219Z contiguous: bool, 2025-05-07T20:31:40.4705456Z compiled: bool, 2025-05-07T20:31:40.4705693Z ) -> None: 2025-05-07T20:31:40.4705921Z torch.manual_seed(2025) 2025-05-07T20:31:40.4706160Z 2025-05-07T20:31:40.4706449Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:40.4706798Z 2025-05-07T20:31:40.4706997Z x_sign = torch.sign(x) 2025-05-07T20:31:40.4707285Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:40.4707606Z x = x_sign * x_clamp 2025-05-07T20:31:40.4707854Z x0 = x[:, :D] 2025-05-07T20:31:40.4708074Z x1 = x[:, D:] 2025-05-07T20:31:40.4708288Z 2025-05-07T20:31:40.4708483Z if contiguous: 2025-05-07T20:31:40.4709196Z x0 = x0.contiguous() 2025-05-07T20:31:40.4709471Z x1 = x1.contiguous() 2025-05-07T20:31:40.4709712Z 2025-05-07T20:31:40.4709901Z if scale_ub is not None: 2025-05-07T20:31:40.4710180Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:40.4710519Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:40.4710824Z ) 2025-05-07T20:31:40.4711024Z else: 2025-05-07T20:31:40.4711240Z scale_ub_tensor = None 2025-05-07T20:31:40.4711487Z 2025-05-07T20:31:40.4711727Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:40.4712047Z op = silu_mul_quant 2025-05-07T20:31:40.4712585Z if compiled: 2025-05-07T20:31:40.4712835Z op = torch.compile(op) 2025-05-07T20:31:40.4713135Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:40.4713410Z 2025-05-07T20:31:40.4713601Z y_fp8, y_scale = fn() 2025-05-07T20:31:40.4713898Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:40.4714191Z 2025-05-07T20:31:40.4714424Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:40.4714759Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:40.4715057Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:40.4715373Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:40.4715733Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:40.4716042Z 2025-05-07T20:31:40.4716249Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:40.4716442Z 2025-05-07T20:31:40.4716544Z moe/activation_test.py:126: 2025-05-07T20:31:40.4716845Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:40.4717195Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:40.4717518Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:40.4718310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:40.4719063Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:40.4719610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:40.4720283Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:40.4720967Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:40.4721687Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:40.4722438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:40.4723183Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:40.4723965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:40.4724602Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:40.4725196Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:40.4725713Z fn() 2025-05-07T20:31:40.4726220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:40.4726807Z self.fn.run( 2025-05-07T20:31:40.4727273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:40.4727804Z kernel = self.compile( 2025-05-07T20:31:40.4728636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:40.4729286Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:40.4729838Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:40.4730077Z 2025-05-07T20:31:40.4730285Z self = 2025-05-07T20:31:40.4731360Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:40.4732733Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f68940dfc40>} 2025-05-07T20:31:40.4734069Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:40.4735215Z context = 2025-05-07T20:31:40.4735501Z 2025-05-07T20:31:40.4735676Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:40.4736194Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:40.4736663Z module_map=module_map) 2025-05-07T20:31:40.4737029Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:40.4737391Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:40.4737657Z E ^ 2025-05-07T20:31:40.4738123Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:40.4738569Z 2025-05-07T20:31:40.4738998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:40.4739505Z 2025-05-07T20:31:40.4739616Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:40.4740030Z self=, 2025-05-07T20:31:40.4740436Z T=16384, 2025-05-07T20:31:40.4740634Z D=5120, 2025-05-07T20:31:40.4740828Z scale_ub=None, 2025-05-07T20:31:40.4741046Z contiguous=True, 2025-05-07T20:31:40.4741272Z compiled=True, 2025-05-07T20:31:40.4741476Z ) 2025-05-07T20:31:40.5010025Z W0507 20:31:40.500000 237772 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] torch._dynamo hit config.recompile_limit (8) 2025-05-07T20:31:40.5011508Z W0507 20:31:40.500000 237772 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55) 2025-05-07T20:31:40.5012878Z W0507 20:31:40.500000 237772 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] last reason: 1/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240 2025-05-07T20:31:40.5013883Z W0507 20:31:40.500000 237772 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] To log all recompilation reasons, use TORCH_LOGS="recompiles". 2025-05-07T20:31:40.5014993Z W0507 20:31:40.500000 237772 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html. 2025-05-07T20:31:40.5694336Z self = 2025-05-07T20:31:40.5695103Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:40.5695486Z 2025-05-07T20:31:40.5695598Z @given( 2025-05-07T20:31:40.5695853Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:40.5696165Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:40.5696505Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:40.5696841Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:40.5697165Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:40.5697802Z ) 2025-05-07T20:31:40.5698164Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:40.5698602Z def test_silu_mul_quant( 2025-05-07T20:31:40.5698854Z self, 2025-05-07T20:31:40.5699067Z T: int, 2025-05-07T20:31:40.5699262Z D: int, 2025-05-07T20:31:40.5699490Z scale_ub: Optional[float], 2025-05-07T20:31:40.5699762Z contiguous: bool, 2025-05-07T20:31:40.5700007Z compiled: bool, 2025-05-07T20:31:40.5700238Z ) -> None: 2025-05-07T20:31:40.5700459Z torch.manual_seed(2025) 2025-05-07T20:31:40.5700709Z 2025-05-07T20:31:40.5700978Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:40.5701478Z 2025-05-07T20:31:40.5701682Z x_sign = torch.sign(x) 2025-05-07T20:31:40.5701974Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:40.5702292Z x = x_sign * x_clamp 2025-05-07T20:31:40.5702540Z x0 = x[:, :D] 2025-05-07T20:31:40.5702764Z x1 = x[:, D:] 2025-05-07T20:31:40.5702976Z 2025-05-07T20:31:40.5703172Z if contiguous: 2025-05-07T20:31:40.5703402Z x0 = x0.contiguous() 2025-05-07T20:31:40.5703669Z x1 = x1.contiguous() 2025-05-07T20:31:40.5703909Z 2025-05-07T20:31:40.5704097Z if scale_ub is not None: 2025-05-07T20:31:40.5704369Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:40.5704706Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:40.5705012Z ) 2025-05-07T20:31:40.5705211Z else: 2025-05-07T20:31:40.5705425Z scale_ub_tensor = None 2025-05-07T20:31:40.5705681Z 2025-05-07T20:31:40.5705910Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:40.5706234Z op = silu_mul_quant 2025-05-07T20:31:40.5706485Z if compiled: 2025-05-07T20:31:40.5706733Z op = torch.compile(op) 2025-05-07T20:31:40.5707043Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:40.5707324Z 2025-05-07T20:31:40.5707515Z y_fp8, y_scale = fn() 2025-05-07T20:31:40.5707801Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:40.5708096Z 2025-05-07T20:31:40.5708330Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:40.5708663Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:40.5708955Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:40.5709338Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:40.5709697Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:40.5710006Z 2025-05-07T20:31:40.5710212Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:40.5710409Z 2025-05-07T20:31:40.5710515Z moe/activation_test.py:126: 2025-05-07T20:31:40.5710813Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:40.5711156Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:40.5711483Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:40.5712269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:40.5713022Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:40.5713569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:40.5714252Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:40.5714941Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:40.5715671Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:40.5716426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:40.5717262Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:40.5717992Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:40.5718632Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:40.5719228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:40.5719758Z fn() 2025-05-07T20:31:40.5720269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:40.5720857Z self.fn.run( 2025-05-07T20:31:40.5721326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:40.5721935Z kernel = self.compile( 2025-05-07T20:31:40.5722485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:40.5723130Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:40.5723544Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:40.5723785Z 2025-05-07T20:31:40.5723992Z self = 2025-05-07T20:31:40.5725121Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:40.5726500Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6895e271a0>} 2025-05-07T20:31:40.5727852Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:40.5729129Z context = 2025-05-07T20:31:40.5729424Z 2025-05-07T20:31:40.5729591Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:40.5730111Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:40.5730578Z module_map=module_map) 2025-05-07T20:31:40.5730943Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:40.5731305Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:40.5731585Z E ^ 2025-05-07T20:31:40.5732050Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:40.5732511Z 2025-05-07T20:31:40.5732929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:40.5733456Z 2025-05-07T20:31:40.5733561Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:40.5733979Z self=, 2025-05-07T20:31:40.5734426Z T=1, 2025-05-07T20:31:40.5734627Z D=5120, 2025-05-07T20:31:40.5734833Z scale_ub=1200.0, 2025-05-07T20:31:40.5735061Z contiguous=True, 2025-05-07T20:31:40.5735295Z compiled=True, 2025-05-07T20:31:40.5735505Z ) 2025-05-07T20:31:40.8670089Z self = 2025-05-07T20:31:40.8670806Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:40.8671176Z 2025-05-07T20:31:40.8671287Z @given( 2025-05-07T20:31:40.8671586Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:40.8671898Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:40.8672218Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:40.8672940Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:40.8673279Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:40.8673560Z ) 2025-05-07T20:31:40.8673958Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:40.8674430Z def test_silu_mul_quant( 2025-05-07T20:31:40.8674682Z self, 2025-05-07T20:31:40.8674883Z T: int, 2025-05-07T20:31:40.8675090Z D: int, 2025-05-07T20:31:40.8675311Z scale_ub: Optional[float], 2025-05-07T20:31:40.8675583Z contiguous: bool, 2025-05-07T20:31:40.8675827Z compiled: bool, 2025-05-07T20:31:40.8676062Z ) -> None: 2025-05-07T20:31:40.8676280Z torch.manual_seed(2025) 2025-05-07T20:31:40.8676529Z 2025-05-07T20:31:40.8676964Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:40.8677304Z 2025-05-07T20:31:40.8677507Z x_sign = torch.sign(x) 2025-05-07T20:31:40.8677810Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:40.8678129Z x = x_sign * x_clamp 2025-05-07T20:31:40.8678378Z x0 = x[:, :D] 2025-05-07T20:31:40.8678600Z x1 = x[:, D:] 2025-05-07T20:31:40.8678811Z 2025-05-07T20:31:40.8678995Z if contiguous: 2025-05-07T20:31:40.8679230Z x0 = x0.contiguous() 2025-05-07T20:31:40.8679493Z x1 = x1.contiguous() 2025-05-07T20:31:40.8679728Z 2025-05-07T20:31:40.8679920Z if scale_ub is not None: 2025-05-07T20:31:40.8680198Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:40.8680532Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:40.8680845Z ) 2025-05-07T20:31:40.8681045Z else: 2025-05-07T20:31:40.8681256Z scale_ub_tensor = None 2025-05-07T20:31:40.8681520Z 2025-05-07T20:31:40.8681754Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:40.8682066Z op = silu_mul_quant 2025-05-07T20:31:40.8682322Z if compiled: 2025-05-07T20:31:40.8682581Z op = torch.compile(op) 2025-05-07T20:31:40.8682876Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:40.8683148Z 2025-05-07T20:31:40.8683363Z > y_fp8, y_scale = fn() 2025-05-07T20:31:40.8683545Z 2025-05-07T20:31:40.8683658Z moe/activation_test.py:117: 2025-05-07T20:31:40.8683978Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:40.8684315Z moe/activation_test.py:115: in fn 2025-05-07T20:31:40.8684601Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:40.8685157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:40.8685717Z return fn(*args, **kwargs) 2025-05-07T20:31:40.8686382Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:40.8687066Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:40.8687607Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:40.8696728Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:40.8697451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:40.8698009Z kernel = self.compile( 2025-05-07T20:31:40.8698567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:40.8699236Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:40.8699645Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:40.8699903Z 2025-05-07T20:31:40.8700197Z self = 2025-05-07T20:31:40.8701634Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:40.8703051Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6895f7b420>} 2025-05-07T20:31:40.8704405Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:40.8705435Z context = 2025-05-07T20:31:40.8705734Z 2025-05-07T20:31:40.8705989Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:40.8706516Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:40.8706987Z module_map=module_map) 2025-05-07T20:31:40.8707373Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:40.8707735Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:40.8707996Z E ^ 2025-05-07T20:31:40.8708479Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:40.8708943Z 2025-05-07T20:31:40.8709456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:40.8709976Z 2025-05-07T20:31:40.8710094Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:40.8710511Z self=, 2025-05-07T20:31:40.8710932Z T=1, 2025-05-07T20:31:40.8711128Z D=5120, 2025-05-07T20:31:40.8711325Z scale_ub=None, 2025-05-07T20:31:40.8711548Z contiguous=False, 2025-05-07T20:31:40.8711782Z compiled=True, 2025-05-07T20:31:40.8711988Z ) 2025-05-07T20:31:40.9184127Z self = 2025-05-07T20:31:40.9184913Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:40.9185271Z 2025-05-07T20:31:40.9185394Z @given( 2025-05-07T20:31:40.9185659Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:40.9185984Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:40.9186292Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:40.9186627Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:40.9186952Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:40.9187238Z ) 2025-05-07T20:31:40.9187596Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:40.9188049Z def test_silu_mul_quant( 2025-05-07T20:31:40.9188296Z self, 2025-05-07T20:31:40.9188503Z T: int, 2025-05-07T20:31:40.9188700Z D: int, 2025-05-07T20:31:40.9188926Z scale_ub: Optional[float], 2025-05-07T20:31:40.9189291Z contiguous: bool, 2025-05-07T20:31:40.9189530Z compiled: bool, 2025-05-07T20:31:40.9189759Z ) -> None: 2025-05-07T20:31:40.9189980Z torch.manual_seed(2025) 2025-05-07T20:31:40.9190221Z 2025-05-07T20:31:40.9190500Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:40.9190847Z 2025-05-07T20:31:40.9191042Z x_sign = torch.sign(x) 2025-05-07T20:31:40.9191336Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:40.9191655Z x = x_sign * x_clamp 2025-05-07T20:31:40.9191902Z x0 = x[:, :D] 2025-05-07T20:31:40.9192117Z x1 = x[:, D:] 2025-05-07T20:31:40.9192327Z 2025-05-07T20:31:40.9192534Z if contiguous: 2025-05-07T20:31:40.9192769Z x0 = x0.contiguous() 2025-05-07T20:31:40.9193030Z x1 = x1.contiguous() 2025-05-07T20:31:40.9193274Z 2025-05-07T20:31:40.9193468Z if scale_ub is not None: 2025-05-07T20:31:40.9194068Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:40.9194439Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:40.9194755Z ) 2025-05-07T20:31:40.9194948Z else: 2025-05-07T20:31:40.9195172Z scale_ub_tensor = None 2025-05-07T20:31:40.9195433Z 2025-05-07T20:31:40.9195665Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:40.9195983Z op = silu_mul_quant 2025-05-07T20:31:40.9196239Z if compiled: 2025-05-07T20:31:40.9196487Z op = torch.compile(op) 2025-05-07T20:31:40.9196788Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:40.9197073Z 2025-05-07T20:31:40.9197276Z y_fp8, y_scale = fn() 2025-05-07T20:31:40.9197695Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:40.9197994Z 2025-05-07T20:31:40.9198238Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:40.9198575Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:40.9198871Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:40.9199194Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:40.9199547Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:40.9199861Z 2025-05-07T20:31:40.9200072Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:40.9200264Z 2025-05-07T20:31:40.9200367Z moe/activation_test.py:126: 2025-05-07T20:31:40.9200669Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:40.9201008Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:40.9201341Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:40.9202130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:40.9202887Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:40.9203442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:40.9204126Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:40.9204805Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:40.9205526Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:40.9206280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:40.9207026Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:40.9207764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:40.9208406Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:40.9209013Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:40.9209530Z fn() 2025-05-07T20:31:40.9210047Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:40.9210637Z self.fn.run( 2025-05-07T20:31:40.9211104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:40.9211642Z kernel = self.compile( 2025-05-07T20:31:40.9212189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:40.9212852Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:40.9213254Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:40.9213492Z 2025-05-07T20:31:40.9213697Z self = 2025-05-07T20:31:40.9214869Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:40.9216249Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6873915b20>} 2025-05-07T20:31:40.9217590Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:40.9218609Z context = 2025-05-07T20:31:40.9218979Z 2025-05-07T20:31:40.9219148Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:40.9219674Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:40.9220135Z module_map=module_map) 2025-05-07T20:31:40.9220511Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:40.9220872Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:40.9221149Z E ^ 2025-05-07T20:31:40.9221616Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:40.9222072Z 2025-05-07T20:31:40.9222490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:40.9223000Z 2025-05-07T20:31:40.9223118Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:40.9223541Z self=, 2025-05-07T20:31:40.9223946Z T=1, 2025-05-07T20:31:40.9224140Z D=5120, 2025-05-07T20:31:40.9224331Z scale_ub=None, 2025-05-07T20:31:40.9224549Z contiguous=True, 2025-05-07T20:31:40.9224782Z compiled=False, 2025-05-07T20:31:40.9224993Z ) 2025-05-07T20:31:41.0392311Z self = 2025-05-07T20:31:41.0393068Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:41.0393429Z 2025-05-07T20:31:41.0393549Z @given( 2025-05-07T20:31:41.0393918Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.0394249Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.0394557Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.0394889Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.0395230Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.0395548Z ) 2025-05-07T20:31:41.0395898Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.0396343Z def test_silu_mul_quant( 2025-05-07T20:31:41.0396591Z self, 2025-05-07T20:31:41.0396807Z T: int, 2025-05-07T20:31:41.0397006Z D: int, 2025-05-07T20:31:41.0397232Z scale_ub: Optional[float], 2025-05-07T20:31:41.0397510Z contiguous: bool, 2025-05-07T20:31:41.0397750Z compiled: bool, 2025-05-07T20:31:41.0397979Z ) -> None: 2025-05-07T20:31:41.0398203Z torch.manual_seed(2025) 2025-05-07T20:31:41.0398443Z 2025-05-07T20:31:41.0398720Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.0399070Z 2025-05-07T20:31:41.0399270Z x_sign = torch.sign(x) 2025-05-07T20:31:41.0399565Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.0399910Z x = x_sign * x_clamp 2025-05-07T20:31:41.0400163Z x0 = x[:, :D] 2025-05-07T20:31:41.0400395Z x1 = x[:, D:] 2025-05-07T20:31:41.0400602Z 2025-05-07T20:31:41.0400800Z if contiguous: 2025-05-07T20:31:41.0401044Z x0 = x0.contiguous() 2025-05-07T20:31:41.0401300Z x1 = x1.contiguous() 2025-05-07T20:31:41.0401890Z 2025-05-07T20:31:41.0402099Z if scale_ub is not None: 2025-05-07T20:31:41.0402376Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:41.0402717Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:41.0403033Z ) 2025-05-07T20:31:41.0403229Z else: 2025-05-07T20:31:41.0403453Z scale_ub_tensor = None 2025-05-07T20:31:41.0403737Z 2025-05-07T20:31:41.0403990Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.0404313Z op = silu_mul_quant 2025-05-07T20:31:41.0404574Z if compiled: 2025-05-07T20:31:41.0404829Z op = torch.compile(op) 2025-05-07T20:31:41.0405127Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.0405584Z 2025-05-07T20:31:41.0405784Z > y_fp8, y_scale = fn() 2025-05-07T20:31:41.0405952Z 2025-05-07T20:31:41.0406054Z moe/activation_test.py:117: 2025-05-07T20:31:41.0406359Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.0406695Z moe/activation_test.py:115: in fn 2025-05-07T20:31:41.0406974Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.0407669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:41.0408364Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:41.0408899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:41.0409574Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:41.0410239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:41.0410776Z kernel = self.compile( 2025-05-07T20:31:41.0411310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:41.0411967Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.0412364Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.0412594Z 2025-05-07T20:31:41.0412810Z self = 2025-05-07T20:31:41.0413878Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:41.0415265Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6873372200>} 2025-05-07T20:31:41.0416612Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:41.0417630Z context = 2025-05-07T20:31:41.0417914Z 2025-05-07T20:31:41.0418087Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:41.0418603Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.0419074Z module_map=module_map) 2025-05-07T20:31:41.0419442Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.0419792Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:41.0420059Z E ^ 2025-05-07T20:31:41.0420527Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.0420977Z 2025-05-07T20:31:41.0421397Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:41.0421905Z 2025-05-07T20:31:41.0422096Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.0422514Z self=, 2025-05-07T20:31:41.0422919Z T=128, 2025-05-07T20:31:41.0423108Z D=5120, 2025-05-07T20:31:41.0423310Z scale_ub=None, 2025-05-07T20:31:41.0423536Z contiguous=False, 2025-05-07T20:31:41.0423792Z compiled=True, 2025-05-07T20:31:41.0424027Z ) 2025-05-07T20:31:41.0424354Z self = 2025-05-07T20:31:41.0424847Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:41.0425111Z 2025-05-07T20:31:41.0425194Z @given( 2025-05-07T20:31:41.0425436Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.0425832Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.0426141Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.0426476Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.0426817Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.0427100Z ) 2025-05-07T20:31:41.0427455Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.0427907Z def test_silu_mul_quant( 2025-05-07T20:31:41.0428423Z self, 2025-05-07T20:31:41.0428627Z T: int, 2025-05-07T20:31:41.0428834Z D: int, 2025-05-07T20:31:41.0429064Z scale_ub: Optional[float], 2025-05-07T20:31:41.0429392Z contiguous: bool, 2025-05-07T20:31:41.0429646Z compiled: bool, 2025-05-07T20:31:41.0429877Z ) -> None: 2025-05-07T20:31:41.0430097Z torch.manual_seed(2025) 2025-05-07T20:31:41.0430342Z 2025-05-07T20:31:41.0430617Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.0430962Z 2025-05-07T20:31:41.0431164Z x_sign = torch.sign(x) 2025-05-07T20:31:41.0431458Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.0431769Z x = x_sign * x_clamp 2025-05-07T20:31:41.0432013Z x0 = x[:, :D] 2025-05-07T20:31:41.0432237Z x1 = x[:, D:] 2025-05-07T20:31:41.0432446Z 2025-05-07T20:31:41.0432640Z if contiguous: 2025-05-07T20:31:41.0432875Z x0 = x0.contiguous() 2025-05-07T20:31:41.0433140Z x1 = x1.contiguous() 2025-05-07T20:31:41.0433405Z 2025-05-07T20:31:41.0433712Z if scale_ub is not None: 2025-05-07T20:31:41.0434081Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:41.0434413Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:41.0434724Z ) 2025-05-07T20:31:41.0434921Z else: 2025-05-07T20:31:41.0435130Z scale_ub_tensor = None 2025-05-07T20:31:41.0435394Z 2025-05-07T20:31:41.0435635Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.0435949Z op = silu_mul_quant 2025-05-07T20:31:41.0436211Z if compiled: 2025-05-07T20:31:41.0436470Z op = torch.compile(op) 2025-05-07T20:31:41.0436762Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.0437044Z 2025-05-07T20:31:41.0437246Z > y_fp8, y_scale = fn() 2025-05-07T20:31:41.0437410Z 2025-05-07T20:31:41.0437512Z moe/activation_test.py:117: 2025-05-07T20:31:41.0437814Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.0438159Z moe/activation_test.py:115: in fn 2025-05-07T20:31:41.0438448Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.0439003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:41.0439570Z return fn(*args, **kwargs) 2025-05-07T20:31:41.0440240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:41.0440921Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:41.0441615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:41.0442306Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:41.0442975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:41.0443503Z kernel = self.compile( 2025-05-07T20:31:41.0444053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:41.0444722Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.0445123Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.0445466Z 2025-05-07T20:31:41.0445676Z self = 2025-05-07T20:31:41.0446762Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:41.0448135Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f68738ee0c0>} 2025-05-07T20:31:41.0449473Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:41.0450489Z context = 2025-05-07T20:31:41.0450785Z 2025-05-07T20:31:41.0450957Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:41.0451487Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.0451956Z module_map=module_map) 2025-05-07T20:31:41.0452327Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.0452686Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:41.0452949Z E ^ 2025-05-07T20:31:41.0453411Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.0453896Z 2025-05-07T20:31:41.0454333Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:41.0454851Z 2025-05-07T20:31:41.0454958Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.0455377Z self=, 2025-05-07T20:31:41.0455776Z T=128, 2025-05-07T20:31:41.0455986Z D=7168, 2025-05-07T20:31:41.0456191Z scale_ub=1200.0, 2025-05-07T20:31:41.0456420Z contiguous=False, 2025-05-07T20:31:41.0456652Z compiled=False, 2025-05-07T20:31:41.0456864Z ) 2025-05-07T20:31:41.1335500Z self = 2025-05-07T20:31:41.1336241Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:41.1336636Z 2025-05-07T20:31:41.1336750Z @given( 2025-05-07T20:31:41.1337075Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.1337507Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.1337830Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.1338171Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.1338504Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.1338787Z ) 2025-05-07T20:31:41.1339141Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.1339601Z def test_silu_mul_quant( 2025-05-07T20:31:41.1339843Z self, 2025-05-07T20:31:41.1340054Z T: int, 2025-05-07T20:31:41.1340274Z D: int, 2025-05-07T20:31:41.1340498Z scale_ub: Optional[float], 2025-05-07T20:31:41.1341145Z contiguous: bool, 2025-05-07T20:31:41.1341407Z compiled: bool, 2025-05-07T20:31:41.1341634Z ) -> None: 2025-05-07T20:31:41.1341860Z torch.manual_seed(2025) 2025-05-07T20:31:41.1342111Z 2025-05-07T20:31:41.1342387Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.1342779Z 2025-05-07T20:31:41.1342980Z x_sign = torch.sign(x) 2025-05-07T20:31:41.1343281Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.1343595Z x = x_sign * x_clamp 2025-05-07T20:31:41.1343836Z x0 = x[:, :D] 2025-05-07T20:31:41.1344069Z x1 = x[:, D:] 2025-05-07T20:31:41.1344288Z 2025-05-07T20:31:41.1344479Z if contiguous: 2025-05-07T20:31:41.1344870Z x0 = x0.contiguous() 2025-05-07T20:31:41.1345139Z x1 = x1.contiguous() 2025-05-07T20:31:41.1345384Z 2025-05-07T20:31:41.1345576Z if scale_ub is not None: 2025-05-07T20:31:41.1345859Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:41.1346197Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:41.1346502Z ) 2025-05-07T20:31:41.1346701Z else: 2025-05-07T20:31:41.1346917Z scale_ub_tensor = None 2025-05-07T20:31:41.1347163Z 2025-05-07T20:31:41.1347397Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.1347714Z op = silu_mul_quant 2025-05-07T20:31:41.1347963Z if compiled: 2025-05-07T20:31:41.1348215Z op = torch.compile(op) 2025-05-07T20:31:41.1348514Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.1348792Z 2025-05-07T20:31:41.1348994Z > y_fp8, y_scale = fn() 2025-05-07T20:31:41.1349237Z 2025-05-07T20:31:41.1349347Z moe/activation_test.py:117: 2025-05-07T20:31:41.1349649Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.1349981Z moe/activation_test.py:115: in fn 2025-05-07T20:31:41.1350270Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.1350958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:41.1351641Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:41.1352175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:41.1352853Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:41.1353514Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:41.1354041Z kernel = self.compile( 2025-05-07T20:31:41.1354647Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:41.1355308Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.1355705Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.1355939Z 2025-05-07T20:31:41.1356145Z self = 2025-05-07T20:31:41.1357217Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:41.1358590Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6873917ba0>} 2025-05-07T20:31:41.1359927Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:41.1360942Z context = 2025-05-07T20:31:41.1361347Z 2025-05-07T20:31:41.1361517Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:41.1362038Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.1362508Z module_map=module_map) 2025-05-07T20:31:41.1362874Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.1363235Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:41.1363502Z E ^ 2025-05-07T20:31:41.1364014Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.1364473Z 2025-05-07T20:31:41.1364888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:41.1365483Z 2025-05-07T20:31:41.1365594Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.1366014Z self=, 2025-05-07T20:31:41.1366413Z T=128, 2025-05-07T20:31:41.1366616Z D=5120, 2025-05-07T20:31:41.1366822Z scale_ub=None, 2025-05-07T20:31:41.1367049Z contiguous=False, 2025-05-07T20:31:41.1367295Z compiled=False, 2025-05-07T20:31:41.1367512Z ) 2025-05-07T20:31:41.1367836Z self = 2025-05-07T20:31:41.1368327Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:41.1368593Z 2025-05-07T20:31:41.1368681Z @given( 2025-05-07T20:31:41.1368912Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.1369243Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.1377585Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.1377941Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.1378268Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.1378558Z ) 2025-05-07T20:31:41.1378920Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.1379370Z def test_silu_mul_quant( 2025-05-07T20:31:41.1379609Z self, 2025-05-07T20:31:41.1379809Z T: int, 2025-05-07T20:31:41.1380016Z D: int, 2025-05-07T20:31:41.1380235Z scale_ub: Optional[float], 2025-05-07T20:31:41.1380518Z contiguous: bool, 2025-05-07T20:31:41.1380761Z compiled: bool, 2025-05-07T20:31:41.1380986Z ) -> None: 2025-05-07T20:31:41.1381214Z torch.manual_seed(2025) 2025-05-07T20:31:41.1381459Z 2025-05-07T20:31:41.1381730Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.1382084Z 2025-05-07T20:31:41.1382291Z x_sign = torch.sign(x) 2025-05-07T20:31:41.1382579Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.1382891Z x = x_sign * x_clamp 2025-05-07T20:31:41.1383137Z x0 = x[:, :D] 2025-05-07T20:31:41.1383353Z x1 = x[:, D:] 2025-05-07T20:31:41.1383576Z 2025-05-07T20:31:41.1383804Z if contiguous: 2025-05-07T20:31:41.1384050Z x0 = x0.contiguous() 2025-05-07T20:31:41.1384302Z x1 = x1.contiguous() 2025-05-07T20:31:41.1384543Z 2025-05-07T20:31:41.1384731Z if scale_ub is not None: 2025-05-07T20:31:41.1384991Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:41.1385334Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:41.1385648Z ) 2025-05-07T20:31:41.1385839Z else: 2025-05-07T20:31:41.1386055Z scale_ub_tensor = None 2025-05-07T20:31:41.1386310Z 2025-05-07T20:31:41.1386538Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.1386864Z op = silu_mul_quant 2025-05-07T20:31:41.1387120Z if compiled: 2025-05-07T20:31:41.1387366Z op = torch.compile(op) 2025-05-07T20:31:41.1387665Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.1387943Z 2025-05-07T20:31:41.1388248Z > y_fp8, y_scale = fn() 2025-05-07T20:31:41.1388424Z 2025-05-07T20:31:41.1388526Z moe/activation_test.py:117: 2025-05-07T20:31:41.1388823Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.1389256Z moe/activation_test.py:115: in fn 2025-05-07T20:31:41.1389533Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.1390229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:41.1390922Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:41.1391459Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:41.1392232Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:41.1392899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:41.1393444Z kernel = self.compile( 2025-05-07T20:31:41.1394035Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:41.1394703Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.1395110Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.1395339Z 2025-05-07T20:31:41.1395556Z self = 2025-05-07T20:31:41.1396645Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:41.1398046Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6873914fe0>} 2025-05-07T20:31:41.1399409Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:41.1400438Z context = 2025-05-07T20:31:41.1400730Z 2025-05-07T20:31:41.1400899Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:41.1401419Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.1401889Z module_map=module_map) 2025-05-07T20:31:41.1402259Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.1402616Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:41.1402879Z E ^ 2025-05-07T20:31:41.1403347Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.1403796Z 2025-05-07T20:31:41.1404219Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:41.1404743Z 2025-05-07T20:31:41.1404847Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.1405264Z self=, 2025-05-07T20:31:41.1405674Z T=128, 2025-05-07T20:31:41.1405863Z D=5120, 2025-05-07T20:31:41.1406060Z scale_ub=1200.0, 2025-05-07T20:31:41.1406293Z contiguous=True, 2025-05-07T20:31:41.1406510Z compiled=False, 2025-05-07T20:31:41.1406721Z ) 2025-05-07T20:31:41.4657318Z self = 2025-05-07T20:31:41.4658812Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:41.4659551Z 2025-05-07T20:31:41.4659786Z @given( 2025-05-07T20:31:41.4660415Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.4661552Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.4662178Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.4662828Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.4663481Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.4664043Z ) 2025-05-07T20:31:41.4664580Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.4665060Z def test_silu_mul_quant( 2025-05-07T20:31:41.4665313Z self, 2025-05-07T20:31:41.4665521Z T: int, 2025-05-07T20:31:41.4665723Z D: int, 2025-05-07T20:31:41.4665959Z scale_ub: Optional[float], 2025-05-07T20:31:41.4666270Z contiguous: bool, 2025-05-07T20:31:41.4666673Z compiled: bool, 2025-05-07T20:31:41.4666903Z ) -> None: 2025-05-07T20:31:41.4667133Z torch.manual_seed(2025) 2025-05-07T20:31:41.4667385Z 2025-05-07T20:31:41.4667660Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.4668007Z 2025-05-07T20:31:41.4668208Z x_sign = torch.sign(x) 2025-05-07T20:31:41.4668507Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.4668814Z x = x_sign * x_clamp 2025-05-07T20:31:41.4669061Z x0 = x[:, :D] 2025-05-07T20:31:41.4669362Z x1 = x[:, D:] 2025-05-07T20:31:41.4669568Z 2025-05-07T20:31:41.4669759Z if contiguous: 2025-05-07T20:31:41.4669998Z x0 = x0.contiguous() 2025-05-07T20:31:41.4670254Z x1 = x1.contiguous() 2025-05-07T20:31:41.4670504Z 2025-05-07T20:31:41.4670702Z if scale_ub is not None: 2025-05-07T20:31:41.4670979Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:41.4671326Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:41.4671631Z ) 2025-05-07T20:31:41.4671830Z else: 2025-05-07T20:31:41.4672051Z scale_ub_tensor = None 2025-05-07T20:31:41.4672298Z 2025-05-07T20:31:41.4672541Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.4672863Z op = silu_mul_quant 2025-05-07T20:31:41.4673113Z if compiled: 2025-05-07T20:31:41.4673373Z op = torch.compile(op) 2025-05-07T20:31:41.4673671Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.4673945Z 2025-05-07T20:31:41.4674141Z > y_fp8, y_scale = fn() 2025-05-07T20:31:41.4674305Z 2025-05-07T20:31:41.4674413Z moe/activation_test.py:117: 2025-05-07T20:31:41.4674714Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.4675045Z moe/activation_test.py:115: in fn 2025-05-07T20:31:41.4675331Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.4676029Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:41.4676718Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:41.4677260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:41.4677947Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:41.4678610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:41.4679137Z kernel = self.compile( 2025-05-07T20:31:41.4679683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:41.4680341Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.4680735Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.4680974Z 2025-05-07T20:31:41.4681178Z self = 2025-05-07T20:31:41.4682344Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:41.4683722Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f68738837e0>} 2025-05-07T20:31:41.4685063Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:41.4686076Z context = 2025-05-07T20:31:41.4686371Z 2025-05-07T20:31:41.4686540Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:41.4687135Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.4687603Z module_map=module_map) 2025-05-07T20:31:41.4687972Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.4688328Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:41.4688590Z E ^ 2025-05-07T20:31:41.4689054Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.4689510Z 2025-05-07T20:31:41.4689928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:41.4690447Z 2025-05-07T20:31:41.4690554Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.4690972Z self=, 2025-05-07T20:31:41.4691374Z T=1, 2025-05-07T20:31:41.4691574Z D=7168, 2025-05-07T20:31:41.4691775Z scale_ub=1200.0, 2025-05-07T20:31:41.4692001Z contiguous=True, 2025-05-07T20:31:41.4692231Z compiled=True, 2025-05-07T20:31:41.4692448Z ) 2025-05-07T20:31:41.4692774Z self = 2025-05-07T20:31:41.4693261Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:41.4693521Z 2025-05-07T20:31:41.4693608Z @given( 2025-05-07T20:31:41.4693840Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.4694158Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.4694469Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.4694805Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.4695130Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.4695418Z ) 2025-05-07T20:31:41.4695769Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.4696251Z def test_silu_mul_quant( 2025-05-07T20:31:41.4696601Z self, 2025-05-07T20:31:41.4696833Z T: int, 2025-05-07T20:31:41.4697031Z D: int, 2025-05-07T20:31:41.4697259Z scale_ub: Optional[float], 2025-05-07T20:31:41.4697542Z contiguous: bool, 2025-05-07T20:31:41.4697778Z compiled: bool, 2025-05-07T20:31:41.4698004Z ) -> None: 2025-05-07T20:31:41.4698227Z torch.manual_seed(2025) 2025-05-07T20:31:41.4698467Z 2025-05-07T20:31:41.4698747Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.4699092Z 2025-05-07T20:31:41.4699294Z x_sign = torch.sign(x) 2025-05-07T20:31:41.4699584Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.4699898Z x = x_sign * x_clamp 2025-05-07T20:31:41.4700142Z x0 = x[:, :D] 2025-05-07T20:31:41.4700361Z x1 = x[:, D:] 2025-05-07T20:31:41.4700574Z 2025-05-07T20:31:41.4700768Z if contiguous: 2025-05-07T20:31:41.4700997Z x0 = x0.contiguous() 2025-05-07T20:31:41.4701265Z x1 = x1.contiguous() 2025-05-07T20:31:41.4701507Z 2025-05-07T20:31:41.4701697Z if scale_ub is not None: 2025-05-07T20:31:41.4702064Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:41.4702403Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:41.4702707Z ) 2025-05-07T20:31:41.4702908Z else: 2025-05-07T20:31:41.4703126Z scale_ub_tensor = None 2025-05-07T20:31:41.4703373Z 2025-05-07T20:31:41.4703618Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.4703938Z op = silu_mul_quant 2025-05-07T20:31:41.4704199Z if compiled: 2025-05-07T20:31:41.4704446Z op = torch.compile(op) 2025-05-07T20:31:41.4704751Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.4705030Z 2025-05-07T20:31:41.4705228Z > y_fp8, y_scale = fn() 2025-05-07T20:31:41.4705509Z 2025-05-07T20:31:41.4705609Z moe/activation_test.py:117: 2025-05-07T20:31:41.4705910Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.4706239Z moe/activation_test.py:115: in fn 2025-05-07T20:31:41.4706532Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.4707095Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:41.4707657Z return fn(*args, **kwargs) 2025-05-07T20:31:41.4708310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:41.4709003Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:41.4709629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:41.4710308Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:41.4710980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:41.4711518Z kernel = self.compile( 2025-05-07T20:31:41.4712068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:41.4712718Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.4713129Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.4713356Z 2025-05-07T20:31:41.4713570Z self = 2025-05-07T20:31:41.4714651Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:41.4716006Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6872f8e840>} 2025-05-07T20:31:41.4717358Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:41.4718379Z context = 2025-05-07T20:31:41.4718666Z 2025-05-07T20:31:41.4718849Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:41.4719365Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.4719836Z module_map=module_map) 2025-05-07T20:31:41.4720207Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.4720572Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:41.4720833Z E ^ 2025-05-07T20:31:41.4721315Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.4721761Z 2025-05-07T20:31:41.4722271Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:41.4722785Z 2025-05-07T20:31:41.4722891Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.4723309Z self=, 2025-05-07T20:31:41.4723733Z T=1, 2025-05-07T20:31:41.4723951Z D=7168, 2025-05-07T20:31:41.4724145Z scale_ub=1200.0, 2025-05-07T20:31:41.4724376Z contiguous=False, 2025-05-07T20:31:41.4724607Z compiled=True, 2025-05-07T20:31:41.4724812Z ) 2025-05-07T20:31:41.5726075Z self = 2025-05-07T20:31:41.5726774Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:41.5727043Z 2025-05-07T20:31:41.5727540Z @given( 2025-05-07T20:31:41.5727782Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.5728101Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.5728779Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.5729129Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.5729463Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.5729749Z ) 2025-05-07T20:31:41.5730096Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.5730541Z def test_silu_mul_quant( 2025-05-07T20:31:41.5730796Z self, 2025-05-07T20:31:41.5730990Z T: int, 2025-05-07T20:31:41.5731194Z D: int, 2025-05-07T20:31:41.5731421Z scale_ub: Optional[float], 2025-05-07T20:31:41.5731688Z contiguous: bool, 2025-05-07T20:31:41.5731936Z compiled: bool, 2025-05-07T20:31:41.5732170Z ) -> None: 2025-05-07T20:31:41.5732391Z torch.manual_seed(2025) 2025-05-07T20:31:41.5732644Z 2025-05-07T20:31:41.5732941Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.5733290Z 2025-05-07T20:31:41.5733481Z x_sign = torch.sign(x) 2025-05-07T20:31:41.5733775Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.5734089Z x = x_sign * x_clamp 2025-05-07T20:31:41.5734348Z x0 = x[:, :D] 2025-05-07T20:31:41.5734592Z x1 = x[:, D:] 2025-05-07T20:31:41.5734804Z 2025-05-07T20:31:41.5734995Z if contiguous: 2025-05-07T20:31:41.5735223Z x0 = x0.contiguous() 2025-05-07T20:31:41.5735485Z x1 = x1.contiguous() 2025-05-07T20:31:41.5735724Z 2025-05-07T20:31:41.5735915Z if scale_ub is not None: 2025-05-07T20:31:41.5736189Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:41.5736529Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:41.5736833Z ) 2025-05-07T20:31:41.5737031Z else: 2025-05-07T20:31:41.5737247Z scale_ub_tensor = None 2025-05-07T20:31:41.5737496Z 2025-05-07T20:31:41.5737737Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.5738058Z op = silu_mul_quant 2025-05-07T20:31:41.5738314Z if compiled: 2025-05-07T20:31:41.5738566Z op = torch.compile(op) 2025-05-07T20:31:41.5738866Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.5739135Z 2025-05-07T20:31:41.5739331Z > y_fp8, y_scale = fn() 2025-05-07T20:31:41.5739503Z 2025-05-07T20:31:41.5739606Z moe/activation_test.py:117: 2025-05-07T20:31:41.5739901Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.5740228Z moe/activation_test.py:115: in fn 2025-05-07T20:31:41.5740511Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.5741069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:41.5741629Z return fn(*args, **kwargs) 2025-05-07T20:31:41.5742288Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:41.5742972Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:41.5743683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:41.5744373Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:41.5745077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:41.5745616Z kernel = self.compile( 2025-05-07T20:31:41.5746161Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:41.5746809Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.5747211Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.5747552Z 2025-05-07T20:31:41.5747769Z self = 2025-05-07T20:31:41.5748840Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:41.5750310Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6872f8c900>} 2025-05-07T20:31:41.5751657Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:41.5752677Z context = 2025-05-07T20:31:41.5752967Z 2025-05-07T20:31:41.5753144Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:41.5753660Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.5754138Z module_map=module_map) 2025-05-07T20:31:41.5754506Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.5754855Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:41.5755117Z E ^ 2025-05-07T20:31:41.5755589Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.5756036Z 2025-05-07T20:31:41.5756458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:41.5756965Z 2025-05-07T20:31:41.5757070Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.5757489Z self=, 2025-05-07T20:31:41.5757898Z T=1, 2025-05-07T20:31:41.5758086Z D=7168, 2025-05-07T20:31:41.5758278Z scale_ub=None, 2025-05-07T20:31:41.5758503Z contiguous=False, 2025-05-07T20:31:41.5758739Z compiled=True, 2025-05-07T20:31:41.5758950Z ) 2025-05-07T20:31:41.6428760Z self = 2025-05-07T20:31:41.6429591Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:41.6429949Z 2025-05-07T20:31:41.6430067Z @given( 2025-05-07T20:31:41.6430368Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.6430692Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.6431004Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.6431328Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.6431661Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.6431951Z ) 2025-05-07T20:31:41.6432312Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.6432757Z def test_silu_mul_quant( 2025-05-07T20:31:41.6433004Z self, 2025-05-07T20:31:41.6433204Z T: int, 2025-05-07T20:31:41.6433400Z D: int, 2025-05-07T20:31:41.6433964Z scale_ub: Optional[float], 2025-05-07T20:31:41.6434243Z contiguous: bool, 2025-05-07T20:31:41.6434478Z compiled: bool, 2025-05-07T20:31:41.6434713Z ) -> None: 2025-05-07T20:31:41.6434938Z torch.manual_seed(2025) 2025-05-07T20:31:41.6435176Z 2025-05-07T20:31:41.6435458Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.6435810Z 2025-05-07T20:31:41.6436009Z x_sign = torch.sign(x) 2025-05-07T20:31:41.6436304Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.6436619Z x = x_sign * x_clamp 2025-05-07T20:31:41.6436859Z x0 = x[:, :D] 2025-05-07T20:31:41.6437088Z x1 = x[:, D:] 2025-05-07T20:31:41.6437483Z 2025-05-07T20:31:41.6437677Z if contiguous: 2025-05-07T20:31:41.6437909Z x0 = x0.contiguous() 2025-05-07T20:31:41.6438172Z x1 = x1.contiguous() 2025-05-07T20:31:41.6438413Z 2025-05-07T20:31:41.6438611Z if scale_ub is not None: 2025-05-07T20:31:41.6438890Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:41.6439225Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:41.6439528Z ) 2025-05-07T20:31:41.6439728Z else: 2025-05-07T20:31:41.6439941Z scale_ub_tensor = None 2025-05-07T20:31:41.6440185Z 2025-05-07T20:31:41.6440421Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.6440737Z op = silu_mul_quant 2025-05-07T20:31:41.6440984Z if compiled: 2025-05-07T20:31:41.6441251Z op = torch.compile(op) 2025-05-07T20:31:41.6449819Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.6450115Z 2025-05-07T20:31:41.6450327Z y_fp8, y_scale = fn() 2025-05-07T20:31:41.6450621Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:41.6450922Z 2025-05-07T20:31:41.6451163Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.6451512Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:41.6451812Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:41.6452124Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:41.6452491Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:41.6452808Z 2025-05-07T20:31:41.6453014Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:41.6453212Z 2025-05-07T20:31:41.6453315Z moe/activation_test.py:126: 2025-05-07T20:31:41.6453618Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.6453967Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:41.6454297Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:41.6455101Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:41.6455862Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:41.6456421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:41.6457101Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:41.6457796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:41.6458522Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:41.6459286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:41.6460041Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:41.6460782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:41.6461438Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:41.6462171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:41.6462703Z fn() 2025-05-07T20:31:41.6463219Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:41.6463810Z self.fn.run( 2025-05-07T20:31:41.6464280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:41.6464868Z kernel = self.compile( 2025-05-07T20:31:41.6465415Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:41.6466070Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.6466560Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.6466795Z 2025-05-07T20:31:41.6467009Z self = 2025-05-07T20:31:41.6468102Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:41.6469579Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f68940f1080>} 2025-05-07T20:31:41.6470924Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:41.6471966Z context = 2025-05-07T20:31:41.6472259Z 2025-05-07T20:31:41.6472430Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:41.6472958Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.6473425Z module_map=module_map) 2025-05-07T20:31:41.6473794Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.6474154Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:41.6474460Z E ^ 2025-05-07T20:31:41.6474941Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.6475405Z 2025-05-07T20:31:41.6475827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:41.6476342Z 2025-05-07T20:31:41.6476454Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.6476869Z self=, 2025-05-07T20:31:41.6477281Z T=1, 2025-05-07T20:31:41.6477470Z D=5120, 2025-05-07T20:31:41.6477666Z scale_ub=1200.0, 2025-05-07T20:31:41.6477904Z contiguous=False, 2025-05-07T20:31:41.6478135Z compiled=True, 2025-05-07T20:31:41.6478342Z ) 2025-05-07T20:31:41.7659036Z self = 2025-05-07T20:31:41.7659786Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:41.7660158Z 2025-05-07T20:31:41.7660243Z @given( 2025-05-07T20:31:41.7660489Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.7660811Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.7661117Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.7661452Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.7661797Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.7662108Z ) 2025-05-07T20:31:41.7662468Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.7662917Z def test_silu_mul_quant( 2025-05-07T20:31:41.7663527Z self, 2025-05-07T20:31:41.7663738Z T: int, 2025-05-07T20:31:41.7663945Z D: int, 2025-05-07T20:31:41.7664163Z scale_ub: Optional[float], 2025-05-07T20:31:41.7664445Z contiguous: bool, 2025-05-07T20:31:41.7664693Z compiled: bool, 2025-05-07T20:31:41.7664928Z ) -> None: 2025-05-07T20:31:41.7665155Z torch.manual_seed(2025) 2025-05-07T20:31:41.7665408Z 2025-05-07T20:31:41.7665693Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.7666036Z 2025-05-07T20:31:41.7666246Z x_sign = torch.sign(x) 2025-05-07T20:31:41.7666548Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.7666864Z x = x_sign * x_clamp 2025-05-07T20:31:41.7667262Z x0 = x[:, :D] 2025-05-07T20:31:41.7667484Z x1 = x[:, D:] 2025-05-07T20:31:41.7667692Z 2025-05-07T20:31:41.7667892Z if contiguous: 2025-05-07T20:31:41.7668131Z x0 = x0.contiguous() 2025-05-07T20:31:41.7668403Z x1 = x1.contiguous() 2025-05-07T20:31:41.7668651Z 2025-05-07T20:31:41.7668849Z if scale_ub is not None: 2025-05-07T20:31:41.7669199Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:41.7669544Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:41.7669855Z ) 2025-05-07T20:31:41.7670048Z else: 2025-05-07T20:31:41.7670264Z scale_ub_tensor = None 2025-05-07T20:31:41.7670519Z 2025-05-07T20:31:41.7670759Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.7671077Z op = silu_mul_quant 2025-05-07T20:31:41.7671334Z if compiled: 2025-05-07T20:31:41.7671590Z op = torch.compile(op) 2025-05-07T20:31:41.7671889Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.7672169Z 2025-05-07T20:31:41.7672369Z > y_fp8, y_scale = fn() 2025-05-07T20:31:41.7672534Z 2025-05-07T20:31:41.7672642Z moe/activation_test.py:117: 2025-05-07T20:31:41.7672955Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.7673294Z moe/activation_test.py:115: in fn 2025-05-07T20:31:41.7673575Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.7674163Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:41.7674753Z return fn(*args, **kwargs) 2025-05-07T20:31:41.7675418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:41.7676104Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:41.7676654Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:41.7677345Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:41.7678018Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:41.7678548Z kernel = self.compile( 2025-05-07T20:31:41.7679097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:41.7679755Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.7680153Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.7680392Z 2025-05-07T20:31:41.7680600Z self = 2025-05-07T20:31:41.7681693Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:41.7683178Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f68940f3b00>} 2025-05-07T20:31:41.7684523Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:41.7685549Z context = 2025-05-07T20:31:41.7685836Z 2025-05-07T20:31:41.7686016Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:41.7686541Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.7687005Z module_map=module_map) 2025-05-07T20:31:41.7687373Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.7687811Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:41.7688065Z E ^ 2025-05-07T20:31:41.7688541Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.7688988Z 2025-05-07T20:31:41.7689412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:41.7689920Z 2025-05-07T20:31:41.7690030Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.7690437Z self=, 2025-05-07T20:31:41.7690844Z T=1, 2025-05-07T20:31:41.7691035Z D=5120, 2025-05-07T20:31:41.7691229Z scale_ub=1200.0, 2025-05-07T20:31:41.7691461Z contiguous=False, 2025-05-07T20:31:41.7691692Z compiled=False, 2025-05-07T20:31:41.7691901Z ) 2025-05-07T20:31:41.7692222Z self = 2025-05-07T20:31:41.7692716Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:41.7692980Z 2025-05-07T20:31:41.7693066Z @given( 2025-05-07T20:31:41.7693297Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:41.7693620Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:41.7693980Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:41.7694306Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:41.7694636Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:41.7694926Z ) 2025-05-07T20:31:41.7695273Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:41.7695720Z def test_silu_mul_quant( 2025-05-07T20:31:41.7695965Z self, 2025-05-07T20:31:41.7696162Z T: int, 2025-05-07T20:31:41.7696365Z D: int, 2025-05-07T20:31:41.7696588Z scale_ub: Optional[float], 2025-05-07T20:31:41.7696857Z contiguous: bool, 2025-05-07T20:31:41.7697104Z compiled: bool, 2025-05-07T20:31:41.7697328Z ) -> None: 2025-05-07T20:31:41.7697550Z torch.manual_seed(2025) 2025-05-07T20:31:41.7697791Z 2025-05-07T20:31:41.7698073Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:41.7698419Z 2025-05-07T20:31:41.7698615Z x_sign = torch.sign(x) 2025-05-07T20:31:41.7698907Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:41.7699224Z x = x_sign * x_clamp 2025-05-07T20:31:41.7699464Z x0 = x[:, :D] 2025-05-07T20:31:41.7699692Z x1 = x[:, D:] 2025-05-07T20:31:41.7699904Z 2025-05-07T20:31:41.7700092Z if contiguous: 2025-05-07T20:31:41.7700331Z x0 = x0.contiguous() 2025-05-07T20:31:41.7700597Z x1 = x1.contiguous() 2025-05-07T20:31:41.7700833Z 2025-05-07T20:31:41.7701034Z if scale_ub is not None: 2025-05-07T20:31:41.7701311Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:41.7701646Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:41.7701963Z ) 2025-05-07T20:31:41.7702164Z else: 2025-05-07T20:31:41.7702374Z scale_ub_tensor = None 2025-05-07T20:31:41.7702635Z 2025-05-07T20:31:41.7702959Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:41.7703280Z op = silu_mul_quant 2025-05-07T20:31:41.7703528Z if compiled: 2025-05-07T20:31:41.7703782Z op = torch.compile(op) 2025-05-07T20:31:41.7704081Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.7704403Z 2025-05-07T20:31:41.7704603Z > y_fp8, y_scale = fn() 2025-05-07T20:31:41.7704769Z 2025-05-07T20:31:41.7704877Z moe/activation_test.py:117: 2025-05-07T20:31:41.7705171Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.7705514Z moe/activation_test.py:115: in fn 2025-05-07T20:31:41.7705806Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:41.7706605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:41.7707294Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:41.7707839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:41.7708528Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:41.7709277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:41.7709815Z kernel = self.compile( 2025-05-07T20:31:41.7710358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:41.7711015Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:41.7711408Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:41.7711653Z 2025-05-07T20:31:41.7711861Z self = 2025-05-07T20:31:41.7712941Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:41.7714305Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6873025a80>} 2025-05-07T20:31:41.7715632Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:41.7716651Z context = 2025-05-07T20:31:41.7716949Z 2025-05-07T20:31:41.7717130Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:41.7717653Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:41.7718128Z module_map=module_map) 2025-05-07T20:31:41.7718499Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:41.7718856Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:41.7719127Z E ^ 2025-05-07T20:31:41.7719594Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:41.7720055Z 2025-05-07T20:31:41.7720475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:41.7720984Z 2025-05-07T20:31:41.7721099Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:41.7721519Z self=, 2025-05-07T20:31:41.7721936Z T=16384, 2025-05-07T20:31:41.7722140Z D=5120, 2025-05-07T20:31:41.7722344Z scale_ub=1200.0, 2025-05-07T20:31:41.7722570Z contiguous=False, 2025-05-07T20:31:41.7722809Z compiled=True, 2025-05-07T20:31:41.7723030Z ) 2025-05-07T20:31:42.0364347Z self = 2025-05-07T20:31:42.0365105Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:42.0365391Z 2025-05-07T20:31:42.0365486Z @given( 2025-05-07T20:31:42.0365724Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:42.0366045Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:42.0366371Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:42.0366708Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:42.0367038Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:42.0367327Z ) 2025-05-07T20:31:42.0367676Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:42.0368282Z def test_silu_mul_quant( 2025-05-07T20:31:42.0368528Z self, 2025-05-07T20:31:42.0368722Z T: int, 2025-05-07T20:31:42.0368921Z D: int, 2025-05-07T20:31:42.0369154Z scale_ub: Optional[float], 2025-05-07T20:31:42.0369421Z contiguous: bool, 2025-05-07T20:31:42.0369667Z compiled: bool, 2025-05-07T20:31:42.0369900Z ) -> None: 2025-05-07T20:31:42.0370128Z torch.manual_seed(2025) 2025-05-07T20:31:42.0370367Z 2025-05-07T20:31:42.0370649Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:42.0370994Z 2025-05-07T20:31:42.0371188Z x_sign = torch.sign(x) 2025-05-07T20:31:42.0371486Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:42.0371796Z x = x_sign * x_clamp 2025-05-07T20:31:42.0372033Z x0 = x[:, :D] 2025-05-07T20:31:42.0372254Z x1 = x[:, D:] 2025-05-07T20:31:42.0372474Z 2025-05-07T20:31:42.0372664Z if contiguous: 2025-05-07T20:31:42.0372903Z x0 = x0.contiguous() 2025-05-07T20:31:42.0373166Z x1 = x1.contiguous() 2025-05-07T20:31:42.0373400Z 2025-05-07T20:31:42.0373597Z if scale_ub is not None: 2025-05-07T20:31:42.0373878Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:42.0374211Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:42.0374524Z ) 2025-05-07T20:31:42.0374724Z else: 2025-05-07T20:31:42.0374944Z scale_ub_tensor = None 2025-05-07T20:31:42.0375193Z 2025-05-07T20:31:42.0375429Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:42.0375750Z op = silu_mul_quant 2025-05-07T20:31:42.0375997Z if compiled: 2025-05-07T20:31:42.0376252Z op = torch.compile(op) 2025-05-07T20:31:42.0376552Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:42.0376824Z 2025-05-07T20:31:42.0377030Z > y_fp8, y_scale = fn() 2025-05-07T20:31:42.0377194Z 2025-05-07T20:31:42.0377305Z moe/activation_test.py:117: 2025-05-07T20:31:42.0377601Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:42.0377948Z moe/activation_test.py:115: in fn 2025-05-07T20:31:42.0378236Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:42.0378801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:42.0379365Z return fn(*args, **kwargs) 2025-05-07T20:31:42.0380031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:42.0380720Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:42.0381254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:42.0381936Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:42.0382606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:42.0383144Z kernel = self.compile( 2025-05-07T20:31:42.0383767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:42.0384475Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:42.0384877Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:42.0385102Z 2025-05-07T20:31:42.0385322Z self = 2025-05-07T20:31:42.0386387Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:42.0387843Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6873024cc0>} 2025-05-07T20:31:42.0389270Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:42.0390293Z context = 2025-05-07T20:31:42.0390579Z 2025-05-07T20:31:42.0390748Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:42.0391266Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:42.0391738Z module_map=module_map) 2025-05-07T20:31:42.0392103Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:42.0392457Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:42.0392725Z E ^ 2025-05-07T20:31:42.0393195Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:42.0393641Z 2025-05-07T20:31:42.0394065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:42.0394629Z 2025-05-07T20:31:42.0394733Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:42.0395152Z self=, 2025-05-07T20:31:42.0395562Z T=2048, 2025-05-07T20:31:42.0395753Z D=7168, 2025-05-07T20:31:42.0395953Z scale_ub=1200.0, 2025-05-07T20:31:42.0396187Z contiguous=False, 2025-05-07T20:31:42.0396416Z compiled=True, 2025-05-07T20:31:42.0396628Z ) 2025-05-07T20:31:42.0396957Z self = 2025-05-07T20:31:42.0397446Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:42.0397732Z 2025-05-07T20:31:42.0397815Z @given( 2025-05-07T20:31:42.0398054Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:42.0398366Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:42.0398683Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:42.0399020Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:42.0399351Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:42.0399632Z ) 2025-05-07T20:31:42.0399983Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:42.0400423Z def test_silu_mul_quant( 2025-05-07T20:31:42.0400669Z self, 2025-05-07T20:31:42.0400872Z T: int, 2025-05-07T20:31:42.0401079Z D: int, 2025-05-07T20:31:42.0401299Z scale_ub: Optional[float], 2025-05-07T20:31:42.0401576Z contiguous: bool, 2025-05-07T20:31:42.0401822Z compiled: bool, 2025-05-07T20:31:42.0402042Z ) -> None: 2025-05-07T20:31:42.0402273Z torch.manual_seed(2025) 2025-05-07T20:31:42.0402519Z 2025-05-07T20:31:42.0402793Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:42.0403137Z 2025-05-07T20:31:42.0403340Z x_sign = torch.sign(x) 2025-05-07T20:31:42.0403717Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:42.0404034Z x = x_sign * x_clamp 2025-05-07T20:31:42.0404277Z x0 = x[:, :D] 2025-05-07T20:31:42.0404500Z x1 = x[:, D:] 2025-05-07T20:31:42.0404706Z 2025-05-07T20:31:42.0404895Z if contiguous: 2025-05-07T20:31:42.0405130Z x0 = x0.contiguous() 2025-05-07T20:31:42.0405389Z x1 = x1.contiguous() 2025-05-07T20:31:42.0405629Z 2025-05-07T20:31:42.0405825Z if scale_ub is not None: 2025-05-07T20:31:42.0406095Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:42.0406432Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:42.0406822Z ) 2025-05-07T20:31:42.0407013Z else: 2025-05-07T20:31:42.0407231Z scale_ub_tensor = None 2025-05-07T20:31:42.0407486Z 2025-05-07T20:31:42.0407716Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:42.0408042Z op = silu_mul_quant 2025-05-07T20:31:42.0408302Z if compiled: 2025-05-07T20:31:42.0408548Z op = torch.compile(op) 2025-05-07T20:31:42.0408848Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:42.0409131Z 2025-05-07T20:31:42.0409332Z > y_fp8, y_scale = fn() 2025-05-07T20:31:42.0409496Z 2025-05-07T20:31:42.0409600Z moe/activation_test.py:117: 2025-05-07T20:31:42.0409896Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:42.0410233Z moe/activation_test.py:115: in fn 2025-05-07T20:31:42.0410512Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:42.0411072Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:42.0411638Z return fn(*args, **kwargs) 2025-05-07T20:31:42.0412289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:42.0412985Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:42.0413520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:42.0414216Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:42.0414906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:42.0415437Z kernel = self.compile( 2025-05-07T20:31:42.0415979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:42.0416635Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:42.0417033Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:42.0417267Z 2025-05-07T20:31:42.0417476Z self = 2025-05-07T20:31:42.0426248Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:42.0427699Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6873027060>} 2025-05-07T20:31:42.0429555Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:42.0430595Z context = 2025-05-07T20:31:42.0430904Z 2025-05-07T20:31:42.0431079Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:42.0431830Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:42.0432314Z module_map=module_map) 2025-05-07T20:31:42.0432686Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:42.0433054Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:42.0433333Z E ^ 2025-05-07T20:31:42.0433807Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:42.0434275Z 2025-05-07T20:31:42.0434698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:42.0435221Z 2025-05-07T20:31:42.1323560Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:42.1324205Z self=, 2025-05-07T20:31:42.1325205Z T=1, 2025-05-07T20:31:42.1325472Z D=5120, 2025-05-07T20:31:42.1325745Z scale_ub=None, 2025-05-07T20:31:42.1326011Z contiguous=False, 2025-05-07T20:31:42.1326260Z compiled=False, 2025-05-07T20:31:42.1326478Z ) 2025-05-07T20:31:42.1326803Z self = 2025-05-07T20:31:42.1327291Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:42.1327564Z 2025-05-07T20:31:42.1327647Z @given( 2025-05-07T20:31:42.1327897Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:42.1328578Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:42.1328900Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:42.1329237Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:42.1329565Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:42.1329862Z ) 2025-05-07T20:31:42.1330219Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:42.1330665Z def test_silu_mul_quant( 2025-05-07T20:31:42.1330910Z self, 2025-05-07T20:31:42.1331119Z T: int, 2025-05-07T20:31:42.1331336Z D: int, 2025-05-07T20:31:42.1331559Z scale_ub: Optional[float], 2025-05-07T20:31:42.1331867Z contiguous: bool, 2025-05-07T20:31:42.1332120Z compiled: bool, 2025-05-07T20:31:42.1332356Z ) -> None: 2025-05-07T20:31:42.1332579Z torch.manual_seed(2025) 2025-05-07T20:31:42.1332824Z 2025-05-07T20:31:42.1333106Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:42.1333450Z 2025-05-07T20:31:42.1333649Z x_sign = torch.sign(x) 2025-05-07T20:31:42.1333948Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:42.1334314Z x = x_sign * x_clamp 2025-05-07T20:31:42.1334556Z x0 = x[:, :D] 2025-05-07T20:31:42.1334785Z x1 = x[:, D:] 2025-05-07T20:31:42.1335004Z 2025-05-07T20:31:42.1335195Z if contiguous: 2025-05-07T20:31:42.1335439Z x0 = x0.contiguous() 2025-05-07T20:31:42.1335707Z x1 = x1.contiguous() 2025-05-07T20:31:42.1335954Z 2025-05-07T20:31:42.1336159Z if scale_ub is not None: 2025-05-07T20:31:42.1336440Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:42.1336776Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:42.1337098Z ) 2025-05-07T20:31:42.1337314Z else: 2025-05-07T20:31:42.1337534Z scale_ub_tensor = None 2025-05-07T20:31:42.1337796Z 2025-05-07T20:31:42.1338043Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:42.1338368Z op = silu_mul_quant 2025-05-07T20:31:42.1338625Z if compiled: 2025-05-07T20:31:42.1338885Z op = torch.compile(op) 2025-05-07T20:31:42.1339198Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:42.1339476Z 2025-05-07T20:31:42.1339680Z > y_fp8, y_scale = fn() 2025-05-07T20:31:42.1339845Z 2025-05-07T20:31:42.1339959Z moe/activation_test.py:117: 2025-05-07T20:31:42.1340443Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:42.1340792Z moe/activation_test.py:115: in fn 2025-05-07T20:31:42.1341083Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:42.1341790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:42.1342489Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:42.1343041Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:42.1343736Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:42.1344456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:42.1345110Z kernel = self.compile( 2025-05-07T20:31:42.1345656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:42.1346325Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:42.1346728Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:42.1346970Z 2025-05-07T20:31:42.1347183Z self = 2025-05-07T20:31:42.1348266Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:42.1349721Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6872154720>} 2025-05-07T20:31:42.1351060Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:42.1352087Z context = 2025-05-07T20:31:42.1352380Z 2025-05-07T20:31:42.1352548Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:42.1353068Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:42.1353534Z module_map=module_map) 2025-05-07T20:31:42.1353903Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:42.1354273Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:42.1354580Z E ^ 2025-05-07T20:31:42.1355040Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:42.1355499Z 2025-05-07T20:31:42.1355915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:42.1356425Z 2025-05-07T20:31:42.1356545Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:42.1356956Z self=, 2025-05-07T20:31:42.1357367Z T=4096, 2025-05-07T20:31:42.1357572Z D=7168, 2025-05-07T20:31:42.1357778Z scale_ub=1200.0, 2025-05-07T20:31:42.1358006Z contiguous=False, 2025-05-07T20:31:42.1358244Z compiled=False, 2025-05-07T20:31:42.1358457Z ) 2025-05-07T20:31:42.1358777Z self = 2025-05-07T20:31:42.1359281Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:42.1359555Z 2025-05-07T20:31:42.1359645Z @given( 2025-05-07T20:31:42.1359878Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:42.1360203Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:42.1360519Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:42.1360848Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:42.1361273Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:42.1361566Z ) 2025-05-07T20:31:42.1361926Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:42.1362366Z def test_silu_mul_quant( 2025-05-07T20:31:42.1362619Z self, 2025-05-07T20:31:42.1362825Z T: int, 2025-05-07T20:31:42.1363025Z D: int, 2025-05-07T20:31:42.1363253Z scale_ub: Optional[float], 2025-05-07T20:31:42.1363536Z contiguous: bool, 2025-05-07T20:31:42.1363775Z compiled: bool, 2025-05-07T20:31:42.1364008Z ) -> None: 2025-05-07T20:31:42.1364228Z torch.manual_seed(2025) 2025-05-07T20:31:42.1364484Z 2025-05-07T20:31:42.1364761Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:42.1365178Z 2025-05-07T20:31:42.1365378Z x_sign = torch.sign(x) 2025-05-07T20:31:42.1365672Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:42.1365991Z x = x_sign * x_clamp 2025-05-07T20:31:42.1366231Z x0 = x[:, :D] 2025-05-07T20:31:42.1366459Z x1 = x[:, D:] 2025-05-07T20:31:42.1366673Z 2025-05-07T20:31:42.1366864Z if contiguous: 2025-05-07T20:31:42.1367101Z x0 = x0.contiguous() 2025-05-07T20:31:42.1367365Z x1 = x1.contiguous() 2025-05-07T20:31:42.1367607Z 2025-05-07T20:31:42.1367809Z if scale_ub is not None: 2025-05-07T20:31:42.1368091Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:42.1368426Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:42.1368742Z ) 2025-05-07T20:31:42.1368944Z else: 2025-05-07T20:31:42.1369159Z scale_ub_tensor = None 2025-05-07T20:31:42.1369426Z 2025-05-07T20:31:42.1369666Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:42.1369979Z op = silu_mul_quant 2025-05-07T20:31:42.1370243Z if compiled: 2025-05-07T20:31:42.1370502Z op = torch.compile(op) 2025-05-07T20:31:42.1370813Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:42.1371086Z 2025-05-07T20:31:42.1371296Z > y_fp8, y_scale = fn() 2025-05-07T20:31:42.1371464Z 2025-05-07T20:31:42.1371569Z moe/activation_test.py:117: 2025-05-07T20:31:42.1371862Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:42.1372198Z moe/activation_test.py:115: in fn 2025-05-07T20:31:42.1372492Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:42.1373180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:42.1373874Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:42.1374471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:42.1375153Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:42.1375814Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:42.1376357Z kernel = self.compile( 2025-05-07T20:31:42.1376904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:42.1377564Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:42.1377962Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:42.1378201Z 2025-05-07T20:31:42.1378409Z self = 2025-05-07T20:31:42.1379497Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:42.1380967Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f68721558a0>} 2025-05-07T20:31:42.1382314Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:42.1383347Z context = 2025-05-07T20:31:42.1383638Z 2025-05-07T20:31:42.1383806Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:42.1384353Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:42.1384852Z module_map=module_map) 2025-05-07T20:31:42.1385324Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:42.1385682Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:42.1385947Z E ^ 2025-05-07T20:31:42.1386422Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:42.1386874Z 2025-05-07T20:31:42.1387291Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:42.1387801Z 2025-05-07T20:31:42.1387916Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:42.1388331Z self=, 2025-05-07T20:31:42.1388743Z T=16384, 2025-05-07T20:31:42.1388948Z D=7168, 2025-05-07T20:31:42.1389236Z scale_ub=None, 2025-05-07T20:31:42.1389463Z contiguous=True, 2025-05-07T20:31:42.1389695Z compiled=True, 2025-05-07T20:31:42.1389906Z ) 2025-05-07T20:31:42.2763656Z self = 2025-05-07T20:31:42.2764399Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:42.2764807Z 2025-05-07T20:31:42.2764948Z @given( 2025-05-07T20:31:42.2765322Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:42.2765755Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:42.2766180Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:42.2766620Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:42.2766953Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:42.2767249Z ) 2025-05-07T20:31:42.2767606Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:42.2768048Z def test_silu_mul_quant( 2025-05-07T20:31:42.2768299Z self, 2025-05-07T20:31:42.2768503Z T: int, 2025-05-07T20:31:42.2768701Z D: int, 2025-05-07T20:31:42.2768943Z scale_ub: Optional[float], 2025-05-07T20:31:42.2769223Z contiguous: bool, 2025-05-07T20:31:42.2769474Z compiled: bool, 2025-05-07T20:31:42.2769701Z ) -> None: 2025-05-07T20:31:42.2769924Z torch.manual_seed(2025) 2025-05-07T20:31:42.2770176Z 2025-05-07T20:31:42.2770451Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:42.2770803Z 2025-05-07T20:31:42.2771043Z x_sign = torch.sign(x) 2025-05-07T20:31:42.2771349Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:42.2771665Z x = x_sign * x_clamp 2025-05-07T20:31:42.2771951Z x0 = x[:, :D] 2025-05-07T20:31:42.2772170Z x1 = x[:, D:] 2025-05-07T20:31:42.2772390Z 2025-05-07T20:31:42.2772588Z if contiguous: 2025-05-07T20:31:42.2772823Z x0 = x0.contiguous() 2025-05-07T20:31:42.2773090Z x1 = x1.contiguous() 2025-05-07T20:31:42.2773336Z 2025-05-07T20:31:42.2773530Z if scale_ub is not None: 2025-05-07T20:31:42.2773816Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:42.2774159Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:42.2774472Z ) 2025-05-07T20:31:42.2774674Z else: 2025-05-07T20:31:42.2775125Z scale_ub_tensor = None 2025-05-07T20:31:42.2775390Z 2025-05-07T20:31:42.2775622Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:42.2775943Z op = silu_mul_quant 2025-05-07T20:31:42.2776203Z if compiled: 2025-05-07T20:31:42.2776451Z op = torch.compile(op) 2025-05-07T20:31:42.2776751Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:42.2777033Z 2025-05-07T20:31:42.2777226Z > y_fp8, y_scale = fn() 2025-05-07T20:31:42.2777396Z 2025-05-07T20:31:42.2777501Z moe/activation_test.py:117: 2025-05-07T20:31:42.2777800Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:42.2778276Z moe/activation_test.py:115: in fn 2025-05-07T20:31:42.2778560Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:42.2779120Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:42.2779689Z return fn(*args, **kwargs) 2025-05-07T20:31:42.2780344Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:42.2781032Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:42.2781567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:42.2782245Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:42.2782900Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:42.2783436Z kernel = self.compile( 2025-05-07T20:31:42.2784005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:42.2784682Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:42.2785092Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:42.2785325Z 2025-05-07T20:31:42.2785529Z self = 2025-05-07T20:31:42.2786605Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:42.2787972Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6872156a20>} 2025-05-07T20:31:42.2789422Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:42.2790451Z context = 2025-05-07T20:31:42.2790736Z 2025-05-07T20:31:42.2790915Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:42.2791434Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:42.2791895Z module_map=module_map) 2025-05-07T20:31:42.2792268Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:42.2792624Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:42.2792882Z E ^ 2025-05-07T20:31:42.2793346Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:42.2793792Z 2025-05-07T20:31:42.2794218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:42.2794779Z 2025-05-07T20:31:42.2794891Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:42.2795299Z self=, 2025-05-07T20:31:42.2795791Z T=4096, 2025-05-07T20:31:42.2795988Z D=5120, 2025-05-07T20:31:42.2796180Z scale_ub=None, 2025-05-07T20:31:42.2796407Z contiguous=False, 2025-05-07T20:31:42.2796637Z compiled=True, 2025-05-07T20:31:42.2796845Z ) 2025-05-07T20:31:42.2797172Z self = 2025-05-07T20:31:42.2797667Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:42.2797933Z 2025-05-07T20:31:42.2798022Z @given( 2025-05-07T20:31:42.2798253Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:42.2798570Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:42.2798880Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:42.2799285Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:42.2799614Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:42.2799903Z ) 2025-05-07T20:31:42.2800254Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:42.2800694Z def test_silu_mul_quant( 2025-05-07T20:31:42.2800939Z self, 2025-05-07T20:31:42.2801132Z T: int, 2025-05-07T20:31:42.2801330Z D: int, 2025-05-07T20:31:42.2801551Z scale_ub: Optional[float], 2025-05-07T20:31:42.2801817Z contiguous: bool, 2025-05-07T20:31:42.2802060Z compiled: bool, 2025-05-07T20:31:42.2802284Z ) -> None: 2025-05-07T20:31:42.2802500Z torch.manual_seed(2025) 2025-05-07T20:31:42.2802748Z 2025-05-07T20:31:42.2803021Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:42.2803369Z 2025-05-07T20:31:42.2803563Z x_sign = torch.sign(x) 2025-05-07T20:31:42.2803865Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:42.2804185Z x = x_sign * x_clamp 2025-05-07T20:31:42.2804422Z x0 = x[:, :D] 2025-05-07T20:31:42.2804643Z x1 = x[:, D:] 2025-05-07T20:31:42.2804863Z 2025-05-07T20:31:42.2805047Z if contiguous: 2025-05-07T20:31:42.2805283Z x0 = x0.contiguous() 2025-05-07T20:31:42.2805553Z x1 = x1.contiguous() 2025-05-07T20:31:42.2805791Z 2025-05-07T20:31:42.2805990Z if scale_ub is not None: 2025-05-07T20:31:42.2806263Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:42.2806592Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:42.2806905Z ) 2025-05-07T20:31:42.2807111Z else: 2025-05-07T20:31:42.2807322Z scale_ub_tensor = None 2025-05-07T20:31:42.2807579Z 2025-05-07T20:31:42.2807813Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:42.2808130Z op = silu_mul_quant 2025-05-07T20:31:42.2808391Z if compiled: 2025-05-07T20:31:42.2808649Z op = torch.compile(op) 2025-05-07T20:31:42.2808959Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:42.2809228Z 2025-05-07T20:31:42.2809430Z > y_fp8, y_scale = fn() 2025-05-07T20:31:42.2809596Z 2025-05-07T20:31:42.2809705Z moe/activation_test.py:117: 2025-05-07T20:31:42.2809998Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:42.2810338Z moe/activation_test.py:115: in fn 2025-05-07T20:31:42.2810625Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:42.2811176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:42.2811737Z return fn(*args, **kwargs) 2025-05-07T20:31:42.2812398Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:42.2813090Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:42.2813617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:42.2814431Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:42.2815103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:42.2815627Z kernel = self.compile( 2025-05-07T20:31:42.2816171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:42.2816829Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:42.2817231Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:42.2817458Z 2025-05-07T20:31:42.2817662Z self = 2025-05-07T20:31:42.2818819Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:42.2820190Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6872157c40>} 2025-05-07T20:31:42.2821532Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:42.2822549Z context = 2025-05-07T20:31:42.2822833Z 2025-05-07T20:31:42.2823000Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:42.2823517Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:42.2823988Z module_map=module_map) 2025-05-07T20:31:42.2824347Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:42.2824740Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:42.2825021Z E ^ 2025-05-07T20:31:42.2825484Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:42.2825930Z 2025-05-07T20:31:42.2826345Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:42.2826859Z 2025-05-07T20:31:42.3981646Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:42.3982798Z self=, 2025-05-07T20:31:42.3983963Z T=4096, 2025-05-07T20:31:42.3984474Z D=5120, 2025-05-07T20:31:42.3984949Z scale_ub=1200.0, 2025-05-07T20:31:42.3985212Z contiguous=False, 2025-05-07T20:31:42.3985490Z compiled=False, 2025-05-07T20:31:42.3985697Z ) 2025-05-07T20:31:42.3986025Z self = 2025-05-07T20:31:42.3986531Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:42.3986828Z 2025-05-07T20:31:42.3986912Z @given( 2025-05-07T20:31:42.3987155Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:42.3987472Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:42.3987775Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:42.3988117Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:42.3988450Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:42.3988741Z ) 2025-05-07T20:31:42.3989188Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:42.3989630Z def test_silu_mul_quant( 2025-05-07T20:31:42.3989881Z self, 2025-05-07T20:31:42.3990080Z T: int, 2025-05-07T20:31:42.3990282Z D: int, 2025-05-07T20:31:42.3990518Z scale_ub: Optional[float], 2025-05-07T20:31:42.3990799Z contiguous: bool, 2025-05-07T20:31:42.3991038Z compiled: bool, 2025-05-07T20:31:42.3991276Z ) -> None: 2025-05-07T20:31:42.3999885Z torch.manual_seed(2025) 2025-05-07T20:31:42.4000175Z 2025-05-07T20:31:42.4000460Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:42.4000801Z 2025-05-07T20:31:42.4001003Z x_sign = torch.sign(x) 2025-05-07T20:31:42.4001304Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:42.4001613Z x = x_sign * x_clamp 2025-05-07T20:31:42.4001863Z x0 = x[:, :D] 2025-05-07T20:31:42.4002096Z x1 = x[:, D:] 2025-05-07T20:31:42.4002302Z 2025-05-07T20:31:42.4002503Z if contiguous: 2025-05-07T20:31:42.4002744Z x0 = x0.contiguous() 2025-05-07T20:31:42.4003004Z x1 = x1.contiguous() 2025-05-07T20:31:42.4003402Z 2025-05-07T20:31:42.4003598Z if scale_ub is not None: 2025-05-07T20:31:42.4003867Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:42.4004213Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:42.4004573Z ) 2025-05-07T20:31:42.4004785Z else: 2025-05-07T20:31:42.4004997Z scale_ub_tensor = None 2025-05-07T20:31:42.4005258Z 2025-05-07T20:31:42.4005500Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:42.4005814Z op = silu_mul_quant 2025-05-07T20:31:42.4006071Z if compiled: 2025-05-07T20:31:42.4006323Z op = torch.compile(op) 2025-05-07T20:31:42.4006616Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:42.4006895Z 2025-05-07T20:31:42.4007092Z > y_fp8, y_scale = fn() 2025-05-07T20:31:42.4007255Z 2025-05-07T20:31:42.4007357Z moe/activation_test.py:117: 2025-05-07T20:31:42.4007659Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:42.4008004Z moe/activation_test.py:115: in fn 2025-05-07T20:31:42.4008291Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:42.4008990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:42.4009683Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:42.4010223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:42.4010898Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:42.4011558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:42.4012089Z kernel = self.compile( 2025-05-07T20:31:42.4012629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:42.4013277Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:42.4013677Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:42.4013904Z 2025-05-07T20:31:42.4014119Z self = 2025-05-07T20:31:42.4015251Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:42.4016622Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6872274ae0>} 2025-05-07T20:31:42.4017956Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:42.4018982Z context = 2025-05-07T20:31:42.4019270Z 2025-05-07T20:31:42.4019450Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:42.4020062Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:42.4020533Z module_map=module_map) 2025-05-07T20:31:42.4020903Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:42.4021269Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:42.4021529Z E ^ 2025-05-07T20:31:42.4022004Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:42.4022450Z 2025-05-07T20:31:42.4022884Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:42.4023392Z 2025-05-07T20:31:42.4023587Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:42.4023995Z self=, 2025-05-07T20:31:42.4024401Z T=4096, 2025-05-07T20:31:42.4024597Z D=5120, 2025-05-07T20:31:42.4024800Z scale_ub=1200.0, 2025-05-07T20:31:42.4025032Z contiguous=False, 2025-05-07T20:31:42.4025267Z compiled=True, 2025-05-07T20:31:42.4025471Z ) 2025-05-07T20:31:42.4025819Z self = 2025-05-07T20:31:42.4026322Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:42.4026592Z 2025-05-07T20:31:42.4026680Z @given( 2025-05-07T20:31:42.4026910Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:42.4027230Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:42.4027546Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:42.4027882Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:42.4028585Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:42.4028940Z ) 2025-05-07T20:31:42.4029339Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:42.4029773Z def test_silu_mul_quant( 2025-05-07T20:31:42.4030021Z self, 2025-05-07T20:31:42.4030223Z T: int, 2025-05-07T20:31:42.4030418Z D: int, 2025-05-07T20:31:42.4030641Z scale_ub: Optional[float], 2025-05-07T20:31:42.4030915Z contiguous: bool, 2025-05-07T20:31:42.4031170Z compiled: bool, 2025-05-07T20:31:42.4031409Z ) -> None: 2025-05-07T20:31:42.4031642Z torch.manual_seed(2025) 2025-05-07T20:31:42.4031900Z 2025-05-07T20:31:42.4032182Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:42.4032524Z 2025-05-07T20:31:42.4032714Z x_sign = torch.sign(x) 2025-05-07T20:31:42.4033008Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:42.4033322Z x = x_sign * x_clamp 2025-05-07T20:31:42.4033565Z x0 = x[:, :D] 2025-05-07T20:31:42.4033776Z x1 = x[:, D:] 2025-05-07T20:31:42.4033983Z 2025-05-07T20:31:42.4034174Z if contiguous: 2025-05-07T20:31:42.4034423Z x0 = x0.contiguous() 2025-05-07T20:31:42.4034715Z x1 = x1.contiguous() 2025-05-07T20:31:42.4034962Z 2025-05-07T20:31:42.4035153Z if scale_ub is not None: 2025-05-07T20:31:42.4035428Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:42.4035763Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:42.4036065Z ) 2025-05-07T20:31:42.4036270Z else: 2025-05-07T20:31:42.4036488Z scale_ub_tensor = None 2025-05-07T20:31:42.4036736Z 2025-05-07T20:31:42.4036973Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:42.4037290Z op = silu_mul_quant 2025-05-07T20:31:42.4037539Z if compiled: 2025-05-07T20:31:42.4037797Z op = torch.compile(op) 2025-05-07T20:31:42.4038106Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:42.4038381Z 2025-05-07T20:31:42.4038572Z > y_fp8, y_scale = fn() 2025-05-07T20:31:42.4038747Z 2025-05-07T20:31:42.4038997Z moe/activation_test.py:117: 2025-05-07T20:31:42.4039299Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:42.4039628Z moe/activation_test.py:115: in fn 2025-05-07T20:31:42.4039915Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:42.4040481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:42.4041040Z return fn(*args, **kwargs) 2025-05-07T20:31:42.4041700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:42.4042388Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:42.4042929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:42.4043717Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:42.4044387Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:42.4044972Z kernel = self.compile( 2025-05-07T20:31:42.4045507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:42.4046163Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:42.4046563Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:42.4046790Z 2025-05-07T20:31:42.4046997Z self = 2025-05-07T20:31:42.4048071Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:42.4049445Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6872275e40>} 2025-05-07T20:31:42.4050773Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:42.4051791Z context = 2025-05-07T20:31:42.4052075Z 2025-05-07T20:31:42.4052239Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:42.4052756Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:42.4053225Z module_map=module_map) 2025-05-07T20:31:42.4053597Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:42.4053948Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:42.4054208Z E ^ 2025-05-07T20:31:42.4054675Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:42.4055116Z 2025-05-07T20:31:42.4055528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:42.4056038Z 2025-05-07T20:31:42.4929697Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:42.4930342Z self=, 2025-05-07T20:31:42.4930925Z T=2048, 2025-05-07T20:31:42.4931190Z D=7168, 2025-05-07T20:31:42.4931460Z scale_ub=1200.0, 2025-05-07T20:31:42.4931772Z contiguous=False, 2025-05-07T20:31:42.4932056Z compiled=False, 2025-05-07T20:31:42.4932265Z ) 2025-05-07T20:31:42.4932583Z self = 2025-05-07T20:31:42.4933088Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:42.4933369Z 2025-05-07T20:31:42.4933448Z @given( 2025-05-07T20:31:42.4934032Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:42.4934386Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:42.4934702Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:42.4935030Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:42.4935358Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:42.4935632Z ) 2025-05-07T20:31:42.4935978Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:42.4936417Z def test_silu_mul_quant( 2025-05-07T20:31:42.4936651Z self, 2025-05-07T20:31:42.4936847Z T: int, 2025-05-07T20:31:42.4937042Z D: int, 2025-05-07T20:31:42.4937254Z scale_ub: Optional[float], 2025-05-07T20:31:42.4937697Z contiguous: bool, 2025-05-07T20:31:42.4937937Z compiled: bool, 2025-05-07T20:31:42.4938176Z ) -> None: 2025-05-07T20:31:42.4938392Z torch.manual_seed(2025) 2025-05-07T20:31:42.4938632Z 2025-05-07T20:31:42.4938936Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:42.4939282Z 2025-05-07T20:31:42.4939473Z x_sign = torch.sign(x) 2025-05-07T20:31:42.4939769Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:42.4940085Z x = x_sign * x_clamp 2025-05-07T20:31:42.4940327Z x0 = x[:, :D] 2025-05-07T20:31:42.4940543Z x1 = x[:, D:] 2025-05-07T20:31:42.4940752Z 2025-05-07T20:31:42.4940945Z if contiguous: 2025-05-07T20:31:42.4941172Z x0 = x0.contiguous() 2025-05-07T20:31:42.4941430Z x1 = x1.contiguous() 2025-05-07T20:31:42.4941673Z 2025-05-07T20:31:42.4941863Z if scale_ub is not None: 2025-05-07T20:31:42.4942144Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:42.4942483Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:42.4942785Z ) 2025-05-07T20:31:42.4942985Z else: 2025-05-07T20:31:42.4943202Z scale_ub_tensor = None 2025-05-07T20:31:42.4943460Z 2025-05-07T20:31:42.4943695Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:42.4944008Z op = silu_mul_quant 2025-05-07T20:31:42.4944254Z if compiled: 2025-05-07T20:31:42.4944538Z op = torch.compile(op) 2025-05-07T20:31:42.4944854Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:42.4945127Z 2025-05-07T20:31:42.4945313Z > y_fp8, y_scale = fn() 2025-05-07T20:31:42.4945485Z 2025-05-07T20:31:42.4945588Z moe/activation_test.py:117: 2025-05-07T20:31:42.4945884Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:42.4946214Z moe/activation_test.py:115: in fn 2025-05-07T20:31:42.4946498Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:42.4947185Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:42.4947869Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:42.4948402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:42.4949080Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:42.4949834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:42.4950356Z kernel = self.compile( 2025-05-07T20:31:42.4950894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:42.4951549Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:42.4951951Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:42.4952187Z 2025-05-07T20:31:42.4952399Z self = 2025-05-07T20:31:42.4953560Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:42.4954935Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6872276c00>} 2025-05-07T20:31:42.4956264Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:42.4957280Z context = 2025-05-07T20:31:42.4957705Z 2025-05-07T20:31:42.4957889Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:42.4958505Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:42.4959057Z module_map=module_map) 2025-05-07T20:31:42.4959461Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:42.4959859Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:42.4960144Z E ^ 2025-05-07T20:31:42.4960681Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:42.4961236Z 2025-05-07T20:31:42.4961737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:42.4962365Z 2025-05-07T20:31:42.4962476Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:42.4962950Z self=, 2025-05-07T20:31:42.4963411Z T=1, 2025-05-07T20:31:42.4963604Z D=7168, 2025-05-07T20:31:42.4963814Z scale_ub=None, 2025-05-07T20:31:42.4964053Z contiguous=True, 2025-05-07T20:31:42.4964330Z compiled=False, 2025-05-07T20:31:42.4964553Z ) 2025-05-07T20:31:42.4964902Z self = 2025-05-07T20:31:42.4965464Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:42.4965772Z 2025-05-07T20:31:42.4965850Z @given( 2025-05-07T20:31:42.4966094Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:42.4966436Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:42.4966780Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:42.4967154Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:42.4967521Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:42.4967837Z ) 2025-05-07T20:31:42.4968237Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:42.4968748Z def test_silu_mul_quant( 2025-05-07T20:31:42.4969009Z self, 2025-05-07T20:31:42.4969212Z T: int, 2025-05-07T20:31:42.4969413Z D: int, 2025-05-07T20:31:42.4969651Z scale_ub: Optional[float], 2025-05-07T20:31:42.4969953Z contiguous: bool, 2025-05-07T20:31:42.4970214Z compiled: bool, 2025-05-07T20:31:42.4970445Z ) -> None: 2025-05-07T20:31:42.4970671Z torch.manual_seed(2025) 2025-05-07T20:31:42.4970934Z 2025-05-07T20:31:42.4971221Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:42.4971605Z 2025-05-07T20:31:42.4971809Z x_sign = torch.sign(x) 2025-05-07T20:31:42.4972119Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:42.4972465Z x = x_sign * x_clamp 2025-05-07T20:31:42.4972726Z x0 = x[:, :D] 2025-05-07T20:31:42.4972949Z x1 = x[:, D:] 2025-05-07T20:31:42.4973180Z 2025-05-07T20:31:42.4973381Z if contiguous: 2025-05-07T20:31:42.4973621Z x0 = x0.contiguous() 2025-05-07T20:31:42.4973907Z x1 = x1.contiguous() 2025-05-07T20:31:42.4974167Z 2025-05-07T20:31:42.4974449Z if scale_ub is not None: 2025-05-07T20:31:42.4974766Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:42.4975182Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:42.4975533Z ) 2025-05-07T20:31:42.4975737Z else: 2025-05-07T20:31:42.4975964Z scale_ub_tensor = None 2025-05-07T20:31:42.4976243Z 2025-05-07T20:31:42.4976485Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:42.4976842Z op = silu_mul_quant 2025-05-07T20:31:42.4977119Z if compiled: 2025-05-07T20:31:42.4977383Z op = torch.compile(op) 2025-05-07T20:31:42.4977713Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:42.4978102Z 2025-05-07T20:31:42.4978299Z > y_fp8, y_scale = fn() 2025-05-07T20:31:42.4978490Z 2025-05-07T20:31:42.4978596Z moe/activation_test.py:117: 2025-05-07T20:31:42.4978930Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:42.4979308Z moe/activation_test.py:115: in fn 2025-05-07T20:31:42.4979621Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:42.4980454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:42.4981282Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:42.4981907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:42.4982724Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:42.4983516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:42.4984156Z kernel = self.compile( 2025-05-07T20:31:42.4984785Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:42.4985572Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:42.4986031Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:42.4986298Z 2025-05-07T20:31:42.4986530Z self = 2025-05-07T20:31:42.4987857Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:42.4989639Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6872276e80>} 2025-05-07T20:31:42.4991311Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:42.4992570Z context = 2025-05-07T20:31:42.4992909Z 2025-05-07T20:31:42.4993097Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:42.4993713Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:42.4994315Z module_map=module_map) 2025-05-07T20:31:42.4994734Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:42.4995125Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:42.4995407Z E ^ 2025-05-07T20:31:42.4995950Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:42.4996499Z 2025-05-07T20:31:42.4997005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:42.4997640Z 2025-05-07T20:31:42.4997748Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:42.4998310Z self=, 2025-05-07T20:31:42.4998785Z T=16384, 2025-05-07T20:31:42.4998989Z D=7168, 2025-05-07T20:31:42.4999199Z scale_ub=1200.0, 2025-05-07T20:31:42.4999438Z contiguous=False, 2025-05-07T20:31:42.4999691Z compiled=True, 2025-05-07T20:31:42.8849283Z ) 2025-05-07T20:31:42.8850302Z self = 2025-05-07T20:31:42.8851687Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:42.8852463Z 2025-05-07T20:31:42.8852683Z @given( 2025-05-07T20:31:42.8853323Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:42.8854683Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:42.8855166Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:42.8855537Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:42.8855875Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:42.8856165Z ) 2025-05-07T20:31:42.8856524Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:42.8856971Z def test_silu_mul_quant( 2025-05-07T20:31:42.8857211Z self, 2025-05-07T20:31:42.8857423Z T: int, 2025-05-07T20:31:42.8857634Z D: int, 2025-05-07T20:31:42.8857851Z scale_ub: Optional[float], 2025-05-07T20:31:42.8858130Z contiguous: bool, 2025-05-07T20:31:42.8858377Z compiled: bool, 2025-05-07T20:31:42.8858611Z ) -> None: 2025-05-07T20:31:42.8858831Z torch.manual_seed(2025) 2025-05-07T20:31:42.8859077Z 2025-05-07T20:31:42.8859357Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:42.8859698Z 2025-05-07T20:31:42.8859907Z x_sign = torch.sign(x) 2025-05-07T20:31:42.8860199Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:42.8860505Z x = x_sign * x_clamp 2025-05-07T20:31:42.8860755Z x0 = x[:, :D] 2025-05-07T20:31:42.8860972Z x1 = x[:, D:] 2025-05-07T20:31:42.8861174Z 2025-05-07T20:31:42.8861363Z if contiguous: 2025-05-07T20:31:42.8861601Z x0 = x0.contiguous() 2025-05-07T20:31:42.8861857Z x1 = x1.contiguous() 2025-05-07T20:31:42.8862101Z 2025-05-07T20:31:42.8862309Z if scale_ub is not None: 2025-05-07T20:31:42.8862581Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:42.8862921Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:42.8863232Z ) 2025-05-07T20:31:42.8863433Z else: 2025-05-07T20:31:42.8863640Z scale_ub_tensor = None 2025-05-07T20:31:42.8863896Z 2025-05-07T20:31:42.8864135Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:42.8864443Z op = silu_mul_quant 2025-05-07T20:31:42.8864695Z if compiled: 2025-05-07T20:31:42.8864943Z op = torch.compile(op) 2025-05-07T20:31:42.8865239Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:42.8865516Z 2025-05-07T20:31:42.8865714Z > y_fp8, y_scale = fn() 2025-05-07T20:31:42.8865876Z 2025-05-07T20:31:42.8865977Z moe/activation_test.py:117: 2025-05-07T20:31:42.8866278Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:42.8866612Z moe/activation_test.py:115: in fn 2025-05-07T20:31:42.8866890Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:42.8867452Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:42.8868021Z return fn(*args, **kwargs) 2025-05-07T20:31:42.8868707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:42.8869520Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:42.8870255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:42.8870940Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:42.8871599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:42.8872130Z kernel = self.compile( 2025-05-07T20:31:42.8872669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:42.8880799Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:42.8881234Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:42.8881475Z 2025-05-07T20:31:42.8881683Z self = 2025-05-07T20:31:42.8882913Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:42.8884319Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f68723fd1c0>} 2025-05-07T20:31:42.8885671Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:42.8886699Z context = 2025-05-07T20:31:42.8886993Z 2025-05-07T20:31:42.8887161Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:42.8887697Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:42.8888158Z module_map=module_map) 2025-05-07T20:31:42.8888534Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:42.8888901Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:42.8889162Z E ^ 2025-05-07T20:31:42.8889635Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:42.8890094Z 2025-05-07T20:31:42.8890517Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:42.8891031Z 2025-05-07T20:31:42.8891143Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:42.8891555Z self=, 2025-05-07T20:31:42.8891965Z T=1, 2025-05-07T20:31:42.8892159Z D=7168, 2025-05-07T20:31:42.8892359Z scale_ub=None, 2025-05-07T20:31:42.8892588Z contiguous=False, 2025-05-07T20:31:42.8892819Z compiled=False, 2025-05-07T20:31:42.8893025Z ) 2025-05-07T20:31:42.8893350Z self = 2025-05-07T20:31:42.8893851Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:42.8894114Z 2025-05-07T20:31:42.8894202Z @given( 2025-05-07T20:31:42.8894432Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:42.8894753Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:42.8895065Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:42.8895395Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:42.8895731Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:42.8896023Z ) 2025-05-07T20:31:42.8896369Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:42.8896815Z def test_silu_mul_quant( 2025-05-07T20:31:42.8897069Z self, 2025-05-07T20:31:42.8897270Z T: int, 2025-05-07T20:31:42.8897466Z D: int, 2025-05-07T20:31:42.8897691Z scale_ub: Optional[float], 2025-05-07T20:31:42.8897969Z contiguous: bool, 2025-05-07T20:31:42.8898296Z compiled: bool, 2025-05-07T20:31:42.8898523Z ) -> None: 2025-05-07T20:31:42.8898745Z torch.manual_seed(2025) 2025-05-07T20:31:42.8898988Z 2025-05-07T20:31:42.8899265Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:42.8899611Z 2025-05-07T20:31:42.8899803Z x_sign = torch.sign(x) 2025-05-07T20:31:42.8900098Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:42.8900410Z x = x_sign * x_clamp 2025-05-07T20:31:42.8900645Z x0 = x[:, :D] 2025-05-07T20:31:42.8900871Z x1 = x[:, D:] 2025-05-07T20:31:42.8901085Z 2025-05-07T20:31:42.8901269Z if contiguous: 2025-05-07T20:31:42.8901506Z x0 = x0.contiguous() 2025-05-07T20:31:42.8901852Z x1 = x1.contiguous() 2025-05-07T20:31:42.8902089Z 2025-05-07T20:31:42.8902289Z if scale_ub is not None: 2025-05-07T20:31:42.8902571Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:42.8902923Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:42.8903228Z ) 2025-05-07T20:31:42.8903427Z else: 2025-05-07T20:31:42.8903644Z scale_ub_tensor = None 2025-05-07T20:31:42.8903893Z 2025-05-07T20:31:42.8904138Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:42.8904465Z op = silu_mul_quant 2025-05-07T20:31:42.8904719Z if compiled: 2025-05-07T20:31:42.8904977Z op = torch.compile(op) 2025-05-07T20:31:42.8905290Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:42.8905561Z 2025-05-07T20:31:42.8905765Z > y_fp8, y_scale = fn() 2025-05-07T20:31:42.8905929Z 2025-05-07T20:31:42.8906040Z moe/activation_test.py:117: 2025-05-07T20:31:42.8906349Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:42.8906690Z moe/activation_test.py:115: in fn 2025-05-07T20:31:42.8906980Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:42.8907682Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:42.8908369Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:42.8908920Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:42.8909672Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:42.8910349Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:42.8910879Z kernel = self.compile( 2025-05-07T20:31:42.8911429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:42.8912102Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:42.8912507Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:42.8912751Z 2025-05-07T20:31:42.8912958Z self = 2025-05-07T20:31:42.8914046Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:42.8915425Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f68723fdf80>} 2025-05-07T20:31:42.8916780Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:42.8917819Z context = 2025-05-07T20:31:42.8918113Z 2025-05-07T20:31:42.8918420Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:42.8918946Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:42.8919416Z module_map=module_map) 2025-05-07T20:31:42.8919776Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:42.8920132Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:42.8920394Z E ^ 2025-05-07T20:31:42.8920853Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:42.8921314Z 2025-05-07T20:31:42.8921732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:42.8922333Z 2025-05-07T20:31:42.8922443Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:42.8922862Z self=, 2025-05-07T20:31:42.8923262Z T=2048, 2025-05-07T20:31:42.8923466Z D=7168, 2025-05-07T20:31:42.8923663Z scale_ub=None, 2025-05-07T20:31:42.8923878Z contiguous=False, 2025-05-07T20:31:42.8924107Z compiled=True, 2025-05-07T20:31:42.8924315Z ) 2025-05-07T20:31:42.9597929Z self = 2025-05-07T20:31:42.9598731Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:42.9599107Z 2025-05-07T20:31:42.9599219Z @given( 2025-05-07T20:31:42.9599540Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:42.9599862Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:42.9600168Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:42.9600542Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:42.9600878Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:42.9601180Z ) 2025-05-07T20:31:42.9601536Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:42.9601983Z def test_silu_mul_quant( 2025-05-07T20:31:42.9602235Z self, 2025-05-07T20:31:42.9602430Z T: int, 2025-05-07T20:31:42.9602636Z D: int, 2025-05-07T20:31:42.9602860Z scale_ub: Optional[float], 2025-05-07T20:31:42.9603135Z contiguous: bool, 2025-05-07T20:31:42.9603386Z compiled: bool, 2025-05-07T20:31:42.9603611Z ) -> None: 2025-05-07T20:31:42.9603831Z torch.manual_seed(2025) 2025-05-07T20:31:42.9604078Z 2025-05-07T20:31:42.9604389Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:42.9604747Z 2025-05-07T20:31:42.9604945Z x_sign = torch.sign(x) 2025-05-07T20:31:42.9605241Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:42.9605552Z x = x_sign * x_clamp 2025-05-07T20:31:42.9605798Z x0 = x[:, :D] 2025-05-07T20:31:42.9606026Z x1 = x[:, D:] 2025-05-07T20:31:42.9606230Z 2025-05-07T20:31:42.9606428Z if contiguous: 2025-05-07T20:31:42.9606667Z x0 = x0.contiguous() 2025-05-07T20:31:42.9606920Z x1 = x1.contiguous() 2025-05-07T20:31:42.9607171Z 2025-05-07T20:31:42.9607369Z if scale_ub is not None: 2025-05-07T20:31:42.9607636Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:42.9607985Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:42.9608300Z ) 2025-05-07T20:31:42.9608493Z else: 2025-05-07T20:31:42.9608708Z scale_ub_tensor = None 2025-05-07T20:31:42.9608964Z 2025-05-07T20:31:42.9609196Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:42.9609513Z op = silu_mul_quant 2025-05-07T20:31:42.9609775Z if compiled: 2025-05-07T20:31:42.9610023Z op = torch.compile(op) 2025-05-07T20:31:42.9610321Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:42.9610607Z 2025-05-07T20:31:42.9610804Z > y_fp8, y_scale = fn() 2025-05-07T20:31:42.9611320Z 2025-05-07T20:31:42.9611425Z moe/activation_test.py:117: 2025-05-07T20:31:42.9611724Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:42.9612062Z moe/activation_test.py:115: in fn 2025-05-07T20:31:42.9612342Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:42.9612896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:42.9613462Z return fn(*args, **kwargs) 2025-05-07T20:31:42.9614122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:42.9614799Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:42.9615478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:42.9616158Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:42.9616816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:42.9617346Z kernel = self.compile( 2025-05-07T20:31:42.9617885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:42.9618541Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:42.9618935Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:42.9619173Z 2025-05-07T20:31:42.9619379Z self = 2025-05-07T20:31:42.9620456Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:42.9621839Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f68723ff420>} 2025-05-07T20:31:42.9623171Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:42.9624190Z context = 2025-05-07T20:31:42.9624479Z 2025-05-07T20:31:42.9624644Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:42.9625211Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:42.9625685Z module_map=module_map) 2025-05-07T20:31:42.9626053Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:42.9626411Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:42.9626675Z E ^ 2025-05-07T20:31:42.9627141Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:42.9627593Z 2025-05-07T20:31:42.9628009Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:42.9628923Z 2025-05-07T20:31:42.9629039Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:42.9629501Z self=, 2025-05-07T20:31:42.9629906Z T=4096, 2025-05-07T20:31:42.9630100Z D=7168, 2025-05-07T20:31:42.9630301Z scale_ub=None, 2025-05-07T20:31:42.9630517Z contiguous=False, 2025-05-07T20:31:42.9630749Z compiled=True, 2025-05-07T20:31:42.9630969Z ) 2025-05-07T20:31:42.9631285Z self = 2025-05-07T20:31:42.9631781Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:42.9632051Z 2025-05-07T20:31:42.9632314Z @given( 2025-05-07T20:31:42.9632563Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:42.9632912Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:42.9633253Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:42.9633620Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:42.9633993Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:42.9634312Z ) 2025-05-07T20:31:42.9634758Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:42.9635267Z def test_silu_mul_quant( 2025-05-07T20:31:42.9635531Z self, 2025-05-07T20:31:42.9635737Z T: int, 2025-05-07T20:31:42.9635939Z D: int, 2025-05-07T20:31:42.9636287Z scale_ub: Optional[float], 2025-05-07T20:31:42.9636588Z contiguous: bool, 2025-05-07T20:31:42.9636841Z compiled: bool, 2025-05-07T20:31:42.9637082Z ) -> None: 2025-05-07T20:31:42.9637317Z torch.manual_seed(2025) 2025-05-07T20:31:42.9637583Z 2025-05-07T20:31:42.9637884Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:42.9638278Z 2025-05-07T20:31:42.9638475Z x_sign = torch.sign(x) 2025-05-07T20:31:42.9638795Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:42.9639142Z x = x_sign * x_clamp 2025-05-07T20:31:42.9639397Z x0 = x[:, :D] 2025-05-07T20:31:42.9639630Z x1 = x[:, D:] 2025-05-07T20:31:42.9639861Z 2025-05-07T20:31:42.9640058Z if contiguous: 2025-05-07T20:31:42.9640306Z x0 = x0.contiguous() 2025-05-07T20:31:42.9640594Z x1 = x1.contiguous() 2025-05-07T20:31:42.9640861Z 2025-05-07T20:31:42.9641069Z if scale_ub is not None: 2025-05-07T20:31:42.9641375Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:42.9641758Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:42.9642109Z ) 2025-05-07T20:31:42.9642325Z else: 2025-05-07T20:31:42.9642557Z scale_ub_tensor = None 2025-05-07T20:31:42.9642826Z 2025-05-07T20:31:42.9643086Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:42.9643445Z op = silu_mul_quant 2025-05-07T20:31:42.9643718Z if compiled: 2025-05-07T20:31:42.9643989Z op = torch.compile(op) 2025-05-07T20:31:42.9644321Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:42.9644625Z 2025-05-07T20:31:42.9644839Z > y_fp8, y_scale = fn() 2025-05-07T20:31:42.9645032Z 2025-05-07T20:31:42.9645140Z moe/activation_test.py:117: 2025-05-07T20:31:42.9645480Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:42.9645861Z moe/activation_test.py:115: in fn 2025-05-07T20:31:42.9646179Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:42.9646843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:42.9647509Z return fn(*args, **kwargs) 2025-05-07T20:31:42.9648301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:42.9649130Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:42.9649762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:42.9650577Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:42.9651376Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:42.9652004Z kernel = self.compile( 2025-05-07T20:31:42.9652647Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:42.9653436Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:42.9653983Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:42.9654254Z 2025-05-07T20:31:42.9654499Z self = 2025-05-07T20:31:42.9655868Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:42.9657583Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f68727e8680>} 2025-05-07T20:31:42.9659249Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:42.9660356Z context = 2025-05-07T20:31:42.9660648Z 2025-05-07T20:31:42.9660822Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:42.9661352Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:42.9661820Z module_map=module_map) 2025-05-07T20:31:42.9662192Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:42.9662544Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:42.9662801Z E ^ 2025-05-07T20:31:42.9663271Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:42.9663716Z 2025-05-07T20:31:42.9664140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:42.9664653Z 2025-05-07T20:31:43.0929935Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:43.0930643Z self=, 2025-05-07T20:31:43.0931243Z T=16384, 2025-05-07T20:31:43.0931516Z D=5120, 2025-05-07T20:31:43.0931788Z scale_ub=1200.0, 2025-05-07T20:31:43.0932053Z contiguous=False, 2025-05-07T20:31:43.0932285Z compiled=False, 2025-05-07T20:31:43.0932503Z ) 2025-05-07T20:31:43.0932827Z self = 2025-05-07T20:31:43.0933333Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:43.0933618Z 2025-05-07T20:31:43.0933707Z @given( 2025-05-07T20:31:43.0933939Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:43.0934259Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:43.0934579Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:43.0934918Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:43.0935292Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:43.0935582Z ) 2025-05-07T20:31:43.0935941Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:43.0936378Z def test_silu_mul_quant( 2025-05-07T20:31:43.0936630Z self, 2025-05-07T20:31:43.0936833Z T: int, 2025-05-07T20:31:43.0937033Z D: int, 2025-05-07T20:31:43.0937262Z scale_ub: Optional[float], 2025-05-07T20:31:43.0937539Z contiguous: bool, 2025-05-07T20:31:43.0937778Z compiled: bool, 2025-05-07T20:31:43.0938008Z ) -> None: 2025-05-07T20:31:43.0938232Z torch.manual_seed(2025) 2025-05-07T20:31:43.0938472Z 2025-05-07T20:31:43.0938752Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:43.0939100Z 2025-05-07T20:31:43.0939297Z x_sign = torch.sign(x) 2025-05-07T20:31:43.0939589Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:43.0939898Z x = x_sign * x_clamp 2025-05-07T20:31:43.0940139Z x0 = x[:, :D] 2025-05-07T20:31:43.0940352Z x1 = x[:, D:] 2025-05-07T20:31:43.0940909Z 2025-05-07T20:31:43.0941102Z if contiguous: 2025-05-07T20:31:43.0941332Z x0 = x0.contiguous() 2025-05-07T20:31:43.0941594Z x1 = x1.contiguous() 2025-05-07T20:31:43.0941834Z 2025-05-07T20:31:43.0942024Z if scale_ub is not None: 2025-05-07T20:31:43.0942293Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:43.0942639Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:43.0942950Z ) 2025-05-07T20:31:43.0943144Z else: 2025-05-07T20:31:43.0943360Z scale_ub_tensor = None 2025-05-07T20:31:43.0943615Z 2025-05-07T20:31:43.0943842Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:43.0944317Z op = silu_mul_quant 2025-05-07T20:31:43.0944573Z if compiled: 2025-05-07T20:31:43.0944831Z op = torch.compile(op) 2025-05-07T20:31:43.0945156Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.0945470Z 2025-05-07T20:31:43.0945668Z > y_fp8, y_scale = fn() 2025-05-07T20:31:43.0945833Z 2025-05-07T20:31:43.0945933Z moe/activation_test.py:117: 2025-05-07T20:31:43.0946232Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.0946573Z moe/activation_test.py:115: in fn 2025-05-07T20:31:43.0946850Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.0947540Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:43.0948230Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:43.0948768Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:43.0949530Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:43.0950195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:43.0950733Z kernel = self.compile( 2025-05-07T20:31:43.0951267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:43.0951920Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:43.0952326Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.0952553Z 2025-05-07T20:31:43.0952764Z self = 2025-05-07T20:31:43.0953833Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:43.0955266Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f68727e94e0>} 2025-05-07T20:31:43.0956607Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:43.0957631Z context = 2025-05-07T20:31:43.0957916Z 2025-05-07T20:31:43.0958087Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:43.0958607Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:43.0959072Z module_map=module_map) 2025-05-07T20:31:43.0959440Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:43.0959792Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:43.0960056Z E ^ 2025-05-07T20:31:43.0960522Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:43.0961055Z 2025-05-07T20:31:43.0961480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:43.0961990Z 2025-05-07T20:31:43.0962096Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:43.0962537Z self=, 2025-05-07T20:31:43.0962942Z T=16384, 2025-05-07T20:31:43.0963138Z D=5120, 2025-05-07T20:31:43.0963331Z scale_ub=1200.0, 2025-05-07T20:31:43.0963558Z contiguous=True, 2025-05-07T20:31:43.0963783Z compiled=True, 2025-05-07T20:31:43.0964007Z ) 2025-05-07T20:31:43.0964407Z self = 2025-05-07T20:31:43.0965154Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:43.0965495Z 2025-05-07T20:31:43.0965605Z @given( 2025-05-07T20:31:43.0965891Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:43.0966293Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:43.0966673Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:43.0967000Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:43.0967334Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:43.0967628Z ) 2025-05-07T20:31:43.0967973Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:43.0968415Z def test_silu_mul_quant( 2025-05-07T20:31:43.0968661Z self, 2025-05-07T20:31:43.0968856Z T: int, 2025-05-07T20:31:43.0969057Z D: int, 2025-05-07T20:31:43.0969277Z scale_ub: Optional[float], 2025-05-07T20:31:43.0969547Z contiguous: bool, 2025-05-07T20:31:43.0969787Z compiled: bool, 2025-05-07T20:31:43.0970010Z ) -> None: 2025-05-07T20:31:43.0970233Z torch.manual_seed(2025) 2025-05-07T20:31:43.0970471Z 2025-05-07T20:31:43.0970748Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:43.0971101Z 2025-05-07T20:31:43.0971306Z x_sign = torch.sign(x) 2025-05-07T20:31:43.0971596Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:43.0971906Z x = x_sign * x_clamp 2025-05-07T20:31:43.0972150Z x0 = x[:, :D] 2025-05-07T20:31:43.0972363Z x1 = x[:, D:] 2025-05-07T20:31:43.0981038Z 2025-05-07T20:31:43.0981264Z if contiguous: 2025-05-07T20:31:43.0981516Z x0 = x0.contiguous() 2025-05-07T20:31:43.0981780Z x1 = x1.contiguous() 2025-05-07T20:31:43.0982019Z 2025-05-07T20:31:43.0982221Z if scale_ub is not None: 2025-05-07T20:31:43.0982501Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:43.0982848Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:43.0983156Z ) 2025-05-07T20:31:43.0983357Z else: 2025-05-07T20:31:43.0983570Z scale_ub_tensor = None 2025-05-07T20:31:43.0983828Z 2025-05-07T20:31:43.0984077Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:43.0984387Z op = silu_mul_quant 2025-05-07T20:31:43.0984651Z if compiled: 2025-05-07T20:31:43.0984904Z op = torch.compile(op) 2025-05-07T20:31:43.0985206Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.0985475Z 2025-05-07T20:31:43.0985675Z > y_fp8, y_scale = fn() 2025-05-07T20:31:43.0985846Z 2025-05-07T20:31:43.0985957Z moe/activation_test.py:117: 2025-05-07T20:31:43.0986248Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.0986584Z moe/activation_test.py:115: in fn 2025-05-07T20:31:43.0986874Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.0987430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:43.0987994Z return fn(*args, **kwargs) 2025-05-07T20:31:43.0988766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:43.0989560Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:43.0990091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:43.0990770Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:43.0991432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:43.0991957Z kernel = self.compile( 2025-05-07T20:31:43.0992502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:43.0993242Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:43.0993646Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.0993870Z 2025-05-07T20:31:43.0994082Z self = 2025-05-07T20:31:43.0995158Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:43.0996525Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f68727ea8e0>} 2025-05-07T20:31:43.0997861Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:43.0998885Z context = 2025-05-07T20:31:43.0999172Z 2025-05-07T20:31:43.0999338Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:43.0999861Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:43.1000333Z module_map=module_map) 2025-05-07T20:31:43.1000691Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:43.1001045Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:43.1001305Z E ^ 2025-05-07T20:31:43.1001768Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:43.1002214Z 2025-05-07T20:31:43.1002633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:43.1003161Z 2025-05-07T20:31:43.4289103Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:43.4289755Z self=, 2025-05-07T20:31:43.4290284Z T=16384, 2025-05-07T20:31:43.4290498Z D=5120, 2025-05-07T20:31:43.4290726Z scale_ub=None, 2025-05-07T20:31:43.4290956Z contiguous=False, 2025-05-07T20:31:43.4291192Z compiled=True, 2025-05-07T20:31:43.4291401Z ) 2025-05-07T20:31:43.4291734Z self = 2025-05-07T20:31:43.4292239Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:43.4292515Z 2025-05-07T20:31:43.4292599Z @given( 2025-05-07T20:31:43.4292845Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:43.4293165Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:43.4293480Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:43.4293811Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:43.4294165Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:43.4294464Z ) 2025-05-07T20:31:43.4294814Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:43.4295633Z def test_silu_mul_quant( 2025-05-07T20:31:43.4295884Z self, 2025-05-07T20:31:43.4296082Z T: int, 2025-05-07T20:31:43.4296287Z D: int, 2025-05-07T20:31:43.4296516Z scale_ub: Optional[float], 2025-05-07T20:31:43.4296786Z contiguous: bool, 2025-05-07T20:31:43.4297035Z compiled: bool, 2025-05-07T20:31:43.4297277Z ) -> None: 2025-05-07T20:31:43.4297494Z torch.manual_seed(2025) 2025-05-07T20:31:43.4297744Z 2025-05-07T20:31:43.4298028Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:43.4298373Z 2025-05-07T20:31:43.4298577Z x_sign = torch.sign(x) 2025-05-07T20:31:43.4298870Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:43.4299342Z x = x_sign * x_clamp 2025-05-07T20:31:43.4299584Z x0 = x[:, :D] 2025-05-07T20:31:43.4299805Z x1 = x[:, D:] 2025-05-07T20:31:43.4300019Z 2025-05-07T20:31:43.4300205Z if contiguous: 2025-05-07T20:31:43.4300448Z x0 = x0.contiguous() 2025-05-07T20:31:43.4300713Z x1 = x1.contiguous() 2025-05-07T20:31:43.4300949Z 2025-05-07T20:31:43.4301145Z if scale_ub is not None: 2025-05-07T20:31:43.4301422Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:43.4301755Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:43.4302068Z ) 2025-05-07T20:31:43.4302273Z else: 2025-05-07T20:31:43.4302484Z scale_ub_tensor = None 2025-05-07T20:31:43.4302748Z 2025-05-07T20:31:43.4302993Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:43.4303304Z op = silu_mul_quant 2025-05-07T20:31:43.4303563Z if compiled: 2025-05-07T20:31:43.4303824Z op = torch.compile(op) 2025-05-07T20:31:43.4304128Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.4304403Z 2025-05-07T20:31:43.4304633Z > y_fp8, y_scale = fn() 2025-05-07T20:31:43.4304822Z 2025-05-07T20:31:43.4304937Z moe/activation_test.py:117: 2025-05-07T20:31:43.4305236Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.4305577Z moe/activation_test.py:115: in fn 2025-05-07T20:31:43.4305864Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.4306420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:43.4306993Z return fn(*args, **kwargs) 2025-05-07T20:31:43.4307653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:43.4308347Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:43.4308876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:43.4309666Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:43.4310330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:43.4310863Z kernel = self.compile( 2025-05-07T20:31:43.4311398Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:43.4312055Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:43.4312465Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.4312694Z 2025-05-07T20:31:43.4312912Z self = 2025-05-07T20:31:43.4313984Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:43.4315513Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f68727eaf20>} 2025-05-07T20:31:43.4316855Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:43.4317878Z context = 2025-05-07T20:31:43.4318163Z 2025-05-07T20:31:43.4318328Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:43.4318846Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:43.4319314Z module_map=module_map) 2025-05-07T20:31:43.4319812Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:43.4320215Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:43.4320499Z E ^ 2025-05-07T20:31:43.4321048Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:43.4321597Z 2025-05-07T20:31:43.4322113Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:43.4322734Z 2025-05-07T20:31:43.4322848Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:43.4323321Z self=, 2025-05-07T20:31:43.4323787Z T=2048, 2025-05-07T20:31:43.4323986Z D=5120, 2025-05-07T20:31:43.4324191Z scale_ub=None, 2025-05-07T20:31:43.4324424Z contiguous=False, 2025-05-07T20:31:43.4324661Z compiled=True, 2025-05-07T20:31:43.4324886Z ) 2025-05-07T20:31:43.5042378Z self = 2025-05-07T20:31:43.5043169Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:43.5043540Z 2025-05-07T20:31:43.5043656Z @given( 2025-05-07T20:31:43.5043902Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:43.5044220Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:43.5044526Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:43.5044856Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:43.5045175Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:43.5045475Z ) 2025-05-07T20:31:43.5045829Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:43.5046262Z def test_silu_mul_quant( 2025-05-07T20:31:43.5046512Z self, 2025-05-07T20:31:43.5046710Z T: int, 2025-05-07T20:31:43.5046904Z D: int, 2025-05-07T20:31:43.5047131Z scale_ub: Optional[float], 2025-05-07T20:31:43.5047415Z contiguous: bool, 2025-05-07T20:31:43.5047658Z compiled: bool, 2025-05-07T20:31:43.5047890Z ) -> None: 2025-05-07T20:31:43.5048118Z torch.manual_seed(2025) 2025-05-07T20:31:43.5048358Z 2025-05-07T20:31:43.5048640Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:43.5048991Z 2025-05-07T20:31:43.5049197Z x_sign = torch.sign(x) 2025-05-07T20:31:43.5049496Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:43.5049815Z x = x_sign * x_clamp 2025-05-07T20:31:43.5050068Z x0 = x[:, :D] 2025-05-07T20:31:43.5050293Z x1 = x[:, D:] 2025-05-07T20:31:43.5050510Z 2025-05-07T20:31:43.5050710Z if contiguous: 2025-05-07T20:31:43.5050944Z x0 = x0.contiguous() 2025-05-07T20:31:43.5051217Z x1 = x1.contiguous() 2025-05-07T20:31:43.5051465Z 2025-05-07T20:31:43.5051661Z if scale_ub is not None: 2025-05-07T20:31:43.5051942Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:43.5052283Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:43.5052587Z ) 2025-05-07T20:31:43.5052788Z else: 2025-05-07T20:31:43.5053372Z scale_ub_tensor = None 2025-05-07T20:31:43.5053628Z 2025-05-07T20:31:43.5053864Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:43.5054181Z op = silu_mul_quant 2025-05-07T20:31:43.5054435Z if compiled: 2025-05-07T20:31:43.5054678Z op = torch.compile(op) 2025-05-07T20:31:43.5054975Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.5055257Z 2025-05-07T20:31:43.5055447Z > y_fp8, y_scale = fn() 2025-05-07T20:31:43.5055615Z 2025-05-07T20:31:43.5055716Z moe/activation_test.py:117: 2025-05-07T20:31:43.5056019Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.5056350Z moe/activation_test.py:115: in fn 2025-05-07T20:31:43.5056821Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.5057382Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:43.5057935Z return fn(*args, **kwargs) 2025-05-07T20:31:43.5058604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:43.5059288Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:43.5059820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:43.5060491Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:43.5061157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:43.5061696Z kernel = self.compile( 2025-05-07T20:31:43.5062237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:43.5062890Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:43.5063291Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.5063521Z 2025-05-07T20:31:43.5063733Z self = 2025-05-07T20:31:43.5064848Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:43.5066243Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6871f58d60>} 2025-05-07T20:31:43.5067584Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:43.5068612Z context = 2025-05-07T20:31:43.5068899Z 2025-05-07T20:31:43.5069080Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:43.5069695Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:43.5070172Z module_map=module_map) 2025-05-07T20:31:43.5070546Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:43.5070903Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:43.5071160Z E ^ 2025-05-07T20:31:43.5071626Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:43.5072071Z 2025-05-07T20:31:43.5072494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:43.5073004Z 2025-05-07T20:31:43.5073108Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:43.5073529Z self=, 2025-05-07T20:31:43.5074022Z T=2048, 2025-05-07T20:31:43.5074219Z D=5120, 2025-05-07T20:31:43.5074411Z scale_ub=1200.0, 2025-05-07T20:31:43.5074640Z contiguous=False, 2025-05-07T20:31:43.5074869Z compiled=True, 2025-05-07T20:31:43.5075074Z ) 2025-05-07T20:31:43.5075448Z self = 2025-05-07T20:31:43.5075948Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:43.5076218Z 2025-05-07T20:31:43.5076298Z @given( 2025-05-07T20:31:43.5076532Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:43.5076850Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:43.5077152Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:43.5077562Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:43.5077893Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:43.5078175Z ) 2025-05-07T20:31:43.5078524Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:43.5078967Z def test_silu_mul_quant( 2025-05-07T20:31:43.5079212Z self, 2025-05-07T20:31:43.5079404Z T: int, 2025-05-07T20:31:43.5079608Z D: int, 2025-05-07T20:31:43.5079835Z scale_ub: Optional[float], 2025-05-07T20:31:43.5080102Z contiguous: bool, 2025-05-07T20:31:43.5080347Z compiled: bool, 2025-05-07T20:31:43.5080569Z ) -> None: 2025-05-07T20:31:43.5080786Z torch.manual_seed(2025) 2025-05-07T20:31:43.5081029Z 2025-05-07T20:31:43.5081305Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:43.5081647Z 2025-05-07T20:31:43.5081846Z x_sign = torch.sign(x) 2025-05-07T20:31:43.5082146Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:43.5082462Z x = x_sign * x_clamp 2025-05-07T20:31:43.5082701Z x0 = x[:, :D] 2025-05-07T20:31:43.5082923Z x1 = x[:, D:] 2025-05-07T20:31:43.5083139Z 2025-05-07T20:31:43.5083335Z if contiguous: 2025-05-07T20:31:43.5083579Z x0 = x0.contiguous() 2025-05-07T20:31:43.5083843Z x1 = x1.contiguous() 2025-05-07T20:31:43.5084083Z 2025-05-07T20:31:43.5084282Z if scale_ub is not None: 2025-05-07T20:31:43.5084559Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:43.5084893Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:43.5085211Z ) 2025-05-07T20:31:43.5085416Z else: 2025-05-07T20:31:43.5085627Z scale_ub_tensor = None 2025-05-07T20:31:43.5085888Z 2025-05-07T20:31:43.5086124Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:43.5086438Z op = silu_mul_quant 2025-05-07T20:31:43.5086699Z if compiled: 2025-05-07T20:31:43.5086956Z op = torch.compile(op) 2025-05-07T20:31:43.5087256Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.5087536Z 2025-05-07T20:31:43.5087742Z > y_fp8, y_scale = fn() 2025-05-07T20:31:43.5087906Z 2025-05-07T20:31:43.5088018Z moe/activation_test.py:117: 2025-05-07T20:31:43.5088316Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.5088663Z moe/activation_test.py:115: in fn 2025-05-07T20:31:43.5088958Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.5089510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:43.5090070Z return fn(*args, **kwargs) 2025-05-07T20:31:43.5090726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:43.5091415Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:43.5091944Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:43.5092712Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:43.5093380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:43.5093911Z kernel = self.compile( 2025-05-07T20:31:43.5094462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:43.5095167Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:43.5095571Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.5095801Z 2025-05-07T20:31:43.5096010Z self = 2025-05-07T20:31:43.5097085Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:43.5098533Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6871f59760>} 2025-05-07T20:31:43.5099866Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:43.5100879Z context = 2025-05-07T20:31:43.5101163Z 2025-05-07T20:31:43.5101328Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:43.5101846Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:43.5102321Z module_map=module_map) 2025-05-07T20:31:43.5102680Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:43.5103033Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:43.5103302Z E ^ 2025-05-07T20:31:43.5103766Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:43.5104216Z 2025-05-07T20:31:43.5104631Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:43.5105152Z 2025-05-07T20:31:43.6437102Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:43.6437792Z self=, 2025-05-07T20:31:43.6438354Z T=4096, 2025-05-07T20:31:43.6438603Z D=5120, 2025-05-07T20:31:43.6438804Z scale_ub=1200.0, 2025-05-07T20:31:43.6439037Z contiguous=True, 2025-05-07T20:31:43.6439263Z compiled=True, 2025-05-07T20:31:43.6439501Z ) 2025-05-07T20:31:43.6439827Z self = 2025-05-07T20:31:43.6440325Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:43.6440592Z 2025-05-07T20:31:43.6440689Z @given( 2025-05-07T20:31:43.6440924Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:43.6441239Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:43.6441544Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:43.6441873Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:43.6442209Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:43.6442487Z ) 2025-05-07T20:31:43.6442837Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:43.6443279Z def test_silu_mul_quant( 2025-05-07T20:31:43.6443517Z self, 2025-05-07T20:31:43.6443725Z T: int, 2025-05-07T20:31:43.6443931Z D: int, 2025-05-07T20:31:43.6444154Z scale_ub: Optional[float], 2025-05-07T20:31:43.6444430Z contiguous: bool, 2025-05-07T20:31:43.6444675Z compiled: bool, 2025-05-07T20:31:43.6444904Z ) -> None: 2025-05-07T20:31:43.6445459Z torch.manual_seed(2025) 2025-05-07T20:31:43.6445706Z 2025-05-07T20:31:43.6445983Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:43.6446322Z 2025-05-07T20:31:43.6446521Z x_sign = torch.sign(x) 2025-05-07T20:31:43.6446812Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:43.6447117Z x = x_sign * x_clamp 2025-05-07T20:31:43.6447363Z x0 = x[:, :D] 2025-05-07T20:31:43.6447589Z x1 = x[:, D:] 2025-05-07T20:31:43.6447794Z 2025-05-07T20:31:43.6447987Z if contiguous: 2025-05-07T20:31:43.6448224Z x0 = x0.contiguous() 2025-05-07T20:31:43.6448479Z x1 = x1.contiguous() 2025-05-07T20:31:43.6448724Z 2025-05-07T20:31:43.6449075Z if scale_ub is not None: 2025-05-07T20:31:43.6449342Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:43.6449679Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:43.6449991Z ) 2025-05-07T20:31:43.6450198Z else: 2025-05-07T20:31:43.6450410Z scale_ub_tensor = None 2025-05-07T20:31:43.6450664Z 2025-05-07T20:31:43.6450901Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:43.6451211Z op = silu_mul_quant 2025-05-07T20:31:43.6451467Z if compiled: 2025-05-07T20:31:43.6451722Z op = torch.compile(op) 2025-05-07T20:31:43.6452018Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.6452296Z 2025-05-07T20:31:43.6452495Z > y_fp8, y_scale = fn() 2025-05-07T20:31:43.6452658Z 2025-05-07T20:31:43.6452758Z moe/activation_test.py:117: 2025-05-07T20:31:43.6453056Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.6453397Z moe/activation_test.py:115: in fn 2025-05-07T20:31:43.6453681Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.6454236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:43.6454795Z return fn(*args, **kwargs) 2025-05-07T20:31:43.6455454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:43.6456155Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:43.6456680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:43.6457358Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:43.6458024Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:43.6458556Z kernel = self.compile( 2025-05-07T20:31:43.6459098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:43.6459761Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:43.6460165Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.6460397Z 2025-05-07T20:31:43.6460608Z self = 2025-05-07T20:31:43.6470267Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:43.6471672Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6871f5a980>} 2025-05-07T20:31:43.6473027Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:43.6474063Z context = 2025-05-07T20:31:43.6474477Z 2025-05-07T20:31:43.6474676Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:43.6475222Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:43.6475690Z module_map=module_map) 2025-05-07T20:31:43.6476060Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:43.6476413Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:43.6476678Z E ^ 2025-05-07T20:31:43.6477150Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:43.6477601Z 2025-05-07T20:31:43.6478021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:43.6478741Z 2025-05-07T20:31:43.6478852Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:43.6479334Z self=, 2025-05-07T20:31:43.6479801Z T=128, 2025-05-07T20:31:43.6479998Z D=5120, 2025-05-07T20:31:43.6480208Z scale_ub=1200.0, 2025-05-07T20:31:43.6480456Z contiguous=False, 2025-05-07T20:31:43.6480694Z compiled=True, 2025-05-07T20:31:43.6480929Z ) 2025-05-07T20:31:43.7307938Z self = 2025-05-07T20:31:43.7308703Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:43.7309062Z 2025-05-07T20:31:43.7309237Z @given( 2025-05-07T20:31:43.7309473Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:43.7309790Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:43.7310125Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:43.7310462Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:43.7310790Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:43.7311079Z ) 2025-05-07T20:31:43.7311453Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:43.7311893Z def test_silu_mul_quant( 2025-05-07T20:31:43.7312146Z self, 2025-05-07T20:31:43.7312353Z T: int, 2025-05-07T20:31:43.7312552Z D: int, 2025-05-07T20:31:43.7312779Z scale_ub: Optional[float], 2025-05-07T20:31:43.7313057Z contiguous: bool, 2025-05-07T20:31:43.7313295Z compiled: bool, 2025-05-07T20:31:43.7313529Z ) -> None: 2025-05-07T20:31:43.7313757Z torch.manual_seed(2025) 2025-05-07T20:31:43.7314000Z 2025-05-07T20:31:43.7314281Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:43.7314628Z 2025-05-07T20:31:43.7314843Z x_sign = torch.sign(x) 2025-05-07T20:31:43.7315139Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:43.7315459Z x = x_sign * x_clamp 2025-05-07T20:31:43.7315714Z x0 = x[:, :D] 2025-05-07T20:31:43.7315937Z x1 = x[:, D:] 2025-05-07T20:31:43.7316159Z 2025-05-07T20:31:43.7316358Z if contiguous: 2025-05-07T20:31:43.7316596Z x0 = x0.contiguous() 2025-05-07T20:31:43.7316864Z x1 = x1.contiguous() 2025-05-07T20:31:43.7317114Z 2025-05-07T20:31:43.7317310Z if scale_ub is not None: 2025-05-07T20:31:43.7317595Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:43.7317941Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:43.7318248Z ) 2025-05-07T20:31:43.7318459Z else: 2025-05-07T20:31:43.7318681Z scale_ub_tensor = None 2025-05-07T20:31:43.7318936Z 2025-05-07T20:31:43.7319179Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:43.7319511Z op = silu_mul_quant 2025-05-07T20:31:43.7319775Z if compiled: 2025-05-07T20:31:43.7320027Z op = torch.compile(op) 2025-05-07T20:31:43.7320337Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.7320974Z 2025-05-07T20:31:43.7321172Z > y_fp8, y_scale = fn() 2025-05-07T20:31:43.7321348Z 2025-05-07T20:31:43.7321455Z moe/activation_test.py:117: 2025-05-07T20:31:43.7321761Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.7322096Z moe/activation_test.py:115: in fn 2025-05-07T20:31:43.7322386Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.7322953Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:43.7323519Z return fn(*args, **kwargs) 2025-05-07T20:31:43.7324174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:43.7325067Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:43.7325609Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:43.7326294Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:43.7326958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:43.7327495Z kernel = self.compile( 2025-05-07T20:31:43.7328041Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:43.7329064Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:43.7329468Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.7329699Z 2025-05-07T20:31:43.7329915Z self = 2025-05-07T20:31:43.7330992Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:43.7332374Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6871db0720>} 2025-05-07T20:31:43.7333710Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:43.7334725Z context = 2025-05-07T20:31:43.7335058Z 2025-05-07T20:31:43.7335234Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:43.7335743Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:43.7336213Z module_map=module_map) 2025-05-07T20:31:43.7336578Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:43.7336933Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:43.7337187Z E ^ 2025-05-07T20:31:43.7337655Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:43.7338098Z 2025-05-07T20:31:43.7338520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:43.7339027Z 2025-05-07T20:31:43.7339132Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:43.7339546Z self=, 2025-05-07T20:31:43.7339952Z T=16384, 2025-05-07T20:31:43.7340152Z D=7168, 2025-05-07T20:31:43.7340346Z scale_ub=1200.0, 2025-05-07T20:31:43.7340576Z contiguous=True, 2025-05-07T20:31:43.7340801Z compiled=True, 2025-05-07T20:31:43.7341005Z ) 2025-05-07T20:31:43.7341331Z self = 2025-05-07T20:31:43.7341967Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:43.7342244Z 2025-05-07T20:31:43.7342325Z @given( 2025-05-07T20:31:43.7342559Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:43.7342874Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:43.7343181Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:43.7343517Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:43.7343848Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:43.7344135Z ) 2025-05-07T20:31:43.7344479Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:43.7344924Z def test_silu_mul_quant( 2025-05-07T20:31:43.7345168Z self, 2025-05-07T20:31:43.7345480Z T: int, 2025-05-07T20:31:43.7345684Z D: int, 2025-05-07T20:31:43.7345910Z scale_ub: Optional[float], 2025-05-07T20:31:43.7346173Z contiguous: bool, 2025-05-07T20:31:43.7346423Z compiled: bool, 2025-05-07T20:31:43.7346642Z ) -> None: 2025-05-07T20:31:43.7346858Z torch.manual_seed(2025) 2025-05-07T20:31:43.7347100Z 2025-05-07T20:31:43.7347370Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:43.7347704Z 2025-05-07T20:31:43.7347902Z x_sign = torch.sign(x) 2025-05-07T20:31:43.7348195Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:43.7348503Z x = x_sign * x_clamp 2025-05-07T20:31:43.7348744Z x0 = x[:, :D] 2025-05-07T20:31:43.7348966Z x1 = x[:, D:] 2025-05-07T20:31:43.7349265Z 2025-05-07T20:31:43.7349459Z if contiguous: 2025-05-07T20:31:43.7349696Z x0 = x0.contiguous() 2025-05-07T20:31:43.7349957Z x1 = x1.contiguous() 2025-05-07T20:31:43.7350195Z 2025-05-07T20:31:43.7350392Z if scale_ub is not None: 2025-05-07T20:31:43.7350661Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:43.7351003Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:43.7351309Z ) 2025-05-07T20:31:43.7351501Z else: 2025-05-07T20:31:43.7351716Z scale_ub_tensor = None 2025-05-07T20:31:43.7351971Z 2025-05-07T20:31:43.7352198Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:43.7352515Z op = silu_mul_quant 2025-05-07T20:31:43.7352773Z if compiled: 2025-05-07T20:31:43.7353023Z op = torch.compile(op) 2025-05-07T20:31:43.7353323Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.7353599Z 2025-05-07T20:31:43.7353801Z > y_fp8, y_scale = fn() 2025-05-07T20:31:43.7353967Z 2025-05-07T20:31:43.7354065Z moe/activation_test.py:117: 2025-05-07T20:31:43.7354375Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.7354750Z moe/activation_test.py:115: in fn 2025-05-07T20:31:43.7355038Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.7355595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:43.7356148Z return fn(*args, **kwargs) 2025-05-07T20:31:43.7356803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:43.7357476Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:43.7358009Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:43.7358679Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:43.7359325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:43.7359860Z kernel = self.compile( 2025-05-07T20:31:43.7360403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:43.7361139Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:43.7361532Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.7361764Z 2025-05-07T20:31:43.7361968Z self = 2025-05-07T20:31:43.7363045Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:43.7364401Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6871db0f40>} 2025-05-07T20:31:43.7365891Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:43.7366908Z context = 2025-05-07T20:31:43.7367202Z 2025-05-07T20:31:43.7367370Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:43.7367886Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:43.7368348Z module_map=module_map) 2025-05-07T20:31:43.7368712Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:43.7369070Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:43.7369332Z E ^ 2025-05-07T20:31:43.7369788Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:43.7370244Z 2025-05-07T20:31:43.7370658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:43.7371168Z 2025-05-07T20:31:43.8335974Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:43.8336697Z self=, 2025-05-07T20:31:43.8337258Z T=16384, 2025-05-07T20:31:43.8337532Z D=5120, 2025-05-07T20:31:43.8337772Z scale_ub=1200.0, 2025-05-07T20:31:43.8338002Z contiguous=True, 2025-05-07T20:31:43.8338228Z compiled=False, 2025-05-07T20:31:43.8338435Z ) 2025-05-07T20:31:43.8338762Z self = 2025-05-07T20:31:43.8339259Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:43.8339537Z 2025-05-07T20:31:43.8339622Z @given( 2025-05-07T20:31:43.8339859Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:43.8340197Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:43.8340513Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:43.8340839Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:43.8341173Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:43.8341466Z ) 2025-05-07T20:31:43.8341809Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:43.8342262Z def test_silu_mul_quant( 2025-05-07T20:31:43.8342510Z self, 2025-05-07T20:31:43.8342707Z T: int, 2025-05-07T20:31:43.8342915Z D: int, 2025-05-07T20:31:43.8343144Z scale_ub: Optional[float], 2025-05-07T20:31:43.8343412Z contiguous: bool, 2025-05-07T20:31:43.8343659Z compiled: bool, 2025-05-07T20:31:43.8343895Z ) -> None: 2025-05-07T20:31:43.8344122Z torch.manual_seed(2025) 2025-05-07T20:31:43.8344363Z 2025-05-07T20:31:43.8344683Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:43.8345039Z 2025-05-07T20:31:43.8345231Z x_sign = torch.sign(x) 2025-05-07T20:31:43.8345530Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:43.8345841Z x = x_sign * x_clamp 2025-05-07T20:31:43.8346411Z x0 = x[:, :D] 2025-05-07T20:31:43.8346639Z x1 = x[:, D:] 2025-05-07T20:31:43.8346847Z 2025-05-07T20:31:43.8347033Z if contiguous: 2025-05-07T20:31:43.8347268Z x0 = x0.contiguous() 2025-05-07T20:31:43.8347527Z x1 = x1.contiguous() 2025-05-07T20:31:43.8347761Z 2025-05-07T20:31:43.8347959Z if scale_ub is not None: 2025-05-07T20:31:43.8348234Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:43.8348564Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:43.8348876Z ) 2025-05-07T20:31:43.8349077Z else: 2025-05-07T20:31:43.8349372Z scale_ub_tensor = None 2025-05-07T20:31:43.8349789Z 2025-05-07T20:31:43.8350020Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:43.8350337Z op = silu_mul_quant 2025-05-07T20:31:43.8350585Z if compiled: 2025-05-07T20:31:43.8350836Z op = torch.compile(op) 2025-05-07T20:31:43.8351136Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.8351409Z 2025-05-07T20:31:43.8351605Z > y_fp8, y_scale = fn() 2025-05-07T20:31:43.8351770Z 2025-05-07T20:31:43.8351880Z moe/activation_test.py:117: 2025-05-07T20:31:43.8352172Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.8352505Z moe/activation_test.py:115: in fn 2025-05-07T20:31:43.8352789Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.8353477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:43.8354165Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:43.8354717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:43.8355454Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:43.8356115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:43.8356652Z kernel = self.compile( 2025-05-07T20:31:43.8357208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:43.8357874Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:43.8358268Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.8358507Z 2025-05-07T20:31:43.8358715Z self = 2025-05-07T20:31:43.8359802Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:43.8361207Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6871db2520>} 2025-05-07T20:31:43.8362569Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:43.8363605Z context = 2025-05-07T20:31:43.8363900Z 2025-05-07T20:31:43.8364069Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:43.8364594Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:43.8365094Z module_map=module_map) 2025-05-07T20:31:43.8365483Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:43.8365834Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:43.8366098Z E ^ 2025-05-07T20:31:43.8366645Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:43.8367105Z 2025-05-07T20:31:43.8367523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:43.8368032Z 2025-05-07T20:31:43.8368143Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:43.8368548Z self=, 2025-05-07T20:31:43.8368949Z T=1, 2025-05-07T20:31:43.8369136Z D=7168, 2025-05-07T20:31:43.8369333Z scale_ub=1200.0, 2025-05-07T20:31:43.8369553Z contiguous=False, 2025-05-07T20:31:43.8369784Z compiled=False, 2025-05-07T20:31:43.8369998Z ) 2025-05-07T20:31:43.8370391Z self = 2025-05-07T20:31:43.8370878Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:43.8371141Z 2025-05-07T20:31:43.8371226Z @given( 2025-05-07T20:31:43.8371461Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:43.8371777Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:43.8372082Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:43.8372408Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:43.8372738Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:43.8373023Z ) 2025-05-07T20:31:43.8373370Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:43.8373803Z def test_silu_mul_quant( 2025-05-07T20:31:43.8374049Z self, 2025-05-07T20:31:43.8374248Z T: int, 2025-05-07T20:31:43.8374445Z D: int, 2025-05-07T20:31:43.8374679Z scale_ub: Optional[float], 2025-05-07T20:31:43.8375003Z contiguous: bool, 2025-05-07T20:31:43.8375242Z compiled: bool, 2025-05-07T20:31:43.8375477Z ) -> None: 2025-05-07T20:31:43.8375702Z torch.manual_seed(2025) 2025-05-07T20:31:43.8375945Z 2025-05-07T20:31:43.8376227Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:43.8376573Z 2025-05-07T20:31:43.8376766Z x_sign = torch.sign(x) 2025-05-07T20:31:43.8377064Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:43.8377375Z x = x_sign * x_clamp 2025-05-07T20:31:43.8377618Z x0 = x[:, :D] 2025-05-07T20:31:43.8377834Z x1 = x[:, D:] 2025-05-07T20:31:43.8378050Z 2025-05-07T20:31:43.8378239Z if contiguous: 2025-05-07T20:31:43.8378468Z x0 = x0.contiguous() 2025-05-07T20:31:43.8378731Z x1 = x1.contiguous() 2025-05-07T20:31:43.8378975Z 2025-05-07T20:31:43.8379168Z if scale_ub is not None: 2025-05-07T20:31:43.8379450Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:43.8379794Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:43.8380104Z ) 2025-05-07T20:31:43.8380308Z else: 2025-05-07T20:31:43.8380534Z scale_ub_tensor = None 2025-05-07T20:31:43.8380782Z 2025-05-07T20:31:43.8381020Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:43.8381342Z op = silu_mul_quant 2025-05-07T20:31:43.8381595Z if compiled: 2025-05-07T20:31:43.8381848Z op = torch.compile(op) 2025-05-07T20:31:43.8382148Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.8382421Z 2025-05-07T20:31:43.8382611Z > y_fp8, y_scale = fn() 2025-05-07T20:31:43.8382786Z 2025-05-07T20:31:43.8382888Z moe/activation_test.py:117: 2025-05-07T20:31:43.8383188Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.8383530Z moe/activation_test.py:115: in fn 2025-05-07T20:31:43.8383824Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:43.8384535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:43.8385339Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:43.8385870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:43.8386552Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:43.8387215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:43.8387748Z kernel = self.compile( 2025-05-07T20:31:43.8388293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:43.8388958Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:43.8389519Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:43.8389745Z 2025-05-07T20:31:43.8389951Z self = 2025-05-07T20:31:43.8391027Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:43.8392391Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6871db1bc0>} 2025-05-07T20:31:43.8393729Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:43.8394751Z context = 2025-05-07T20:31:43.8395069Z 2025-05-07T20:31:43.8395259Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:43.8395774Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:43.8396250Z module_map=module_map) 2025-05-07T20:31:43.8396613Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:43.8396970Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:43.8397240Z E ^ 2025-05-07T20:31:43.8397717Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:43.8398163Z 2025-05-07T20:31:43.8398582Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:43.8399111Z 2025-05-07T20:31:44.1798487Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:44.1799137Z self=, 2025-05-07T20:31:44.1799724Z T=4096, 2025-05-07T20:31:44.1799953Z D=7168, 2025-05-07T20:31:44.1800145Z scale_ub=1200.0, 2025-05-07T20:31:44.1800385Z contiguous=False, 2025-05-07T20:31:44.1800626Z compiled=True, 2025-05-07T20:31:44.1800832Z ) 2025-05-07T20:31:44.1801154Z self = 2025-05-07T20:31:44.1801647Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:44.1801915Z 2025-05-07T20:31:44.1801995Z @given( 2025-05-07T20:31:44.1802230Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:44.1802542Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:44.1802849Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:44.1803181Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:44.1803506Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:44.1803795Z ) 2025-05-07T20:31:44.1804140Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:44.1804593Z def test_silu_mul_quant( 2025-05-07T20:31:44.1804838Z self, 2025-05-07T20:31:44.1805067Z T: int, 2025-05-07T20:31:44.1805656Z D: int, 2025-05-07T20:31:44.1805886Z scale_ub: Optional[float], 2025-05-07T20:31:44.1806162Z contiguous: bool, 2025-05-07T20:31:44.1806401Z compiled: bool, 2025-05-07T20:31:44.1814613Z ) -> None: 2025-05-07T20:31:44.1814880Z torch.manual_seed(2025) 2025-05-07T20:31:44.1815124Z 2025-05-07T20:31:44.1815408Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:44.1815760Z 2025-05-07T20:31:44.1815961Z x_sign = torch.sign(x) 2025-05-07T20:31:44.1816252Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:44.1816570Z x = x_sign * x_clamp 2025-05-07T20:31:44.1816817Z x0 = x[:, :D] 2025-05-07T20:31:44.1817269Z x1 = x[:, D:] 2025-05-07T20:31:44.1817484Z 2025-05-07T20:31:44.1817676Z if contiguous: 2025-05-07T20:31:44.1817907Z x0 = x0.contiguous() 2025-05-07T20:31:44.1818170Z x1 = x1.contiguous() 2025-05-07T20:31:44.1818418Z 2025-05-07T20:31:44.1818614Z if scale_ub is not None: 2025-05-07T20:31:44.1818901Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:44.1819243Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:44.1819552Z ) 2025-05-07T20:31:44.1819756Z else: 2025-05-07T20:31:44.1819977Z scale_ub_tensor = None 2025-05-07T20:31:44.1820217Z 2025-05-07T20:31:44.1820439Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:44.1820760Z op = silu_mul_quant 2025-05-07T20:31:44.1821015Z if compiled: 2025-05-07T20:31:44.1821264Z op = torch.compile(op) 2025-05-07T20:31:44.1821566Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.1821852Z 2025-05-07T20:31:44.1822043Z > y_fp8, y_scale = fn() 2025-05-07T20:31:44.1822213Z 2025-05-07T20:31:44.1822314Z moe/activation_test.py:117: 2025-05-07T20:31:44.1822619Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.1822951Z moe/activation_test.py:115: in fn 2025-05-07T20:31:44.1823237Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.1823803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:44.1824369Z return fn(*args, **kwargs) 2025-05-07T20:31:44.1825077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:44.1825769Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:44.1826306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:44.1826988Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:44.1827651Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:44.1828496Z kernel = self.compile( 2025-05-07T20:31:44.1829215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:44.1829869Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:44.1830272Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.1830502Z 2025-05-07T20:31:44.1830717Z self = 2025-05-07T20:31:44.1831802Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:44.1833198Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6872a0cb80>} 2025-05-07T20:31:44.1834706Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:44.1835792Z context = 2025-05-07T20:31:44.1836081Z 2025-05-07T20:31:44.1836257Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:44.1836774Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:44.1837248Z module_map=module_map) 2025-05-07T20:31:44.1837633Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:44.1838137Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:44.1838396Z E ^ 2025-05-07T20:31:44.1838867Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:44.1839319Z 2025-05-07T20:31:44.1839752Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:44.1840263Z 2025-05-07T20:31:44.1840377Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:44.1840789Z self=, 2025-05-07T20:31:44.1841205Z T=128, 2025-05-07T20:31:44.1841405Z D=7168, 2025-05-07T20:31:44.1841605Z scale_ub=1200.0, 2025-05-07T20:31:44.1841843Z contiguous=False, 2025-05-07T20:31:44.1842081Z compiled=True, 2025-05-07T20:31:44.1842292Z ) 2025-05-07T20:31:44.2580390Z self = 2025-05-07T20:31:44.2581122Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:44.2581598Z 2025-05-07T20:31:44.2581714Z @given( 2025-05-07T20:31:44.2582046Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:44.2582475Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:44.2582803Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:44.2583150Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:44.2583494Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:44.2583781Z ) 2025-05-07T20:31:44.2584148Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:44.2584599Z def test_silu_mul_quant( 2025-05-07T20:31:44.2584842Z self, 2025-05-07T20:31:44.2585073Z T: int, 2025-05-07T20:31:44.2585305Z D: int, 2025-05-07T20:31:44.2585533Z scale_ub: Optional[float], 2025-05-07T20:31:44.2585813Z contiguous: bool, 2025-05-07T20:31:44.2586067Z compiled: bool, 2025-05-07T20:31:44.2586304Z ) -> None: 2025-05-07T20:31:44.2586535Z torch.manual_seed(2025) 2025-05-07T20:31:44.2586786Z 2025-05-07T20:31:44.2587058Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:44.2587406Z 2025-05-07T20:31:44.2587614Z x_sign = torch.sign(x) 2025-05-07T20:31:44.2587908Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:44.2588229Z x = x_sign * x_clamp 2025-05-07T20:31:44.2588484Z x0 = x[:, :D] 2025-05-07T20:31:44.2588712Z x1 = x[:, D:] 2025-05-07T20:31:44.2588922Z 2025-05-07T20:31:44.2589194Z if contiguous: 2025-05-07T20:31:44.2589439Z x0 = x0.contiguous() 2025-05-07T20:31:44.2589700Z x1 = x1.contiguous() 2025-05-07T20:31:44.2589953Z 2025-05-07T20:31:44.2590157Z if scale_ub is not None: 2025-05-07T20:31:44.2590428Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:44.2590772Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:44.2591094Z ) 2025-05-07T20:31:44.2591291Z else: 2025-05-07T20:31:44.2591512Z scale_ub_tensor = None 2025-05-07T20:31:44.2591768Z 2025-05-07T20:31:44.2592172Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:44.2592503Z op = silu_mul_quant 2025-05-07T20:31:44.2592762Z if compiled: 2025-05-07T20:31:44.2593009Z op = torch.compile(op) 2025-05-07T20:31:44.2593314Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.2593601Z 2025-05-07T20:31:44.2593805Z > y_fp8, y_scale = fn() 2025-05-07T20:31:44.2593974Z 2025-05-07T20:31:44.2594078Z moe/activation_test.py:117: 2025-05-07T20:31:44.2594389Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.2594738Z moe/activation_test.py:115: in fn 2025-05-07T20:31:44.2595018Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.2595578Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:44.2596268Z return fn(*args, **kwargs) 2025-05-07T20:31:44.2596926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:44.2597608Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:44.2598147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:44.2598817Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:44.2599474Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:44.2600012Z kernel = self.compile( 2025-05-07T20:31:44.2600552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:44.2601197Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:44.2601608Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.2601838Z 2025-05-07T20:31:44.2602054Z self = 2025-05-07T20:31:44.2603132Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:44.2604496Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6872a0d440>} 2025-05-07T20:31:44.2605888Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:44.2606915Z context = 2025-05-07T20:31:44.2607199Z 2025-05-07T20:31:44.2607382Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:44.2607901Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:44.2608376Z module_map=module_map) 2025-05-07T20:31:44.2608747Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:44.2609101Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:44.2609359Z E ^ 2025-05-07T20:31:44.2609825Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:44.2610270Z 2025-05-07T20:31:44.2610690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:44.2611196Z 2025-05-07T20:31:44.2611303Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:44.2611722Z self=, 2025-05-07T20:31:44.2612130Z T=2048, 2025-05-07T20:31:44.2612332Z D=7168, 2025-05-07T20:31:44.2612525Z scale_ub=None, 2025-05-07T20:31:44.2612829Z contiguous=True, 2025-05-07T20:31:44.2613061Z compiled=True, 2025-05-07T20:31:44.2613262Z ) 2025-05-07T20:31:44.2613582Z self = 2025-05-07T20:31:44.2614074Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:44.2614337Z 2025-05-07T20:31:44.2614421Z @given( 2025-05-07T20:31:44.2614656Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:44.2614977Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:44.2615305Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:44.2615659Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:44.2615988Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:44.2616358Z ) 2025-05-07T20:31:44.2616702Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:44.2617144Z def test_silu_mul_quant( 2025-05-07T20:31:44.2617396Z self, 2025-05-07T20:31:44.2617588Z T: int, 2025-05-07T20:31:44.2617791Z D: int, 2025-05-07T20:31:44.2618016Z scale_ub: Optional[float], 2025-05-07T20:31:44.2618282Z contiguous: bool, 2025-05-07T20:31:44.2618525Z compiled: bool, 2025-05-07T20:31:44.2618749Z ) -> None: 2025-05-07T20:31:44.2618962Z torch.manual_seed(2025) 2025-05-07T20:31:44.2619210Z 2025-05-07T20:31:44.2619487Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:44.2619823Z 2025-05-07T20:31:44.2620024Z x_sign = torch.sign(x) 2025-05-07T20:31:44.2620321Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:44.2620633Z x = x_sign * x_clamp 2025-05-07T20:31:44.2620875Z x0 = x[:, :D] 2025-05-07T20:31:44.2621099Z x1 = x[:, D:] 2025-05-07T20:31:44.2621313Z 2025-05-07T20:31:44.2621498Z if contiguous: 2025-05-07T20:31:44.2621737Z x0 = x0.contiguous() 2025-05-07T20:31:44.2622004Z x1 = x1.contiguous() 2025-05-07T20:31:44.2622232Z 2025-05-07T20:31:44.2622437Z if scale_ub is not None: 2025-05-07T20:31:44.2622717Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:44.2623058Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:44.2623375Z ) 2025-05-07T20:31:44.2623576Z else: 2025-05-07T20:31:44.2623787Z scale_ub_tensor = None 2025-05-07T20:31:44.2624049Z 2025-05-07T20:31:44.2624292Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:44.2624605Z op = silu_mul_quant 2025-05-07T20:31:44.2624866Z if compiled: 2025-05-07T20:31:44.2625116Z op = torch.compile(op) 2025-05-07T20:31:44.2625421Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.2625694Z 2025-05-07T20:31:44.2625898Z > y_fp8, y_scale = fn() 2025-05-07T20:31:44.2626064Z 2025-05-07T20:31:44.2626175Z moe/activation_test.py:117: 2025-05-07T20:31:44.2626475Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.2626815Z moe/activation_test.py:115: in fn 2025-05-07T20:31:44.2627095Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.2627644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:44.2628446Z return fn(*args, **kwargs) 2025-05-07T20:31:44.2629158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:44.2629841Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:44.2630369Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:44.2631051Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:44.2631875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:44.2632403Z kernel = self.compile( 2025-05-07T20:31:44.2632948Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:44.2633605Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:44.2634006Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.2634234Z 2025-05-07T20:31:44.2634440Z self = 2025-05-07T20:31:44.2635566Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:44.2637051Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6872a0e340>} 2025-05-07T20:31:44.2638397Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:44.2639421Z context = 2025-05-07T20:31:44.2639706Z 2025-05-07T20:31:44.2639872Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:44.2640387Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:44.2640859Z module_map=module_map) 2025-05-07T20:31:44.2641219Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:44.2641585Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:44.2641870Z E ^ 2025-05-07T20:31:44.2642527Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:44.2643067Z 2025-05-07T20:31:44.2643482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:44.2643995Z 2025-05-07T20:31:44.3290454Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:44.3291074Z self=, 2025-05-07T20:31:44.3291656Z T=16384, 2025-05-07T20:31:44.3291925Z D=5120, 2025-05-07T20:31:44.3292266Z scale_ub=None, 2025-05-07T20:31:44.3292569Z contiguous=False, 2025-05-07T20:31:44.3292875Z compiled=False, 2025-05-07T20:31:44.3293162Z ) 2025-05-07T20:31:44.3293584Z self = 2025-05-07T20:31:44.3294088Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:44.3294376Z 2025-05-07T20:31:44.3294460Z @given( 2025-05-07T20:31:44.3294703Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:44.3295027Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:44.3295336Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:44.3295666Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:44.3295997Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:44.3296280Z ) 2025-05-07T20:31:44.3296633Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:44.3297074Z def test_silu_mul_quant( 2025-05-07T20:31:44.3297314Z self, 2025-05-07T20:31:44.3297515Z T: int, 2025-05-07T20:31:44.3297717Z D: int, 2025-05-07T20:31:44.3297938Z scale_ub: Optional[float], 2025-05-07T20:31:44.3298225Z contiguous: bool, 2025-05-07T20:31:44.3298477Z compiled: bool, 2025-05-07T20:31:44.3298708Z ) -> None: 2025-05-07T20:31:44.3298924Z torch.manual_seed(2025) 2025-05-07T20:31:44.3299168Z 2025-05-07T20:31:44.3299617Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:44.3299958Z 2025-05-07T20:31:44.3300157Z x_sign = torch.sign(x) 2025-05-07T20:31:44.3300451Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:44.3302457Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 40.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:44.3304430Z 2025-05-07T20:31:44.3304560Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:31:44.3304771Z 2025-05-07T20:31:44.3304877Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:44.3305328Z self=, 2025-05-07T20:31:44.3305754Z T=4096, 2025-05-07T20:31:44.3305955Z D=7168, 2025-05-07T20:31:44.3306148Z scale_ub=1200.0, 2025-05-07T20:31:44.3306379Z contiguous=True, 2025-05-07T20:31:44.3306604Z compiled=True, 2025-05-07T20:31:44.3306811Z ) 2025-05-07T20:31:44.3307133Z self = 2025-05-07T20:31:44.3307622Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:44.3307889Z 2025-05-07T20:31:44.3307972Z @given( 2025-05-07T20:31:44.3308206Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:44.3308529Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:44.3308831Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:44.3309260Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:44.3309601Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:44.3309888Z ) 2025-05-07T20:31:44.3310234Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:44.3310672Z def test_silu_mul_quant( 2025-05-07T20:31:44.3310916Z self, 2025-05-07T20:31:44.3311109Z T: int, 2025-05-07T20:31:44.3311311Z D: int, 2025-05-07T20:31:44.3311536Z scale_ub: Optional[float], 2025-05-07T20:31:44.3311809Z contiguous: bool, 2025-05-07T20:31:44.3312050Z compiled: bool, 2025-05-07T20:31:44.3312279Z ) -> None: 2025-05-07T20:31:44.3312492Z torch.manual_seed(2025) 2025-05-07T20:31:44.3312742Z 2025-05-07T20:31:44.3313014Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:44.3313354Z 2025-05-07T20:31:44.3313551Z x_sign = torch.sign(x) 2025-05-07T20:31:44.3313847Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:44.3315900Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:44.3317740Z 2025-05-07T20:31:44.3317865Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:31:44.3318076Z 2025-05-07T20:31:44.3318181Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:44.3318599Z self=, 2025-05-07T20:31:44.3319001Z T=16384, 2025-05-07T20:31:44.3319194Z D=7168, 2025-05-07T20:31:44.3319389Z scale_ub=None, 2025-05-07T20:31:44.3319612Z contiguous=False, 2025-05-07T20:31:44.3319924Z compiled=False, 2025-05-07T20:31:44.3320134Z ) 2025-05-07T20:31:44.3320454Z self = 2025-05-07T20:31:44.3320947Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:44.3321225Z 2025-05-07T20:31:44.3321306Z @given( 2025-05-07T20:31:44.3321539Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:44.3321855Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:44.3322156Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:44.3322483Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:44.3322809Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:44.3323197Z ) 2025-05-07T20:31:44.3323594Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:44.3324108Z def test_silu_mul_quant( 2025-05-07T20:31:44.3324369Z self, 2025-05-07T20:31:44.3324583Z T: int, 2025-05-07T20:31:44.3324794Z D: int, 2025-05-07T20:31:44.3325025Z scale_ub: Optional[float], 2025-05-07T20:31:44.3325320Z contiguous: bool, 2025-05-07T20:31:44.3325579Z compiled: bool, 2025-05-07T20:31:44.3325816Z ) -> None: 2025-05-07T20:31:44.3326057Z torch.manual_seed(2025) 2025-05-07T20:31:44.3326323Z 2025-05-07T20:31:44.3326620Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:44.3329261Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:44.3331122Z 2025-05-07T20:31:44.3331251Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:44.3331463Z 2025-05-07T20:31:44.3331567Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:44.3331979Z self=, 2025-05-07T20:31:44.3332381Z T=2048, 2025-05-07T20:31:44.3332567Z D=7168, 2025-05-07T20:31:44.3332766Z scale_ub=1200.0, 2025-05-07T20:31:44.3332995Z contiguous=True, 2025-05-07T20:31:44.3333211Z compiled=True, 2025-05-07T20:31:44.3333418Z ) 2025-05-07T20:31:44.3333741Z self = 2025-05-07T20:31:44.3334240Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:44.3334511Z 2025-05-07T20:31:44.3334591Z @given( 2025-05-07T20:31:44.3334835Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:44.3335195Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:44.3335499Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:44.3335835Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:44.3336165Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:44.3336446Z ) 2025-05-07T20:31:44.3336795Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:44.3337239Z def test_silu_mul_quant( 2025-05-07T20:31:44.3337477Z self, 2025-05-07T20:31:44.3337670Z T: int, 2025-05-07T20:31:44.3337869Z D: int, 2025-05-07T20:31:44.3338093Z scale_ub: Optional[float], 2025-05-07T20:31:44.3338362Z contiguous: bool, 2025-05-07T20:31:44.3338609Z compiled: bool, 2025-05-07T20:31:44.3338834Z ) -> None: 2025-05-07T20:31:44.3339051Z torch.manual_seed(2025) 2025-05-07T20:31:44.3339294Z 2025-05-07T20:31:44.3339578Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:44.3340058Z 2025-05-07T20:31:44.3340262Z x_sign = torch.sign(x) 2025-05-07T20:31:44.3340556Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:44.3342527Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:44.3344486Z 2025-05-07T20:31:44.3344612Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:31:44.3344820Z 2025-05-07T20:31:44.3344944Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:44.3345392Z self=, 2025-05-07T20:31:44.3345793Z T=2048, 2025-05-07T20:31:44.3345980Z D=7168, 2025-05-07T20:31:44.3346186Z scale_ub=None, 2025-05-07T20:31:44.3346411Z contiguous=True, 2025-05-07T20:31:44.3346636Z compiled=False, 2025-05-07T20:31:44.3346839Z ) 2025-05-07T20:31:44.4210404Z self = 2025-05-07T20:31:44.4223457Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:44.4223927Z 2025-05-07T20:31:44.4224049Z @given( 2025-05-07T20:31:44.4224363Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:44.4224739Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:44.4225071Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:44.4225402Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:44.4225728Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:44.4226028Z ) 2025-05-07T20:31:44.4226378Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:44.4226814Z def test_silu_mul_quant( 2025-05-07T20:31:44.4227059Z self, 2025-05-07T20:31:44.4227257Z T: int, 2025-05-07T20:31:44.4227453Z D: int, 2025-05-07T20:31:44.4227682Z scale_ub: Optional[float], 2025-05-07T20:31:44.4227965Z contiguous: bool, 2025-05-07T20:31:44.4228424Z compiled: bool, 2025-05-07T20:31:44.4228654Z ) -> None: 2025-05-07T20:31:44.4228878Z torch.manual_seed(2025) 2025-05-07T20:31:44.4229156Z 2025-05-07T20:31:44.4229431Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:44.4229776Z 2025-05-07T20:31:44.4229968Z > x_sign = torch.sign(x) 2025-05-07T20:31:44.4231895Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:44.4233739Z 2025-05-07T20:31:44.4233866Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:31:44.4234078Z 2025-05-07T20:31:44.4234181Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:44.4234593Z self=, 2025-05-07T20:31:44.4235054Z T=1, 2025-05-07T20:31:44.4235241Z D=7168, 2025-05-07T20:31:44.4235443Z scale_ub=1200.0, 2025-05-07T20:31:44.4235672Z contiguous=True, 2025-05-07T20:31:44.4235890Z compiled=False, 2025-05-07T20:31:44.4236099Z ) 2025-05-07T20:31:44.4236614Z self = 2025-05-07T20:31:44.4237093Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:44.4237361Z 2025-05-07T20:31:44.4237443Z @given( 2025-05-07T20:31:44.4237673Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:44.4237983Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:44.4238280Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:44.4238609Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:44.4238936Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:44.4239210Z ) 2025-05-07T20:31:44.4239559Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:44.4240195Z def test_silu_mul_quant( 2025-05-07T20:31:44.4240434Z self, 2025-05-07T20:31:44.4240629Z T: int, 2025-05-07T20:31:44.4240829Z D: int, 2025-05-07T20:31:44.4241044Z scale_ub: Optional[float], 2025-05-07T20:31:44.4241317Z contiguous: bool, 2025-05-07T20:31:44.4241556Z compiled: bool, 2025-05-07T20:31:44.4241778Z ) -> None: 2025-05-07T20:31:44.4241986Z torch.manual_seed(2025) 2025-05-07T20:31:44.4242224Z 2025-05-07T20:31:44.4242490Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:44.4242827Z 2025-05-07T20:31:44.4243020Z x_sign = torch.sign(x) 2025-05-07T20:31:44.4243309Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:44.4243611Z x = x_sign * x_clamp 2025-05-07T20:31:44.4243853Z x0 = x[:, :D] 2025-05-07T20:31:44.4244074Z x1 = x[:, D:] 2025-05-07T20:31:44.4244275Z 2025-05-07T20:31:44.4244473Z if contiguous: 2025-05-07T20:31:44.4244709Z x0 = x0.contiguous() 2025-05-07T20:31:44.4244961Z x1 = x1.contiguous() 2025-05-07T20:31:44.4245199Z 2025-05-07T20:31:44.4245393Z if scale_ub is not None: 2025-05-07T20:31:44.4245710Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:44.4246058Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:44.4246391Z ) 2025-05-07T20:31:44.4246592Z else: 2025-05-07T20:31:44.4246801Z scale_ub_tensor = None 2025-05-07T20:31:44.4247049Z 2025-05-07T20:31:44.4247285Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:44.4247601Z op = silu_mul_quant 2025-05-07T20:31:44.4247852Z if compiled: 2025-05-07T20:31:44.4248100Z op = torch.compile(op) 2025-05-07T20:31:44.4248393Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.4248668Z 2025-05-07T20:31:44.4248857Z > y_fp8, y_scale = fn() 2025-05-07T20:31:44.4249025Z 2025-05-07T20:31:44.4249125Z moe/activation_test.py:117: 2025-05-07T20:31:44.4249419Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.4249751Z moe/activation_test.py:115: in fn 2025-05-07T20:31:44.4250033Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.4250718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:44.4251406Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:44.4251943Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:44.4252637Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:44.4253290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:44.4253816Z kernel = self.compile( 2025-05-07T20:31:44.4254363Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:44.4255007Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:44.4255491Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.4255727Z 2025-05-07T20:31:44.4255932Z self = 2025-05-07T20:31:44.4257006Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:44.4258353Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f687191dbc0>} 2025-05-07T20:31:44.4259684Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:44.4260775Z context = 2025-05-07T20:31:44.4261062Z 2025-05-07T20:31:44.4261231Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:44.4261744Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:44.4262202Z module_map=module_map) 2025-05-07T20:31:44.4262561Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:44.4262910Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:44.4263168Z E ^ 2025-05-07T20:31:44.4263634Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:44.4264080Z 2025-05-07T20:31:44.4264502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:44.4265038Z 2025-05-07T20:31:44.4265174Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:44.4265580Z self=, 2025-05-07T20:31:44.4265981Z T=128, 2025-05-07T20:31:44.4266172Z D=5120, 2025-05-07T20:31:44.4266360Z scale_ub=None, 2025-05-07T20:31:44.4266575Z contiguous=True, 2025-05-07T20:31:44.4266801Z compiled=False, 2025-05-07T20:31:44.4266998Z ) 2025-05-07T20:31:44.4805119Z self = 2025-05-07T20:31:44.4805920Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:44.4806306Z 2025-05-07T20:31:44.4806484Z @given( 2025-05-07T20:31:44.4806802Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:44.4807234Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:44.4807633Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:44.4807962Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:44.4808306Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:44.4808592Z ) 2025-05-07T20:31:44.4808944Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:44.4809388Z def test_silu_mul_quant( 2025-05-07T20:31:44.4809623Z self, 2025-05-07T20:31:44.4809818Z T: int, 2025-05-07T20:31:44.4810015Z D: int, 2025-05-07T20:31:44.4810228Z scale_ub: Optional[float], 2025-05-07T20:31:44.4810498Z contiguous: bool, 2025-05-07T20:31:44.4810736Z compiled: bool, 2025-05-07T20:31:44.4810951Z ) -> None: 2025-05-07T20:31:44.4811167Z torch.manual_seed(2025) 2025-05-07T20:31:44.4811406Z 2025-05-07T20:31:44.4811676Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:44.4812016Z 2025-05-07T20:31:44.4812214Z x_sign = torch.sign(x) 2025-05-07T20:31:44.4812497Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:44.4812802Z x = x_sign * x_clamp 2025-05-07T20:31:44.4813042Z x0 = x[:, :D] 2025-05-07T20:31:44.4813256Z x1 = x[:, D:] 2025-05-07T20:31:44.4813623Z 2025-05-07T20:31:44.4813819Z if contiguous: 2025-05-07T20:31:44.4814044Z x0 = x0.contiguous() 2025-05-07T20:31:44.4814307Z x1 = x1.contiguous() 2025-05-07T20:31:44.4814545Z 2025-05-07T20:31:44.4814731Z if scale_ub is not None: 2025-05-07T20:31:44.4815001Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:44.4815337Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:44.4815645Z ) 2025-05-07T20:31:44.4815836Z else: 2025-05-07T20:31:44.4816050Z scale_ub_tensor = None 2025-05-07T20:31:44.4816295Z 2025-05-07T20:31:44.4816518Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:44.4816971Z op = silu_mul_quant 2025-05-07T20:31:44.4817221Z if compiled: 2025-05-07T20:31:44.4817464Z op = torch.compile(op) 2025-05-07T20:31:44.4817758Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.4818037Z 2025-05-07T20:31:44.4818227Z > y_fp8, y_scale = fn() 2025-05-07T20:31:44.4818396Z 2025-05-07T20:31:44.4818495Z moe/activation_test.py:117: 2025-05-07T20:31:44.4818786Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.4819116Z moe/activation_test.py:115: in fn 2025-05-07T20:31:44.4819390Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.4820071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:44.4820755Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:44.4821286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:44.4821972Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:44.4822626Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:44.4823180Z kernel = self.compile( 2025-05-07T20:31:44.4823719Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:44.4824364Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:44.4824762Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.4825007Z 2025-05-07T20:31:44.4825256Z self = 2025-05-07T20:31:44.4826334Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:44.4827693Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f687191ed40>} 2025-05-07T20:31:44.4829249Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:44.4830265Z context = 2025-05-07T20:31:44.4830549Z 2025-05-07T20:31:44.4830722Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:44.4831228Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:44.4831691Z module_map=module_map) 2025-05-07T20:31:44.4832050Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:44.4832407Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:44.4832658Z E ^ 2025-05-07T20:31:44.4833118Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:44.4833683Z 2025-05-07T20:31:44.4834103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:44.4834607Z 2025-05-07T20:31:44.4834709Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:44.4835115Z self=, 2025-05-07T20:31:44.4835565Z T=128, 2025-05-07T20:31:44.4835758Z D=7168, 2025-05-07T20:31:44.4835952Z scale_ub=None, 2025-05-07T20:31:44.4836168Z contiguous=True, 2025-05-07T20:31:44.4836387Z compiled=False, 2025-05-07T20:31:44.4836587Z ) 2025-05-07T20:31:44.4836905Z self = 2025-05-07T20:31:44.4837503Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:44.4837766Z 2025-05-07T20:31:44.4837844Z @given( 2025-05-07T20:31:44.4838077Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:44.4838392Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:44.4838696Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:44.4839021Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:44.4839348Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:44.4839627Z ) 2025-05-07T20:31:44.4839966Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:44.4840404Z def test_silu_mul_quant( 2025-05-07T20:31:44.4840649Z self, 2025-05-07T20:31:44.4840843Z T: int, 2025-05-07T20:31:44.4841039Z D: int, 2025-05-07T20:31:44.4841259Z scale_ub: Optional[float], 2025-05-07T20:31:44.4841523Z contiguous: bool, 2025-05-07T20:31:44.4841763Z compiled: bool, 2025-05-07T20:31:44.4841981Z ) -> None: 2025-05-07T20:31:44.4842194Z torch.manual_seed(2025) 2025-05-07T20:31:44.4842435Z 2025-05-07T20:31:44.4842715Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:44.4843061Z 2025-05-07T20:31:44.4843256Z x_sign = torch.sign(x) 2025-05-07T20:31:44.4843548Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:44.4843854Z x = x_sign * x_clamp 2025-05-07T20:31:44.4844093Z x0 = x[:, :D] 2025-05-07T20:31:44.4844307Z x1 = x[:, D:] 2025-05-07T20:31:44.4844517Z 2025-05-07T20:31:44.4844697Z if contiguous: 2025-05-07T20:31:44.4844927Z x0 = x0.contiguous() 2025-05-07T20:31:44.4845194Z x1 = x1.contiguous() 2025-05-07T20:31:44.4845429Z 2025-05-07T20:31:44.4845623Z if scale_ub is not None: 2025-05-07T20:31:44.4845891Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:44.4846224Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:44.4846528Z ) 2025-05-07T20:31:44.4846719Z else: 2025-05-07T20:31:44.4846926Z scale_ub_tensor = None 2025-05-07T20:31:44.4847174Z 2025-05-07T20:31:44.4847406Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:44.4847716Z op = silu_mul_quant 2025-05-07T20:31:44.4847965Z if compiled: 2025-05-07T20:31:44.4848215Z op = torch.compile(op) 2025-05-07T20:31:44.4848509Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.4848795Z 2025-05-07T20:31:44.4848991Z > y_fp8, y_scale = fn() 2025-05-07T20:31:44.4849158Z 2025-05-07T20:31:44.4849266Z moe/activation_test.py:117: 2025-05-07T20:31:44.4849553Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.4849887Z moe/activation_test.py:115: in fn 2025-05-07T20:31:44.4850166Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.4850848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:44.4851528Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:44.4852152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:44.4852826Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:44.4853475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:44.4854001Z kernel = self.compile( 2025-05-07T20:31:44.4854536Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:44.4855230Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:44.4855627Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.4855938Z 2025-05-07T20:31:44.4856140Z self = 2025-05-07T20:31:44.4857214Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:44.4858568Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f687191fd80>} 2025-05-07T20:31:44.4859892Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:44.4860906Z context = 2025-05-07T20:31:44.4861194Z 2025-05-07T20:31:44.4861369Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:44.4861879Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:44.4862344Z module_map=module_map) 2025-05-07T20:31:44.4862709Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:44.4863055Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:44.4863304Z E ^ 2025-05-07T20:31:44.4863762Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:44.4864210Z 2025-05-07T20:31:44.4864626Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:44.4865133Z 2025-05-07T20:31:44.4865243Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:44.4865653Z self=, 2025-05-07T20:31:44.4866103Z T=2048, 2025-05-07T20:31:44.4866287Z D=7168, 2025-05-07T20:31:44.4866473Z scale_ub=1200.0, 2025-05-07T20:31:44.4866697Z contiguous=True, 2025-05-07T20:31:44.4866916Z compiled=False, 2025-05-07T20:31:44.4867117Z ) 2025-05-07T20:31:44.5533066Z self = 2025-05-07T20:31:44.5533756Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:44.5534104Z 2025-05-07T20:31:44.5534222Z @given( 2025-05-07T20:31:44.5534493Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:44.5534908Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:44.5535219Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:44.5535546Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:44.5535862Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:44.5536143Z ) 2025-05-07T20:31:44.5536486Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:44.5536922Z def test_silu_mul_quant( 2025-05-07T20:31:44.5537161Z self, 2025-05-07T20:31:44.5537355Z T: int, 2025-05-07T20:31:44.5537552Z D: int, 2025-05-07T20:31:44.5537920Z scale_ub: Optional[float], 2025-05-07T20:31:44.5538191Z contiguous: bool, 2025-05-07T20:31:44.5538426Z compiled: bool, 2025-05-07T20:31:44.5538643Z ) -> None: 2025-05-07T20:31:44.5538855Z torch.manual_seed(2025) 2025-05-07T20:31:44.5539090Z 2025-05-07T20:31:44.5539359Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:44.5541400Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.70 GiB is allocated by PyTorch, and 53.93 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:44.5543362Z 2025-05-07T20:31:44.5543493Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:44.5543701Z 2025-05-07T20:31:44.5543808Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:44.5544219Z self=, 2025-05-07T20:31:44.5544611Z T=1, 2025-05-07T20:31:44.5544796Z D=5120, 2025-05-07T20:31:44.5544987Z scale_ub=1200.0, 2025-05-07T20:31:44.5545233Z contiguous=True, 2025-05-07T20:31:44.5545478Z compiled=False, 2025-05-07T20:31:44.5545687Z ) 2025-05-07T20:31:44.5546004Z self = 2025-05-07T20:31:44.5546479Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:44.5546750Z 2025-05-07T20:31:44.5546829Z @given( 2025-05-07T20:31:44.5547057Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:44.5547366Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:44.5547666Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:44.5547998Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:44.5548331Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:44.5548608Z ) 2025-05-07T20:31:44.5548955Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:44.5549467Z def test_silu_mul_quant( 2025-05-07T20:31:44.5549702Z self, 2025-05-07T20:31:44.5549901Z T: int, 2025-05-07T20:31:44.5550096Z D: int, 2025-05-07T20:31:44.5550307Z scale_ub: Optional[float], 2025-05-07T20:31:44.5550575Z contiguous: bool, 2025-05-07T20:31:44.5550815Z compiled: bool, 2025-05-07T20:31:44.5551034Z ) -> None: 2025-05-07T20:31:44.5551244Z torch.manual_seed(2025) 2025-05-07T20:31:44.5551489Z 2025-05-07T20:31:44.5551760Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:44.5552095Z 2025-05-07T20:31:44.5552286Z x_sign = torch.sign(x) 2025-05-07T20:31:44.5552577Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:44.5552881Z x = x_sign * x_clamp 2025-05-07T20:31:44.5553118Z x0 = x[:, :D] 2025-05-07T20:31:44.5553338Z x1 = x[:, D:] 2025-05-07T20:31:44.5553540Z 2025-05-07T20:31:44.5553722Z if contiguous: 2025-05-07T20:31:44.5553955Z x0 = x0.contiguous() 2025-05-07T20:31:44.5554205Z x1 = x1.contiguous() 2025-05-07T20:31:44.5554439Z 2025-05-07T20:31:44.5554631Z if scale_ub is not None: 2025-05-07T20:31:44.5554894Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:44.5555260Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:44.5555581Z ) 2025-05-07T20:31:44.5555774Z else: 2025-05-07T20:31:44.5555982Z scale_ub_tensor = None 2025-05-07T20:31:44.5556225Z 2025-05-07T20:31:44.5556451Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:44.5556755Z op = silu_mul_quant 2025-05-07T20:31:44.5557091Z if compiled: 2025-05-07T20:31:44.5557347Z op = torch.compile(op) 2025-05-07T20:31:44.5557637Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.5557911Z 2025-05-07T20:31:44.5558105Z > y_fp8, y_scale = fn() 2025-05-07T20:31:44.5558266Z 2025-05-07T20:31:44.5558361Z moe/activation_test.py:117: 2025-05-07T20:31:44.5558651Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.5558980Z moe/activation_test.py:115: in fn 2025-05-07T20:31:44.5559252Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.5559939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:44.5560724Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:44.5561251Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:44.5561925Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:44.5562578Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:44.5563104Z kernel = self.compile( 2025-05-07T20:31:44.5563644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:44.5564288Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:44.5564695Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.5564919Z 2025-05-07T20:31:44.5565152Z self = 2025-05-07T20:31:44.5566227Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:44.5567575Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6871b1d3a0>} 2025-05-07T20:31:44.5568901Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:44.5569912Z context = 2025-05-07T20:31:44.5570195Z 2025-05-07T20:31:44.5570364Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:44.5570882Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:44.5571347Z module_map=module_map) 2025-05-07T20:31:44.5571702Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:44.5572059Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:44.5572319Z E ^ 2025-05-07T20:31:44.5572781Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:44.5573240Z 2025-05-07T20:31:44.5581364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:44.5581928Z 2025-05-07T20:31:44.5582039Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:44.5582454Z self=, 2025-05-07T20:31:44.5582846Z T=2048, 2025-05-07T20:31:44.5583035Z D=5120, 2025-05-07T20:31:44.5583222Z scale_ub=None, 2025-05-07T20:31:44.5583438Z contiguous=True, 2025-05-07T20:31:44.5583661Z compiled=False, 2025-05-07T20:31:44.5583859Z ) 2025-05-07T20:31:44.5584174Z self = 2025-05-07T20:31:44.5584768Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:44.5585045Z 2025-05-07T20:31:44.5585139Z @given( 2025-05-07T20:31:44.5585401Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:44.5585713Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:44.5586018Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:44.5586345Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:44.5586663Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:44.5586947Z ) 2025-05-07T20:31:44.5587296Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:44.5587730Z def test_silu_mul_quant( 2025-05-07T20:31:44.5587969Z self, 2025-05-07T20:31:44.5588243Z T: int, 2025-05-07T20:31:44.5588430Z D: int, 2025-05-07T20:31:44.5588643Z scale_ub: Optional[float], 2025-05-07T20:31:44.5588907Z contiguous: bool, 2025-05-07T20:31:44.5589210Z compiled: bool, 2025-05-07T20:31:44.5589435Z ) -> None: 2025-05-07T20:31:44.5589650Z torch.manual_seed(2025) 2025-05-07T20:31:44.5589885Z 2025-05-07T20:31:44.5590146Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:44.5590485Z 2025-05-07T20:31:44.5590672Z > x_sign = torch.sign(x) 2025-05-07T20:31:44.5592624Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:44.5594485Z 2025-05-07T20:31:44.5594603Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:31:44.5594826Z 2025-05-07T20:31:44.5594926Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:44.5595329Z self=, 2025-05-07T20:31:44.5595727Z T=16384, 2025-05-07T20:31:44.5595908Z D=5120, 2025-05-07T20:31:44.5596097Z scale_ub=None, 2025-05-07T20:31:44.5596306Z contiguous=True, 2025-05-07T20:31:44.5596518Z compiled=False, 2025-05-07T20:31:44.5596717Z ) 2025-05-07T20:31:44.6287979Z self = 2025-05-07T20:31:44.6289374Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:44.6289928Z 2025-05-07T20:31:44.6290089Z @given( 2025-05-07T20:31:44.6290547Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:44.6291161Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:44.6291763Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:44.6292405Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:44.6293045Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:44.6293594Z ) 2025-05-07T20:31:44.6294270Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:44.6295055Z def test_silu_mul_quant( 2025-05-07T20:31:44.6295290Z self, 2025-05-07T20:31:44.6295490Z T: int, 2025-05-07T20:31:44.6295681Z D: int, 2025-05-07T20:31:44.6295896Z scale_ub: Optional[float], 2025-05-07T20:31:44.6296163Z contiguous: bool, 2025-05-07T20:31:44.6296392Z compiled: bool, 2025-05-07T20:31:44.6296610Z ) -> None: 2025-05-07T20:31:44.6296821Z torch.manual_seed(2025) 2025-05-07T20:31:44.6297059Z 2025-05-07T20:31:44.6297330Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:44.6299516Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:44.6301357Z 2025-05-07T20:31:44.6301481Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:44.6301690Z 2025-05-07T20:31:44.6301799Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:44.6302202Z self=, 2025-05-07T20:31:44.6302705Z T=4096, 2025-05-07T20:31:44.6302890Z D=5120, 2025-05-07T20:31:44.6303075Z scale_ub=None, 2025-05-07T20:31:44.6303287Z contiguous=True, 2025-05-07T20:31:44.6303509Z compiled=False, 2025-05-07T20:31:44.6303710Z ) 2025-05-07T20:31:44.6304030Z self = 2025-05-07T20:31:44.6304519Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:44.6304779Z 2025-05-07T20:31:44.6304859Z @given( 2025-05-07T20:31:44.6305088Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:44.6305402Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:44.6305701Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:44.6306019Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:44.6306342Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:44.6306625Z ) 2025-05-07T20:31:44.6306963Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:44.6307411Z def test_silu_mul_quant( 2025-05-07T20:31:44.6307649Z self, 2025-05-07T20:31:44.6307838Z T: int, 2025-05-07T20:31:44.6308034Z D: int, 2025-05-07T20:31:44.6308264Z scale_ub: Optional[float], 2025-05-07T20:31:44.6308530Z contiguous: bool, 2025-05-07T20:31:44.6308766Z compiled: bool, 2025-05-07T20:31:44.6308987Z ) -> None: 2025-05-07T20:31:44.6309252Z torch.manual_seed(2025) 2025-05-07T20:31:44.6309490Z 2025-05-07T20:31:44.6309766Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:44.6311778Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:44.6313609Z 2025-05-07T20:31:44.6313733Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:44.6313941Z 2025-05-07T20:31:44.6314044Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:44.6314449Z self=, 2025-05-07T20:31:44.6314855Z T=2048, 2025-05-07T20:31:44.6315065Z D=5120, 2025-05-07T20:31:44.6315265Z scale_ub=None, 2025-05-07T20:31:44.6315486Z contiguous=False, 2025-05-07T20:31:44.6315711Z compiled=False, 2025-05-07T20:31:44.6315913Z ) 2025-05-07T20:31:44.6316228Z self = 2025-05-07T20:31:44.6316718Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:44.6316988Z 2025-05-07T20:31:44.6317066Z @given( 2025-05-07T20:31:44.6317294Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:44.6317605Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:44.6317995Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:44.6318318Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:44.6318651Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:44.6318926Z ) 2025-05-07T20:31:44.6319280Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:44.6319725Z def test_silu_mul_quant( 2025-05-07T20:31:44.6319967Z self, 2025-05-07T20:31:44.6320160Z T: int, 2025-05-07T20:31:44.6320351Z D: int, 2025-05-07T20:31:44.6320565Z scale_ub: Optional[float], 2025-05-07T20:31:44.6320837Z contiguous: bool, 2025-05-07T20:31:44.6321070Z compiled: bool, 2025-05-07T20:31:44.6321286Z ) -> None: 2025-05-07T20:31:44.6321576Z torch.manual_seed(2025) 2025-05-07T20:31:44.6321807Z 2025-05-07T20:31:44.6322075Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:44.6324094Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:44.6325971Z 2025-05-07T20:31:44.6326093Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:44.6326301Z 2025-05-07T20:31:44.6326405Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:44.6326813Z self=, 2025-05-07T20:31:44.6327207Z T=4096, 2025-05-07T20:31:44.6327392Z D=7168, 2025-05-07T20:31:44.6327578Z scale_ub=None, 2025-05-07T20:31:44.6327789Z contiguous=True, 2025-05-07T20:31:44.6328012Z compiled=True, 2025-05-07T20:31:44.6328379Z ) 2025-05-07T20:31:44.6328690Z self = 2025-05-07T20:31:44.6329167Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:44.6329427Z 2025-05-07T20:31:44.6329506Z @given( 2025-05-07T20:31:44.6329731Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:44.6330045Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:44.6330341Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:44.6330664Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:44.6330985Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:44.6331268Z ) 2025-05-07T20:31:44.6331607Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:44.6332039Z def test_silu_mul_quant( 2025-05-07T20:31:44.6332276Z self, 2025-05-07T20:31:44.6332464Z T: int, 2025-05-07T20:31:44.6332655Z D: int, 2025-05-07T20:31:44.6332869Z scale_ub: Optional[float], 2025-05-07T20:31:44.6333130Z contiguous: bool, 2025-05-07T20:31:44.6333364Z compiled: bool, 2025-05-07T20:31:44.6333580Z ) -> None: 2025-05-07T20:31:44.6333789Z torch.manual_seed(2025) 2025-05-07T20:31:44.6334030Z 2025-05-07T20:31:44.6334295Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:44.6336439Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:44.6338278Z 2025-05-07T20:31:44.6338399Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:44.6338609Z 2025-05-07T20:31:44.6338712Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:44.6339118Z self=, 2025-05-07T20:31:44.6339519Z T=2048, 2025-05-07T20:31:44.6339703Z D=5120, 2025-05-07T20:31:44.6339894Z scale_ub=1200.0, 2025-05-07T20:31:44.6340118Z contiguous=False, 2025-05-07T20:31:44.6340336Z compiled=False, 2025-05-07T20:31:44.6340539Z ) 2025-05-07T20:31:44.6340848Z self = 2025-05-07T20:31:44.6341480Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:44.6341750Z 2025-05-07T20:31:44.6341828Z @given( 2025-05-07T20:31:44.6342060Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:44.6342379Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:44.6342679Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:44.6343006Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:44.6343333Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:44.6343609Z ) 2025-05-07T20:31:44.6343952Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:44.6344385Z def test_silu_mul_quant( 2025-05-07T20:31:44.6344618Z self, 2025-05-07T20:31:44.6344833Z T: int, 2025-05-07T20:31:44.6345050Z D: int, 2025-05-07T20:31:44.6345268Z scale_ub: Optional[float], 2025-05-07T20:31:44.6345529Z contiguous: bool, 2025-05-07T20:31:44.6345776Z compiled: bool, 2025-05-07T20:31:44.6345994Z ) -> None: 2025-05-07T20:31:44.6346204Z torch.manual_seed(2025) 2025-05-07T20:31:44.6346444Z 2025-05-07T20:31:44.6346710Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:44.6348723Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:44.6350611Z 2025-05-07T20:31:44.6350730Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:44.6350949Z 2025-05-07T20:31:44.6351052Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:44.6351455Z self=, 2025-05-07T20:31:44.6351848Z T=4096, 2025-05-07T20:31:44.6352027Z D=7168, 2025-05-07T20:31:44.6352216Z scale_ub=1200.0, 2025-05-07T20:31:44.6352434Z contiguous=True, 2025-05-07T20:31:44.6352649Z compiled=False, 2025-05-07T20:31:44.6352852Z ) 2025-05-07T20:31:44.7271535Z self = 2025-05-07T20:31:44.7272051Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:44.7272457Z 2025-05-07T20:31:44.7272570Z @given( 2025-05-07T20:31:44.7272923Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:44.7273344Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:44.7273723Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:44.7274048Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:44.7274379Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:44.7274658Z ) 2025-05-07T20:31:44.7275006Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:44.7275647Z def test_silu_mul_quant( 2025-05-07T20:31:44.7275885Z self, 2025-05-07T20:31:44.7276079Z T: int, 2025-05-07T20:31:44.7276272Z D: int, 2025-05-07T20:31:44.7276487Z scale_ub: Optional[float], 2025-05-07T20:31:44.7276751Z contiguous: bool, 2025-05-07T20:31:44.7276984Z compiled: bool, 2025-05-07T20:31:44.7277199Z ) -> None: 2025-05-07T20:31:44.7277413Z torch.manual_seed(2025) 2025-05-07T20:31:44.7277643Z 2025-05-07T20:31:44.7277913Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:44.7279950Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:44.7281909Z 2025-05-07T20:31:44.7282031Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:44.7282238Z 2025-05-07T20:31:44.7282346Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:44.7282745Z self=, 2025-05-07T20:31:44.7283138Z T=16384, 2025-05-07T20:31:44.7283332Z D=7168, 2025-05-07T20:31:44.7283518Z scale_ub=None, 2025-05-07T20:31:44.7283728Z contiguous=False, 2025-05-07T20:31:44.7283949Z compiled=True, 2025-05-07T20:31:44.7284143Z ) 2025-05-07T20:31:44.7284457Z self = 2025-05-07T20:31:44.7284950Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:44.7285222Z 2025-05-07T20:31:44.7285301Z @given( 2025-05-07T20:31:44.7285527Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:44.7285843Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:44.7286148Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:44.7286471Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:44.7286796Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:44.7287076Z ) 2025-05-07T20:31:44.7287414Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:44.7287850Z def test_silu_mul_quant( 2025-05-07T20:31:44.7288087Z self, 2025-05-07T20:31:44.7288277Z T: int, 2025-05-07T20:31:44.7288468Z D: int, 2025-05-07T20:31:44.7288682Z scale_ub: Optional[float], 2025-05-07T20:31:44.7288956Z contiguous: bool, 2025-05-07T20:31:44.7289186Z compiled: bool, 2025-05-07T20:31:44.7289405Z ) -> None: 2025-05-07T20:31:44.7289613Z torch.manual_seed(2025) 2025-05-07T20:31:44.7289845Z 2025-05-07T20:31:44.7290117Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:44.7292138Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:44.7293981Z 2025-05-07T20:31:44.7294108Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:44.7294321Z 2025-05-07T20:31:44.7294432Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:44.7294835Z self=, 2025-05-07T20:31:44.7295372Z T=4096, 2025-05-07T20:31:44.7295559Z D=7168, 2025-05-07T20:31:44.7295746Z scale_ub=None, 2025-05-07T20:31:44.7295956Z contiguous=True, 2025-05-07T20:31:44.7296176Z compiled=False, 2025-05-07T20:31:44.7296371Z ) 2025-05-07T20:31:44.7296688Z self = 2025-05-07T20:31:44.7297174Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:44.7297441Z 2025-05-07T20:31:44.7297517Z @given( 2025-05-07T20:31:44.7297744Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:44.7298050Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:44.7298358Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:44.7298756Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:44.7299076Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:44.7299355Z ) 2025-05-07T20:31:44.7299706Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:44.7300152Z def test_silu_mul_quant( 2025-05-07T20:31:44.7300398Z self, 2025-05-07T20:31:44.7300588Z T: int, 2025-05-07T20:31:44.7300779Z D: int, 2025-05-07T20:31:44.7300994Z scale_ub: Optional[float], 2025-05-07T20:31:44.7301254Z contiguous: bool, 2025-05-07T20:31:44.7301491Z compiled: bool, 2025-05-07T20:31:44.7301705Z ) -> None: 2025-05-07T20:31:44.7301914Z torch.manual_seed(2025) 2025-05-07T20:31:44.7302147Z 2025-05-07T20:31:44.7302408Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:44.7304427Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:44.7306316Z 2025-05-07T20:31:44.7306437Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:44.7306648Z 2025-05-07T20:31:44.7306750Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:44.7307155Z self=, 2025-05-07T20:31:44.7307551Z T=16384, 2025-05-07T20:31:44.7307737Z D=7168, 2025-05-07T20:31:44.7307921Z scale_ub=None, 2025-05-07T20:31:44.7308134Z contiguous=True, 2025-05-07T20:31:44.7308348Z compiled=False, 2025-05-07T20:31:44.7308559Z ) 2025-05-07T20:31:44.7308869Z self = 2025-05-07T20:31:44.7309437Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:44.7309709Z 2025-05-07T20:31:44.7309791Z @given( 2025-05-07T20:31:44.7310021Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:44.7310329Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:44.7310628Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:44.7310954Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:44.7311280Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:44.7311557Z ) 2025-05-07T20:31:44.7311902Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:44.7312337Z def test_silu_mul_quant( 2025-05-07T20:31:44.7312575Z self, 2025-05-07T20:31:44.7312763Z T: int, 2025-05-07T20:31:44.7312960Z D: int, 2025-05-07T20:31:44.7313174Z scale_ub: Optional[float], 2025-05-07T20:31:44.7313435Z contiguous: bool, 2025-05-07T20:31:44.7313673Z compiled: bool, 2025-05-07T20:31:44.7313899Z ) -> None: 2025-05-07T20:31:44.7314193Z torch.manual_seed(2025) 2025-05-07T20:31:44.7314430Z 2025-05-07T20:31:44.7314697Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:44.7316714Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:44.7318613Z 2025-05-07T20:31:44.7318735Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:44.7318949Z 2025-05-07T20:31:44.7319052Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:44.7319466Z self=, 2025-05-07T20:31:44.7319870Z T=16384, 2025-05-07T20:31:44.7320058Z D=7168, 2025-05-07T20:31:44.7320249Z scale_ub=1200.0, 2025-05-07T20:31:44.7320469Z contiguous=True, 2025-05-07T20:31:44.7320682Z compiled=False, 2025-05-07T20:31:44.7320886Z ) 2025-05-07T20:31:44.7321198Z self = 2025-05-07T20:31:44.7321683Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:44.7321962Z 2025-05-07T20:31:44.7322042Z @given( 2025-05-07T20:31:44.7322269Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:44.7322576Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:44.7322884Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:44.7323215Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:44.7323537Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:44.7323812Z ) 2025-05-07T20:31:44.7324161Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:44.7324600Z def test_silu_mul_quant( 2025-05-07T20:31:44.7324836Z self, 2025-05-07T20:31:44.7325060Z T: int, 2025-05-07T20:31:44.7325273Z D: int, 2025-05-07T20:31:44.7325483Z scale_ub: Optional[float], 2025-05-07T20:31:44.7325753Z contiguous: bool, 2025-05-07T20:31:44.7325990Z compiled: bool, 2025-05-07T20:31:44.7326203Z ) -> None: 2025-05-07T20:31:44.7326415Z torch.manual_seed(2025) 2025-05-07T20:31:44.7326654Z 2025-05-07T20:31:44.7326919Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:44.7329120Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:44.7330969Z 2025-05-07T20:31:44.7331090Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:44.7331306Z 2025-05-07T20:31:44.7331407Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:44.7331812Z self=, 2025-05-07T20:31:44.7332203Z T=128, 2025-05-07T20:31:44.7332392Z D=5120, 2025-05-07T20:31:44.7332579Z scale_ub=1200.0, 2025-05-07T20:31:44.7332814Z contiguous=False, 2025-05-07T20:31:44.7333036Z compiled=False, 2025-05-07T20:31:44.7333237Z ) 2025-05-07T20:31:44.8354552Z self = 2025-05-07T20:31:44.8355617Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:44.8356008Z 2025-05-07T20:31:44.8356120Z @given( 2025-05-07T20:31:44.8356416Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:44.8356734Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:44.8357041Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:44.8357373Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:44.8357698Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:44.8357985Z ) 2025-05-07T20:31:44.8358344Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:44.8365933Z def test_silu_mul_quant( 2025-05-07T20:31:44.8366361Z self, 2025-05-07T20:31:44.8366554Z T: int, 2025-05-07T20:31:44.8366747Z D: int, 2025-05-07T20:31:44.8366967Z scale_ub: Optional[float], 2025-05-07T20:31:44.8367235Z contiguous: bool, 2025-05-07T20:31:44.8367472Z compiled: bool, 2025-05-07T20:31:44.8367703Z ) -> None: 2025-05-07T20:31:44.8367913Z torch.manual_seed(2025) 2025-05-07T20:31:44.8368155Z 2025-05-07T20:31:44.8368422Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:44.8368754Z 2025-05-07T20:31:44.8368944Z x_sign = torch.sign(x) 2025-05-07T20:31:44.8369233Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:44.8369539Z x = x_sign * x_clamp 2025-05-07T20:31:44.8369779Z x0 = x[:, :D] 2025-05-07T20:31:44.8369991Z x1 = x[:, D:] 2025-05-07T20:31:44.8370191Z 2025-05-07T20:31:44.8370379Z if contiguous: 2025-05-07T20:31:44.8370609Z x0 = x0.contiguous() 2025-05-07T20:31:44.8370865Z x1 = x1.contiguous() 2025-05-07T20:31:44.8371101Z 2025-05-07T20:31:44.8371293Z if scale_ub is not None: 2025-05-07T20:31:44.8371554Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:44.8371890Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:44.8372194Z ) 2025-05-07T20:31:44.8372395Z else: 2025-05-07T20:31:44.8372598Z scale_ub_tensor = None 2025-05-07T20:31:44.8372843Z 2025-05-07T20:31:44.8373070Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:44.8373378Z op = silu_mul_quant 2025-05-07T20:31:44.8373627Z if compiled: 2025-05-07T20:31:44.8373875Z op = torch.compile(op) 2025-05-07T20:31:44.8374164Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.8374439Z 2025-05-07T20:31:44.8374635Z > y_fp8, y_scale = fn() 2025-05-07T20:31:44.8374799Z 2025-05-07T20:31:44.8374899Z moe/activation_test.py:117: 2025-05-07T20:31:44.8375243Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.8375574Z moe/activation_test.py:115: in fn 2025-05-07T20:31:44.8375850Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.8376536Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:44.8377220Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:44.8377758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:44.8378443Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:44.8379105Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:44.8379632Z kernel = self.compile( 2025-05-07T20:31:44.8380173Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:44.8380824Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:44.8381220Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.8381444Z 2025-05-07T20:31:44.8381739Z self = 2025-05-07T20:31:44.8382834Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:44.8384190Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f68718c40e0>} 2025-05-07T20:31:44.8385567Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:44.8386660Z context = 2025-05-07T20:31:44.8386948Z 2025-05-07T20:31:44.8387120Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:44.8387638Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:44.8388094Z module_map=module_map) 2025-05-07T20:31:44.8388455Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:44.8388803Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:44.8389051Z E ^ 2025-05-07T20:31:44.8389567Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:44.8390020Z 2025-05-07T20:31:44.8390431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:44.8390941Z 2025-05-07T20:31:44.8391048Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:44.8391452Z self=, 2025-05-07T20:31:44.8391854Z T=2048, 2025-05-07T20:31:44.8392050Z D=7168, 2025-05-07T20:31:44.8392231Z scale_ub=None, 2025-05-07T20:31:44.8392452Z contiguous=False, 2025-05-07T20:31:44.8392670Z compiled=False, 2025-05-07T20:31:44.8392868Z ) 2025-05-07T20:31:44.8393176Z self = 2025-05-07T20:31:44.8393667Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:44.8393934Z 2025-05-07T20:31:44.8394015Z @given( 2025-05-07T20:31:44.8394237Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:44.8394541Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:44.8394842Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:44.8395163Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:44.8395490Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:44.8395768Z ) 2025-05-07T20:31:44.8396109Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:44.8396546Z def test_silu_mul_quant( 2025-05-07T20:31:44.8396783Z self, 2025-05-07T20:31:44.8396975Z T: int, 2025-05-07T20:31:44.8397162Z D: int, 2025-05-07T20:31:44.8397374Z scale_ub: Optional[float], 2025-05-07T20:31:44.8397638Z contiguous: bool, 2025-05-07T20:31:44.8397873Z compiled: bool, 2025-05-07T20:31:44.8398087Z ) -> None: 2025-05-07T20:31:44.8398297Z torch.manual_seed(2025) 2025-05-07T20:31:44.8398529Z 2025-05-07T20:31:44.8398795Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:44.8400935Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 5.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:44.8402785Z 2025-05-07T20:31:44.8402910Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:44.8403122Z 2025-05-07T20:31:44.8403230Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:44.8403637Z self=, 2025-05-07T20:31:44.8404037Z T=128, 2025-05-07T20:31:44.8404219Z D=7168, 2025-05-07T20:31:44.8404403Z scale_ub=1200.0, 2025-05-07T20:31:44.8404623Z contiguous=True, 2025-05-07T20:31:44.8404837Z compiled=True, 2025-05-07T20:31:44.8405040Z ) 2025-05-07T20:31:44.8701024Z self = 2025-05-07T20:31:44.8701839Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:44.8702215Z 2025-05-07T20:31:44.8702323Z @given( 2025-05-07T20:31:44.8702661Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:44.8702994Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:44.8703304Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:44.8703632Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:44.8703963Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:44.8704245Z ) 2025-05-07T20:31:44.8704592Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:44.8705040Z def test_silu_mul_quant( 2025-05-07T20:31:44.8705279Z self, 2025-05-07T20:31:44.8705505Z T: int, 2025-05-07T20:31:44.8705725Z D: int, 2025-05-07T20:31:44.8705947Z scale_ub: Optional[float], 2025-05-07T20:31:44.8706224Z contiguous: bool, 2025-05-07T20:31:44.8706462Z compiled: bool, 2025-05-07T20:31:44.8706681Z ) -> None: 2025-05-07T20:31:44.8706898Z torch.manual_seed(2025) 2025-05-07T20:31:44.8707142Z 2025-05-07T20:31:44.8707414Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:44.8707748Z 2025-05-07T20:31:44.8707939Z x_sign = torch.sign(x) 2025-05-07T20:31:44.8708234Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:44.8708537Z x = x_sign * x_clamp 2025-05-07T20:31:44.8708776Z x0 = x[:, :D] 2025-05-07T20:31:44.8708993Z x1 = x[:, D:] 2025-05-07T20:31:44.8709257Z 2025-05-07T20:31:44.8709448Z if contiguous: 2025-05-07T20:31:44.8709679Z x0 = x0.contiguous() 2025-05-07T20:31:44.8709936Z x1 = x1.contiguous() 2025-05-07T20:31:44.8710175Z 2025-05-07T20:31:44.8710364Z if scale_ub is not None: 2025-05-07T20:31:44.8710634Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:44.8710969Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:44.8711276Z ) 2025-05-07T20:31:44.8711469Z else: 2025-05-07T20:31:44.8711678Z scale_ub_tensor = None 2025-05-07T20:31:44.8711929Z 2025-05-07T20:31:44.8712159Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:44.8712467Z op = silu_mul_quant 2025-05-07T20:31:44.8712718Z if compiled: 2025-05-07T20:31:44.8712968Z op = torch.compile(op) 2025-05-07T20:31:44.8713259Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.8713532Z 2025-05-07T20:31:44.8713726Z > y_fp8, y_scale = fn() 2025-05-07T20:31:44.8713887Z 2025-05-07T20:31:44.8713988Z moe/activation_test.py:117: 2025-05-07T20:31:44.8714277Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.8714609Z moe/activation_test.py:115: in fn 2025-05-07T20:31:44.8714888Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:44.8715490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:44.8716216Z return fn(*args, **kwargs) 2025-05-07T20:31:44.8716877Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:44.8717552Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:44.8718079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:44.8718751Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:44.8719403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:44.8719927Z kernel = self.compile( 2025-05-07T20:31:44.8720463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:44.8721234Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:44.8721634Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:44.8721866Z 2025-05-07T20:31:44.8722069Z self = 2025-05-07T20:31:44.8723141Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:44.8724493Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f68718c6fc0>} 2025-05-07T20:31:44.8725820Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:44.8726835Z context = 2025-05-07T20:31:44.8727125Z 2025-05-07T20:31:44.8727293Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:44.8727806Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:44.8728453Z module_map=module_map) 2025-05-07T20:31:44.8728814Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:44.8729169Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:44.8729430Z E ^ 2025-05-07T20:31:44.8729887Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:44.8730333Z 2025-05-07T20:31:44.8730745Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:44.8731258Z 2025-05-07T20:31:44.8731360Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:44.8731771Z self=, 2025-05-07T20:31:44.8732170Z T=128, 2025-05-07T20:31:44.8732359Z D=7168, 2025-05-07T20:31:44.8732554Z scale_ub=1200.0, 2025-05-07T20:31:44.8732773Z contiguous=True, 2025-05-07T20:31:44.8732995Z compiled=False, 2025-05-07T20:31:44.8733201Z ) 2025-05-07T20:31:44.8733516Z self = 2025-05-07T20:31:44.8734002Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:44.8734275Z 2025-05-07T20:31:44.8734354Z @given( 2025-05-07T20:31:44.8734585Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:44.8734893Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:44.8735249Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:44.8735587Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:44.8735908Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:44.8736188Z ) 2025-05-07T20:31:44.8736685Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:44.8737123Z def test_silu_mul_quant( 2025-05-07T20:31:44.8737363Z self, 2025-05-07T20:31:44.8737557Z T: int, 2025-05-07T20:31:44.8737748Z D: int, 2025-05-07T20:31:44.8737972Z scale_ub: Optional[float], 2025-05-07T20:31:44.8738245Z contiguous: bool, 2025-05-07T20:31:44.8738486Z compiled: bool, 2025-05-07T20:31:44.8738706Z ) -> None: 2025-05-07T20:31:44.8738919Z torch.manual_seed(2025) 2025-05-07T20:31:44.8739162Z 2025-05-07T20:31:44.8739429Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:44.8739771Z 2025-05-07T20:31:44.8739963Z x_sign = torch.sign(x) 2025-05-07T20:31:44.8740373Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:44.8742361Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 4.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:44.8744203Z 2025-05-07T20:31:44.8744323Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:31:44.8744535Z 2025-05-07T20:31:44.8744645Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:44.8745052Z self=, 2025-05-07T20:31:44.8745453Z T=128, 2025-05-07T20:31:44.8745645Z D=5120, 2025-05-07T20:31:44.8745868Z scale_ub=1200.0, 2025-05-07T20:31:44.8746105Z contiguous=True, 2025-05-07T20:31:44.8746324Z compiled=True, 2025-05-07T20:31:44.8746527Z ) 2025-05-07T20:31:44.8746842Z self = 2025-05-07T20:31:44.8747321Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:44.8747585Z 2025-05-07T20:31:44.8747673Z @given( 2025-05-07T20:31:44.8747898Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:44.8748207Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:44.8748508Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:44.8748831Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:44.8749215Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:44.8749499Z ) 2025-05-07T20:31:44.8749844Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:44.8750280Z def test_silu_mul_quant( 2025-05-07T20:31:44.8750518Z self, 2025-05-07T20:31:44.8750727Z T: int, 2025-05-07T20:31:44.8750918Z D: int, 2025-05-07T20:31:44.8751137Z scale_ub: Optional[float], 2025-05-07T20:31:44.8751402Z contiguous: bool, 2025-05-07T20:31:44.8751634Z compiled: bool, 2025-05-07T20:31:44.8751857Z ) -> None: 2025-05-07T20:31:44.8752072Z torch.manual_seed(2025) 2025-05-07T20:31:44.8752310Z 2025-05-07T20:31:44.8752575Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:44.8752913Z 2025-05-07T20:31:44.8753107Z > x_sign = torch.sign(x) 2025-05-07T20:31:44.8755131Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:44.8757029Z 2025-05-07T20:31:44.8757149Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:31:44.8757362Z 2025-05-07T20:31:44.8757463Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:44.8757872Z self=, 2025-05-07T20:31:44.8758272Z T=128, 2025-05-07T20:31:44.8758453Z D=7168, 2025-05-07T20:31:44.8758643Z scale_ub=None, 2025-05-07T20:31:44.8758856Z contiguous=True, 2025-05-07T20:31:44.8759073Z compiled=True, 2025-05-07T20:31:44.8759274Z ) 2025-05-07T20:31:45.3502168Z self = 2025-05-07T20:31:45.3502868Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:45.3503446Z 2025-05-07T20:31:45.3503556Z @given( 2025-05-07T20:31:45.3503866Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.3504186Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.3504495Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.3504822Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.3505144Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.3505434Z ) 2025-05-07T20:31:45.3505833Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.3506271Z def test_silu_mul_quant( 2025-05-07T20:31:45.3506508Z self, 2025-05-07T20:31:45.3506705Z T: int, 2025-05-07T20:31:45.3506907Z D: int, 2025-05-07T20:31:45.3507119Z scale_ub: Optional[float], 2025-05-07T20:31:45.3507393Z contiguous: bool, 2025-05-07T20:31:45.3507632Z compiled: bool, 2025-05-07T20:31:45.3507854Z ) -> None: 2025-05-07T20:31:45.3508069Z torch.manual_seed(2025) 2025-05-07T20:31:45.3508308Z 2025-05-07T20:31:45.3508579Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.3510683Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:45.3512513Z 2025-05-07T20:31:45.3512634Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:45.3512850Z 2025-05-07T20:31:45.3567471Z FAILED 2025-05-07T20:31:45.3567928Z 2025-05-07T20:31:45.3568439Z =================================== FAILURES =================================== 2025-05-07T20:31:45.3569099Z _____________________ ActivationTests.test_silu_mul_quant ______________________ 2025-05-07T20:31:45.3569763Z + Exception Group Traceback (most recent call last): 2025-05-07T20:31:45.3570646Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/unittest/case.py", line 57, in testPartExecutor 2025-05-07T20:31:45.3571425Z | yield 2025-05-07T20:31:45.3572059Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/unittest/case.py", line 623, in run 2025-05-07T20:31:45.3572786Z | self._callTestMethod(testMethod) 2025-05-07T20:31:45.3573573Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/unittest/case.py", line 579, in _callTestMethod 2025-05-07T20:31:45.3574352Z | if method() is not None: 2025-05-07T20:31:45.3574699Z | ^^^^^^^^ 2025-05-07T20:31:45.3575630Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant 2025-05-07T20:31:45.3576664Z | T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.3577358Z | ^^^^^^^ 2025-05-07T20:31:45.3578164Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/hypothesis/core.py", line 1850, in wrapped_test 2025-05-07T20:31:45.3579049Z | raise the_error_hypothesis_found 2025-05-07T20:31:45.3579646Z | ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions) 2025-05-07T20:31:45.3580249Z +-+---------------- 1 ---------------- 2025-05-07T20:31:45.3580657Z | Traceback (most recent call last): 2025-05-07T20:31:45.3581675Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:31:45.3582781Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.3583503Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:45.3586387Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:45.3590641Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:31:45.3591093Z | self=, 2025-05-07T20:31:45.3591504Z | T=128, 2025-05-07T20:31:45.3591708Z | D=7168, 2025-05-07T20:31:45.3591941Z | scale_ub=1200.0, 2025-05-07T20:31:45.3592192Z | contiguous=True, 2025-05-07T20:31:45.3592431Z | compiled=False, 2025-05-07T20:31:45.3592671Z | ) 2025-05-07T20:31:45.3592864Z | 2025-05-07T20:31:45.3593752Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAUEAQQE=') as a decorator on your test case 2025-05-07T20:31:45.3594353Z +---------------- 2 ---------------- 2025-05-07T20:31:45.3594652Z | Traceback (most recent call last): 2025-05-07T20:31:45.3595362Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:31:45.3596127Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.3596510Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:45.3598500Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:45.3600470Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:31:45.3600912Z | self=, 2025-05-07T20:31:45.3601321Z | T=128, 2025-05-07T20:31:45.3601532Z | D=7168, 2025-05-07T20:31:45.3601753Z | scale_ub=None, 2025-05-07T20:31:45.3601992Z | contiguous=True, 2025-05-07T20:31:45.3602245Z | compiled=True, 2025-05-07T20:31:45.3602498Z | ) 2025-05-07T20:31:45.3602764Z | 2025-05-07T20:31:45.3603402Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case 2025-05-07T20:31:45.3604018Z +---------------- 3 ---------------- 2025-05-07T20:31:45.3604440Z | Traceback (most recent call last): 2025-05-07T20:31:45.3605157Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:31:45.3606029Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.3606409Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:45.3622078Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:45.3624443Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:31:45.3625085Z | self=, 2025-05-07T20:31:45.3625729Z | T=128, 2025-05-07T20:31:45.3626030Z | D=5120, 2025-05-07T20:31:45.3626329Z | scale_ub=1200.0, 2025-05-07T20:31:45.3626685Z | contiguous=True, 2025-05-07T20:31:45.3627038Z | compiled=True, 2025-05-07T20:31:45.3627359Z | ) 2025-05-07T20:31:45.3627622Z | 2025-05-07T20:31:45.3628658Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case 2025-05-07T20:31:45.3629647Z +---------------- 4 ---------------- 2025-05-07T20:31:45.3630056Z | Traceback (most recent call last): 2025-05-07T20:31:45.3631080Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant 2025-05-07T20:31:45.3632105Z | y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:45.3632508Z | ^^^^^^^^ 2025-05-07T20:31:45.3633423Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn 2025-05-07T20:31:45.3634427Z | return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:45.3634907Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:45.3636091Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row 2025-05-07T20:31:45.3637230Z | _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:45.3638089Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py", line 330, in 2025-05-07T20:31:45.3639137Z | return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.3639669Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:45.3640335Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 186, in run 2025-05-07T20:31:45.3641140Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:45.3641639Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:45.3642324Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 186, in 2025-05-07T20:31:45.3643161Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:45.3643662Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:45.3644782Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 166, in _bench 2025-05-07T20:31:45.3645797Z | return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:45.3646359Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:45.3647231Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py", line 117, in do_bench 2025-05-07T20:31:45.3648044Z | fn() 2025-05-07T20:31:45.3648874Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call 2025-05-07T20:31:45.3650040Z | self.fn.run( 2025-05-07T20:31:45.3650832Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py", line 623, in run 2025-05-07T20:31:45.3651487Z | kernel = self.compile( 2025-05-07T20:31:45.3651811Z | ^^^^^^^^^^^^^ 2025-05-07T20:31:45.3652665Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 273, in compile 2025-05-07T20:31:45.3653701Z | module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.3654278Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:45.3655234Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:45.3656378Z | return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.3657032Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:45.3657565Z | triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.3658056Z | def _kernel_quantize_fp8_row( 2025-05-07T20:31:45.3658431Z | ^ 2025-05-07T20:31:45.3659087Z | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.3659904Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:31:45.3660477Z | # The test always failed when commented parts were varied together. 2025-05-07T20:31:45.3661207Z | self=, 2025-05-07T20:31:45.3661836Z | T=1, # or any other generated value 2025-05-07T20:31:45.3662286Z | D=5120, # or any other generated value 2025-05-07T20:31:45.3662765Z | scale_ub=None, # or any other generated value 2025-05-07T20:31:45.3663288Z | contiguous=True, # or any other generated value 2025-05-07T20:31:45.3663814Z | compiled=True, # or any other generated value 2025-05-07T20:31:45.3664248Z | ) 2025-05-07T20:31:45.3664507Z | 2025-05-07T20:31:45.3665263Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case 2025-05-07T20:31:45.3666124Z +------------------------------------ 2025-05-07T20:31:45.3666603Z ---------------------------------- Hypothesis ---------------------------------- 2025-05-07T20:31:45.3667103Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.3667667Z self=, 2025-05-07T20:31:45.3668177Z T=1, 2025-05-07T20:31:45.3668428Z D=5120, 2025-05-07T20:31:45.3668685Z scale_ub=None, 2025-05-07T20:31:45.3668982Z contiguous=True, 2025-05-07T20:31:45.3669414Z compiled=True, 2025-05-07T20:31:45.3669720Z ) 2025-05-07T20:31:45.3670203Z self = 2025-05-07T20:31:45.3670884Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:45.3671261Z 2025-05-07T20:31:45.3671381Z @given( 2025-05-07T20:31:45.3671845Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.3672267Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.3672691Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.3673136Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.3673568Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.3673957Z ) 2025-05-07T20:31:45.3674424Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.3675017Z def test_silu_mul_quant( 2025-05-07T20:31:45.3675337Z self, 2025-05-07T20:31:45.3675605Z T: int, 2025-05-07T20:31:45.3675883Z D: int, 2025-05-07T20:31:45.3676169Z scale_ub: Optional[float], 2025-05-07T20:31:45.3676637Z contiguous: bool, 2025-05-07T20:31:45.3676968Z compiled: bool, 2025-05-07T20:31:45.3677261Z ) -> None: 2025-05-07T20:31:45.3677560Z torch.manual_seed(2025) 2025-05-07T20:31:45.3677891Z 2025-05-07T20:31:45.3678256Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.3678723Z 2025-05-07T20:31:45.3678995Z x_sign = torch.sign(x) 2025-05-07T20:31:45.3679383Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.3679795Z x = x_sign * x_clamp 2025-05-07T20:31:45.3680122Z x0 = x[:, :D] 2025-05-07T20:31:45.3680439Z x1 = x[:, D:] 2025-05-07T20:31:45.3680726Z 2025-05-07T20:31:45.3681000Z if contiguous: 2025-05-07T20:31:45.3681316Z x0 = x0.contiguous() 2025-05-07T20:31:45.3681670Z x1 = x1.contiguous() 2025-05-07T20:31:45.3681990Z 2025-05-07T20:31:45.3682250Z if scale_ub is not None: 2025-05-07T20:31:45.3682607Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.3683056Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.3683465Z ) 2025-05-07T20:31:45.3683723Z else: 2025-05-07T20:31:45.3683998Z scale_ub_tensor = None 2025-05-07T20:31:45.3684339Z 2025-05-07T20:31:45.3684643Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.3685055Z op = silu_mul_quant 2025-05-07T20:31:45.3685391Z if compiled: 2025-05-07T20:31:45.3685724Z op = torch.compile(op) 2025-05-07T20:31:45.3686109Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.3686475Z 2025-05-07T20:31:45.3686739Z y_fp8, y_scale = fn() 2025-05-07T20:31:45.3687110Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:45.3687501Z 2025-05-07T20:31:45.3687820Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.3688255Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:45.3688654Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:45.3689087Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:45.3689566Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:45.3689973Z 2025-05-07T20:31:45.3690261Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:45.3690535Z 2025-05-07T20:31:45.3690683Z moe/activation_test.py:126: 2025-05-07T20:31:45.3691095Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.3691581Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:45.3692051Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:45.3693158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:45.3694220Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:45.3694989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.3695956Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.3697014Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:45.3698022Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:45.3699086Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:45.3700157Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:45.3701175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:45.3702088Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:45.3702932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:45.3703766Z fn() 2025-05-07T20:31:45.3704469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:45.3705294Z self.fn.run( 2025-05-07T20:31:45.3705955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.3706679Z kernel = self.compile( 2025-05-07T20:31:45.3707404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.3708244Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.3708791Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.3709194Z 2025-05-07T20:31:45.3709455Z self = 2025-05-07T20:31:45.3710892Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.3712815Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f68a9943240>} 2025-05-07T20:31:45.3714663Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.3716112Z context = 2025-05-07T20:31:45.3716501Z 2025-05-07T20:31:45.3716722Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.3717423Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.3718071Z module_map=module_map) 2025-05-07T20:31:45.3718562Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.3719051Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:45.3719435Z E ^ 2025-05-07T20:31:45.3720097Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.3720732Z 2025-05-07T20:31:45.3721315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.3722043Z 2025-05-07T20:31:45.3722191Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.3722773Z self=, 2025-05-07T20:31:45.3723330Z T=2048, 2025-05-07T20:31:45.3723590Z D=5120, 2025-05-07T20:31:45.3723877Z scale_ub=1200.0, 2025-05-07T20:31:45.3724196Z contiguous=True, 2025-05-07T20:31:45.3724515Z compiled=False, 2025-05-07T20:31:45.3724815Z ) 2025-05-07T20:31:45.3725271Z self = 2025-05-07T20:31:45.3725956Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:45.3726704Z 2025-05-07T20:31:45.3726816Z @given( 2025-05-07T20:31:45.3727139Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.3727566Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.3727995Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.3728727Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.3729202Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.3729598Z ) 2025-05-07T20:31:45.3730047Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.3730625Z def test_silu_mul_quant( 2025-05-07T20:31:45.3730953Z self, 2025-05-07T20:31:45.3731201Z T: int, 2025-05-07T20:31:45.3731661Z D: int, 2025-05-07T20:31:45.3731940Z scale_ub: Optional[float], 2025-05-07T20:31:45.3732311Z contiguous: bool, 2025-05-07T20:31:45.3732644Z compiled: bool, 2025-05-07T20:31:45.3732927Z ) -> None: 2025-05-07T20:31:45.3733208Z torch.manual_seed(2025) 2025-05-07T20:31:45.3733514Z 2025-05-07T20:31:45.3733846Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.3734282Z 2025-05-07T20:31:45.3734539Z x_sign = torch.sign(x) 2025-05-07T20:31:45.3734921Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.3735391Z x = x_sign * x_clamp 2025-05-07T20:31:45.3735721Z x0 = x[:, :D] 2025-05-07T20:31:45.3736025Z x1 = x[:, D:] 2025-05-07T20:31:45.3736279Z 2025-05-07T20:31:45.3736515Z if contiguous: 2025-05-07T20:31:45.3736808Z x0 = x0.contiguous() 2025-05-07T20:31:45.3737124Z x1 = x1.contiguous() 2025-05-07T20:31:45.3737433Z 2025-05-07T20:31:45.3737670Z if scale_ub is not None: 2025-05-07T20:31:45.3738038Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.3738494Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.3738938Z ) 2025-05-07T20:31:45.3739205Z else: 2025-05-07T20:31:45.3739499Z scale_ub_tensor = None 2025-05-07T20:31:45.3739840Z 2025-05-07T20:31:45.3740153Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.3740597Z op = silu_mul_quant 2025-05-07T20:31:45.3740959Z if compiled: 2025-05-07T20:31:45.3741307Z op = torch.compile(op) 2025-05-07T20:31:45.3741693Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.3742043Z 2025-05-07T20:31:45.3742284Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.3742519Z 2025-05-07T20:31:45.3742655Z moe/activation_test.py:117: 2025-05-07T20:31:45.3743059Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.3743523Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.3743913Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.3744900Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.3745914Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.3746632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.3747497Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.3748400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.3749167Z kernel = self.compile( 2025-05-07T20:31:45.3749871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.3750714Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.3751225Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.3751514Z 2025-05-07T20:31:45.3751968Z self = 2025-05-07T20:31:45.3753520Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.3755479Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f68a995ade0>} 2025-05-07T20:31:45.3757249Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.3758710Z context = 2025-05-07T20:31:45.3759107Z 2025-05-07T20:31:45.3759349Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.3760066Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.3760727Z module_map=module_map) 2025-05-07T20:31:45.3761230Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.3761706Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.3762076Z E ^ 2025-05-07T20:31:45.3762744Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.3763387Z 2025-05-07T20:31:45.3763987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.3764718Z 2025-05-07T20:31:45.3764876Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.3765510Z self=, 2025-05-07T20:31:45.3766081Z T=2048, 2025-05-07T20:31:45.3766339Z D=5120, 2025-05-07T20:31:45.3766611Z scale_ub=1200.0, 2025-05-07T20:31:45.3766932Z contiguous=True, 2025-05-07T20:31:45.3767239Z compiled=True, 2025-05-07T20:31:45.3767532Z ) 2025-05-07T20:31:45.3767982Z self = 2025-05-07T20:31:45.3768656Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:45.3769040Z 2025-05-07T20:31:45.3769151Z @given( 2025-05-07T20:31:45.3769476Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.3769917Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.3770344Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.3770810Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.3771281Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.3771678Z ) 2025-05-07T20:31:45.3772163Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.3772775Z def test_silu_mul_quant( 2025-05-07T20:31:45.3773119Z self, 2025-05-07T20:31:45.3773396Z T: int, 2025-05-07T20:31:45.3773678Z D: int, 2025-05-07T20:31:45.3773967Z scale_ub: Optional[float], 2025-05-07T20:31:45.3774358Z contiguous: bool, 2025-05-07T20:31:45.3774710Z compiled: bool, 2025-05-07T20:31:45.3775030Z ) -> None: 2025-05-07T20:31:45.3775343Z torch.manual_seed(2025) 2025-05-07T20:31:45.3775707Z 2025-05-07T20:31:45.3776093Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.3776560Z 2025-05-07T20:31:45.3776830Z x_sign = torch.sign(x) 2025-05-07T20:31:45.3777233Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.3777660Z x = x_sign * x_clamp 2025-05-07T20:31:45.3778001Z x0 = x[:, :D] 2025-05-07T20:31:45.3778303Z x1 = x[:, D:] 2025-05-07T20:31:45.3778595Z 2025-05-07T20:31:45.3778856Z if contiguous: 2025-05-07T20:31:45.3779184Z x0 = x0.contiguous() 2025-05-07T20:31:45.3779645Z x1 = x1.contiguous() 2025-05-07T20:31:45.3779988Z 2025-05-07T20:31:45.3780267Z if scale_ub is not None: 2025-05-07T20:31:45.3780651Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.3781118Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.3781558Z ) 2025-05-07T20:31:45.3781834Z else: 2025-05-07T20:31:45.3782123Z scale_ub_tensor = None 2025-05-07T20:31:45.3782477Z 2025-05-07T20:31:45.3782801Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.3783234Z op = silu_mul_quant 2025-05-07T20:31:45.3783592Z if compiled: 2025-05-07T20:31:45.3783940Z op = torch.compile(op) 2025-05-07T20:31:45.3784436Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.3784822Z 2025-05-07T20:31:45.3785096Z y_fp8, y_scale = fn() 2025-05-07T20:31:45.3785487Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:45.3785905Z 2025-05-07T20:31:45.3786239Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.3786711Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:45.3787114Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:45.3787554Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:45.3788056Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:45.3788487Z 2025-05-07T20:31:45.3788772Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:45.3789146Z 2025-05-07T20:31:45.3789301Z moe/activation_test.py:126: 2025-05-07T20:31:45.3789732Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.3790237Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:45.3790711Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:45.3791818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:45.3792874Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:45.3793644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.3794613Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.3795575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:45.3796598Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:45.3797655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:45.3798709Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:45.3799736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:45.3800633Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:45.3801437Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:45.3802112Z fn() 2025-05-07T20:31:45.3802787Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:45.3803565Z self.fn.run( 2025-05-07T20:31:45.3804154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.3804833Z kernel = self.compile( 2025-05-07T20:31:45.3805590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.3806452Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.3807802Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.3808123Z 2025-05-07T20:31:45.3808392Z self = 2025-05-07T20:31:45.3809861Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.3811835Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f68a9ace700>} 2025-05-07T20:31:45.3813740Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.3815388Z context = 2025-05-07T20:31:45.3815793Z 2025-05-07T20:31:45.3816036Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.3816767Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.3817419Z module_map=module_map) 2025-05-07T20:31:45.3817904Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.3818388Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:45.3818743Z E ^ 2025-05-07T20:31:45.3819358Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.3819975Z 2025-05-07T20:31:45.3820546Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.3821274Z 2025-05-07T20:31:45.3821419Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.3821983Z self=, 2025-05-07T20:31:45.3822538Z T=16384, 2025-05-07T20:31:45.3822815Z D=7168, 2025-05-07T20:31:45.3823085Z scale_ub=1200.0, 2025-05-07T20:31:45.3823394Z contiguous=False, 2025-05-07T20:31:45.3823711Z compiled=False, 2025-05-07T20:31:45.3824006Z ) 2025-05-07T20:31:45.3824464Z self = 2025-05-07T20:31:45.3849541Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:45.3849939Z 2025-05-07T20:31:45.3850049Z @given( 2025-05-07T20:31:45.3850360Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.3850789Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.3851202Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.3851670Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.3852117Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.3852515Z ) 2025-05-07T20:31:45.3853001Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.3853601Z def test_silu_mul_quant( 2025-05-07T20:31:45.3853933Z self, 2025-05-07T20:31:45.3854199Z T: int, 2025-05-07T20:31:45.3854465Z D: int, 2025-05-07T20:31:45.3854766Z scale_ub: Optional[float], 2025-05-07T20:31:45.3855138Z contiguous: bool, 2025-05-07T20:31:45.3855460Z compiled: bool, 2025-05-07T20:31:45.3855773Z ) -> None: 2025-05-07T20:31:45.3856076Z torch.manual_seed(2025) 2025-05-07T20:31:45.3856404Z 2025-05-07T20:31:45.3856782Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.3857245Z 2025-05-07T20:31:45.3857505Z x_sign = torch.sign(x) 2025-05-07T20:31:45.3857909Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.3858342Z x = x_sign * x_clamp 2025-05-07T20:31:45.3858668Z x0 = x[:, :D] 2025-05-07T20:31:45.3858947Z x1 = x[:, D:] 2025-05-07T20:31:45.3859209Z 2025-05-07T20:31:45.3859746Z if contiguous: 2025-05-07T20:31:45.3860055Z x0 = x0.contiguous() 2025-05-07T20:31:45.3860396Z x1 = x1.contiguous() 2025-05-07T20:31:45.3860709Z 2025-05-07T20:31:45.3860955Z if scale_ub is not None: 2025-05-07T20:31:45.3861323Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.3861765Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.3862169Z ) 2025-05-07T20:31:45.3862416Z else: 2025-05-07T20:31:45.3862685Z scale_ub_tensor = None 2025-05-07T20:31:45.3863007Z 2025-05-07T20:31:45.3863322Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.3863748Z op = silu_mul_quant 2025-05-07T20:31:45.3864313Z if compiled: 2025-05-07T20:31:45.3864649Z op = torch.compile(op) 2025-05-07T20:31:45.3865052Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.3865480Z 2025-05-07T20:31:45.3865753Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.3865979Z 2025-05-07T20:31:45.3866112Z moe/activation_test.py:117: 2025-05-07T20:31:45.3866503Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.3866946Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.3867316Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.3868250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.3869302Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.3870045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.3871006Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.3871920Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.3872655Z kernel = self.compile( 2025-05-07T20:31:45.3873419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.3874340Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.3874908Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.3875230Z 2025-05-07T20:31:45.3875513Z self = 2025-05-07T20:31:45.3877032Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.3878988Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f68aabc1760>} 2025-05-07T20:31:45.3880887Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.3882311Z context = 2025-05-07T20:31:45.3882721Z 2025-05-07T20:31:45.3882946Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.3883665Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.3884310Z module_map=module_map) 2025-05-07T20:31:45.3884798Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.3885308Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.3885674Z E ^ 2025-05-07T20:31:45.3886307Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.3886952Z 2025-05-07T20:31:45.3887619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.3888302Z 2025-05-07T20:31:45.3888442Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.3889022Z self=, 2025-05-07T20:31:45.3889572Z T=1, 2025-05-07T20:31:45.3889836Z D=7168, 2025-05-07T20:31:45.3890119Z scale_ub=None, 2025-05-07T20:31:45.3890404Z contiguous=True, 2025-05-07T20:31:45.3890696Z compiled=True, 2025-05-07T20:31:45.3890952Z ) 2025-05-07T20:31:45.3891362Z self = 2025-05-07T20:31:45.3891989Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:45.3892412Z 2025-05-07T20:31:45.3892509Z @given( 2025-05-07T20:31:45.3892799Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.3893180Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.3893571Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.3893993Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.3894442Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.3894841Z ) 2025-05-07T20:31:45.3895338Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.3895953Z def test_silu_mul_quant( 2025-05-07T20:31:45.3896296Z self, 2025-05-07T20:31:45.3896577Z T: int, 2025-05-07T20:31:45.3896854Z D: int, 2025-05-07T20:31:45.3897157Z scale_ub: Optional[float], 2025-05-07T20:31:45.3897541Z contiguous: bool, 2025-05-07T20:31:45.3897873Z compiled: bool, 2025-05-07T20:31:45.3898182Z ) -> None: 2025-05-07T20:31:45.3898485Z torch.manual_seed(2025) 2025-05-07T20:31:45.3898806Z 2025-05-07T20:31:45.3899140Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.3899608Z 2025-05-07T20:31:45.3899888Z x_sign = torch.sign(x) 2025-05-07T20:31:45.3900275Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.3900702Z x = x_sign * x_clamp 2025-05-07T20:31:45.3901028Z x0 = x[:, :D] 2025-05-07T20:31:45.3901316Z x1 = x[:, D:] 2025-05-07T20:31:45.3901605Z 2025-05-07T20:31:45.3901862Z if contiguous: 2025-05-07T20:31:45.3902173Z x0 = x0.contiguous() 2025-05-07T20:31:45.3902529Z x1 = x1.contiguous() 2025-05-07T20:31:45.3902858Z 2025-05-07T20:31:45.3903118Z if scale_ub is not None: 2025-05-07T20:31:45.3903489Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.3903951Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.3904394Z ) 2025-05-07T20:31:45.3904666Z else: 2025-05-07T20:31:45.3904968Z scale_ub_tensor = None 2025-05-07T20:31:45.3905326Z 2025-05-07T20:31:45.3905650Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.3906096Z op = silu_mul_quant 2025-05-07T20:31:45.3906448Z if compiled: 2025-05-07T20:31:45.3906789Z op = torch.compile(op) 2025-05-07T20:31:45.3907187Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.3907534Z 2025-05-07T20:31:45.3907731Z y_fp8, y_scale = fn() 2025-05-07T20:31:45.3908012Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:45.3908306Z 2025-05-07T20:31:45.3908546Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.3908874Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:45.3909264Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:45.3909587Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:45.3909937Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:45.3910250Z 2025-05-07T20:31:45.3910454Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:45.3910754Z 2025-05-07T20:31:45.3910865Z moe/activation_test.py:126: 2025-05-07T20:31:45.3911156Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.3911493Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:45.3911816Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:45.3912602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:45.3913353Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:45.3913895Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.3914654Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.3915338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:45.3916069Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:45.3916831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:45.3917582Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:45.3918308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:45.3918951Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:45.3919553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:45.3920071Z fn() 2025-05-07T20:31:45.3920580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:45.3921163Z self.fn.run( 2025-05-07T20:31:45.3921635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.3922162Z kernel = self.compile( 2025-05-07T20:31:45.3922702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.3923355Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.3923748Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.3923985Z 2025-05-07T20:31:45.3924189Z self = 2025-05-07T20:31:45.3925270Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.3926665Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f68aa04cd60>} 2025-05-07T20:31:45.3928013Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.3929346Z context = 2025-05-07T20:31:45.3929639Z 2025-05-07T20:31:45.3929804Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.3930322Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.3930790Z module_map=module_map) 2025-05-07T20:31:45.3931155Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.3931512Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:45.3931779Z E ^ 2025-05-07T20:31:45.3932419Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.3932878Z 2025-05-07T20:31:45.3933294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.3933806Z 2025-05-07T20:31:45.3933913Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.3934323Z self=, 2025-05-07T20:31:45.3934719Z T=4096, 2025-05-07T20:31:45.3934915Z D=5120, 2025-05-07T20:31:45.3935117Z scale_ub=None, 2025-05-07T20:31:45.3935335Z contiguous=False, 2025-05-07T20:31:45.3935566Z compiled=False, 2025-05-07T20:31:45.3935777Z ) 2025-05-07T20:31:45.3936093Z self = 2025-05-07T20:31:45.3936716Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:45.3936983Z 2025-05-07T20:31:45.3937067Z @given( 2025-05-07T20:31:45.3937298Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.3937611Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.3937916Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.3938243Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.3938561Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.3938843Z ) 2025-05-07T20:31:45.3939189Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.3939639Z def test_silu_mul_quant( 2025-05-07T20:31:45.3939875Z self, 2025-05-07T20:31:45.3940072Z T: int, 2025-05-07T20:31:45.3940274Z D: int, 2025-05-07T20:31:45.3940494Z scale_ub: Optional[float], 2025-05-07T20:31:45.3940773Z contiguous: bool, 2025-05-07T20:31:45.3941015Z compiled: bool, 2025-05-07T20:31:45.3941231Z ) -> None: 2025-05-07T20:31:45.3941448Z torch.manual_seed(2025) 2025-05-07T20:31:45.3941692Z 2025-05-07T20:31:45.3941966Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.3942306Z 2025-05-07T20:31:45.3942504Z x_sign = torch.sign(x) 2025-05-07T20:31:45.3942790Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.3943101Z x = x_sign * x_clamp 2025-05-07T20:31:45.3943342Z x0 = x[:, :D] 2025-05-07T20:31:45.3943561Z x1 = x[:, D:] 2025-05-07T20:31:45.3943762Z 2025-05-07T20:31:45.3944318Z if contiguous: 2025-05-07T20:31:45.3944609Z x0 = x0.contiguous() 2025-05-07T20:31:45.3944961Z x1 = x1.contiguous() 2025-05-07T20:31:45.3945370Z 2025-05-07T20:31:45.3945624Z if scale_ub is not None: 2025-05-07T20:31:45.3946023Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.3946511Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.3946873Z ) 2025-05-07T20:31:45.3947217Z else: 2025-05-07T20:31:45.3947587Z scale_ub_tensor = None 2025-05-07T20:31:45.3947925Z 2025-05-07T20:31:45.3948238Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.3948695Z op = silu_mul_quant 2025-05-07T20:31:45.3949039Z if compiled: 2025-05-07T20:31:45.3949428Z op = torch.compile(op) 2025-05-07T20:31:45.3949874Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.3950258Z 2025-05-07T20:31:45.3950505Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.3950780Z 2025-05-07T20:31:45.3950925Z moe/activation_test.py:117: 2025-05-07T20:31:45.3951352Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.3951755Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.3952162Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.3952961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.3953811Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.3954474Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.3955264Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.3955995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.3956698Z kernel = self.compile( 2025-05-07T20:31:45.3957291Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.3958017Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.3958699Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.3958956Z 2025-05-07T20:31:45.3959224Z self = 2025-05-07T20:31:45.3960344Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.3961887Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f68a8d402c0>} 2025-05-07T20:31:45.3963311Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.3964475Z context = 2025-05-07T20:31:45.3964810Z 2025-05-07T20:31:45.3965039Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.3965615Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.3966265Z module_map=module_map) 2025-05-07T20:31:45.3966733Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.3967143Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.3967554Z E ^ 2025-05-07T20:31:45.3968150Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.3968624Z 2025-05-07T20:31:45.3969131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.3969654Z 2025-05-07T20:31:45.3969843Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.3970361Z self=, 2025-05-07T20:31:45.3970878Z T=4096, 2025-05-07T20:31:45.3971213Z D=7168, 2025-05-07T20:31:45.3971463Z scale_ub=None, 2025-05-07T20:31:45.3971789Z contiguous=False, 2025-05-07T20:31:45.3972162Z compiled=False, 2025-05-07T20:31:45.3972425Z ) 2025-05-07T20:31:45.3972852Z self = 2025-05-07T20:31:45.3973510Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:45.3973811Z 2025-05-07T20:31:45.3973974Z @given( 2025-05-07T20:31:45.3974257Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.3974712Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.3975125Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.3975507Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.3976006Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.3976402Z ) 2025-05-07T20:31:45.3976809Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.3977386Z def test_silu_mul_quant( 2025-05-07T20:31:45.3977739Z self, 2025-05-07T20:31:45.3978007Z T: int, 2025-05-07T20:31:45.3978459Z D: int, 2025-05-07T20:31:45.3978767Z scale_ub: Optional[float], 2025-05-07T20:31:45.3979109Z contiguous: bool, 2025-05-07T20:31:45.3979491Z compiled: bool, 2025-05-07T20:31:45.3979802Z ) -> None: 2025-05-07T20:31:45.3980095Z torch.manual_seed(2025) 2025-05-07T20:31:45.3980472Z 2025-05-07T20:31:45.3980831Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.3981262Z 2025-05-07T20:31:45.3981573Z x_sign = torch.sign(x) 2025-05-07T20:31:45.3981949Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.3982391Z x = x_sign * x_clamp 2025-05-07T20:31:45.3982782Z x0 = x[:, :D] 2025-05-07T20:31:45.3983136Z x1 = x[:, D:] 2025-05-07T20:31:45.3983440Z 2025-05-07T20:31:45.3983782Z if contiguous: 2025-05-07T20:31:45.3984090Z x0 = x0.contiguous() 2025-05-07T20:31:45.3984419Z x1 = x1.contiguous() 2025-05-07T20:31:45.3984812Z 2025-05-07T20:31:45.3985088Z if scale_ub is not None: 2025-05-07T20:31:45.3985430Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.3985926Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.3986305Z ) 2025-05-07T20:31:45.3986607Z else: 2025-05-07T20:31:45.3986972Z scale_ub_tensor = None 2025-05-07T20:31:45.3987327Z 2025-05-07T20:31:45.3987595Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.3988085Z op = silu_mul_quant 2025-05-07T20:31:45.3988421Z if compiled: 2025-05-07T20:31:45.3988708Z op = torch.compile(op) 2025-05-07T20:31:45.3989251Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.3989624Z 2025-05-07T20:31:45.3989862Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.3990183Z 2025-05-07T20:31:45.3990312Z moe/activation_test.py:117: 2025-05-07T20:31:45.3990725Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.3991185Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.3991554Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.3992327Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.3993140Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.3993859Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.3994764Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.3995751Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.3996488Z kernel = self.compile( 2025-05-07T20:31:45.3997176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.3998102Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.3998599Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.3998882Z 2025-05-07T20:31:45.3999150Z self = 2025-05-07T20:31:45.4000335Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.4001802Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f68a8d42160>} 2025-05-07T20:31:45.4003254Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.4004505Z context = 2025-05-07T20:31:45.4004825Z 2025-05-07T20:31:45.4005067Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.4005659Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.4006270Z module_map=module_map) 2025-05-07T20:31:45.4006747Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.4007175Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.4007552Z E ^ 2025-05-07T20:31:45.4008153Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.4008707Z 2025-05-07T20:31:45.4009175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.4009771Z 2025-05-07T20:31:45.4009984Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.4010447Z self=, 2025-05-07T20:31:45.4010917Z T=128, 2025-05-07T20:31:45.4011285Z D=7168, 2025-05-07T20:31:45.4011536Z scale_ub=None, 2025-05-07T20:31:45.4011825Z contiguous=False, 2025-05-07T20:31:45.4012226Z compiled=True, 2025-05-07T20:31:45.4012486Z ) 2025-05-07T20:31:45.4012885Z self = 2025-05-07T20:31:45.4013938Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:45.4014231Z 2025-05-07T20:31:45.4014371Z @given( 2025-05-07T20:31:45.4014645Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.4015151Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.4015590Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.4015963Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.4016472Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.4016841Z ) 2025-05-07T20:31:45.4017226Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.4017838Z def test_silu_mul_quant( 2025-05-07T20:31:45.4018164Z self, 2025-05-07T20:31:45.4018504Z T: int, 2025-05-07T20:31:45.4018800Z D: int, 2025-05-07T20:31:45.4019103Z scale_ub: Optional[float], 2025-05-07T20:31:45.4019513Z contiguous: bool, 2025-05-07T20:31:45.4019820Z compiled: bool, 2025-05-07T20:31:45.4020128Z ) -> None: 2025-05-07T20:31:45.4020485Z torch.manual_seed(2025) 2025-05-07T20:31:45.4020790Z 2025-05-07T20:31:45.4021169Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.4021643Z 2025-05-07T20:31:45.4021902Z x_sign = torch.sign(x) 2025-05-07T20:31:45.4022301Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.4022733Z x = x_sign * x_clamp 2025-05-07T20:31:45.4023098Z x0 = x[:, :D] 2025-05-07T20:31:45.4023388Z x1 = x[:, D:] 2025-05-07T20:31:45.4023719Z 2025-05-07T20:31:45.4024005Z if contiguous: 2025-05-07T20:31:45.4024308Z x0 = x0.contiguous() 2025-05-07T20:31:45.4024696Z x1 = x1.contiguous() 2025-05-07T20:31:45.4025052Z 2025-05-07T20:31:45.4025295Z if scale_ub is not None: 2025-05-07T20:31:45.4025704Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.4026160Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.4026517Z ) 2025-05-07T20:31:45.4026851Z else: 2025-05-07T20:31:45.4027185Z scale_ub_tensor = None 2025-05-07T20:31:45.4027518Z 2025-05-07T20:31:45.4035401Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.4035776Z op = silu_mul_quant 2025-05-07T20:31:45.4036043Z if compiled: 2025-05-07T20:31:45.4036294Z op = torch.compile(op) 2025-05-07T20:31:45.4036846Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4037131Z 2025-05-07T20:31:45.4037325Z y_fp8, y_scale = fn() 2025-05-07T20:31:45.4037617Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:45.4037913Z 2025-05-07T20:31:45.4038151Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.4038488Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:45.4038785Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:45.4039107Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:45.4039461Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:45.4039772Z 2025-05-07T20:31:45.4040136Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:45.4040332Z 2025-05-07T20:31:45.4040435Z moe/activation_test.py:126: 2025-05-07T20:31:45.4040734Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4041084Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:45.4041409Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:45.4042205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:45.4042966Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:45.4043521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.4044200Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.4044889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:45.4045621Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:45.4046385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:45.4047127Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:45.4047865Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:45.4048507Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:45.4049113Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:45.4049631Z fn() 2025-05-07T20:31:45.4050148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:45.4050740Z self.fn.run( 2025-05-07T20:31:45.4051212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.4051751Z kernel = self.compile( 2025-05-07T20:31:45.4052304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.4052968Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.4053369Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4053610Z 2025-05-07T20:31:45.4053818Z self = 2025-05-07T20:31:45.4054914Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.4056313Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f68a8d437e0>} 2025-05-07T20:31:45.4057759Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.4058798Z context = 2025-05-07T20:31:45.4059094Z 2025-05-07T20:31:45.4059262Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.4059791Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.4060256Z module_map=module_map) 2025-05-07T20:31:45.4060628Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.4060993Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:45.4061255Z E ^ 2025-05-07T20:31:45.4061802Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.4062265Z 2025-05-07T20:31:45.4062693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.4063206Z 2025-05-07T20:31:45.4063320Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.4063732Z self=, 2025-05-07T20:31:45.4064139Z T=128, 2025-05-07T20:31:45.4064334Z D=7168, 2025-05-07T20:31:45.4064524Z scale_ub=None, 2025-05-07T20:31:45.4064747Z contiguous=False, 2025-05-07T20:31:45.4064979Z compiled=False, 2025-05-07T20:31:45.4065187Z ) 2025-05-07T20:31:45.4065561Z self = 2025-05-07T20:31:45.4066059Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:45.4066328Z 2025-05-07T20:31:45.4066420Z @given( 2025-05-07T20:31:45.4066653Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.4066973Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.4067281Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.4067614Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.4067946Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.4068235Z ) 2025-05-07T20:31:45.4068580Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.4069026Z def test_silu_mul_quant( 2025-05-07T20:31:45.4069349Z self, 2025-05-07T20:31:45.4069550Z T: int, 2025-05-07T20:31:45.4069741Z D: int, 2025-05-07T20:31:45.4069967Z scale_ub: Optional[float], 2025-05-07T20:31:45.4070240Z contiguous: bool, 2025-05-07T20:31:45.4070476Z compiled: bool, 2025-05-07T20:31:45.4070703Z ) -> None: 2025-05-07T20:31:45.4070924Z torch.manual_seed(2025) 2025-05-07T20:31:45.4071166Z 2025-05-07T20:31:45.4071445Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.4071794Z 2025-05-07T20:31:45.4071985Z x_sign = torch.sign(x) 2025-05-07T20:31:45.4072284Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.4072597Z x = x_sign * x_clamp 2025-05-07T20:31:45.4072834Z x0 = x[:, :D] 2025-05-07T20:31:45.4073054Z x1 = x[:, D:] 2025-05-07T20:31:45.4073270Z 2025-05-07T20:31:45.4073454Z if contiguous: 2025-05-07T20:31:45.4073689Z x0 = x0.contiguous() 2025-05-07T20:31:45.4073953Z x1 = x1.contiguous() 2025-05-07T20:31:45.4074184Z 2025-05-07T20:31:45.4074380Z if scale_ub is not None: 2025-05-07T20:31:45.4074654Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.4074991Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.4075070Z ) 2025-05-07T20:31:45.4075151Z else: 2025-05-07T20:31:45.4075256Z scale_ub_tensor = None 2025-05-07T20:31:45.4075328Z 2025-05-07T20:31:45.4075458Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.4075556Z op = silu_mul_quant 2025-05-07T20:31:45.4075730Z if compiled: 2025-05-07T20:31:45.4075838Z op = torch.compile(op) 2025-05-07T20:31:45.4075952Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4076026Z 2025-05-07T20:31:45.4076117Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.4076122Z 2025-05-07T20:31:45.4076228Z moe/activation_test.py:117: 2025-05-07T20:31:45.4076357Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4076467Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.4076569Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4077072Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.4077252Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.4077611Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.4077843Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.4078191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.4078288Z kernel = self.compile( 2025-05-07T20:31:45.4078677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.4078850Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.4078978Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4078983Z 2025-05-07T20:31:45.4079198Z self = 2025-05-07T20:31:45.4079990Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.4080509Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6894be0860>} 2025-05-07T20:31:45.4081265Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.4081462Z context = 2025-05-07T20:31:45.4081466Z 2025-05-07T20:31:45.4081635Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.4081895Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.4082021Z module_map=module_map) 2025-05-07T20:31:45.4082183Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.4082282Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.4082373Z E ^ 2025-05-07T20:31:45.4082733Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.4082738Z 2025-05-07T20:31:45.4083162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.4083166Z 2025-05-07T20:31:45.4083271Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.4083495Z self=, 2025-05-07T20:31:45.4083583Z T=4096, 2025-05-07T20:31:45.4083659Z D=5120, 2025-05-07T20:31:45.4083742Z scale_ub=1200.0, 2025-05-07T20:31:45.4083835Z contiguous=True, 2025-05-07T20:31:45.4083925Z compiled=False, 2025-05-07T20:31:45.4083998Z ) 2025-05-07T20:31:45.4084225Z self = 2025-05-07T20:31:45.4084477Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:45.4084482Z 2025-05-07T20:31:45.4084570Z @given( 2025-05-07T20:31:45.4084685Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.4084788Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.4084908Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.4085024Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.4085136Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.4085218Z ) 2025-05-07T20:31:45.4085462Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.4085555Z def test_silu_mul_quant( 2025-05-07T20:31:45.4085637Z self, 2025-05-07T20:31:45.4085787Z T: int, 2025-05-07T20:31:45.4085863Z D: int, 2025-05-07T20:31:45.4085966Z scale_ub: Optional[float], 2025-05-07T20:31:45.4086055Z contiguous: bool, 2025-05-07T20:31:45.4086147Z compiled: bool, 2025-05-07T20:31:45.4086230Z ) -> None: 2025-05-07T20:31:45.4086324Z torch.manual_seed(2025) 2025-05-07T20:31:45.4086407Z 2025-05-07T20:31:45.4086572Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.4086646Z 2025-05-07T20:31:45.4086744Z x_sign = torch.sign(x) 2025-05-07T20:31:45.4086870Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.4086961Z x = x_sign * x_clamp 2025-05-07T20:31:45.4087051Z x0 = x[:, :D] 2025-05-07T20:31:45.4087131Z x1 = x[:, D:] 2025-05-07T20:31:45.4087202Z 2025-05-07T20:31:45.4087291Z if contiguous: 2025-05-07T20:31:45.4087382Z x0 = x0.contiguous() 2025-05-07T20:31:45.4087477Z x1 = x1.contiguous() 2025-05-07T20:31:45.4087557Z 2025-05-07T20:31:45.4087645Z if scale_ub is not None: 2025-05-07T20:31:45.4087756Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.4087892Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.4087971Z ) 2025-05-07T20:31:45.4088053Z else: 2025-05-07T20:31:45.4088146Z scale_ub_tensor = None 2025-05-07T20:31:45.4088217Z 2025-05-07T20:31:45.4088351Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.4088441Z op = silu_mul_quant 2025-05-07T20:31:45.4088524Z if compiled: 2025-05-07T20:31:45.4088628Z op = torch.compile(op) 2025-05-07T20:31:45.4088731Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4088803Z 2025-05-07T20:31:45.4088897Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.4088901Z 2025-05-07T20:31:45.4088997Z moe/activation_test.py:117: 2025-05-07T20:31:45.4089133Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4089237Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.4089336Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4089848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.4089944Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.4090301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.4090528Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.4090867Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.4090969Z kernel = self.compile( 2025-05-07T20:31:45.4091350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.4091526Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.4091662Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4091667Z 2025-05-07T20:31:45.4091950Z self = 2025-05-07T20:31:45.4092740Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.4093245Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6894ef2b60>} 2025-05-07T20:31:45.4094007Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.4094292Z context = 2025-05-07T20:31:45.4094296Z 2025-05-07T20:31:45.4094459Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.4094729Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.4094835Z module_map=module_map) 2025-05-07T20:31:45.4094996Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.4095102Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.4095199Z E ^ 2025-05-07T20:31:45.4095588Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.4095593Z 2025-05-07T20:31:45.4096011Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.4096021Z 2025-05-07T20:31:45.4096122Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.4096352Z self=, 2025-05-07T20:31:45.4096428Z T=1, 2025-05-07T20:31:45.4096508Z D=5120, 2025-05-07T20:31:45.4096597Z scale_ub=None, 2025-05-07T20:31:45.4096686Z contiguous=True, 2025-05-07T20:31:45.4096775Z compiled=True, 2025-05-07T20:31:45.4096847Z ) 2025-05-07T20:31:45.4097064Z self = 2025-05-07T20:31:45.4097229Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:45.4097233Z 2025-05-07T20:31:45.4097311Z @given( 2025-05-07T20:31:45.4097428Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.4097532Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.4097646Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.4097761Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.4097884Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.4097958Z ) 2025-05-07T20:31:45.4098214Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.4098310Z def test_silu_mul_quant( 2025-05-07T20:31:45.4098389Z self, 2025-05-07T20:31:45.4098472Z T: int, 2025-05-07T20:31:45.4098550Z D: int, 2025-05-07T20:31:45.4098648Z scale_ub: Optional[float], 2025-05-07T20:31:45.4098745Z contiguous: bool, 2025-05-07T20:31:45.4098830Z compiled: bool, 2025-05-07T20:31:45.4098907Z ) -> None: 2025-05-07T20:31:45.4099008Z torch.manual_seed(2025) 2025-05-07T20:31:45.4099080Z 2025-05-07T20:31:45.4099252Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.4099336Z 2025-05-07T20:31:45.4099428Z x_sign = torch.sign(x) 2025-05-07T20:31:45.4099561Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.4099654Z x = x_sign * x_clamp 2025-05-07T20:31:45.4099736Z x0 = x[:, :D] 2025-05-07T20:31:45.4099825Z x1 = x[:, D:] 2025-05-07T20:31:45.4099896Z 2025-05-07T20:31:45.4099980Z if contiguous: 2025-05-07T20:31:45.4100164Z x0 = x0.contiguous() 2025-05-07T20:31:45.4100254Z x1 = x1.contiguous() 2025-05-07T20:31:45.4100325Z 2025-05-07T20:31:45.4100422Z if scale_ub is not None: 2025-05-07T20:31:45.4100528Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.4100660Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.4100743Z ) 2025-05-07T20:31:45.4100820Z else: 2025-05-07T20:31:45.4100920Z scale_ub_tensor = None 2025-05-07T20:31:45.4100997Z 2025-05-07T20:31:45.4101124Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.4101219Z op = silu_mul_quant 2025-05-07T20:31:45.4101304Z if compiled: 2025-05-07T20:31:45.4101479Z op = torch.compile(op) 2025-05-07T20:31:45.4101591Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4101663Z 2025-05-07T20:31:45.4101752Z y_fp8, y_scale = fn() 2025-05-07T20:31:45.4101882Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:45.4101953Z 2025-05-07T20:31:45.4102086Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.4102194Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:45.4102292Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:45.4102418Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:45.4102556Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:45.4102627Z 2025-05-07T20:31:45.4102731Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:45.4102736Z 2025-05-07T20:31:45.4102833Z moe/activation_test.py:126: 2025-05-07T20:31:45.4102961Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4103078Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:45.4103208Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:45.4103789Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:45.4103890Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:45.4104257Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.4104481Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.4104850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:45.4105112Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:45.4105512Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:45.4105775Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:45.4106154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:45.4106322Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:45.4106670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:45.4106748Z fn() 2025-05-07T20:31:45.4107150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:45.4107238Z self.fn.run( 2025-05-07T20:31:45.4107577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.4107675Z kernel = self.compile( 2025-05-07T20:31:45.4108060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.4108232Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.4108448Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4108453Z 2025-05-07T20:31:45.4108656Z self = 2025-05-07T20:31:45.4109524Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.4110032Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f68a93ec040>} 2025-05-07T20:31:45.4110779Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.4111051Z context = 2025-05-07T20:31:45.4111056Z 2025-05-07T20:31:45.4111221Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.4111489Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.4111596Z module_map=module_map) 2025-05-07T20:31:45.4111757Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.4111866Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:45.4111944Z E ^ 2025-05-07T20:31:45.4112301Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.4112312Z 2025-05-07T20:31:45.4112730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.4112740Z 2025-05-07T20:31:45.4112843Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.4113076Z self=, 2025-05-07T20:31:45.4113153Z T=2048, 2025-05-07T20:31:45.4113229Z D=5120, 2025-05-07T20:31:45.4113317Z scale_ub=None, 2025-05-07T20:31:45.4113401Z contiguous=True, 2025-05-07T20:31:45.4113484Z compiled=True, 2025-05-07T20:31:45.4113560Z ) 2025-05-07T20:31:45.4113780Z self = 2025-05-07T20:31:45.4113955Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:45.4113959Z 2025-05-07T20:31:45.4114035Z @given( 2025-05-07T20:31:45.4114152Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.4114256Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.4114370Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.4114490Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.4114608Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.4114681Z ) 2025-05-07T20:31:45.4114929Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.4115032Z def test_silu_mul_quant( 2025-05-07T20:31:45.4115107Z self, 2025-05-07T20:31:45.4115189Z T: int, 2025-05-07T20:31:45.4115264Z D: int, 2025-05-07T20:31:45.4115361Z scale_ub: Optional[float], 2025-05-07T20:31:45.4115458Z contiguous: bool, 2025-05-07T20:31:45.4115545Z compiled: bool, 2025-05-07T20:31:45.4115623Z ) -> None: 2025-05-07T20:31:45.4115722Z torch.manual_seed(2025) 2025-05-07T20:31:45.4115796Z 2025-05-07T20:31:45.4115963Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.4116042Z 2025-05-07T20:31:45.4116133Z x_sign = torch.sign(x) 2025-05-07T20:31:45.4116262Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.4116356Z x = x_sign * x_clamp 2025-05-07T20:31:45.4116435Z x0 = x[:, :D] 2025-05-07T20:31:45.4116521Z x1 = x[:, D:] 2025-05-07T20:31:45.4116677Z 2025-05-07T20:31:45.4116761Z if contiguous: 2025-05-07T20:31:45.4116858Z x0 = x0.contiguous() 2025-05-07T20:31:45.4116945Z x1 = x1.contiguous() 2025-05-07T20:31:45.4117016Z 2025-05-07T20:31:45.4117110Z if scale_ub is not None: 2025-05-07T20:31:45.4117214Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.4117348Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.4117429Z ) 2025-05-07T20:31:45.4117504Z else: 2025-05-07T20:31:45.4117600Z scale_ub_tensor = None 2025-05-07T20:31:45.4117677Z 2025-05-07T20:31:45.4117804Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.4117976Z op = silu_mul_quant 2025-05-07T20:31:45.4118061Z if compiled: 2025-05-07T20:31:45.4118159Z op = torch.compile(op) 2025-05-07T20:31:45.4118267Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4118338Z 2025-05-07T20:31:45.4118432Z y_fp8, y_scale = fn() 2025-05-07T20:31:45.4118561Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:45.4118632Z 2025-05-07T20:31:45.4118771Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.4118878Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:45.4118977Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:45.4119096Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:45.4119242Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:45.4119314Z 2025-05-07T20:31:45.4119419Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:45.4119424Z 2025-05-07T20:31:45.4119530Z moe/activation_test.py:126: 2025-05-07T20:31:45.4119658Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4119768Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:45.4119904Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:45.4120468Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:45.4120576Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:45.4120940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.4121168Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.4121535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:45.4121793Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:45.4122207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:45.4122463Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:45.4122843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:45.4123010Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:45.4123353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:45.4123436Z fn() 2025-05-07T20:31:45.4123843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:45.4123925Z self.fn.run( 2025-05-07T20:31:45.4124270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.4124366Z kernel = self.compile( 2025-05-07T20:31:45.4124755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.4125032Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.4125166Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4125170Z 2025-05-07T20:31:45.4125379Z self = 2025-05-07T20:31:45.4126163Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.4126679Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f68a9986840>} 2025-05-07T20:31:45.4127505Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.4127699Z context = 2025-05-07T20:31:45.4127709Z 2025-05-07T20:31:45.4127871Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.4128131Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.4128539Z module_map=module_map) 2025-05-07T20:31:45.4128734Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.4128839Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:45.4128927Z E ^ 2025-05-07T20:31:45.4129284Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.4129295Z 2025-05-07T20:31:45.4129721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.4129726Z 2025-05-07T20:31:45.4129837Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.4130060Z self=, 2025-05-07T20:31:45.4130145Z T=128, 2025-05-07T20:31:45.4130223Z D=5120, 2025-05-07T20:31:45.4130306Z scale_ub=None, 2025-05-07T20:31:45.4130398Z contiguous=True, 2025-05-07T20:31:45.4130481Z compiled=True, 2025-05-07T20:31:45.4130553Z ) 2025-05-07T20:31:45.4130776Z self = 2025-05-07T20:31:45.4130941Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:45.4130946Z 2025-05-07T20:31:45.4131032Z @given( 2025-05-07T20:31:45.4131152Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.4131257Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.4131378Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.4131494Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.4131612Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.4131693Z ) 2025-05-07T20:31:45.4131941Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.4132042Z def test_silu_mul_quant( 2025-05-07T20:31:45.4132119Z self, 2025-05-07T20:31:45.4132196Z T: int, 2025-05-07T20:31:45.4132278Z D: int, 2025-05-07T20:31:45.4132376Z scale_ub: Optional[float], 2025-05-07T20:31:45.4132466Z contiguous: bool, 2025-05-07T20:31:45.4132557Z compiled: bool, 2025-05-07T20:31:45.4132636Z ) -> None: 2025-05-07T20:31:45.4132731Z torch.manual_seed(2025) 2025-05-07T20:31:45.4132810Z 2025-05-07T20:31:45.4132978Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.4133056Z 2025-05-07T20:31:45.4133155Z x_sign = torch.sign(x) 2025-05-07T20:31:45.4133278Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.4133366Z x = x_sign * x_clamp 2025-05-07T20:31:45.4133643Z x0 = x[:, :D] 2025-05-07T20:31:45.4133729Z x1 = x[:, D:] 2025-05-07T20:31:45.4133806Z 2025-05-07T20:31:45.4133889Z if contiguous: 2025-05-07T20:31:45.4133980Z x0 = x0.contiguous() 2025-05-07T20:31:45.4134074Z x1 = x1.contiguous() 2025-05-07T20:31:45.4134145Z 2025-05-07T20:31:45.4134234Z if scale_ub is not None: 2025-05-07T20:31:45.4134346Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.4134480Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.4134555Z ) 2025-05-07T20:31:45.4134636Z else: 2025-05-07T20:31:45.4134729Z scale_ub_tensor = None 2025-05-07T20:31:45.4134918Z 2025-05-07T20:31:45.4135055Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.4135144Z op = silu_mul_quant 2025-05-07T20:31:45.4135235Z if compiled: 2025-05-07T20:31:45.4135335Z op = torch.compile(op) 2025-05-07T20:31:45.4135445Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4135522Z 2025-05-07T20:31:45.4135611Z y_fp8, y_scale = fn() 2025-05-07T20:31:45.4135732Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:45.4135813Z 2025-05-07T20:31:45.4135948Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.4136050Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:45.4136155Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:45.4136277Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:45.4136416Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:45.4136496Z 2025-05-07T20:31:45.4136601Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:45.4136606Z 2025-05-07T20:31:45.4136712Z moe/activation_test.py:126: 2025-05-07T20:31:45.4136843Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4136952Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:45.4137094Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:45.4137658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:45.4137763Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:45.4138131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.4138353Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.4138727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:45.4138988Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:45.4139395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:45.4139657Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:45.4140034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:45.4140209Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:45.4140553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:45.4140634Z fn() 2025-05-07T20:31:45.4141047Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:45.4141135Z self.fn.run( 2025-05-07T20:31:45.4141478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.4141577Z kernel = self.compile( 2025-05-07T20:31:45.4142042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.4142229Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.4142358Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4142363Z 2025-05-07T20:31:45.4142565Z self = 2025-05-07T20:31:45.4143355Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.4143863Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6894c95c60>} 2025-05-07T20:31:45.4144702Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.4144892Z context = 2025-05-07T20:31:45.4144897Z 2025-05-07T20:31:45.4145067Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.4145330Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.4145436Z module_map=module_map) 2025-05-07T20:31:45.4145602Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.4145703Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:45.4145783Z E ^ 2025-05-07T20:31:45.4146153Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.4146158Z 2025-05-07T20:31:45.4146581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.4146585Z 2025-05-07T20:31:45.4146698Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.4146922Z self=, 2025-05-07T20:31:45.4146998Z T=4096, 2025-05-07T20:31:45.4147084Z D=5120, 2025-05-07T20:31:45.4147169Z scale_ub=None, 2025-05-07T20:31:45.4147257Z contiguous=True, 2025-05-07T20:31:45.4147346Z compiled=True, 2025-05-07T20:31:45.4147420Z ) 2025-05-07T20:31:45.4147637Z self = 2025-05-07T20:31:45.4147813Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:45.4147818Z 2025-05-07T20:31:45.4147901Z @given( 2025-05-07T20:31:45.4148024Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.4148123Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.4148238Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.4148365Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.4148477Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.4148556Z ) 2025-05-07T20:31:45.4148806Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.4148902Z def test_silu_mul_quant( 2025-05-07T20:31:45.4148982Z self, 2025-05-07T20:31:45.4149138Z T: int, 2025-05-07T20:31:45.4149216Z D: int, 2025-05-07T20:31:45.4149321Z scale_ub: Optional[float], 2025-05-07T20:31:45.4149410Z contiguous: bool, 2025-05-07T20:31:45.4149497Z compiled: bool, 2025-05-07T20:31:45.4149580Z ) -> None: 2025-05-07T20:31:45.4149676Z torch.manual_seed(2025) 2025-05-07T20:31:45.4149753Z 2025-05-07T20:31:45.4149928Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.4150001Z 2025-05-07T20:31:45.4150094Z x_sign = torch.sign(x) 2025-05-07T20:31:45.4150307Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.4150395Z x = x_sign * x_clamp 2025-05-07T20:31:45.4150475Z x0 = x[:, :D] 2025-05-07T20:31:45.4150561Z x1 = x[:, D:] 2025-05-07T20:31:45.4150632Z 2025-05-07T20:31:45.4150722Z if contiguous: 2025-05-07T20:31:45.4150812Z x0 = x0.contiguous() 2025-05-07T20:31:45.4150900Z x1 = x1.contiguous() 2025-05-07T20:31:45.4150978Z 2025-05-07T20:31:45.4151070Z if scale_ub is not None: 2025-05-07T20:31:45.4151175Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.4151317Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.4151392Z ) 2025-05-07T20:31:45.4151614Z else: 2025-05-07T20:31:45.4151715Z scale_ub_tensor = None 2025-05-07T20:31:45.4151787Z 2025-05-07T20:31:45.4151918Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.4152014Z op = silu_mul_quant 2025-05-07T20:31:45.4152106Z if compiled: 2025-05-07T20:31:45.4152206Z op = torch.compile(op) 2025-05-07T20:31:45.4152320Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4152393Z 2025-05-07T20:31:45.4152490Z y_fp8, y_scale = fn() 2025-05-07T20:31:45.4152610Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:45.4152684Z 2025-05-07T20:31:45.4152827Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.4152929Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:45.4153028Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:45.4153155Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:45.4153305Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:45.4153379Z 2025-05-07T20:31:45.4153488Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:45.4153492Z 2025-05-07T20:31:45.4153592Z moe/activation_test.py:126: 2025-05-07T20:31:45.4153731Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4153836Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:45.4153971Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:45.4154545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:45.4154649Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:45.4155020Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.4155293Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.4155667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:45.4155929Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:45.4156336Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:45.4156587Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:45.4156970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:45.4157135Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:45.4157489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:45.4157566Z fn() 2025-05-07T20:31:45.4157967Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:45.4158062Z self.fn.run( 2025-05-07T20:31:45.4158400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.4158601Z kernel = self.compile( 2025-05-07T20:31:45.4158997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.4159171Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.4159304Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4159308Z 2025-05-07T20:31:45.4159509Z self = 2025-05-07T20:31:45.4160294Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.4160888Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f68940dfc40>} 2025-05-07T20:31:45.4161648Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.4161842Z context = 2025-05-07T20:31:45.4161847Z 2025-05-07T20:31:45.4162009Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.4162277Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.4162383Z module_map=module_map) 2025-05-07T20:31:45.4162544Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.4162656Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:45.4162734Z E ^ 2025-05-07T20:31:45.4163091Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.4163096Z 2025-05-07T20:31:45.4163522Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.4163526Z 2025-05-07T20:31:45.4163628Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.4163857Z self=, 2025-05-07T20:31:45.4163933Z T=16384, 2025-05-07T20:31:45.4164009Z D=5120, 2025-05-07T20:31:45.4164096Z scale_ub=None, 2025-05-07T20:31:45.4164180Z contiguous=True, 2025-05-07T20:31:45.4164263Z compiled=True, 2025-05-07T20:31:45.4164341Z ) 2025-05-07T20:31:45.4164561Z self = 2025-05-07T20:31:45.4164739Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:45.4164749Z 2025-05-07T20:31:45.4164833Z @given( 2025-05-07T20:31:45.4164950Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.4165056Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.4165173Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.4165288Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.4165407Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.4165482Z ) 2025-05-07T20:31:45.4165727Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.4165827Z def test_silu_mul_quant( 2025-05-07T20:31:45.4165904Z self, 2025-05-07T20:31:45.4165980Z T: int, 2025-05-07T20:31:45.4166065Z D: int, 2025-05-07T20:31:45.4166162Z scale_ub: Optional[float], 2025-05-07T20:31:45.4166255Z contiguous: bool, 2025-05-07T20:31:45.4166344Z compiled: bool, 2025-05-07T20:31:45.4166421Z ) -> None: 2025-05-07T20:31:45.4166522Z torch.manual_seed(2025) 2025-05-07T20:31:45.4166594Z 2025-05-07T20:31:45.4166761Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.4166921Z 2025-05-07T20:31:45.4167013Z x_sign = torch.sign(x) 2025-05-07T20:31:45.4167136Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.4167231Z x = x_sign * x_clamp 2025-05-07T20:31:45.4167311Z x0 = x[:, :D] 2025-05-07T20:31:45.4167393Z x1 = x[:, D:] 2025-05-07T20:31:45.4167471Z 2025-05-07T20:31:45.4167556Z if contiguous: 2025-05-07T20:31:45.4167653Z x0 = x0.contiguous() 2025-05-07T20:31:45.4167741Z x1 = x1.contiguous() 2025-05-07T20:31:45.4167813Z 2025-05-07T20:31:45.4167912Z if scale_ub is not None: 2025-05-07T20:31:45.4168017Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.4168152Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.4168422Z ) 2025-05-07T20:31:45.4168536Z else: 2025-05-07T20:31:45.4175655Z scale_ub_tensor = None 2025-05-07T20:31:45.4175744Z 2025-05-07T20:31:45.4175899Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.4175994Z op = silu_mul_quant 2025-05-07T20:31:45.4176090Z if compiled: 2025-05-07T20:31:45.4176197Z op = torch.compile(op) 2025-05-07T20:31:45.4176305Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4176386Z 2025-05-07T20:31:45.4176478Z y_fp8, y_scale = fn() 2025-05-07T20:31:45.4176603Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:45.4176685Z 2025-05-07T20:31:45.4176824Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.4176927Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:45.4177035Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:45.4177163Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:45.4177313Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:45.4177386Z 2025-05-07T20:31:45.4177494Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:45.4177499Z 2025-05-07T20:31:45.4177607Z moe/activation_test.py:126: 2025-05-07T20:31:45.4177738Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4177849Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:45.4177990Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:45.4178559Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:45.4178672Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:45.4179036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.4179265Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.4179644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:45.4179902Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:45.4180308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:45.4180558Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:45.4180935Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:45.4181110Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:45.4181452Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:45.4181535Z fn() 2025-05-07T20:31:45.4181942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:45.4182027Z self.fn.run( 2025-05-07T20:31:45.4182533Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.4182676Z kernel = self.compile( 2025-05-07T20:31:45.4183231Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.4183491Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.4183674Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4183681Z 2025-05-07T20:31:45.4183977Z self = 2025-05-07T20:31:45.4185010Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.4185704Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6895e271a0>} 2025-05-07T20:31:45.4186468Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.4186658Z context = 2025-05-07T20:31:45.4186663Z 2025-05-07T20:31:45.4186838Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.4187101Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.4187209Z module_map=module_map) 2025-05-07T20:31:45.4187385Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.4187488Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:45.4187566Z E ^ 2025-05-07T20:31:45.4187939Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.4187944Z 2025-05-07T20:31:45.4188362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.4188367Z 2025-05-07T20:31:45.4188477Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.4188702Z self=, 2025-05-07T20:31:45.4188778Z T=1, 2025-05-07T20:31:45.4188862Z D=5120, 2025-05-07T20:31:45.4188946Z scale_ub=1200.0, 2025-05-07T20:31:45.4189038Z contiguous=True, 2025-05-07T20:31:45.4189246Z compiled=True, 2025-05-07T20:31:45.4189324Z ) 2025-05-07T20:31:45.4189558Z self = 2025-05-07T20:31:45.4189723Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:45.4189728Z 2025-05-07T20:31:45.4189809Z @given( 2025-05-07T20:31:45.4189939Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.4190039Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.4190156Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.4190279Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.4190395Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.4190476Z ) 2025-05-07T20:31:45.4190722Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.4190819Z def test_silu_mul_quant( 2025-05-07T20:31:45.4190902Z self, 2025-05-07T20:31:45.4190980Z T: int, 2025-05-07T20:31:45.4191058Z D: int, 2025-05-07T20:31:45.4191165Z scale_ub: Optional[float], 2025-05-07T20:31:45.4191257Z contiguous: bool, 2025-05-07T20:31:45.4191343Z compiled: bool, 2025-05-07T20:31:45.4191431Z ) -> None: 2025-05-07T20:31:45.4191526Z torch.manual_seed(2025) 2025-05-07T20:31:45.4191598Z 2025-05-07T20:31:45.4191865Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.4191939Z 2025-05-07T20:31:45.4192040Z x_sign = torch.sign(x) 2025-05-07T20:31:45.4192169Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.4192259Z x = x_sign * x_clamp 2025-05-07T20:31:45.4192350Z x0 = x[:, :D] 2025-05-07T20:31:45.4192429Z x1 = x[:, D:] 2025-05-07T20:31:45.4192502Z 2025-05-07T20:31:45.4192596Z if contiguous: 2025-05-07T20:31:45.4192690Z x0 = x0.contiguous() 2025-05-07T20:31:45.4192779Z x1 = x1.contiguous() 2025-05-07T20:31:45.4192858Z 2025-05-07T20:31:45.4192949Z if scale_ub is not None: 2025-05-07T20:31:45.4193134Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.4193279Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.4193355Z ) 2025-05-07T20:31:45.4193434Z else: 2025-05-07T20:31:45.4193545Z scale_ub_tensor = None 2025-05-07T20:31:45.4193621Z 2025-05-07T20:31:45.4193762Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.4193853Z op = silu_mul_quant 2025-05-07T20:31:45.4193939Z if compiled: 2025-05-07T20:31:45.4194050Z op = torch.compile(op) 2025-05-07T20:31:45.4194157Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4194230Z 2025-05-07T20:31:45.4194329Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.4194333Z 2025-05-07T20:31:45.4194432Z moe/activation_test.py:117: 2025-05-07T20:31:45.4194563Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4194673Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.4194781Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4195161Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:45.4195262Z return fn(*args, **kwargs) 2025-05-07T20:31:45.4195760Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.4195867Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.4196226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.4196450Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.4196798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.4196892Z kernel = self.compile( 2025-05-07T20:31:45.4197281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.4197460Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.4197594Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4197599Z 2025-05-07T20:31:45.4197812Z self = 2025-05-07T20:31:45.4198602Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.4199119Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6895f7b420>} 2025-05-07T20:31:45.4199878Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.4200081Z context = 2025-05-07T20:31:45.4200085Z 2025-05-07T20:31:45.4200334Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.4200600Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.4200717Z module_map=module_map) 2025-05-07T20:31:45.4200881Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.4200981Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.4201071Z E ^ 2025-05-07T20:31:45.4201433Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.4201438Z 2025-05-07T20:31:45.4201866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.4201950Z 2025-05-07T20:31:45.4202056Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.4202283Z self=, 2025-05-07T20:31:45.4202373Z T=1, 2025-05-07T20:31:45.4202449Z D=5120, 2025-05-07T20:31:45.4202534Z scale_ub=None, 2025-05-07T20:31:45.4202628Z contiguous=False, 2025-05-07T20:31:45.4202712Z compiled=True, 2025-05-07T20:31:45.4202784Z ) 2025-05-07T20:31:45.4203019Z self = 2025-05-07T20:31:45.4203183Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:45.4203187Z 2025-05-07T20:31:45.4203276Z @given( 2025-05-07T20:31:45.4203394Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.4203494Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.4203619Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.4203742Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.4203855Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.4203938Z ) 2025-05-07T20:31:45.4204190Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.4204291Z def test_silu_mul_quant( 2025-05-07T20:31:45.4204371Z self, 2025-05-07T20:31:45.4204449Z T: int, 2025-05-07T20:31:45.4204533Z D: int, 2025-05-07T20:31:45.4204631Z scale_ub: Optional[float], 2025-05-07T20:31:45.4204723Z contiguous: bool, 2025-05-07T20:31:45.4204816Z compiled: bool, 2025-05-07T20:31:45.4204894Z ) -> None: 2025-05-07T20:31:45.4204991Z torch.manual_seed(2025) 2025-05-07T20:31:45.4205071Z 2025-05-07T20:31:45.4205242Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.4205314Z 2025-05-07T20:31:45.4205413Z x_sign = torch.sign(x) 2025-05-07T20:31:45.4205542Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.4205639Z x = x_sign * x_clamp 2025-05-07T20:31:45.4205720Z x0 = x[:, :D] 2025-05-07T20:31:45.4205801Z x1 = x[:, D:] 2025-05-07T20:31:45.4205882Z 2025-05-07T20:31:45.4205970Z if contiguous: 2025-05-07T20:31:45.4206062Z x0 = x0.contiguous() 2025-05-07T20:31:45.4206160Z x1 = x1.contiguous() 2025-05-07T20:31:45.4206233Z 2025-05-07T20:31:45.4206323Z if scale_ub is not None: 2025-05-07T20:31:45.4206436Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.4206573Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.4206649Z ) 2025-05-07T20:31:45.4206733Z else: 2025-05-07T20:31:45.4206827Z scale_ub_tensor = None 2025-05-07T20:31:45.4206900Z 2025-05-07T20:31:45.4207036Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.4207126Z op = silu_mul_quant 2025-05-07T20:31:45.4207223Z if compiled: 2025-05-07T20:31:45.4207323Z op = torch.compile(op) 2025-05-07T20:31:45.4207430Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4207509Z 2025-05-07T20:31:45.4207688Z y_fp8, y_scale = fn() 2025-05-07T20:31:45.4207811Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:45.4207892Z 2025-05-07T20:31:45.4208028Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.4208133Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:45.4208242Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:45.4208369Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:45.4208518Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:45.4208592Z 2025-05-07T20:31:45.4208695Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:45.4208699Z 2025-05-07T20:31:45.4208806Z moe/activation_test.py:126: 2025-05-07T20:31:45.4209015Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4209122Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:45.4209263Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:45.4209839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:45.4209951Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:45.4210316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.4210541Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.4210919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:45.4211180Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:45.4211591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:45.4211859Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:45.4212239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:45.4212415Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:45.4212761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:45.4212839Z fn() 2025-05-07T20:31:45.4213256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:45.4213340Z self.fn.run( 2025-05-07T20:31:45.4213683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.4213790Z kernel = self.compile( 2025-05-07T20:31:45.4214176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.4214365Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.4214494Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4214499Z 2025-05-07T20:31:45.4214705Z self = 2025-05-07T20:31:45.4215510Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.4216021Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6873915b20>} 2025-05-07T20:31:45.4216791Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.4217084Z context = 2025-05-07T20:31:45.4217089Z 2025-05-07T20:31:45.4217264Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.4217529Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.4217637Z module_map=module_map) 2025-05-07T20:31:45.4217805Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.4217911Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:45.4217990Z E ^ 2025-05-07T20:31:45.4218356Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.4218436Z 2025-05-07T20:31:45.4218857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.4218861Z 2025-05-07T20:31:45.4218974Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.4219209Z self=, 2025-05-07T20:31:45.4219297Z T=1, 2025-05-07T20:31:45.4219374Z D=5120, 2025-05-07T20:31:45.4219456Z scale_ub=None, 2025-05-07T20:31:45.4219548Z contiguous=True, 2025-05-07T20:31:45.4219632Z compiled=False, 2025-05-07T20:31:45.4219705Z ) 2025-05-07T20:31:45.4219933Z self = 2025-05-07T20:31:45.4220097Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:45.4220101Z 2025-05-07T20:31:45.4220186Z @given( 2025-05-07T20:31:45.4220305Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.4220408Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.4220529Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.4220645Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.4220759Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.4220844Z ) 2025-05-07T20:31:45.4221091Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.4221185Z def test_silu_mul_quant( 2025-05-07T20:31:45.4221270Z self, 2025-05-07T20:31:45.4221346Z T: int, 2025-05-07T20:31:45.4221423Z D: int, 2025-05-07T20:31:45.4221531Z scale_ub: Optional[float], 2025-05-07T20:31:45.4221621Z contiguous: bool, 2025-05-07T20:31:45.4221713Z compiled: bool, 2025-05-07T20:31:45.4221792Z ) -> None: 2025-05-07T20:31:45.4221887Z torch.manual_seed(2025) 2025-05-07T20:31:45.4221965Z 2025-05-07T20:31:45.4222134Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.4222212Z 2025-05-07T20:31:45.4222310Z x_sign = torch.sign(x) 2025-05-07T20:31:45.4222436Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.4222530Z x = x_sign * x_clamp 2025-05-07T20:31:45.4222620Z x0 = x[:, :D] 2025-05-07T20:31:45.4222699Z x1 = x[:, D:] 2025-05-07T20:31:45.4222770Z 2025-05-07T20:31:45.4222859Z if contiguous: 2025-05-07T20:31:45.4222950Z x0 = x0.contiguous() 2025-05-07T20:31:45.4223040Z x1 = x1.contiguous() 2025-05-07T20:31:45.4223117Z 2025-05-07T20:31:45.4223205Z if scale_ub is not None: 2025-05-07T20:31:45.4223315Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.4223449Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.4223523Z ) 2025-05-07T20:31:45.4223605Z else: 2025-05-07T20:31:45.4223698Z scale_ub_tensor = None 2025-05-07T20:31:45.4223770Z 2025-05-07T20:31:45.4223910Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.4224001Z op = silu_mul_quant 2025-05-07T20:31:45.4224086Z if compiled: 2025-05-07T20:31:45.4224195Z op = torch.compile(op) 2025-05-07T20:31:45.4224386Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4224459Z 2025-05-07T20:31:45.4224556Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.4224560Z 2025-05-07T20:31:45.4224657Z moe/activation_test.py:117: 2025-05-07T20:31:45.4224796Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4224897Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.4224996Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4225530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.4225638Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.4226010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.4226312Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.4226659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.4226759Z kernel = self.compile( 2025-05-07T20:31:45.4227141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.4227312Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.4227445Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4227450Z 2025-05-07T20:31:45.4227651Z self = 2025-05-07T20:31:45.4228746Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.4229311Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6873372200>} 2025-05-07T20:31:45.4230070Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.4230264Z context = 2025-05-07T20:31:45.4230268Z 2025-05-07T20:31:45.4230431Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.4230699Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.4230805Z module_map=module_map) 2025-05-07T20:31:45.4230971Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.4231078Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.4231156Z E ^ 2025-05-07T20:31:45.4231523Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.4231528Z 2025-05-07T20:31:45.4231945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.4231949Z 2025-05-07T20:31:45.4232053Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.4232283Z self=, 2025-05-07T20:31:45.4232360Z T=128, 2025-05-07T20:31:45.4232437Z D=5120, 2025-05-07T20:31:45.4232523Z scale_ub=None, 2025-05-07T20:31:45.4232609Z contiguous=False, 2025-05-07T20:31:45.4232696Z compiled=True, 2025-05-07T20:31:45.4232768Z ) 2025-05-07T20:31:45.4232986Z self = 2025-05-07T20:31:45.4233168Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:45.4233173Z 2025-05-07T20:31:45.4233252Z @given( 2025-05-07T20:31:45.4233591Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.4233700Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.4233815Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.4233930Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.4234050Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.4234124Z ) 2025-05-07T20:31:45.4234376Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.4234470Z def test_silu_mul_quant( 2025-05-07T20:31:45.4234547Z self, 2025-05-07T20:31:45.4234630Z T: int, 2025-05-07T20:31:45.4234707Z D: int, 2025-05-07T20:31:45.4234809Z scale_ub: Optional[float], 2025-05-07T20:31:45.4235025Z contiguous: bool, 2025-05-07T20:31:45.4235112Z compiled: bool, 2025-05-07T20:31:45.4235189Z ) -> None: 2025-05-07T20:31:45.4235290Z torch.manual_seed(2025) 2025-05-07T20:31:45.4235363Z 2025-05-07T20:31:45.4235537Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.4235618Z 2025-05-07T20:31:45.4235710Z x_sign = torch.sign(x) 2025-05-07T20:31:45.4235842Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.4235929Z x = x_sign * x_clamp 2025-05-07T20:31:45.4236013Z x0 = x[:, :D] 2025-05-07T20:31:45.4236102Z x1 = x[:, D:] 2025-05-07T20:31:45.4236173Z 2025-05-07T20:31:45.4236258Z if contiguous: 2025-05-07T20:31:45.4236356Z x0 = x0.contiguous() 2025-05-07T20:31:45.4236447Z x1 = x1.contiguous() 2025-05-07T20:31:45.4236520Z 2025-05-07T20:31:45.4236620Z if scale_ub is not None: 2025-05-07T20:31:45.4236731Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.4236867Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.4236950Z ) 2025-05-07T20:31:45.4237028Z else: 2025-05-07T20:31:45.4237132Z scale_ub_tensor = None 2025-05-07T20:31:45.4237207Z 2025-05-07T20:31:45.4237334Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.4237429Z op = silu_mul_quant 2025-05-07T20:31:45.4237513Z if compiled: 2025-05-07T20:31:45.4237614Z op = torch.compile(op) 2025-05-07T20:31:45.4237723Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4237795Z 2025-05-07T20:31:45.4237885Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.4237890Z 2025-05-07T20:31:45.4237992Z moe/activation_test.py:117: 2025-05-07T20:31:45.4238121Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4238220Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.4238338Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4238715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:45.4238808Z return fn(*args, **kwargs) 2025-05-07T20:31:45.4239309Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.4239412Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.4239768Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.4239989Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.4240334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.4240428Z kernel = self.compile( 2025-05-07T20:31:45.4240821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.4240998Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.4241125Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4241212Z 2025-05-07T20:31:45.4241425Z self = 2025-05-07T20:31:45.4242207Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.4242717Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f68738ee0c0>} 2025-05-07T20:31:45.4243473Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.4243801Z context = 2025-05-07T20:31:45.4243812Z 2025-05-07T20:31:45.4243980Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.4244242Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.4244355Z module_map=module_map) 2025-05-07T20:31:45.4244515Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.4244613Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.4244697Z E ^ 2025-05-07T20:31:45.4245055Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.4245060Z 2025-05-07T20:31:45.4245486Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.4245497Z 2025-05-07T20:31:45.4245601Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.4245824Z self=, 2025-05-07T20:31:45.4245908Z T=128, 2025-05-07T20:31:45.4245993Z D=7168, 2025-05-07T20:31:45.4246079Z scale_ub=1200.0, 2025-05-07T20:31:45.4246173Z contiguous=False, 2025-05-07T20:31:45.4246259Z compiled=False, 2025-05-07T20:31:45.4246332Z ) 2025-05-07T20:31:45.4246558Z self = 2025-05-07T20:31:45.4246729Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:45.4246734Z 2025-05-07T20:31:45.4246817Z @given( 2025-05-07T20:31:45.4246934Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.4247032Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.4247149Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.4247269Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.4247381Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.4247459Z ) 2025-05-07T20:31:45.4247707Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.4247805Z def test_silu_mul_quant( 2025-05-07T20:31:45.4247879Z self, 2025-05-07T20:31:45.4247955Z T: int, 2025-05-07T20:31:45.4248036Z D: int, 2025-05-07T20:31:45.4248133Z scale_ub: Optional[float], 2025-05-07T20:31:45.4248221Z contiguous: bool, 2025-05-07T20:31:45.4248310Z compiled: bool, 2025-05-07T20:31:45.4248386Z ) -> None: 2025-05-07T20:31:45.4248481Z torch.manual_seed(2025) 2025-05-07T20:31:45.4248559Z 2025-05-07T20:31:45.4248724Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.4248797Z 2025-05-07T20:31:45.4248898Z x_sign = torch.sign(x) 2025-05-07T20:31:45.4249026Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.4249114Z x = x_sign * x_clamp 2025-05-07T20:31:45.4249201Z x0 = x[:, :D] 2025-05-07T20:31:45.4249279Z x1 = x[:, D:] 2025-05-07T20:31:45.4249357Z 2025-05-07T20:31:45.4249546Z if contiguous: 2025-05-07T20:31:45.4249639Z x0 = x0.contiguous() 2025-05-07T20:31:45.4249733Z x1 = x1.contiguous() 2025-05-07T20:31:45.4249805Z 2025-05-07T20:31:45.4249895Z if scale_ub is not None: 2025-05-07T20:31:45.4250007Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.4250140Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.4250215Z ) 2025-05-07T20:31:45.4250297Z else: 2025-05-07T20:31:45.4250389Z scale_ub_tensor = None 2025-05-07T20:31:45.4250460Z 2025-05-07T20:31:45.4250594Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.4250683Z op = silu_mul_quant 2025-05-07T20:31:45.4250859Z if compiled: 2025-05-07T20:31:45.4250960Z op = torch.compile(op) 2025-05-07T20:31:45.4251066Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4251143Z 2025-05-07T20:31:45.4251234Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.4251245Z 2025-05-07T20:31:45.4251344Z moe/activation_test.py:117: 2025-05-07T20:31:45.4251478Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4251579Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.4251678Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4252184Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.4252281Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.4252644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.4252866Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.4253209Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.4253306Z kernel = self.compile( 2025-05-07T20:31:45.4253691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.4253864Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.4253996Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4254001Z 2025-05-07T20:31:45.4254201Z self = 2025-05-07T20:31:45.4254991Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.4255500Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6873917ba0>} 2025-05-07T20:31:45.4256267Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.4256455Z context = 2025-05-07T20:31:45.4256460Z 2025-05-07T20:31:45.4256623Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.4256892Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.4256999Z module_map=module_map) 2025-05-07T20:31:45.4257166Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.4257265Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.4257345Z E ^ 2025-05-07T20:31:45.4257708Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.4257714Z 2025-05-07T20:31:45.4258214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.4258219Z 2025-05-07T20:31:45.4258329Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.4258554Z self=, 2025-05-07T20:31:45.4258631Z T=128, 2025-05-07T20:31:45.4258719Z D=5120, 2025-05-07T20:31:45.4258801Z scale_ub=None, 2025-05-07T20:31:45.4258888Z contiguous=False, 2025-05-07T20:31:45.4258978Z compiled=False, 2025-05-07T20:31:45.4259049Z ) 2025-05-07T20:31:45.4259268Z self = 2025-05-07T20:31:45.4259447Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:45.4259527Z 2025-05-07T20:31:45.4259604Z @given( 2025-05-07T20:31:45.4259723Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.4259829Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.4259948Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.4260071Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.4260184Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.4260258Z ) 2025-05-07T20:31:45.4260514Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.4260611Z def test_silu_mul_quant( 2025-05-07T20:31:45.4260686Z self, 2025-05-07T20:31:45.4260770Z T: int, 2025-05-07T20:31:45.4260847Z D: int, 2025-05-07T20:31:45.4260945Z scale_ub: Optional[float], 2025-05-07T20:31:45.4261041Z contiguous: bool, 2025-05-07T20:31:45.4261126Z compiled: bool, 2025-05-07T20:31:45.4261211Z ) -> None: 2025-05-07T20:31:45.4261312Z torch.manual_seed(2025) 2025-05-07T20:31:45.4261384Z 2025-05-07T20:31:45.4261556Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.4261629Z 2025-05-07T20:31:45.4261724Z x_sign = torch.sign(x) 2025-05-07T20:31:45.4261853Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.4261943Z x = x_sign * x_clamp 2025-05-07T20:31:45.4262026Z x0 = x[:, :D] 2025-05-07T20:31:45.4262111Z x1 = x[:, D:] 2025-05-07T20:31:45.4262182Z 2025-05-07T20:31:45.4262265Z if contiguous: 2025-05-07T20:31:45.4262362Z x0 = x0.contiguous() 2025-05-07T20:31:45.4262450Z x1 = x1.contiguous() 2025-05-07T20:31:45.4262522Z 2025-05-07T20:31:45.4262618Z if scale_ub is not None: 2025-05-07T20:31:45.4262721Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.4262862Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.4262941Z ) 2025-05-07T20:31:45.4263015Z else: 2025-05-07T20:31:45.4263114Z scale_ub_tensor = None 2025-05-07T20:31:45.4263187Z 2025-05-07T20:31:45.4263316Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.4263417Z op = silu_mul_quant 2025-05-07T20:31:45.4263502Z if compiled: 2025-05-07T20:31:45.4263601Z op = torch.compile(op) 2025-05-07T20:31:45.4263712Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4263785Z 2025-05-07T20:31:45.4263875Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.4263886Z 2025-05-07T20:31:45.4263983Z moe/activation_test.py:117: 2025-05-07T20:31:45.4264113Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4264219Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.4264318Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4264818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.4264925Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.4265366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.4265590Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.4265934Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.4266027Z kernel = self.compile( 2025-05-07T20:31:45.4266415Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.4266587Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.4266718Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4266722Z 2025-05-07T20:31:45.4267005Z self = 2025-05-07T20:31:45.4267797Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.4268311Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6873914fe0>} 2025-05-07T20:31:45.4269139Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.4269334Z context = 2025-05-07T20:31:45.4269339Z 2025-05-07T20:31:45.4269501Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.4269772Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.4269887Z module_map=module_map) 2025-05-07T20:31:45.4270051Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.4270150Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.4270234Z E ^ 2025-05-07T20:31:45.4270590Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.4270595Z 2025-05-07T20:31:45.4271019Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.4271023Z 2025-05-07T20:31:45.4271126Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.4271349Z self=, 2025-05-07T20:31:45.4271434Z T=128, 2025-05-07T20:31:45.4271511Z D=5120, 2025-05-07T20:31:45.4271600Z scale_ub=1200.0, 2025-05-07T20:31:45.4271691Z contiguous=True, 2025-05-07T20:31:45.4271776Z compiled=False, 2025-05-07T20:31:45.4271853Z ) 2025-05-07T20:31:45.4272074Z self = 2025-05-07T20:31:45.4272253Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:45.4272258Z 2025-05-07T20:31:45.4272342Z @given( 2025-05-07T20:31:45.4272460Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.4272562Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.4272685Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.4272802Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.4272917Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.4272997Z ) 2025-05-07T20:31:45.4273243Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.4273344Z def test_silu_mul_quant( 2025-05-07T20:31:45.4273425Z self, 2025-05-07T20:31:45.4273502Z T: int, 2025-05-07T20:31:45.4273589Z D: int, 2025-05-07T20:31:45.4273687Z scale_ub: Optional[float], 2025-05-07T20:31:45.4273776Z contiguous: bool, 2025-05-07T20:31:45.4273950Z compiled: bool, 2025-05-07T20:31:45.4274030Z ) -> None: 2025-05-07T20:31:45.4274126Z torch.manual_seed(2025) 2025-05-07T20:31:45.4274209Z 2025-05-07T20:31:45.4274376Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.4274449Z 2025-05-07T20:31:45.4274548Z x_sign = torch.sign(x) 2025-05-07T20:31:45.4274671Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.4274767Z x = x_sign * x_clamp 2025-05-07T20:31:45.4274847Z x0 = x[:, :D] 2025-05-07T20:31:45.4274927Z x1 = x[:, D:] 2025-05-07T20:31:45.4275004Z 2025-05-07T20:31:45.4275087Z if contiguous: 2025-05-07T20:31:45.4275277Z x0 = x0.contiguous() 2025-05-07T20:31:45.4275373Z x1 = x1.contiguous() 2025-05-07T20:31:45.4275444Z 2025-05-07T20:31:45.4275534Z if scale_ub is not None: 2025-05-07T20:31:45.4275645Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.4275784Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.4275859Z ) 2025-05-07T20:31:45.4275939Z else: 2025-05-07T20:31:45.4276031Z scale_ub_tensor = None 2025-05-07T20:31:45.4276106Z 2025-05-07T20:31:45.4276242Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.4276331Z op = silu_mul_quant 2025-05-07T20:31:45.4276421Z if compiled: 2025-05-07T20:31:45.4276521Z op = torch.compile(op) 2025-05-07T20:31:45.4276625Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4276703Z 2025-05-07T20:31:45.4276793Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.4276798Z 2025-05-07T20:31:45.4276904Z moe/activation_test.py:117: 2025-05-07T20:31:45.4277039Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4277140Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.4277238Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4277748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.4277845Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.4278211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.4278431Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.4278771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.4278869Z kernel = self.compile( 2025-05-07T20:31:45.4279251Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.4279433Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.4279565Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4279569Z 2025-05-07T20:31:45.4279771Z self = 2025-05-07T20:31:45.4280564Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.4281068Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f68738837e0>} 2025-05-07T20:31:45.4281828Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.4282019Z context = 2025-05-07T20:31:45.4282024Z 2025-05-07T20:31:45.4282328Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.4282599Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.4282705Z module_map=module_map) 2025-05-07T20:31:45.4282870Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.4282969Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.4283047Z E ^ 2025-05-07T20:31:45.4283408Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.4283413Z 2025-05-07T20:31:45.4283831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.4283915Z 2025-05-07T20:31:45.4284025Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.4284249Z self=, 2025-05-07T20:31:45.4284335Z T=1, 2025-05-07T20:31:45.4284420Z D=7168, 2025-05-07T20:31:45.4284503Z scale_ub=1200.0, 2025-05-07T20:31:45.4284588Z contiguous=True, 2025-05-07T20:31:45.4284676Z compiled=True, 2025-05-07T20:31:45.4284750Z ) 2025-05-07T20:31:45.4284971Z self = 2025-05-07T20:31:45.4285142Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:45.4285146Z 2025-05-07T20:31:45.4285223Z @given( 2025-05-07T20:31:45.4285348Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.4285447Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.4285561Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.4285689Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.4285802Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.4285876Z ) 2025-05-07T20:31:45.4286133Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.4286227Z def test_silu_mul_quant( 2025-05-07T20:31:45.4286302Z self, 2025-05-07T20:31:45.4286385Z T: int, 2025-05-07T20:31:45.4286464Z D: int, 2025-05-07T20:31:45.4286564Z scale_ub: Optional[float], 2025-05-07T20:31:45.4286660Z contiguous: bool, 2025-05-07T20:31:45.4286747Z compiled: bool, 2025-05-07T20:31:45.4286834Z ) -> None: 2025-05-07T20:31:45.4286929Z torch.manual_seed(2025) 2025-05-07T20:31:45.4287001Z 2025-05-07T20:31:45.4287174Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.4287247Z 2025-05-07T20:31:45.4287340Z x_sign = torch.sign(x) 2025-05-07T20:31:45.4287476Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.4287565Z x = x_sign * x_clamp 2025-05-07T20:31:45.4287647Z x0 = x[:, :D] 2025-05-07T20:31:45.4287735Z x1 = x[:, D:] 2025-05-07T20:31:45.4287808Z 2025-05-07T20:31:45.4287897Z if contiguous: 2025-05-07T20:31:45.4287995Z x0 = x0.contiguous() 2025-05-07T20:31:45.4288083Z x1 = x1.contiguous() 2025-05-07T20:31:45.4288164Z 2025-05-07T20:31:45.4288255Z if scale_ub is not None: 2025-05-07T20:31:45.4288360Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.4288501Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.4288576Z ) 2025-05-07T20:31:45.4288652Z else: 2025-05-07T20:31:45.4288750Z scale_ub_tensor = None 2025-05-07T20:31:45.4288825Z 2025-05-07T20:31:45.4288953Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.4289050Z op = silu_mul_quant 2025-05-07T20:31:45.4289139Z if compiled: 2025-05-07T20:31:45.4289239Z op = torch.compile(op) 2025-05-07T20:31:45.4289351Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4289423Z 2025-05-07T20:31:45.4289603Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.4289608Z 2025-05-07T20:31:45.4289706Z moe/activation_test.py:117: 2025-05-07T20:31:45.4289836Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4289942Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.4290040Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4290408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:45.4290506Z return fn(*args, **kwargs) 2025-05-07T20:31:45.4291003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.4291177Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.4291533Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.4291759Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.4292107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.4292199Z kernel = self.compile( 2025-05-07T20:31:45.4292580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.4292756Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.4292884Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4292888Z 2025-05-07T20:31:45.4293098Z self = 2025-05-07T20:31:45.4293882Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.4294397Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6872f8e840>} 2025-05-07T20:31:45.4295156Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.4295344Z context = 2025-05-07T20:31:45.4295349Z 2025-05-07T20:31:45.4295518Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.4295779Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.4295900Z module_map=module_map) 2025-05-07T20:31:45.4296060Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.4296159Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.4296239Z E ^ 2025-05-07T20:31:45.4296602Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.4296606Z 2025-05-07T20:31:45.4297024Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.4297028Z 2025-05-07T20:31:45.4297138Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.4297364Z self=, 2025-05-07T20:31:45.4297446Z T=1, 2025-05-07T20:31:45.4297521Z D=7168, 2025-05-07T20:31:45.4297604Z scale_ub=1200.0, 2025-05-07T20:31:45.4297695Z contiguous=False, 2025-05-07T20:31:45.4297778Z compiled=True, 2025-05-07T20:31:45.4297859Z ) 2025-05-07T20:31:45.4298081Z self = 2025-05-07T20:31:45.4298247Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:45.4298251Z 2025-05-07T20:31:45.4298414Z @given( 2025-05-07T20:31:45.4298541Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.4298639Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.4298759Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.4298875Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.4298987Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.4299067Z ) 2025-05-07T20:31:45.4299312Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.4299405Z def test_silu_mul_quant( 2025-05-07T20:31:45.4299487Z self, 2025-05-07T20:31:45.4299564Z T: int, 2025-05-07T20:31:45.4299714Z D: int, 2025-05-07T20:31:45.4299820Z scale_ub: Optional[float], 2025-05-07T20:31:45.4299909Z contiguous: bool, 2025-05-07T20:31:45.4299993Z compiled: bool, 2025-05-07T20:31:45.4300080Z ) -> None: 2025-05-07T20:31:45.4304919Z torch.manual_seed(2025) 2025-05-07T20:31:45.4305010Z 2025-05-07T20:31:45.4305192Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.4305273Z 2025-05-07T20:31:45.4305369Z x_sign = torch.sign(x) 2025-05-07T20:31:45.4305497Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.4305595Z x = x_sign * x_clamp 2025-05-07T20:31:45.4305678Z x0 = x[:, :D] 2025-05-07T20:31:45.4305767Z x1 = x[:, D:] 2025-05-07T20:31:45.4305841Z 2025-05-07T20:31:45.4305929Z if contiguous: 2025-05-07T20:31:45.4306032Z x0 = x0.contiguous() 2025-05-07T20:31:45.4306122Z x1 = x1.contiguous() 2025-05-07T20:31:45.4306195Z 2025-05-07T20:31:45.4306299Z if scale_ub is not None: 2025-05-07T20:31:45.4306407Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.4306545Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.4306630Z ) 2025-05-07T20:31:45.4306711Z else: 2025-05-07T20:31:45.4306807Z scale_ub_tensor = None 2025-05-07T20:31:45.4306889Z 2025-05-07T20:31:45.4307026Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.4307121Z op = silu_mul_quant 2025-05-07T20:31:45.4307216Z if compiled: 2025-05-07T20:31:45.4307318Z op = torch.compile(op) 2025-05-07T20:31:45.4307434Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4307506Z 2025-05-07T20:31:45.4307599Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.4307603Z 2025-05-07T20:31:45.4307712Z moe/activation_test.py:117: 2025-05-07T20:31:45.4307844Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4307953Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.4308062Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4308436Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:45.4308542Z return fn(*args, **kwargs) 2025-05-07T20:31:45.4309043Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.4309230Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.4309596Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.4309820Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.4310160Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.4310264Z kernel = self.compile( 2025-05-07T20:31:45.4310654Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.4310839Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.4311082Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4311087Z 2025-05-07T20:31:45.4311292Z self = 2025-05-07T20:31:45.4312083Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.4312590Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6872f8c900>} 2025-05-07T20:31:45.4313351Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.4313640Z context = 2025-05-07T20:31:45.4313649Z 2025-05-07T20:31:45.4313819Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.4314082Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.4314190Z module_map=module_map) 2025-05-07T20:31:45.4314357Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.4314460Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.4314537Z E ^ 2025-05-07T20:31:45.4314903Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.4314908Z 2025-05-07T20:31:45.4315352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.4315365Z 2025-05-07T20:31:45.4315496Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.4315726Z self=, 2025-05-07T20:31:45.4315804Z T=1, 2025-05-07T20:31:45.4315888Z D=7168, 2025-05-07T20:31:45.4315971Z scale_ub=None, 2025-05-07T20:31:45.4316060Z contiguous=False, 2025-05-07T20:31:45.4316151Z compiled=True, 2025-05-07T20:31:45.4316226Z ) 2025-05-07T20:31:45.4316445Z self = 2025-05-07T20:31:45.4316615Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:45.4316620Z 2025-05-07T20:31:45.4316698Z @given( 2025-05-07T20:31:45.4316825Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.4316924Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.4317045Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.4317169Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.4317282Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.4317358Z ) 2025-05-07T20:31:45.4317616Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.4317710Z def test_silu_mul_quant( 2025-05-07T20:31:45.4317788Z self, 2025-05-07T20:31:45.4317872Z T: int, 2025-05-07T20:31:45.4317949Z D: int, 2025-05-07T20:31:45.4318057Z scale_ub: Optional[float], 2025-05-07T20:31:45.4318148Z contiguous: bool, 2025-05-07T20:31:45.4318234Z compiled: bool, 2025-05-07T20:31:45.4318318Z ) -> None: 2025-05-07T20:31:45.4318413Z torch.manual_seed(2025) 2025-05-07T20:31:45.4318488Z 2025-05-07T20:31:45.4318665Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.4318739Z 2025-05-07T20:31:45.4318836Z x_sign = torch.sign(x) 2025-05-07T20:31:45.4318966Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.4319057Z x = x_sign * x_clamp 2025-05-07T20:31:45.4319138Z x0 = x[:, :D] 2025-05-07T20:31:45.4319228Z x1 = x[:, D:] 2025-05-07T20:31:45.4319382Z 2025-05-07T20:31:45.4319482Z if contiguous: 2025-05-07T20:31:45.4319574Z x0 = x0.contiguous() 2025-05-07T20:31:45.4319664Z x1 = x1.contiguous() 2025-05-07T20:31:45.4319746Z 2025-05-07T20:31:45.4319840Z if scale_ub is not None: 2025-05-07T20:31:45.4319948Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.4320096Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.4320172Z ) 2025-05-07T20:31:45.4320249Z else: 2025-05-07T20:31:45.4320353Z scale_ub_tensor = None 2025-05-07T20:31:45.4320428Z 2025-05-07T20:31:45.4320558Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.4320733Z op = silu_mul_quant 2025-05-07T20:31:45.4320823Z if compiled: 2025-05-07T20:31:45.4320923Z op = torch.compile(op) 2025-05-07T20:31:45.4321037Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4321115Z 2025-05-07T20:31:45.4321213Z y_fp8, y_scale = fn() 2025-05-07T20:31:45.4321336Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:45.4321411Z 2025-05-07T20:31:45.4321555Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.4321661Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:45.4321762Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:45.4321899Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:45.4322038Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:45.4322112Z 2025-05-07T20:31:45.4322221Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:45.4322225Z 2025-05-07T20:31:45.4322335Z moe/activation_test.py:126: 2025-05-07T20:31:45.4322472Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4322579Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:45.4322718Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:45.4323291Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:45.4323396Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:45.4323759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.4323990Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.4324360Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:45.4324625Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:45.4325035Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:45.4325295Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:45.4325686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:45.4325854Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:45.4326205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:45.4326284Z fn() 2025-05-07T20:31:45.4326687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:45.4326783Z self.fn.run( 2025-05-07T20:31:45.4327123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.4327222Z kernel = self.compile( 2025-05-07T20:31:45.4327613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.4327869Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.4328007Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4328011Z 2025-05-07T20:31:45.4328509Z self = 2025-05-07T20:31:45.4329371Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.4329885Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f68940f1080>} 2025-05-07T20:31:45.4330857Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.4331057Z context = 2025-05-07T20:31:45.4331062Z 2025-05-07T20:31:45.4331228Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.4331499Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.4331608Z module_map=module_map) 2025-05-07T20:31:45.4331772Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.4331884Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:45.4331962Z E ^ 2025-05-07T20:31:45.4332321Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.4332333Z 2025-05-07T20:31:45.4332761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.4332765Z 2025-05-07T20:31:45.4332874Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.4333108Z self=, 2025-05-07T20:31:45.4333186Z T=1, 2025-05-07T20:31:45.4333267Z D=5120, 2025-05-07T20:31:45.4333357Z scale_ub=1200.0, 2025-05-07T20:31:45.4333446Z contiguous=False, 2025-05-07T20:31:45.4333531Z compiled=True, 2025-05-07T20:31:45.4333614Z ) 2025-05-07T20:31:45.4333832Z self = 2025-05-07T20:31:45.4333998Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:45.4334010Z 2025-05-07T20:31:45.4334087Z @given( 2025-05-07T20:31:45.4334204Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.4334319Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.4334434Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.4334551Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.4334674Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.4334749Z ) 2025-05-07T20:31:45.4335003Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.4335107Z def test_silu_mul_quant( 2025-05-07T20:31:45.4335202Z self, 2025-05-07T20:31:45.4335285Z T: int, 2025-05-07T20:31:45.4335389Z D: int, 2025-05-07T20:31:45.4335493Z scale_ub: Optional[float], 2025-05-07T20:31:45.4335588Z contiguous: bool, 2025-05-07T20:31:45.4335674Z compiled: bool, 2025-05-07T20:31:45.4335754Z ) -> None: 2025-05-07T20:31:45.4335854Z torch.manual_seed(2025) 2025-05-07T20:31:45.4335928Z 2025-05-07T20:31:45.4336094Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.4336180Z 2025-05-07T20:31:45.4336273Z x_sign = torch.sign(x) 2025-05-07T20:31:45.4336397Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.4336627Z x = x_sign * x_clamp 2025-05-07T20:31:45.4336710Z x0 = x[:, :D] 2025-05-07T20:31:45.4336791Z x1 = x[:, D:] 2025-05-07T20:31:45.4336872Z 2025-05-07T20:31:45.4336957Z if contiguous: 2025-05-07T20:31:45.4337050Z x0 = x0.contiguous() 2025-05-07T20:31:45.4337151Z x1 = x1.contiguous() 2025-05-07T20:31:45.4337225Z 2025-05-07T20:31:45.4337328Z if scale_ub is not None: 2025-05-07T20:31:45.4337434Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.4337568Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.4337655Z ) 2025-05-07T20:31:45.4337732Z else: 2025-05-07T20:31:45.4337828Z scale_ub_tensor = None 2025-05-07T20:31:45.4337989Z 2025-05-07T20:31:45.4338121Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.4338213Z op = silu_mul_quant 2025-05-07T20:31:45.4338308Z if compiled: 2025-05-07T20:31:45.4338415Z op = torch.compile(op) 2025-05-07T20:31:45.4338522Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4338602Z 2025-05-07T20:31:45.4338695Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.4338699Z 2025-05-07T20:31:45.4338804Z moe/activation_test.py:117: 2025-05-07T20:31:45.4338936Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4339041Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.4339149Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4339519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:45.4339612Z return fn(*args, **kwargs) 2025-05-07T20:31:45.4340122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.4340221Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.4340590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.4340812Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.4341152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.4341256Z kernel = self.compile( 2025-05-07T20:31:45.4341639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.4341821Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.4341952Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4341965Z 2025-05-07T20:31:45.4342169Z self = 2025-05-07T20:31:45.4342964Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.4343472Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f68940f3b00>} 2025-05-07T20:31:45.4344236Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.4344425Z context = 2025-05-07T20:31:45.4344430Z 2025-05-07T20:31:45.4344597Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.4344870Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.4344977Z module_map=module_map) 2025-05-07T20:31:45.4345256Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.4345375Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.4345461Z E ^ 2025-05-07T20:31:45.4345850Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.4345855Z 2025-05-07T20:31:45.4346271Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.4346275Z 2025-05-07T20:31:45.4346384Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.4346609Z self=, 2025-05-07T20:31:45.4346685Z T=1, 2025-05-07T20:31:45.4346846Z D=5120, 2025-05-07T20:31:45.4346930Z scale_ub=1200.0, 2025-05-07T20:31:45.4347023Z contiguous=False, 2025-05-07T20:31:45.4347109Z compiled=False, 2025-05-07T20:31:45.4347184Z ) 2025-05-07T20:31:45.4347411Z self = 2025-05-07T20:31:45.4347579Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:45.4347583Z 2025-05-07T20:31:45.4347660Z @given( 2025-05-07T20:31:45.4347790Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.4347891Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.4348004Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.4348128Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.4348244Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.4348324Z ) 2025-05-07T20:31:45.4348569Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.4348670Z def test_silu_mul_quant( 2025-05-07T20:31:45.4348754Z self, 2025-05-07T20:31:45.4348830Z T: int, 2025-05-07T20:31:45.4348906Z D: int, 2025-05-07T20:31:45.4349013Z scale_ub: Optional[float], 2025-05-07T20:31:45.4349169Z contiguous: bool, 2025-05-07T20:31:45.4349259Z compiled: bool, 2025-05-07T20:31:45.4349345Z ) -> None: 2025-05-07T20:31:45.4349441Z torch.manual_seed(2025) 2025-05-07T20:31:45.4349514Z 2025-05-07T20:31:45.4349689Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.4349761Z 2025-05-07T20:31:45.4349859Z x_sign = torch.sign(x) 2025-05-07T20:31:45.4349982Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.4350072Z x = x_sign * x_clamp 2025-05-07T20:31:45.4350158Z x0 = x[:, :D] 2025-05-07T20:31:45.4350237Z x1 = x[:, D:] 2025-05-07T20:31:45.4350309Z 2025-05-07T20:31:45.4350397Z if contiguous: 2025-05-07T20:31:45.4350493Z x0 = x0.contiguous() 2025-05-07T20:31:45.4350584Z x1 = x1.contiguous() 2025-05-07T20:31:45.4350662Z 2025-05-07T20:31:45.4350751Z if scale_ub is not None: 2025-05-07T20:31:45.4350859Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.4351001Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.4351077Z ) 2025-05-07T20:31:45.4351159Z else: 2025-05-07T20:31:45.4351251Z scale_ub_tensor = None 2025-05-07T20:31:45.4351322Z 2025-05-07T20:31:45.4351457Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.4351546Z op = silu_mul_quant 2025-05-07T20:31:45.4351633Z if compiled: 2025-05-07T20:31:45.4351738Z op = torch.compile(op) 2025-05-07T20:31:45.4351842Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4351913Z 2025-05-07T20:31:45.4352012Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.4352021Z 2025-05-07T20:31:45.4352119Z moe/activation_test.py:117: 2025-05-07T20:31:45.4352248Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4352358Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.4352542Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4353049Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.4353146Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.4353502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.4353729Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.4354068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.4354170Z kernel = self.compile( 2025-05-07T20:31:45.4354623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.4354798Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.4354934Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4354938Z 2025-05-07T20:31:45.4355141Z self = 2025-05-07T20:31:45.4355925Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.4356435Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6873025a80>} 2025-05-07T20:31:45.4357194Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.4357394Z context = 2025-05-07T20:31:45.4357399Z 2025-05-07T20:31:45.4357567Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.4357835Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.4357941Z module_map=module_map) 2025-05-07T20:31:45.4358102Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.4358207Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.4358287Z E ^ 2025-05-07T20:31:45.4358645Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.4358650Z 2025-05-07T20:31:45.4359073Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.4359083Z 2025-05-07T20:31:45.4359187Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.4359420Z self=, 2025-05-07T20:31:45.4359499Z T=16384, 2025-05-07T20:31:45.4359575Z D=5120, 2025-05-07T20:31:45.4359666Z scale_ub=1200.0, 2025-05-07T20:31:45.4359753Z contiguous=False, 2025-05-07T20:31:45.4359836Z compiled=True, 2025-05-07T20:31:45.4359916Z ) 2025-05-07T20:31:45.4360134Z self = 2025-05-07T20:31:45.4360319Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:45.4360323Z 2025-05-07T20:31:45.4360402Z @given( 2025-05-07T20:31:45.4360520Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.4360629Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.4360745Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.4360865Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.4360989Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.4361064Z ) 2025-05-07T20:31:45.4361397Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.4361499Z def test_silu_mul_quant( 2025-05-07T20:31:45.4361577Z self, 2025-05-07T20:31:45.4361660Z T: int, 2025-05-07T20:31:45.4361737Z D: int, 2025-05-07T20:31:45.4361835Z scale_ub: Optional[float], 2025-05-07T20:31:45.4361931Z contiguous: bool, 2025-05-07T20:31:45.4362015Z compiled: bool, 2025-05-07T20:31:45.4362094Z ) -> None: 2025-05-07T20:31:45.4362195Z torch.manual_seed(2025) 2025-05-07T20:31:45.4362268Z 2025-05-07T20:31:45.4362437Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.4362519Z 2025-05-07T20:31:45.4362611Z x_sign = torch.sign(x) 2025-05-07T20:31:45.4362814Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.4362910Z x = x_sign * x_clamp 2025-05-07T20:31:45.4362991Z x0 = x[:, :D] 2025-05-07T20:31:45.4363071Z x1 = x[:, D:] 2025-05-07T20:31:45.4363157Z 2025-05-07T20:31:45.4363244Z if contiguous: 2025-05-07T20:31:45.4363343Z x0 = x0.contiguous() 2025-05-07T20:31:45.4363433Z x1 = x1.contiguous() 2025-05-07T20:31:45.4363506Z 2025-05-07T20:31:45.4363602Z if scale_ub is not None: 2025-05-07T20:31:45.4363709Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.4363844Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.4363928Z ) 2025-05-07T20:31:45.4364003Z else: 2025-05-07T20:31:45.4364096Z scale_ub_tensor = None 2025-05-07T20:31:45.4364173Z 2025-05-07T20:31:45.4364302Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.4364397Z op = silu_mul_quant 2025-05-07T20:31:45.4364489Z if compiled: 2025-05-07T20:31:45.4364588Z op = torch.compile(op) 2025-05-07T20:31:45.4364698Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4364769Z 2025-05-07T20:31:45.4364864Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.4364868Z 2025-05-07T20:31:45.4364970Z moe/activation_test.py:117: 2025-05-07T20:31:45.4365110Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4365219Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.4365317Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4365685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:45.4365785Z return fn(*args, **kwargs) 2025-05-07T20:31:45.4366282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.4366394Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.4366753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.4366978Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.4367325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.4367419Z kernel = self.compile( 2025-05-07T20:31:45.4367801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.4367980Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.4368107Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4368112Z 2025-05-07T20:31:45.4368319Z self = 2025-05-07T20:31:45.4369107Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.4369701Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6873024cc0>} 2025-05-07T20:31:45.4370462Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.4370651Z context = 2025-05-07T20:31:45.4370655Z 2025-05-07T20:31:45.4370826Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.4371087Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.4371276Z module_map=module_map) 2025-05-07T20:31:45.4371439Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.4371539Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.4371628Z E ^ 2025-05-07T20:31:45.4371987Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.4371992Z 2025-05-07T20:31:45.4372407Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.4372411Z 2025-05-07T20:31:45.4372521Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.4372744Z self=, 2025-05-07T20:31:45.4372828Z T=2048, 2025-05-07T20:31:45.4372904Z D=7168, 2025-05-07T20:31:45.4372989Z scale_ub=1200.0, 2025-05-07T20:31:45.4373081Z contiguous=False, 2025-05-07T20:31:45.4373169Z compiled=True, 2025-05-07T20:31:45.4373242Z ) 2025-05-07T20:31:45.4373467Z self = 2025-05-07T20:31:45.4373645Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:45.4373651Z 2025-05-07T20:31:45.4373727Z @given( 2025-05-07T20:31:45.4373850Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.4373948Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.4374067Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.4374183Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.4374294Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.4374377Z ) 2025-05-07T20:31:45.4374622Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.4374715Z def test_silu_mul_quant( 2025-05-07T20:31:45.4374798Z self, 2025-05-07T20:31:45.4374881Z T: int, 2025-05-07T20:31:45.4374956Z D: int, 2025-05-07T20:31:45.4375059Z scale_ub: Optional[float], 2025-05-07T20:31:45.4375147Z contiguous: bool, 2025-05-07T20:31:45.4375232Z compiled: bool, 2025-05-07T20:31:45.4375319Z ) -> None: 2025-05-07T20:31:45.4375421Z torch.manual_seed(2025) 2025-05-07T20:31:45.4375498Z 2025-05-07T20:31:45.4375666Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.4375740Z 2025-05-07T20:31:45.4375837Z x_sign = torch.sign(x) 2025-05-07T20:31:45.4375961Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.4376049Z x = x_sign * x_clamp 2025-05-07T20:31:45.4376137Z x0 = x[:, :D] 2025-05-07T20:31:45.4376217Z x1 = x[:, D:] 2025-05-07T20:31:45.4376288Z 2025-05-07T20:31:45.4376378Z if contiguous: 2025-05-07T20:31:45.4376469Z x0 = x0.contiguous() 2025-05-07T20:31:45.4376558Z x1 = x1.contiguous() 2025-05-07T20:31:45.4376640Z 2025-05-07T20:31:45.4376731Z if scale_ub is not None: 2025-05-07T20:31:45.4376847Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.4376982Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.4377172Z ) 2025-05-07T20:31:45.4377255Z else: 2025-05-07T20:31:45.4377349Z scale_ub_tensor = None 2025-05-07T20:31:45.4377421Z 2025-05-07T20:31:45.4377556Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.4377646Z op = silu_mul_quant 2025-05-07T20:31:45.4377731Z if compiled: 2025-05-07T20:31:45.4377839Z op = torch.compile(op) 2025-05-07T20:31:45.4377944Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4378015Z 2025-05-07T20:31:45.4378117Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.4378121Z 2025-05-07T20:31:45.4378222Z moe/activation_test.py:117: 2025-05-07T20:31:45.4378358Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4378539Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.4378640Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4379021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:45.4379113Z return fn(*args, **kwargs) 2025-05-07T20:31:45.4379607Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.4379710Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.4380068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.4380296Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.4380635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.4380735Z kernel = self.compile( 2025-05-07T20:31:45.4381591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.4381767Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.4381901Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4381906Z 2025-05-07T20:31:45.4382107Z self = 2025-05-07T20:31:45.4382891Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.4383402Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6873027060>} 2025-05-07T20:31:45.4384161Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.4384359Z context = 2025-05-07T20:31:45.4384364Z 2025-05-07T20:31:45.4384527Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.4384790Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.4384903Z module_map=module_map) 2025-05-07T20:31:45.4385064Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.4385169Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.4385246Z E ^ 2025-05-07T20:31:45.4385604Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.4385613Z 2025-05-07T20:31:45.4386039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.4386044Z 2025-05-07T20:31:45.4386147Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.4386466Z self=, 2025-05-07T20:31:45.4386545Z T=1, 2025-05-07T20:31:45.4386622Z D=5120, 2025-05-07T20:31:45.4386710Z scale_ub=None, 2025-05-07T20:31:45.4386798Z contiguous=False, 2025-05-07T20:31:45.4386887Z compiled=False, 2025-05-07T20:31:45.4386967Z ) 2025-05-07T20:31:45.4387185Z self = 2025-05-07T20:31:45.4387351Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:45.4387355Z 2025-05-07T20:31:45.4387442Z @given( 2025-05-07T20:31:45.4387560Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.4387668Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.4387868Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.4387984Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.4388103Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.4388182Z ) 2025-05-07T20:31:45.4388427Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.4388525Z def test_silu_mul_quant( 2025-05-07T20:31:45.4388600Z self, 2025-05-07T20:31:45.4388675Z T: int, 2025-05-07T20:31:45.4388758Z D: int, 2025-05-07T20:31:45.4388855Z scale_ub: Optional[float], 2025-05-07T20:31:45.4388943Z contiguous: bool, 2025-05-07T20:31:45.4389033Z compiled: bool, 2025-05-07T20:31:45.4389229Z ) -> None: 2025-05-07T20:31:45.4389330Z torch.manual_seed(2025) 2025-05-07T20:31:45.4389403Z 2025-05-07T20:31:45.4389570Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.4389660Z 2025-05-07T20:31:45.4389753Z x_sign = torch.sign(x) 2025-05-07T20:31:45.4389876Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.4389970Z x = x_sign * x_clamp 2025-05-07T20:31:45.4390050Z x0 = x[:, :D] 2025-05-07T20:31:45.4390133Z x1 = x[:, D:] 2025-05-07T20:31:45.4390212Z 2025-05-07T20:31:45.4390296Z if contiguous: 2025-05-07T20:31:45.4390386Z x0 = x0.contiguous() 2025-05-07T20:31:45.4390480Z x1 = x1.contiguous() 2025-05-07T20:31:45.4390552Z 2025-05-07T20:31:45.4390641Z if scale_ub is not None: 2025-05-07T20:31:45.4390752Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.4390886Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.4390966Z ) 2025-05-07T20:31:45.4391043Z else: 2025-05-07T20:31:45.4391137Z scale_ub_tensor = None 2025-05-07T20:31:45.4391214Z 2025-05-07T20:31:45.4391341Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.4391435Z op = silu_mul_quant 2025-05-07T20:31:45.4391527Z if compiled: 2025-05-07T20:31:45.4391628Z op = torch.compile(op) 2025-05-07T20:31:45.4391736Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4391813Z 2025-05-07T20:31:45.4391904Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.4391908Z 2025-05-07T20:31:45.4392011Z moe/activation_test.py:117: 2025-05-07T20:31:45.4392140Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4392240Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.4392344Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4392843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.4392938Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.4393301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.4393527Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.4393968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.4394062Z kernel = self.compile( 2025-05-07T20:31:45.4394443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.4394621Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.4394747Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4394752Z 2025-05-07T20:31:45.4394954Z self = 2025-05-07T20:31:45.4395744Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.4396335Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6872154720>} 2025-05-07T20:31:45.4397093Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.4397283Z context = 2025-05-07T20:31:45.4397287Z 2025-05-07T20:31:45.4397462Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.4397723Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.4397829Z module_map=module_map) 2025-05-07T20:31:45.4398001Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.4398101Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.4398178Z E ^ 2025-05-07T20:31:45.4398547Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.4398552Z 2025-05-07T20:31:45.4398971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.4398975Z 2025-05-07T20:31:45.4399085Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.4399309Z self=, 2025-05-07T20:31:45.4399386Z T=4096, 2025-05-07T20:31:45.4399473Z D=7168, 2025-05-07T20:31:45.4399557Z scale_ub=1200.0, 2025-05-07T20:31:45.4399646Z contiguous=False, 2025-05-07T20:31:45.4399738Z compiled=False, 2025-05-07T20:31:45.4399811Z ) 2025-05-07T20:31:45.4400036Z self = 2025-05-07T20:31:45.4400217Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:45.4400221Z 2025-05-07T20:31:45.4400299Z @given( 2025-05-07T20:31:45.4400432Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.4400531Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.4400645Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.4400768Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.4400881Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.4400963Z ) 2025-05-07T20:31:45.4401210Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.4401303Z def test_silu_mul_quant( 2025-05-07T20:31:45.4401386Z self, 2025-05-07T20:31:45.4401462Z T: int, 2025-05-07T20:31:45.4401538Z D: int, 2025-05-07T20:31:45.4401643Z scale_ub: Optional[float], 2025-05-07T20:31:45.4401738Z contiguous: bool, 2025-05-07T20:31:45.4401822Z compiled: bool, 2025-05-07T20:31:45.4401907Z ) -> None: 2025-05-07T20:31:45.4402005Z torch.manual_seed(2025) 2025-05-07T20:31:45.4402078Z 2025-05-07T20:31:45.4402339Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.4402414Z 2025-05-07T20:31:45.4402506Z x_sign = torch.sign(x) 2025-05-07T20:31:45.4402634Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.4402723Z x = x_sign * x_clamp 2025-05-07T20:31:45.4402809Z x0 = x[:, :D] 2025-05-07T20:31:45.4402890Z x1 = x[:, D:] 2025-05-07T20:31:45.4402963Z 2025-05-07T20:31:45.4403053Z if contiguous: 2025-05-07T20:31:45.4403144Z x0 = x0.contiguous() 2025-05-07T20:31:45.4403232Z x1 = x1.contiguous() 2025-05-07T20:31:45.4403309Z 2025-05-07T20:31:45.4403399Z if scale_ub is not None: 2025-05-07T20:31:45.4403618Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.4403757Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.4403834Z ) 2025-05-07T20:31:45.4403909Z else: 2025-05-07T20:31:45.4404008Z scale_ub_tensor = None 2025-05-07T20:31:45.4404086Z 2025-05-07T20:31:45.4404223Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.4404311Z op = silu_mul_quant 2025-05-07T20:31:45.4404396Z if compiled: 2025-05-07T20:31:45.4404503Z op = torch.compile(op) 2025-05-07T20:31:45.4404607Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4404679Z 2025-05-07T20:31:45.4404774Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.4404779Z 2025-05-07T20:31:45.4404875Z moe/activation_test.py:117: 2025-05-07T20:31:45.4405011Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4405137Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.4405260Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4405772Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.4405874Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.4406232Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.4406463Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.4406807Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.4406900Z kernel = self.compile( 2025-05-07T20:31:45.4407287Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.4407458Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.4407591Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4407601Z 2025-05-07T20:31:45.4407802Z self = 2025-05-07T20:31:45.4408586Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.4409098Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f68721558a0>} 2025-05-07T20:31:45.4409852Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.4410047Z context = 2025-05-07T20:31:45.4410055Z 2025-05-07T20:31:45.4410218Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.4410484Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.4410682Z module_map=module_map) 2025-05-07T20:31:45.4410843Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.4410948Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.4411024Z E ^ 2025-05-07T20:31:45.4411381Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.4411386Z 2025-05-07T20:31:45.4411811Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.4411815Z 2025-05-07T20:31:45.4411917Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.4412148Z self=, 2025-05-07T20:31:45.4412401Z T=16384, 2025-05-07T20:31:45.4412478Z D=7168, 2025-05-07T20:31:45.4412567Z scale_ub=None, 2025-05-07T20:31:45.4412653Z contiguous=True, 2025-05-07T20:31:45.4412736Z compiled=True, 2025-05-07T20:31:45.4412819Z ) 2025-05-07T20:31:45.4413037Z self = 2025-05-07T20:31:45.4413210Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:45.4413221Z 2025-05-07T20:31:45.4413301Z @given( 2025-05-07T20:31:45.4413418Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.4413526Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.4413639Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.4413755Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.4413878Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.4413952Z ) 2025-05-07T20:31:45.4414205Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.4414306Z def test_silu_mul_quant( 2025-05-07T20:31:45.4414382Z self, 2025-05-07T20:31:45.4414462Z T: int, 2025-05-07T20:31:45.4414545Z D: int, 2025-05-07T20:31:45.4414648Z scale_ub: Optional[float], 2025-05-07T20:31:45.4414743Z contiguous: bool, 2025-05-07T20:31:45.4414829Z compiled: bool, 2025-05-07T20:31:45.4414907Z ) -> None: 2025-05-07T20:31:45.4415007Z torch.manual_seed(2025) 2025-05-07T20:31:45.4415080Z 2025-05-07T20:31:45.4415247Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.4415327Z 2025-05-07T20:31:45.4415418Z x_sign = torch.sign(x) 2025-05-07T20:31:45.4415542Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.4415639Z x = x_sign * x_clamp 2025-05-07T20:31:45.4415720Z x0 = x[:, :D] 2025-05-07T20:31:45.4415800Z x1 = x[:, D:] 2025-05-07T20:31:45.4415883Z 2025-05-07T20:31:45.4415968Z if contiguous: 2025-05-07T20:31:45.4416063Z x0 = x0.contiguous() 2025-05-07T20:31:45.4416157Z x1 = x1.contiguous() 2025-05-07T20:31:45.4416229Z 2025-05-07T20:31:45.4416334Z if scale_ub is not None: 2025-05-07T20:31:45.4416440Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.4416574Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.4416656Z ) 2025-05-07T20:31:45.4416730Z else: 2025-05-07T20:31:45.4416823Z scale_ub_tensor = None 2025-05-07T20:31:45.4416900Z 2025-05-07T20:31:45.4417027Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.4417117Z op = silu_mul_quant 2025-05-07T20:31:45.4417209Z if compiled: 2025-05-07T20:31:45.4417308Z op = torch.compile(op) 2025-05-07T20:31:45.4417413Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4417495Z 2025-05-07T20:31:45.4417585Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.4417589Z 2025-05-07T20:31:45.4417695Z moe/activation_test.py:117: 2025-05-07T20:31:45.4417822Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4418013Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.4418120Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4418488Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:45.4418579Z return fn(*args, **kwargs) 2025-05-07T20:31:45.4419086Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.4419183Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.4419545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.4419766Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.4420181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.4420280Z kernel = self.compile( 2025-05-07T20:31:45.4420665Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.4420843Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.4420970Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4420975Z 2025-05-07T20:31:45.4421175Z self = 2025-05-07T20:31:45.4421963Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.4422473Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6872156a20>} 2025-05-07T20:31:45.4423241Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.4423429Z context = 2025-05-07T20:31:45.4423434Z 2025-05-07T20:31:45.4423598Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.4423866Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.4423972Z module_map=module_map) 2025-05-07T20:31:45.4424138Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.4424236Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.4424316Z E ^ 2025-05-07T20:31:45.4424683Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.4424687Z 2025-05-07T20:31:45.4425108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.4425112Z 2025-05-07T20:31:45.4425220Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.4425443Z self=, 2025-05-07T20:31:45.4425521Z T=4096, 2025-05-07T20:31:45.4425605Z D=5120, 2025-05-07T20:31:45.4425687Z scale_ub=None, 2025-05-07T20:31:45.4425772Z contiguous=False, 2025-05-07T20:31:45.4425861Z compiled=True, 2025-05-07T20:31:45.4425933Z ) 2025-05-07T20:31:45.4426150Z self = 2025-05-07T20:31:45.4426328Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:45.4426338Z 2025-05-07T20:31:45.4426414Z @given( 2025-05-07T20:31:45.4426530Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.4426636Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.4426831Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.4426956Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.4427081Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.4432603Z ) 2025-05-07T20:31:45.4432879Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.4432991Z def test_silu_mul_quant( 2025-05-07T20:31:45.4433072Z self, 2025-05-07T20:31:45.4433152Z T: int, 2025-05-07T20:31:45.4433239Z D: int, 2025-05-07T20:31:45.4433339Z scale_ub: Optional[float], 2025-05-07T20:31:45.4433432Z contiguous: bool, 2025-05-07T20:31:45.4433530Z compiled: bool, 2025-05-07T20:31:45.4433877Z ) -> None: 2025-05-07T20:31:45.4433976Z torch.manual_seed(2025) 2025-05-07T20:31:45.4434059Z 2025-05-07T20:31:45.4434232Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.4434309Z 2025-05-07T20:31:45.4434419Z x_sign = torch.sign(x) 2025-05-07T20:31:45.4434548Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.4434640Z x = x_sign * x_clamp 2025-05-07T20:31:45.4434734Z x0 = x[:, :D] 2025-05-07T20:31:45.4434816Z x1 = x[:, D:] 2025-05-07T20:31:45.4434902Z 2025-05-07T20:31:45.4434988Z if contiguous: 2025-05-07T20:31:45.4435083Z x0 = x0.contiguous() 2025-05-07T20:31:45.4435185Z x1 = x1.contiguous() 2025-05-07T20:31:45.4435263Z 2025-05-07T20:31:45.4435355Z if scale_ub is not None: 2025-05-07T20:31:45.4435480Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.4435622Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.4435707Z ) 2025-05-07T20:31:45.4435794Z else: 2025-05-07T20:31:45.4435894Z scale_ub_tensor = None 2025-05-07T20:31:45.4435969Z 2025-05-07T20:31:45.4436115Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.4436212Z op = silu_mul_quant 2025-05-07T20:31:45.4436309Z if compiled: 2025-05-07T20:31:45.4436413Z op = torch.compile(op) 2025-05-07T20:31:45.4436522Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4436608Z 2025-05-07T20:31:45.4436702Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.4436708Z 2025-05-07T20:31:45.4436810Z moe/activation_test.py:117: 2025-05-07T20:31:45.4436951Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4437056Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.4437162Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4437546Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:45.4437645Z return fn(*args, **kwargs) 2025-05-07T20:31:45.4438153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.4438252Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.4438613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.4438844Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.4439185Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.4439281Z kernel = self.compile( 2025-05-07T20:31:45.4439672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.4439847Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.4439987Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4439992Z 2025-05-07T20:31:45.4440197Z self = 2025-05-07T20:31:45.4441134Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.4441650Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6872157c40>} 2025-05-07T20:31:45.4442400Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.4442599Z context = 2025-05-07T20:31:45.4442712Z 2025-05-07T20:31:45.4442878Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.4443154Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.4443263Z module_map=module_map) 2025-05-07T20:31:45.4443427Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.4443536Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.4443615Z E ^ 2025-05-07T20:31:45.4443974Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.4443979Z 2025-05-07T20:31:45.4444404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.4444409Z 2025-05-07T20:31:45.4444514Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.4444753Z self=, 2025-05-07T20:31:45.4444832Z T=4096, 2025-05-07T20:31:45.4444912Z D=5120, 2025-05-07T20:31:45.4445009Z scale_ub=1200.0, 2025-05-07T20:31:45.4445109Z contiguous=False, 2025-05-07T20:31:45.4445213Z compiled=False, 2025-05-07T20:31:45.4445316Z ) 2025-05-07T20:31:45.4445543Z self = 2025-05-07T20:31:45.4445719Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:45.4445732Z 2025-05-07T20:31:45.4445811Z @given( 2025-05-07T20:31:45.4445931Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.4446039Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.4446156Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.4446278Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.4446400Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.4446480Z ) 2025-05-07T20:31:45.4446725Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.4446829Z def test_silu_mul_quant( 2025-05-07T20:31:45.4446909Z self, 2025-05-07T20:31:45.4446999Z T: int, 2025-05-07T20:31:45.4447078Z D: int, 2025-05-07T20:31:45.4447177Z scale_ub: Optional[float], 2025-05-07T20:31:45.4447275Z contiguous: bool, 2025-05-07T20:31:45.4447363Z compiled: bool, 2025-05-07T20:31:45.4447443Z ) -> None: 2025-05-07T20:31:45.4447550Z torch.manual_seed(2025) 2025-05-07T20:31:45.4447625Z 2025-05-07T20:31:45.4447795Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.4447882Z 2025-05-07T20:31:45.4447979Z x_sign = torch.sign(x) 2025-05-07T20:31:45.4448104Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.4448207Z x = x_sign * x_clamp 2025-05-07T20:31:45.4448295Z x0 = x[:, :D] 2025-05-07T20:31:45.4448377Z x1 = x[:, D:] 2025-05-07T20:31:45.4448463Z 2025-05-07T20:31:45.4448550Z if contiguous: 2025-05-07T20:31:45.4448650Z x0 = x0.contiguous() 2025-05-07T20:31:45.4448838Z x1 = x1.contiguous() 2025-05-07T20:31:45.4448918Z 2025-05-07T20:31:45.4449020Z if scale_ub is not None: 2025-05-07T20:31:45.4449127Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.4449264Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.4449349Z ) 2025-05-07T20:31:45.4449429Z else: 2025-05-07T20:31:45.4449525Z scale_ub_tensor = None 2025-05-07T20:31:45.4449606Z 2025-05-07T20:31:45.4449739Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.4449832Z op = silu_mul_quant 2025-05-07T20:31:45.4449927Z if compiled: 2025-05-07T20:31:45.4450028Z op = torch.compile(op) 2025-05-07T20:31:45.4450221Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4450296Z 2025-05-07T20:31:45.4450389Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.4450394Z 2025-05-07T20:31:45.4450501Z moe/activation_test.py:117: 2025-05-07T20:31:45.4450639Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4450744Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.4450854Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4451353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.4451452Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.4451822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.4452046Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.4452393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.4452498Z kernel = self.compile( 2025-05-07T20:31:45.4452884Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.4453072Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.4453202Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4453206Z 2025-05-07T20:31:45.4453420Z self = 2025-05-07T20:31:45.4454205Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.4454709Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6872274ae0>} 2025-05-07T20:31:45.4455483Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.4455673Z context = 2025-05-07T20:31:45.4455678Z 2025-05-07T20:31:45.4455849Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.4456112Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.4456221Z module_map=module_map) 2025-05-07T20:31:45.4456395Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.4456496Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.4456585Z E ^ 2025-05-07T20:31:45.4456940Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.4456949Z 2025-05-07T20:31:45.4457368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.4457373Z 2025-05-07T20:31:45.4457568Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.4457794Z self=, 2025-05-07T20:31:45.4457882Z T=4096, 2025-05-07T20:31:45.4457960Z D=5120, 2025-05-07T20:31:45.4458047Z scale_ub=1200.0, 2025-05-07T20:31:45.4458148Z contiguous=False, 2025-05-07T20:31:45.4458234Z compiled=True, 2025-05-07T20:31:45.4458310Z ) 2025-05-07T20:31:45.4458536Z self = 2025-05-07T20:31:45.4458713Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:45.4458718Z 2025-05-07T20:31:45.4458797Z @given( 2025-05-07T20:31:45.4458927Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.4459102Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.4459225Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.4459342Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.4459461Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.4459546Z ) 2025-05-07T20:31:45.4459795Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.4459890Z def test_silu_mul_quant( 2025-05-07T20:31:45.4459972Z self, 2025-05-07T20:31:45.4460048Z T: int, 2025-05-07T20:31:45.4460124Z D: int, 2025-05-07T20:31:45.4460229Z scale_ub: Optional[float], 2025-05-07T20:31:45.4460318Z contiguous: bool, 2025-05-07T20:31:45.4460404Z compiled: bool, 2025-05-07T20:31:45.4460491Z ) -> None: 2025-05-07T20:31:45.4460588Z torch.manual_seed(2025) 2025-05-07T20:31:45.4460668Z 2025-05-07T20:31:45.4460843Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.4460918Z 2025-05-07T20:31:45.4461016Z x_sign = torch.sign(x) 2025-05-07T20:31:45.4461140Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.4461234Z x = x_sign * x_clamp 2025-05-07T20:31:45.4461323Z x0 = x[:, :D] 2025-05-07T20:31:45.4461403Z x1 = x[:, D:] 2025-05-07T20:31:45.4461479Z 2025-05-07T20:31:45.4461572Z if contiguous: 2025-05-07T20:31:45.4461665Z x0 = x0.contiguous() 2025-05-07T20:31:45.4461759Z x1 = x1.contiguous() 2025-05-07T20:31:45.4461839Z 2025-05-07T20:31:45.4461930Z if scale_ub is not None: 2025-05-07T20:31:45.4462037Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.4462180Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.4462257Z ) 2025-05-07T20:31:45.4462342Z else: 2025-05-07T20:31:45.4462439Z scale_ub_tensor = None 2025-05-07T20:31:45.4462518Z 2025-05-07T20:31:45.4462656Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.4462750Z op = silu_mul_quant 2025-05-07T20:31:45.4462837Z if compiled: 2025-05-07T20:31:45.4462954Z op = torch.compile(op) 2025-05-07T20:31:45.4463061Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4463134Z 2025-05-07T20:31:45.4463232Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.4463237Z 2025-05-07T20:31:45.4463335Z moe/activation_test.py:117: 2025-05-07T20:31:45.4463475Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4463577Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.4463679Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4464058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:45.4464156Z return fn(*args, **kwargs) 2025-05-07T20:31:45.4464654Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.4464759Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.4465203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.4465434Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.4465773Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.4465868Z kernel = self.compile( 2025-05-07T20:31:45.4466256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.4466430Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.4466557Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4466652Z 2025-05-07T20:31:45.4466857Z self = 2025-05-07T20:31:45.4467645Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.4468160Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6872275e40>} 2025-05-07T20:31:45.4468914Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.4469243Z context = 2025-05-07T20:31:45.4469249Z 2025-05-07T20:31:45.4469412Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.4469680Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.4469797Z module_map=module_map) 2025-05-07T20:31:45.4469963Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.4470063Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.4470149Z E ^ 2025-05-07T20:31:45.4470506Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.4470510Z 2025-05-07T20:31:45.4470936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.4470941Z 2025-05-07T20:31:45.4471044Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.4471270Z self=, 2025-05-07T20:31:45.4471359Z T=2048, 2025-05-07T20:31:45.4471437Z D=7168, 2025-05-07T20:31:45.4471530Z scale_ub=1200.0, 2025-05-07T20:31:45.4471617Z contiguous=False, 2025-05-07T20:31:45.4471702Z compiled=False, 2025-05-07T20:31:45.4471785Z ) 2025-05-07T20:31:45.4472008Z self = 2025-05-07T20:31:45.4472197Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:45.4472202Z 2025-05-07T20:31:45.4472279Z @given( 2025-05-07T20:31:45.4472396Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.4472500Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.4472614Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.4472730Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.4472850Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.4472923Z ) 2025-05-07T20:31:45.4473169Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.4473278Z def test_silu_mul_quant( 2025-05-07T20:31:45.4473354Z self, 2025-05-07T20:31:45.4473438Z T: int, 2025-05-07T20:31:45.4473515Z D: int, 2025-05-07T20:31:45.4473734Z scale_ub: Optional[float], 2025-05-07T20:31:45.4473832Z contiguous: bool, 2025-05-07T20:31:45.4473919Z compiled: bool, 2025-05-07T20:31:45.4473996Z ) -> None: 2025-05-07T20:31:45.4474099Z torch.manual_seed(2025) 2025-05-07T20:31:45.4474172Z 2025-05-07T20:31:45.4474340Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.4474422Z 2025-05-07T20:31:45.4474515Z x_sign = torch.sign(x) 2025-05-07T20:31:45.4474639Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.4474735Z x = x_sign * x_clamp 2025-05-07T20:31:45.4474815Z x0 = x[:, :D] 2025-05-07T20:31:45.4474905Z x1 = x[:, D:] 2025-05-07T20:31:45.4474977Z 2025-05-07T20:31:45.4475145Z if contiguous: 2025-05-07T20:31:45.4475244Z x0 = x0.contiguous() 2025-05-07T20:31:45.4475333Z x1 = x1.contiguous() 2025-05-07T20:31:45.4475404Z 2025-05-07T20:31:45.4475499Z if scale_ub is not None: 2025-05-07T20:31:45.4475611Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.4475745Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.4475826Z ) 2025-05-07T20:31:45.4475902Z else: 2025-05-07T20:31:45.4475995Z scale_ub_tensor = None 2025-05-07T20:31:45.4476072Z 2025-05-07T20:31:45.4476199Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.4476288Z op = silu_mul_quant 2025-05-07T20:31:45.4476382Z if compiled: 2025-05-07T20:31:45.4476480Z op = torch.compile(op) 2025-05-07T20:31:45.4476592Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4476664Z 2025-05-07T20:31:45.4476753Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.4476764Z 2025-05-07T20:31:45.4476868Z moe/activation_test.py:117: 2025-05-07T20:31:45.4476996Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4477096Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.4477207Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4477706Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.4477808Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.4478164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.4478383Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.4478731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.4478823Z kernel = self.compile( 2025-05-07T20:31:45.4479208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.4479386Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.4479516Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4479520Z 2025-05-07T20:31:45.4479728Z self = 2025-05-07T20:31:45.4480512Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.4481014Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6872276c00>} 2025-05-07T20:31:45.4481773Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.4481968Z context = 2025-05-07T20:31:45.4482059Z 2025-05-07T20:31:45.4482230Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.4482490Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.4482601Z module_map=module_map) 2025-05-07T20:31:45.4482763Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.4482862Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.4482944Z E ^ 2025-05-07T20:31:45.4483301Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.4483306Z 2025-05-07T20:31:45.4483799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.4483804Z 2025-05-07T20:31:45.4483911Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.4484139Z self=, 2025-05-07T20:31:45.4484223Z T=1, 2025-05-07T20:31:45.4484302Z D=7168, 2025-05-07T20:31:45.4484385Z scale_ub=None, 2025-05-07T20:31:45.4484477Z contiguous=True, 2025-05-07T20:31:45.4484560Z compiled=False, 2025-05-07T20:31:45.4484633Z ) 2025-05-07T20:31:45.4484856Z self = 2025-05-07T20:31:45.4485017Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:45.4485022Z 2025-05-07T20:31:45.4485098Z @given( 2025-05-07T20:31:45.4485224Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.4485321Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.4485448Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.4485565Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.4485678Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.4485759Z ) 2025-05-07T20:31:45.4486011Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.4486105Z def test_silu_mul_quant( 2025-05-07T20:31:45.4486190Z self, 2025-05-07T20:31:45.4486270Z T: int, 2025-05-07T20:31:45.4486346Z D: int, 2025-05-07T20:31:45.4486451Z scale_ub: Optional[float], 2025-05-07T20:31:45.4486541Z contiguous: bool, 2025-05-07T20:31:45.4486627Z compiled: bool, 2025-05-07T20:31:45.4486714Z ) -> None: 2025-05-07T20:31:45.4486809Z torch.manual_seed(2025) 2025-05-07T20:31:45.4486888Z 2025-05-07T20:31:45.4487056Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.4487130Z 2025-05-07T20:31:45.4487235Z x_sign = torch.sign(x) 2025-05-07T20:31:45.4487359Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.4487452Z x = x_sign * x_clamp 2025-05-07T20:31:45.4487541Z x0 = x[:, :D] 2025-05-07T20:31:45.4487626Z x1 = x[:, D:] 2025-05-07T20:31:45.4487699Z 2025-05-07T20:31:45.4487792Z if contiguous: 2025-05-07T20:31:45.4487883Z x0 = x0.contiguous() 2025-05-07T20:31:45.4487971Z x1 = x1.contiguous() 2025-05-07T20:31:45.4488053Z 2025-05-07T20:31:45.4488145Z if scale_ub is not None: 2025-05-07T20:31:45.4488256Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.4488390Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.4488467Z ) 2025-05-07T20:31:45.4488549Z else: 2025-05-07T20:31:45.4488645Z scale_ub_tensor = None 2025-05-07T20:31:45.4488719Z 2025-05-07T20:31:45.4488853Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.4488947Z op = silu_mul_quant 2025-05-07T20:31:45.4489032Z if compiled: 2025-05-07T20:31:45.4489141Z op = torch.compile(op) 2025-05-07T20:31:45.4489245Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4489401Z 2025-05-07T20:31:45.4489499Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.4489503Z 2025-05-07T20:31:45.4489599Z moe/activation_test.py:117: 2025-05-07T20:31:45.4489732Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4489832Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.4489931Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4490435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.4490531Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.4490886Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.4491187Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.4491530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.4491627Z kernel = self.compile( 2025-05-07T20:31:45.4492029Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.4492208Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.4492334Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4492339Z 2025-05-07T20:31:45.4492541Z self = 2025-05-07T20:31:45.4493332Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.4493843Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6872276e80>} 2025-05-07T20:31:45.4494608Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.4494797Z context = 2025-05-07T20:31:45.4494801Z 2025-05-07T20:31:45.4494974Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.4495235Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.4495343Z module_map=module_map) 2025-05-07T20:31:45.4495510Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.4495613Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.4495689Z E ^ 2025-05-07T20:31:45.4496053Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.4496061Z 2025-05-07T20:31:45.4496478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.4496482Z 2025-05-07T20:31:45.4496591Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.4496815Z self=, 2025-05-07T20:31:45.4496893Z T=16384, 2025-05-07T20:31:45.4496980Z D=7168, 2025-05-07T20:31:45.4497067Z scale_ub=1200.0, 2025-05-07T20:31:45.4497152Z contiguous=False, 2025-05-07T20:31:45.4497243Z compiled=True, 2025-05-07T20:31:45.4497316Z ) 2025-05-07T20:31:45.4497533Z self = 2025-05-07T20:31:45.4497726Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:45.4497731Z 2025-05-07T20:31:45.4497808Z @given( 2025-05-07T20:31:45.4497932Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.4498114Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.4498229Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.4498351Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.4498465Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.4498539Z ) 2025-05-07T20:31:45.4498793Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.4498886Z def test_silu_mul_quant( 2025-05-07T20:31:45.4498969Z self, 2025-05-07T20:31:45.4499049Z T: int, 2025-05-07T20:31:45.4499125Z D: int, 2025-05-07T20:31:45.4499229Z scale_ub: Optional[float], 2025-05-07T20:31:45.4499317Z contiguous: bool, 2025-05-07T20:31:45.4499480Z compiled: bool, 2025-05-07T20:31:45.4499565Z ) -> None: 2025-05-07T20:31:45.4499659Z torch.manual_seed(2025) 2025-05-07T20:31:45.4499734Z 2025-05-07T20:31:45.4499912Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.4499986Z 2025-05-07T20:31:45.4500078Z x_sign = torch.sign(x) 2025-05-07T20:31:45.4500207Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.4500296Z x = x_sign * x_clamp 2025-05-07T20:31:45.4500382Z x0 = x[:, :D] 2025-05-07T20:31:45.4500462Z x1 = x[:, D:] 2025-05-07T20:31:45.4500534Z 2025-05-07T20:31:45.4500623Z if contiguous: 2025-05-07T20:31:45.4500713Z x0 = x0.contiguous() 2025-05-07T20:31:45.4500801Z x1 = x1.contiguous() 2025-05-07T20:31:45.4500878Z 2025-05-07T20:31:45.4500968Z if scale_ub is not None: 2025-05-07T20:31:45.4501074Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.4501221Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.4501297Z ) 2025-05-07T20:31:45.4501372Z else: 2025-05-07T20:31:45.4501472Z scale_ub_tensor = None 2025-05-07T20:31:45.4501543Z 2025-05-07T20:31:45.4501676Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.4501772Z op = silu_mul_quant 2025-05-07T20:31:45.4501858Z if compiled: 2025-05-07T20:31:45.4501965Z op = torch.compile(op) 2025-05-07T20:31:45.4502070Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4502145Z 2025-05-07T20:31:45.4502244Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.4502249Z 2025-05-07T20:31:45.4502345Z moe/activation_test.py:117: 2025-05-07T20:31:45.4502478Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4502585Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.4502684Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4503056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:45.4503153Z return fn(*args, **kwargs) 2025-05-07T20:31:45.4503650Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.4503754Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.4504111Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.4504331Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.4504674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.4504768Z kernel = self.compile( 2025-05-07T20:31:45.4505158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.4505338Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.4505487Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4505493Z 2025-05-07T20:31:45.4505851Z self = 2025-05-07T20:31:45.4506632Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.4507142Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f68723fd1c0>} 2025-05-07T20:31:45.4507893Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.4508163Z context = 2025-05-07T20:31:45.4508167Z 2025-05-07T20:31:45.4508336Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.4508601Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.4508715Z module_map=module_map) 2025-05-07T20:31:45.4508874Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.4508973Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.4509171Z E ^ 2025-05-07T20:31:45.4509530Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.4509535Z 2025-05-07T20:31:45.4509958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.4509969Z 2025-05-07T20:31:45.4510073Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.4510296Z self=, 2025-05-07T20:31:45.4510379Z T=1, 2025-05-07T20:31:45.4510457Z D=7168, 2025-05-07T20:31:45.4510544Z scale_ub=None, 2025-05-07T20:31:45.4510638Z contiguous=False, 2025-05-07T20:31:45.4510723Z compiled=False, 2025-05-07T20:31:45.4510795Z ) 2025-05-07T20:31:45.4511023Z self = 2025-05-07T20:31:45.4511188Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:45.4511192Z 2025-05-07T20:31:45.4511276Z @given( 2025-05-07T20:31:45.4511393Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.4511491Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.4511612Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.4511728Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.4511847Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.4511926Z ) 2025-05-07T20:31:45.4512171Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.4512269Z def test_silu_mul_quant( 2025-05-07T20:31:45.4512352Z self, 2025-05-07T20:31:45.4512429Z T: int, 2025-05-07T20:31:45.4512515Z D: int, 2025-05-07T20:31:45.4512613Z scale_ub: Optional[float], 2025-05-07T20:31:45.4512702Z contiguous: bool, 2025-05-07T20:31:45.4512794Z compiled: bool, 2025-05-07T20:31:45.4512872Z ) -> None: 2025-05-07T20:31:45.4512965Z torch.manual_seed(2025) 2025-05-07T20:31:45.4513049Z 2025-05-07T20:31:45.4513218Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.4513293Z 2025-05-07T20:31:45.4513391Z x_sign = torch.sign(x) 2025-05-07T20:31:45.4513515Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.4513608Z x = x_sign * x_clamp 2025-05-07T20:31:45.4513694Z x0 = x[:, :D] 2025-05-07T20:31:45.4513774Z x1 = x[:, D:] 2025-05-07T20:31:45.4513845Z 2025-05-07T20:31:45.4513937Z if contiguous: 2025-05-07T20:31:45.4514114Z x0 = x0.contiguous() 2025-05-07T20:31:45.4514210Z x1 = x1.contiguous() 2025-05-07T20:31:45.4514282Z 2025-05-07T20:31:45.4514371Z if scale_ub is not None: 2025-05-07T20:31:45.4514481Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.4514614Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.4514689Z ) 2025-05-07T20:31:45.4514772Z else: 2025-05-07T20:31:45.4514863Z scale_ub_tensor = None 2025-05-07T20:31:45.4514936Z 2025-05-07T20:31:45.4515072Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.4515161Z op = silu_mul_quant 2025-05-07T20:31:45.4515246Z if compiled: 2025-05-07T20:31:45.4515432Z op = torch.compile(op) 2025-05-07T20:31:45.4515536Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4515615Z 2025-05-07T20:31:45.4515705Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.4515710Z 2025-05-07T20:31:45.4515812Z moe/activation_test.py:117: 2025-05-07T20:31:45.4515946Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4516047Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.4516144Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4516649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.4516745Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.4517106Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.4517326Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.4517670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.4517769Z kernel = self.compile( 2025-05-07T20:31:45.4518154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.4518325Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.4518456Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4518461Z 2025-05-07T20:31:45.4518661Z self = 2025-05-07T20:31:45.4519445Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.4519947Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f68723fdf80>} 2025-05-07T20:31:45.4520714Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.4520902Z context = 2025-05-07T20:31:45.4520908Z 2025-05-07T20:31:45.4521071Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.4521338Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.4521444Z module_map=module_map) 2025-05-07T20:31:45.4521615Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.4521714Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.4521790Z E ^ 2025-05-07T20:31:45.4522158Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.4522163Z 2025-05-07T20:31:45.4522663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.4522668Z 2025-05-07T20:31:45.4522772Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.4523002Z self=, 2025-05-07T20:31:45.4523081Z T=2048, 2025-05-07T20:31:45.4523164Z D=7168, 2025-05-07T20:31:45.4523245Z scale_ub=None, 2025-05-07T20:31:45.4523333Z contiguous=False, 2025-05-07T20:31:45.4523421Z compiled=True, 2025-05-07T20:31:45.4523494Z ) 2025-05-07T20:31:45.4523713Z self = 2025-05-07T20:31:45.4523893Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:45.4523977Z 2025-05-07T20:31:45.4524057Z @given( 2025-05-07T20:31:45.4524174Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.4524281Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.4524397Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.4524526Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.4524639Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.4524713Z ) 2025-05-07T20:31:45.4524970Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.4525068Z def test_silu_mul_quant( 2025-05-07T20:31:45.4525168Z self, 2025-05-07T20:31:45.4525261Z T: int, 2025-05-07T20:31:45.4525356Z D: int, 2025-05-07T20:31:45.4525455Z scale_ub: Optional[float], 2025-05-07T20:31:45.4525550Z contiguous: bool, 2025-05-07T20:31:45.4525636Z compiled: bool, 2025-05-07T20:31:45.4525714Z ) -> None: 2025-05-07T20:31:45.4525817Z torch.manual_seed(2025) 2025-05-07T20:31:45.4525895Z 2025-05-07T20:31:45.4526070Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.4526144Z 2025-05-07T20:31:45.4526238Z x_sign = torch.sign(x) 2025-05-07T20:31:45.4526373Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.4526460Z x = x_sign * x_clamp 2025-05-07T20:31:45.4526540Z x0 = x[:, :D] 2025-05-07T20:31:45.4526628Z x1 = x[:, D:] 2025-05-07T20:31:45.4526704Z 2025-05-07T20:31:45.4526789Z if contiguous: 2025-05-07T20:31:45.4526889Z x0 = x0.contiguous() 2025-05-07T20:31:45.4526979Z x1 = x1.contiguous() 2025-05-07T20:31:45.4527050Z 2025-05-07T20:31:45.4527149Z if scale_ub is not None: 2025-05-07T20:31:45.4527256Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.4527398Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.4527476Z ) 2025-05-07T20:31:45.4527559Z else: 2025-05-07T20:31:45.4527659Z scale_ub_tensor = None 2025-05-07T20:31:45.4527731Z 2025-05-07T20:31:45.4527858Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.4527953Z op = silu_mul_quant 2025-05-07T20:31:45.4528041Z if compiled: 2025-05-07T20:31:45.4528477Z op = torch.compile(op) 2025-05-07T20:31:45.4528651Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4528759Z 2025-05-07T20:31:45.4528860Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.4528865Z 2025-05-07T20:31:45.4528969Z moe/activation_test.py:117: 2025-05-07T20:31:45.4529099Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4529204Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.4529306Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4529673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:45.4529777Z return fn(*args, **kwargs) 2025-05-07T20:31:45.4530271Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.4530595Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.4530959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.4531181Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.4531525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.4531618Z kernel = self.compile( 2025-05-07T20:31:45.4531998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.4532179Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.4532467Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4532472Z 2025-05-07T20:31:45.4532686Z self = 2025-05-07T20:31:45.4533470Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.4533971Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f68723ff420>} 2025-05-07T20:31:45.4534728Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.4534916Z context = 2025-05-07T20:31:45.4534927Z 2025-05-07T20:31:45.4535100Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.4535366Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.4535489Z module_map=module_map) 2025-05-07T20:31:45.4535681Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.4535791Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.4535874Z E ^ 2025-05-07T20:31:45.4536230Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.4536235Z 2025-05-07T20:31:45.4536649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.4536654Z 2025-05-07T20:31:45.4536762Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.4536987Z self=, 2025-05-07T20:31:45.4537070Z T=4096, 2025-05-07T20:31:45.4537153Z D=7168, 2025-05-07T20:31:45.4537238Z scale_ub=None, 2025-05-07T20:31:45.4537330Z contiguous=False, 2025-05-07T20:31:45.4537412Z compiled=True, 2025-05-07T20:31:45.4537490Z ) 2025-05-07T20:31:45.4537714Z self = 2025-05-07T20:31:45.4537885Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:45.4537889Z 2025-05-07T20:31:45.4537966Z @given( 2025-05-07T20:31:45.4538089Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.4538188Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.4538303Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.4538426Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.4538538Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.4538617Z ) 2025-05-07T20:31:45.4538866Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.4538960Z def test_silu_mul_quant( 2025-05-07T20:31:45.4539042Z self, 2025-05-07T20:31:45.4539119Z T: int, 2025-05-07T20:31:45.4539285Z D: int, 2025-05-07T20:31:45.4539392Z scale_ub: Optional[float], 2025-05-07T20:31:45.4539480Z contiguous: bool, 2025-05-07T20:31:45.4539565Z compiled: bool, 2025-05-07T20:31:45.4539651Z ) -> None: 2025-05-07T20:31:45.4539745Z torch.manual_seed(2025) 2025-05-07T20:31:45.4539820Z 2025-05-07T20:31:45.4539994Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.4540069Z 2025-05-07T20:31:45.4540171Z x_sign = torch.sign(x) 2025-05-07T20:31:45.4540295Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.4540383Z x = x_sign * x_clamp 2025-05-07T20:31:45.4540471Z x0 = x[:, :D] 2025-05-07T20:31:45.4540632Z x1 = x[:, D:] 2025-05-07T20:31:45.4540704Z 2025-05-07T20:31:45.4540796Z if contiguous: 2025-05-07T20:31:45.4540888Z x0 = x0.contiguous() 2025-05-07T20:31:45.4540977Z x1 = x1.contiguous() 2025-05-07T20:31:45.4541057Z 2025-05-07T20:31:45.4541153Z if scale_ub is not None: 2025-05-07T20:31:45.4541259Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.4541400Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.4541475Z ) 2025-05-07T20:31:45.4541552Z else: 2025-05-07T20:31:45.4541652Z scale_ub_tensor = None 2025-05-07T20:31:45.4541724Z 2025-05-07T20:31:45.4541860Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.4541951Z op = silu_mul_quant 2025-05-07T20:31:45.4542035Z if compiled: 2025-05-07T20:31:45.4542142Z op = torch.compile(op) 2025-05-07T20:31:45.4542248Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4542326Z 2025-05-07T20:31:45.4542424Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.4542428Z 2025-05-07T20:31:45.4542526Z moe/activation_test.py:117: 2025-05-07T20:31:45.4542660Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4542770Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.4542870Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4543242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:45.4543334Z return fn(*args, **kwargs) 2025-05-07T20:31:45.4543827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.4543929Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.4544284Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.4544509Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.4544851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.4544944Z kernel = self.compile( 2025-05-07T20:31:45.4545337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.4545510Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.4545637Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4545641Z 2025-05-07T20:31:45.4545848Z self = 2025-05-07T20:31:45.4546625Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.4547139Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f68727e8680>} 2025-05-07T20:31:45.4547979Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.4548175Z context = 2025-05-07T20:31:45.4548180Z 2025-05-07T20:31:45.4548346Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.4548607Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.4548719Z module_map=module_map) 2025-05-07T20:31:45.4548879Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.4548977Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.4549317Z E ^ 2025-05-07T20:31:45.4549676Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.4549681Z 2025-05-07T20:31:45.4550107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.4550111Z 2025-05-07T20:31:45.4550214Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.4550439Z self=, 2025-05-07T20:31:45.4550522Z T=16384, 2025-05-07T20:31:45.4550599Z D=5120, 2025-05-07T20:31:45.4550683Z scale_ub=1200.0, 2025-05-07T20:31:45.4550775Z contiguous=False, 2025-05-07T20:31:45.4550859Z compiled=False, 2025-05-07T20:31:45.4550938Z ) 2025-05-07T20:31:45.4551156Z self = 2025-05-07T20:31:45.4551335Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:45.4551345Z 2025-05-07T20:31:45.4551427Z @given( 2025-05-07T20:31:45.4551546Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.4551645Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.4551772Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.4551888Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.4551999Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.4552078Z ) 2025-05-07T20:31:45.4552322Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.4552422Z def test_silu_mul_quant( 2025-05-07T20:31:45.4552498Z self, 2025-05-07T20:31:45.4552574Z T: int, 2025-05-07T20:31:45.4552655Z D: int, 2025-05-07T20:31:45.4552752Z scale_ub: Optional[float], 2025-05-07T20:31:45.4552842Z contiguous: bool, 2025-05-07T20:31:45.4552933Z compiled: bool, 2025-05-07T20:31:45.4553016Z ) -> None: 2025-05-07T20:31:45.4553110Z torch.manual_seed(2025) 2025-05-07T20:31:45.4553189Z 2025-05-07T20:31:45.4553358Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.4553431Z 2025-05-07T20:31:45.4553542Z x_sign = torch.sign(x) 2025-05-07T20:31:45.4558144Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.4558255Z x = x_sign * x_clamp 2025-05-07T20:31:45.4558343Z x0 = x[:, :D] 2025-05-07T20:31:45.4558436Z x1 = x[:, D:] 2025-05-07T20:31:45.4558511Z 2025-05-07T20:31:45.4558599Z if contiguous: 2025-05-07T20:31:45.4558701Z x0 = x0.contiguous() 2025-05-07T20:31:45.4558793Z x1 = x1.contiguous() 2025-05-07T20:31:45.4558866Z 2025-05-07T20:31:45.4558967Z if scale_ub is not None: 2025-05-07T20:31:45.4559078Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.4559228Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.4559314Z ) 2025-05-07T20:31:45.4559393Z else: 2025-05-07T20:31:45.4559496Z scale_ub_tensor = None 2025-05-07T20:31:45.4559570Z 2025-05-07T20:31:45.4559818Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.4559920Z op = silu_mul_quant 2025-05-07T20:31:45.4560008Z if compiled: 2025-05-07T20:31:45.4560111Z op = torch.compile(op) 2025-05-07T20:31:45.4560228Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4560302Z 2025-05-07T20:31:45.4560395Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.4560409Z 2025-05-07T20:31:45.4560511Z moe/activation_test.py:117: 2025-05-07T20:31:45.4560642Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4560752Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.4560854Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4561362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.4561546Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.4561911Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.4562142Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.4562487Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.4562583Z kernel = self.compile( 2025-05-07T20:31:45.4562977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.4563151Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.4563280Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4563285Z 2025-05-07T20:31:45.4563508Z self = 2025-05-07T20:31:45.4564293Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.4564806Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f68727e94e0>} 2025-05-07T20:31:45.4565583Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.4565804Z context = 2025-05-07T20:31:45.4565808Z 2025-05-07T20:31:45.4565974Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.4566242Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.4566360Z module_map=module_map) 2025-05-07T20:31:45.4566530Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.4566632Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.4566719Z E ^ 2025-05-07T20:31:45.4567081Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.4567085Z 2025-05-07T20:31:45.4567509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.4567513Z 2025-05-07T20:31:45.4567619Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.4567843Z self=, 2025-05-07T20:31:45.4567931Z T=16384, 2025-05-07T20:31:45.4568010Z D=5120, 2025-05-07T20:31:45.4568100Z scale_ub=1200.0, 2025-05-07T20:31:45.4568195Z contiguous=True, 2025-05-07T20:31:45.4568279Z compiled=True, 2025-05-07T20:31:45.4568359Z ) 2025-05-07T20:31:45.4568666Z self = 2025-05-07T20:31:45.4568845Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:45.4568850Z 2025-05-07T20:31:45.4568936Z @given( 2025-05-07T20:31:45.4569057Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.4569158Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.4569280Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.4569398Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.4569513Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.4569594Z ) 2025-05-07T20:31:45.4569840Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.4570056Z def test_silu_mul_quant( 2025-05-07T20:31:45.4570139Z self, 2025-05-07T20:31:45.4570218Z T: int, 2025-05-07T20:31:45.4570302Z D: int, 2025-05-07T20:31:45.4570399Z scale_ub: Optional[float], 2025-05-07T20:31:45.4570495Z contiguous: bool, 2025-05-07T20:31:45.4570588Z compiled: bool, 2025-05-07T20:31:45.4570667Z ) -> None: 2025-05-07T20:31:45.4570764Z torch.manual_seed(2025) 2025-05-07T20:31:45.4570846Z 2025-05-07T20:31:45.4571017Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.4571097Z 2025-05-07T20:31:45.4571197Z x_sign = torch.sign(x) 2025-05-07T20:31:45.4571323Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.4571422Z x = x_sign * x_clamp 2025-05-07T20:31:45.4571508Z x0 = x[:, :D] 2025-05-07T20:31:45.4571588Z x1 = x[:, D:] 2025-05-07T20:31:45.4571668Z 2025-05-07T20:31:45.4571753Z if contiguous: 2025-05-07T20:31:45.4571852Z x0 = x0.contiguous() 2025-05-07T20:31:45.4571948Z x1 = x1.contiguous() 2025-05-07T20:31:45.4572020Z 2025-05-07T20:31:45.4572112Z if scale_ub is not None: 2025-05-07T20:31:45.4572225Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.4572366Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.4572443Z ) 2025-05-07T20:31:45.4572528Z else: 2025-05-07T20:31:45.4572622Z scale_ub_tensor = None 2025-05-07T20:31:45.4572708Z 2025-05-07T20:31:45.4572838Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.4572929Z op = silu_mul_quant 2025-05-07T20:31:45.4573027Z if compiled: 2025-05-07T20:31:45.4573130Z op = torch.compile(op) 2025-05-07T20:31:45.4573239Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4573323Z 2025-05-07T20:31:45.4573419Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.4573423Z 2025-05-07T20:31:45.4573531Z moe/activation_test.py:117: 2025-05-07T20:31:45.4573672Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4573775Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.4573880Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4574260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:45.4574354Z return fn(*args, **kwargs) 2025-05-07T20:31:45.4574866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.4574966Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.4575326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.4575559Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.4575903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.4576011Z kernel = self.compile( 2025-05-07T20:31:45.4576485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.4576662Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.4576803Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4576807Z 2025-05-07T20:31:45.4577012Z self = 2025-05-07T20:31:45.4577804Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.4578310Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f68727ea8e0>} 2025-05-07T20:31:45.4579144Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.4579343Z context = 2025-05-07T20:31:45.4579348Z 2025-05-07T20:31:45.4579513Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.4579784Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.4579893Z module_map=module_map) 2025-05-07T20:31:45.4580056Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.4580164Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.4580243Z E ^ 2025-05-07T20:31:45.4580601Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.4580623Z 2025-05-07T20:31:45.4581042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.4581050Z 2025-05-07T20:31:45.4581159Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.4581393Z self=, 2025-05-07T20:31:45.4581471Z T=16384, 2025-05-07T20:31:45.4581548Z D=5120, 2025-05-07T20:31:45.4581638Z scale_ub=None, 2025-05-07T20:31:45.4581730Z contiguous=False, 2025-05-07T20:31:45.4581814Z compiled=True, 2025-05-07T20:31:45.4581895Z ) 2025-05-07T20:31:45.4582115Z self = 2025-05-07T20:31:45.4582299Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:45.4582303Z 2025-05-07T20:31:45.4582381Z @given( 2025-05-07T20:31:45.4582505Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.4582611Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.4582727Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.4582849Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.4582971Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.4583046Z ) 2025-05-07T20:31:45.4583292Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.4583395Z def test_silu_mul_quant( 2025-05-07T20:31:45.4583473Z self, 2025-05-07T20:31:45.4583557Z T: int, 2025-05-07T20:31:45.4583634Z D: int, 2025-05-07T20:31:45.4583734Z scale_ub: Optional[float], 2025-05-07T20:31:45.4583831Z contiguous: bool, 2025-05-07T20:31:45.4583917Z compiled: bool, 2025-05-07T20:31:45.4583999Z ) -> None: 2025-05-07T20:31:45.4584104Z torch.manual_seed(2025) 2025-05-07T20:31:45.4584183Z 2025-05-07T20:31:45.4584352Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.4584434Z 2025-05-07T20:31:45.4584529Z x_sign = torch.sign(x) 2025-05-07T20:31:45.4584656Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.4584848Z x = x_sign * x_clamp 2025-05-07T20:31:45.4584951Z x0 = x[:, :D] 2025-05-07T20:31:45.4585061Z x1 = x[:, D:] 2025-05-07T20:31:45.4585153Z 2025-05-07T20:31:45.4585258Z if contiguous: 2025-05-07T20:31:45.4585381Z x0 = x0.contiguous() 2025-05-07T20:31:45.4585493Z x1 = x1.contiguous() 2025-05-07T20:31:45.4585584Z 2025-05-07T20:31:45.4585702Z if scale_ub is not None: 2025-05-07T20:31:45.4585834Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.4586003Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.4586107Z ) 2025-05-07T20:31:45.4586202Z else: 2025-05-07T20:31:45.4586413Z scale_ub_tensor = None 2025-05-07T20:31:45.4586513Z 2025-05-07T20:31:45.4586673Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.4586794Z op = silu_mul_quant 2025-05-07T20:31:45.4586900Z if compiled: 2025-05-07T20:31:45.4587030Z op = torch.compile(op) 2025-05-07T20:31:45.4587167Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4587257Z 2025-05-07T20:31:45.4587364Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.4587369Z 2025-05-07T20:31:45.4587474Z moe/activation_test.py:117: 2025-05-07T20:31:45.4587604Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4587708Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.4587816Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4588184Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:45.4588284Z return fn(*args, **kwargs) 2025-05-07T20:31:45.4588785Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.4588883Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.4589392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.4589614Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.4589954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.4590056Z kernel = self.compile( 2025-05-07T20:31:45.4590442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.4590622Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.4590751Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4590762Z 2025-05-07T20:31:45.4590966Z self = 2025-05-07T20:31:45.4591761Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.4592266Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f68727eaf20>} 2025-05-07T20:31:45.4593027Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.4593217Z context = 2025-05-07T20:31:45.4593222Z 2025-05-07T20:31:45.4593397Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.4593659Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.4593854Z module_map=module_map) 2025-05-07T20:31:45.4594025Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.4594127Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.4594204Z E ^ 2025-05-07T20:31:45.4594573Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.4594577Z 2025-05-07T20:31:45.4595004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.4595010Z 2025-05-07T20:31:45.4595124Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.4595351Z self=, 2025-05-07T20:31:45.4595516Z T=2048, 2025-05-07T20:31:45.4595609Z D=5120, 2025-05-07T20:31:45.4595711Z scale_ub=None, 2025-05-07T20:31:45.4595807Z contiguous=False, 2025-05-07T20:31:45.4595917Z compiled=True, 2025-05-07T20:31:45.4595991Z ) 2025-05-07T20:31:45.4596219Z self = 2025-05-07T20:31:45.4596403Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:45.4596407Z 2025-05-07T20:31:45.4596487Z @given( 2025-05-07T20:31:45.4596613Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.4596713Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.4596832Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.4596962Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.4597076Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.4597152Z ) 2025-05-07T20:31:45.4597404Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.4597505Z def test_silu_mul_quant( 2025-05-07T20:31:45.4597585Z self, 2025-05-07T20:31:45.4597668Z T: int, 2025-05-07T20:31:45.4597745Z D: int, 2025-05-07T20:31:45.4597858Z scale_ub: Optional[float], 2025-05-07T20:31:45.4597948Z contiguous: bool, 2025-05-07T20:31:45.4598035Z compiled: bool, 2025-05-07T20:31:45.4598124Z ) -> None: 2025-05-07T20:31:45.4598219Z torch.manual_seed(2025) 2025-05-07T20:31:45.4598298Z 2025-05-07T20:31:45.4598466Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.4598542Z 2025-05-07T20:31:45.4598643Z x_sign = torch.sign(x) 2025-05-07T20:31:45.4598768Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.4598860Z x = x_sign * x_clamp 2025-05-07T20:31:45.4598948Z x0 = x[:, :D] 2025-05-07T20:31:45.4599028Z x1 = x[:, D:] 2025-05-07T20:31:45.4599104Z 2025-05-07T20:31:45.4599198Z if contiguous: 2025-05-07T20:31:45.4599289Z x0 = x0.contiguous() 2025-05-07T20:31:45.4599378Z x1 = x1.contiguous() 2025-05-07T20:31:45.4599460Z 2025-05-07T20:31:45.4599550Z if scale_ub is not None: 2025-05-07T20:31:45.4599665Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.4599800Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.4599875Z ) 2025-05-07T20:31:45.4599957Z else: 2025-05-07T20:31:45.4600052Z scale_ub_tensor = None 2025-05-07T20:31:45.4600126Z 2025-05-07T20:31:45.4600262Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.4600352Z op = silu_mul_quant 2025-05-07T20:31:45.4600436Z if compiled: 2025-05-07T20:31:45.4600541Z op = torch.compile(op) 2025-05-07T20:31:45.4600647Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4600719Z 2025-05-07T20:31:45.4600819Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.4600824Z 2025-05-07T20:31:45.4600922Z moe/activation_test.py:117: 2025-05-07T20:31:45.4601057Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4601280Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.4601380Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4601756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:45.4601848Z return fn(*args, **kwargs) 2025-05-07T20:31:45.4602340Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.4602442Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.4602801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.4603028Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.4603444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.4603540Z kernel = self.compile( 2025-05-07T20:31:45.4603933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.4604104Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.4604232Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4604242Z 2025-05-07T20:31:45.4604445Z self = 2025-05-07T20:31:45.4605227Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.4605741Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6871f58d60>} 2025-05-07T20:31:45.4606502Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.4606695Z context = 2025-05-07T20:31:45.4606700Z 2025-05-07T20:31:45.4606862Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.4607122Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.4607235Z module_map=module_map) 2025-05-07T20:31:45.4607399Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.4607503Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.4607580Z E ^ 2025-05-07T20:31:45.4607942Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.4607946Z 2025-05-07T20:31:45.4608374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.4608378Z 2025-05-07T20:31:45.4608482Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.4608705Z self=, 2025-05-07T20:31:45.4608789Z T=2048, 2025-05-07T20:31:45.4608865Z D=5120, 2025-05-07T20:31:45.4608953Z scale_ub=1200.0, 2025-05-07T20:31:45.4609037Z contiguous=False, 2025-05-07T20:31:45.4609118Z compiled=True, 2025-05-07T20:31:45.4609201Z ) 2025-05-07T20:31:45.4609419Z self = 2025-05-07T20:31:45.4609591Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:45.4609600Z 2025-05-07T20:31:45.4609685Z @given( 2025-05-07T20:31:45.4609803Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.4609903Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.4610177Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.4610296Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.4610415Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.4610488Z ) 2025-05-07T20:31:45.4610734Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.4610834Z def test_silu_mul_quant( 2025-05-07T20:31:45.4610911Z self, 2025-05-07T20:31:45.4610989Z T: int, 2025-05-07T20:31:45.4611074Z D: int, 2025-05-07T20:31:45.4611171Z scale_ub: Optional[float], 2025-05-07T20:31:45.4611259Z contiguous: bool, 2025-05-07T20:31:45.4611350Z compiled: bool, 2025-05-07T20:31:45.4611428Z ) -> None: 2025-05-07T20:31:45.4611607Z torch.manual_seed(2025) 2025-05-07T20:31:45.4611686Z 2025-05-07T20:31:45.4611852Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.4611932Z 2025-05-07T20:31:45.4612024Z x_sign = torch.sign(x) 2025-05-07T20:31:45.4612155Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.4612253Z x = x_sign * x_clamp 2025-05-07T20:31:45.4612333Z x0 = x[:, :D] 2025-05-07T20:31:45.4612414Z x1 = x[:, D:] 2025-05-07T20:31:45.4612493Z 2025-05-07T20:31:45.4612577Z if contiguous: 2025-05-07T20:31:45.4612668Z x0 = x0.contiguous() 2025-05-07T20:31:45.4612765Z x1 = x1.contiguous() 2025-05-07T20:31:45.4612837Z 2025-05-07T20:31:45.4612926Z if scale_ub is not None: 2025-05-07T20:31:45.4613037Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.4613172Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.4613253Z ) 2025-05-07T20:31:45.4613336Z else: 2025-05-07T20:31:45.4613430Z scale_ub_tensor = None 2025-05-07T20:31:45.4613508Z 2025-05-07T20:31:45.4613636Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.4613729Z op = silu_mul_quant 2025-05-07T20:31:45.4613823Z if compiled: 2025-05-07T20:31:45.4613925Z op = torch.compile(op) 2025-05-07T20:31:45.4614030Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4614108Z 2025-05-07T20:31:45.4614199Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.4614204Z 2025-05-07T20:31:45.4614301Z moe/activation_test.py:117: 2025-05-07T20:31:45.4614438Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4614540Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.4614645Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4615017Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:45.4615135Z return fn(*args, **kwargs) 2025-05-07T20:31:45.4615665Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.4615768Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.4616124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.4616355Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.4616693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.4616794Z kernel = self.compile( 2025-05-07T20:31:45.4617174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.4617345Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.4617483Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4617487Z 2025-05-07T20:31:45.4617690Z self = 2025-05-07T20:31:45.4618569Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.4619075Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6871f59760>} 2025-05-07T20:31:45.4619825Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.4620017Z context = 2025-05-07T20:31:45.4620097Z 2025-05-07T20:31:45.4620261Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.4620529Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.4620638Z module_map=module_map) 2025-05-07T20:31:45.4620817Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.4620917Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.4620995Z E ^ 2025-05-07T20:31:45.4621359Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.4621364Z 2025-05-07T20:31:45.4621782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.4621786Z 2025-05-07T20:31:45.4621897Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.4622121Z self=, 2025-05-07T20:31:45.4622203Z T=4096, 2025-05-07T20:31:45.4622287Z D=5120, 2025-05-07T20:31:45.4622370Z scale_ub=1200.0, 2025-05-07T20:31:45.4622455Z contiguous=True, 2025-05-07T20:31:45.4622548Z compiled=True, 2025-05-07T20:31:45.4622620Z ) 2025-05-07T20:31:45.4622843Z self = 2025-05-07T20:31:45.4623013Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:45.4623017Z 2025-05-07T20:31:45.4623095Z @given( 2025-05-07T20:31:45.4623217Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.4623317Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.4623430Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.4623558Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.4623671Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.4623750Z ) 2025-05-07T20:31:45.4624001Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.4624095Z def test_silu_mul_quant( 2025-05-07T20:31:45.4624181Z self, 2025-05-07T20:31:45.4624262Z T: int, 2025-05-07T20:31:45.4624342Z D: int, 2025-05-07T20:31:45.4624446Z scale_ub: Optional[float], 2025-05-07T20:31:45.4624540Z contiguous: bool, 2025-05-07T20:31:45.4624625Z compiled: bool, 2025-05-07T20:31:45.4624720Z ) -> None: 2025-05-07T20:31:45.4624838Z torch.manual_seed(2025) 2025-05-07T20:31:45.4624931Z 2025-05-07T20:31:45.4625146Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.4625237Z 2025-05-07T20:31:45.4625350Z x_sign = torch.sign(x) 2025-05-07T20:31:45.4625509Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.4625620Z x = x_sign * x_clamp 2025-05-07T20:31:45.4625724Z x0 = x[:, :D] 2025-05-07T20:31:45.4625827Z x1 = x[:, D:] 2025-05-07T20:31:45.4625919Z 2025-05-07T20:31:45.4626028Z if contiguous: 2025-05-07T20:31:45.4626141Z x0 = x0.contiguous() 2025-05-07T20:31:45.4626251Z x1 = x1.contiguous() 2025-05-07T20:31:45.4626345Z 2025-05-07T20:31:45.4626566Z if scale_ub is not None: 2025-05-07T20:31:45.4626698Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.4626847Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.4626921Z ) 2025-05-07T20:31:45.4626998Z else: 2025-05-07T20:31:45.4627095Z scale_ub_tensor = None 2025-05-07T20:31:45.4627167Z 2025-05-07T20:31:45.4627297Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.4627392Z op = silu_mul_quant 2025-05-07T20:31:45.4627476Z if compiled: 2025-05-07T20:31:45.4627580Z op = torch.compile(op) 2025-05-07T20:31:45.4627685Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4627834Z 2025-05-07T20:31:45.4627931Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.4627936Z 2025-05-07T20:31:45.4628032Z moe/activation_test.py:117: 2025-05-07T20:31:45.4628476Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4628641Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.4628781Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4629210Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:45.4629303Z return fn(*args, **kwargs) 2025-05-07T20:31:45.4629802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.4629905Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.4630266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.4630496Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.4630844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.4630942Z kernel = self.compile( 2025-05-07T20:31:45.4631331Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.4631505Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.4631633Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4631639Z 2025-05-07T20:31:45.4631848Z self = 2025-05-07T20:31:45.4632634Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.4633147Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6871f5a980>} 2025-05-07T20:31:45.4633903Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.4634092Z context = 2025-05-07T20:31:45.4634102Z 2025-05-07T20:31:45.4634265Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.4634528Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.4634640Z module_map=module_map) 2025-05-07T20:31:45.4634803Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.4634906Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.4634995Z E ^ 2025-05-07T20:31:45.4635399Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.4635405Z 2025-05-07T20:31:45.4636111Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.4636117Z 2025-05-07T20:31:45.4636222Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.4636447Z self=, 2025-05-07T20:31:45.4636532Z T=128, 2025-05-07T20:31:45.4636607Z D=5120, 2025-05-07T20:31:45.4636689Z scale_ub=1200.0, 2025-05-07T20:31:45.4636786Z contiguous=False, 2025-05-07T20:31:45.4636869Z compiled=True, 2025-05-07T20:31:45.4636942Z ) 2025-05-07T20:31:45.4637168Z self = 2025-05-07T20:31:45.4637340Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:45.4637471Z 2025-05-07T20:31:45.4637559Z @given( 2025-05-07T20:31:45.4637680Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.4637779Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.4637909Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.4638026Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.4638138Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.4638219Z ) 2025-05-07T20:31:45.4638464Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.4638566Z def test_silu_mul_quant( 2025-05-07T20:31:45.4638642Z self, 2025-05-07T20:31:45.4638718Z T: int, 2025-05-07T20:31:45.4638803Z D: int, 2025-05-07T20:31:45.4638901Z scale_ub: Optional[float], 2025-05-07T20:31:45.4638992Z contiguous: bool, 2025-05-07T20:31:45.4639083Z compiled: bool, 2025-05-07T20:31:45.4639168Z ) -> None: 2025-05-07T20:31:45.4639263Z torch.manual_seed(2025) 2025-05-07T20:31:45.4639342Z 2025-05-07T20:31:45.4639510Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.4639583Z 2025-05-07T20:31:45.4639686Z x_sign = torch.sign(x) 2025-05-07T20:31:45.4639810Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.4639900Z x = x_sign * x_clamp 2025-05-07T20:31:45.4639988Z x0 = x[:, :D] 2025-05-07T20:31:45.4640068Z x1 = x[:, D:] 2025-05-07T20:31:45.4640147Z 2025-05-07T20:31:45.4640232Z if contiguous: 2025-05-07T20:31:45.4640324Z x0 = x0.contiguous() 2025-05-07T20:31:45.4640419Z x1 = x1.contiguous() 2025-05-07T20:31:45.4640490Z 2025-05-07T20:31:45.4640580Z if scale_ub is not None: 2025-05-07T20:31:45.4640693Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.4640827Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.4640907Z ) 2025-05-07T20:31:45.4640989Z else: 2025-05-07T20:31:45.4641084Z scale_ub_tensor = None 2025-05-07T20:31:45.4641155Z 2025-05-07T20:31:45.4641291Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.4641381Z op = silu_mul_quant 2025-05-07T20:31:45.4641473Z if compiled: 2025-05-07T20:31:45.4641571Z op = torch.compile(op) 2025-05-07T20:31:45.4641675Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4641751Z 2025-05-07T20:31:45.4641842Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.4641846Z 2025-05-07T20:31:45.4641943Z moe/activation_test.py:117: 2025-05-07T20:31:45.4642079Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4642179Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.4642278Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4642652Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:45.4642749Z return fn(*args, **kwargs) 2025-05-07T20:31:45.4643339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.4643439Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.4643795Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.4644021Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.4644360Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.4644453Z kernel = self.compile( 2025-05-07T20:31:45.4644840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.4645016Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.4645258Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4645263Z 2025-05-07T20:31:45.4645494Z self = 2025-05-07T20:31:45.4646277Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.4646788Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6871db0720>} 2025-05-07T20:31:45.4647541Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.4647744Z context = 2025-05-07T20:31:45.4647748Z 2025-05-07T20:31:45.4647912Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.4648186Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.4648294Z module_map=module_map) 2025-05-07T20:31:45.4648454Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.4648560Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.4648636Z E ^ 2025-05-07T20:31:45.4648992Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.4648997Z 2025-05-07T20:31:45.4649420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.4649424Z 2025-05-07T20:31:45.4649526Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.4649762Z self=, 2025-05-07T20:31:45.4649839Z T=16384, 2025-05-07T20:31:45.4649915Z D=7168, 2025-05-07T20:31:45.4650005Z scale_ub=1200.0, 2025-05-07T20:31:45.4650094Z contiguous=True, 2025-05-07T20:31:45.4650176Z compiled=True, 2025-05-07T20:31:45.4650256Z ) 2025-05-07T20:31:45.4650475Z self = 2025-05-07T20:31:45.4650650Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:45.4650660Z 2025-05-07T20:31:45.4650736Z @given( 2025-05-07T20:31:45.4650854Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.4650960Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.4651075Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.4651195Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.4651311Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.4651390Z ) 2025-05-07T20:31:45.4651637Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.4651744Z def test_silu_mul_quant( 2025-05-07T20:31:45.4651901Z self, 2025-05-07T20:31:45.4651979Z T: int, 2025-05-07T20:31:45.4652061Z D: int, 2025-05-07T20:31:45.4652161Z scale_ub: Optional[float], 2025-05-07T20:31:45.4652258Z contiguous: bool, 2025-05-07T20:31:45.4652343Z compiled: bool, 2025-05-07T20:31:45.4652421Z ) -> None: 2025-05-07T20:31:45.4652522Z torch.manual_seed(2025) 2025-05-07T20:31:45.4652595Z 2025-05-07T20:31:45.4652763Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.4652845Z 2025-05-07T20:31:45.4652937Z x_sign = torch.sign(x) 2025-05-07T20:31:45.4653065Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.4653161Z x = x_sign * x_clamp 2025-05-07T20:31:45.4653889Z x0 = x[:, :D] 2025-05-07T20:31:45.4653970Z x1 = x[:, D:] 2025-05-07T20:31:45.4654049Z 2025-05-07T20:31:45.4654133Z if contiguous: 2025-05-07T20:31:45.4654230Z x0 = x0.contiguous() 2025-05-07T20:31:45.4654323Z x1 = x1.contiguous() 2025-05-07T20:31:45.4654395Z 2025-05-07T20:31:45.4654493Z if scale_ub is not None: 2025-05-07T20:31:45.4654601Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.4654736Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.4654818Z ) 2025-05-07T20:31:45.4654895Z else: 2025-05-07T20:31:45.4654989Z scale_ub_tensor = None 2025-05-07T20:31:45.4655068Z 2025-05-07T20:31:45.4655197Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.4655291Z op = silu_mul_quant 2025-05-07T20:31:45.4655386Z if compiled: 2025-05-07T20:31:45.4655499Z op = torch.compile(op) 2025-05-07T20:31:45.4655628Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4655717Z 2025-05-07T20:31:45.4655811Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.4655816Z 2025-05-07T20:31:45.4655919Z moe/activation_test.py:117: 2025-05-07T20:31:45.4656055Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4656155Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.4656259Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4656629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:45.4656723Z return fn(*args, **kwargs) 2025-05-07T20:31:45.4657223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.4657319Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.4657680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.4657907Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.4658248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.4658347Z kernel = self.compile( 2025-05-07T20:31:45.4658730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.4658907Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.4659033Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4659038Z 2025-05-07T20:31:45.4659241Z self = 2025-05-07T20:31:45.4660028Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.4660623Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6871db0f40>} 2025-05-07T20:31:45.4661383Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.4661570Z context = 2025-05-07T20:31:45.4661575Z 2025-05-07T20:31:45.4661738Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.4662011Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.4662117Z module_map=module_map) 2025-05-07T20:31:45.4662285Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.4662498Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.4662575Z E ^ 2025-05-07T20:31:45.4662945Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.4662950Z 2025-05-07T20:31:45.4663367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.4663372Z 2025-05-07T20:31:45.4663481Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.4663705Z self=, 2025-05-07T20:31:45.4663783Z T=16384, 2025-05-07T20:31:45.4663866Z D=5120, 2025-05-07T20:31:45.4663950Z scale_ub=1200.0, 2025-05-07T20:31:45.4664035Z contiguous=True, 2025-05-07T20:31:45.4664124Z compiled=False, 2025-05-07T20:31:45.4664200Z ) 2025-05-07T20:31:45.4664417Z self = 2025-05-07T20:31:45.4664606Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:45.4664610Z 2025-05-07T20:31:45.4664687Z @given( 2025-05-07T20:31:45.4664812Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.4664911Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.4665026Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.4665147Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.4665260Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.4665335Z ) 2025-05-07T20:31:45.4665591Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.4665686Z def test_silu_mul_quant( 2025-05-07T20:31:45.4665762Z self, 2025-05-07T20:31:45.4665845Z T: int, 2025-05-07T20:31:45.4665921Z D: int, 2025-05-07T20:31:45.4666019Z scale_ub: Optional[float], 2025-05-07T20:31:45.4666118Z contiguous: bool, 2025-05-07T20:31:45.4666204Z compiled: bool, 2025-05-07T20:31:45.4666287Z ) -> None: 2025-05-07T20:31:45.4666383Z torch.manual_seed(2025) 2025-05-07T20:31:45.4666456Z 2025-05-07T20:31:45.4666636Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.4666709Z 2025-05-07T20:31:45.4666800Z x_sign = torch.sign(x) 2025-05-07T20:31:45.4666931Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.4667019Z x = x_sign * x_clamp 2025-05-07T20:31:45.4667099Z x0 = x[:, :D] 2025-05-07T20:31:45.4667185Z x1 = x[:, D:] 2025-05-07T20:31:45.4667257Z 2025-05-07T20:31:45.4667342Z if contiguous: 2025-05-07T20:31:45.4667440Z x0 = x0.contiguous() 2025-05-07T20:31:45.4667528Z x1 = x1.contiguous() 2025-05-07T20:31:45.4667600Z 2025-05-07T20:31:45.4667700Z if scale_ub is not None: 2025-05-07T20:31:45.4667806Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.4667952Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.4668028Z ) 2025-05-07T20:31:45.4668104Z else: 2025-05-07T20:31:45.4668206Z scale_ub_tensor = None 2025-05-07T20:31:45.4668365Z 2025-05-07T20:31:45.4668497Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.4668595Z op = silu_mul_quant 2025-05-07T20:31:45.4668682Z if compiled: 2025-05-07T20:31:45.4668783Z op = torch.compile(op) 2025-05-07T20:31:45.4668897Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4668969Z 2025-05-07T20:31:45.4669149Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.4669165Z 2025-05-07T20:31:45.4669266Z moe/activation_test.py:117: 2025-05-07T20:31:45.4669393Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4669499Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.4669599Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4670183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.4670284Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.4670648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.4670875Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.4671219Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.4671312Z kernel = self.compile( 2025-05-07T20:31:45.4671700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.4671872Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.4672000Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4672009Z 2025-05-07T20:31:45.4672217Z self = 2025-05-07T20:31:45.4673004Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.4673516Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6871db2520>} 2025-05-07T20:31:45.4674269Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.4674462Z context = 2025-05-07T20:31:45.4674472Z 2025-05-07T20:31:45.4674635Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.4674897Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.4675016Z module_map=module_map) 2025-05-07T20:31:45.4675189Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.4675302Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.4675402Z E ^ 2025-05-07T20:31:45.4675764Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.4675769Z 2025-05-07T20:31:45.4676196Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.4676201Z 2025-05-07T20:31:45.4676304Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.4676528Z self=, 2025-05-07T20:31:45.4676616Z T=1, 2025-05-07T20:31:45.4676692Z D=7168, 2025-05-07T20:31:45.4676773Z scale_ub=1200.0, 2025-05-07T20:31:45.4676866Z contiguous=False, 2025-05-07T20:31:45.4676949Z compiled=False, 2025-05-07T20:31:45.4677027Z ) 2025-05-07T20:31:45.4677326Z self = 2025-05-07T20:31:45.4677497Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:45.4677501Z 2025-05-07T20:31:45.4677583Z @given( 2025-05-07T20:31:45.4677699Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.4677797Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.4677920Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.4678035Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.4678154Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.4678232Z ) 2025-05-07T20:31:45.4678478Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.4678657Z def test_silu_mul_quant( 2025-05-07T20:31:45.4678733Z self, 2025-05-07T20:31:45.4678809Z T: int, 2025-05-07T20:31:45.4678891Z D: int, 2025-05-07T20:31:45.4678997Z scale_ub: Optional[float], 2025-05-07T20:31:45.4679087Z contiguous: bool, 2025-05-07T20:31:45.4679178Z compiled: bool, 2025-05-07T20:31:45.4679257Z ) -> None: 2025-05-07T20:31:45.4679351Z torch.manual_seed(2025) 2025-05-07T20:31:45.4679428Z 2025-05-07T20:31:45.4679595Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.4679668Z 2025-05-07T20:31:45.4679765Z x_sign = torch.sign(x) 2025-05-07T20:31:45.4679889Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.4679984Z x = x_sign * x_clamp 2025-05-07T20:31:45.4680066Z x0 = x[:, :D] 2025-05-07T20:31:45.4680146Z x1 = x[:, D:] 2025-05-07T20:31:45.4680229Z 2025-05-07T20:31:45.4680312Z if contiguous: 2025-05-07T20:31:45.4680403Z x0 = x0.contiguous() 2025-05-07T20:31:45.4680497Z x1 = x1.contiguous() 2025-05-07T20:31:45.4680568Z 2025-05-07T20:31:45.4680664Z if scale_ub is not None: 2025-05-07T20:31:45.4680790Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.4685298Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.4685414Z ) 2025-05-07T20:31:45.4685504Z else: 2025-05-07T20:31:45.4685636Z scale_ub_tensor = None 2025-05-07T20:31:45.4685711Z 2025-05-07T20:31:45.4685850Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.4685951Z op = silu_mul_quant 2025-05-07T20:31:45.4686040Z if compiled: 2025-05-07T20:31:45.4686145Z op = torch.compile(op) 2025-05-07T20:31:45.4686264Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4686338Z 2025-05-07T20:31:45.4686447Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.4686452Z 2025-05-07T20:31:45.4686552Z moe/activation_test.py:117: 2025-05-07T20:31:45.4686684Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4686802Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.4686904Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4687413Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.4687521Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.4687884Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.4688115Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.4688457Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.4688556Z kernel = self.compile( 2025-05-07T20:31:45.4688950Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.4689126Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.4689465Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4689480Z 2025-05-07T20:31:45.4689689Z self = 2025-05-07T20:31:45.4690474Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.4690991Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6871db1bc0>} 2025-05-07T20:31:45.4691821Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.4692024Z context = 2025-05-07T20:31:45.4692028Z 2025-05-07T20:31:45.4692194Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.4692461Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.4692578Z module_map=module_map) 2025-05-07T20:31:45.4692742Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.4692842Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.4692931Z E ^ 2025-05-07T20:31:45.4693292Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.4693304Z 2025-05-07T20:31:45.4693736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.4693741Z 2025-05-07T20:31:45.4693844Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.4694075Z self=, 2025-05-07T20:31:45.4694163Z T=4096, 2025-05-07T20:31:45.4694241Z D=7168, 2025-05-07T20:31:45.4694332Z scale_ub=1200.0, 2025-05-07T20:31:45.4694421Z contiguous=False, 2025-05-07T20:31:45.4694507Z compiled=True, 2025-05-07T20:31:45.4694588Z ) 2025-05-07T20:31:45.4694810Z self = 2025-05-07T20:31:45.4694988Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:45.4694993Z 2025-05-07T20:31:45.4695085Z @given( 2025-05-07T20:31:45.4695224Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.4695346Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.4695476Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.4695596Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.4695717Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.4695798Z ) 2025-05-07T20:31:45.4696045Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.4696147Z def test_silu_mul_quant( 2025-05-07T20:31:45.4696226Z self, 2025-05-07T20:31:45.4696304Z T: int, 2025-05-07T20:31:45.4696393Z D: int, 2025-05-07T20:31:45.4696492Z scale_ub: Optional[float], 2025-05-07T20:31:45.4696582Z contiguous: bool, 2025-05-07T20:31:45.4696680Z compiled: bool, 2025-05-07T20:31:45.4696761Z ) -> None: 2025-05-07T20:31:45.4696860Z torch.manual_seed(2025) 2025-05-07T20:31:45.4696941Z 2025-05-07T20:31:45.4697113Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.4697202Z 2025-05-07T20:31:45.4697294Z x_sign = torch.sign(x) 2025-05-07T20:31:45.4697421Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.4697521Z x = x_sign * x_clamp 2025-05-07T20:31:45.4697602Z x0 = x[:, :D] 2025-05-07T20:31:45.4697768Z x1 = x[:, D:] 2025-05-07T20:31:45.4697849Z 2025-05-07T20:31:45.4697934Z if contiguous: 2025-05-07T20:31:45.4698026Z x0 = x0.contiguous() 2025-05-07T20:31:45.4698127Z x1 = x1.contiguous() 2025-05-07T20:31:45.4698199Z 2025-05-07T20:31:45.4698292Z if scale_ub is not None: 2025-05-07T20:31:45.4698407Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.4698543Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.4698618Z ) 2025-05-07T20:31:45.4698707Z else: 2025-05-07T20:31:45.4698802Z scale_ub_tensor = None 2025-05-07T20:31:45.4698883Z 2025-05-07T20:31:45.4699016Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.4699214Z op = silu_mul_quant 2025-05-07T20:31:45.4699307Z if compiled: 2025-05-07T20:31:45.4699407Z op = torch.compile(op) 2025-05-07T20:31:45.4699519Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4699602Z 2025-05-07T20:31:45.4699694Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.4699698Z 2025-05-07T20:31:45.4699797Z moe/activation_test.py:117: 2025-05-07T20:31:45.4699936Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4700039Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.4700147Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4700520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:45.4700614Z return fn(*args, **kwargs) 2025-05-07T20:31:45.4701125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.4701231Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.4701594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.4701832Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.4702176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.4702279Z kernel = self.compile( 2025-05-07T20:31:45.4702668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.4702843Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.4702982Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4702986Z 2025-05-07T20:31:45.4703192Z self = 2025-05-07T20:31:45.4703998Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.4704506Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6872a0cb80>} 2025-05-07T20:31:45.4705291Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.4705514Z context = 2025-05-07T20:31:45.4705518Z 2025-05-07T20:31:45.4705683Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.4705958Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.4706071Z module_map=module_map) 2025-05-07T20:31:45.4706234Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.4706423Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.4706504Z E ^ 2025-05-07T20:31:45.4706865Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.4706881Z 2025-05-07T20:31:45.4707300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.4707304Z 2025-05-07T20:31:45.4707408Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.4707642Z self=, 2025-05-07T20:31:45.4707723Z T=128, 2025-05-07T20:31:45.4707801Z D=7168, 2025-05-07T20:31:45.4707898Z scale_ub=1200.0, 2025-05-07T20:31:45.4708064Z contiguous=False, 2025-05-07T20:31:45.4708149Z compiled=True, 2025-05-07T20:31:45.4708234Z ) 2025-05-07T20:31:45.4708452Z self = 2025-05-07T20:31:45.4708641Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:45.4708645Z 2025-05-07T20:31:45.4708729Z @given( 2025-05-07T20:31:45.4708848Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.4708958Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.4709173Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.4709292Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.4709415Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.4709490Z ) 2025-05-07T20:31:45.4709747Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.4709841Z def test_silu_mul_quant( 2025-05-07T20:31:45.4709925Z self, 2025-05-07T20:31:45.4710009Z T: int, 2025-05-07T20:31:45.4710087Z D: int, 2025-05-07T20:31:45.4710190Z scale_ub: Optional[float], 2025-05-07T20:31:45.4710289Z contiguous: bool, 2025-05-07T20:31:45.4710382Z compiled: bool, 2025-05-07T20:31:45.4710463Z ) -> None: 2025-05-07T20:31:45.4710566Z torch.manual_seed(2025) 2025-05-07T20:31:45.4710640Z 2025-05-07T20:31:45.4710811Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.4710894Z 2025-05-07T20:31:45.4710988Z x_sign = torch.sign(x) 2025-05-07T20:31:45.4711113Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.4711211Z x = x_sign * x_clamp 2025-05-07T20:31:45.4711292Z x0 = x[:, :D] 2025-05-07T20:31:45.4711382Z x1 = x[:, D:] 2025-05-07T20:31:45.4711456Z 2025-05-07T20:31:45.4711540Z if contiguous: 2025-05-07T20:31:45.4711640Z x0 = x0.contiguous() 2025-05-07T20:31:45.4711734Z x1 = x1.contiguous() 2025-05-07T20:31:45.4711807Z 2025-05-07T20:31:45.4711910Z if scale_ub is not None: 2025-05-07T20:31:45.4712017Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.4712159Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.4712245Z ) 2025-05-07T20:31:45.4712322Z else: 2025-05-07T20:31:45.4712416Z scale_ub_tensor = None 2025-05-07T20:31:45.4712495Z 2025-05-07T20:31:45.4712625Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.4712726Z op = silu_mul_quant 2025-05-07T20:31:45.4712811Z if compiled: 2025-05-07T20:31:45.4712915Z op = torch.compile(op) 2025-05-07T20:31:45.4713028Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4713101Z 2025-05-07T20:31:45.4713193Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.4713197Z 2025-05-07T20:31:45.4713303Z moe/activation_test.py:117: 2025-05-07T20:31:45.4713437Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4713539Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.4713645Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4714100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:45.4714207Z return fn(*args, **kwargs) 2025-05-07T20:31:45.4714703Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.4714803Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.4715170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.4715397Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.4715788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.4715965Z kernel = self.compile( 2025-05-07T20:31:45.4716347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.4716533Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.4716662Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4716667Z 2025-05-07T20:31:45.4716874Z self = 2025-05-07T20:31:45.4717667Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.4718175Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6872a0d440>} 2025-05-07T20:31:45.4718950Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.4719140Z context = 2025-05-07T20:31:45.4719145Z 2025-05-07T20:31:45.4719319Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.4719582Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.4719690Z module_map=module_map) 2025-05-07T20:31:45.4719860Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.4719960Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.4720040Z E ^ 2025-05-07T20:31:45.4720409Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.4720419Z 2025-05-07T20:31:45.4720840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.4720844Z 2025-05-07T20:31:45.4720962Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.4721189Z self=, 2025-05-07T20:31:45.4721267Z T=2048, 2025-05-07T20:31:45.4721354Z D=7168, 2025-05-07T20:31:45.4721440Z scale_ub=None, 2025-05-07T20:31:45.4721524Z contiguous=True, 2025-05-07T20:31:45.4721618Z compiled=True, 2025-05-07T20:31:45.4721692Z ) 2025-05-07T20:31:45.4721913Z self = 2025-05-07T20:31:45.4722096Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:45.4722100Z 2025-05-07T20:31:45.4722179Z @given( 2025-05-07T20:31:45.4722306Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.4722411Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.4722528Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.4722654Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.4722862Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.4722941Z ) 2025-05-07T20:31:45.4723199Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.4723295Z def test_silu_mul_quant( 2025-05-07T20:31:45.4723380Z self, 2025-05-07T20:31:45.4723461Z T: int, 2025-05-07T20:31:45.4723540Z D: int, 2025-05-07T20:31:45.4723648Z scale_ub: Optional[float], 2025-05-07T20:31:45.4723741Z contiguous: bool, 2025-05-07T20:31:45.4723829Z compiled: bool, 2025-05-07T20:31:45.4723915Z ) -> None: 2025-05-07T20:31:45.4724011Z torch.manual_seed(2025) 2025-05-07T20:31:45.4724085Z 2025-05-07T20:31:45.4724259Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.4724412Z 2025-05-07T20:31:45.4724507Z x_sign = torch.sign(x) 2025-05-07T20:31:45.4724641Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.4724736Z x = x_sign * x_clamp 2025-05-07T20:31:45.4724818Z x0 = x[:, :D] 2025-05-07T20:31:45.4724908Z x1 = x[:, D:] 2025-05-07T20:31:45.4724983Z 2025-05-07T20:31:45.4725085Z if contiguous: 2025-05-07T20:31:45.4725196Z x0 = x0.contiguous() 2025-05-07T20:31:45.4725300Z x1 = x1.contiguous() 2025-05-07T20:31:45.4725391Z 2025-05-07T20:31:45.4725481Z if scale_ub is not None: 2025-05-07T20:31:45.4725594Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.4725733Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.4725809Z ) 2025-05-07T20:31:45.4725893Z else: 2025-05-07T20:31:45.4725987Z scale_ub_tensor = None 2025-05-07T20:31:45.4726064Z 2025-05-07T20:31:45.4726199Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.4726292Z op = silu_mul_quant 2025-05-07T20:31:45.4726378Z if compiled: 2025-05-07T20:31:45.4726492Z op = torch.compile(op) 2025-05-07T20:31:45.4726597Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4726670Z 2025-05-07T20:31:45.4726768Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.4726772Z 2025-05-07T20:31:45.4726874Z moe/activation_test.py:117: 2025-05-07T20:31:45.4727009Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4727111Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.4727212Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4727586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:45.4727684Z return fn(*args, **kwargs) 2025-05-07T20:31:45.4728464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.4728621Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.4729083Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.4729315Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.4729655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.4729749Z kernel = self.compile( 2025-05-07T20:31:45.4730137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.4730309Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.4730442Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4730453Z 2025-05-07T20:31:45.4730655Z self = 2025-05-07T20:31:45.4731703Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.4732217Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6872a0e340>} 2025-05-07T20:31:45.4732974Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.4733168Z context = 2025-05-07T20:31:45.4733173Z 2025-05-07T20:31:45.4733335Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.4733721Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.4733837Z module_map=module_map) 2025-05-07T20:31:45.4734001Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.4734104Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.4734180Z E ^ 2025-05-07T20:31:45.4734536Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.4734541Z 2025-05-07T20:31:45.4734965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.4734970Z 2025-05-07T20:31:45.4735077Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.4735332Z self=, 2025-05-07T20:31:45.4735418Z T=16384, 2025-05-07T20:31:45.4735514Z D=5120, 2025-05-07T20:31:45.4735601Z scale_ub=None, 2025-05-07T20:31:45.4735686Z contiguous=False, 2025-05-07T20:31:45.4735770Z compiled=False, 2025-05-07T20:31:45.4735847Z ) 2025-05-07T20:31:45.4736069Z self = 2025-05-07T20:31:45.4736249Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:45.4736253Z 2025-05-07T20:31:45.4736337Z @given( 2025-05-07T20:31:45.4736454Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.4736563Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.4736678Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.4736794Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.4736911Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.4736985Z ) 2025-05-07T20:31:45.4737230Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.4737335Z def test_silu_mul_quant( 2025-05-07T20:31:45.4737412Z self, 2025-05-07T20:31:45.4737490Z T: int, 2025-05-07T20:31:45.4737575Z D: int, 2025-05-07T20:31:45.4737673Z scale_ub: Optional[float], 2025-05-07T20:31:45.4737771Z contiguous: bool, 2025-05-07T20:31:45.4737864Z compiled: bool, 2025-05-07T20:31:45.4737946Z ) -> None: 2025-05-07T20:31:45.4738047Z torch.manual_seed(2025) 2025-05-07T20:31:45.4738125Z 2025-05-07T20:31:45.4738294Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.4738373Z 2025-05-07T20:31:45.4738464Z x_sign = torch.sign(x) 2025-05-07T20:31:45.4738589Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.4740499Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 40.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:45.4740512Z 2025-05-07T20:31:45.4740632Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:31:45.4740637Z 2025-05-07T20:31:45.4740744Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.4740966Z self=, 2025-05-07T20:31:45.4741042Z T=4096, 2025-05-07T20:31:45.4741126Z D=7168, 2025-05-07T20:31:45.4741210Z scale_ub=1200.0, 2025-05-07T20:31:45.4741305Z contiguous=True, 2025-05-07T20:31:45.4741391Z compiled=True, 2025-05-07T20:31:45.4741465Z ) 2025-05-07T20:31:45.4741689Z self = 2025-05-07T20:31:45.4741937Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:45.4741941Z 2025-05-07T20:31:45.4742019Z @given( 2025-05-07T20:31:45.4742143Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.4742248Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.4742363Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.4742490Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.4742607Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.4742689Z ) 2025-05-07T20:31:45.4742936Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.4743031Z def test_silu_mul_quant( 2025-05-07T20:31:45.4743116Z self, 2025-05-07T20:31:45.4743194Z T: int, 2025-05-07T20:31:45.4743272Z D: int, 2025-05-07T20:31:45.4743377Z scale_ub: Optional[float], 2025-05-07T20:31:45.4743468Z contiguous: bool, 2025-05-07T20:31:45.4743561Z compiled: bool, 2025-05-07T20:31:45.4743649Z ) -> None: 2025-05-07T20:31:45.4743747Z torch.manual_seed(2025) 2025-05-07T20:31:45.4743823Z 2025-05-07T20:31:45.4744000Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.4744074Z 2025-05-07T20:31:45.4744175Z x_sign = torch.sign(x) 2025-05-07T20:31:45.4744299Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.4746146Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:45.4746164Z 2025-05-07T20:31:45.4746284Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:31:45.4746289Z 2025-05-07T20:31:45.4746393Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.4746626Z self=, 2025-05-07T20:31:45.4746705Z T=16384, 2025-05-07T20:31:45.4746782Z D=7168, 2025-05-07T20:31:45.4746873Z scale_ub=None, 2025-05-07T20:31:45.4746959Z contiguous=False, 2025-05-07T20:31:45.4747046Z compiled=False, 2025-05-07T20:31:45.4747129Z ) 2025-05-07T20:31:45.4747348Z self = 2025-05-07T20:31:45.4747532Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:45.4747537Z 2025-05-07T20:31:45.4747614Z @given( 2025-05-07T20:31:45.4747730Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.4747840Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.4747953Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.4748069Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.4748269Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.4748344Z ) 2025-05-07T20:31:45.4748590Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.4748690Z def test_silu_mul_quant( 2025-05-07T20:31:45.4748764Z self, 2025-05-07T20:31:45.4748862Z T: int, 2025-05-07T20:31:45.4748940Z D: int, 2025-05-07T20:31:45.4749045Z scale_ub: Optional[float], 2025-05-07T20:31:45.4749198Z contiguous: bool, 2025-05-07T20:31:45.4749284Z compiled: bool, 2025-05-07T20:31:45.4749369Z ) -> None: 2025-05-07T20:31:45.4749464Z torch.manual_seed(2025) 2025-05-07T20:31:45.4749536Z 2025-05-07T20:31:45.4749709Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.4751593Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:45.4751599Z 2025-05-07T20:31:45.4751727Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:45.4751732Z 2025-05-07T20:31:45.4751836Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.4752056Z self=, 2025-05-07T20:31:45.4752138Z T=2048, 2025-05-07T20:31:45.4752214Z D=7168, 2025-05-07T20:31:45.4752307Z scale_ub=1200.0, 2025-05-07T20:31:45.4752392Z contiguous=True, 2025-05-07T20:31:45.4752475Z compiled=True, 2025-05-07T20:31:45.4752553Z ) 2025-05-07T20:31:45.4752772Z self = 2025-05-07T20:31:45.4752945Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:45.4752950Z 2025-05-07T20:31:45.4753033Z @given( 2025-05-07T20:31:45.4753150Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.4753248Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.4753373Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.4753491Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.4753609Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.4753688Z ) 2025-05-07T20:31:45.4753935Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.4754041Z def test_silu_mul_quant( 2025-05-07T20:31:45.4754117Z self, 2025-05-07T20:31:45.4754193Z T: int, 2025-05-07T20:31:45.4754277Z D: int, 2025-05-07T20:31:45.4754374Z scale_ub: Optional[float], 2025-05-07T20:31:45.4754463Z contiguous: bool, 2025-05-07T20:31:45.4754560Z compiled: bool, 2025-05-07T20:31:45.4754637Z ) -> None: 2025-05-07T20:31:45.4754733Z torch.manual_seed(2025) 2025-05-07T20:31:45.4754810Z 2025-05-07T20:31:45.4754975Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.4755052Z 2025-05-07T20:31:45.4755144Z x_sign = torch.sign(x) 2025-05-07T20:31:45.4755267Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.4757126Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:45.4757141Z 2025-05-07T20:31:45.4757257Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:31:45.4757261Z 2025-05-07T20:31:45.4757369Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.4757590Z self=, 2025-05-07T20:31:45.4757665Z T=2048, 2025-05-07T20:31:45.4757746Z D=7168, 2025-05-07T20:31:45.4757826Z scale_ub=None, 2025-05-07T20:31:45.4757910Z contiguous=True, 2025-05-07T20:31:45.4758000Z compiled=False, 2025-05-07T20:31:45.4758072Z ) 2025-05-07T20:31:45.4758289Z self = 2025-05-07T20:31:45.4758542Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:45.4758546Z 2025-05-07T20:31:45.4758621Z @given( 2025-05-07T20:31:45.4758741Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.4758843Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.4758957Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.4759081Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.4759194Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.4759268Z ) 2025-05-07T20:31:45.4759518Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.4759613Z def test_silu_mul_quant( 2025-05-07T20:31:45.4759695Z self, 2025-05-07T20:31:45.4759770Z T: int, 2025-05-07T20:31:45.4759848Z D: int, 2025-05-07T20:31:45.4759952Z scale_ub: Optional[float], 2025-05-07T20:31:45.4760043Z contiguous: bool, 2025-05-07T20:31:45.4760136Z compiled: bool, 2025-05-07T20:31:45.4760219Z ) -> None: 2025-05-07T20:31:45.4760313Z torch.manual_seed(2025) 2025-05-07T20:31:45.4760386Z 2025-05-07T20:31:45.4760558Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.4760635Z 2025-05-07T20:31:45.4760728Z > x_sign = torch.sign(x) 2025-05-07T20:31:45.4762506Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:45.4762512Z 2025-05-07T20:31:45.4762633Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:31:45.4762637Z 2025-05-07T20:31:45.4762746Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.4762967Z self=, 2025-05-07T20:31:45.4763055Z T=1, 2025-05-07T20:31:45.4763132Z D=7168, 2025-05-07T20:31:45.4763217Z scale_ub=1200.0, 2025-05-07T20:31:45.4763312Z contiguous=True, 2025-05-07T20:31:45.4763397Z compiled=False, 2025-05-07T20:31:45.4763471Z ) 2025-05-07T20:31:45.4763695Z self = 2025-05-07T20:31:45.4763858Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:45.4763863Z 2025-05-07T20:31:45.4763941Z @given( 2025-05-07T20:31:45.4764063Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.4764160Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.4764280Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.4764402Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.4764514Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.4764598Z ) 2025-05-07T20:31:45.4764958Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.4765059Z def test_silu_mul_quant( 2025-05-07T20:31:45.4765143Z self, 2025-05-07T20:31:45.4765232Z T: int, 2025-05-07T20:31:45.4765321Z D: int, 2025-05-07T20:31:45.4765440Z scale_ub: Optional[float], 2025-05-07T20:31:45.4765538Z contiguous: bool, 2025-05-07T20:31:45.4765624Z compiled: bool, 2025-05-07T20:31:45.4765707Z ) -> None: 2025-05-07T20:31:45.4765801Z torch.manual_seed(2025) 2025-05-07T20:31:45.4765883Z 2025-05-07T20:31:45.4766052Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.4766124Z 2025-05-07T20:31:45.4766224Z x_sign = torch.sign(x) 2025-05-07T20:31:45.4766427Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.4766516Z x = x_sign * x_clamp 2025-05-07T20:31:45.4766603Z x0 = x[:, :D] 2025-05-07T20:31:45.4766683Z x1 = x[:, D:] 2025-05-07T20:31:45.4766756Z 2025-05-07T20:31:45.4766851Z if contiguous: 2025-05-07T20:31:45.4766943Z x0 = x0.contiguous() 2025-05-07T20:31:45.4767032Z x1 = x1.contiguous() 2025-05-07T20:31:45.4767111Z 2025-05-07T20:31:45.4767200Z if scale_ub is not None: 2025-05-07T20:31:45.4767316Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.4767451Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.4767526Z ) 2025-05-07T20:31:45.4767609Z else: 2025-05-07T20:31:45.4767703Z scale_ub_tensor = None 2025-05-07T20:31:45.4767777Z 2025-05-07T20:31:45.4767914Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.4768004Z op = silu_mul_quant 2025-05-07T20:31:45.4768096Z if compiled: 2025-05-07T20:31:45.4768204Z op = torch.compile(op) 2025-05-07T20:31:45.4768309Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4768381Z 2025-05-07T20:31:45.4768481Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.4768486Z 2025-05-07T20:31:45.4768583Z moe/activation_test.py:117: 2025-05-07T20:31:45.4768719Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4768819Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.4768919Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4769430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.4769526Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.4769886Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.4770117Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.4770458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.4770565Z kernel = self.compile( 2025-05-07T20:31:45.4770949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.4771122Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.4771258Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4771263Z 2025-05-07T20:31:45.4771466Z self = 2025-05-07T20:31:45.4772252Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.4772760Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f687191dbc0>} 2025-05-07T20:31:45.4773601Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.4773799Z context = 2025-05-07T20:31:45.4773804Z 2025-05-07T20:31:45.4773967Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.4774238Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.4774344Z module_map=module_map) 2025-05-07T20:31:45.4774503Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.4774608Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.4774767Z E ^ 2025-05-07T20:31:45.4775134Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.4775138Z 2025-05-07T20:31:45.4775588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.4775593Z 2025-05-07T20:31:45.4775711Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.4775951Z self=, 2025-05-07T20:31:45.4776028Z T=128, 2025-05-07T20:31:45.4776104Z D=5120, 2025-05-07T20:31:45.4776192Z scale_ub=None, 2025-05-07T20:31:45.4776277Z contiguous=True, 2025-05-07T20:31:45.4776369Z compiled=False, 2025-05-07T20:31:45.4776441Z ) 2025-05-07T20:31:45.4776659Z self = 2025-05-07T20:31:45.4776835Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:45.4776845Z 2025-05-07T20:31:45.4776921Z @given( 2025-05-07T20:31:45.4777038Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.4777144Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.4777264Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.4777383Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.4777503Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.4777575Z ) 2025-05-07T20:31:45.4777827Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.4777920Z def test_silu_mul_quant( 2025-05-07T20:31:45.4777997Z self, 2025-05-07T20:31:45.4778081Z T: int, 2025-05-07T20:31:45.4778157Z D: int, 2025-05-07T20:31:45.4778255Z scale_ub: Optional[float], 2025-05-07T20:31:45.4778350Z contiguous: bool, 2025-05-07T20:31:45.4778434Z compiled: bool, 2025-05-07T20:31:45.4778516Z ) -> None: 2025-05-07T20:31:45.4778615Z torch.manual_seed(2025) 2025-05-07T20:31:45.4778688Z 2025-05-07T20:31:45.4778857Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.4778936Z 2025-05-07T20:31:45.4779032Z x_sign = torch.sign(x) 2025-05-07T20:31:45.4779163Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.4779252Z x = x_sign * x_clamp 2025-05-07T20:31:45.4779333Z x0 = x[:, :D] 2025-05-07T20:31:45.4779422Z x1 = x[:, D:] 2025-05-07T20:31:45.4779496Z 2025-05-07T20:31:45.4779580Z if contiguous: 2025-05-07T20:31:45.4779679Z x0 = x0.contiguous() 2025-05-07T20:31:45.4779767Z x1 = x1.contiguous() 2025-05-07T20:31:45.4779840Z 2025-05-07T20:31:45.4779937Z if scale_ub is not None: 2025-05-07T20:31:45.4780042Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.4780177Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.4780264Z ) 2025-05-07T20:31:45.4780340Z else: 2025-05-07T20:31:45.4780434Z scale_ub_tensor = None 2025-05-07T20:31:45.4780513Z 2025-05-07T20:31:45.4780728Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.4780827Z op = silu_mul_quant 2025-05-07T20:31:45.4780912Z if compiled: 2025-05-07T20:31:45.4781012Z op = torch.compile(op) 2025-05-07T20:31:45.4781124Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4781195Z 2025-05-07T20:31:45.4781286Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.4781290Z 2025-05-07T20:31:45.4781394Z moe/activation_test.py:117: 2025-05-07T20:31:45.4781523Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4781624Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.4781731Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4782230Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.4782414Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.4782776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.4783003Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.4783354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.4783448Z kernel = self.compile( 2025-05-07T20:31:45.4783836Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.4784010Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.4784136Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4784141Z 2025-05-07T20:31:45.4784354Z self = 2025-05-07T20:31:45.4785156Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.4785698Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f687191ed40>} 2025-05-07T20:31:45.4786455Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.4786644Z context = 2025-05-07T20:31:45.4786648Z 2025-05-07T20:31:45.4786820Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.4787087Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.4787199Z module_map=module_map) 2025-05-07T20:31:45.4787364Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.4787463Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.4787546Z E ^ 2025-05-07T20:31:45.4787903Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.4787907Z 2025-05-07T20:31:45.4788323Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.4788337Z 2025-05-07T20:31:45.4788443Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.4788666Z self=, 2025-05-07T20:31:45.4788750Z T=128, 2025-05-07T20:31:45.4788825Z D=7168, 2025-05-07T20:31:45.4788915Z scale_ub=None, 2025-05-07T20:31:45.4789006Z contiguous=True, 2025-05-07T20:31:45.4789256Z compiled=False, 2025-05-07T20:31:45.4789328Z ) 2025-05-07T20:31:45.4789637Z self = 2025-05-07T20:31:45.4789806Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:45.4789812Z 2025-05-07T20:31:45.4789895Z @given( 2025-05-07T20:31:45.4790012Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.4790110Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.4790232Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.4790348Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.4790460Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.4790539Z ) 2025-05-07T20:31:45.4790786Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.4791030Z def test_silu_mul_quant( 2025-05-07T20:31:45.4791115Z self, 2025-05-07T20:31:45.4791191Z T: int, 2025-05-07T20:31:45.4791270Z D: int, 2025-05-07T20:31:45.4791373Z scale_ub: Optional[float], 2025-05-07T20:31:45.4791467Z contiguous: bool, 2025-05-07T20:31:45.4791558Z compiled: bool, 2025-05-07T20:31:45.4791635Z ) -> None: 2025-05-07T20:31:45.4791730Z torch.manual_seed(2025) 2025-05-07T20:31:45.4791807Z 2025-05-07T20:31:45.4791974Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.4792047Z 2025-05-07T20:31:45.4792144Z x_sign = torch.sign(x) 2025-05-07T20:31:45.4792268Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.4792361Z x = x_sign * x_clamp 2025-05-07T20:31:45.4792447Z x0 = x[:, :D] 2025-05-07T20:31:45.4792530Z x1 = x[:, D:] 2025-05-07T20:31:45.4792603Z 2025-05-07T20:31:45.4792693Z if contiguous: 2025-05-07T20:31:45.4792790Z x0 = x0.contiguous() 2025-05-07T20:31:45.4792884Z x1 = x1.contiguous() 2025-05-07T20:31:45.4792956Z 2025-05-07T20:31:45.4793047Z if scale_ub is not None: 2025-05-07T20:31:45.4793159Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.4793298Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.4793374Z ) 2025-05-07T20:31:45.4793457Z else: 2025-05-07T20:31:45.4793551Z scale_ub_tensor = None 2025-05-07T20:31:45.4793623Z 2025-05-07T20:31:45.4793758Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.4793848Z op = silu_mul_quant 2025-05-07T20:31:45.4793934Z if compiled: 2025-05-07T20:31:45.4794043Z op = torch.compile(op) 2025-05-07T20:31:45.4794147Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4794230Z 2025-05-07T20:31:45.4794320Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.4794325Z 2025-05-07T20:31:45.4794427Z moe/activation_test.py:117: 2025-05-07T20:31:45.4794566Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4794668Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.4794771Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4795330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.4795433Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.4795792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.4796020Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.4796359Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.4796460Z kernel = self.compile( 2025-05-07T20:31:45.4796842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.4797019Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.4797237Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4797242Z 2025-05-07T20:31:45.4797446Z self = 2025-05-07T20:31:45.4798237Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.4798742Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f687191fd80>} 2025-05-07T20:31:45.4799503Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.4799764Z context = 2025-05-07T20:31:45.4799768Z 2025-05-07T20:31:45.4799938Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.4800209Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.4800316Z module_map=module_map) 2025-05-07T20:31:45.4800477Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.4800583Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.4800659Z E ^ 2025-05-07T20:31:45.4801019Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.4801024Z 2025-05-07T20:31:45.4801439Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.4801448Z 2025-05-07T20:31:45.4801549Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.4801777Z self=, 2025-05-07T20:31:45.4801856Z T=2048, 2025-05-07T20:31:45.4801936Z D=7168, 2025-05-07T20:31:45.4802020Z scale_ub=1200.0, 2025-05-07T20:31:45.4802104Z contiguous=True, 2025-05-07T20:31:45.4802191Z compiled=False, 2025-05-07T20:31:45.4802264Z ) 2025-05-07T20:31:45.4802481Z self = 2025-05-07T20:31:45.4802662Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:45.4802667Z 2025-05-07T20:31:45.4802742Z @given( 2025-05-07T20:31:45.4802857Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.4802963Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.4803077Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.4803208Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.4803319Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.4803393Z ) 2025-05-07T20:31:45.4803646Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.4803739Z def test_silu_mul_quant( 2025-05-07T20:31:45.4803816Z self, 2025-05-07T20:31:45.4803898Z T: int, 2025-05-07T20:31:45.4803975Z D: int, 2025-05-07T20:31:45.4804073Z scale_ub: Optional[float], 2025-05-07T20:31:45.4804169Z contiguous: bool, 2025-05-07T20:31:45.4804253Z compiled: bool, 2025-05-07T20:31:45.4804330Z ) -> None: 2025-05-07T20:31:45.4804428Z torch.manual_seed(2025) 2025-05-07T20:31:45.4804504Z 2025-05-07T20:31:45.4804672Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.4806608Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.70 GiB is allocated by PyTorch, and 53.93 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:45.4806619Z 2025-05-07T20:31:45.4806745Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:45.4806749Z 2025-05-07T20:31:45.4806851Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.4807072Z self=, 2025-05-07T20:31:45.4807153Z T=1, 2025-05-07T20:31:45.4807229Z D=5120, 2025-05-07T20:31:45.4807311Z scale_ub=1200.0, 2025-05-07T20:31:45.4807400Z contiguous=True, 2025-05-07T20:31:45.4807557Z compiled=False, 2025-05-07T20:31:45.4807632Z ) 2025-05-07T20:31:45.4807860Z self = 2025-05-07T20:31:45.4808023Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:45.4808033Z 2025-05-07T20:31:45.4808114Z @given( 2025-05-07T20:31:45.4808230Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.4808326Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.4808444Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.4808559Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.4808672Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.4808750Z ) 2025-05-07T20:31:45.4808997Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.4809091Z def test_silu_mul_quant( 2025-05-07T20:31:45.4809172Z self, 2025-05-07T20:31:45.4809247Z T: int, 2025-05-07T20:31:45.4809333Z D: int, 2025-05-07T20:31:45.4809430Z scale_ub: Optional[float], 2025-05-07T20:31:45.4813831Z contiguous: bool, 2025-05-07T20:31:45.4813938Z compiled: bool, 2025-05-07T20:31:45.4814020Z ) -> None: 2025-05-07T20:31:45.4814135Z torch.manual_seed(2025) 2025-05-07T20:31:45.4814209Z 2025-05-07T20:31:45.4814384Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.4814465Z 2025-05-07T20:31:45.4814560Z x_sign = torch.sign(x) 2025-05-07T20:31:45.4814696Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.4814788Z x = x_sign * x_clamp 2025-05-07T20:31:45.4814871Z x0 = x[:, :D] 2025-05-07T20:31:45.4814960Z x1 = x[:, D:] 2025-05-07T20:31:45.4815034Z 2025-05-07T20:31:45.4815121Z if contiguous: 2025-05-07T20:31:45.4815221Z x0 = x0.contiguous() 2025-05-07T20:31:45.4815312Z x1 = x1.contiguous() 2025-05-07T20:31:45.4815394Z 2025-05-07T20:31:45.4815498Z if scale_ub is not None: 2025-05-07T20:31:45.4815605Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.4815743Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.4815828Z ) 2025-05-07T20:31:45.4815909Z else: 2025-05-07T20:31:45.4816012Z scale_ub_tensor = None 2025-05-07T20:31:45.4816087Z 2025-05-07T20:31:45.4816219Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.4816319Z op = silu_mul_quant 2025-05-07T20:31:45.4816406Z if compiled: 2025-05-07T20:31:45.4816508Z op = torch.compile(op) 2025-05-07T20:31:45.4816621Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4816696Z 2025-05-07T20:31:45.4816789Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.4816794Z 2025-05-07T20:31:45.4816902Z moe/activation_test.py:117: 2025-05-07T20:31:45.4817036Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4817143Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.4817255Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4817877Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.4817985Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.4818346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.4818572Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.4818926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.4819022Z kernel = self.compile( 2025-05-07T20:31:45.4819416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.4819594Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.4819801Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4819806Z 2025-05-07T20:31:45.4820022Z self = 2025-05-07T20:31:45.4820807Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.4821322Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6871b1d3a0>} 2025-05-07T20:31:45.4822075Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.4822272Z context = 2025-05-07T20:31:45.4822277Z 2025-05-07T20:31:45.4822446Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.4822715Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.4822831Z module_map=module_map) 2025-05-07T20:31:45.4822993Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.4823094Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.4823180Z E ^ 2025-05-07T20:31:45.4823539Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.4823543Z 2025-05-07T20:31:45.4823973Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.4823978Z 2025-05-07T20:31:45.4824083Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.4824315Z self=, 2025-05-07T20:31:45.4824400Z T=2048, 2025-05-07T20:31:45.4824478Z D=5120, 2025-05-07T20:31:45.4824561Z scale_ub=None, 2025-05-07T20:31:45.4824662Z contiguous=True, 2025-05-07T20:31:45.4824748Z compiled=False, 2025-05-07T20:31:45.4824823Z ) 2025-05-07T20:31:45.4825055Z self = 2025-05-07T20:31:45.4825253Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:45.4825259Z 2025-05-07T20:31:45.4825365Z @given( 2025-05-07T20:31:45.4825486Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.4825588Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.4825710Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.4825826Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.4825938Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.4826025Z ) 2025-05-07T20:31:45.4826271Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.4826367Z def test_silu_mul_quant( 2025-05-07T20:31:45.4826634Z self, 2025-05-07T20:31:45.4826713Z T: int, 2025-05-07T20:31:45.4826800Z D: int, 2025-05-07T20:31:45.4826898Z scale_ub: Optional[float], 2025-05-07T20:31:45.4826988Z contiguous: bool, 2025-05-07T20:31:45.4827080Z compiled: bool, 2025-05-07T20:31:45.4827159Z ) -> None: 2025-05-07T20:31:45.4827255Z torch.manual_seed(2025) 2025-05-07T20:31:45.4827336Z 2025-05-07T20:31:45.4827505Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.4827578Z 2025-05-07T20:31:45.4827675Z > x_sign = torch.sign(x) 2025-05-07T20:31:45.4829980Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:45.4830250Z 2025-05-07T20:31:45.4830384Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:31:45.4830389Z 2025-05-07T20:31:45.4830493Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.4830724Z self=, 2025-05-07T20:31:45.4830802Z T=16384, 2025-05-07T20:31:45.4830881Z D=5120, 2025-05-07T20:31:45.4830970Z scale_ub=None, 2025-05-07T20:31:45.4831056Z contiguous=True, 2025-05-07T20:31:45.4831148Z compiled=False, 2025-05-07T20:31:45.4831236Z ) 2025-05-07T20:31:45.4831457Z self = 2025-05-07T20:31:45.4831639Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:45.4831644Z 2025-05-07T20:31:45.4831735Z @given( 2025-05-07T20:31:45.4831854Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.4831955Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.4832080Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.4832201Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.4832327Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.4832402Z ) 2025-05-07T20:31:45.4832650Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.4832756Z def test_silu_mul_quant( 2025-05-07T20:31:45.4832833Z self, 2025-05-07T20:31:45.4832911Z T: int, 2025-05-07T20:31:45.4833000Z D: int, 2025-05-07T20:31:45.4833104Z scale_ub: Optional[float], 2025-05-07T20:31:45.4833194Z contiguous: bool, 2025-05-07T20:31:45.4833293Z compiled: bool, 2025-05-07T20:31:45.4833374Z ) -> None: 2025-05-07T20:31:45.4833470Z torch.manual_seed(2025) 2025-05-07T20:31:45.4833555Z 2025-05-07T20:31:45.4833729Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.4835551Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:45.4835565Z 2025-05-07T20:31:45.4835697Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:45.4835702Z 2025-05-07T20:31:45.4835824Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.4836179Z self=, 2025-05-07T20:31:45.4836259Z T=4096, 2025-05-07T20:31:45.4836347Z D=5120, 2025-05-07T20:31:45.4836433Z scale_ub=None, 2025-05-07T20:31:45.4836520Z contiguous=True, 2025-05-07T20:31:45.4836612Z compiled=False, 2025-05-07T20:31:45.4836688Z ) 2025-05-07T20:31:45.4836908Z self = 2025-05-07T20:31:45.4837093Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:45.4837098Z 2025-05-07T20:31:45.4837177Z @given( 2025-05-07T20:31:45.4837301Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.4837401Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.4837597Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.4837722Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.4837836Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.4837911Z ) 2025-05-07T20:31:45.4838172Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.4838266Z def test_silu_mul_quant( 2025-05-07T20:31:45.4838352Z self, 2025-05-07T20:31:45.4838430Z T: int, 2025-05-07T20:31:45.4838508Z D: int, 2025-05-07T20:31:45.4838613Z scale_ub: Optional[float], 2025-05-07T20:31:45.4838703Z contiguous: bool, 2025-05-07T20:31:45.4838790Z compiled: bool, 2025-05-07T20:31:45.4838879Z ) -> None: 2025-05-07T20:31:45.4838974Z torch.manual_seed(2025) 2025-05-07T20:31:45.4839049Z 2025-05-07T20:31:45.4839230Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.4841015Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:45.4841027Z 2025-05-07T20:31:45.4841154Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:45.4841158Z 2025-05-07T20:31:45.4841261Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.4841492Z self=, 2025-05-07T20:31:45.4841569Z T=2048, 2025-05-07T20:31:45.4841645Z D=5120, 2025-05-07T20:31:45.4841734Z scale_ub=None, 2025-05-07T20:31:45.4841824Z contiguous=False, 2025-05-07T20:31:45.4841915Z compiled=False, 2025-05-07T20:31:45.4841996Z ) 2025-05-07T20:31:45.4842218Z self = 2025-05-07T20:31:45.4842398Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:45.4842402Z 2025-05-07T20:31:45.4842488Z @given( 2025-05-07T20:31:45.4842609Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.4842706Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.4842827Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.4842943Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.4843066Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.4843141Z ) 2025-05-07T20:31:45.4843387Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.4843490Z def test_silu_mul_quant( 2025-05-07T20:31:45.4843568Z self, 2025-05-07T20:31:45.4843649Z T: int, 2025-05-07T20:31:45.4843734Z D: int, 2025-05-07T20:31:45.4843832Z scale_ub: Optional[float], 2025-05-07T20:31:45.4843921Z contiguous: bool, 2025-05-07T20:31:45.4844014Z compiled: bool, 2025-05-07T20:31:45.4844182Z ) -> None: 2025-05-07T20:31:45.4844280Z torch.manual_seed(2025) 2025-05-07T20:31:45.4844363Z 2025-05-07T20:31:45.4844530Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.4846310Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:45.4846388Z 2025-05-07T20:31:45.4846510Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:45.4846514Z 2025-05-07T20:31:45.4846623Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.4846851Z self=, 2025-05-07T20:31:45.4846928Z T=4096, 2025-05-07T20:31:45.4847012Z D=7168, 2025-05-07T20:31:45.4847095Z scale_ub=None, 2025-05-07T20:31:45.4847180Z contiguous=True, 2025-05-07T20:31:45.4847271Z compiled=True, 2025-05-07T20:31:45.4847345Z ) 2025-05-07T20:31:45.4847564Z self = 2025-05-07T20:31:45.4847744Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:45.4847750Z 2025-05-07T20:31:45.4847827Z @given( 2025-05-07T20:31:45.4847952Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.4848058Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.4848173Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.4848298Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.4848418Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.4848493Z ) 2025-05-07T20:31:45.4848750Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.4848847Z def test_silu_mul_quant( 2025-05-07T20:31:45.4848932Z self, 2025-05-07T20:31:45.4849010Z T: int, 2025-05-07T20:31:45.4849087Z D: int, 2025-05-07T20:31:45.4849193Z scale_ub: Optional[float], 2025-05-07T20:31:45.4849284Z contiguous: bool, 2025-05-07T20:31:45.4849371Z compiled: bool, 2025-05-07T20:31:45.4849458Z ) -> None: 2025-05-07T20:31:45.4849553Z torch.manual_seed(2025) 2025-05-07T20:31:45.4849630Z 2025-05-07T20:31:45.4849807Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.4851593Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:45.4851599Z 2025-05-07T20:31:45.4851725Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:45.4851729Z 2025-05-07T20:31:45.4851831Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.4852061Z self=, 2025-05-07T20:31:45.4852139Z T=2048, 2025-05-07T20:31:45.4852216Z D=5120, 2025-05-07T20:31:45.4852311Z scale_ub=1200.0, 2025-05-07T20:31:45.4852400Z contiguous=False, 2025-05-07T20:31:45.4852485Z compiled=False, 2025-05-07T20:31:45.4852569Z ) 2025-05-07T20:31:45.4852790Z self = 2025-05-07T20:31:45.4853047Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:45.4853053Z 2025-05-07T20:31:45.4853138Z @given( 2025-05-07T20:31:45.4853255Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.4853355Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.4853478Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.4853594Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.4853715Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.4853789Z ) 2025-05-07T20:31:45.4854034Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.4854219Z def test_silu_mul_quant( 2025-05-07T20:31:45.4854296Z self, 2025-05-07T20:31:45.4854374Z T: int, 2025-05-07T20:31:45.4854459Z D: int, 2025-05-07T20:31:45.4854561Z scale_ub: Optional[float], 2025-05-07T20:31:45.4854649Z contiguous: bool, 2025-05-07T20:31:45.4854750Z compiled: bool, 2025-05-07T20:31:45.4854828Z ) -> None: 2025-05-07T20:31:45.4854929Z torch.manual_seed(2025) 2025-05-07T20:31:45.4855002Z 2025-05-07T20:31:45.4855171Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.4856994Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:45.4857004Z 2025-05-07T20:31:45.4857121Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:45.4857136Z 2025-05-07T20:31:45.4857237Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.4857458Z self=, 2025-05-07T20:31:45.4857541Z T=4096, 2025-05-07T20:31:45.4857617Z D=7168, 2025-05-07T20:31:45.4857701Z scale_ub=1200.0, 2025-05-07T20:31:45.4857795Z contiguous=True, 2025-05-07T20:31:45.4857882Z compiled=False, 2025-05-07T20:31:45.4857955Z ) 2025-05-07T20:31:45.4858183Z self = 2025-05-07T20:31:45.4858355Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:45.4858359Z 2025-05-07T20:31:45.4858435Z @given( 2025-05-07T20:31:45.4858565Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.4858661Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.4858782Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.4858904Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.4859017Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.4859097Z ) 2025-05-07T20:31:45.4859342Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.4859435Z def test_silu_mul_quant( 2025-05-07T20:31:45.4859519Z self, 2025-05-07T20:31:45.4859598Z T: int, 2025-05-07T20:31:45.4859675Z D: int, 2025-05-07T20:31:45.4859777Z scale_ub: Optional[float], 2025-05-07T20:31:45.4859865Z contiguous: bool, 2025-05-07T20:31:45.4859955Z compiled: bool, 2025-05-07T20:31:45.4860032Z ) -> None: 2025-05-07T20:31:45.4860124Z torch.manual_seed(2025) 2025-05-07T20:31:45.4860202Z 2025-05-07T20:31:45.4860373Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.4862259Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:45.4862272Z 2025-05-07T20:31:45.4862388Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:45.4862392Z 2025-05-07T20:31:45.4862494Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.4862724Z self=, 2025-05-07T20:31:45.4862877Z T=16384, 2025-05-07T20:31:45.4862952Z D=7168, 2025-05-07T20:31:45.4863040Z scale_ub=None, 2025-05-07T20:31:45.4863125Z contiguous=False, 2025-05-07T20:31:45.4863213Z compiled=True, 2025-05-07T20:31:45.4863285Z ) 2025-05-07T20:31:45.4863505Z self = 2025-05-07T20:31:45.4863685Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:45.4863690Z 2025-05-07T20:31:45.4863770Z @given( 2025-05-07T20:31:45.4863884Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.4863992Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.4864105Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.4864221Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.4864338Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.4864411Z ) 2025-05-07T20:31:45.4864665Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.4864763Z def test_silu_mul_quant( 2025-05-07T20:31:45.4864840Z self, 2025-05-07T20:31:45.4864923Z T: int, 2025-05-07T20:31:45.4864998Z D: int, 2025-05-07T20:31:45.4865100Z scale_ub: Optional[float], 2025-05-07T20:31:45.4865193Z contiguous: bool, 2025-05-07T20:31:45.4865276Z compiled: bool, 2025-05-07T20:31:45.4865352Z ) -> None: 2025-05-07T20:31:45.4865454Z torch.manual_seed(2025) 2025-05-07T20:31:45.4865527Z 2025-05-07T20:31:45.4865718Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.4867529Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:45.4867539Z 2025-05-07T20:31:45.4867658Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:45.4867668Z 2025-05-07T20:31:45.4867769Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.4867989Z self=, 2025-05-07T20:31:45.4868070Z T=4096, 2025-05-07T20:31:45.4868145Z D=7168, 2025-05-07T20:31:45.4868226Z scale_ub=None, 2025-05-07T20:31:45.4868317Z contiguous=True, 2025-05-07T20:31:45.4868400Z compiled=False, 2025-05-07T20:31:45.4868474Z ) 2025-05-07T20:31:45.4868697Z self = 2025-05-07T20:31:45.4868865Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:45.4868874Z 2025-05-07T20:31:45.4868950Z @given( 2025-05-07T20:31:45.4869150Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.4869248Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.4869455Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.4869573Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.4869686Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.4869769Z ) 2025-05-07T20:31:45.4870020Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.4870114Z def test_silu_mul_quant( 2025-05-07T20:31:45.4870197Z self, 2025-05-07T20:31:45.4870273Z T: int, 2025-05-07T20:31:45.4870350Z D: int, 2025-05-07T20:31:45.4870453Z scale_ub: Optional[float], 2025-05-07T20:31:45.4870542Z contiguous: bool, 2025-05-07T20:31:45.4870634Z compiled: bool, 2025-05-07T20:31:45.4870711Z ) -> None: 2025-05-07T20:31:45.4870883Z torch.manual_seed(2025) 2025-05-07T20:31:45.4870964Z 2025-05-07T20:31:45.4871132Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.4872914Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:45.4872927Z 2025-05-07T20:31:45.4873044Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:45.4873048Z 2025-05-07T20:31:45.4873149Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.4873378Z self=, 2025-05-07T20:31:45.4873459Z T=16384, 2025-05-07T20:31:45.4873539Z D=7168, 2025-05-07T20:31:45.4873630Z scale_ub=None, 2025-05-07T20:31:45.4873718Z contiguous=True, 2025-05-07T20:31:45.4873812Z compiled=False, 2025-05-07T20:31:45.4873886Z ) 2025-05-07T20:31:45.4874103Z self = 2025-05-07T20:31:45.4874284Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:45.4874289Z 2025-05-07T20:31:45.4874366Z @given( 2025-05-07T20:31:45.4874482Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.4874585Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.4874700Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.4874815Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.4874934Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.4875014Z ) 2025-05-07T20:31:45.4875266Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.4875361Z def test_silu_mul_quant( 2025-05-07T20:31:45.4875440Z self, 2025-05-07T20:31:45.4875522Z T: int, 2025-05-07T20:31:45.4875602Z D: int, 2025-05-07T20:31:45.4875700Z scale_ub: Optional[float], 2025-05-07T20:31:45.4875794Z contiguous: bool, 2025-05-07T20:31:45.4875878Z compiled: bool, 2025-05-07T20:31:45.4875955Z ) -> None: 2025-05-07T20:31:45.4876055Z torch.manual_seed(2025) 2025-05-07T20:31:45.4876127Z 2025-05-07T20:31:45.4876293Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.4878152Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:45.4878163Z 2025-05-07T20:31:45.4878280Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:45.4878293Z 2025-05-07T20:31:45.4878395Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.4878615Z self=, 2025-05-07T20:31:45.4878697Z T=16384, 2025-05-07T20:31:45.4878782Z D=7168, 2025-05-07T20:31:45.4878871Z scale_ub=1200.0, 2025-05-07T20:31:45.4878955Z contiguous=True, 2025-05-07T20:31:45.4879039Z compiled=False, 2025-05-07T20:31:45.4879120Z ) 2025-05-07T20:31:45.4879337Z self = 2025-05-07T20:31:45.4879511Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:45.4879594Z 2025-05-07T20:31:45.4879678Z @given( 2025-05-07T20:31:45.4879794Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.4879898Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.4880017Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.4880135Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.4880253Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.4880327Z ) 2025-05-07T20:31:45.4880572Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.4880673Z def test_silu_mul_quant( 2025-05-07T20:31:45.4880749Z self, 2025-05-07T20:31:45.4880829Z T: int, 2025-05-07T20:31:45.4880913Z D: int, 2025-05-07T20:31:45.4881012Z scale_ub: Optional[float], 2025-05-07T20:31:45.4881104Z contiguous: bool, 2025-05-07T20:31:45.4881202Z compiled: bool, 2025-05-07T20:31:45.4881280Z ) -> None: 2025-05-07T20:31:45.4881384Z torch.manual_seed(2025) 2025-05-07T20:31:45.4881457Z 2025-05-07T20:31:45.4881625Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.4883418Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:45.4883425Z 2025-05-07T20:31:45.4883542Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:45.4883546Z 2025-05-07T20:31:45.4883663Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.4883884Z self=, 2025-05-07T20:31:45.4883961Z T=128, 2025-05-07T20:31:45.4884046Z D=5120, 2025-05-07T20:31:45.4884130Z scale_ub=1200.0, 2025-05-07T20:31:45.4884220Z contiguous=False, 2025-05-07T20:31:45.4884312Z compiled=False, 2025-05-07T20:31:45.4884385Z ) 2025-05-07T20:31:45.4884602Z self = 2025-05-07T20:31:45.4884788Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:45.4884794Z 2025-05-07T20:31:45.4884887Z @given( 2025-05-07T20:31:45.4885034Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.4885153Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.4885292Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.4885444Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.4885589Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.4885679Z ) 2025-05-07T20:31:45.4885988Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.4886105Z def test_silu_mul_quant( 2025-05-07T20:31:45.4886304Z self, 2025-05-07T20:31:45.4886401Z T: int, 2025-05-07T20:31:45.4886496Z D: int, 2025-05-07T20:31:45.4886622Z scale_ub: Optional[float], 2025-05-07T20:31:45.4886731Z contiguous: bool, 2025-05-07T20:31:45.4886837Z compiled: bool, 2025-05-07T20:31:45.4886937Z ) -> None: 2025-05-07T20:31:45.4887052Z torch.manual_seed(2025) 2025-05-07T20:31:45.4887143Z 2025-05-07T20:31:45.4887317Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.4887389Z 2025-05-07T20:31:45.4887481Z x_sign = torch.sign(x) 2025-05-07T20:31:45.4887610Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.4887775Z x = x_sign * x_clamp 2025-05-07T20:31:45.4887861Z x0 = x[:, :D] 2025-05-07T20:31:45.4887941Z x1 = x[:, D:] 2025-05-07T20:31:45.4888013Z 2025-05-07T20:31:45.4888103Z if contiguous: 2025-05-07T20:31:45.4888194Z x0 = x0.contiguous() 2025-05-07T20:31:45.4888287Z x1 = x1.contiguous() 2025-05-07T20:31:45.4888369Z 2025-05-07T20:31:45.4888460Z if scale_ub is not None: 2025-05-07T20:31:45.4888566Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.4888707Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.4888781Z ) 2025-05-07T20:31:45.4888856Z else: 2025-05-07T20:31:45.4888955Z scale_ub_tensor = None 2025-05-07T20:31:45.4889027Z 2025-05-07T20:31:45.4889155Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.4889251Z op = silu_mul_quant 2025-05-07T20:31:45.4889338Z if compiled: 2025-05-07T20:31:45.4889443Z op = torch.compile(op) 2025-05-07T20:31:45.4889553Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4889625Z 2025-05-07T20:31:45.4889725Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.4889729Z 2025-05-07T20:31:45.4889834Z moe/activation_test.py:117: 2025-05-07T20:31:45.4889964Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4890073Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.4890174Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4890680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.4890783Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.4891143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.4891371Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.4891719Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.4891811Z kernel = self.compile( 2025-05-07T20:31:45.4892203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.4892378Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.4892516Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4892521Z 2025-05-07T20:31:45.4892725Z self = 2025-05-07T20:31:45.4893509Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.4894019Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f68718c40e0>} 2025-05-07T20:31:45.4894930Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.4895178Z context = 2025-05-07T20:31:45.4895184Z 2025-05-07T20:31:45.4895388Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.4895720Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.4895862Z module_map=module_map) 2025-05-07T20:31:45.4896061Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.4896195Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.4896294Z E ^ 2025-05-07T20:31:45.4896741Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.4896837Z 2025-05-07T20:31:45.4897371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.4897377Z 2025-05-07T20:31:45.4897483Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.4897714Z self=, 2025-05-07T20:31:45.4897791Z T=2048, 2025-05-07T20:31:45.4897868Z D=7168, 2025-05-07T20:31:45.4897960Z scale_ub=None, 2025-05-07T20:31:45.4898047Z contiguous=False, 2025-05-07T20:31:45.4898131Z compiled=False, 2025-05-07T20:31:45.4898209Z ) 2025-05-07T20:31:45.4898427Z self = 2025-05-07T20:31:45.4898602Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:45.4898606Z 2025-05-07T20:31:45.4898693Z @given( 2025-05-07T20:31:45.4898811Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.4898919Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.4899034Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.4899158Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.4899278Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.4899350Z ) 2025-05-07T20:31:45.4899596Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.4899696Z def test_silu_mul_quant( 2025-05-07T20:31:45.4899773Z self, 2025-05-07T20:31:45.4899850Z T: int, 2025-05-07T20:31:45.4899934Z D: int, 2025-05-07T20:31:45.4900032Z scale_ub: Optional[float], 2025-05-07T20:31:45.4900122Z contiguous: bool, 2025-05-07T20:31:45.4900216Z compiled: bool, 2025-05-07T20:31:45.4900293Z ) -> None: 2025-05-07T20:31:45.4900391Z torch.manual_seed(2025) 2025-05-07T20:31:45.4900468Z 2025-05-07T20:31:45.4900639Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.4902439Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 5.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:45.4902446Z 2025-05-07T20:31:45.4902564Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:45.4902568Z 2025-05-07T20:31:45.4902674Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.4902895Z self=, 2025-05-07T20:31:45.4902976Z T=128, 2025-05-07T20:31:45.4903063Z D=7168, 2025-05-07T20:31:45.4903145Z scale_ub=1200.0, 2025-05-07T20:31:45.4903228Z contiguous=True, 2025-05-07T20:31:45.4903315Z compiled=True, 2025-05-07T20:31:45.4903470Z ) 2025-05-07T20:31:45.4903695Z self = 2025-05-07T20:31:45.4903862Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:45.4903867Z 2025-05-07T20:31:45.4903942Z @given( 2025-05-07T20:31:45.4904063Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.4904161Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.4904274Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.4904395Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.4904508Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.4904582Z ) 2025-05-07T20:31:45.4904914Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.4905008Z def test_silu_mul_quant( 2025-05-07T20:31:45.4905094Z self, 2025-05-07T20:31:45.4905170Z T: int, 2025-05-07T20:31:45.4905246Z D: int, 2025-05-07T20:31:45.4905355Z scale_ub: Optional[float], 2025-05-07T20:31:45.4905444Z contiguous: bool, 2025-05-07T20:31:45.4905530Z compiled: bool, 2025-05-07T20:31:45.4905611Z ) -> None: 2025-05-07T20:31:45.4905704Z torch.manual_seed(2025) 2025-05-07T20:31:45.4905777Z 2025-05-07T20:31:45.4905951Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.4906024Z 2025-05-07T20:31:45.4906113Z x_sign = torch.sign(x) 2025-05-07T20:31:45.4906243Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.4906331Z x = x_sign * x_clamp 2025-05-07T20:31:45.4906418Z x0 = x[:, :D] 2025-05-07T20:31:45.4906501Z x1 = x[:, D:] 2025-05-07T20:31:45.4906580Z 2025-05-07T20:31:45.4906669Z if contiguous: 2025-05-07T20:31:45.4906762Z x0 = x0.contiguous() 2025-05-07T20:31:45.4906851Z x1 = x1.contiguous() 2025-05-07T20:31:45.4906930Z 2025-05-07T20:31:45.4907024Z if scale_ub is not None: 2025-05-07T20:31:45.4907130Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:45.4907272Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:45.4907347Z ) 2025-05-07T20:31:45.4907423Z else: 2025-05-07T20:31:45.4907521Z scale_ub_tensor = None 2025-05-07T20:31:45.4907592Z 2025-05-07T20:31:45.4907721Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:45.4907821Z op = silu_mul_quant 2025-05-07T20:31:45.4907906Z if compiled: 2025-05-07T20:31:45.4908011Z op = torch.compile(op) 2025-05-07T20:31:45.4908117Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4908196Z 2025-05-07T20:31:45.4908294Z > y_fp8, y_scale = fn() 2025-05-07T20:31:45.4908298Z 2025-05-07T20:31:45.4908396Z moe/activation_test.py:117: 2025-05-07T20:31:45.4908527Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4908645Z moe/activation_test.py:115: in fn 2025-05-07T20:31:45.4908748Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:45.4909215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:45.4909310Z return fn(*args, **kwargs) 2025-05-07T20:31:45.4909805Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:45.4909908Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:45.4910266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:45.4910488Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:45.4910840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:45.4910933Z kernel = self.compile( 2025-05-07T20:31:45.4911406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:45.4911582Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:45.4911710Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:45.4911715Z 2025-05-07T20:31:45.4911924Z self = 2025-05-07T20:31:45.4912708Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:45.4913295Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f68718c6fc0>} 2025-05-07T20:31:45.4914056Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:45.4914248Z context = 2025-05-07T20:31:45.4914259Z 2025-05-07T20:31:45.4914424Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:45.4914686Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:45.4914801Z module_map=module_map) 2025-05-07T20:31:45.4914961Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:45.4915063Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:45.4915159Z E ^ 2025-05-07T20:31:45.4915567Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:45.4915571Z 2025-05-07T20:31:45.4916002Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:45.4916007Z 2025-05-07T20:31:45.4916108Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.4916332Z self=, 2025-05-07T20:31:45.4916414Z T=128, 2025-05-07T20:31:45.4916492Z D=7168, 2025-05-07T20:31:45.4916573Z scale_ub=1200.0, 2025-05-07T20:31:45.4916662Z contiguous=True, 2025-05-07T20:31:45.4916745Z compiled=False, 2025-05-07T20:31:45.4916819Z ) 2025-05-07T20:31:45.4917043Z self = 2025-05-07T20:31:45.4917212Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:45.4917222Z 2025-05-07T20:31:45.4917302Z @given( 2025-05-07T20:31:45.4917419Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.4917518Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.4917641Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.4917757Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.4917868Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.4917947Z ) 2025-05-07T20:31:45.4918192Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.4918287Z def test_silu_mul_quant( 2025-05-07T20:31:45.4918372Z self, 2025-05-07T20:31:45.4918447Z T: int, 2025-05-07T20:31:45.4918528Z D: int, 2025-05-07T20:31:45.4918625Z scale_ub: Optional[float], 2025-05-07T20:31:45.4918713Z contiguous: bool, 2025-05-07T20:31:45.4918803Z compiled: bool, 2025-05-07T20:31:45.4918884Z ) -> None: 2025-05-07T20:31:45.4918978Z torch.manual_seed(2025) 2025-05-07T20:31:45.4919057Z 2025-05-07T20:31:45.4919232Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.4919305Z 2025-05-07T20:31:45.4919491Z x_sign = torch.sign(x) 2025-05-07T20:31:45.4919620Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:45.4921413Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 4.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:45.4921522Z 2025-05-07T20:31:45.4921640Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:31:45.4921645Z 2025-05-07T20:31:45.4921752Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.4921973Z self=, 2025-05-07T20:31:45.4922053Z T=128, 2025-05-07T20:31:45.4922134Z D=5120, 2025-05-07T20:31:45.4922216Z scale_ub=1200.0, 2025-05-07T20:31:45.4922300Z contiguous=True, 2025-05-07T20:31:45.4922390Z compiled=True, 2025-05-07T20:31:45.4922462Z ) 2025-05-07T20:31:45.4922682Z self = 2025-05-07T20:31:45.4922853Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:45.4922858Z 2025-05-07T20:31:45.4922932Z @given( 2025-05-07T20:31:45.4923053Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.4923151Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.4923266Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.4923393Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.4923506Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.4923580Z ) 2025-05-07T20:31:45.4923837Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.4923932Z def test_silu_mul_quant( 2025-05-07T20:31:45.4924008Z self, 2025-05-07T20:31:45.4924089Z T: int, 2025-05-07T20:31:45.4924164Z D: int, 2025-05-07T20:31:45.4924260Z scale_ub: Optional[float], 2025-05-07T20:31:45.4924358Z contiguous: bool, 2025-05-07T20:31:45.4924443Z compiled: bool, 2025-05-07T20:31:45.4924525Z ) -> None: 2025-05-07T20:31:45.4924618Z torch.manual_seed(2025) 2025-05-07T20:31:45.4924691Z 2025-05-07T20:31:45.4924863Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.4924935Z 2025-05-07T20:31:45.4925027Z > x_sign = torch.sign(x) 2025-05-07T20:31:45.4926873Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:45.4926879Z 2025-05-07T20:31:45.4926998Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:31:45.4927002Z 2025-05-07T20:31:45.4927111Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:45.4927333Z self=, 2025-05-07T20:31:45.4927409Z T=128, 2025-05-07T20:31:45.4927491Z D=7168, 2025-05-07T20:31:45.4927577Z scale_ub=None, 2025-05-07T20:31:45.4927669Z contiguous=True, 2025-05-07T20:31:45.4927753Z compiled=True, 2025-05-07T20:31:45.4927826Z ) 2025-05-07T20:31:45.4928055Z self = 2025-05-07T20:31:45.4930077Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:45.4930115Z 2025-05-07T20:31:45.4930309Z @given( 2025-05-07T20:31:45.4930581Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:45.4930783Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:45.4931011Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:45.4931250Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:45.4931469Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:45.4931626Z ) 2025-05-07T20:31:45.4932120Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:45.4932311Z def test_silu_mul_quant( 2025-05-07T20:31:45.4932708Z self, 2025-05-07T20:31:45.4932859Z T: int, 2025-05-07T20:31:45.4933006Z D: int, 2025-05-07T20:31:45.4933208Z scale_ub: Optional[float], 2025-05-07T20:31:45.4933383Z contiguous: bool, 2025-05-07T20:31:45.4933560Z compiled: bool, 2025-05-07T20:31:45.4933728Z ) -> None: 2025-05-07T20:31:45.4933916Z torch.manual_seed(2025) 2025-05-07T20:31:45.4934058Z 2025-05-07T20:31:45.4934410Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:45.4936576Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:31:45.4936588Z 2025-05-07T20:31:45.4936712Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:31:45.4936852Z =============================== warnings summary =============================== 2025-05-07T20:31:45.4937168Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:31:45.4937466Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:31:45.4937758Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:31:45.4938634Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details. 2025-05-07T20:31:45.4938868Z warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See " 2025-05-07T20:31:45.4938873Z 2025-05-07T20:31:45.4939056Z experimental/gen_ai/test/moe/activation_test.py: 10 warnings 2025-05-07T20:31:45.4940322Z /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py:72: FutureWarning: `torch.testing.assert_allclose()` is deprecated since 1.12 and will be removed in a future release. Please use `torch.testing.assert_close()` instead. You can find detailed upgrade instructions in https://github.com/pytorch/pytorch/issues/61844. 2025-05-07T20:31:45.4940515Z torch.testing.assert_allclose(y, y_ref, rtol=1.6e-2, atol=1e-3) 2025-05-07T20:31:45.4940520Z 2025-05-07T20:31:45.4940726Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html 2025-05-07T20:31:45.4940888Z ================== 1 failed, 1 passed, 13 warnings in 21.90s =================== 2025-05-07T20:31:47.2200803Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./moe/activation_test.py` failed. (See above for error) 2025-05-07T20:31:47.2824137Z 2025-05-07T20:31:47.2825516Z [TEST] Some tests FAILED. Re-attempting only FAILED tests: ./moe/activation_test.py 2025-05-07T20:31:47.2825890Z 2025-05-07T20:31:47.2825896Z 2025-05-07T20:31:47.2845929Z [EXEC] [ATTEMPT 0/2] + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py 2025-05-07T20:31:49.4337482Z ============================= test session starts ============================== 2025-05-07T20:31:49.4338138Z platform linux -- Python 3.11.8, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:31:49.4338669Z cachedir: .pytest_cache 2025-05-07T20:31:49.4339243Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:31:49.4340321Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:31:49.4340729Z plugins: hypothesis-6.131.14 2025-05-07T20:31:51.0409445Z TMA benchmarks will be running with experimental grid constant TMA descriptor. 2025-05-07T20:31:51.1936522Z collecting ... collected 2 items / 1 deselected / 1 selected 2025-05-07T20:31:51.1937322Z run-last-failure: rerun previous 1 failure 2025-05-07T20:31:51.1937538Z 2025-05-07T20:31:53.3077015Z W0507 20:31:53.305000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:53.3078812Z W0507 20:31:53.305000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Traceback (most recent call last): 2025-05-07T20:31:53.3080972Z W0507 20:31:53.305000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:53.3083394Z W0507 20:31:53.305000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:53.3085020Z W0507 20:31:53.305000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:53.3086982Z W0507 20:31:53.305000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:53.3089296Z W0507 20:31:53.305000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:53.3090905Z W0507 20:31:53.305000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:53.3092906Z W0507 20:31:53.305000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:53.3095101Z W0507 20:31:53.305000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:53.3096814Z W0507 20:31:53.305000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:53.3098911Z W0507 20:31:53.305000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:53.3101420Z W0507 20:31:53.305000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] generator.visit(fn.parse()) 2025-05-07T20:31:53.3103512Z W0507 20:31:53.305000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:53.3105627Z W0507 20:31:53.305000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ret = super().visit(node) 2025-05-07T20:31:53.3107071Z W0507 20:31:53.305000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:53.3108911Z W0507 20:31:53.305000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:53.3111080Z W0507 20:31:53.305000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] return visitor(node) 2025-05-07T20:31:53.3112487Z W0507 20:31:53.305000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^^^^^^^^^^^^^ 2025-05-07T20:31:53.3114613Z W0507 20:31:53.305000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:53.3116883Z W0507 20:31:53.305000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:53.3118906Z W0507 20:31:53.305000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:53.3120726Z W0507 20:31:53.305000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] self.visit(item) 2025-05-07T20:31:53.3122703Z W0507 20:31:53.305000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:53.3124824Z W0507 20:31:53.305000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:53.3126537Z W0507 20:31:53.305000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:53.3127976Z W0507 20:31:53.305000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:53.3129507Z W0507 20:31:53.305000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^ 2025-05-07T20:31:53.3131143Z W0507 20:31:53.305000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:53.3248748Z W0507 20:31:53.323000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:53.3250537Z W0507 20:31:53.323000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Traceback (most recent call last): 2025-05-07T20:31:53.3252706Z W0507 20:31:53.323000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:53.3255110Z W0507 20:31:53.323000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:53.3257053Z W0507 20:31:53.323000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:53.3259299Z W0507 20:31:53.323000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:53.3261600Z W0507 20:31:53.323000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:53.3263041Z W0507 20:31:53.323000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:53.3265151Z W0507 20:31:53.323000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:53.3267469Z W0507 20:31:53.323000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:53.3269361Z W0507 20:31:53.323000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:53.3271504Z W0507 20:31:53.323000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:53.3273592Z W0507 20:31:53.323000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] generator.visit(fn.parse()) 2025-05-07T20:31:53.3275709Z W0507 20:31:53.323000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:53.3277782Z W0507 20:31:53.323000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ret = super().visit(node) 2025-05-07T20:31:53.3279220Z W0507 20:31:53.323000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:53.3280907Z W0507 20:31:53.323000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:53.3282643Z W0507 20:31:53.323000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] return visitor(node) 2025-05-07T20:31:53.3284042Z W0507 20:31:53.323000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^^^^^^^^^^^^^ 2025-05-07T20:31:53.3286119Z W0507 20:31:53.323000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:53.3288426Z W0507 20:31:53.323000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:53.3290442Z W0507 20:31:53.323000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:53.3292254Z W0507 20:31:53.323000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] self.visit(item) 2025-05-07T20:31:53.3294611Z W0507 20:31:53.323000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:53.3297041Z W0507 20:31:53.323000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:53.3298931Z W0507 20:31:53.323000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:53.3300568Z W0507 20:31:53.323000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:53.3301838Z W0507 20:31:53.323000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^ 2025-05-07T20:31:53.3303836Z W0507 20:31:53.323000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:53.8405243Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant( 2025-05-07T20:31:53.8405943Z self=, 2025-05-07T20:31:53.8406360Z T=1, 2025-05-07T20:31:53.8406575Z D=5120, 2025-05-07T20:31:53.8406776Z scale_ub=None, 2025-05-07T20:31:53.8407011Z contiguous=True, 2025-05-07T20:31:53.8407247Z compiled=True, 2025-05-07T20:31:53.8407460Z ) 2025-05-07T20:31:53.8407794Z self = 2025-05-07T20:31:53.8408286Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:53.8408569Z 2025-05-07T20:31:53.8408654Z @given( 2025-05-07T20:31:53.8408900Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:53.8409229Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:53.8409537Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:53.8409885Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:53.8410224Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:53.8410516Z ) 2025-05-07T20:31:53.8410869Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:53.8411317Z def test_silu_mul_quant( 2025-05-07T20:31:53.8411573Z self, 2025-05-07T20:31:53.8411773Z T: int, 2025-05-07T20:31:53.8411985Z D: int, 2025-05-07T20:31:53.8412214Z scale_ub: Optional[float], 2025-05-07T20:31:53.8412488Z contiguous: bool, 2025-05-07T20:31:53.8412742Z compiled: bool, 2025-05-07T20:31:53.8412983Z ) -> None: 2025-05-07T20:31:53.8413206Z torch.manual_seed(2025) 2025-05-07T20:31:53.8413455Z 2025-05-07T20:31:53.8413737Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:53.8414080Z 2025-05-07T20:31:53.8414284Z x_sign = torch.sign(x) 2025-05-07T20:31:53.8414586Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:53.8414897Z x = x_sign * x_clamp 2025-05-07T20:31:53.8415157Z x0 = x[:, :D] 2025-05-07T20:31:53.8415373Z x1 = x[:, D:] 2025-05-07T20:31:53.8415587Z 2025-05-07T20:31:53.8415780Z if contiguous: 2025-05-07T20:31:53.8416011Z x0 = x0.contiguous() 2025-05-07T20:31:53.8416274Z x1 = x1.contiguous() 2025-05-07T20:31:53.8416522Z 2025-05-07T20:31:53.8416715Z if scale_ub is not None: 2025-05-07T20:31:53.8417000Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:53.8417340Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:53.8417646Z ) 2025-05-07T20:31:53.8417857Z else: 2025-05-07T20:31:53.8418078Z scale_ub_tensor = None 2025-05-07T20:31:53.8418331Z 2025-05-07T20:31:53.8418568Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:53.8418888Z op = silu_mul_quant 2025-05-07T20:31:53.8419441Z if compiled: 2025-05-07T20:31:53.8419699Z op = torch.compile(op) 2025-05-07T20:31:53.8420007Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:53.8420284Z 2025-05-07T20:31:53.8420476Z y_fp8, y_scale = fn() 2025-05-07T20:31:53.8420762Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:53.8421054Z 2025-05-07T20:31:53.8421289Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:53.8421625Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:53.8421924Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:53.8422236Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:53.8422804Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:53.8423119Z 2025-05-07T20:31:53.8423319Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:53.8423521Z 2025-05-07T20:31:53.8423624Z moe/activation_test.py:126: 2025-05-07T20:31:53.8423928Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:53.8424270Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:53.8424590Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:53.8425380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:53.8426134Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:53.8426673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:53.8427360Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:53.8428057Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:53.8429226Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:53.8429981Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:53.8430727Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:53.8431450Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:53.8432091Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:53.8432685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:53.8433208Z fn() 2025-05-07T20:31:53.8433722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:53.8434316Z self.fn.run( 2025-05-07T20:31:53.8434780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:53.8435318Z kernel = self.compile( 2025-05-07T20:31:53.8435864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:53.8436511Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:53.8436914Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:53.8437142Z 2025-05-07T20:31:53.8437358Z self = 2025-05-07T20:31:53.8438443Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:53.8439831Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4986e7ae80>} 2025-05-07T20:31:53.8441317Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:53.8442344Z context = 2025-05-07T20:31:53.8442631Z 2025-05-07T20:31:53.8442805Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:53.8443316Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:53.8443785Z module_map=module_map) 2025-05-07T20:31:53.8444157Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:53.8444633Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:53.8444899Z E ^ 2025-05-07T20:31:53.8445363Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:53.8445823Z 2025-05-07T20:31:53.8446247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:53.8446754Z 2025-05-07T20:31:53.8446868Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:53.8447278Z self=, 2025-05-07T20:31:53.8447702Z T=2048, 2025-05-07T20:31:53.8447901Z D=5120, 2025-05-07T20:31:53.8448104Z scale_ub=1200.0, 2025-05-07T20:31:53.8448328Z contiguous=True, 2025-05-07T20:31:53.8448557Z compiled=False, 2025-05-07T20:31:53.8448780Z ) 2025-05-07T20:31:54.3751617Z W0507 20:31:54.372000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:54.3752750Z W0507 20:31:54.372000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Traceback (most recent call last): 2025-05-07T20:31:54.3754100Z W0507 20:31:54.372000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:54.3755532Z W0507 20:31:54.372000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:54.3756504Z W0507 20:31:54.372000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:54.3757805Z W0507 20:31:54.372000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:54.3759184Z W0507 20:31:54.372000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:54.3760158Z W0507 20:31:54.372000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:54.3761381Z W0507 20:31:54.372000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:54.3762748Z W0507 20:31:54.372000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:54.3764153Z W0507 20:31:54.372000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:54.3765431Z W0507 20:31:54.372000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:54.3766669Z W0507 20:31:54.372000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] generator.visit(fn.parse()) 2025-05-07T20:31:54.3767890Z W0507 20:31:54.372000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:54.3769252Z W0507 20:31:54.372000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ret = super().visit(node) 2025-05-07T20:31:54.3770089Z W0507 20:31:54.372000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:54.3771111Z W0507 20:31:54.372000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:54.3772121Z W0507 20:31:54.372000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] return visitor(node) 2025-05-07T20:31:54.3772913Z W0507 20:31:54.372000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^^^^^^^^^^^^^ 2025-05-07T20:31:54.3774121Z W0507 20:31:54.372000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:54.3775404Z W0507 20:31:54.372000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:54.3776525Z W0507 20:31:54.372000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:54.3777562Z W0507 20:31:54.372000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] self.visit(item) 2025-05-07T20:31:54.3778782Z W0507 20:31:54.372000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:54.3780145Z W0507 20:31:54.372000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:54.3781208Z W0507 20:31:54.372000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:54.3782114Z W0507 20:31:54.372000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:54.3782857Z W0507 20:31:54.372000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^ 2025-05-07T20:31:54.3783874Z W0507 20:31:54.372000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:54.4846759Z W0507 20:31:54.482000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:54.4847875Z W0507 20:31:54.482000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Traceback (most recent call last): 2025-05-07T20:31:54.4850631Z W0507 20:31:54.482000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:54.4853468Z W0507 20:31:54.482000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:54.4855419Z W0507 20:31:54.482000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:54.4858006Z W0507 20:31:54.482000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:54.4859909Z W0507 20:31:54.482000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:54.4860890Z W0507 20:31:54.482000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:54.4862112Z W0507 20:31:54.482000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:54.4863486Z W0507 20:31:54.482000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:54.4864555Z W0507 20:31:54.482000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:54.4865842Z W0507 20:31:54.482000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:54.4867093Z W0507 20:31:54.482000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] generator.visit(fn.parse()) 2025-05-07T20:31:54.4868310Z W0507 20:31:54.482000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:54.4869637Z W0507 20:31:54.482000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ret = super().visit(node) 2025-05-07T20:31:54.4870472Z W0507 20:31:54.482000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:54.4871502Z W0507 20:31:54.482000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:54.4872524Z W0507 20:31:54.482000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] return visitor(node) 2025-05-07T20:31:54.4873313Z W0507 20:31:54.482000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^^^^^^^^^^^^^ 2025-05-07T20:31:54.4874523Z W0507 20:31:54.482000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:54.4875800Z W0507 20:31:54.482000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:54.4877004Z W0507 20:31:54.482000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:54.4878039Z W0507 20:31:54.482000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] self.visit(item) 2025-05-07T20:31:54.4879217Z W0507 20:31:54.482000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:54.4880578Z W0507 20:31:54.482000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:54.4881718Z W0507 20:31:54.482000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:54.4882641Z W0507 20:31:54.482000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:54.4883379Z W0507 20:31:54.482000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^ 2025-05-07T20:31:54.4884402Z W0507 20:31:54.482000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:54.9320179Z self = 2025-05-07T20:31:54.9320686Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:54.9320966Z 2025-05-07T20:31:54.9321061Z @given( 2025-05-07T20:31:54.9321330Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:54.9321756Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:54.9322145Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:54.9322481Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:54.9322822Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:54.9323108Z ) 2025-05-07T20:31:54.9323466Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:54.9323914Z def test_silu_mul_quant( 2025-05-07T20:31:54.9324154Z self, 2025-05-07T20:31:54.9324358Z T: int, 2025-05-07T20:31:54.9324566Z D: int, 2025-05-07T20:31:54.9324785Z scale_ub: Optional[float], 2025-05-07T20:31:54.9325067Z contiguous: bool, 2025-05-07T20:31:54.9325316Z compiled: bool, 2025-05-07T20:31:54.9325544Z ) -> None: 2025-05-07T20:31:54.9325766Z torch.manual_seed(2025) 2025-05-07T20:31:54.9326015Z 2025-05-07T20:31:54.9326294Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:54.9326644Z 2025-05-07T20:31:54.9326843Z x_sign = torch.sign(x) 2025-05-07T20:31:54.9327133Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:54.9327448Z x = x_sign * x_clamp 2025-05-07T20:31:54.9327697Z x0 = x[:, :D] 2025-05-07T20:31:54.9327920Z x1 = x[:, D:] 2025-05-07T20:31:54.9328403Z 2025-05-07T20:31:54.9328600Z if contiguous: 2025-05-07T20:31:54.9328839Z x0 = x0.contiguous() 2025-05-07T20:31:54.9329090Z x1 = x1.contiguous() 2025-05-07T20:31:54.9329333Z 2025-05-07T20:31:54.9329528Z if scale_ub is not None: 2025-05-07T20:31:54.9329795Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:54.9330131Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:54.9330444Z ) 2025-05-07T20:31:54.9330638Z else: 2025-05-07T20:31:54.9330855Z scale_ub_tensor = None 2025-05-07T20:31:54.9331111Z 2025-05-07T20:31:54.9331343Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:54.9331661Z op = silu_mul_quant 2025-05-07T20:31:54.9331918Z if compiled: 2025-05-07T20:31:54.9332327Z op = torch.compile(op) 2025-05-07T20:31:54.9332632Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:54.9332910Z 2025-05-07T20:31:54.9333106Z > y_fp8, y_scale = fn() 2025-05-07T20:31:54.9333272Z 2025-05-07T20:31:54.9333373Z moe/activation_test.py:117: 2025-05-07T20:31:54.9333669Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:54.9334006Z moe/activation_test.py:115: in fn 2025-05-07T20:31:54.9334285Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:54.9334983Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:54.9335677Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:54.9336332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:54.9337005Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:54.9337670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:54.9338205Z kernel = self.compile( 2025-05-07T20:31:54.9338744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:54.9339405Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:54.9339807Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:54.9340040Z 2025-05-07T20:31:54.9340253Z self = 2025-05-07T20:31:54.9341324Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:54.9342698Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f49872f9da0>} 2025-05-07T20:31:54.9344029Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:54.9345048Z context = 2025-05-07T20:31:54.9345335Z 2025-05-07T20:31:54.9345510Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:54.9346020Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:54.9346490Z module_map=module_map) 2025-05-07T20:31:54.9346864Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:54.9347212Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:54.9347474Z E ^ 2025-05-07T20:31:54.9347946Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:54.9348393Z 2025-05-07T20:31:54.9348833Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:54.9349446Z 2025-05-07T20:31:54.9349551Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:54.9349968Z self=, 2025-05-07T20:31:54.9350375Z T=2048, 2025-05-07T20:31:54.9350563Z D=5120, 2025-05-07T20:31:54.9350760Z scale_ub=1200.0, 2025-05-07T20:31:54.9350993Z contiguous=True, 2025-05-07T20:31:54.9351215Z compiled=True, 2025-05-07T20:31:54.9351432Z ) 2025-05-07T20:31:54.9351756Z self = 2025-05-07T20:31:54.9352246Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:54.9352513Z 2025-05-07T20:31:54.9352594Z @given( 2025-05-07T20:31:54.9352920Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:54.9353238Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:54.9353543Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:54.9353874Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:54.9354201Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:54.9354481Z ) 2025-05-07T20:31:54.9354832Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:54.9355271Z def test_silu_mul_quant( 2025-05-07T20:31:54.9355510Z self, 2025-05-07T20:31:54.9355710Z T: int, 2025-05-07T20:31:54.9355917Z D: int, 2025-05-07T20:31:54.9356216Z scale_ub: Optional[float], 2025-05-07T20:31:54.9356486Z contiguous: bool, 2025-05-07T20:31:54.9356729Z compiled: bool, 2025-05-07T20:31:54.9356958Z ) -> None: 2025-05-07T20:31:54.9357172Z torch.manual_seed(2025) 2025-05-07T20:31:54.9357423Z 2025-05-07T20:31:54.9357704Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:54.9358042Z 2025-05-07T20:31:54.9358244Z x_sign = torch.sign(x) 2025-05-07T20:31:54.9358540Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:54.9358845Z x = x_sign * x_clamp 2025-05-07T20:31:54.9359088Z x0 = x[:, :D] 2025-05-07T20:31:54.9359316Z x1 = x[:, D:] 2025-05-07T20:31:54.9359522Z 2025-05-07T20:31:54.9359713Z if contiguous: 2025-05-07T20:31:54.9359951Z x0 = x0.contiguous() 2025-05-07T20:31:54.9360207Z x1 = x1.contiguous() 2025-05-07T20:31:54.9360449Z 2025-05-07T20:31:54.9360644Z if scale_ub is not None: 2025-05-07T20:31:54.9367031Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:54.9367442Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:54.9367749Z ) 2025-05-07T20:31:54.9367956Z else: 2025-05-07T20:31:54.9368183Z scale_ub_tensor = None 2025-05-07T20:31:54.9368440Z 2025-05-07T20:31:54.9368720Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:54.9369071Z op = silu_mul_quant 2025-05-07T20:31:54.9369323Z if compiled: 2025-05-07T20:31:54.9369584Z op = torch.compile(op) 2025-05-07T20:31:54.9369886Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:54.9370159Z 2025-05-07T20:31:54.9370368Z y_fp8, y_scale = fn() 2025-05-07T20:31:54.9370661Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:54.9370961Z 2025-05-07T20:31:54.9371202Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:54.9371550Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:54.9371852Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:54.9372168Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:54.9372546Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:54.9372867Z 2025-05-07T20:31:54.9373067Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:54.9373272Z 2025-05-07T20:31:54.9373376Z moe/activation_test.py:126: 2025-05-07T20:31:54.9373682Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:54.9374029Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:54.9374354Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:54.9375154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:54.9375919Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:54.9376471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:54.9377161Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:54.9377990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:54.9378745Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:54.9379534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:54.9380291Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:54.9381029Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:54.9381683Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:54.9382363Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:54.9382892Z fn() 2025-05-07T20:31:54.9383416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:54.9384000Z self.fn.run( 2025-05-07T20:31:54.9384471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:54.9385012Z kernel = self.compile( 2025-05-07T20:31:54.9385564Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:54.9386218Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:54.9386623Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:54.9386855Z 2025-05-07T20:31:54.9387074Z self = 2025-05-07T20:31:54.9388183Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:54.9389668Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4985c122a0>} 2025-05-07T20:31:54.9391025Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:54.9392066Z context = 2025-05-07T20:31:54.9392358Z 2025-05-07T20:31:54.9392539Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:54.9393063Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:54.9393543Z module_map=module_map) 2025-05-07T20:31:54.9393915Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:54.9394281Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:54.9394548Z E ^ 2025-05-07T20:31:54.9395023Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:54.9395478Z 2025-05-07T20:31:54.9395910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:54.9396427Z 2025-05-07T20:31:54.9396533Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:54.9396956Z self=, 2025-05-07T20:31:54.9397368Z T=16384, 2025-05-07T20:31:54.9397570Z D=7168, 2025-05-07T20:31:54.9397771Z scale_ub=1200.0, 2025-05-07T20:31:54.9398011Z contiguous=False, 2025-05-07T20:31:54.9398250Z compiled=False, 2025-05-07T20:31:54.9398459Z ) 2025-05-07T20:31:55.2439846Z W0507 20:31:55.241000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:55.2440938Z W0507 20:31:55.241000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Traceback (most recent call last): 2025-05-07T20:31:55.2442285Z W0507 20:31:55.241000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:55.2443706Z W0507 20:31:55.241000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:55.2444794Z W0507 20:31:55.241000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:55.2446108Z W0507 20:31:55.241000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:55.2447490Z W0507 20:31:55.241000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:55.2448481Z W0507 20:31:55.241000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:55.2449705Z W0507 20:31:55.241000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:55.2451091Z W0507 20:31:55.241000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:55.2452163Z W0507 20:31:55.241000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:55.2453444Z W0507 20:31:55.241000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:55.2454697Z W0507 20:31:55.241000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] generator.visit(fn.parse()) 2025-05-07T20:31:55.2455923Z W0507 20:31:55.241000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:55.2457138Z W0507 20:31:55.241000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ret = super().visit(node) 2025-05-07T20:31:55.2457974Z W0507 20:31:55.241000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:55.2459005Z W0507 20:31:55.241000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:55.2460027Z W0507 20:31:55.241000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] return visitor(node) 2025-05-07T20:31:55.2460819Z W0507 20:31:55.241000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^^^^^^^^^^^^^ 2025-05-07T20:31:55.2462122Z W0507 20:31:55.241000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:55.2463409Z W0507 20:31:55.241000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:55.2464535Z W0507 20:31:55.241000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:55.2465579Z W0507 20:31:55.241000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] self.visit(item) 2025-05-07T20:31:55.2466755Z W0507 20:31:55.241000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:55.2468197Z W0507 20:31:55.241000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:55.2469328Z W0507 20:31:55.241000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:55.2470250Z W0507 20:31:55.241000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:55.2470995Z W0507 20:31:55.241000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^ 2025-05-07T20:31:55.2472012Z W0507 20:31:55.241000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:55.3190646Z W0507 20:31:55.316000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:55.3191733Z W0507 20:31:55.316000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Traceback (most recent call last): 2025-05-07T20:31:55.3193074Z W0507 20:31:55.316000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:55.3194493Z W0507 20:31:55.316000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:55.3195475Z W0507 20:31:55.316000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:55.3196788Z W0507 20:31:55.316000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:55.3198165Z W0507 20:31:55.316000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:55.3199192Z W0507 20:31:55.316000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:55.3200432Z W0507 20:31:55.316000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:55.3201812Z W0507 20:31:55.316000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:55.3203037Z W0507 20:31:55.316000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:55.3204316Z W0507 20:31:55.316000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:55.3205572Z W0507 20:31:55.316000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] generator.visit(fn.parse()) 2025-05-07T20:31:55.3206795Z W0507 20:31:55.316000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:55.3208124Z W0507 20:31:55.316000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ret = super().visit(node) 2025-05-07T20:31:55.3208962Z W0507 20:31:55.316000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:55.3209992Z W0507 20:31:55.316000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:55.3211011Z W0507 20:31:55.316000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] return visitor(node) 2025-05-07T20:31:55.3211802Z W0507 20:31:55.316000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^^^^^^^^^^^^^ 2025-05-07T20:31:55.3213016Z W0507 20:31:55.316000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:55.3214309Z W0507 20:31:55.316000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:55.3215425Z W0507 20:31:55.316000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:55.3216467Z W0507 20:31:55.316000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] self.visit(item) 2025-05-07T20:31:55.3217649Z W0507 20:31:55.316000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:55.3219061Z W0507 20:31:55.316000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:55.3220127Z W0507 20:31:55.316000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:55.3221048Z W0507 20:31:55.316000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:55.3221795Z W0507 20:31:55.316000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^ 2025-05-07T20:31:55.3222816Z W0507 20:31:55.316000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:55.9918345Z self = 2025-05-07T20:31:55.9919281Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:55.9919621Z 2025-05-07T20:31:55.9919702Z @given( 2025-05-07T20:31:55.9919946Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:55.9920425Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:55.9920739Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:55.9921076Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:55.9921399Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:55.9921687Z ) 2025-05-07T20:31:55.9922042Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:55.9922488Z def test_silu_mul_quant( 2025-05-07T20:31:55.9922731Z self, 2025-05-07T20:31:55.9922932Z T: int, 2025-05-07T20:31:55.9923138Z D: int, 2025-05-07T20:31:55.9923355Z scale_ub: Optional[float], 2025-05-07T20:31:55.9923633Z contiguous: bool, 2025-05-07T20:31:55.9924049Z compiled: bool, 2025-05-07T20:31:55.9924273Z ) -> None: 2025-05-07T20:31:55.9924500Z torch.manual_seed(2025) 2025-05-07T20:31:55.9924752Z 2025-05-07T20:31:55.9925037Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:55.9925387Z 2025-05-07T20:31:55.9925587Z x_sign = torch.sign(x) 2025-05-07T20:31:55.9925877Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:55.9926201Z x = x_sign * x_clamp 2025-05-07T20:31:55.9926452Z x0 = x[:, :D] 2025-05-07T20:31:55.9926669Z x1 = x[:, D:] 2025-05-07T20:31:55.9926883Z 2025-05-07T20:31:55.9927073Z if contiguous: 2025-05-07T20:31:55.9927301Z x0 = x0.contiguous() 2025-05-07T20:31:55.9927561Z x1 = x1.contiguous() 2025-05-07T20:31:55.9927807Z 2025-05-07T20:31:55.9928003Z if scale_ub is not None: 2025-05-07T20:31:55.9928454Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:55.9928799Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:55.9929110Z ) 2025-05-07T20:31:55.9929302Z else: 2025-05-07T20:31:55.9929515Z scale_ub_tensor = None 2025-05-07T20:31:55.9929770Z 2025-05-07T20:31:55.9930005Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:55.9930323Z op = silu_mul_quant 2025-05-07T20:31:55.9930577Z if compiled: 2025-05-07T20:31:55.9930825Z op = torch.compile(op) 2025-05-07T20:31:55.9931125Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:55.9931403Z 2025-05-07T20:31:55.9931593Z > y_fp8, y_scale = fn() 2025-05-07T20:31:55.9931762Z 2025-05-07T20:31:55.9931864Z moe/activation_test.py:117: 2025-05-07T20:31:55.9932163Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:55.9932496Z moe/activation_test.py:115: in fn 2025-05-07T20:31:55.9932774Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:55.9933466Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:55.9934161Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:55.9934708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:55.9935390Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:55.9936049Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:55.9936584Z kernel = self.compile( 2025-05-07T20:31:55.9937125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:55.9937784Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:55.9938179Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:55.9938422Z 2025-05-07T20:31:55.9938628Z self = 2025-05-07T20:31:55.9939877Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:55.9941242Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f49857a2700>} 2025-05-07T20:31:55.9942574Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:55.9943594Z context = 2025-05-07T20:31:55.9943886Z 2025-05-07T20:31:55.9944162Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:55.9944681Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:55.9945144Z module_map=module_map) 2025-05-07T20:31:55.9945520Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:55.9945881Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:55.9946135Z E ^ 2025-05-07T20:31:55.9946600Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:55.9947056Z 2025-05-07T20:31:55.9947470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:55.9947978Z 2025-05-07T20:31:55.9948091Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:55.9948498Z self=, 2025-05-07T20:31:55.9948908Z T=1, 2025-05-07T20:31:55.9949201Z D=7168, 2025-05-07T20:31:55.9949424Z scale_ub=None, 2025-05-07T20:31:55.9949642Z contiguous=True, 2025-05-07T20:31:55.9949874Z compiled=True, 2025-05-07T20:31:55.9950082Z ) 2025-05-07T20:31:55.9950406Z self = 2025-05-07T20:31:55.9950887Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:55.9951144Z 2025-05-07T20:31:55.9951231Z @given( 2025-05-07T20:31:55.9951464Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:55.9951786Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:55.9952094Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:55.9952422Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:55.9952753Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:55.9953042Z ) 2025-05-07T20:31:55.9953393Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:55.9953830Z def test_silu_mul_quant( 2025-05-07T20:31:55.9954073Z self, 2025-05-07T20:31:55.9954269Z T: int, 2025-05-07T20:31:55.9954464Z D: int, 2025-05-07T20:31:55.9954682Z scale_ub: Optional[float], 2025-05-07T20:31:55.9954959Z contiguous: bool, 2025-05-07T20:31:55.9955193Z compiled: bool, 2025-05-07T20:31:55.9955416Z ) -> None: 2025-05-07T20:31:55.9955634Z torch.manual_seed(2025) 2025-05-07T20:31:55.9955882Z 2025-05-07T20:31:55.9956161Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:55.9956500Z 2025-05-07T20:31:55.9956691Z x_sign = torch.sign(x) 2025-05-07T20:31:55.9956985Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:55.9957290Z x = x_sign * x_clamp 2025-05-07T20:31:55.9957527Z x0 = x[:, :D] 2025-05-07T20:31:55.9957755Z x1 = x[:, D:] 2025-05-07T20:31:55.9957968Z 2025-05-07T20:31:55.9958155Z if contiguous: 2025-05-07T20:31:55.9958391Z x0 = x0.contiguous() 2025-05-07T20:31:55.9958653Z x1 = x1.contiguous() 2025-05-07T20:31:55.9958895Z 2025-05-07T20:31:55.9959085Z if scale_ub is not None: 2025-05-07T20:31:55.9959444Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:55.9959784Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:55.9960087Z ) 2025-05-07T20:31:55.9960280Z else: 2025-05-07T20:31:55.9960497Z scale_ub_tensor = None 2025-05-07T20:31:55.9960742Z 2025-05-07T20:31:55.9960975Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:55.9961289Z op = silu_mul_quant 2025-05-07T20:31:55.9961533Z if compiled: 2025-05-07T20:31:55.9961782Z op = torch.compile(op) 2025-05-07T20:31:55.9962080Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:55.9962348Z 2025-05-07T20:31:55.9962544Z y_fp8, y_scale = fn() 2025-05-07T20:31:55.9962913Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:55.9963201Z 2025-05-07T20:31:55.9963444Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:55.9963784Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:55.9964089Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:55.9964401Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:55.9964762Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:55.9965073Z 2025-05-07T20:31:55.9965270Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:55.9965472Z 2025-05-07T20:31:55.9965574Z moe/activation_test.py:126: 2025-05-07T20:31:55.9965869Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:55.9966197Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:55.9966525Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:55.9967313Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:55.9968059Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:55.9968601Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:55.9969280Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:55.9970008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:55.9970722Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:55.9971468Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:55.9972215Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:55.9972949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:55.9973586Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:55.9974178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:55.9974701Z fn() 2025-05-07T20:31:55.9975205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:55.9975785Z self.fn.run( 2025-05-07T20:31:55.9976252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:55.9976782Z kernel = self.compile( 2025-05-07T20:31:55.9977324Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:55.9977975Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:55.9978377Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:55.9978603Z 2025-05-07T20:31:55.9978811Z self = 2025-05-07T20:31:55.9980011Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:55.9981380Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4985a18ae0>} 2025-05-07T20:31:55.9982715Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:55.9983734Z context = 2025-05-07T20:31:55.9984120Z 2025-05-07T20:31:55.9984292Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:55.9984806Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:55.9985269Z module_map=module_map) 2025-05-07T20:31:55.9985629Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:55.9985978Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:55.9986245Z E ^ 2025-05-07T20:31:55.9986709Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:55.9987154Z 2025-05-07T20:31:55.9987575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:55.9988083Z 2025-05-07T20:31:55.9988188Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:55.9988604Z self=, 2025-05-07T20:31:55.9989007Z T=4096, 2025-05-07T20:31:55.9989243Z D=5120, 2025-05-07T20:31:55.9989432Z scale_ub=None, 2025-05-07T20:31:55.9989653Z contiguous=False, 2025-05-07T20:31:55.9989884Z compiled=False, 2025-05-07T20:31:55.9990083Z ) 2025-05-07T20:31:56.3578284Z W0507 20:31:56.355000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:56.3579366Z W0507 20:31:56.355000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Traceback (most recent call last): 2025-05-07T20:31:56.3580755Z W0507 20:31:56.355000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:56.3582178Z W0507 20:31:56.355000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:56.3583149Z W0507 20:31:56.355000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:56.3584443Z W0507 20:31:56.355000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:56.3585811Z W0507 20:31:56.355000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:56.3586794Z W0507 20:31:56.355000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:56.3588169Z W0507 20:31:56.355000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:56.3589643Z W0507 20:31:56.355000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:56.3590701Z W0507 20:31:56.355000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:56.3591976Z W0507 20:31:56.355000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:56.3593336Z W0507 20:31:56.355000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] generator.visit(fn.parse()) 2025-05-07T20:31:56.3594567Z W0507 20:31:56.355000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:56.3595769Z W0507 20:31:56.355000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ret = super().visit(node) 2025-05-07T20:31:56.3596596Z W0507 20:31:56.355000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:56.3597621Z W0507 20:31:56.355000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:56.3598641Z W0507 20:31:56.355000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] return visitor(node) 2025-05-07T20:31:56.3599438Z W0507 20:31:56.355000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^^^^^^^^^^^^^ 2025-05-07T20:31:56.3600708Z W0507 20:31:56.355000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:56.3601988Z W0507 20:31:56.355000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:56.3603109Z W0507 20:31:56.355000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:56.3604151Z W0507 20:31:56.355000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] self.visit(item) 2025-05-07T20:31:56.3605340Z W0507 20:31:56.355000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:56.3606698Z W0507 20:31:56.355000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:56.3607763Z W0507 20:31:56.355000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:56.3608675Z W0507 20:31:56.355000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:56.3609453Z W0507 20:31:56.355000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^ 2025-05-07T20:31:56.3610494Z W0507 20:31:56.355000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:56.6217456Z W0507 20:31:56.619000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:56.6218516Z W0507 20:31:56.619000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Traceback (most recent call last): 2025-05-07T20:31:56.6219843Z W0507 20:31:56.619000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:56.6222660Z W0507 20:31:56.619000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:56.6224820Z W0507 20:31:56.619000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:56.6227422Z W0507 20:31:56.619000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:56.6229881Z W0507 20:31:56.619000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:56.6230862Z W0507 20:31:56.619000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:56.6232088Z W0507 20:31:56.619000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:56.6233484Z W0507 20:31:56.619000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:56.6234558Z W0507 20:31:56.619000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:56.6235846Z W0507 20:31:56.619000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:56.6237101Z W0507 20:31:56.619000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] generator.visit(fn.parse()) 2025-05-07T20:31:56.6238340Z W0507 20:31:56.619000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:56.6239615Z W0507 20:31:56.619000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ret = super().visit(node) 2025-05-07T20:31:56.6240451Z W0507 20:31:56.619000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:56.6241486Z W0507 20:31:56.619000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:56.6242506Z W0507 20:31:56.619000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] return visitor(node) 2025-05-07T20:31:56.6243318Z W0507 20:31:56.619000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^^^^^^^^^^^^^ 2025-05-07T20:31:56.6244659Z W0507 20:31:56.619000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:56.6245951Z W0507 20:31:56.619000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:56.6247079Z W0507 20:31:56.619000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:56.6253062Z W0507 20:31:56.619000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] self.visit(item) 2025-05-07T20:31:56.6254278Z W0507 20:31:56.619000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:56.6255830Z W0507 20:31:56.619000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:56.6256914Z W0507 20:31:56.619000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:56.6257842Z W0507 20:31:56.619000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:56.6258603Z W0507 20:31:56.619000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^ 2025-05-07T20:31:56.6259637Z W0507 20:31:56.619000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.2962523Z self = 2025-05-07T20:31:57.2963092Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:57.2963378Z 2025-05-07T20:31:57.2963463Z @given( 2025-05-07T20:31:57.2963706Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.2964018Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.2964331Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.2964672Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.2964995Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.2965287Z ) 2025-05-07T20:31:57.2965644Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.2966091Z def test_silu_mul_quant( 2025-05-07T20:31:57.2966332Z self, 2025-05-07T20:31:57.2966534Z T: int, 2025-05-07T20:31:57.2966736Z D: int, 2025-05-07T20:31:57.2966955Z scale_ub: Optional[float], 2025-05-07T20:31:57.2967235Z contiguous: bool, 2025-05-07T20:31:57.2967486Z compiled: bool, 2025-05-07T20:31:57.2967711Z ) -> None: 2025-05-07T20:31:57.2967933Z torch.manual_seed(2025) 2025-05-07T20:31:57.2968182Z 2025-05-07T20:31:57.2968461Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.2968811Z 2025-05-07T20:31:57.2969016Z x_sign = torch.sign(x) 2025-05-07T20:31:57.2969305Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.2969629Z x = x_sign * x_clamp 2025-05-07T20:31:57.2969872Z x0 = x[:, :D] 2025-05-07T20:31:57.2970087Z x1 = x[:, D:] 2025-05-07T20:31:57.2970295Z 2025-05-07T20:31:57.2970489Z if contiguous: 2025-05-07T20:31:57.2970730Z x0 = x0.contiguous() 2025-05-07T20:31:57.2970991Z x1 = x1.contiguous() 2025-05-07T20:31:57.2971231Z 2025-05-07T20:31:57.2971430Z if scale_ub is not None: 2025-05-07T20:31:57.2971696Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:57.2972240Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:57.2972564Z ) 2025-05-07T20:31:57.2972760Z else: 2025-05-07T20:31:57.2972981Z scale_ub_tensor = None 2025-05-07T20:31:57.2973234Z 2025-05-07T20:31:57.2973466Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.2973791Z op = silu_mul_quant 2025-05-07T20:31:57.2974051Z if compiled: 2025-05-07T20:31:57.2974297Z op = torch.compile(op) 2025-05-07T20:31:57.2974596Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.2974870Z 2025-05-07T20:31:57.2975060Z > y_fp8, y_scale = fn() 2025-05-07T20:31:57.2975229Z 2025-05-07T20:31:57.2975329Z moe/activation_test.py:117: 2025-05-07T20:31:57.2975749Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.2976086Z moe/activation_test.py:115: in fn 2025-05-07T20:31:57.2976365Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.2977066Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:57.2977755Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:57.2978287Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:57.2978970Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:57.2979662Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:57.2980214Z kernel = self.compile( 2025-05-07T20:31:57.2980753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:57.2981416Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.2981821Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.2982048Z 2025-05-07T20:31:57.2982265Z self = 2025-05-07T20:31:57.2983334Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:57.2984693Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4985c11940>} 2025-05-07T20:31:57.2986036Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:57.2987062Z context = 2025-05-07T20:31:57.2987349Z 2025-05-07T20:31:57.2987515Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:57.2988034Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.2988502Z module_map=module_map) 2025-05-07T20:31:57.2988867Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.2989279Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:57.2989541Z E ^ 2025-05-07T20:31:57.2990034Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.2990505Z 2025-05-07T20:31:57.2990919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:57.2991434Z 2025-05-07T20:31:57.2991537Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.2991951Z self=, 2025-05-07T20:31:57.2992351Z T=4096, 2025-05-07T20:31:57.2992538Z D=7168, 2025-05-07T20:31:57.2992818Z scale_ub=None, 2025-05-07T20:31:57.2993044Z contiguous=False, 2025-05-07T20:31:57.2993272Z compiled=False, 2025-05-07T20:31:57.2993480Z ) 2025-05-07T20:31:57.2993804Z self = 2025-05-07T20:31:57.2994297Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:57.2994572Z 2025-05-07T20:31:57.2994652Z @given( 2025-05-07T20:31:57.2994884Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.2995200Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.2995508Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.2995838Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.2996252Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.2996538Z ) 2025-05-07T20:31:57.2996889Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.2997342Z def test_silu_mul_quant( 2025-05-07T20:31:57.2997587Z self, 2025-05-07T20:31:57.2997792Z T: int, 2025-05-07T20:31:57.2997995Z D: int, 2025-05-07T20:31:57.2998215Z scale_ub: Optional[float], 2025-05-07T20:31:57.2998491Z contiguous: bool, 2025-05-07T20:31:57.2998735Z compiled: bool, 2025-05-07T20:31:57.2998962Z ) -> None: 2025-05-07T20:31:57.2999175Z torch.manual_seed(2025) 2025-05-07T20:31:57.2999415Z 2025-05-07T20:31:57.2999689Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.3000031Z 2025-05-07T20:31:57.3000228Z x_sign = torch.sign(x) 2025-05-07T20:31:57.3000524Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.3000838Z x = x_sign * x_clamp 2025-05-07T20:31:57.3001080Z x0 = x[:, :D] 2025-05-07T20:31:57.3001295Z x1 = x[:, D:] 2025-05-07T20:31:57.3001505Z 2025-05-07T20:31:57.3001690Z if contiguous: 2025-05-07T20:31:57.3001936Z x0 = x0.contiguous() 2025-05-07T20:31:57.3002201Z x1 = x1.contiguous() 2025-05-07T20:31:57.3002433Z 2025-05-07T20:31:57.3002625Z if scale_ub is not None: 2025-05-07T20:31:57.3002897Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:57.3003232Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:57.3003541Z ) 2025-05-07T20:31:57.3003756Z else: 2025-05-07T20:31:57.3003969Z scale_ub_tensor = None 2025-05-07T20:31:57.3004216Z 2025-05-07T20:31:57.3004448Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.3004765Z op = silu_mul_quant 2025-05-07T20:31:57.3005014Z if compiled: 2025-05-07T20:31:57.3005267Z op = torch.compile(op) 2025-05-07T20:31:57.3005564Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.3005840Z 2025-05-07T20:31:57.3006043Z > y_fp8, y_scale = fn() 2025-05-07T20:31:57.3006206Z 2025-05-07T20:31:57.3006315Z moe/activation_test.py:117: 2025-05-07T20:31:57.3006604Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.3006932Z moe/activation_test.py:115: in fn 2025-05-07T20:31:57.3007210Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.3007896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:57.3008575Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:57.3009109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:57.3009835Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:57.3010492Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:57.3011019Z kernel = self.compile( 2025-05-07T20:31:57.3011643Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:57.3012297Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.3012689Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.3012920Z 2025-05-07T20:31:57.3013127Z self = 2025-05-07T20:31:57.3014199Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:57.3015556Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4985a1b060>} 2025-05-07T20:31:57.3016966Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:57.3017991Z context = 2025-05-07T20:31:57.3018283Z 2025-05-07T20:31:57.3018448Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:57.3018965Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.3019428Z module_map=module_map) 2025-05-07T20:31:57.3019793Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.3020147Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:57.3020433Z E ^ 2025-05-07T20:31:57.3020921Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.3021368Z 2025-05-07T20:31:57.3021789Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:57.3022299Z 2025-05-07T20:31:57.3022407Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.3022813Z self=, 2025-05-07T20:31:57.3023217Z T=128, 2025-05-07T20:31:57.3023406Z D=7168, 2025-05-07T20:31:57.3023599Z scale_ub=None, 2025-05-07T20:31:57.3023817Z contiguous=False, 2025-05-07T20:31:57.3024042Z compiled=True, 2025-05-07T20:31:57.3024255Z ) 2025-05-07T20:31:57.3463984Z self = 2025-05-07T20:31:57.3465020Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:57.3465570Z 2025-05-07T20:31:57.3465726Z @given( 2025-05-07T20:31:57.3466185Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.3466798Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.3467410Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.3468057Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.3468713Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.3469353Z ) 2025-05-07T20:31:57.3469746Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.3470196Z def test_silu_mul_quant( 2025-05-07T20:31:57.3470431Z self, 2025-05-07T20:31:57.3470629Z T: int, 2025-05-07T20:31:57.3470825Z D: int, 2025-05-07T20:31:57.3471053Z scale_ub: Optional[float], 2025-05-07T20:31:57.3471323Z contiguous: bool, 2025-05-07T20:31:57.3471560Z compiled: bool, 2025-05-07T20:31:57.3471784Z ) -> None: 2025-05-07T20:31:57.3472001Z torch.manual_seed(2025) 2025-05-07T20:31:57.3472241Z 2025-05-07T20:31:57.3472516Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.3472847Z 2025-05-07T20:31:57.3473046Z x_sign = torch.sign(x) 2025-05-07T20:31:57.3473478Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.3473789Z x = x_sign * x_clamp 2025-05-07T20:31:57.3474030Z x0 = x[:, :D] 2025-05-07T20:31:57.3474251Z x1 = x[:, D:] 2025-05-07T20:31:57.3474455Z 2025-05-07T20:31:57.3474642Z if contiguous: 2025-05-07T20:31:57.3474872Z x0 = x0.contiguous() 2025-05-07T20:31:57.3475130Z x1 = x1.contiguous() 2025-05-07T20:31:57.3475363Z 2025-05-07T20:31:57.3475558Z if scale_ub is not None: 2025-05-07T20:31:57.3475828Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:57.3476158Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:57.3476462Z ) 2025-05-07T20:31:57.3476772Z else: 2025-05-07T20:31:57.3476981Z scale_ub_tensor = None 2025-05-07T20:31:57.3477232Z 2025-05-07T20:31:57.3477466Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.3477783Z op = silu_mul_quant 2025-05-07T20:31:57.3478037Z if compiled: 2025-05-07T20:31:57.3478286Z op = torch.compile(op) 2025-05-07T20:31:57.3478575Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.3478845Z 2025-05-07T20:31:57.3479037Z y_fp8, y_scale = fn() 2025-05-07T20:31:57.3479319Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:57.3479633Z 2025-05-07T20:31:57.3479893Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.3480224Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:57.3480508Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:57.3480820Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:57.3481181Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:57.3481488Z 2025-05-07T20:31:57.3481691Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:57.3481883Z 2025-05-07T20:31:57.3481989Z moe/activation_test.py:126: 2025-05-07T20:31:57.3482281Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.3482615Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:57.3482937Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:57.3483718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:57.3484460Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:57.3485002Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:57.3485681Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:57.3486363Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:57.3487087Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:57.3487833Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:57.3488573Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:57.3489295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:57.3489975Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:57.3490572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:57.3491089Z fn() 2025-05-07T20:31:57.3491591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:57.3492179Z self.fn.run( 2025-05-07T20:31:57.3492641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:57.3493302Z kernel = self.compile( 2025-05-07T20:31:57.3493841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:57.3494499Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.3494900Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.3495131Z 2025-05-07T20:31:57.3495335Z self = 2025-05-07T20:31:57.3496423Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:57.3497851Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f498548d9e0>} 2025-05-07T20:31:57.3499196Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:57.3500260Z context = 2025-05-07T20:31:57.3500548Z 2025-05-07T20:31:57.3500713Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:57.3501224Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.3501691Z module_map=module_map) 2025-05-07T20:31:57.3502062Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.3502415Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:57.3502684Z E ^ 2025-05-07T20:31:57.3503149Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.3503604Z 2025-05-07T20:31:57.3504021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:57.3504534Z 2025-05-07T20:31:57.3504638Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.3505044Z self=, 2025-05-07T20:31:57.3505442Z T=128, 2025-05-07T20:31:57.3505625Z D=7168, 2025-05-07T20:31:57.3505816Z scale_ub=None, 2025-05-07T20:31:57.3506034Z contiguous=False, 2025-05-07T20:31:57.3506255Z compiled=False, 2025-05-07T20:31:57.3506459Z ) 2025-05-07T20:31:57.6536934Z self = 2025-05-07T20:31:57.6537441Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:57.6537720Z 2025-05-07T20:31:57.6537836Z @given( 2025-05-07T20:31:57.6538075Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.6538522Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.6538953Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.6539376Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.6539806Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.6540139Z ) 2025-05-07T20:31:57.6540491Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.6540932Z def test_silu_mul_quant( 2025-05-07T20:31:57.6541179Z self, 2025-05-07T20:31:57.6541366Z T: int, 2025-05-07T20:31:57.6541571Z D: int, 2025-05-07T20:31:57.6541795Z scale_ub: Optional[float], 2025-05-07T20:31:57.6542058Z contiguous: bool, 2025-05-07T20:31:57.6542302Z compiled: bool, 2025-05-07T20:31:57.6542533Z ) -> None: 2025-05-07T20:31:57.6542745Z torch.manual_seed(2025) 2025-05-07T20:31:57.6542986Z 2025-05-07T20:31:57.6543421Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.6543770Z 2025-05-07T20:31:57.6543959Z x_sign = torch.sign(x) 2025-05-07T20:31:57.6544253Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.6544556Z x = x_sign * x_clamp 2025-05-07T20:31:57.6544789Z x0 = x[:, :D] 2025-05-07T20:31:57.6545002Z x1 = x[:, D:] 2025-05-07T20:31:57.6545210Z 2025-05-07T20:31:57.6545391Z if contiguous: 2025-05-07T20:31:57.6545627Z x0 = x0.contiguous() 2025-05-07T20:31:57.6545885Z x1 = x1.contiguous() 2025-05-07T20:31:57.6546115Z 2025-05-07T20:31:57.6546311Z if scale_ub is not None: 2025-05-07T20:31:57.6546580Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:57.6547029Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:57.6547333Z ) 2025-05-07T20:31:57.6547525Z else: 2025-05-07T20:31:57.6547731Z scale_ub_tensor = None 2025-05-07T20:31:57.6547981Z 2025-05-07T20:31:57.6548225Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.6548543Z op = silu_mul_quant 2025-05-07T20:31:57.6548789Z if compiled: 2025-05-07T20:31:57.6549032Z op = torch.compile(op) 2025-05-07T20:31:57.6549393Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6549666Z 2025-05-07T20:31:57.6549862Z > y_fp8, y_scale = fn() 2025-05-07T20:31:57.6550023Z 2025-05-07T20:31:57.6550127Z moe/activation_test.py:117: 2025-05-07T20:31:57.6550412Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6550748Z moe/activation_test.py:115: in fn 2025-05-07T20:31:57.6551028Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6551714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:57.6552402Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:57.6552939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:57.6553616Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:57.6554267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:57.6554795Z kernel = self.compile( 2025-05-07T20:31:57.6555339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:57.6555989Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.6556380Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6556621Z 2025-05-07T20:31:57.6556826Z self = 2025-05-07T20:31:57.6557900Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:57.6559255Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f498548f1a0>} 2025-05-07T20:31:57.6560633Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:57.6561648Z context = 2025-05-07T20:31:57.6561936Z 2025-05-07T20:31:57.6562103Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:57.6562622Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.6563082Z module_map=module_map) 2025-05-07T20:31:57.6563526Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.6563875Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:57.6564131Z E ^ 2025-05-07T20:31:57.6564595Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.6565046Z 2025-05-07T20:31:57.6565459Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:57.6565969Z 2025-05-07T20:31:57.6566077Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.6566484Z self=, 2025-05-07T20:31:57.6566959Z T=4096, 2025-05-07T20:31:57.6567148Z D=5120, 2025-05-07T20:31:57.6567339Z scale_ub=1200.0, 2025-05-07T20:31:57.6567563Z contiguous=True, 2025-05-07T20:31:57.6567788Z compiled=False, 2025-05-07T20:31:57.6567991Z ) 2025-05-07T20:31:57.6568319Z self = 2025-05-07T20:31:57.6568809Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:57.6569083Z 2025-05-07T20:31:57.6569167Z @given( 2025-05-07T20:31:57.6569391Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.6569709Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.6570013Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.6570356Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.6570704Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.6570985Z ) 2025-05-07T20:31:57.6571327Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.6571772Z def test_silu_mul_quant( 2025-05-07T20:31:57.6572012Z self, 2025-05-07T20:31:57.6572207Z T: int, 2025-05-07T20:31:57.6572403Z D: int, 2025-05-07T20:31:57.6572626Z scale_ub: Optional[float], 2025-05-07T20:31:57.6572896Z contiguous: bool, 2025-05-07T20:31:57.6573131Z compiled: bool, 2025-05-07T20:31:57.6573352Z ) -> None: 2025-05-07T20:31:57.6573569Z torch.manual_seed(2025) 2025-05-07T20:31:57.6573809Z 2025-05-07T20:31:57.6574083Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.6574424Z 2025-05-07T20:31:57.6579971Z x_sign = torch.sign(x) 2025-05-07T20:31:57.6580310Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.6580618Z x = x_sign * x_clamp 2025-05-07T20:31:57.6580862Z x0 = x[:, :D] 2025-05-07T20:31:57.6581082Z x1 = x[:, D:] 2025-05-07T20:31:57.6581284Z 2025-05-07T20:31:57.6581481Z if contiguous: 2025-05-07T20:31:57.6581713Z x0 = x0.contiguous() 2025-05-07T20:31:57.6581970Z x1 = x1.contiguous() 2025-05-07T20:31:57.6582207Z 2025-05-07T20:31:57.6582402Z if scale_ub is not None: 2025-05-07T20:31:57.6582671Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:57.6583006Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:57.6583315Z ) 2025-05-07T20:31:57.6583505Z else: 2025-05-07T20:31:57.6583720Z scale_ub_tensor = None 2025-05-07T20:31:57.6583974Z 2025-05-07T20:31:57.6584204Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.6584524Z op = silu_mul_quant 2025-05-07T20:31:57.6584780Z if compiled: 2025-05-07T20:31:57.6585023Z op = torch.compile(op) 2025-05-07T20:31:57.6585314Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6585599Z 2025-05-07T20:31:57.6585794Z > y_fp8, y_scale = fn() 2025-05-07T20:31:57.6585959Z 2025-05-07T20:31:57.6586056Z moe/activation_test.py:117: 2025-05-07T20:31:57.6586349Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6586680Z moe/activation_test.py:115: in fn 2025-05-07T20:31:57.6587060Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.6587741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:57.6588435Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:57.6588968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:57.6589762Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:57.6590424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:57.6590953Z kernel = self.compile( 2025-05-07T20:31:57.6591595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:57.6592242Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.6592648Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.6592872Z 2025-05-07T20:31:57.6593080Z self = 2025-05-07T20:31:57.6594157Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:57.6595503Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f498548ea20>} 2025-05-07T20:31:57.6596835Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:57.6597857Z context = 2025-05-07T20:31:57.6598141Z 2025-05-07T20:31:57.6598310Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:57.6598814Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.6599274Z module_map=module_map) 2025-05-07T20:31:57.6599648Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.6600031Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:57.6600280Z E ^ 2025-05-07T20:31:57.6600745Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.6601195Z 2025-05-07T20:31:57.6601621Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:57.6602133Z 2025-05-07T20:31:57.6602246Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.6602651Z self=, 2025-05-07T20:31:57.6603049Z T=1, 2025-05-07T20:31:57.6603241Z D=5120, 2025-05-07T20:31:57.6603426Z scale_ub=None, 2025-05-07T20:31:57.6603637Z contiguous=True, 2025-05-07T20:31:57.6603859Z compiled=True, 2025-05-07T20:31:57.6604057Z ) 2025-05-07T20:31:57.9842158Z W0507 20:31:57.981000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:57.9843255Z W0507 20:31:57.981000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Traceback (most recent call last): 2025-05-07T20:31:57.9844593Z W0507 20:31:57.981000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:57.9846191Z W0507 20:31:57.981000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:57.9847182Z W0507 20:31:57.981000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:57.9848486Z W0507 20:31:57.981000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:57.9849869Z W0507 20:31:57.981000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.9851024Z W0507 20:31:57.981000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:57.9852244Z W0507 20:31:57.981000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:57.9853621Z W0507 20:31:57.981000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.9854673Z W0507 20:31:57.981000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:57.9855958Z W0507 20:31:57.981000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:57.9857210Z W0507 20:31:57.981000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] generator.visit(fn.parse()) 2025-05-07T20:31:57.9858428Z W0507 20:31:57.981000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:57.9859639Z W0507 20:31:57.981000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ret = super().visit(node) 2025-05-07T20:31:57.9860520Z W0507 20:31:57.981000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:57.9861546Z W0507 20:31:57.981000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:57.9862556Z W0507 20:31:57.981000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] return visitor(node) 2025-05-07T20:31:57.9863349Z W0507 20:31:57.981000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^^^^^^^^^^^^^ 2025-05-07T20:31:57.9864557Z W0507 20:31:57.981000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:57.9865831Z W0507 20:31:57.981000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:57.9866943Z W0507 20:31:57.981000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:57.9868110Z W0507 20:31:57.981000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] self.visit(item) 2025-05-07T20:31:57.9869367Z W0507 20:31:57.981000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:57.9870717Z W0507 20:31:57.981000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:57.9871777Z W0507 20:31:57.981000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.9872685Z W0507 20:31:57.981000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:57.9873498Z W0507 20:31:57.981000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^ 2025-05-07T20:31:57.9874521Z W0507 20:31:57.981000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:58.0691213Z W0507 20:31:58.066000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:58.0692260Z W0507 20:31:58.066000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Traceback (most recent call last): 2025-05-07T20:31:58.0693604Z W0507 20:31:58.066000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:58.0695028Z W0507 20:31:58.066000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:58.0696009Z W0507 20:31:58.066000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:58.0697310Z W0507 20:31:58.066000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:58.0698682Z W0507 20:31:58.066000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:58.0699675Z W0507 20:31:58.066000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:58.0700944Z W0507 20:31:58.066000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:58.0702323Z W0507 20:31:58.066000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:58.0703386Z W0507 20:31:58.066000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:58.0704667Z W0507 20:31:58.066000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:58.0706054Z W0507 20:31:58.066000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] generator.visit(fn.parse()) 2025-05-07T20:31:58.0707278Z W0507 20:31:58.066000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:58.0708483Z W0507 20:31:58.066000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ret = super().visit(node) 2025-05-07T20:31:58.0709392Z W0507 20:31:58.066000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:58.0710469Z W0507 20:31:58.066000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:58.0711606Z W0507 20:31:58.066000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] return visitor(node) 2025-05-07T20:31:58.0712411Z W0507 20:31:58.066000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^^^^^^^^^^^^^ 2025-05-07T20:31:58.0713612Z W0507 20:31:58.066000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:58.0714889Z W0507 20:31:58.066000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:58.0716002Z W0507 20:31:58.066000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:58.0717051Z W0507 20:31:58.066000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] self.visit(item) 2025-05-07T20:31:58.0718232Z W0507 20:31:58.066000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:58.0719577Z W0507 20:31:58.066000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:58.0720633Z W0507 20:31:58.066000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:58.0721540Z W0507 20:31:58.066000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:58.0722284Z W0507 20:31:58.066000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^ 2025-05-07T20:31:58.0723302Z W0507 20:31:58.066000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:58.3675451Z self = 2025-05-07T20:31:58.3675946Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:58.3676292Z 2025-05-07T20:31:58.3676408Z @given( 2025-05-07T20:31:58.3676737Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:58.3677053Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:58.3677363Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:58.3677697Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:58.3678027Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:58.3678324Z ) 2025-05-07T20:31:58.3678684Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:58.3679128Z def test_silu_mul_quant( 2025-05-07T20:31:58.3679366Z self, 2025-05-07T20:31:58.3679564Z T: int, 2025-05-07T20:31:58.3679936Z D: int, 2025-05-07T20:31:58.3680156Z scale_ub: Optional[float], 2025-05-07T20:31:58.3680427Z contiguous: bool, 2025-05-07T20:31:58.3680672Z compiled: bool, 2025-05-07T20:31:58.3680896Z ) -> None: 2025-05-07T20:31:58.3681115Z torch.manual_seed(2025) 2025-05-07T20:31:58.3681355Z 2025-05-07T20:31:58.3681620Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:58.3681966Z 2025-05-07T20:31:58.3682165Z x_sign = torch.sign(x) 2025-05-07T20:31:58.3682459Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:58.3682765Z x = x_sign * x_clamp 2025-05-07T20:31:58.3683014Z x0 = x[:, :D] 2025-05-07T20:31:58.3683382Z x1 = x[:, D:] 2025-05-07T20:31:58.3683582Z 2025-05-07T20:31:58.3683770Z if contiguous: 2025-05-07T20:31:58.3684006Z x0 = x0.contiguous() 2025-05-07T20:31:58.3684263Z x1 = x1.contiguous() 2025-05-07T20:31:58.3684514Z 2025-05-07T20:31:58.3684709Z if scale_ub is not None: 2025-05-07T20:31:58.3684977Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:58.3685315Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:58.3685627Z ) 2025-05-07T20:31:58.3685815Z else: 2025-05-07T20:31:58.3686034Z scale_ub_tensor = None 2025-05-07T20:31:58.3686287Z 2025-05-07T20:31:58.3686518Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:58.3686842Z op = silu_mul_quant 2025-05-07T20:31:58.3687096Z if compiled: 2025-05-07T20:31:58.3687348Z op = torch.compile(op) 2025-05-07T20:31:58.3687639Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:58.3687917Z 2025-05-07T20:31:58.3688112Z y_fp8, y_scale = fn() 2025-05-07T20:31:58.3688394Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:58.3688682Z 2025-05-07T20:31:58.3688927Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:58.3689256Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:58.3689548Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:58.3689862Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:58.3690234Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:58.3690577Z 2025-05-07T20:31:58.3690779Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:58.3690978Z 2025-05-07T20:31:58.3691079Z moe/activation_test.py:126: 2025-05-07T20:31:58.3691379Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:58.3691714Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:58.3692040Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:58.3692825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:58.3693582Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:58.3694122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:58.3694795Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:58.3695475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:58.3696194Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:58.3696937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:58.3697684Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:58.3698408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:58.3699133Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:58.3699727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:58.3700268Z fn() 2025-05-07T20:31:58.3700796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:58.3701375Z self.fn.run( 2025-05-07T20:31:58.3701838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:58.3702368Z kernel = self.compile( 2025-05-07T20:31:58.3702906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:58.3703628Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:58.3704023Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:58.3704255Z 2025-05-07T20:31:58.3704467Z self = 2025-05-07T20:31:58.3705541Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:58.3706891Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4985498fe0>} 2025-05-07T20:31:58.3708224Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:58.3709308Z context = 2025-05-07T20:31:58.3709594Z 2025-05-07T20:31:58.3709766Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:58.3710282Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:58.3710740Z module_map=module_map) 2025-05-07T20:31:58.3711105Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:58.3711459Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:58.3711719Z E ^ 2025-05-07T20:31:58.3712185Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:58.3712633Z 2025-05-07T20:31:58.3713050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:58.3713562Z 2025-05-07T20:31:58.3713670Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:58.3714075Z self=, 2025-05-07T20:31:58.3714478Z T=2048, 2025-05-07T20:31:58.3714670Z D=5120, 2025-05-07T20:31:58.3714864Z scale_ub=None, 2025-05-07T20:31:58.3715087Z contiguous=True, 2025-05-07T20:31:58.3715310Z compiled=True, 2025-05-07T20:31:58.3715514Z ) 2025-05-07T20:31:58.6835266Z W0507 20:31:58.681000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:58.6837378Z W0507 20:31:58.681000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Traceback (most recent call last): 2025-05-07T20:31:58.6839979Z W0507 20:31:58.681000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:58.6841605Z W0507 20:31:58.681000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:58.6842577Z W0507 20:31:58.681000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:58.6843874Z W0507 20:31:58.681000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:58.6845253Z W0507 20:31:58.681000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:58.6846358Z W0507 20:31:58.681000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:58.6847588Z W0507 20:31:58.681000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:58.6848951Z W0507 20:31:58.681000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:58.6850045Z W0507 20:31:58.681000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:58.6851344Z W0507 20:31:58.681000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:58.6852603Z W0507 20:31:58.681000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] generator.visit(fn.parse()) 2025-05-07T20:31:58.6853831Z W0507 20:31:58.681000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:58.6855037Z W0507 20:31:58.681000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ret = super().visit(node) 2025-05-07T20:31:58.6855864Z W0507 20:31:58.681000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:58.6856893Z W0507 20:31:58.681000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:58.6857917Z W0507 20:31:58.681000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] return visitor(node) 2025-05-07T20:31:58.6858724Z W0507 20:31:58.681000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^^^^^^^^^^^^^ 2025-05-07T20:31:58.6859930Z W0507 20:31:58.681000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:58.6861258Z W0507 20:31:58.681000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:58.6862374Z W0507 20:31:58.681000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:58.6863421Z W0507 20:31:58.681000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] self.visit(item) 2025-05-07T20:31:58.6864681Z W0507 20:31:58.681000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:58.6866031Z W0507 20:31:58.681000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:58.6867090Z W0507 20:31:58.681000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:58.6868003Z W0507 20:31:58.681000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:58.6868750Z W0507 20:31:58.681000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^ 2025-05-07T20:31:58.6869907Z W0507 20:31:58.681000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:58.7677021Z W0507 20:31:58.765000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:58.7679112Z W0507 20:31:58.765000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Traceback (most recent call last): 2025-05-07T20:31:58.7680842Z W0507 20:31:58.765000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:58.7682259Z W0507 20:31:58.765000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:58.7683239Z W0507 20:31:58.765000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:58.7684529Z W0507 20:31:58.765000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:58.7685906Z W0507 20:31:58.765000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:58.7686884Z W0507 20:31:58.765000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:58.7688114Z W0507 20:31:58.765000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:58.7689490Z W0507 20:31:58.765000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:58.7690599Z W0507 20:31:58.765000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:58.7691881Z W0507 20:31:58.765000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:58.7693136Z W0507 20:31:58.765000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] generator.visit(fn.parse()) 2025-05-07T20:31:58.7694533Z W0507 20:31:58.765000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:58.7695746Z W0507 20:31:58.765000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ret = super().visit(node) 2025-05-07T20:31:58.7696574Z W0507 20:31:58.765000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:58.7697604Z W0507 20:31:58.765000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:58.7698622Z W0507 20:31:58.765000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] return visitor(node) 2025-05-07T20:31:58.7699540Z W0507 20:31:58.765000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^^^^^^^^^^^^^ 2025-05-07T20:31:58.7700807Z W0507 20:31:58.765000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:58.7702076Z W0507 20:31:58.765000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:58.7703193Z W0507 20:31:58.765000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:58.7704231Z W0507 20:31:58.765000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] self.visit(item) 2025-05-07T20:31:58.7705420Z W0507 20:31:58.765000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:58.7706777Z W0507 20:31:58.765000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:58.7707837Z W0507 20:31:58.765000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:58.7708753Z W0507 20:31:58.765000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:58.7709564Z W0507 20:31:58.765000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^ 2025-05-07T20:31:58.7710599Z W0507 20:31:58.765000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:59.0660044Z self = 2025-05-07T20:31:59.0660975Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:59.0661373Z 2025-05-07T20:31:59.0661509Z @given( 2025-05-07T20:31:59.0661863Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:59.0662316Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:59.0662749Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:59.0663104Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:59.0663442Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:59.0663736Z ) 2025-05-07T20:31:59.0664092Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:59.0664548Z def test_silu_mul_quant( 2025-05-07T20:31:59.0664826Z self, 2025-05-07T20:31:59.0665033Z T: int, 2025-05-07T20:31:59.0665237Z D: int, 2025-05-07T20:31:59.0665454Z scale_ub: Optional[float], 2025-05-07T20:31:59.0665934Z contiguous: bool, 2025-05-07T20:31:59.0666189Z compiled: bool, 2025-05-07T20:31:59.0666414Z ) -> None: 2025-05-07T20:31:59.0666646Z torch.manual_seed(2025) 2025-05-07T20:31:59.0666892Z 2025-05-07T20:31:59.0667172Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:59.0667526Z 2025-05-07T20:31:59.0667725Z x_sign = torch.sign(x) 2025-05-07T20:31:59.0668015Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:59.0668331Z x = x_sign * x_clamp 2025-05-07T20:31:59.0668577Z x0 = x[:, :D] 2025-05-07T20:31:59.0668792Z x1 = x[:, D:] 2025-05-07T20:31:59.0669004Z 2025-05-07T20:31:59.0669406Z if contiguous: 2025-05-07T20:31:59.0669646Z x0 = x0.contiguous() 2025-05-07T20:31:59.0669911Z x1 = x1.contiguous() 2025-05-07T20:31:59.0670153Z 2025-05-07T20:31:59.0670380Z if scale_ub is not None: 2025-05-07T20:31:59.0670688Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:59.0671025Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:59.0671337Z ) 2025-05-07T20:31:59.0671533Z else: 2025-05-07T20:31:59.0671748Z scale_ub_tensor = None 2025-05-07T20:31:59.0672001Z 2025-05-07T20:31:59.0672234Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:59.0672555Z op = silu_mul_quant 2025-05-07T20:31:59.0672811Z if compiled: 2025-05-07T20:31:59.0673062Z op = torch.compile(op) 2025-05-07T20:31:59.0673364Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:59.0673642Z 2025-05-07T20:31:59.0673834Z y_fp8, y_scale = fn() 2025-05-07T20:31:59.0674139Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:59.0674435Z 2025-05-07T20:31:59.0674671Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:59.0675008Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:59.0675310Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:59.0675630Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:59.0675988Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:59.0676304Z 2025-05-07T20:31:59.0676509Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:59.0676704Z 2025-05-07T20:31:59.0676807Z moe/activation_test.py:126: 2025-05-07T20:31:59.0677107Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:59.0677448Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:59.0677771Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:59.0678564Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:59.0679332Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:59.0679886Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:59.0680570Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:59.0681260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:59.0681983Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:59.0682741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:31:59.0683484Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:59.0684218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:59.0684867Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:59.0685685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:59.0686213Z fn() 2025-05-07T20:31:59.0686727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:59.0687314Z self.fn.run( 2025-05-07T20:31:59.0687779Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:59.0688319Z kernel = self.compile( 2025-05-07T20:31:59.0688865Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:59.0694817Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:59.0695347Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:59.0695593Z 2025-05-07T20:31:59.0695805Z self = 2025-05-07T20:31:59.0697199Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:59.0698811Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4984c9d8a0>} 2025-05-07T20:31:59.0700172Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:59.0701259Z context = 2025-05-07T20:31:59.0701554Z 2025-05-07T20:31:59.0701730Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:59.0702260Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:59.0702732Z module_map=module_map) 2025-05-07T20:31:59.0703102Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:59.0703459Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:59.0703730Z E ^ 2025-05-07T20:31:59.0704203Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:59.0704659Z 2025-05-07T20:31:59.0705092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:59.0705612Z 2025-05-07T20:31:59.0705723Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:59.0706143Z self=, 2025-05-07T20:31:59.0706553Z T=128, 2025-05-07T20:31:59.0706745Z D=5120, 2025-05-07T20:31:59.0706946Z scale_ub=None, 2025-05-07T20:31:59.0707166Z contiguous=True, 2025-05-07T20:31:59.0707398Z compiled=True, 2025-05-07T20:31:59.0707605Z ) 2025-05-07T20:31:59.4006138Z W0507 20:31:59.398000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:59.4008277Z W0507 20:31:59.398000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Traceback (most recent call last): 2025-05-07T20:31:59.4010531Z W0507 20:31:59.398000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:59.4011953Z W0507 20:31:59.398000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:59.4013109Z W0507 20:31:59.398000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:59.4014428Z W0507 20:31:59.398000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:59.4015821Z W0507 20:31:59.398000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:59.4016814Z W0507 20:31:59.398000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:59.4018151Z W0507 20:31:59.398000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:59.4019532Z W0507 20:31:59.398000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:59.4020602Z W0507 20:31:59.398000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:59.4021934Z W0507 20:31:59.398000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:59.4023199Z W0507 20:31:59.398000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] generator.visit(fn.parse()) 2025-05-07T20:31:59.4024432Z W0507 20:31:59.398000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:59.4025637Z W0507 20:31:59.398000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ret = super().visit(node) 2025-05-07T20:31:59.4026470Z W0507 20:31:59.398000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:59.4027496Z W0507 20:31:59.398000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:59.4028703Z W0507 20:31:59.398000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] return visitor(node) 2025-05-07T20:31:59.4029544Z W0507 20:31:59.398000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^^^^^^^^^^^^^ 2025-05-07T20:31:59.4030761Z W0507 20:31:59.398000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:59.4032096Z W0507 20:31:59.398000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:59.4033224Z W0507 20:31:59.398000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:59.4034282Z W0507 20:31:59.398000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] self.visit(item) 2025-05-07T20:31:59.4035583Z W0507 20:31:59.398000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:59.4036946Z W0507 20:31:59.398000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:59.4038022Z W0507 20:31:59.398000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:59.4038936Z W0507 20:31:59.398000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:59.4039679Z W0507 20:31:59.398000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^ 2025-05-07T20:31:59.4040810Z W0507 20:31:59.398000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:59.4859302Z W0507 20:31:59.483000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:59.4860899Z W0507 20:31:59.483000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Traceback (most recent call last): 2025-05-07T20:31:59.4863576Z W0507 20:31:59.483000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:59.4866480Z W0507 20:31:59.483000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:59.4868578Z W0507 20:31:59.483000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:59.4871100Z W0507 20:31:59.483000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:59.4872504Z W0507 20:31:59.483000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:59.4873496Z W0507 20:31:59.483000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:59.4874726Z W0507 20:31:59.483000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:59.4876118Z W0507 20:31:59.483000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:59.4877194Z W0507 20:31:59.483000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:59.4878717Z W0507 20:31:59.483000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:59.4879973Z W0507 20:31:59.483000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] generator.visit(fn.parse()) 2025-05-07T20:31:59.4881363Z W0507 20:31:59.483000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:59.4882581Z W0507 20:31:59.483000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ret = super().visit(node) 2025-05-07T20:31:59.4883415Z W0507 20:31:59.483000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:31:59.4884446Z W0507 20:31:59.483000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:31:59.4885469Z W0507 20:31:59.483000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] return visitor(node) 2025-05-07T20:31:59.4886447Z W0507 20:31:59.483000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^^^^^^^^^^^^^ 2025-05-07T20:31:59.4887657Z W0507 20:31:59.483000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:59.4888931Z W0507 20:31:59.483000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:59.4890051Z W0507 20:31:59.483000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:31:59.4891090Z W0507 20:31:59.483000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] self.visit(item) 2025-05-07T20:31:59.4892277Z W0507 20:31:59.483000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:59.4893636Z W0507 20:31:59.483000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:59.4894695Z W0507 20:31:59.483000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:59.4895604Z W0507 20:31:59.483000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:59.4896349Z W0507 20:31:59.483000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^ 2025-05-07T20:31:59.4897358Z W0507 20:31:59.483000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:59.9997478Z self = 2025-05-07T20:31:59.9998156Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:59.9998433Z 2025-05-07T20:31:59.9998513Z @given( 2025-05-07T20:31:59.9998757Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:59.9999070Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:59.9999380Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:59.9999718Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:00.0000046Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:00.0000341Z ) 2025-05-07T20:32:00.0000692Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:00.0001190Z def test_silu_mul_quant( 2025-05-07T20:32:00.0001435Z self, 2025-05-07T20:32:00.0001643Z T: int, 2025-05-07T20:32:00.0001853Z D: int, 2025-05-07T20:32:00.0002074Z scale_ub: Optional[float], 2025-05-07T20:32:00.0002350Z contiguous: bool, 2025-05-07T20:32:00.0002593Z compiled: bool, 2025-05-07T20:32:00.0003022Z ) -> None: 2025-05-07T20:32:00.0003248Z torch.manual_seed(2025) 2025-05-07T20:32:00.0003489Z 2025-05-07T20:32:00.0003759Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:00.0004105Z 2025-05-07T20:32:00.0004306Z x_sign = torch.sign(x) 2025-05-07T20:32:00.0004591Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:00.0004903Z x = x_sign * x_clamp 2025-05-07T20:32:00.0005148Z x0 = x[:, :D] 2025-05-07T20:32:00.0005358Z x1 = x[:, D:] 2025-05-07T20:32:00.0005566Z 2025-05-07T20:32:00.0005753Z if contiguous: 2025-05-07T20:32:00.0005994Z x0 = x0.contiguous() 2025-05-07T20:32:00.0006248Z x1 = x1.contiguous() 2025-05-07T20:32:00.0006608Z 2025-05-07T20:32:00.0006804Z if scale_ub is not None: 2025-05-07T20:32:00.0007070Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:00.0007405Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:00.0007722Z ) 2025-05-07T20:32:00.0007911Z else: 2025-05-07T20:32:00.0008121Z scale_ub_tensor = None 2025-05-07T20:32:00.0008374Z 2025-05-07T20:32:00.0008600Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:00.0008914Z op = silu_mul_quant 2025-05-07T20:32:00.0009166Z if compiled: 2025-05-07T20:32:00.0009417Z op = torch.compile(op) 2025-05-07T20:32:00.0009710Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:00.0009985Z 2025-05-07T20:32:00.0010178Z y_fp8, y_scale = fn() 2025-05-07T20:32:00.0010487Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:00.0010800Z 2025-05-07T20:32:00.0011044Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:00.0011372Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:00.0011664Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:00.0011982Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:00.0012335Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:00.0012643Z 2025-05-07T20:32:00.0012847Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:00.0013040Z 2025-05-07T20:32:00.0013147Z moe/activation_test.py:126: 2025-05-07T20:32:00.0013437Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:00.0013772Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:00.0014096Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:00.0014875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:00.0015628Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:00.0016172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:00.0016854Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:00.0017530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:00.0018249Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:00.0018993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:00.0019734Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:00.0020454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:00.0021090Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:00.0021689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:00.0022197Z fn() 2025-05-07T20:32:00.0022796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:00.0023380Z self.fn.run( 2025-05-07T20:32:00.0023844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:00.0024372Z kernel = self.compile( 2025-05-07T20:32:00.0024914Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:00.0025561Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:00.0025952Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:00.0026260Z 2025-05-07T20:32:00.0026466Z self = 2025-05-07T20:32:00.0027546Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:00.0029135Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f49843453a0>} 2025-05-07T20:32:00.0030470Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:00.0031531Z context = 2025-05-07T20:32:00.0031820Z 2025-05-07T20:32:00.0031989Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:00.0032506Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:00.0032970Z module_map=module_map) 2025-05-07T20:32:00.0033336Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:00.0033689Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:00.0033956Z E ^ 2025-05-07T20:32:00.0034415Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:00.0034866Z 2025-05-07T20:32:00.0035279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:00.0035794Z 2025-05-07T20:32:00.0035896Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:00.0036310Z self=, 2025-05-07T20:32:00.0036708Z T=4096, 2025-05-07T20:32:00.0036907Z D=5120, 2025-05-07T20:32:00.0037102Z scale_ub=None, 2025-05-07T20:32:00.0037328Z contiguous=True, 2025-05-07T20:32:00.0037549Z compiled=True, 2025-05-07T20:32:00.0037753Z ) 2025-05-07T20:32:00.3349154Z W0507 20:32:00.332000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:00.3350500Z W0507 20:32:00.332000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Traceback (most recent call last): 2025-05-07T20:32:00.3353198Z W0507 20:32:00.332000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:00.3356018Z W0507 20:32:00.332000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:00.3357961Z W0507 20:32:00.332000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:00.3360812Z W0507 20:32:00.332000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:00.3362246Z W0507 20:32:00.332000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:00.3363226Z W0507 20:32:00.332000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:00.3364442Z W0507 20:32:00.332000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:00.3365920Z W0507 20:32:00.332000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:00.3366980Z W0507 20:32:00.332000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:00.3368257Z W0507 20:32:00.332000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:00.3369503Z W0507 20:32:00.332000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] generator.visit(fn.parse()) 2025-05-07T20:32:00.3370737Z W0507 20:32:00.332000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:00.3371939Z W0507 20:32:00.332000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ret = super().visit(node) 2025-05-07T20:32:00.3372772Z W0507 20:32:00.332000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:00.3373797Z W0507 20:32:00.332000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:32:00.3374815Z W0507 20:32:00.332000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] return visitor(node) 2025-05-07T20:32:00.3375615Z W0507 20:32:00.332000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^^^^^^^^^^^^^ 2025-05-07T20:32:00.3376821Z W0507 20:32:00.332000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:00.3378103Z W0507 20:32:00.332000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:00.3379231Z W0507 20:32:00.332000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:32:00.3380270Z W0507 20:32:00.332000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] self.visit(item) 2025-05-07T20:32:00.3381504Z W0507 20:32:00.332000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:00.3382940Z W0507 20:32:00.332000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:00.3384004Z W0507 20:32:00.332000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:00.3384922Z W0507 20:32:00.332000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:00.3385670Z W0507 20:32:00.332000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^ 2025-05-07T20:32:00.3386681Z W0507 20:32:00.332000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:00.4202061Z W0507 20:32:00.417000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:00.4203195Z W0507 20:32:00.417000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Traceback (most recent call last): 2025-05-07T20:32:00.4204524Z W0507 20:32:00.417000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:00.4205935Z W0507 20:32:00.417000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:00.4206909Z W0507 20:32:00.417000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:00.4208209Z W0507 20:32:00.417000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:00.4209580Z W0507 20:32:00.417000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:00.4210559Z W0507 20:32:00.417000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:00.4211831Z W0507 20:32:00.417000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:00.4213211Z W0507 20:32:00.417000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:00.4214267Z W0507 20:32:00.417000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:00.4215541Z W0507 20:32:00.417000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:00.4216787Z W0507 20:32:00.417000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] generator.visit(fn.parse()) 2025-05-07T20:32:00.4218003Z W0507 20:32:00.417000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:00.4219398Z W0507 20:32:00.417000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ret = super().visit(node) 2025-05-07T20:32:00.4220227Z W0507 20:32:00.417000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:00.4221247Z W0507 20:32:00.417000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 418, in visit 2025-05-07T20:32:00.4222269Z W0507 20:32:00.417000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] return visitor(node) 2025-05-07T20:32:00.4223064Z W0507 20:32:00.417000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^^^^^^^^^^^^^ 2025-05-07T20:32:00.4224410Z W0507 20:32:00.417000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:00.4225682Z W0507 20:32:00.417000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:00.4226795Z W0507 20:32:00.417000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/ast.py", line 426, in generic_visit 2025-05-07T20:32:00.4227837Z W0507 20:32:00.417000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] self.visit(item) 2025-05-07T20:32:00.4229228Z W0507 20:32:00.417000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:00.4230606Z W0507 20:32:00.417000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:00.4231708Z W0507 20:32:00.417000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:00.4232623Z W0507 20:32:00.417000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:00.4233370Z W0507 20:32:00.417000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^ 2025-05-07T20:32:00.4234384Z W0507 20:32:00.417000 238389 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:00.7727301Z self = 2025-05-07T20:32:00.7728073Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:00.7728643Z 2025-05-07T20:32:00.7728757Z @given( 2025-05-07T20:32:00.7729076Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:00.7729403Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:00.7729710Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:00.7730051Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:00.7730384Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:00.7730669Z ) 2025-05-07T20:32:00.7731020Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:00.7731463Z def test_silu_mul_quant( 2025-05-07T20:32:00.7731702Z self, 2025-05-07T20:32:00.7731901Z T: int, 2025-05-07T20:32:00.7732103Z D: int, 2025-05-07T20:32:00.7732323Z scale_ub: Optional[float], 2025-05-07T20:32:00.7732600Z contiguous: bool, 2025-05-07T20:32:00.7732838Z compiled: bool, 2025-05-07T20:32:00.7733058Z ) -> None: 2025-05-07T20:32:00.7733282Z torch.manual_seed(2025) 2025-05-07T20:32:00.7733524Z 2025-05-07T20:32:00.7733972Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:00.7734318Z 2025-05-07T20:32:00.7734516Z x_sign = torch.sign(x) 2025-05-07T20:32:00.7734810Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:00.7735113Z x = x_sign * x_clamp 2025-05-07T20:32:00.7735350Z x0 = x[:, :D] 2025-05-07T20:32:00.7735569Z x1 = x[:, D:] 2025-05-07T20:32:00.7735769Z 2025-05-07T20:32:00.7735952Z if contiguous: 2025-05-07T20:32:00.7736190Z x0 = x0.contiguous() 2025-05-07T20:32:00.7736441Z x1 = x1.contiguous() 2025-05-07T20:32:00.7736678Z 2025-05-07T20:32:00.7736869Z if scale_ub is not None: 2025-05-07T20:32:00.7737251Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:00.7737587Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:00.7737897Z ) 2025-05-07T20:32:00.7738088Z else: 2025-05-07T20:32:00.7738309Z scale_ub_tensor = None 2025-05-07T20:32:00.7738556Z 2025-05-07T20:32:00.7738790Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:00.7739097Z op = silu_mul_quant 2025-05-07T20:32:00.7739351Z if compiled: 2025-05-07T20:32:00.7739600Z op = torch.compile(op) 2025-05-07T20:32:00.7739889Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:00.7740162Z 2025-05-07T20:32:00.7740361Z y_fp8, y_scale = fn() 2025-05-07T20:32:00.7740721Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:00.7741174Z 2025-05-07T20:32:00.7741542Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:00.7742070Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:00.7742522Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:00.7743008Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:00.7743575Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:00.7744102Z 2025-05-07T20:32:00.7744404Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:00.7744688Z 2025-05-07T20:32:00.7744831Z moe/activation_test.py:126: 2025-05-07T20:32:00.7745242Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:00.7745646Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:00.7745972Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:00.7746758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:00.7747507Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:00.7748057Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:00.7748737Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:00.7749482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:00.7750194Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:00.7750970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:00.7751738Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:00.7752466Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:00.7753097Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:00.7753697Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:00.7754208Z fn() 2025-05-07T20:32:00.7754829Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:00.7755414Z self.fn.run( 2025-05-07T20:32:00.7755882Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:00.7756411Z kernel = self.compile( 2025-05-07T20:32:00.7756946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:00.7757598Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:00.7757997Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:00.7758224Z 2025-05-07T20:32:00.7758430Z self = 2025-05-07T20:32:00.7759584Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:00.7760951Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4984345b20>} 2025-05-07T20:32:00.7762287Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:00.7763300Z context = 2025-05-07T20:32:00.7763583Z 2025-05-07T20:32:00.7763749Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:00.7764262Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:00.7764731Z module_map=module_map) 2025-05-07T20:32:00.7765097Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:00.7765446Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:00.7765718Z E ^ 2025-05-07T20:32:00.7766183Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:00.7766627Z 2025-05-07T20:32:00.7767043Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:00.7767554Z 2025-05-07T20:32:00.7767657Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:00.7768069Z self=, 2025-05-07T20:32:00.7768472Z T=16384, 2025-05-07T20:32:00.7768664Z D=5120, 2025-05-07T20:32:00.7768866Z scale_ub=None, 2025-05-07T20:32:00.7769091Z contiguous=True, 2025-05-07T20:32:00.7769318Z compiled=True, 2025-05-07T20:32:00.7769524Z ) 2025-05-07T20:32:00.8021305Z W0507 20:32:00.801000 238389 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8) 2025-05-07T20:32:00.8023050Z W0507 20:32:00.801000 238389 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55) 2025-05-07T20:32:00.8024441Z W0507 20:32:00.801000 238389 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240 2025-05-07T20:32:00.8025441Z W0507 20:32:00.801000 238389 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles". 2025-05-07T20:32:00.8026557Z W0507 20:32:00.801000 238389 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html. 2025-05-07T20:32:00.8707763Z self = 2025-05-07T20:32:00.8709316Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:00.8709722Z 2025-05-07T20:32:00.8709825Z @given( 2025-05-07T20:32:00.8710064Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:00.8710380Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:00.8710694Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:00.8711065Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:00.8711403Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:00.8711686Z ) 2025-05-07T20:32:00.8712041Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:00.8717643Z def test_silu_mul_quant( 2025-05-07T20:32:00.8717901Z self, 2025-05-07T20:32:00.8718304Z T: int, 2025-05-07T20:32:00.8718504Z D: int, 2025-05-07T20:32:00.8718723Z scale_ub: Optional[float], 2025-05-07T20:32:00.8718997Z contiguous: bool, 2025-05-07T20:32:00.8719231Z compiled: bool, 2025-05-07T20:32:00.8719463Z ) -> None: 2025-05-07T20:32:00.8719684Z torch.manual_seed(2025) 2025-05-07T20:32:00.8719921Z 2025-05-07T20:32:00.8720199Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:00.8720538Z 2025-05-07T20:32:00.8720737Z x_sign = torch.sign(x) 2025-05-07T20:32:00.8721063Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:00.8721398Z x = x_sign * x_clamp 2025-05-07T20:32:00.8721641Z x0 = x[:, :D] 2025-05-07T20:32:00.8721864Z x1 = x[:, D:] 2025-05-07T20:32:00.8722075Z 2025-05-07T20:32:00.8722257Z if contiguous: 2025-05-07T20:32:00.8722495Z x0 = x0.contiguous() 2025-05-07T20:32:00.8722753Z x1 = x1.contiguous() 2025-05-07T20:32:00.8722990Z 2025-05-07T20:32:00.8723187Z if scale_ub is not None: 2025-05-07T20:32:00.8723460Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:00.8723792Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:00.8724099Z ) 2025-05-07T20:32:00.8724293Z else: 2025-05-07T20:32:00.8724505Z scale_ub_tensor = None 2025-05-07T20:32:00.8724752Z 2025-05-07T20:32:00.8724988Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:00.8725314Z op = silu_mul_quant 2025-05-07T20:32:00.8725563Z if compiled: 2025-05-07T20:32:00.8725816Z op = torch.compile(op) 2025-05-07T20:32:00.8726119Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:00.8726390Z 2025-05-07T20:32:00.8726585Z y_fp8, y_scale = fn() 2025-05-07T20:32:00.8726867Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:00.8727148Z 2025-05-07T20:32:00.8727388Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:00.8727724Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:00.8728018Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:00.8728522Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:00.8728882Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:00.8729193Z 2025-05-07T20:32:00.8729385Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:00.8729582Z 2025-05-07T20:32:00.8729684Z moe/activation_test.py:126: 2025-05-07T20:32:00.8729979Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:00.8730311Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:00.8730650Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:00.8731442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:00.8732199Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:00.8732734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:00.8733548Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:00.8734248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:00.8734971Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:00.8735718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:00.8736481Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:00.8737216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:00.8737966Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:00.8738556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:00.8739073Z fn() 2025-05-07T20:32:00.8739585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:00.8740173Z self.fn.run( 2025-05-07T20:32:00.8740641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:00.8741173Z kernel = self.compile( 2025-05-07T20:32:00.8741717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:00.8742372Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:00.8742776Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:00.8743007Z 2025-05-07T20:32:00.8743223Z self = 2025-05-07T20:32:00.8744313Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:00.8745681Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4984c9ea20>} 2025-05-07T20:32:00.8747033Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:00.8748067Z context = 2025-05-07T20:32:00.8748353Z 2025-05-07T20:32:00.8748526Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:00.8749045Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:00.8749587Z module_map=module_map) 2025-05-07T20:32:00.8749959Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:00.8750319Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:00.8750577Z E ^ 2025-05-07T20:32:00.8751050Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:00.8751546Z 2025-05-07T20:32:00.8751965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:00.8752480Z 2025-05-07T20:32:00.8752593Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:00.8752999Z self=, 2025-05-07T20:32:00.8753406Z T=1, 2025-05-07T20:32:00.8753596Z D=5120, 2025-05-07T20:32:00.8753791Z scale_ub=1200.0, 2025-05-07T20:32:00.8754017Z contiguous=True, 2025-05-07T20:32:00.8754245Z compiled=True, 2025-05-07T20:32:00.8754449Z ) 2025-05-07T20:32:00.9811357Z self = 2025-05-07T20:32:00.9812097Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:00.9812457Z 2025-05-07T20:32:00.9812555Z @given( 2025-05-07T20:32:00.9812795Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:00.9813114Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:00.9813421Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:00.9813751Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:00.9814079Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:00.9814369Z ) 2025-05-07T20:32:00.9814725Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:00.9815294Z def test_silu_mul_quant( 2025-05-07T20:32:00.9815544Z self, 2025-05-07T20:32:00.9815752Z T: int, 2025-05-07T20:32:00.9815954Z D: int, 2025-05-07T20:32:00.9816181Z scale_ub: Optional[float], 2025-05-07T20:32:00.9816460Z contiguous: bool, 2025-05-07T20:32:00.9816693Z compiled: bool, 2025-05-07T20:32:00.9816919Z ) -> None: 2025-05-07T20:32:00.9817140Z torch.manual_seed(2025) 2025-05-07T20:32:00.9817378Z 2025-05-07T20:32:00.9817652Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:00.9817993Z 2025-05-07T20:32:00.9818187Z x_sign = torch.sign(x) 2025-05-07T20:32:00.9818472Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:00.9818790Z x = x_sign * x_clamp 2025-05-07T20:32:00.9819037Z x0 = x[:, :D] 2025-05-07T20:32:00.9819255Z x1 = x[:, D:] 2025-05-07T20:32:00.9819462Z 2025-05-07T20:32:00.9819647Z if contiguous: 2025-05-07T20:32:00.9819880Z x0 = x0.contiguous() 2025-05-07T20:32:00.9820142Z x1 = x1.contiguous() 2025-05-07T20:32:00.9820391Z 2025-05-07T20:32:00.9820592Z if scale_ub is not None: 2025-05-07T20:32:00.9820865Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:00.9821257Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:00.9821564Z ) 2025-05-07T20:32:00.9821753Z else: 2025-05-07T20:32:00.9821979Z scale_ub_tensor = None 2025-05-07T20:32:00.9822234Z 2025-05-07T20:32:00.9822464Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:00.9822777Z op = silu_mul_quant 2025-05-07T20:32:00.9823025Z if compiled: 2025-05-07T20:32:00.9823268Z op = torch.compile(op) 2025-05-07T20:32:00.9823563Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:00.9823834Z 2025-05-07T20:32:00.9824025Z > y_fp8, y_scale = fn() 2025-05-07T20:32:00.9824199Z 2025-05-07T20:32:00.9824297Z moe/activation_test.py:117: 2025-05-07T20:32:00.9824591Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:00.9824922Z moe/activation_test.py:115: in fn 2025-05-07T20:32:00.9825201Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:00.9825755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:00.9826315Z return fn(*args, **kwargs) 2025-05-07T20:32:00.9826964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:00.9827652Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:00.9828498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:00.9829226Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:00.9829890Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:00.9830419Z kernel = self.compile( 2025-05-07T20:32:00.9831086Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:00.9831742Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:00.9832133Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:00.9832368Z 2025-05-07T20:32:00.9832573Z self = 2025-05-07T20:32:00.9833645Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:00.9838133Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4899d816c0>} 2025-05-07T20:32:00.9839559Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:00.9840576Z context = 2025-05-07T20:32:00.9840860Z 2025-05-07T20:32:00.9841031Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:00.9841587Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:00.9842055Z module_map=module_map) 2025-05-07T20:32:00.9842417Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:00.9842770Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:00.9843039Z E ^ 2025-05-07T20:32:00.9843503Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:00.9843958Z 2025-05-07T20:32:00.9844389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:00.9844897Z 2025-05-07T20:32:00.9845002Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:00.9845423Z self=, 2025-05-07T20:32:00.9845825Z T=1, 2025-05-07T20:32:00.9846011Z D=5120, 2025-05-07T20:32:00.9846201Z scale_ub=None, 2025-05-07T20:32:00.9846425Z contiguous=False, 2025-05-07T20:32:00.9846665Z compiled=True, 2025-05-07T20:32:00.9846868Z ) 2025-05-07T20:32:01.1969896Z self = 2025-05-07T20:32:01.1970629Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:01.1971037Z 2025-05-07T20:32:01.1971170Z @given( 2025-05-07T20:32:01.1971495Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.1971881Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.1972189Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.1972520Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.1972858Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.1973148Z ) 2025-05-07T20:32:01.1973492Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.1973931Z def test_silu_mul_quant( 2025-05-07T20:32:01.1974176Z self, 2025-05-07T20:32:01.1974378Z T: int, 2025-05-07T20:32:01.1974577Z D: int, 2025-05-07T20:32:01.1974809Z scale_ub: Optional[float], 2025-05-07T20:32:01.1975089Z contiguous: bool, 2025-05-07T20:32:01.1975336Z compiled: bool, 2025-05-07T20:32:01.1975560Z ) -> None: 2025-05-07T20:32:01.1975784Z torch.manual_seed(2025) 2025-05-07T20:32:01.1976029Z 2025-05-07T20:32:01.1976298Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.1976644Z 2025-05-07T20:32:01.1976848Z x_sign = torch.sign(x) 2025-05-07T20:32:01.1977348Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.1977669Z x = x_sign * x_clamp 2025-05-07T20:32:01.1977919Z x0 = x[:, :D] 2025-05-07T20:32:01.1978140Z x1 = x[:, D:] 2025-05-07T20:32:01.1978357Z 2025-05-07T20:32:01.1978557Z if contiguous: 2025-05-07T20:32:01.1978786Z x0 = x0.contiguous() 2025-05-07T20:32:01.1979049Z x1 = x1.contiguous() 2025-05-07T20:32:01.1979290Z 2025-05-07T20:32:01.1979483Z if scale_ub is not None: 2025-05-07T20:32:01.1979756Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.1980095Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.1980401Z ) 2025-05-07T20:32:01.1980721Z else: 2025-05-07T20:32:01.1981024Z scale_ub_tensor = None 2025-05-07T20:32:01.1981300Z 2025-05-07T20:32:01.1981527Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.1981846Z op = silu_mul_quant 2025-05-07T20:32:01.1982100Z if compiled: 2025-05-07T20:32:01.1982347Z op = torch.compile(op) 2025-05-07T20:32:01.1982640Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.1982910Z 2025-05-07T20:32:01.1983099Z y_fp8, y_scale = fn() 2025-05-07T20:32:01.1983384Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:01.1983672Z 2025-05-07T20:32:01.1983925Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.1984251Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:01.1984540Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:01.1984862Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:01.1985216Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:01.1985538Z 2025-05-07T20:32:01.1985742Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:01.1985936Z 2025-05-07T20:32:01.1986039Z moe/activation_test.py:126: 2025-05-07T20:32:01.1986335Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.1986669Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:01.1986993Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:01.1987769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:01.1988519Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:01.1989150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.1989822Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.1990509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:01.1991223Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:01.1991976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:01.1992712Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:01.1993435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:01.1994069Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:01.1994669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:01.1995179Z fn() 2025-05-07T20:32:01.1995686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:01.1996276Z self.fn.run( 2025-05-07T20:32:01.1996737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.1997366Z kernel = self.compile( 2025-05-07T20:32:01.1997908Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.1998559Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.1998956Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.1999189Z 2025-05-07T20:32:01.1999400Z self = 2025-05-07T20:32:01.2000477Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.2001985Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4899d82020>} 2025-05-07T20:32:01.2003319Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.2004340Z context = 2025-05-07T20:32:01.2004633Z 2025-05-07T20:32:01.2004800Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.2005310Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.2005771Z module_map=module_map) 2025-05-07T20:32:01.2006142Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.2006503Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:01.2006770Z E ^ 2025-05-07T20:32:01.2007226Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.2007680Z 2025-05-07T20:32:01.2008096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.2008602Z 2025-05-07T20:32:01.2008715Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.2009125Z self=, 2025-05-07T20:32:01.2009628Z T=1, 2025-05-07T20:32:01.2009829Z D=5120, 2025-05-07T20:32:01.2010024Z scale_ub=None, 2025-05-07T20:32:01.2010235Z contiguous=True, 2025-05-07T20:32:01.2010457Z compiled=False, 2025-05-07T20:32:01.2010672Z ) 2025-05-07T20:32:01.3182207Z self = 2025-05-07T20:32:01.3182935Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:01.3183323Z 2025-05-07T20:32:01.3183438Z @given( 2025-05-07T20:32:01.3183793Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.3184226Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.3184548Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.3184873Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.3185200Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.3185490Z ) 2025-05-07T20:32:01.3185841Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.3186285Z def test_silu_mul_quant( 2025-05-07T20:32:01.3186530Z self, 2025-05-07T20:32:01.3186727Z T: int, 2025-05-07T20:32:01.3186928Z D: int, 2025-05-07T20:32:01.3187154Z scale_ub: Optional[float], 2025-05-07T20:32:01.3187429Z contiguous: bool, 2025-05-07T20:32:01.3187669Z compiled: bool, 2025-05-07T20:32:01.3187899Z ) -> None: 2025-05-07T20:32:01.3188119Z torch.manual_seed(2025) 2025-05-07T20:32:01.3188357Z 2025-05-07T20:32:01.3188634Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.3188979Z 2025-05-07T20:32:01.3189403Z x_sign = torch.sign(x) 2025-05-07T20:32:01.3189702Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.3190009Z x = x_sign * x_clamp 2025-05-07T20:32:01.3190248Z x0 = x[:, :D] 2025-05-07T20:32:01.3190468Z x1 = x[:, D:] 2025-05-07T20:32:01.3190682Z 2025-05-07T20:32:01.3190890Z if contiguous: 2025-05-07T20:32:01.3191157Z x0 = x0.contiguous() 2025-05-07T20:32:01.3191421Z x1 = x1.contiguous() 2025-05-07T20:32:01.3191655Z 2025-05-07T20:32:01.3191847Z if scale_ub is not None: 2025-05-07T20:32:01.3192117Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.3192519Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.3192883Z ) 2025-05-07T20:32:01.3193076Z else: 2025-05-07T20:32:01.3193286Z scale_ub_tensor = None 2025-05-07T20:32:01.3193529Z 2025-05-07T20:32:01.3193767Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.3194081Z op = silu_mul_quant 2025-05-07T20:32:01.3194326Z if compiled: 2025-05-07T20:32:01.3194576Z op = torch.compile(op) 2025-05-07T20:32:01.3194868Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.3195135Z 2025-05-07T20:32:01.3195331Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.3195493Z 2025-05-07T20:32:01.3195596Z moe/activation_test.py:117: 2025-05-07T20:32:01.3195884Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.3196218Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.3196494Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.3197181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.3197868Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.3198404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.3199083Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.3199733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.3200262Z kernel = self.compile( 2025-05-07T20:32:01.3200798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.3201506Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.3201897Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.3202129Z 2025-05-07T20:32:01.3202342Z self = 2025-05-07T20:32:01.3203419Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.3204775Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4899d837e0>} 2025-05-07T20:32:01.3206102Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.3207116Z context = 2025-05-07T20:32:01.3207402Z 2025-05-07T20:32:01.3207570Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.3208085Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.3208541Z module_map=module_map) 2025-05-07T20:32:01.3208992Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.3209353Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.3209615Z E ^ 2025-05-07T20:32:01.3210072Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.3210518Z 2025-05-07T20:32:01.3210929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.3211432Z 2025-05-07T20:32:01.3211539Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.3211951Z self=, 2025-05-07T20:32:01.3212392Z T=128, 2025-05-07T20:32:01.3212588Z D=5120, 2025-05-07T20:32:01.3212821Z scale_ub=None, 2025-05-07T20:32:01.3213034Z contiguous=False, 2025-05-07T20:32:01.3213263Z compiled=True, 2025-05-07T20:32:01.3213467Z ) 2025-05-07T20:32:01.3213788Z self = 2025-05-07T20:32:01.3214278Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:01.3214541Z 2025-05-07T20:32:01.3214626Z @given( 2025-05-07T20:32:01.3214856Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.3215168Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.3215471Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.3215801Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.3216120Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.3216405Z ) 2025-05-07T20:32:01.3216757Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.3217195Z def test_silu_mul_quant( 2025-05-07T20:32:01.3217438Z self, 2025-05-07T20:32:01.3217635Z T: int, 2025-05-07T20:32:01.3217831Z D: int, 2025-05-07T20:32:01.3223148Z scale_ub: Optional[float], 2025-05-07T20:32:01.3223447Z contiguous: bool, 2025-05-07T20:32:01.3223699Z compiled: bool, 2025-05-07T20:32:01.3223923Z ) -> None: 2025-05-07T20:32:01.3224142Z torch.manual_seed(2025) 2025-05-07T20:32:01.3224386Z 2025-05-07T20:32:01.3224662Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.3225007Z 2025-05-07T20:32:01.3225214Z x_sign = torch.sign(x) 2025-05-07T20:32:01.3225507Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.3225818Z x = x_sign * x_clamp 2025-05-07T20:32:01.3226055Z x0 = x[:, :D] 2025-05-07T20:32:01.3226277Z x1 = x[:, D:] 2025-05-07T20:32:01.3226477Z 2025-05-07T20:32:01.3226668Z if contiguous: 2025-05-07T20:32:01.3226910Z x0 = x0.contiguous() 2025-05-07T20:32:01.3227169Z x1 = x1.contiguous() 2025-05-07T20:32:01.3227411Z 2025-05-07T20:32:01.3227606Z if scale_ub is not None: 2025-05-07T20:32:01.3227875Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.3228387Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.3228697Z ) 2025-05-07T20:32:01.3228890Z else: 2025-05-07T20:32:01.3229155Z scale_ub_tensor = None 2025-05-07T20:32:01.3229402Z 2025-05-07T20:32:01.3229629Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.3229946Z op = silu_mul_quant 2025-05-07T20:32:01.3230197Z if compiled: 2025-05-07T20:32:01.3230444Z op = torch.compile(op) 2025-05-07T20:32:01.3230735Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.3231009Z 2025-05-07T20:32:01.3231206Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.3231397Z 2025-05-07T20:32:01.3231513Z moe/activation_test.py:117: 2025-05-07T20:32:01.3231818Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.3232153Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.3232687Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.3233252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:01.3233812Z return fn(*args, **kwargs) 2025-05-07T20:32:01.3234473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.3235157Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.3235690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.3236371Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.3237092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.3237678Z kernel = self.compile( 2025-05-07T20:32:01.3238222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.3238877Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.3239267Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.3239504Z 2025-05-07T20:32:01.3239712Z self = 2025-05-07T20:32:01.3240804Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.3242187Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f48995e1ee0>} 2025-05-07T20:32:01.3243542Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.3244560Z context = 2025-05-07T20:32:01.3244848Z 2025-05-07T20:32:01.3245015Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.3245539Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.3245996Z module_map=module_map) 2025-05-07T20:32:01.3246363Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.3246716Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.3246974Z E ^ 2025-05-07T20:32:01.3247440Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.3247895Z 2025-05-07T20:32:01.3248313Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.3248826Z 2025-05-07T20:32:01.3248933Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.3249338Z self=, 2025-05-07T20:32:01.3249746Z T=128, 2025-05-07T20:32:01.3249934Z D=7168, 2025-05-07T20:32:01.3250129Z scale_ub=1200.0, 2025-05-07T20:32:01.3250348Z contiguous=False, 2025-05-07T20:32:01.3250576Z compiled=False, 2025-05-07T20:32:01.3250788Z ) 2025-05-07T20:32:01.4116366Z self = 2025-05-07T20:32:01.4117127Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:01.4117531Z 2025-05-07T20:32:01.4117663Z @given( 2025-05-07T20:32:01.4117995Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.4118425Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.4118847Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.4119498Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.4119840Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.4120138Z ) 2025-05-07T20:32:01.4120488Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.4120935Z def test_silu_mul_quant( 2025-05-07T20:32:01.4121195Z self, 2025-05-07T20:32:01.4121416Z T: int, 2025-05-07T20:32:01.4121646Z D: int, 2025-05-07T20:32:01.4121871Z scale_ub: Optional[float], 2025-05-07T20:32:01.4122182Z contiguous: bool, 2025-05-07T20:32:01.4122427Z compiled: bool, 2025-05-07T20:32:01.4122654Z ) -> None: 2025-05-07T20:32:01.4122881Z torch.manual_seed(2025) 2025-05-07T20:32:01.4123310Z 2025-05-07T20:32:01.4123586Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.4123933Z 2025-05-07T20:32:01.4124137Z x_sign = torch.sign(x) 2025-05-07T20:32:01.4124431Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.4124761Z x = x_sign * x_clamp 2025-05-07T20:32:01.4125012Z x0 = x[:, :D] 2025-05-07T20:32:01.4125227Z x1 = x[:, D:] 2025-05-07T20:32:01.4125442Z 2025-05-07T20:32:01.4125630Z if contiguous: 2025-05-07T20:32:01.4125870Z x0 = x0.contiguous() 2025-05-07T20:32:01.4126124Z x1 = x1.contiguous() 2025-05-07T20:32:01.4126364Z 2025-05-07T20:32:01.4126562Z if scale_ub is not None: 2025-05-07T20:32:01.4126832Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.4127170Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.4127475Z ) 2025-05-07T20:32:01.4127678Z else: 2025-05-07T20:32:01.4127885Z scale_ub_tensor = None 2025-05-07T20:32:01.4128391Z 2025-05-07T20:32:01.4128630Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.4128936Z op = silu_mul_quant 2025-05-07T20:32:01.4129187Z if compiled: 2025-05-07T20:32:01.4129439Z op = torch.compile(op) 2025-05-07T20:32:01.4129725Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.4130004Z 2025-05-07T20:32:01.4130208Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.4130370Z 2025-05-07T20:32:01.4130469Z moe/activation_test.py:117: 2025-05-07T20:32:01.4130761Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.4131094Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.4131392Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.4132099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.4132791Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.4133320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.4133993Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.4134647Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.4135182Z kernel = self.compile( 2025-05-07T20:32:01.4135721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.4136366Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.4136761Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.4136987Z 2025-05-07T20:32:01.4137198Z self = 2025-05-07T20:32:01.4138277Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.4139768Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f48995e1a80>} 2025-05-07T20:32:01.4141141Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.4142174Z context = 2025-05-07T20:32:01.4142457Z 2025-05-07T20:32:01.4142627Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.4143139Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.4143713Z module_map=module_map) 2025-05-07T20:32:01.4144080Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.4144430Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.4144687Z E ^ 2025-05-07T20:32:01.4145159Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.4145606Z 2025-05-07T20:32:01.4146030Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.4146538Z 2025-05-07T20:32:01.4146641Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.4147050Z self=, 2025-05-07T20:32:01.4147454Z T=128, 2025-05-07T20:32:01.4147646Z D=5120, 2025-05-07T20:32:01.4147839Z scale_ub=None, 2025-05-07T20:32:01.4148064Z contiguous=False, 2025-05-07T20:32:01.4148298Z compiled=False, 2025-05-07T20:32:01.4148501Z ) 2025-05-07T20:32:01.4148826Z self = 2025-05-07T20:32:01.4149372Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:01.4149641Z 2025-05-07T20:32:01.4149719Z @given( 2025-05-07T20:32:01.4149954Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.4150274Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.4150573Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.4150901Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.4151229Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.4151515Z ) 2025-05-07T20:32:01.4151857Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.4152299Z def test_silu_mul_quant( 2025-05-07T20:32:01.4152542Z self, 2025-05-07T20:32:01.4152734Z T: int, 2025-05-07T20:32:01.4152937Z D: int, 2025-05-07T20:32:01.4153154Z scale_ub: Optional[float], 2025-05-07T20:32:01.4153418Z contiguous: bool, 2025-05-07T20:32:01.4153655Z compiled: bool, 2025-05-07T20:32:01.4153874Z ) -> None: 2025-05-07T20:32:01.4154088Z torch.manual_seed(2025) 2025-05-07T20:32:01.4154327Z 2025-05-07T20:32:01.4154598Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.4154958Z 2025-05-07T20:32:01.4155151Z x_sign = torch.sign(x) 2025-05-07T20:32:01.4155447Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.4155760Z x = x_sign * x_clamp 2025-05-07T20:32:01.4155999Z x0 = x[:, :D] 2025-05-07T20:32:01.4156220Z x1 = x[:, D:] 2025-05-07T20:32:01.4156431Z 2025-05-07T20:32:01.4156626Z if contiguous: 2025-05-07T20:32:01.4156859Z x0 = x0.contiguous() 2025-05-07T20:32:01.4157123Z x1 = x1.contiguous() 2025-05-07T20:32:01.4157372Z 2025-05-07T20:32:01.4157560Z if scale_ub is not None: 2025-05-07T20:32:01.4157835Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.4158174Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.4158482Z ) 2025-05-07T20:32:01.4158768Z else: 2025-05-07T20:32:01.4158986Z scale_ub_tensor = None 2025-05-07T20:32:01.4159228Z 2025-05-07T20:32:01.4159458Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.4159770Z op = silu_mul_quant 2025-05-07T20:32:01.4160015Z if compiled: 2025-05-07T20:32:01.4160260Z op = torch.compile(op) 2025-05-07T20:32:01.4160554Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.4160822Z 2025-05-07T20:32:01.4161023Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.4161191Z 2025-05-07T20:32:01.4161307Z moe/activation_test.py:117: 2025-05-07T20:32:01.4161685Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.4162081Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.4162360Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.4163050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.4163728Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.4164261Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.4164937Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.4165595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.4166116Z kernel = self.compile( 2025-05-07T20:32:01.4166654Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.4167312Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.4167706Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.4167938Z 2025-05-07T20:32:01.4168147Z self = 2025-05-07T20:32:01.4169224Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.4170582Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f48995e3c40>} 2025-05-07T20:32:01.4171966Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.4172984Z context = 2025-05-07T20:32:01.4173276Z 2025-05-07T20:32:01.4173441Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.4173964Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.4174434Z module_map=module_map) 2025-05-07T20:32:01.4174792Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.4175149Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.4175417Z E ^ 2025-05-07T20:32:01.4175871Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.4176328Z 2025-05-07T20:32:01.4176742Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.4177267Z 2025-05-07T20:32:01.4177376Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.4177800Z self=, 2025-05-07T20:32:01.4178196Z T=128, 2025-05-07T20:32:01.4178396Z D=5120, 2025-05-07T20:32:01.4178597Z scale_ub=1200.0, 2025-05-07T20:32:01.4178904Z contiguous=True, 2025-05-07T20:32:01.4179138Z compiled=False, 2025-05-07T20:32:01.4179347Z ) 2025-05-07T20:32:01.5533472Z self = 2025-05-07T20:32:01.5534254Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:01.5534630Z 2025-05-07T20:32:01.5534747Z @given( 2025-05-07T20:32:01.5535054Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.5535382Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.5535698Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.5536034Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.5536579Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.5536869Z ) 2025-05-07T20:32:01.5537225Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.5537746Z def test_silu_mul_quant( 2025-05-07T20:32:01.5538071Z self, 2025-05-07T20:32:01.5538276Z T: int, 2025-05-07T20:32:01.5538474Z D: int, 2025-05-07T20:32:01.5538707Z scale_ub: Optional[float], 2025-05-07T20:32:01.5538985Z contiguous: bool, 2025-05-07T20:32:01.5539224Z compiled: bool, 2025-05-07T20:32:01.5539458Z ) -> None: 2025-05-07T20:32:01.5539681Z torch.manual_seed(2025) 2025-05-07T20:32:01.5539924Z 2025-05-07T20:32:01.5540205Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.5540549Z 2025-05-07T20:32:01.5540752Z x_sign = torch.sign(x) 2025-05-07T20:32:01.5541067Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.5541406Z x = x_sign * x_clamp 2025-05-07T20:32:01.5541660Z x0 = x[:, :D] 2025-05-07T20:32:01.5541881Z x1 = x[:, D:] 2025-05-07T20:32:01.5542099Z 2025-05-07T20:32:01.5542295Z if contiguous: 2025-05-07T20:32:01.5542528Z x0 = x0.contiguous() 2025-05-07T20:32:01.5542794Z x1 = x1.contiguous() 2025-05-07T20:32:01.5543040Z 2025-05-07T20:32:01.5543230Z if scale_ub is not None: 2025-05-07T20:32:01.5543507Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.5543851Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.5544158Z ) 2025-05-07T20:32:01.5544361Z else: 2025-05-07T20:32:01.5544581Z scale_ub_tensor = None 2025-05-07T20:32:01.5544827Z 2025-05-07T20:32:01.5545070Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.5545389Z op = silu_mul_quant 2025-05-07T20:32:01.5545649Z if compiled: 2025-05-07T20:32:01.5545902Z op = torch.compile(op) 2025-05-07T20:32:01.5546206Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.5546486Z 2025-05-07T20:32:01.5546681Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.5546853Z 2025-05-07T20:32:01.5546955Z moe/activation_test.py:117: 2025-05-07T20:32:01.5547255Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.5547586Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.5547869Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.5548562Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.5549329Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.5549862Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.5550541Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.5551235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.5551788Z kernel = self.compile( 2025-05-07T20:32:01.5552491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.5553150Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.5553552Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.5553783Z 2025-05-07T20:32:01.5553991Z self = 2025-05-07T20:32:01.5555070Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.5556444Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4899db7ba0>} 2025-05-07T20:32:01.5557878Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.5558898Z context = 2025-05-07T20:32:01.5559182Z 2025-05-07T20:32:01.5559348Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.5559866Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.5560340Z module_map=module_map) 2025-05-07T20:32:01.5560701Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.5561062Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.5561332Z E ^ 2025-05-07T20:32:01.5561799Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.5562259Z 2025-05-07T20:32:01.5562675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.5563196Z 2025-05-07T20:32:01.5563305Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.5563725Z self=, 2025-05-07T20:32:01.5564125Z T=1, 2025-05-07T20:32:01.5564322Z D=7168, 2025-05-07T20:32:01.5564525Z scale_ub=1200.0, 2025-05-07T20:32:01.5564749Z contiguous=True, 2025-05-07T20:32:01.5564982Z compiled=True, 2025-05-07T20:32:01.5565198Z ) 2025-05-07T20:32:01.5565523Z self = 2025-05-07T20:32:01.5566008Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:01.5566276Z 2025-05-07T20:32:01.5566358Z @given( 2025-05-07T20:32:01.5566610Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.5566923Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.5567241Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.5567584Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.5567910Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.5568199Z ) 2025-05-07T20:32:01.5568554Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.5568999Z def test_silu_mul_quant( 2025-05-07T20:32:01.5569240Z self, 2025-05-07T20:32:01.5569442Z T: int, 2025-05-07T20:32:01.5569648Z D: int, 2025-05-07T20:32:01.5569865Z scale_ub: Optional[float], 2025-05-07T20:32:01.5570141Z contiguous: bool, 2025-05-07T20:32:01.5570387Z compiled: bool, 2025-05-07T20:32:01.5570606Z ) -> None: 2025-05-07T20:32:01.5570826Z torch.manual_seed(2025) 2025-05-07T20:32:01.5571080Z 2025-05-07T20:32:01.5571386Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.5571750Z 2025-05-07T20:32:01.5571949Z x_sign = torch.sign(x) 2025-05-07T20:32:01.5572322Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.5572640Z x = x_sign * x_clamp 2025-05-07T20:32:01.5572885Z x0 = x[:, :D] 2025-05-07T20:32:01.5573100Z x1 = x[:, D:] 2025-05-07T20:32:01.5573307Z 2025-05-07T20:32:01.5573501Z if contiguous: 2025-05-07T20:32:01.5573729Z x0 = x0.contiguous() 2025-05-07T20:32:01.5573992Z x1 = x1.contiguous() 2025-05-07T20:32:01.5574238Z 2025-05-07T20:32:01.5574431Z if scale_ub is not None: 2025-05-07T20:32:01.5574700Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.5575037Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.5575349Z ) 2025-05-07T20:32:01.5575591Z else: 2025-05-07T20:32:01.5584576Z scale_ub_tensor = None 2025-05-07T20:32:01.5584886Z 2025-05-07T20:32:01.5585143Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.5585468Z op = silu_mul_quant 2025-05-07T20:32:01.5585731Z if compiled: 2025-05-07T20:32:01.5585999Z op = torch.compile(op) 2025-05-07T20:32:01.5586313Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.5586593Z 2025-05-07T20:32:01.5586802Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.5586972Z 2025-05-07T20:32:01.5587087Z moe/activation_test.py:117: 2025-05-07T20:32:01.5587384Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.5587728Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.5588023Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.5588585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:01.5589231Z return fn(*args, **kwargs) 2025-05-07T20:32:01.5589904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.5590598Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.5591187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.5591874Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.5592540Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.5593069Z kernel = self.compile( 2025-05-07T20:32:01.5593619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.5594281Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.5594695Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.5594932Z 2025-05-07T20:32:01.5595139Z self = 2025-05-07T20:32:01.5596233Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.5597617Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4984e20ea0>} 2025-05-07T20:32:01.5598975Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.5600002Z context = 2025-05-07T20:32:01.5600290Z 2025-05-07T20:32:01.5600460Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.5601012Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.5601623Z module_map=module_map) 2025-05-07T20:32:01.5601993Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.5602354Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.5602621Z E ^ 2025-05-07T20:32:01.5603092Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.5603540Z 2025-05-07T20:32:01.5603955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.5604476Z 2025-05-07T20:32:01.5604597Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.5605019Z self=, 2025-05-07T20:32:01.5605514Z T=1, 2025-05-07T20:32:01.5605704Z D=7168, 2025-05-07T20:32:01.5605908Z scale_ub=1200.0, 2025-05-07T20:32:01.5606146Z contiguous=False, 2025-05-07T20:32:01.5606374Z compiled=True, 2025-05-07T20:32:01.5606588Z ) 2025-05-07T20:32:01.8290797Z self = 2025-05-07T20:32:01.8291310Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:01.8291889Z 2025-05-07T20:32:01.8292139Z @given( 2025-05-07T20:32:01.8292706Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.8293339Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.8293942Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.8294607Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.8295268Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.8295833Z ) 2025-05-07T20:32:01.8296526Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.8297411Z def test_silu_mul_quant( 2025-05-07T20:32:01.8297899Z self, 2025-05-07T20:32:01.8298283Z T: int, 2025-05-07T20:32:01.8298676Z D: int, 2025-05-07T20:32:01.8299123Z scale_ub: Optional[float], 2025-05-07T20:32:01.8299654Z contiguous: bool, 2025-05-07T20:32:01.8300133Z compiled: bool, 2025-05-07T20:32:01.8300586Z ) -> None: 2025-05-07T20:32:01.8300979Z torch.manual_seed(2025) 2025-05-07T20:32:01.8301228Z 2025-05-07T20:32:01.8301514Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.8301858Z 2025-05-07T20:32:01.8302060Z x_sign = torch.sign(x) 2025-05-07T20:32:01.8302360Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.8302669Z x = x_sign * x_clamp 2025-05-07T20:32:01.8302916Z x0 = x[:, :D] 2025-05-07T20:32:01.8303151Z x1 = x[:, D:] 2025-05-07T20:32:01.8303371Z 2025-05-07T20:32:01.8303561Z if contiguous: 2025-05-07T20:32:01.8303806Z x0 = x0.contiguous() 2025-05-07T20:32:01.8304071Z x1 = x1.contiguous() 2025-05-07T20:32:01.8304310Z 2025-05-07T20:32:01.8304510Z if scale_ub is not None: 2025-05-07T20:32:01.8304792Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.8305126Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.8305444Z ) 2025-05-07T20:32:01.8305648Z else: 2025-05-07T20:32:01.8305856Z scale_ub_tensor = None 2025-05-07T20:32:01.8306115Z 2025-05-07T20:32:01.8306358Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.8306670Z op = silu_mul_quant 2025-05-07T20:32:01.8306928Z if compiled: 2025-05-07T20:32:01.8307184Z op = torch.compile(op) 2025-05-07T20:32:01.8307481Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.8307763Z 2025-05-07T20:32:01.8307970Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.8308137Z 2025-05-07T20:32:01.8308252Z moe/activation_test.py:117: 2025-05-07T20:32:01.8308547Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.8309204Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.8309502Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.8310069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:01.8310631Z return fn(*args, **kwargs) 2025-05-07T20:32:01.8311299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.8311981Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.8312525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.8313297Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.8314028Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.8314559Z kernel = self.compile( 2025-05-07T20:32:01.8315110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.8315769Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.8316163Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.8316396Z 2025-05-07T20:32:01.8316602Z self = 2025-05-07T20:32:01.8317686Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.8319063Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4899e4d1c0>} 2025-05-07T20:32:01.8320406Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.8321462Z context = 2025-05-07T20:32:01.8321766Z 2025-05-07T20:32:01.8321936Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.8322452Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.8322917Z module_map=module_map) 2025-05-07T20:32:01.8323277Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.8323638Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.8323910Z E ^ 2025-05-07T20:32:01.8324370Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.8324824Z 2025-05-07T20:32:01.8325245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.8325763Z 2025-05-07T20:32:01.8325867Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.8326284Z self=, 2025-05-07T20:32:01.8326683Z T=1, 2025-05-07T20:32:01.8326877Z D=7168, 2025-05-07T20:32:01.8327077Z scale_ub=None, 2025-05-07T20:32:01.8327295Z contiguous=False, 2025-05-07T20:32:01.8327523Z compiled=True, 2025-05-07T20:32:01.8327734Z ) 2025-05-07T20:32:01.8996605Z self = 2025-05-07T20:32:01.8997105Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:01.8997436Z 2025-05-07T20:32:01.8997518Z @given( 2025-05-07T20:32:01.8998222Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.8999126Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.8999958Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.9001093Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.9001578Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.9001886Z ) 2025-05-07T20:32:01.9002241Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.9002689Z def test_silu_mul_quant( 2025-05-07T20:32:01.9002937Z self, 2025-05-07T20:32:01.9003136Z T: int, 2025-05-07T20:32:01.9003344Z D: int, 2025-05-07T20:32:01.9003573Z scale_ub: Optional[float], 2025-05-07T20:32:01.9003844Z contiguous: bool, 2025-05-07T20:32:01.9004096Z compiled: bool, 2025-05-07T20:32:01.9004397Z ) -> None: 2025-05-07T20:32:01.9004683Z torch.manual_seed(2025) 2025-05-07T20:32:01.9004937Z 2025-05-07T20:32:01.9005226Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.9005567Z 2025-05-07T20:32:01.9005772Z x_sign = torch.sign(x) 2025-05-07T20:32:01.9006077Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.9006387Z x = x_sign * x_clamp 2025-05-07T20:32:01.9006635Z x0 = x[:, :D] 2025-05-07T20:32:01.9006856Z x1 = x[:, D:] 2025-05-07T20:32:01.9007064Z 2025-05-07T20:32:01.9007259Z if contiguous: 2025-05-07T20:32:01.9007491Z x0 = x0.contiguous() 2025-05-07T20:32:01.9007744Z x1 = x1.contiguous() 2025-05-07T20:32:01.9007985Z 2025-05-07T20:32:01.9008179Z if scale_ub is not None: 2025-05-07T20:32:01.9008451Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.9008801Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.9009108Z ) 2025-05-07T20:32:01.9009314Z else: 2025-05-07T20:32:01.9009530Z scale_ub_tensor = None 2025-05-07T20:32:01.9009777Z 2025-05-07T20:32:01.9010015Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.9010332Z op = silu_mul_quant 2025-05-07T20:32:01.9010584Z if compiled: 2025-05-07T20:32:01.9010836Z op = torch.compile(op) 2025-05-07T20:32:01.9011139Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.9011408Z 2025-05-07T20:32:01.9011607Z y_fp8, y_scale = fn() 2025-05-07T20:32:01.9011897Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:01.9012206Z 2025-05-07T20:32:01.9012443Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.9012784Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:01.9013084Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:01.9013392Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:01.9013757Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:01.9014070Z 2025-05-07T20:32:01.9014266Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:01.9014467Z 2025-05-07T20:32:01.9014569Z moe/activation_test.py:126: 2025-05-07T20:32:01.9014874Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9015216Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:01.9015538Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:01.9016328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:01.9017087Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:01.9017627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.9018309Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.9018997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:01.9019800Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:01.9020541Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:01.9021314Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:01.9022062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:01.9022699Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:01.9023292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:01.9023812Z fn() 2025-05-07T20:32:01.9024366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:01.9024977Z self.fn.run( 2025-05-07T20:32:01.9025441Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.9025973Z kernel = self.compile( 2025-05-07T20:32:01.9026516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.9027160Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.9027565Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.9027790Z 2025-05-07T20:32:01.9028003Z self = 2025-05-07T20:32:01.9029472Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.9030838Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4899e4dda0>} 2025-05-07T20:32:01.9032231Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.9033251Z context = 2025-05-07T20:32:01.9033535Z 2025-05-07T20:32:01.9033708Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.9034223Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.9034690Z module_map=module_map) 2025-05-07T20:32:01.9035057Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.9035417Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:01.9035680Z E ^ 2025-05-07T20:32:01.9036144Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.9036593Z 2025-05-07T20:32:01.9037012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.9037521Z 2025-05-07T20:32:01.9037635Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.9038040Z self=, 2025-05-07T20:32:01.9038444Z T=1, 2025-05-07T20:32:01.9038632Z D=5120, 2025-05-07T20:32:01.9038824Z scale_ub=1200.0, 2025-05-07T20:32:01.9039055Z contiguous=False, 2025-05-07T20:32:01.9039286Z compiled=True, 2025-05-07T20:32:01.9039486Z ) 2025-05-07T20:32:02.0227476Z self = 2025-05-07T20:32:02.0228421Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:02.0228799Z 2025-05-07T20:32:02.0228923Z @given( 2025-05-07T20:32:02.0229292Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:02.0229942Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:02.0230381Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:02.0230803Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:02.0231130Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:02.0231415Z ) 2025-05-07T20:32:02.0231768Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:02.0232204Z def test_silu_mul_quant( 2025-05-07T20:32:02.0232456Z self, 2025-05-07T20:32:02.0232654Z T: int, 2025-05-07T20:32:02.0232850Z D: int, 2025-05-07T20:32:02.0233075Z scale_ub: Optional[float], 2025-05-07T20:32:02.0233429Z contiguous: bool, 2025-05-07T20:32:02.0233724Z compiled: bool, 2025-05-07T20:32:02.0233952Z ) -> None: 2025-05-07T20:32:02.0234171Z torch.manual_seed(2025) 2025-05-07T20:32:02.0234407Z 2025-05-07T20:32:02.0234700Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:02.0235050Z 2025-05-07T20:32:02.0235248Z x_sign = torch.sign(x) 2025-05-07T20:32:02.0235541Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:02.0235856Z x = x_sign * x_clamp 2025-05-07T20:32:02.0236103Z x0 = x[:, :D] 2025-05-07T20:32:02.0236324Z x1 = x[:, D:] 2025-05-07T20:32:02.0236536Z 2025-05-07T20:32:02.0236726Z if contiguous: 2025-05-07T20:32:02.0236957Z x0 = x0.contiguous() 2025-05-07T20:32:02.0237221Z x1 = x1.contiguous() 2025-05-07T20:32:02.0237472Z 2025-05-07T20:32:02.0237664Z if scale_ub is not None: 2025-05-07T20:32:02.0237940Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:02.0238288Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:02.0238593Z ) 2025-05-07T20:32:02.0238800Z else: 2025-05-07T20:32:02.0239017Z scale_ub_tensor = None 2025-05-07T20:32:02.0239267Z 2025-05-07T20:32:02.0239508Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:02.0239826Z op = silu_mul_quant 2025-05-07T20:32:02.0240075Z if compiled: 2025-05-07T20:32:02.0240326Z op = torch.compile(op) 2025-05-07T20:32:02.0240633Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:02.0240910Z 2025-05-07T20:32:02.0241103Z > y_fp8, y_scale = fn() 2025-05-07T20:32:02.0241276Z 2025-05-07T20:32:02.0241377Z moe/activation_test.py:117: 2025-05-07T20:32:02.0241675Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:02.0242003Z moe/activation_test.py:115: in fn 2025-05-07T20:32:02.0242288Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:02.0242852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:02.0243411Z return fn(*args, **kwargs) 2025-05-07T20:32:02.0244070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:02.0244758Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:02.0245293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:02.0245965Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:02.0246627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:02.0247153Z kernel = self.compile( 2025-05-07T20:32:02.0247689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:02.0248337Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:02.0248739Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:02.0248966Z 2025-05-07T20:32:02.0249295Z self = 2025-05-07T20:32:02.0250368Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:02.0251721Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4899e4e020>} 2025-05-07T20:32:02.0253054Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:02.0254176Z context = 2025-05-07T20:32:02.0254460Z 2025-05-07T20:32:02.0254629Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:02.0255145Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:02.0255610Z module_map=module_map) 2025-05-07T20:32:02.0255973Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:02.0256323Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:02.0256576Z E ^ 2025-05-07T20:32:02.0257039Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:02.0257485Z 2025-05-07T20:32:02.0257912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:02.0258424Z 2025-05-07T20:32:02.0258527Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:02.0258972Z self=, 2025-05-07T20:32:02.0259381Z T=1, 2025-05-07T20:32:02.0259571Z D=5120, 2025-05-07T20:32:02.0259772Z scale_ub=1200.0, 2025-05-07T20:32:02.0259992Z contiguous=False, 2025-05-07T20:32:02.0260223Z compiled=False, 2025-05-07T20:32:02.0260430Z ) 2025-05-07T20:32:02.0260746Z self = 2025-05-07T20:32:02.0261240Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:02.0261512Z 2025-05-07T20:32:02.0261619Z @given( 2025-05-07T20:32:02.0261870Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:02.0262182Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:02.0262491Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:02.0262815Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:02.0263143Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:02.0263428Z ) 2025-05-07T20:32:02.0263777Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:02.0264213Z def test_silu_mul_quant( 2025-05-07T20:32:02.0264455Z self, 2025-05-07T20:32:02.0264652Z T: int, 2025-05-07T20:32:02.0264842Z D: int, 2025-05-07T20:32:02.0265060Z scale_ub: Optional[float], 2025-05-07T20:32:02.0265330Z contiguous: bool, 2025-05-07T20:32:02.0265564Z compiled: bool, 2025-05-07T20:32:02.0265785Z ) -> None: 2025-05-07T20:32:02.0265999Z torch.manual_seed(2025) 2025-05-07T20:32:02.0266232Z 2025-05-07T20:32:02.0266502Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:02.0266840Z 2025-05-07T20:32:02.0267027Z x_sign = torch.sign(x) 2025-05-07T20:32:02.0267317Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:02.0267626Z x = x_sign * x_clamp 2025-05-07T20:32:02.0267860Z x0 = x[:, :D] 2025-05-07T20:32:02.0268081Z x1 = x[:, D:] 2025-05-07T20:32:02.0268295Z 2025-05-07T20:32:02.0268474Z if contiguous: 2025-05-07T20:32:02.0268791Z x0 = x0.contiguous() 2025-05-07T20:32:02.0269050Z x1 = x1.contiguous() 2025-05-07T20:32:02.0269364Z 2025-05-07T20:32:02.0269550Z if scale_ub is not None: 2025-05-07T20:32:02.0269823Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:02.0270156Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:02.0270458Z ) 2025-05-07T20:32:02.0270652Z else: 2025-05-07T20:32:02.0270860Z scale_ub_tensor = None 2025-05-07T20:32:02.0271107Z 2025-05-07T20:32:02.0271359Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:02.0271706Z op = silu_mul_quant 2025-05-07T20:32:02.0272068Z if compiled: 2025-05-07T20:32:02.0272446Z op = torch.compile(op) 2025-05-07T20:32:02.0272986Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:02.0273365Z 2025-05-07T20:32:02.0273669Z > y_fp8, y_scale = fn() 2025-05-07T20:32:02.0273845Z 2025-05-07T20:32:02.0274035Z moe/activation_test.py:117: 2025-05-07T20:32:02.0274507Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:02.0274946Z moe/activation_test.py:115: in fn 2025-05-07T20:32:02.0275359Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:02.0276092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:02.0276881Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:02.0277554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:02.0278282Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:02.0287575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:02.0288214Z kernel = self.compile( 2025-05-07T20:32:02.0288793Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:02.0289464Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:02.0289873Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:02.0290105Z 2025-05-07T20:32:02.0290313Z self = 2025-05-07T20:32:02.0291415Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:02.0292807Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4898fbc720>} 2025-05-07T20:32:02.0294175Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:02.0295196Z context = 2025-05-07T20:32:02.0295492Z 2025-05-07T20:32:02.0295659Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:02.0296185Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:02.0296660Z module_map=module_map) 2025-05-07T20:32:02.0297024Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:02.0297384Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:02.0297653Z E ^ 2025-05-07T20:32:02.0298124Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:02.0298587Z 2025-05-07T20:32:02.0299125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:02.0299653Z 2025-05-07T20:32:02.0299765Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:02.0300188Z self=, 2025-05-07T20:32:02.0300590Z T=16384, 2025-05-07T20:32:02.0300793Z D=5120, 2025-05-07T20:32:02.0301005Z scale_ub=1200.0, 2025-05-07T20:32:02.0301274Z contiguous=False, 2025-05-07T20:32:02.0301517Z compiled=True, 2025-05-07T20:32:02.0301727Z ) 2025-05-07T20:32:02.0983531Z self = 2025-05-07T20:32:02.0984108Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:02.0984615Z 2025-05-07T20:32:02.0984774Z @given( 2025-05-07T20:32:02.0985018Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:02.0985341Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:02.0985650Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:02.0985996Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:02.0986332Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:02.0986616Z ) 2025-05-07T20:32:02.0986978Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:02.0987439Z def test_silu_mul_quant( 2025-05-07T20:32:02.0987679Z self, 2025-05-07T20:32:02.0987885Z T: int, 2025-05-07T20:32:02.0988092Z D: int, 2025-05-07T20:32:02.0988323Z scale_ub: Optional[float], 2025-05-07T20:32:02.0988595Z contiguous: bool, 2025-05-07T20:32:02.0988828Z compiled: bool, 2025-05-07T20:32:02.0989136Z ) -> None: 2025-05-07T20:32:02.0989364Z torch.manual_seed(2025) 2025-05-07T20:32:02.0989614Z 2025-05-07T20:32:02.0989901Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:02.0990246Z 2025-05-07T20:32:02.0990453Z x_sign = torch.sign(x) 2025-05-07T20:32:02.0990755Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:02.0991077Z x = x_sign * x_clamp 2025-05-07T20:32:02.0991361Z x0 = x[:, :D] 2025-05-07T20:32:02.0991581Z x1 = x[:, D:] 2025-05-07T20:32:02.0991788Z 2025-05-07T20:32:02.0991981Z if contiguous: 2025-05-07T20:32:02.0992218Z x0 = x0.contiguous() 2025-05-07T20:32:02.0992473Z x1 = x1.contiguous() 2025-05-07T20:32:02.0992720Z 2025-05-07T20:32:02.0992921Z if scale_ub is not None: 2025-05-07T20:32:02.0993190Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:02.0993536Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:02.0993850Z ) 2025-05-07T20:32:02.0994055Z else: 2025-05-07T20:32:02.0994266Z scale_ub_tensor = None 2025-05-07T20:32:02.0994524Z 2025-05-07T20:32:02.0994761Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:02.0995072Z op = silu_mul_quant 2025-05-07T20:32:02.0995327Z if compiled: 2025-05-07T20:32:02.0995577Z op = torch.compile(op) 2025-05-07T20:32:02.0995868Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:02.0996153Z 2025-05-07T20:32:02.0996350Z > y_fp8, y_scale = fn() 2025-05-07T20:32:02.0996517Z 2025-05-07T20:32:02.0996619Z moe/activation_test.py:117: 2025-05-07T20:32:02.0996917Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:02.0997249Z moe/activation_test.py:115: in fn 2025-05-07T20:32:02.0997532Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:02.0998084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:02.0998650Z return fn(*args, **kwargs) 2025-05-07T20:32:02.0999311Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:02.1000120Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:02.1000659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:02.1001338Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:02.1002043Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:02.1002571Z kernel = self.compile( 2025-05-07T20:32:02.1003121Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:02.1003781Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:02.1004262Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:02.1004498Z 2025-05-07T20:32:02.1004706Z self = 2025-05-07T20:32:02.1005791Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:02.1007160Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4898fbdd00>} 2025-05-07T20:32:02.1008495Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:02.1009507Z context = 2025-05-07T20:32:02.1009803Z 2025-05-07T20:32:02.1009971Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:02.1010492Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:02.1010963Z module_map=module_map) 2025-05-07T20:32:02.1011332Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:02.1011689Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:02.1011955Z E ^ 2025-05-07T20:32:02.1012413Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:02.1012868Z 2025-05-07T20:32:02.1013283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:02.1013798Z 2025-05-07T20:32:02.1013904Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:02.1014320Z self=, 2025-05-07T20:32:02.1014719Z T=2048, 2025-05-07T20:32:02.1014914Z D=7168, 2025-05-07T20:32:02.1015117Z scale_ub=1200.0, 2025-05-07T20:32:02.1015342Z contiguous=False, 2025-05-07T20:32:02.1015592Z compiled=True, 2025-05-07T20:32:02.1015805Z ) 2025-05-07T20:32:02.1016127Z self = 2025-05-07T20:32:02.1016617Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:02.1016891Z 2025-05-07T20:32:02.1016971Z @given( 2025-05-07T20:32:02.1017205Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:02.1017518Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:02.1017819Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:02.1018148Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:02.1018473Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:02.1018756Z ) 2025-05-07T20:32:02.1019109Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:02.1019547Z def test_silu_mul_quant( 2025-05-07T20:32:02.1019784Z self, 2025-05-07T20:32:02.1019986Z T: int, 2025-05-07T20:32:02.1020275Z D: int, 2025-05-07T20:32:02.1020498Z scale_ub: Optional[float], 2025-05-07T20:32:02.1020769Z contiguous: bool, 2025-05-07T20:32:02.1021012Z compiled: bool, 2025-05-07T20:32:02.1021244Z ) -> None: 2025-05-07T20:32:02.1021461Z torch.manual_seed(2025) 2025-05-07T20:32:02.1021704Z 2025-05-07T20:32:02.1021980Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:02.1022318Z 2025-05-07T20:32:02.1022517Z x_sign = torch.sign(x) 2025-05-07T20:32:02.1022810Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:02.1023116Z x = x_sign * x_clamp 2025-05-07T20:32:02.1023401Z x0 = x[:, :D] 2025-05-07T20:32:02.1023688Z x1 = x[:, D:] 2025-05-07T20:32:02.1023890Z 2025-05-07T20:32:02.1024080Z if contiguous: 2025-05-07T20:32:02.1024314Z x0 = x0.contiguous() 2025-05-07T20:32:02.1024569Z x1 = x1.contiguous() 2025-05-07T20:32:02.1024810Z 2025-05-07T20:32:02.1025014Z if scale_ub is not None: 2025-05-07T20:32:02.1025282Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:02.1025621Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:02.1025927Z ) 2025-05-07T20:32:02.1026122Z else: 2025-05-07T20:32:02.1026339Z scale_ub_tensor = None 2025-05-07T20:32:02.1026590Z 2025-05-07T20:32:02.1026825Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:02.1027136Z op = silu_mul_quant 2025-05-07T20:32:02.1027390Z if compiled: 2025-05-07T20:32:02.1027639Z op = torch.compile(op) 2025-05-07T20:32:02.1027934Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:02.1028445Z 2025-05-07T20:32:02.1028669Z > y_fp8, y_scale = fn() 2025-05-07T20:32:02.1028835Z 2025-05-07T20:32:02.1028934Z moe/activation_test.py:117: 2025-05-07T20:32:02.1029272Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:02.1029602Z moe/activation_test.py:115: in fn 2025-05-07T20:32:02.1029879Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:02.1030435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:02.1030989Z return fn(*args, **kwargs) 2025-05-07T20:32:02.1031640Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:02.1032320Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:02.1032851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:02.1033531Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:02.1034186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:02.1034709Z kernel = self.compile( 2025-05-07T20:32:02.1035249Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:02.1035900Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:02.1036287Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:02.1036519Z 2025-05-07T20:32:02.1036724Z self = 2025-05-07T20:32:02.1037797Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:02.1039155Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4898fbe840>} 2025-05-07T20:32:02.1040622Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:02.1041645Z context = 2025-05-07T20:32:02.1041929Z 2025-05-07T20:32:02.1042093Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:02.1042609Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:02.1043077Z module_map=module_map) 2025-05-07T20:32:02.1043444Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:02.1043851Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:02.1044173Z E ^ 2025-05-07T20:32:02.1044635Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:02.1045080Z 2025-05-07T20:32:02.1045498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:02.1046012Z 2025-05-07T20:32:02.1939765Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:02.1940246Z self=, 2025-05-07T20:32:02.1940787Z T=1, 2025-05-07T20:32:02.1940972Z D=5120, 2025-05-07T20:32:02.1941234Z scale_ub=None, 2025-05-07T20:32:02.1941539Z contiguous=False, 2025-05-07T20:32:02.1941846Z compiled=False, 2025-05-07T20:32:02.1942125Z ) 2025-05-07T20:32:02.1942518Z self = 2025-05-07T20:32:02.1943016Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:02.1943278Z 2025-05-07T20:32:02.1943357Z @given( 2025-05-07T20:32:02.1943591Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:02.1943905Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:02.1944211Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:02.1944540Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:02.1944873Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:02.1945152Z ) 2025-05-07T20:32:02.1945499Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:02.1945935Z def test_silu_mul_quant( 2025-05-07T20:32:02.1946179Z self, 2025-05-07T20:32:02.1946370Z T: int, 2025-05-07T20:32:02.1946574Z D: int, 2025-05-07T20:32:02.1946792Z scale_ub: Optional[float], 2025-05-07T20:32:02.1947055Z contiguous: bool, 2025-05-07T20:32:02.1947294Z compiled: bool, 2025-05-07T20:32:02.1947521Z ) -> None: 2025-05-07T20:32:02.1947731Z torch.manual_seed(2025) 2025-05-07T20:32:02.1947970Z 2025-05-07T20:32:02.1948243Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:02.1948574Z 2025-05-07T20:32:02.1948781Z x_sign = torch.sign(x) 2025-05-07T20:32:02.1949135Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:02.1949440Z x = x_sign * x_clamp 2025-05-07T20:32:02.1949675Z x0 = x[:, :D] 2025-05-07T20:32:02.1949898Z x1 = x[:, D:] 2025-05-07T20:32:02.1950103Z 2025-05-07T20:32:02.1950289Z if contiguous: 2025-05-07T20:32:02.1950521Z x0 = x0.contiguous() 2025-05-07T20:32:02.1950778Z x1 = x1.contiguous() 2025-05-07T20:32:02.1951009Z 2025-05-07T20:32:02.1951202Z if scale_ub is not None: 2025-05-07T20:32:02.1951474Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:02.1951803Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:02.1952117Z ) 2025-05-07T20:32:02.1952308Z else: 2025-05-07T20:32:02.1952513Z scale_ub_tensor = None 2025-05-07T20:32:02.1952764Z 2025-05-07T20:32:02.1952993Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:02.1953501Z op = silu_mul_quant 2025-05-07T20:32:02.1953762Z if compiled: 2025-05-07T20:32:02.1954008Z op = torch.compile(op) 2025-05-07T20:32:02.1954307Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:02.1954577Z 2025-05-07T20:32:02.1954766Z > y_fp8, y_scale = fn() 2025-05-07T20:32:02.1954932Z 2025-05-07T20:32:02.1955030Z moe/activation_test.py:117: 2025-05-07T20:32:02.1955325Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:02.1955658Z moe/activation_test.py:115: in fn 2025-05-07T20:32:02.1955933Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:02.1956676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:02.1957421Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:02.1957958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:02.1958630Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:02.1959284Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:02.1959807Z kernel = self.compile( 2025-05-07T20:32:02.1960350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:02.1960999Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:02.1961395Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:02.1961622Z 2025-05-07T20:32:02.1961833Z self = 2025-05-07T20:32:02.1962913Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:02.1964262Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f48991640e0>} 2025-05-07T20:32:02.1965591Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:02.1966611Z context = 2025-05-07T20:32:02.1966894Z 2025-05-07T20:32:02.1967058Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:02.1967581Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:02.1968044Z module_map=module_map) 2025-05-07T20:32:02.1968404Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:02.1968751Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:02.1969006Z E ^ 2025-05-07T20:32:02.1969464Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:02.1969912Z 2025-05-07T20:32:02.1970324Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:02.1970836Z 2025-05-07T20:32:02.1970939Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:02.1971369Z self=, 2025-05-07T20:32:02.1971800Z T=4096, 2025-05-07T20:32:02.1971987Z D=7168, 2025-05-07T20:32:02.1972184Z scale_ub=1200.0, 2025-05-07T20:32:02.1972405Z contiguous=False, 2025-05-07T20:32:02.1972627Z compiled=False, 2025-05-07T20:32:02.1972832Z ) 2025-05-07T20:32:02.1973151Z self = 2025-05-07T20:32:02.1973770Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:02.1974051Z 2025-05-07T20:32:02.1974131Z @given( 2025-05-07T20:32:02.1974365Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:02.1974670Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:02.1974972Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:02.1975299Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:02.1975621Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:02.1975899Z ) 2025-05-07T20:32:02.1976243Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:02.1976761Z def test_silu_mul_quant( 2025-05-07T20:32:02.1976998Z self, 2025-05-07T20:32:02.1977196Z T: int, 2025-05-07T20:32:02.1977390Z D: int, 2025-05-07T20:32:02.1977601Z scale_ub: Optional[float], 2025-05-07T20:32:02.1977871Z contiguous: bool, 2025-05-07T20:32:02.1978116Z compiled: bool, 2025-05-07T20:32:02.1978332Z ) -> None: 2025-05-07T20:32:02.1978548Z torch.manual_seed(2025) 2025-05-07T20:32:02.1978791Z 2025-05-07T20:32:02.1979054Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:02.1979392Z 2025-05-07T20:32:02.1979585Z x_sign = torch.sign(x) 2025-05-07T20:32:02.1979873Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:02.1980181Z x = x_sign * x_clamp 2025-05-07T20:32:02.1980420Z x0 = x[:, :D] 2025-05-07T20:32:02.1980633Z x1 = x[:, D:] 2025-05-07T20:32:02.1980833Z 2025-05-07T20:32:02.1981021Z if contiguous: 2025-05-07T20:32:02.1981252Z x0 = x0.contiguous() 2025-05-07T20:32:02.1981504Z x1 = x1.contiguous() 2025-05-07T20:32:02.1981738Z 2025-05-07T20:32:02.1981927Z if scale_ub is not None: 2025-05-07T20:32:02.1982192Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:02.1982526Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:02.1982832Z ) 2025-05-07T20:32:02.1983021Z else: 2025-05-07T20:32:02.1983231Z scale_ub_tensor = None 2025-05-07T20:32:02.1983484Z 2025-05-07T20:32:02.1983708Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:02.1984026Z op = silu_mul_quant 2025-05-07T20:32:02.1984278Z if compiled: 2025-05-07T20:32:02.1984520Z op = torch.compile(op) 2025-05-07T20:32:02.1984809Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:02.1985080Z 2025-05-07T20:32:02.1985265Z > y_fp8, y_scale = fn() 2025-05-07T20:32:02.1985435Z 2025-05-07T20:32:02.1985536Z moe/activation_test.py:117: 2025-05-07T20:32:02.1985826Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:02.1986154Z moe/activation_test.py:115: in fn 2025-05-07T20:32:02.1986429Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:02.1987123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:02.1987805Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:02.1988326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:02.1989005Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:02.1989731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:02.1990259Z kernel = self.compile( 2025-05-07T20:32:02.1990793Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:02.1991446Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:02.1991926Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:02.1992153Z 2025-05-07T20:32:02.1992359Z self = 2025-05-07T20:32:02.1993421Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:02.1994779Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4899165300>} 2025-05-07T20:32:02.1996102Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:02.1997231Z context = 2025-05-07T20:32:02.1997513Z 2025-05-07T20:32:02.1997686Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:02.1998198Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:02.1998666Z module_map=module_map) 2025-05-07T20:32:02.1999035Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:02.1999377Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:02.1999642Z E ^ 2025-05-07T20:32:02.2000104Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:02.2000547Z 2025-05-07T20:32:02.2000968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:02.2001528Z 2025-05-07T20:32:02.2001631Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:02.2002034Z self=, 2025-05-07T20:32:02.2002445Z T=16384, 2025-05-07T20:32:02.2002638Z D=7168, 2025-05-07T20:32:02.2002830Z scale_ub=None, 2025-05-07T20:32:02.2003043Z contiguous=True, 2025-05-07T20:32:02.2003257Z compiled=True, 2025-05-07T20:32:02.2003457Z ) 2025-05-07T20:32:02.5076354Z self = 2025-05-07T20:32:02.5077126Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:02.5077414Z 2025-05-07T20:32:02.5077495Z @given( 2025-05-07T20:32:02.5077736Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:02.5078054Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:02.5078393Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:02.5078737Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:02.5079063Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:02.5079416Z ) 2025-05-07T20:32:02.5079988Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:02.5080515Z def test_silu_mul_quant( 2025-05-07T20:32:02.5089329Z self, 2025-05-07T20:32:02.5089567Z T: int, 2025-05-07T20:32:02.5089767Z D: int, 2025-05-07T20:32:02.5089978Z scale_ub: Optional[float], 2025-05-07T20:32:02.5090248Z contiguous: bool, 2025-05-07T20:32:02.5090487Z compiled: bool, 2025-05-07T20:32:02.5090715Z ) -> None: 2025-05-07T20:32:02.5090931Z torch.manual_seed(2025) 2025-05-07T20:32:02.5091171Z 2025-05-07T20:32:02.5091472Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:02.5091827Z 2025-05-07T20:32:02.5092025Z x_sign = torch.sign(x) 2025-05-07T20:32:02.5092313Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:02.5092631Z x = x_sign * x_clamp 2025-05-07T20:32:02.5092878Z x0 = x[:, :D] 2025-05-07T20:32:02.5093088Z x1 = x[:, D:] 2025-05-07T20:32:02.5093294Z 2025-05-07T20:32:02.5093752Z if contiguous: 2025-05-07T20:32:02.5093992Z x0 = x0.contiguous() 2025-05-07T20:32:02.5094261Z x1 = x1.contiguous() 2025-05-07T20:32:02.5094508Z 2025-05-07T20:32:02.5094709Z if scale_ub is not None: 2025-05-07T20:32:02.5094979Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:02.5095321Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:02.5095632Z ) 2025-05-07T20:32:02.5095827Z else: 2025-05-07T20:32:02.5096046Z scale_ub_tensor = None 2025-05-07T20:32:02.5096305Z 2025-05-07T20:32:02.5096539Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:02.5096943Z op = silu_mul_quant 2025-05-07T20:32:02.5097279Z if compiled: 2025-05-07T20:32:02.5097530Z op = torch.compile(op) 2025-05-07T20:32:02.5097833Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:02.5098113Z 2025-05-07T20:32:02.5098320Z > y_fp8, y_scale = fn() 2025-05-07T20:32:02.5098495Z 2025-05-07T20:32:02.5098598Z moe/activation_test.py:117: 2025-05-07T20:32:02.5098905Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:02.5099245Z moe/activation_test.py:115: in fn 2025-05-07T20:32:02.5099526Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:02.5100090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:02.5100657Z return fn(*args, **kwargs) 2025-05-07T20:32:02.5101316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:02.5102065Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:02.5102598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:02.5103278Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:02.5103931Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:02.5104464Z kernel = self.compile( 2025-05-07T20:32:02.5105010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:02.5105663Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:02.5106069Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:02.5106308Z 2025-05-07T20:32:02.5106520Z self = 2025-05-07T20:32:02.5107615Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:02.5109021Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f48991663e0>} 2025-05-07T20:32:02.5110473Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:02.5111533Z context = 2025-05-07T20:32:02.5111858Z 2025-05-07T20:32:02.5112028Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:02.5112553Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:02.5113025Z module_map=module_map) 2025-05-07T20:32:02.5113397Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:02.5113759Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:02.5114018Z E ^ 2025-05-07T20:32:02.5114580Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:02.5115039Z 2025-05-07T20:32:02.5115459Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:02.5115970Z 2025-05-07T20:32:02.5116083Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:02.5116494Z self=, 2025-05-07T20:32:02.5116903Z T=4096, 2025-05-07T20:32:02.5117103Z D=5120, 2025-05-07T20:32:02.5117300Z scale_ub=None, 2025-05-07T20:32:02.5117529Z contiguous=False, 2025-05-07T20:32:02.5117806Z compiled=True, 2025-05-07T20:32:02.5118053Z ) 2025-05-07T20:32:02.5118377Z self = 2025-05-07T20:32:02.5118876Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:02.5119146Z 2025-05-07T20:32:02.5119238Z @given( 2025-05-07T20:32:02.5119470Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:02.5119790Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:02.5120104Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:02.5120429Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:02.5120763Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:02.5121054Z ) 2025-05-07T20:32:02.5121429Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:02.5121903Z def test_silu_mul_quant( 2025-05-07T20:32:02.5122156Z self, 2025-05-07T20:32:02.5122356Z T: int, 2025-05-07T20:32:02.5122563Z D: int, 2025-05-07T20:32:02.5122789Z scale_ub: Optional[float], 2025-05-07T20:32:02.5123070Z contiguous: bool, 2025-05-07T20:32:02.5123307Z compiled: bool, 2025-05-07T20:32:02.5123534Z ) -> None: 2025-05-07T20:32:02.5123762Z torch.manual_seed(2025) 2025-05-07T20:32:02.5124002Z 2025-05-07T20:32:02.5124279Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:02.5124626Z 2025-05-07T20:32:02.5124822Z x_sign = torch.sign(x) 2025-05-07T20:32:02.5125116Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:02.5125430Z x = x_sign * x_clamp 2025-05-07T20:32:02.5125667Z x0 = x[:, :D] 2025-05-07T20:32:02.5125896Z x1 = x[:, D:] 2025-05-07T20:32:02.5126110Z 2025-05-07T20:32:02.5126304Z if contiguous: 2025-05-07T20:32:02.5126540Z x0 = x0.contiguous() 2025-05-07T20:32:02.5126808Z x1 = x1.contiguous() 2025-05-07T20:32:02.5127048Z 2025-05-07T20:32:02.5127254Z if scale_ub is not None: 2025-05-07T20:32:02.5127535Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:02.5127876Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:02.5129059Z ) 2025-05-07T20:32:02.5129289Z else: 2025-05-07T20:32:02.5129510Z scale_ub_tensor = None 2025-05-07T20:32:02.5129760Z 2025-05-07T20:32:02.5130000Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:02.5130327Z op = silu_mul_quant 2025-05-07T20:32:02.5130580Z if compiled: 2025-05-07T20:32:02.5130834Z op = torch.compile(op) 2025-05-07T20:32:02.5131136Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:02.5131411Z 2025-05-07T20:32:02.5131613Z > y_fp8, y_scale = fn() 2025-05-07T20:32:02.5131777Z 2025-05-07T20:32:02.5131884Z moe/activation_test.py:117: 2025-05-07T20:32:02.5132180Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:02.5132523Z moe/activation_test.py:115: in fn 2025-05-07T20:32:02.5132807Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:02.5133523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:02.5134092Z return fn(*args, **kwargs) 2025-05-07T20:32:02.5134743Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:02.5135429Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:02.5135966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:02.5136642Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:02.5137306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:02.5137907Z kernel = self.compile( 2025-05-07T20:32:02.5138506Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:02.5139157Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:02.5139568Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:02.5139797Z 2025-05-07T20:32:02.5140010Z self = 2025-05-07T20:32:02.5141090Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:02.5142464Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4899166a20>} 2025-05-07T20:32:02.5143813Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:02.5144852Z context = 2025-05-07T20:32:02.5145139Z 2025-05-07T20:32:02.5145313Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:02.5145824Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:02.5146291Z module_map=module_map) 2025-05-07T20:32:02.5146655Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:02.5147008Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:02.5147263Z E ^ 2025-05-07T20:32:02.5147731Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:02.5148178Z 2025-05-07T20:32:02.5148607Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:02.5149200Z 2025-05-07T20:32:02.6289124Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:02.6289852Z self=, 2025-05-07T20:32:02.6290414Z T=4096, 2025-05-07T20:32:02.6290678Z D=5120, 2025-05-07T20:32:02.6290918Z scale_ub=1200.0, 2025-05-07T20:32:02.6291144Z contiguous=False, 2025-05-07T20:32:02.6291377Z compiled=False, 2025-05-07T20:32:02.6291595Z ) 2025-05-07T20:32:02.6291961Z self = 2025-05-07T20:32:02.6292460Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:02.6292740Z 2025-05-07T20:32:02.6292836Z @given( 2025-05-07T20:32:02.6293070Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:02.6293393Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:02.6293716Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:02.6294049Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:02.6294376Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:02.6294663Z ) 2025-05-07T20:32:02.6295258Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:02.6295709Z def test_silu_mul_quant( 2025-05-07T20:32:02.6295959Z self, 2025-05-07T20:32:02.6296165Z T: int, 2025-05-07T20:32:02.6296370Z D: int, 2025-05-07T20:32:02.6296609Z scale_ub: Optional[float], 2025-05-07T20:32:02.6296889Z contiguous: bool, 2025-05-07T20:32:02.6297129Z compiled: bool, 2025-05-07T20:32:02.6297361Z ) -> None: 2025-05-07T20:32:02.6297587Z torch.manual_seed(2025) 2025-05-07T20:32:02.6297829Z 2025-05-07T20:32:02.6298113Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:02.6298601Z 2025-05-07T20:32:02.6298796Z x_sign = torch.sign(x) 2025-05-07T20:32:02.6299096Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:02.6299418Z x = x_sign * x_clamp 2025-05-07T20:32:02.6299668Z x0 = x[:, :D] 2025-05-07T20:32:02.6299890Z x1 = x[:, D:] 2025-05-07T20:32:02.6300103Z 2025-05-07T20:32:02.6300296Z if contiguous: 2025-05-07T20:32:02.6300525Z x0 = x0.contiguous() 2025-05-07T20:32:02.6300790Z x1 = x1.contiguous() 2025-05-07T20:32:02.6301030Z 2025-05-07T20:32:02.6301218Z if scale_ub is not None: 2025-05-07T20:32:02.6301502Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:02.6301857Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:02.6302206Z ) 2025-05-07T20:32:02.6302413Z else: 2025-05-07T20:32:02.6302634Z scale_ub_tensor = None 2025-05-07T20:32:02.6302888Z 2025-05-07T20:32:02.6303121Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:02.6303448Z op = silu_mul_quant 2025-05-07T20:32:02.6303710Z if compiled: 2025-05-07T20:32:02.6303959Z op = torch.compile(op) 2025-05-07T20:32:02.6304267Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:02.6304555Z 2025-05-07T20:32:02.6304746Z > y_fp8, y_scale = fn() 2025-05-07T20:32:02.6304923Z 2025-05-07T20:32:02.6305029Z moe/activation_test.py:117: 2025-05-07T20:32:02.6305335Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:02.6305671Z moe/activation_test.py:115: in fn 2025-05-07T20:32:02.6305954Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:02.6306659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:02.6307358Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:02.6307893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:02.6308586Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:02.6309335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:02.6309873Z kernel = self.compile( 2025-05-07T20:32:02.6310415Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:02.6311078Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:02.6311489Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:02.6311715Z 2025-05-07T20:32:02.6311928Z self = 2025-05-07T20:32:02.6313007Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:02.6314501Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f48990142c0>} 2025-05-07T20:32:02.6315860Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:02.6316907Z context = 2025-05-07T20:32:02.6317195Z 2025-05-07T20:32:02.6317370Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:02.6317886Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:02.6318355Z module_map=module_map) 2025-05-07T20:32:02.6318761Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:02.6319154Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:02.6319412Z E ^ 2025-05-07T20:32:02.6319884Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:02.6320333Z 2025-05-07T20:32:02.6320758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:02.6321280Z 2025-05-07T20:32:02.6321411Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:02.6321846Z self=, 2025-05-07T20:32:02.6322256Z T=4096, 2025-05-07T20:32:02.6322450Z D=5120, 2025-05-07T20:32:02.6322643Z scale_ub=1200.0, 2025-05-07T20:32:02.6322876Z contiguous=False, 2025-05-07T20:32:02.6323106Z compiled=True, 2025-05-07T20:32:02.6323311Z ) 2025-05-07T20:32:02.6323633Z self = 2025-05-07T20:32:02.6324135Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:02.6324407Z 2025-05-07T20:32:02.6324491Z @given( 2025-05-07T20:32:02.6324739Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:02.6325070Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:02.6325386Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:02.6325720Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:02.6326060Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:02.6326347Z ) 2025-05-07T20:32:02.6326696Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:02.6327144Z def test_silu_mul_quant( 2025-05-07T20:32:02.6327399Z self, 2025-05-07T20:32:02.6327602Z T: int, 2025-05-07T20:32:02.6327808Z D: int, 2025-05-07T20:32:02.6328039Z scale_ub: Optional[float], 2025-05-07T20:32:02.6328657Z contiguous: bool, 2025-05-07T20:32:02.6328917Z compiled: bool, 2025-05-07T20:32:02.6329152Z ) -> None: 2025-05-07T20:32:02.6329373Z torch.manual_seed(2025) 2025-05-07T20:32:02.6329630Z 2025-05-07T20:32:02.6329917Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:02.6330267Z 2025-05-07T20:32:02.6330461Z x_sign = torch.sign(x) 2025-05-07T20:32:02.6330755Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:02.6331064Z x = x_sign * x_clamp 2025-05-07T20:32:02.6331299Z x0 = x[:, :D] 2025-05-07T20:32:02.6331523Z x1 = x[:, D:] 2025-05-07T20:32:02.6331763Z 2025-05-07T20:32:02.6331963Z if contiguous: 2025-05-07T20:32:02.6332199Z x0 = x0.contiguous() 2025-05-07T20:32:02.6332462Z x1 = x1.contiguous() 2025-05-07T20:32:02.6332698Z 2025-05-07T20:32:02.6332896Z if scale_ub is not None: 2025-05-07T20:32:02.6333174Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:02.6333508Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:02.6333824Z ) 2025-05-07T20:32:02.6334024Z else: 2025-05-07T20:32:02.6334236Z scale_ub_tensor = None 2025-05-07T20:32:02.6334487Z 2025-05-07T20:32:02.6334858Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:02.6335183Z op = silu_mul_quant 2025-05-07T20:32:02.6335431Z if compiled: 2025-05-07T20:32:02.6335679Z op = torch.compile(op) 2025-05-07T20:32:02.6335982Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:02.6336254Z 2025-05-07T20:32:02.6336457Z > y_fp8, y_scale = fn() 2025-05-07T20:32:02.6336621Z 2025-05-07T20:32:02.6336729Z moe/activation_test.py:117: 2025-05-07T20:32:02.6337019Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:02.6337355Z moe/activation_test.py:115: in fn 2025-05-07T20:32:02.6337705Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:02.6338321Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:02.6338886Z return fn(*args, **kwargs) 2025-05-07T20:32:02.6339550Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:02.6340251Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:02.6340782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:02.6341467Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:02.6342131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:02.6342667Z kernel = self.compile( 2025-05-07T20:32:02.6343207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:02.6343869Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:02.6344271Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:02.6344499Z 2025-05-07T20:32:02.6344713Z self = 2025-05-07T20:32:02.6345800Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:02.6347175Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f48990154e0>} 2025-05-07T20:32:02.6348528Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:02.6349631Z context = 2025-05-07T20:32:02.6349920Z 2025-05-07T20:32:02.6350092Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:02.6350612Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:02.6351079Z module_map=module_map) 2025-05-07T20:32:02.6351463Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:02.6351849Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:02.6352111Z E ^ 2025-05-07T20:32:02.6352579Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:02.6353024Z 2025-05-07T20:32:02.6353446Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:02.6353964Z 2025-05-07T20:32:02.7234470Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:02.7235051Z self=, 2025-05-07T20:32:02.7235675Z T=2048, 2025-05-07T20:32:02.7235958Z D=7168, 2025-05-07T20:32:02.7236427Z scale_ub=1200.0, 2025-05-07T20:32:02.7236747Z contiguous=False, 2025-05-07T20:32:02.7237061Z compiled=False, 2025-05-07T20:32:02.7237295Z ) 2025-05-07T20:32:02.7237619Z self = 2025-05-07T20:32:02.7238123Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:02.7238397Z 2025-05-07T20:32:02.7238487Z @given( 2025-05-07T20:32:02.7238719Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:02.7239033Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:02.7239347Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:02.7239778Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:02.7240171Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:02.7240460Z ) 2025-05-07T20:32:02.7240812Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:02.7241262Z def test_silu_mul_quant( 2025-05-07T20:32:02.7241512Z self, 2025-05-07T20:32:02.7241714Z T: int, 2025-05-07T20:32:02.7241920Z D: int, 2025-05-07T20:32:02.7242150Z scale_ub: Optional[float], 2025-05-07T20:32:02.7242425Z contiguous: bool, 2025-05-07T20:32:02.7242666Z compiled: bool, 2025-05-07T20:32:02.7242897Z ) -> None: 2025-05-07T20:32:02.7243123Z torch.manual_seed(2025) 2025-05-07T20:32:02.7243367Z 2025-05-07T20:32:02.7243649Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:02.7243997Z 2025-05-07T20:32:02.7244200Z x_sign = torch.sign(x) 2025-05-07T20:32:02.7244503Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:02.7244820Z x = x_sign * x_clamp 2025-05-07T20:32:02.7245063Z x0 = x[:, :D] 2025-05-07T20:32:02.7245289Z x1 = x[:, D:] 2025-05-07T20:32:02.7245504Z 2025-05-07T20:32:02.7245690Z if contiguous: 2025-05-07T20:32:02.7245936Z x0 = x0.contiguous() 2025-05-07T20:32:02.7246199Z x1 = x1.contiguous() 2025-05-07T20:32:02.7246434Z 2025-05-07T20:32:02.7246629Z if scale_ub is not None: 2025-05-07T20:32:02.7246908Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:02.7247240Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:02.7247552Z ) 2025-05-07T20:32:02.7247751Z else: 2025-05-07T20:32:02.7247970Z scale_ub_tensor = None 2025-05-07T20:32:02.7248216Z 2025-05-07T20:32:02.7248455Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:02.7248778Z op = silu_mul_quant 2025-05-07T20:32:02.7249034Z if compiled: 2025-05-07T20:32:02.7249288Z op = torch.compile(op) 2025-05-07T20:32:02.7249586Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:02.7249856Z 2025-05-07T20:32:02.7250055Z > y_fp8, y_scale = fn() 2025-05-07T20:32:02.7250218Z 2025-05-07T20:32:02.7250330Z moe/activation_test.py:117: 2025-05-07T20:32:02.7250620Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:02.7250954Z moe/activation_test.py:115: in fn 2025-05-07T20:32:02.7251238Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:02.7251978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:02.7252664Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:02.7253202Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:02.7253894Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:02.7254552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:02.7255086Z kernel = self.compile( 2025-05-07T20:32:02.7255719Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:02.7256378Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:02.7256775Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:02.7257012Z 2025-05-07T20:32:02.7264305Z self = 2025-05-07T20:32:02.7265419Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:02.7266917Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4899016340>} 2025-05-07T20:32:02.7268715Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:02.7269819Z context = 2025-05-07T20:32:02.7270110Z 2025-05-07T20:32:02.7270278Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:02.7270800Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:02.7271278Z module_map=module_map) 2025-05-07T20:32:02.7271649Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:02.7272005Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:02.7272280Z E ^ 2025-05-07T20:32:02.7272757Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:02.7273205Z 2025-05-07T20:32:02.7273642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:02.7274155Z 2025-05-07T20:32:02.7274263Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:02.7274690Z self=, 2025-05-07T20:32:02.7275101Z T=1, 2025-05-07T20:32:02.7275288Z D=7168, 2025-05-07T20:32:02.7275495Z scale_ub=None, 2025-05-07T20:32:02.7275716Z contiguous=True, 2025-05-07T20:32:02.7275938Z compiled=False, 2025-05-07T20:32:02.7276162Z ) 2025-05-07T20:32:02.7276498Z self = 2025-05-07T20:32:02.7276976Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:02.7277253Z 2025-05-07T20:32:02.7277334Z @given( 2025-05-07T20:32:02.7277578Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:02.7277902Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:02.7278211Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:02.7278569Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:02.7278905Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:02.7279199Z ) 2025-05-07T20:32:02.7279546Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:02.7279992Z def test_silu_mul_quant( 2025-05-07T20:32:02.7280241Z self, 2025-05-07T20:32:02.7280437Z T: int, 2025-05-07T20:32:02.7280639Z D: int, 2025-05-07T20:32:02.7280869Z scale_ub: Optional[float], 2025-05-07T20:32:02.7281137Z contiguous: bool, 2025-05-07T20:32:02.7281403Z compiled: bool, 2025-05-07T20:32:02.7281668Z ) -> None: 2025-05-07T20:32:02.7281884Z torch.manual_seed(2025) 2025-05-07T20:32:02.7282130Z 2025-05-07T20:32:02.7282414Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:02.7282760Z 2025-05-07T20:32:02.7282958Z x_sign = torch.sign(x) 2025-05-07T20:32:02.7283358Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:02.7283676Z x = x_sign * x_clamp 2025-05-07T20:32:02.7283922Z x0 = x[:, :D] 2025-05-07T20:32:02.7284150Z x1 = x[:, D:] 2025-05-07T20:32:02.7284369Z 2025-05-07T20:32:02.7284559Z if contiguous: 2025-05-07T20:32:02.7284798Z x0 = x0.contiguous() 2025-05-07T20:32:02.7285065Z x1 = x1.contiguous() 2025-05-07T20:32:02.7285304Z 2025-05-07T20:32:02.7285504Z if scale_ub is not None: 2025-05-07T20:32:02.7285784Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:02.7286124Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:02.7287866Z ) 2025-05-07T20:32:02.7288110Z else: 2025-05-07T20:32:02.7288318Z scale_ub_tensor = None 2025-05-07T20:32:02.7288571Z 2025-05-07T20:32:02.7288813Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:02.7289138Z op = silu_mul_quant 2025-05-07T20:32:02.7289393Z if compiled: 2025-05-07T20:32:02.7289647Z op = torch.compile(op) 2025-05-07T20:32:02.7289949Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:02.7290227Z 2025-05-07T20:32:02.7290436Z > y_fp8, y_scale = fn() 2025-05-07T20:32:02.7290603Z 2025-05-07T20:32:02.7290712Z moe/activation_test.py:117: 2025-05-07T20:32:02.7291008Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:02.7291344Z moe/activation_test.py:115: in fn 2025-05-07T20:32:02.7291641Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:02.7292361Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:02.7293065Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:02.7293611Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:02.7294306Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:02.7294977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:02.7295518Z kernel = self.compile( 2025-05-07T20:32:02.7296066Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:02.7296729Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:02.7297125Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:02.7297359Z 2025-05-07T20:32:02.7297575Z self = 2025-05-07T20:32:02.7298673Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:02.7300045Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4899015c60>} 2025-05-07T20:32:02.7301386Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:02.7302430Z context = 2025-05-07T20:32:02.7302727Z 2025-05-07T20:32:02.7302896Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:02.7303418Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:02.7303886Z module_map=module_map) 2025-05-07T20:32:02.7304258Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:02.7304708Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:02.7304970Z E ^ 2025-05-07T20:32:02.7305440Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:02.7305903Z 2025-05-07T20:32:02.7306326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:02.7306839Z 2025-05-07T20:32:02.7306958Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:02.7307370Z self=, 2025-05-07T20:32:02.7307785Z T=16384, 2025-05-07T20:32:02.7307992Z D=7168, 2025-05-07T20:32:02.7308237Z scale_ub=1200.0, 2025-05-07T20:32:02.7308508Z contiguous=False, 2025-05-07T20:32:02.7308747Z compiled=True, 2025-05-07T20:32:03.0933702Z ) 2025-05-07T20:32:03.0934188Z self = 2025-05-07T20:32:03.0934916Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:03.0935324Z 2025-05-07T20:32:03.0935436Z @given( 2025-05-07T20:32:03.0935760Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:03.0936195Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:03.0936630Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:03.0936962Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:03.0937294Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:03.0937574Z ) 2025-05-07T20:32:03.0937925Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:03.0938373Z def test_silu_mul_quant( 2025-05-07T20:32:03.0938620Z self, 2025-05-07T20:32:03.0938815Z T: int, 2025-05-07T20:32:03.0939018Z D: int, 2025-05-07T20:32:03.0939242Z scale_ub: Optional[float], 2025-05-07T20:32:03.0939505Z contiguous: bool, 2025-05-07T20:32:03.0939751Z compiled: bool, 2025-05-07T20:32:03.0939975Z ) -> None: 2025-05-07T20:32:03.0940190Z torch.manual_seed(2025) 2025-05-07T20:32:03.0940438Z 2025-05-07T20:32:03.0940712Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:03.0941051Z 2025-05-07T20:32:03.0941256Z x_sign = torch.sign(x) 2025-05-07T20:32:03.0941552Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:03.0941865Z x = x_sign * x_clamp 2025-05-07T20:32:03.0942149Z x0 = x[:, :D] 2025-05-07T20:32:03.0942378Z x1 = x[:, D:] 2025-05-07T20:32:03.0942576Z 2025-05-07T20:32:03.0942764Z if contiguous: 2025-05-07T20:32:03.0943008Z x0 = x0.contiguous() 2025-05-07T20:32:03.0943269Z x1 = x1.contiguous() 2025-05-07T20:32:03.0943500Z 2025-05-07T20:32:03.0943691Z if scale_ub is not None: 2025-05-07T20:32:03.0943965Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:03.0944296Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:03.0944609Z ) 2025-05-07T20:32:03.0944802Z else: 2025-05-07T20:32:03.0945004Z scale_ub_tensor = None 2025-05-07T20:32:03.0945251Z 2025-05-07T20:32:03.0945479Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:03.0945784Z op = silu_mul_quant 2025-05-07T20:32:03.0946032Z if compiled: 2025-05-07T20:32:03.0946278Z op = torch.compile(op) 2025-05-07T20:32:03.0946566Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:03.0946837Z 2025-05-07T20:32:03.0947031Z > y_fp8, y_scale = fn() 2025-05-07T20:32:03.0947192Z 2025-05-07T20:32:03.0947294Z moe/activation_test.py:117: 2025-05-07T20:32:03.0947585Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:03.0947915Z moe/activation_test.py:115: in fn 2025-05-07T20:32:03.0948192Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:03.0948954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:03.0949609Z return fn(*args, **kwargs) 2025-05-07T20:32:03.0950263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:03.0950937Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:03.0951468Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:03.0952190Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:03.0952846Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:03.0953491Z kernel = self.compile( 2025-05-07T20:32:03.0954035Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:03.0954695Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:03.0955092Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:03.0955320Z 2025-05-07T20:32:03.0955524Z self = 2025-05-07T20:32:03.0956602Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:03.0957961Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4899408900>} 2025-05-07T20:32:03.0959301Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:03.0960313Z context = 2025-05-07T20:32:03.0960602Z 2025-05-07T20:32:03.0960765Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:03.0961278Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:03.0961768Z module_map=module_map) 2025-05-07T20:32:03.0962149Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:03.0962498Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:03.0962757Z E ^ 2025-05-07T20:32:03.0963211Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:03.0963668Z 2025-05-07T20:32:03.0964081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:03.0964589Z 2025-05-07T20:32:03.0964698Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:03.0965116Z self=, 2025-05-07T20:32:03.0965647Z T=1, 2025-05-07T20:32:03.0965830Z D=7168, 2025-05-07T20:32:03.0966031Z scale_ub=None, 2025-05-07T20:32:03.0966261Z contiguous=False, 2025-05-07T20:32:03.0966484Z compiled=False, 2025-05-07T20:32:03.0966695Z ) 2025-05-07T20:32:03.0967018Z self = 2025-05-07T20:32:03.0967499Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:03.0967764Z 2025-05-07T20:32:03.0967842Z @given( 2025-05-07T20:32:03.0968078Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:03.0968397Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:03.0968698Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:03.0969037Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:03.0969476Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:03.0969755Z ) 2025-05-07T20:32:03.0970102Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:03.0970545Z def test_silu_mul_quant( 2025-05-07T20:32:03.0970782Z self, 2025-05-07T20:32:03.0970979Z T: int, 2025-05-07T20:32:03.0971175Z D: int, 2025-05-07T20:32:03.0971389Z scale_ub: Optional[float], 2025-05-07T20:32:03.0971658Z contiguous: bool, 2025-05-07T20:32:03.0971922Z compiled: bool, 2025-05-07T20:32:03.0972167Z ) -> None: 2025-05-07T20:32:03.0972375Z torch.manual_seed(2025) 2025-05-07T20:32:03.0972616Z 2025-05-07T20:32:03.0972930Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:03.0973301Z 2025-05-07T20:32:03.0973503Z x_sign = torch.sign(x) 2025-05-07T20:32:03.0973792Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:03.0974091Z x = x_sign * x_clamp 2025-05-07T20:32:03.0974334Z x0 = x[:, :D] 2025-05-07T20:32:03.0974553Z x1 = x[:, D:] 2025-05-07T20:32:03.0974753Z 2025-05-07T20:32:03.0974939Z if contiguous: 2025-05-07T20:32:03.0975170Z x0 = x0.contiguous() 2025-05-07T20:32:03.0975421Z x1 = x1.contiguous() 2025-05-07T20:32:03.0975657Z 2025-05-07T20:32:03.0975847Z if scale_ub is not None: 2025-05-07T20:32:03.0976194Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:03.0976621Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:03.0976931Z ) 2025-05-07T20:32:03.0977117Z else: 2025-05-07T20:32:03.0977328Z scale_ub_tensor = None 2025-05-07T20:32:03.0977582Z 2025-05-07T20:32:03.0977816Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:03.0978128Z op = silu_mul_quant 2025-05-07T20:32:03.0978377Z if compiled: 2025-05-07T20:32:03.0978626Z op = torch.compile(op) 2025-05-07T20:32:03.0978923Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:03.0979200Z 2025-05-07T20:32:03.0979399Z > y_fp8, y_scale = fn() 2025-05-07T20:32:03.0979562Z 2025-05-07T20:32:03.0979658Z moe/activation_test.py:117: 2025-05-07T20:32:03.0979958Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:03.0980290Z moe/activation_test.py:115: in fn 2025-05-07T20:32:03.0980566Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:03.0981253Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:03.0981946Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:03.0982477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:03.0983147Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:03.0983811Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:03.0984341Z kernel = self.compile( 2025-05-07T20:32:03.0984876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:03.0985521Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:03.0985921Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:03.0986147Z 2025-05-07T20:32:03.0986355Z self = 2025-05-07T20:32:03.0987422Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:03.0988879Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4899409760>} 2025-05-07T20:32:03.0990309Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:03.0991320Z context = 2025-05-07T20:32:03.0991621Z 2025-05-07T20:32:03.0991824Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:03.0992343Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:03.0992856Z module_map=module_map) 2025-05-07T20:32:03.0993260Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:03.0993602Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:03.0993865Z E ^ 2025-05-07T20:32:03.0994334Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:03.0994778Z 2025-05-07T20:32:03.0995195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:03.0995700Z 2025-05-07T20:32:03.0995802Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:03.0996209Z self=, 2025-05-07T20:32:03.0996609Z T=2048, 2025-05-07T20:32:03.0996797Z D=7168, 2025-05-07T20:32:03.0996982Z scale_ub=None, 2025-05-07T20:32:03.0997200Z contiguous=False, 2025-05-07T20:32:03.0997423Z compiled=True, 2025-05-07T20:32:03.0997624Z ) 2025-05-07T20:32:03.1685064Z self = 2025-05-07T20:32:03.1685732Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:03.1686034Z 2025-05-07T20:32:03.1686122Z @given( 2025-05-07T20:32:03.1686361Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:03.1686676Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:03.1686984Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:03.1687318Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:03.1687641Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:03.1687931Z ) 2025-05-07T20:32:03.1688280Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:03.1688716Z def test_silu_mul_quant( 2025-05-07T20:32:03.1688966Z self, 2025-05-07T20:32:03.1689166Z T: int, 2025-05-07T20:32:03.1689367Z D: int, 2025-05-07T20:32:03.1689595Z scale_ub: Optional[float], 2025-05-07T20:32:03.1689873Z contiguous: bool, 2025-05-07T20:32:03.1690113Z compiled: bool, 2025-05-07T20:32:03.1690345Z ) -> None: 2025-05-07T20:32:03.1690568Z torch.manual_seed(2025) 2025-05-07T20:32:03.1690810Z 2025-05-07T20:32:03.1691089Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:03.1691436Z 2025-05-07T20:32:03.1691635Z x_sign = torch.sign(x) 2025-05-07T20:32:03.1691921Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:03.1692237Z x = x_sign * x_clamp 2025-05-07T20:32:03.1692479Z x0 = x[:, :D] 2025-05-07T20:32:03.1692692Z x1 = x[:, D:] 2025-05-07T20:32:03.1692913Z 2025-05-07T20:32:03.1693181Z if contiguous: 2025-05-07T20:32:03.1693471Z x0 = x0.contiguous() 2025-05-07T20:32:03.1693804Z x1 = x1.contiguous() 2025-05-07T20:32:03.1694063Z 2025-05-07T20:32:03.1694259Z if scale_ub is not None: 2025-05-07T20:32:03.1694538Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:03.1694878Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:03.1695180Z ) 2025-05-07T20:32:03.1695380Z else: 2025-05-07T20:32:03.1695768Z scale_ub_tensor = None 2025-05-07T20:32:03.1696018Z 2025-05-07T20:32:03.1696250Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:03.1696563Z op = silu_mul_quant 2025-05-07T20:32:03.1696809Z if compiled: 2025-05-07T20:32:03.1697059Z op = torch.compile(op) 2025-05-07T20:32:03.1697357Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:03.1697634Z 2025-05-07T20:32:03.1697823Z > y_fp8, y_scale = fn() 2025-05-07T20:32:03.1697990Z 2025-05-07T20:32:03.1698089Z moe/activation_test.py:117: 2025-05-07T20:32:03.1698380Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:03.1698822Z moe/activation_test.py:115: in fn 2025-05-07T20:32:03.1699107Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:03.1699668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:03.1700235Z return fn(*args, **kwargs) 2025-05-07T20:32:03.1700892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:03.1701584Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:03.1702165Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:03.1702839Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:03.1703501Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:03.1704035Z kernel = self.compile( 2025-05-07T20:32:03.1704584Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:03.1705242Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:03.1705648Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:03.1705874Z 2025-05-07T20:32:03.1706089Z self = 2025-05-07T20:32:03.1707167Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:03.1708525Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f489940aa20>} 2025-05-07T20:32:03.1709924Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:03.1710953Z context = 2025-05-07T20:32:03.1711238Z 2025-05-07T20:32:03.1711417Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:03.1711930Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:03.1712398Z module_map=module_map) 2025-05-07T20:32:03.1712775Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:03.1713126Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:03.1713381Z E ^ 2025-05-07T20:32:03.1713847Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:03.1714291Z 2025-05-07T20:32:03.1714713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:03.1715220Z 2025-05-07T20:32:03.1715323Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:03.1715734Z self=, 2025-05-07T20:32:03.1716247Z T=4096, 2025-05-07T20:32:03.1716438Z D=7168, 2025-05-07T20:32:03.1716628Z scale_ub=None, 2025-05-07T20:32:03.1716848Z contiguous=False, 2025-05-07T20:32:03.1717077Z compiled=True, 2025-05-07T20:32:03.1717277Z ) 2025-05-07T20:32:03.1717598Z self = 2025-05-07T20:32:03.1718088Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:03.1718358Z 2025-05-07T20:32:03.1718437Z @given( 2025-05-07T20:32:03.1718670Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:03.1718984Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:03.1719328Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:03.1719700Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:03.1720032Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:03.1720313Z ) 2025-05-07T20:32:03.1720661Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:03.1721105Z def test_silu_mul_quant( 2025-05-07T20:32:03.1721347Z self, 2025-05-07T20:32:03.1721539Z T: int, 2025-05-07T20:32:03.1721780Z D: int, 2025-05-07T20:32:03.1722025Z scale_ub: Optional[float], 2025-05-07T20:32:03.1722297Z contiguous: bool, 2025-05-07T20:32:03.1722532Z compiled: bool, 2025-05-07T20:32:03.1722753Z ) -> None: 2025-05-07T20:32:03.1729135Z torch.manual_seed(2025) 2025-05-07T20:32:03.1729400Z 2025-05-07T20:32:03.1729690Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:03.1730030Z 2025-05-07T20:32:03.1730240Z x_sign = torch.sign(x) 2025-05-07T20:32:03.1730541Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:03.1730848Z x = x_sign * x_clamp 2025-05-07T20:32:03.1731095Z x0 = x[:, :D] 2025-05-07T20:32:03.1731318Z x1 = x[:, D:] 2025-05-07T20:32:03.1731527Z 2025-05-07T20:32:03.1731723Z if contiguous: 2025-05-07T20:32:03.1731955Z x0 = x0.contiguous() 2025-05-07T20:32:03.1732211Z x1 = x1.contiguous() 2025-05-07T20:32:03.1732455Z 2025-05-07T20:32:03.1732651Z if scale_ub is not None: 2025-05-07T20:32:03.1732917Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:03.1733261Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:03.1733572Z ) 2025-05-07T20:32:03.1733764Z else: 2025-05-07T20:32:03.1733984Z scale_ub_tensor = None 2025-05-07T20:32:03.1734237Z 2025-05-07T20:32:03.1734469Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:03.1734783Z op = silu_mul_quant 2025-05-07T20:32:03.1735039Z if compiled: 2025-05-07T20:32:03.1735285Z op = torch.compile(op) 2025-05-07T20:32:03.1735579Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:03.1735858Z 2025-05-07T20:32:03.1736060Z > y_fp8, y_scale = fn() 2025-05-07T20:32:03.1736221Z 2025-05-07T20:32:03.1736321Z moe/activation_test.py:117: 2025-05-07T20:32:03.1736618Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:03.1736948Z moe/activation_test.py:115: in fn 2025-05-07T20:32:03.1737222Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:03.1737785Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:03.1738348Z return fn(*args, **kwargs) 2025-05-07T20:32:03.1739007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:03.1739695Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:03.1740229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:03.1741062Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:03.1741719Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:03.1742253Z kernel = self.compile( 2025-05-07T20:32:03.1742792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:03.1743449Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:03.1743839Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:03.1744068Z 2025-05-07T20:32:03.1744275Z self = 2025-05-07T20:32:03.1745408Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:03.1746841Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f489940bce0>} 2025-05-07T20:32:03.1748183Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:03.1749271Z context = 2025-05-07T20:32:03.1749566Z 2025-05-07T20:32:03.1749734Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:03.1750262Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:03.1750729Z module_map=module_map) 2025-05-07T20:32:03.1751093Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:03.1751449Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:03.1751714Z E ^ 2025-05-07T20:32:03.1752228Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:03.1752685Z 2025-05-07T20:32:03.1753101Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:03.1753610Z 2025-05-07T20:32:03.3010364Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:03.3010808Z self=, 2025-05-07T20:32:03.3011324Z T=16384, 2025-05-07T20:32:03.3011683Z D=5120, 2025-05-07T20:32:03.3012015Z scale_ub=1200.0, 2025-05-07T20:32:03.3012547Z contiguous=False, 2025-05-07T20:32:03.3013232Z compiled=False, 2025-05-07T20:32:03.3013850Z ) 2025-05-07T20:32:03.3014470Z self = 2025-05-07T20:32:03.3015388Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:03.3015968Z 2025-05-07T20:32:03.3016118Z @given( 2025-05-07T20:32:03.3016542Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:03.3017108Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:03.3017669Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:03.3018278Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:03.3018881Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:03.3019400Z ) 2025-05-07T20:32:03.3020038Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:03.3020835Z def test_silu_mul_quant( 2025-05-07T20:32:03.3021295Z self, 2025-05-07T20:32:03.3021663Z T: int, 2025-05-07T20:32:03.3022045Z D: int, 2025-05-07T20:32:03.3022449Z scale_ub: Optional[float], 2025-05-07T20:32:03.3022960Z contiguous: bool, 2025-05-07T20:32:03.3023271Z compiled: bool, 2025-05-07T20:32:03.3023494Z ) -> None: 2025-05-07T20:32:03.3023886Z torch.manual_seed(2025) 2025-05-07T20:32:03.3024148Z 2025-05-07T20:32:03.3024416Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:03.3024767Z 2025-05-07T20:32:03.3024968Z x_sign = torch.sign(x) 2025-05-07T20:32:03.3025258Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:03.3025565Z x = x_sign * x_clamp 2025-05-07T20:32:03.3025807Z x0 = x[:, :D] 2025-05-07T20:32:03.3026025Z x1 = x[:, D:] 2025-05-07T20:32:03.3026233Z 2025-05-07T20:32:03.3026426Z if contiguous: 2025-05-07T20:32:03.3026664Z x0 = x0.contiguous() 2025-05-07T20:32:03.3026986Z x1 = x1.contiguous() 2025-05-07T20:32:03.3027286Z 2025-05-07T20:32:03.3027479Z if scale_ub is not None: 2025-05-07T20:32:03.3027755Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:03.3028095Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:03.3028679Z ) 2025-05-07T20:32:03.3028868Z else: 2025-05-07T20:32:03.3029137Z scale_ub_tensor = None 2025-05-07T20:32:03.3029389Z 2025-05-07T20:32:03.3029617Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:03.3029931Z op = silu_mul_quant 2025-05-07T20:32:03.3030180Z if compiled: 2025-05-07T20:32:03.3030421Z op = torch.compile(op) 2025-05-07T20:32:03.3030715Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:03.3030983Z 2025-05-07T20:32:03.3031178Z > y_fp8, y_scale = fn() 2025-05-07T20:32:03.3031341Z 2025-05-07T20:32:03.3031452Z moe/activation_test.py:117: 2025-05-07T20:32:03.3031773Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:03.3032134Z moe/activation_test.py:115: in fn 2025-05-07T20:32:03.3032412Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:03.3033097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:03.3033786Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:03.3034318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:03.3034999Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:03.3035650Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:03.3036187Z kernel = self.compile( 2025-05-07T20:32:03.3036724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:03.3037375Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:03.3037770Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:03.3038009Z 2025-05-07T20:32:03.3038221Z self = 2025-05-07T20:32:03.3039298Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:03.3040653Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4898b24c20>} 2025-05-07T20:32:03.3041988Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:03.3043020Z context = 2025-05-07T20:32:03.3043310Z 2025-05-07T20:32:03.3043475Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:03.3044132Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:03.3044597Z module_map=module_map) 2025-05-07T20:32:03.3044961Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:03.3045313Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:03.3045565Z E ^ 2025-05-07T20:32:03.3046027Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:03.3046474Z 2025-05-07T20:32:03.3046893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:03.3047459Z 2025-05-07T20:32:03.3047651Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:03.3048057Z self=, 2025-05-07T20:32:03.3048458Z T=16384, 2025-05-07T20:32:03.3048656Z D=5120, 2025-05-07T20:32:03.3048846Z scale_ub=1200.0, 2025-05-07T20:32:03.3049072Z contiguous=True, 2025-05-07T20:32:03.3049294Z compiled=True, 2025-05-07T20:32:03.3049491Z ) 2025-05-07T20:32:03.3049810Z self = 2025-05-07T20:32:03.3050298Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:03.3050569Z 2025-05-07T20:32:03.3050650Z @given( 2025-05-07T20:32:03.3050879Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:03.3051211Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:03.3051589Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:03.3051990Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:03.3052404Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:03.3052759Z ) 2025-05-07T20:32:03.3053193Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:03.3053739Z def test_silu_mul_quant( 2025-05-07T20:32:03.3054029Z self, 2025-05-07T20:32:03.3054221Z T: int, 2025-05-07T20:32:03.3054413Z D: int, 2025-05-07T20:32:03.3054631Z scale_ub: Optional[float], 2025-05-07T20:32:03.3054901Z contiguous: bool, 2025-05-07T20:32:03.3055136Z compiled: bool, 2025-05-07T20:32:03.3055360Z ) -> None: 2025-05-07T20:32:03.3055575Z torch.manual_seed(2025) 2025-05-07T20:32:03.3055810Z 2025-05-07T20:32:03.3056083Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:03.3056425Z 2025-05-07T20:32:03.3056616Z x_sign = torch.sign(x) 2025-05-07T20:32:03.3056912Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:03.3057225Z x = x_sign * x_clamp 2025-05-07T20:32:03.3057458Z x0 = x[:, :D] 2025-05-07T20:32:03.3057676Z x1 = x[:, D:] 2025-05-07T20:32:03.3057884Z 2025-05-07T20:32:03.3058063Z if contiguous: 2025-05-07T20:32:03.3058293Z x0 = x0.contiguous() 2025-05-07T20:32:03.3058559Z x1 = x1.contiguous() 2025-05-07T20:32:03.3058791Z 2025-05-07T20:32:03.3058987Z if scale_ub is not None: 2025-05-07T20:32:03.3059258Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:03.3059590Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:03.3059895Z ) 2025-05-07T20:32:03.3060087Z else: 2025-05-07T20:32:03.3060295Z scale_ub_tensor = None 2025-05-07T20:32:03.3060541Z 2025-05-07T20:32:03.3060772Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:03.3061089Z op = silu_mul_quant 2025-05-07T20:32:03.3061345Z if compiled: 2025-05-07T20:32:03.3061649Z op = torch.compile(op) 2025-05-07T20:32:03.3062014Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:03.3062360Z 2025-05-07T20:32:03.3062602Z > y_fp8, y_scale = fn() 2025-05-07T20:32:03.3062809Z 2025-05-07T20:32:03.3062933Z moe/activation_test.py:117: 2025-05-07T20:32:03.3063392Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:03.3063720Z moe/activation_test.py:115: in fn 2025-05-07T20:32:03.3063999Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:03.3064556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:03.3065107Z return fn(*args, **kwargs) 2025-05-07T20:32:03.3065757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:03.3066444Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:03.3067019Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:03.3067726Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:03.3068388Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:03.3068916Z kernel = self.compile( 2025-05-07T20:32:03.3069501Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:03.3070146Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:03.3070539Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:03.3070765Z 2025-05-07T20:32:03.3070974Z self = 2025-05-07T20:32:03.3072048Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:03.3073413Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4898b260c0>} 2025-05-07T20:32:03.3074754Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:03.3075776Z context = 2025-05-07T20:32:03.3076063Z 2025-05-07T20:32:03.3076235Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:03.3076748Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:03.3077214Z module_map=module_map) 2025-05-07T20:32:03.3077580Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:03.3077933Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:03.3078190Z E ^ 2025-05-07T20:32:03.3078654Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:03.3079100Z 2025-05-07T20:32:03.3079519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:03.3080023Z 2025-05-07T20:32:03.4385557Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:03.4386805Z self=, 2025-05-07T20:32:03.4387883Z T=16384, 2025-05-07T20:32:03.4388405Z D=5120, 2025-05-07T20:32:03.4388793Z scale_ub=None, 2025-05-07T20:32:03.4389308Z contiguous=False, 2025-05-07T20:32:03.4389757Z compiled=True, 2025-05-07T20:32:03.4390154Z ) 2025-05-07T20:32:03.4390790Z self = 2025-05-07T20:32:03.4391780Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:03.4392235Z 2025-05-07T20:32:03.4392333Z @given( 2025-05-07T20:32:03.4392768Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:03.4393094Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:03.4393399Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:03.4393728Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:03.4394053Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:03.4394337Z ) 2025-05-07T20:32:03.4394681Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:03.4395112Z def test_silu_mul_quant( 2025-05-07T20:32:03.4395353Z self, 2025-05-07T20:32:03.4395547Z T: int, 2025-05-07T20:32:03.4395739Z D: int, 2025-05-07T20:32:03.4396019Z scale_ub: Optional[float], 2025-05-07T20:32:03.4396344Z contiguous: bool, 2025-05-07T20:32:03.4396577Z compiled: bool, 2025-05-07T20:32:03.4396799Z ) -> None: 2025-05-07T20:32:03.4397016Z torch.manual_seed(2025) 2025-05-07T20:32:03.4397248Z 2025-05-07T20:32:03.4397528Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:03.4397867Z 2025-05-07T20:32:03.4398061Z x_sign = torch.sign(x) 2025-05-07T20:32:03.4398344Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:03.4398647Z x = x_sign * x_clamp 2025-05-07T20:32:03.4398887Z x0 = x[:, :D] 2025-05-07T20:32:03.4399098Z x1 = x[:, D:] 2025-05-07T20:32:03.4399305Z 2025-05-07T20:32:03.4399493Z if contiguous: 2025-05-07T20:32:03.4399719Z x0 = x0.contiguous() 2025-05-07T20:32:03.4399975Z x1 = x1.contiguous() 2025-05-07T20:32:03.4400209Z 2025-05-07T20:32:03.4400395Z if scale_ub is not None: 2025-05-07T20:32:03.4400667Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:03.4401004Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:03.4401307Z ) 2025-05-07T20:32:03.4401503Z else: 2025-05-07T20:32:03.4401711Z scale_ub_tensor = None 2025-05-07T20:32:03.4401990Z 2025-05-07T20:32:03.4402243Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:03.4402552Z op = silu_mul_quant 2025-05-07T20:32:03.4402804Z if compiled: 2025-05-07T20:32:03.4403040Z op = torch.compile(op) 2025-05-07T20:32:03.4403331Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:03.4403601Z 2025-05-07T20:32:03.4403787Z > y_fp8, y_scale = fn() 2025-05-07T20:32:03.4403951Z 2025-05-07T20:32:03.4404051Z moe/activation_test.py:117: 2025-05-07T20:32:03.4404345Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:03.4404668Z moe/activation_test.py:115: in fn 2025-05-07T20:32:03.4404952Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:03.4405508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:03.4406059Z return fn(*args, **kwargs) 2025-05-07T20:32:03.4406710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:03.4407399Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:03.4407929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:03.4408600Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:03.4409256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:03.4409781Z kernel = self.compile( 2025-05-07T20:32:03.4410320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:03.4410968Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:03.4411360Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:03.4411585Z 2025-05-07T20:32:03.4411890Z self = 2025-05-07T20:32:03.4412961Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:03.4414311Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4898b26c00>} 2025-05-07T20:32:03.4415639Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:03.4416728Z context = 2025-05-07T20:32:03.4417016Z 2025-05-07T20:32:03.4417191Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:03.4417703Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:03.4418166Z module_map=module_map) 2025-05-07T20:32:03.4418533Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:03.4418881Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:03.4419135Z E ^ 2025-05-07T20:32:03.4419597Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:03.4420043Z 2025-05-07T20:32:03.4420470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:03.4420981Z 2025-05-07T20:32:03.4421090Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:03.4421495Z self=, 2025-05-07T20:32:03.4421921Z T=2048, 2025-05-07T20:32:03.4422136Z D=5120, 2025-05-07T20:32:03.4422323Z scale_ub=None, 2025-05-07T20:32:03.4422538Z contiguous=False, 2025-05-07T20:32:03.4422760Z compiled=True, 2025-05-07T20:32:03.4422957Z ) 2025-05-07T20:32:03.6912353Z self = 2025-05-07T20:32:03.6913114Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:03.6913500Z 2025-05-07T20:32:03.6913622Z @given( 2025-05-07T20:32:03.6913937Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:03.6914376Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:03.6914776Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:03.6915120Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:03.6915447Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:03.6915733Z ) 2025-05-07T20:32:03.6916080Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:03.6916517Z def test_silu_mul_quant( 2025-05-07T20:32:03.6916768Z self, 2025-05-07T20:32:03.6916966Z T: int, 2025-05-07T20:32:03.6917159Z D: int, 2025-05-07T20:32:03.6917383Z scale_ub: Optional[float], 2025-05-07T20:32:03.6917658Z contiguous: bool, 2025-05-07T20:32:03.6917892Z compiled: bool, 2025-05-07T20:32:03.6918117Z ) -> None: 2025-05-07T20:32:03.6918336Z torch.manual_seed(2025) 2025-05-07T20:32:03.6918571Z 2025-05-07T20:32:03.6918846Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:03.6919190Z 2025-05-07T20:32:03.6919383Z x_sign = torch.sign(x) 2025-05-07T20:32:03.6919680Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:03.6919988Z x = x_sign * x_clamp 2025-05-07T20:32:03.6920235Z x0 = x[:, :D] 2025-05-07T20:32:03.6920456Z x1 = x[:, D:] 2025-05-07T20:32:03.6920669Z 2025-05-07T20:32:03.6920859Z if contiguous: 2025-05-07T20:32:03.6921259Z x0 = x0.contiguous() 2025-05-07T20:32:03.6921526Z x1 = x1.contiguous() 2025-05-07T20:32:03.6921781Z 2025-05-07T20:32:03.6922000Z if scale_ub is not None: 2025-05-07T20:32:03.6922274Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:03.6922613Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:03.6922914Z ) 2025-05-07T20:32:03.6923108Z else: 2025-05-07T20:32:03.6923326Z scale_ub_tensor = None 2025-05-07T20:32:03.6923573Z 2025-05-07T20:32:03.6923805Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:03.6924120Z op = silu_mul_quant 2025-05-07T20:32:03.6924516Z if compiled: 2025-05-07T20:32:03.6924764Z op = torch.compile(op) 2025-05-07T20:32:03.6925057Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:03.6925334Z 2025-05-07T20:32:03.6925524Z > y_fp8, y_scale = fn() 2025-05-07T20:32:03.6925697Z 2025-05-07T20:32:03.6925799Z moe/activation_test.py:117: 2025-05-07T20:32:03.6926090Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:03.6926416Z moe/activation_test.py:115: in fn 2025-05-07T20:32:03.6926704Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:03.6927263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:03.6927820Z return fn(*args, **kwargs) 2025-05-07T20:32:03.6928692Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:03.6929383Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:03.6935753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:03.6936518Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:03.6937198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:03.6937740Z kernel = self.compile( 2025-05-07T20:32:03.6938287Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:03.6938957Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:03.6939364Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:03.6939593Z 2025-05-07T20:32:03.6939802Z self = 2025-05-07T20:32:03.6940890Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:03.6942275Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f489881c680>} 2025-05-07T20:32:03.6943627Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:03.6944655Z context = 2025-05-07T20:32:03.6944944Z 2025-05-07T20:32:03.6945113Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:03.6945635Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:03.6946110Z module_map=module_map) 2025-05-07T20:32:03.6946480Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:03.6946831Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:03.6947091Z E ^ 2025-05-07T20:32:03.6947721Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:03.6948180Z 2025-05-07T20:32:03.6948599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:03.6949179Z 2025-05-07T20:32:03.6949284Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:03.6949702Z self=, 2025-05-07T20:32:03.6950103Z T=2048, 2025-05-07T20:32:03.6950289Z D=5120, 2025-05-07T20:32:03.6950485Z scale_ub=1200.0, 2025-05-07T20:32:03.6950714Z contiguous=False, 2025-05-07T20:32:03.6951002Z compiled=True, 2025-05-07T20:32:03.6951269Z ) 2025-05-07T20:32:03.6951598Z self = 2025-05-07T20:32:03.6952096Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:03.6952411Z 2025-05-07T20:32:03.6952512Z @given( 2025-05-07T20:32:03.6952762Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:03.6953073Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:03.6953390Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:03.6953727Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:03.6954059Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:03.6954342Z ) 2025-05-07T20:32:03.6954694Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:03.6955140Z def test_silu_mul_quant( 2025-05-07T20:32:03.6955376Z self, 2025-05-07T20:32:03.6955569Z T: int, 2025-05-07T20:32:03.6955772Z D: int, 2025-05-07T20:32:03.6955990Z scale_ub: Optional[float], 2025-05-07T20:32:03.6956259Z contiguous: bool, 2025-05-07T20:32:03.6956503Z compiled: bool, 2025-05-07T20:32:03.6956724Z ) -> None: 2025-05-07T20:32:03.6956951Z torch.manual_seed(2025) 2025-05-07T20:32:03.6957200Z 2025-05-07T20:32:03.6957473Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:03.6957825Z 2025-05-07T20:32:03.6958025Z x_sign = torch.sign(x) 2025-05-07T20:32:03.6958326Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:03.6958633Z x = x_sign * x_clamp 2025-05-07T20:32:03.6958867Z x0 = x[:, :D] 2025-05-07T20:32:03.6959084Z x1 = x[:, D:] 2025-05-07T20:32:03.6959287Z 2025-05-07T20:32:03.6959472Z if contiguous: 2025-05-07T20:32:03.6959704Z x0 = x0.contiguous() 2025-05-07T20:32:03.6959963Z x1 = x1.contiguous() 2025-05-07T20:32:03.6960206Z 2025-05-07T20:32:03.6960403Z if scale_ub is not None: 2025-05-07T20:32:03.6960680Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:03.6961015Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:03.6961329Z ) 2025-05-07T20:32:03.6961517Z else: 2025-05-07T20:32:03.6961737Z scale_ub_tensor = None 2025-05-07T20:32:03.6961995Z 2025-05-07T20:32:03.6962253Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:03.6962600Z op = silu_mul_quant 2025-05-07T20:32:03.6962848Z if compiled: 2025-05-07T20:32:03.6963101Z op = torch.compile(op) 2025-05-07T20:32:03.6963402Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:03.6963679Z 2025-05-07T20:32:03.6963872Z > y_fp8, y_scale = fn() 2025-05-07T20:32:03.6964037Z 2025-05-07T20:32:03.6964137Z moe/activation_test.py:117: 2025-05-07T20:32:03.6964430Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:03.6964769Z moe/activation_test.py:115: in fn 2025-05-07T20:32:03.6965047Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:03.6965609Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:03.6966259Z return fn(*args, **kwargs) 2025-05-07T20:32:03.6966921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:03.6967605Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:03.6968143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:03.6968829Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:03.6969486Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:03.6970019Z kernel = self.compile( 2025-05-07T20:32:03.6970655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:03.6971315Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:03.6971715Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:03.6971946Z 2025-05-07T20:32:03.6972155Z self = 2025-05-07T20:32:03.6973246Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:03.6974622Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f489881d1c0>} 2025-05-07T20:32:03.6975973Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:03.6977014Z context = 2025-05-07T20:32:03.6977308Z 2025-05-07T20:32:03.6977482Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:03.6978004Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:03.6978467Z module_map=module_map) 2025-05-07T20:32:03.6978837Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:03.6979186Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:03.6979452Z E ^ 2025-05-07T20:32:03.6979917Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:03.6980375Z 2025-05-07T20:32:03.6980801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:03.6981319Z 2025-05-07T20:32:03.8298031Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:03.8299255Z self=, 2025-05-07T20:32:03.8300400Z T=4096, 2025-05-07T20:32:03.8300922Z D=5120, 2025-05-07T20:32:03.8301323Z scale_ub=1200.0, 2025-05-07T20:32:03.8301761Z contiguous=True, 2025-05-07T20:32:03.8302170Z compiled=True, 2025-05-07T20:32:03.8302401Z ) 2025-05-07T20:32:03.8302745Z self = 2025-05-07T20:32:03.8303234Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:03.8303498Z 2025-05-07T20:32:03.8303583Z @given( 2025-05-07T20:32:03.8303815Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:03.8304129Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:03.8304438Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:03.8304770Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:03.8305093Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:03.8305373Z ) 2025-05-07T20:32:03.8305891Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:03.8306328Z def test_silu_mul_quant( 2025-05-07T20:32:03.8306576Z self, 2025-05-07T20:32:03.8306769Z T: int, 2025-05-07T20:32:03.8306964Z D: int, 2025-05-07T20:32:03.8307182Z scale_ub: Optional[float], 2025-05-07T20:32:03.8307452Z contiguous: bool, 2025-05-07T20:32:03.8307685Z compiled: bool, 2025-05-07T20:32:03.8307910Z ) -> None: 2025-05-07T20:32:03.8308128Z torch.manual_seed(2025) 2025-05-07T20:32:03.8308364Z 2025-05-07T20:32:03.8308635Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:03.8309036Z 2025-05-07T20:32:03.8309310Z x_sign = torch.sign(x) 2025-05-07T20:32:03.8309668Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:03.8309977Z x = x_sign * x_clamp 2025-05-07T20:32:03.8310219Z x0 = x[:, :D] 2025-05-07T20:32:03.8310427Z x1 = x[:, D:] 2025-05-07T20:32:03.8310637Z 2025-05-07T20:32:03.8310821Z if contiguous: 2025-05-07T20:32:03.8311047Z x0 = x0.contiguous() 2025-05-07T20:32:03.8311303Z x1 = x1.contiguous() 2025-05-07T20:32:03.8311543Z 2025-05-07T20:32:03.8311731Z if scale_ub is not None: 2025-05-07T20:32:03.8311998Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:03.8312330Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:03.8312631Z ) 2025-05-07T20:32:03.8312824Z else: 2025-05-07T20:32:03.8313034Z scale_ub_tensor = None 2025-05-07T20:32:03.8313276Z 2025-05-07T20:32:03.8313510Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:03.8313836Z op = silu_mul_quant 2025-05-07T20:32:03.8314081Z if compiled: 2025-05-07T20:32:03.8314330Z op = torch.compile(op) 2025-05-07T20:32:03.8314626Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:03.8314895Z 2025-05-07T20:32:03.8315087Z > y_fp8, y_scale = fn() 2025-05-07T20:32:03.8315252Z 2025-05-07T20:32:03.8315353Z moe/activation_test.py:117: 2025-05-07T20:32:03.8315648Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:03.8315975Z moe/activation_test.py:115: in fn 2025-05-07T20:32:03.8316260Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:03.8316819Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:03.8317367Z return fn(*args, **kwargs) 2025-05-07T20:32:03.8318022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:03.8318711Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:03.8319241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:03.8319911Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:03.8320563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:03.8321094Z kernel = self.compile( 2025-05-07T20:32:03.8321629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:03.8322314Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:03.8322710Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:03.8322936Z 2025-05-07T20:32:03.8323146Z self = 2025-05-07T20:32:03.8324219Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:03.8325678Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f489881da80>} 2025-05-07T20:32:03.8327013Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:03.8328030Z context = 2025-05-07T20:32:03.8328580Z 2025-05-07T20:32:03.8328752Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:03.8329270Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:03.8329872Z module_map=module_map) 2025-05-07T20:32:03.8330237Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:03.8330583Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:03.8330846Z E ^ 2025-05-07T20:32:03.8331312Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:03.8331758Z 2025-05-07T20:32:03.8332221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:03.8332725Z 2025-05-07T20:32:03.8332831Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:03.8333233Z self=, 2025-05-07T20:32:03.8333635Z T=128, 2025-05-07T20:32:03.8333827Z D=5120, 2025-05-07T20:32:03.8334018Z scale_ub=1200.0, 2025-05-07T20:32:03.8334241Z contiguous=False, 2025-05-07T20:32:03.8334472Z compiled=True, 2025-05-07T20:32:03.8334679Z ) 2025-05-07T20:32:03.9160838Z self = 2025-05-07T20:32:03.9162200Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:03.9162637Z 2025-05-07T20:32:03.9162750Z @given( 2025-05-07T20:32:03.9163068Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:03.9163458Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:03.9163767Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:03.9164098Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:03.9164426Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:03.9164708Z ) 2025-05-07T20:32:03.9165058Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:03.9165501Z def test_silu_mul_quant( 2025-05-07T20:32:03.9165745Z self, 2025-05-07T20:32:03.9165947Z T: int, 2025-05-07T20:32:03.9166146Z D: int, 2025-05-07T20:32:03.9166365Z scale_ub: Optional[float], 2025-05-07T20:32:03.9166631Z contiguous: bool, 2025-05-07T20:32:03.9166865Z compiled: bool, 2025-05-07T20:32:03.9167088Z ) -> None: 2025-05-07T20:32:03.9167304Z torch.manual_seed(2025) 2025-05-07T20:32:03.9167550Z 2025-05-07T20:32:03.9167827Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:03.9168164Z 2025-05-07T20:32:03.9168360Z x_sign = torch.sign(x) 2025-05-07T20:32:03.9168651Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:03.9168958Z x = x_sign * x_clamp 2025-05-07T20:32:03.9169207Z x0 = x[:, :D] 2025-05-07T20:32:03.9169426Z x1 = x[:, D:] 2025-05-07T20:32:03.9169628Z 2025-05-07T20:32:03.9169815Z if contiguous: 2025-05-07T20:32:03.9170053Z x0 = x0.contiguous() 2025-05-07T20:32:03.9170308Z x1 = x1.contiguous() 2025-05-07T20:32:03.9170545Z 2025-05-07T20:32:03.9170736Z if scale_ub is not None: 2025-05-07T20:32:03.9171004Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:03.9171334Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:03.9171814Z ) 2025-05-07T20:32:03.9172010Z else: 2025-05-07T20:32:03.9172219Z scale_ub_tensor = None 2025-05-07T20:32:03.9172470Z 2025-05-07T20:32:03.9172699Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:03.9173011Z op = silu_mul_quant 2025-05-07T20:32:03.9173263Z if compiled: 2025-05-07T20:32:03.9173510Z op = torch.compile(op) 2025-05-07T20:32:03.9173802Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:03.9174078Z 2025-05-07T20:32:03.9174271Z > y_fp8, y_scale = fn() 2025-05-07T20:32:03.9174433Z 2025-05-07T20:32:03.9174532Z moe/activation_test.py:117: 2025-05-07T20:32:03.9174890Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:03.9175276Z moe/activation_test.py:115: in fn 2025-05-07T20:32:03.9175554Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:03.9176111Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:03.9176663Z return fn(*args, **kwargs) 2025-05-07T20:32:03.9177315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:03.9177992Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:03.9178523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:03.9179196Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:03.9179848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:03.9180375Z kernel = self.compile( 2025-05-07T20:32:03.9180912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:03.9181565Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:03.9181963Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:03.9182194Z 2025-05-07T20:32:03.9182429Z self = 2025-05-07T20:32:03.9183519Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:03.9184876Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f489881fa60>} 2025-05-07T20:32:03.9186219Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:03.9187233Z context = 2025-05-07T20:32:03.9187520Z 2025-05-07T20:32:03.9187685Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:03.9188202Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:03.9188666Z module_map=module_map) 2025-05-07T20:32:03.9189024Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:03.9189435Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:03.9189693Z E ^ 2025-05-07T20:32:03.9190150Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:03.9190604Z 2025-05-07T20:32:03.9191018Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:03.9191528Z 2025-05-07T20:32:03.9191632Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:03.9192164Z self=, 2025-05-07T20:32:03.9192575Z T=16384, 2025-05-07T20:32:03.9192768Z D=7168, 2025-05-07T20:32:03.9192968Z scale_ub=1200.0, 2025-05-07T20:32:03.9193187Z contiguous=True, 2025-05-07T20:32:03.9193413Z compiled=True, 2025-05-07T20:32:03.9193616Z ) 2025-05-07T20:32:03.9193926Z self = 2025-05-07T20:32:03.9194415Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:03.9194696Z 2025-05-07T20:32:03.9194777Z @given( 2025-05-07T20:32:03.9195008Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:03.9195361Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:03.9195716Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:03.9196044Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:03.9196362Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:03.9196650Z ) 2025-05-07T20:32:03.9196995Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:03.9197429Z def test_silu_mul_quant( 2025-05-07T20:32:03.9197667Z self, 2025-05-07T20:32:03.9197863Z T: int, 2025-05-07T20:32:03.9198056Z D: int, 2025-05-07T20:32:03.9198273Z scale_ub: Optional[float], 2025-05-07T20:32:03.9198544Z contiguous: bool, 2025-05-07T20:32:03.9198780Z compiled: bool, 2025-05-07T20:32:03.9199002Z ) -> None: 2025-05-07T20:32:03.9199218Z torch.manual_seed(2025) 2025-05-07T20:32:03.9199460Z 2025-05-07T20:32:03.9199726Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:03.9200069Z 2025-05-07T20:32:03.9200269Z x_sign = torch.sign(x) 2025-05-07T20:32:03.9200555Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:03.9200868Z x = x_sign * x_clamp 2025-05-07T20:32:03.9201116Z x0 = x[:, :D] 2025-05-07T20:32:03.9201327Z x1 = x[:, D:] 2025-05-07T20:32:03.9201535Z 2025-05-07T20:32:03.9201723Z if contiguous: 2025-05-07T20:32:03.9201953Z x0 = x0.contiguous() 2025-05-07T20:32:03.9202204Z x1 = x1.contiguous() 2025-05-07T20:32:03.9202450Z 2025-05-07T20:32:03.9202637Z if scale_ub is not None: 2025-05-07T20:32:03.9202914Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:03.9203259Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:03.9203559Z ) 2025-05-07T20:32:03.9203750Z else: 2025-05-07T20:32:03.9203963Z scale_ub_tensor = None 2025-05-07T20:32:03.9204216Z 2025-05-07T20:32:03.9204447Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:03.9204760Z op = silu_mul_quant 2025-05-07T20:32:03.9205014Z if compiled: 2025-05-07T20:32:03.9205252Z op = torch.compile(op) 2025-05-07T20:32:03.9205547Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:03.9205821Z 2025-05-07T20:32:03.9206006Z > y_fp8, y_scale = fn() 2025-05-07T20:32:03.9206169Z 2025-05-07T20:32:03.9206271Z moe/activation_test.py:117: 2025-05-07T20:32:03.9206567Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:03.9206901Z moe/activation_test.py:115: in fn 2025-05-07T20:32:03.9207184Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:03.9207739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:03.9208297Z return fn(*args, **kwargs) 2025-05-07T20:32:03.9208941Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:03.9209626Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:03.9210237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:03.9210914Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:03.9211569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:03.9212143Z kernel = self.compile( 2025-05-07T20:32:03.9212680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:03.9213322Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:03.9213724Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:03.9213993Z 2025-05-07T20:32:03.9214202Z self = 2025-05-07T20:32:03.9215328Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:03.9216673Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f489875cd60>} 2025-05-07T20:32:03.9217994Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:03.9219007Z context = 2025-05-07T20:32:03.9219290Z 2025-05-07T20:32:03.9219468Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:03.9219996Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:03.9220455Z module_map=module_map) 2025-05-07T20:32:03.9220820Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:03.9221176Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:03.9221427Z E ^ 2025-05-07T20:32:03.9221892Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:03.9222386Z 2025-05-07T20:32:03.9222807Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:03.9223315Z 2025-05-07T20:32:04.0178425Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:04.0179689Z self=, 2025-05-07T20:32:04.0180772Z T=16384, 2025-05-07T20:32:04.0181314Z D=5120, 2025-05-07T20:32:04.0181735Z scale_ub=1200.0, 2025-05-07T20:32:04.0182150Z contiguous=True, 2025-05-07T20:32:04.0182402Z compiled=False, 2025-05-07T20:32:04.0182659Z ) 2025-05-07T20:32:04.0182986Z self = 2025-05-07T20:32:04.0183476Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:04.0183755Z 2025-05-07T20:32:04.0183834Z @given( 2025-05-07T20:32:04.0184075Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:04.0184393Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:04.0184700Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:04.0185040Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:04.0191604Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:04.0191932Z ) 2025-05-07T20:32:04.0192288Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:04.0192739Z def test_silu_mul_quant( 2025-05-07T20:32:04.0192980Z self, 2025-05-07T20:32:04.0193175Z T: int, 2025-05-07T20:32:04.0193372Z D: int, 2025-05-07T20:32:04.0193590Z scale_ub: Optional[float], 2025-05-07T20:32:04.0193866Z contiguous: bool, 2025-05-07T20:32:04.0194268Z compiled: bool, 2025-05-07T20:32:04.0194492Z ) -> None: 2025-05-07T20:32:04.0194713Z torch.manual_seed(2025) 2025-05-07T20:32:04.0194953Z 2025-05-07T20:32:04.0195226Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:04.0195571Z 2025-05-07T20:32:04.0195766Z x_sign = torch.sign(x) 2025-05-07T20:32:04.0196058Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:04.0196369Z x = x_sign * x_clamp 2025-05-07T20:32:04.0196612Z x0 = x[:, :D] 2025-05-07T20:32:04.0196833Z x1 = x[:, D:] 2025-05-07T20:32:04.0197040Z 2025-05-07T20:32:04.0197223Z if contiguous: 2025-05-07T20:32:04.0197515Z x0 = x0.contiguous() 2025-05-07T20:32:04.0197826Z x1 = x1.contiguous() 2025-05-07T20:32:04.0198068Z 2025-05-07T20:32:04.0198259Z if scale_ub is not None: 2025-05-07T20:32:04.0198531Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:04.0198884Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:04.0199203Z ) 2025-05-07T20:32:04.0199401Z else: 2025-05-07T20:32:04.0199617Z scale_ub_tensor = None 2025-05-07T20:32:04.0199866Z 2025-05-07T20:32:04.0200094Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:04.0200409Z op = silu_mul_quant 2025-05-07T20:32:04.0200667Z if compiled: 2025-05-07T20:32:04.0200916Z op = torch.compile(op) 2025-05-07T20:32:04.0201209Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:04.0201487Z 2025-05-07T20:32:04.0201682Z > y_fp8, y_scale = fn() 2025-05-07T20:32:04.0201875Z 2025-05-07T20:32:04.0201989Z moe/activation_test.py:117: 2025-05-07T20:32:04.0202297Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:04.0202632Z moe/activation_test.py:115: in fn 2025-05-07T20:32:04.0202913Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:04.0203612Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:04.0204311Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:04.0204843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:04.0205515Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:04.0206184Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:04.0206713Z kernel = self.compile( 2025-05-07T20:32:04.0207247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:04.0207904Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:04.0208303Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:04.0208531Z 2025-05-07T20:32:04.0208738Z self = 2025-05-07T20:32:04.0209820Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:04.0211195Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f489875dbc0>} 2025-05-07T20:32:04.0212587Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:04.0213614Z context = 2025-05-07T20:32:04.0213900Z 2025-05-07T20:32:04.0214154Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:04.0214666Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:04.0215130Z module_map=module_map) 2025-05-07T20:32:04.0215489Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:04.0215834Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:04.0216088Z E ^ 2025-05-07T20:32:04.0216560Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:04.0217009Z 2025-05-07T20:32:04.0217440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:04.0218035Z 2025-05-07T20:32:04.0218142Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:04.0218553Z self=, 2025-05-07T20:32:04.0218952Z T=1, 2025-05-07T20:32:04.0219139Z D=7168, 2025-05-07T20:32:04.0219336Z scale_ub=1200.0, 2025-05-07T20:32:04.0219564Z contiguous=False, 2025-05-07T20:32:04.0219787Z compiled=False, 2025-05-07T20:32:04.0219986Z ) 2025-05-07T20:32:04.0220306Z self = 2025-05-07T20:32:04.0220794Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:04.0221055Z 2025-05-07T20:32:04.0221131Z @given( 2025-05-07T20:32:04.0221365Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:04.0221678Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:04.0221978Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:04.0222316Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:04.0222644Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:04.0222923Z ) 2025-05-07T20:32:04.0223282Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:04.0223724Z def test_silu_mul_quant( 2025-05-07T20:32:04.0223967Z self, 2025-05-07T20:32:04.0224159Z T: int, 2025-05-07T20:32:04.0224357Z D: int, 2025-05-07T20:32:04.0224573Z scale_ub: Optional[float], 2025-05-07T20:32:04.0224840Z contiguous: bool, 2025-05-07T20:32:04.0225079Z compiled: bool, 2025-05-07T20:32:04.0225300Z ) -> None: 2025-05-07T20:32:04.0225508Z torch.manual_seed(2025) 2025-05-07T20:32:04.0225758Z 2025-05-07T20:32:04.0226025Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:04.0226361Z 2025-05-07T20:32:04.0226558Z x_sign = torch.sign(x) 2025-05-07T20:32:04.0226852Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:04.0227158Z x = x_sign * x_clamp 2025-05-07T20:32:04.0227404Z x0 = x[:, :D] 2025-05-07T20:32:04.0227624Z x1 = x[:, D:] 2025-05-07T20:32:04.0227830Z 2025-05-07T20:32:04.0228012Z if contiguous: 2025-05-07T20:32:04.0228450Z x0 = x0.contiguous() 2025-05-07T20:32:04.0228705Z x1 = x1.contiguous() 2025-05-07T20:32:04.0228938Z 2025-05-07T20:32:04.0229181Z if scale_ub is not None: 2025-05-07T20:32:04.0229447Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:04.0229773Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:04.0230076Z ) 2025-05-07T20:32:04.0230264Z else: 2025-05-07T20:32:04.0230471Z scale_ub_tensor = None 2025-05-07T20:32:04.0230719Z 2025-05-07T20:32:04.0230945Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:04.0231253Z op = silu_mul_quant 2025-05-07T20:32:04.0231499Z if compiled: 2025-05-07T20:32:04.0231745Z op = torch.compile(op) 2025-05-07T20:32:04.0232094Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:04.0232364Z 2025-05-07T20:32:04.0232551Z > y_fp8, y_scale = fn() 2025-05-07T20:32:04.0232847Z 2025-05-07T20:32:04.0232945Z moe/activation_test.py:117: 2025-05-07T20:32:04.0233233Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:04.0233554Z moe/activation_test.py:115: in fn 2025-05-07T20:32:04.0233833Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:04.0234518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:04.0235202Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:04.0235727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:04.0236460Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:04.0237172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:04.0237696Z kernel = self.compile( 2025-05-07T20:32:04.0238241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:04.0238898Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:04.0239292Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:04.0239519Z 2025-05-07T20:32:04.0239725Z self = 2025-05-07T20:32:04.0240811Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:04.0242192Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f489875d4e0>} 2025-05-07T20:32:04.0243597Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:04.0244627Z context = 2025-05-07T20:32:04.0244914Z 2025-05-07T20:32:04.0245082Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:04.0245599Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:04.0246058Z module_map=module_map) 2025-05-07T20:32:04.0246414Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:04.0246773Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:04.0247030Z E ^ 2025-05-07T20:32:04.0247491Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:04.0247944Z 2025-05-07T20:32:04.0248367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:04.0248883Z 2025-05-07T20:32:04.3397858Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:04.3398478Z self=, 2025-05-07T20:32:04.3399029Z T=4096, 2025-05-07T20:32:04.3399292Z D=7168, 2025-05-07T20:32:04.3399554Z scale_ub=1200.0, 2025-05-07T20:32:04.3399840Z contiguous=False, 2025-05-07T20:32:04.3400129Z compiled=True, 2025-05-07T20:32:04.3400386Z ) 2025-05-07T20:32:04.3400758Z self = 2025-05-07T20:32:04.3401259Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:04.3401534Z 2025-05-07T20:32:04.3401616Z @given( 2025-05-07T20:32:04.3401849Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:04.3402159Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:04.3402633Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:04.3402965Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:04.3403288Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:04.3403576Z ) 2025-05-07T20:32:04.3403926Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:04.3404359Z def test_silu_mul_quant( 2025-05-07T20:32:04.3404600Z self, 2025-05-07T20:32:04.3404795Z T: int, 2025-05-07T20:32:04.3404992Z D: int, 2025-05-07T20:32:04.3405227Z scale_ub: Optional[float], 2025-05-07T20:32:04.3405492Z contiguous: bool, 2025-05-07T20:32:04.3405819Z compiled: bool, 2025-05-07T20:32:04.3406115Z ) -> None: 2025-05-07T20:32:04.3406327Z torch.manual_seed(2025) 2025-05-07T20:32:04.3406572Z 2025-05-07T20:32:04.3406845Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:04.3407183Z 2025-05-07T20:32:04.3407390Z x_sign = torch.sign(x) 2025-05-07T20:32:04.3407684Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:04.3407998Z x = x_sign * x_clamp 2025-05-07T20:32:04.3408235Z x0 = x[:, :D] 2025-05-07T20:32:04.3408451Z x1 = x[:, D:] 2025-05-07T20:32:04.3408661Z 2025-05-07T20:32:04.3408842Z if contiguous: 2025-05-07T20:32:04.3409079Z x0 = x0.contiguous() 2025-05-07T20:32:04.3409333Z x1 = x1.contiguous() 2025-05-07T20:32:04.3409572Z 2025-05-07T20:32:04.3409763Z if scale_ub is not None: 2025-05-07T20:32:04.3410034Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:04.3410364Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:04.3410671Z ) 2025-05-07T20:32:04.3410865Z else: 2025-05-07T20:32:04.3411079Z scale_ub_tensor = None 2025-05-07T20:32:04.3411324Z 2025-05-07T20:32:04.3411560Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:04.3411878Z op = silu_mul_quant 2025-05-07T20:32:04.3412129Z if compiled: 2025-05-07T20:32:04.3412377Z op = torch.compile(op) 2025-05-07T20:32:04.3412672Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:04.3412941Z 2025-05-07T20:32:04.3413133Z > y_fp8, y_scale = fn() 2025-05-07T20:32:04.3413298Z 2025-05-07T20:32:04.3413400Z moe/activation_test.py:117: 2025-05-07T20:32:04.3413693Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:04.3414018Z moe/activation_test.py:115: in fn 2025-05-07T20:32:04.3414295Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:04.3414857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:04.3415413Z return fn(*args, **kwargs) 2025-05-07T20:32:04.3416072Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:04.3416756Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:04.3417291Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:04.3417959Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:04.3418618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:04.3419145Z kernel = self.compile( 2025-05-07T20:32:04.3419681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:04.3420338Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:04.3420739Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:04.3420964Z 2025-05-07T20:32:04.3421171Z self = 2025-05-07T20:32:04.3422383Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:04.3423742Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f48986f0180>} 2025-05-07T20:32:04.3425091Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:04.3426179Z context = 2025-05-07T20:32:04.3426464Z 2025-05-07T20:32:04.3426635Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:04.3427149Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:04.3427616Z module_map=module_map) 2025-05-07T20:32:04.3427980Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:04.3428537Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:04.3428805Z E ^ 2025-05-07T20:32:04.3429311Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:04.3429756Z 2025-05-07T20:32:04.3430173Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:04.3430681Z 2025-05-07T20:32:04.3430791Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:04.3431208Z self=, 2025-05-07T20:32:04.3431613Z T=128, 2025-05-07T20:32:04.3431802Z D=7168, 2025-05-07T20:32:04.3431994Z scale_ub=1200.0, 2025-05-07T20:32:04.3432269Z contiguous=False, 2025-05-07T20:32:04.3432498Z compiled=True, 2025-05-07T20:32:04.3432711Z ) 2025-05-07T20:32:04.4153962Z self = 2025-05-07T20:32:04.4154720Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:04.4155106Z 2025-05-07T20:32:04.4155220Z @given( 2025-05-07T20:32:04.4155523Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:04.4155876Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:04.4156179Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:04.4156498Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:04.4156830Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:04.4157115Z ) 2025-05-07T20:32:04.4157465Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:04.4157900Z def test_silu_mul_quant( 2025-05-07T20:32:04.4158147Z self, 2025-05-07T20:32:04.4158345Z T: int, 2025-05-07T20:32:04.4158541Z D: int, 2025-05-07T20:32:04.4158761Z scale_ub: Optional[float], 2025-05-07T20:32:04.4159029Z contiguous: bool, 2025-05-07T20:32:04.4159266Z compiled: bool, 2025-05-07T20:32:04.4159486Z ) -> None: 2025-05-07T20:32:04.4159700Z torch.manual_seed(2025) 2025-05-07T20:32:04.4159937Z 2025-05-07T20:32:04.4160209Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:04.4160550Z 2025-05-07T20:32:04.4160748Z x_sign = torch.sign(x) 2025-05-07T20:32:04.4161040Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:04.4161349Z x = x_sign * x_clamp 2025-05-07T20:32:04.4161587Z x0 = x[:, :D] 2025-05-07T20:32:04.4161800Z x1 = x[:, D:] 2025-05-07T20:32:04.4162002Z 2025-05-07T20:32:04.4162208Z if contiguous: 2025-05-07T20:32:04.4162460Z x0 = x0.contiguous() 2025-05-07T20:32:04.4162883Z x1 = x1.contiguous() 2025-05-07T20:32:04.4163118Z 2025-05-07T20:32:04.4163306Z if scale_ub is not None: 2025-05-07T20:32:04.4163579Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:04.4163911Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:04.4164212Z ) 2025-05-07T20:32:04.4164405Z else: 2025-05-07T20:32:04.4164611Z scale_ub_tensor = None 2025-05-07T20:32:04.4164851Z 2025-05-07T20:32:04.4165083Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:04.4165399Z op = silu_mul_quant 2025-05-07T20:32:04.4165643Z if compiled: 2025-05-07T20:32:04.4165886Z op = torch.compile(op) 2025-05-07T20:32:04.4166291Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:04.4166556Z 2025-05-07T20:32:04.4166748Z > y_fp8, y_scale = fn() 2025-05-07T20:32:04.4166910Z 2025-05-07T20:32:04.4167012Z moe/activation_test.py:117: 2025-05-07T20:32:04.4167304Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:04.4167635Z moe/activation_test.py:115: in fn 2025-05-07T20:32:04.4167920Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:04.4168478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:04.4169031Z return fn(*args, **kwargs) 2025-05-07T20:32:04.4169683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:04.4170367Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:04.4170900Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:04.4171579Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:04.4172236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:04.4172770Z kernel = self.compile( 2025-05-07T20:32:04.4173300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:04.4173947Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:04.4174340Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:04.4174566Z 2025-05-07T20:32:04.4174774Z self = 2025-05-07T20:32:04.4175843Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:04.4177212Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f48986f0cc0>} 2025-05-07T20:32:04.4178542Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:04.4179555Z context = 2025-05-07T20:32:04.4179840Z 2025-05-07T20:32:04.4180012Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:04.4180518Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:04.4180981Z module_map=module_map) 2025-05-07T20:32:04.4181345Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:04.4181690Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:04.4181946Z E ^ 2025-05-07T20:32:04.4182460Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:04.4182987Z 2025-05-07T20:32:04.4183403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:04.4183908Z 2025-05-07T20:32:04.4184012Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:04.4184420Z self=, 2025-05-07T20:32:04.4184821Z T=2048, 2025-05-07T20:32:04.4185006Z D=7168, 2025-05-07T20:32:04.4185199Z scale_ub=None, 2025-05-07T20:32:04.4185434Z contiguous=True, 2025-05-07T20:32:04.4185655Z compiled=True, 2025-05-07T20:32:04.4185855Z ) 2025-05-07T20:32:04.4186171Z self = 2025-05-07T20:32:04.4186735Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:04.4186995Z 2025-05-07T20:32:04.4187078Z @given( 2025-05-07T20:32:04.4187302Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:04.4187616Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:04.4187919Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:04.4188240Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:04.4188562Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:04.4188844Z ) 2025-05-07T20:32:04.4189259Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:04.4189698Z def test_silu_mul_quant( 2025-05-07T20:32:04.4189939Z self, 2025-05-07T20:32:04.4190129Z T: int, 2025-05-07T20:32:04.4190324Z D: int, 2025-05-07T20:32:04.4190541Z scale_ub: Optional[float], 2025-05-07T20:32:04.4190808Z contiguous: bool, 2025-05-07T20:32:04.4191049Z compiled: bool, 2025-05-07T20:32:04.4191270Z ) -> None: 2025-05-07T20:32:04.4191485Z torch.manual_seed(2025) 2025-05-07T20:32:04.4191718Z 2025-05-07T20:32:04.4191991Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:04.4192340Z 2025-05-07T20:32:04.4192567Z x_sign = torch.sign(x) 2025-05-07T20:32:04.4192863Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:04.4193166Z x = x_sign * x_clamp 2025-05-07T20:32:04.4193405Z x0 = x[:, :D] 2025-05-07T20:32:04.4193617Z x1 = x[:, D:] 2025-05-07T20:32:04.4193820Z 2025-05-07T20:32:04.4193998Z if contiguous: 2025-05-07T20:32:04.4194232Z x0 = x0.contiguous() 2025-05-07T20:32:04.4194488Z x1 = x1.contiguous() 2025-05-07T20:32:04.4194721Z 2025-05-07T20:32:04.4194916Z if scale_ub is not None: 2025-05-07T20:32:04.4195188Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:04.4195521Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:04.4195824Z ) 2025-05-07T20:32:04.4196017Z else: 2025-05-07T20:32:04.4196225Z scale_ub_tensor = None 2025-05-07T20:32:04.4196477Z 2025-05-07T20:32:04.4196721Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:04.4197043Z op = silu_mul_quant 2025-05-07T20:32:04.4197288Z if compiled: 2025-05-07T20:32:04.4197535Z op = torch.compile(op) 2025-05-07T20:32:04.4197836Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:04.4198105Z 2025-05-07T20:32:04.4198297Z > y_fp8, y_scale = fn() 2025-05-07T20:32:04.4198458Z 2025-05-07T20:32:04.4198560Z moe/activation_test.py:117: 2025-05-07T20:32:04.4198861Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:04.4199200Z moe/activation_test.py:115: in fn 2025-05-07T20:32:04.4205165Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:04.4205748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:04.4206309Z return fn(*args, **kwargs) 2025-05-07T20:32:04.4207104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:04.4207804Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:04.4208344Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:04.4209025Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:04.4209703Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:04.4210236Z kernel = self.compile( 2025-05-07T20:32:04.4210779Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:04.4211525Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:04.4211943Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:04.4212214Z 2025-05-07T20:32:04.4212435Z self = 2025-05-07T20:32:04.4213523Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:04.4214905Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f48986f1940>} 2025-05-07T20:32:04.4216264Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:04.4217306Z context = 2025-05-07T20:32:04.4217595Z 2025-05-07T20:32:04.4217774Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:04.4218291Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:04.4218762Z module_map=module_map) 2025-05-07T20:32:04.4219137Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:04.4219484Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:04.4219739Z E ^ 2025-05-07T20:32:04.4220204Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:04.4220656Z 2025-05-07T20:32:04.4221083Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:04.4221603Z 2025-05-07T20:32:04.4891977Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:04.4892589Z self=, 2025-05-07T20:32:04.4893208Z T=16384, 2025-05-07T20:32:04.4893493Z D=5120, 2025-05-07T20:32:04.4893769Z scale_ub=None, 2025-05-07T20:32:04.4894057Z contiguous=False, 2025-05-07T20:32:04.4894373Z compiled=False, 2025-05-07T20:32:04.4894642Z ) 2025-05-07T20:32:04.4895075Z self = 2025-05-07T20:32:04.4895672Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:04.4895949Z 2025-05-07T20:32:04.4896030Z @given( 2025-05-07T20:32:04.4896257Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:04.4896564Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:04.4896869Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:04.4897193Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:04.4897519Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:04.4897803Z ) 2025-05-07T20:32:04.4898144Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:04.4898755Z def test_silu_mul_quant( 2025-05-07T20:32:04.4898998Z self, 2025-05-07T20:32:04.4899186Z T: int, 2025-05-07T20:32:04.4899389Z D: int, 2025-05-07T20:32:04.4899605Z scale_ub: Optional[float], 2025-05-07T20:32:04.4899866Z contiguous: bool, 2025-05-07T20:32:04.4900105Z compiled: bool, 2025-05-07T20:32:04.4900327Z ) -> None: 2025-05-07T20:32:04.4900541Z torch.manual_seed(2025) 2025-05-07T20:32:04.4900778Z 2025-05-07T20:32:04.4901049Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:04.4901387Z 2025-05-07T20:32:04.4901584Z x_sign = torch.sign(x) 2025-05-07T20:32:04.4901873Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:04.4904013Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:04.4905892Z 2025-05-07T20:32:04.4906012Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:04.4906220Z 2025-05-07T20:32:04.4906325Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:04.4906722Z self=, 2025-05-07T20:32:04.4907115Z T=4096, 2025-05-07T20:32:04.4907312Z D=7168, 2025-05-07T20:32:04.4907507Z scale_ub=1200.0, 2025-05-07T20:32:04.4907735Z contiguous=True, 2025-05-07T20:32:04.4907956Z compiled=True, 2025-05-07T20:32:04.4908154Z ) 2025-05-07T20:32:04.4908471Z self = 2025-05-07T20:32:04.4908968Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:04.4909325Z 2025-05-07T20:32:04.4909409Z @given( 2025-05-07T20:32:04.4909630Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:04.4909936Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:04.4910244Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:04.4910565Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:04.4910887Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:04.4911175Z ) 2025-05-07T20:32:04.4911523Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:04.4911962Z def test_silu_mul_quant( 2025-05-07T20:32:04.4912238Z self, 2025-05-07T20:32:04.4912451Z T: int, 2025-05-07T20:32:04.4912649Z D: int, 2025-05-07T20:32:04.4912870Z scale_ub: Optional[float], 2025-05-07T20:32:04.4913142Z contiguous: bool, 2025-05-07T20:32:04.4913385Z compiled: bool, 2025-05-07T20:32:04.4913609Z ) -> None: 2025-05-07T20:32:04.4913828Z torch.manual_seed(2025) 2025-05-07T20:32:04.4914062Z 2025-05-07T20:32:04.4914337Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:04.4914674Z 2025-05-07T20:32:04.4914865Z x_sign = torch.sign(x) 2025-05-07T20:32:04.4915159Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:04.4917235Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:04.4919088Z 2025-05-07T20:32:04.4919207Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:04.4919418Z 2025-05-07T20:32:04.4919525Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:04.4919931Z self=, 2025-05-07T20:32:04.4920328Z T=16384, 2025-05-07T20:32:04.4920528Z D=7168, 2025-05-07T20:32:04.4920721Z scale_ub=None, 2025-05-07T20:32:04.4920940Z contiguous=False, 2025-05-07T20:32:04.4921168Z compiled=False, 2025-05-07T20:32:04.4921367Z ) 2025-05-07T20:32:04.4921688Z self = 2025-05-07T20:32:04.4922246Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:04.4922601Z 2025-05-07T20:32:04.4922687Z @given( 2025-05-07T20:32:04.4922916Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:04.4923223Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:04.4923533Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:04.4923858Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:04.4924182Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:04.4924467Z ) 2025-05-07T20:32:04.4924812Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:04.4925245Z def test_silu_mul_quant( 2025-05-07T20:32:04.4925485Z self, 2025-05-07T20:32:04.4925675Z T: int, 2025-05-07T20:32:04.4925875Z D: int, 2025-05-07T20:32:04.4926090Z scale_ub: Optional[float], 2025-05-07T20:32:04.4926359Z contiguous: bool, 2025-05-07T20:32:04.4926596Z compiled: bool, 2025-05-07T20:32:04.4926816Z ) -> None: 2025-05-07T20:32:04.4927024Z torch.manual_seed(2025) 2025-05-07T20:32:04.4927263Z 2025-05-07T20:32:04.4927533Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:04.4929847Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:04.4931716Z 2025-05-07T20:32:04.4931836Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:04.4932054Z 2025-05-07T20:32:04.4932158Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:04.4932562Z self=, 2025-05-07T20:32:04.4932961Z T=2048, 2025-05-07T20:32:04.4933149Z D=7168, 2025-05-07T20:32:04.4933340Z scale_ub=1200.0, 2025-05-07T20:32:04.4933567Z contiguous=True, 2025-05-07T20:32:04.4933782Z compiled=True, 2025-05-07T20:32:04.4933987Z ) 2025-05-07T20:32:04.4934307Z self = 2025-05-07T20:32:04.4934790Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:04.4935061Z 2025-05-07T20:32:04.4935140Z @given( 2025-05-07T20:32:04.4935365Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:04.4935678Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:04.4935978Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:04.4936301Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:04.4936631Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:04.4936905Z ) 2025-05-07T20:32:04.4937250Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:04.4937686Z def test_silu_mul_quant( 2025-05-07T20:32:04.4938059Z self, 2025-05-07T20:32:04.4938250Z T: int, 2025-05-07T20:32:04.4938445Z D: int, 2025-05-07T20:32:04.4938655Z scale_ub: Optional[float], 2025-05-07T20:32:04.4938925Z contiguous: bool, 2025-05-07T20:32:04.4939161Z compiled: bool, 2025-05-07T20:32:04.4939376Z ) -> None: 2025-05-07T20:32:04.4939589Z torch.manual_seed(2025) 2025-05-07T20:32:04.4939825Z 2025-05-07T20:32:04.4940090Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:04.4940430Z 2025-05-07T20:32:04.4940622Z x_sign = torch.sign(x) 2025-05-07T20:32:04.4940907Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:04.4943032Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:04.4944948Z 2025-05-07T20:32:04.4945070Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:04.4945292Z 2025-05-07T20:32:04.4945397Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:04.4945809Z self=, 2025-05-07T20:32:04.4946203Z T=2048, 2025-05-07T20:32:04.4946389Z D=7168, 2025-05-07T20:32:04.4946580Z scale_ub=None, 2025-05-07T20:32:04.4946794Z contiguous=True, 2025-05-07T20:32:04.4947013Z compiled=False, 2025-05-07T20:32:04.4947217Z ) 2025-05-07T20:32:04.5814594Z self = 2025-05-07T20:32:04.5815299Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:04.5815688Z 2025-05-07T20:32:04.5815826Z @given( 2025-05-07T20:32:04.5816139Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:04.5816570Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:04.5816873Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:04.5817202Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:04.5817526Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:04.5817801Z ) 2025-05-07T20:32:04.5818146Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:04.5818579Z def test_silu_mul_quant( 2025-05-07T20:32:04.5818820Z self, 2025-05-07T20:32:04.5819012Z T: int, 2025-05-07T20:32:04.5819211Z D: int, 2025-05-07T20:32:04.5819431Z scale_ub: Optional[float], 2025-05-07T20:32:04.5819696Z contiguous: bool, 2025-05-07T20:32:04.5819934Z compiled: bool, 2025-05-07T20:32:04.5820160Z ) -> None: 2025-05-07T20:32:04.5820380Z torch.manual_seed(2025) 2025-05-07T20:32:04.5820613Z 2025-05-07T20:32:04.5820879Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:04.5821213Z 2025-05-07T20:32:04.5821409Z > x_sign = torch.sign(x) 2025-05-07T20:32:04.5823714Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:04.5825640Z 2025-05-07T20:32:04.5825948Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:04.5826165Z 2025-05-07T20:32:04.5826266Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:04.5826675Z self=, 2025-05-07T20:32:04.5827068Z T=1, 2025-05-07T20:32:04.5827256Z D=7168, 2025-05-07T20:32:04.5827445Z scale_ub=1200.0, 2025-05-07T20:32:04.5827658Z contiguous=True, 2025-05-07T20:32:04.5827880Z compiled=False, 2025-05-07T20:32:04.5828081Z ) 2025-05-07T20:32:04.5828594Z self = 2025-05-07T20:32:04.5829115Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:04.5829445Z 2025-05-07T20:32:04.5829529Z @given( 2025-05-07T20:32:04.5829815Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:04.5830125Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:04.5830428Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:04.5830763Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:04.5831081Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:04.5831361Z ) 2025-05-07T20:32:04.5831754Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:04.5832290Z def test_silu_mul_quant( 2025-05-07T20:32:04.5832587Z self, 2025-05-07T20:32:04.5832827Z T: int, 2025-05-07T20:32:04.5833064Z D: int, 2025-05-07T20:32:04.5833336Z scale_ub: Optional[float], 2025-05-07T20:32:04.5833670Z contiguous: bool, 2025-05-07T20:32:04.5833918Z compiled: bool, 2025-05-07T20:32:04.5834134Z ) -> None: 2025-05-07T20:32:04.5834349Z torch.manual_seed(2025) 2025-05-07T20:32:04.5834588Z 2025-05-07T20:32:04.5834861Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:04.5835195Z 2025-05-07T20:32:04.5835391Z x_sign = torch.sign(x) 2025-05-07T20:32:04.5835678Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:04.5835982Z x = x_sign * x_clamp 2025-05-07T20:32:04.5836222Z x0 = x[:, :D] 2025-05-07T20:32:04.5836431Z x1 = x[:, D:] 2025-05-07T20:32:04.5836632Z 2025-05-07T20:32:04.5836814Z if contiguous: 2025-05-07T20:32:04.5837039Z x0 = x0.contiguous() 2025-05-07T20:32:04.5837298Z x1 = x1.contiguous() 2025-05-07T20:32:04.5837533Z 2025-05-07T20:32:04.5837717Z if scale_ub is not None: 2025-05-07T20:32:04.5837982Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:04.5838311Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:04.5838609Z ) 2025-05-07T20:32:04.5838802Z else: 2025-05-07T20:32:04.5839012Z scale_ub_tensor = None 2025-05-07T20:32:04.5839256Z 2025-05-07T20:32:04.5839484Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:04.5839794Z op = silu_mul_quant 2025-05-07T20:32:04.5840040Z if compiled: 2025-05-07T20:32:04.5840284Z op = torch.compile(op) 2025-05-07T20:32:04.5840588Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:04.5840855Z 2025-05-07T20:32:04.5841041Z > y_fp8, y_scale = fn() 2025-05-07T20:32:04.5841208Z 2025-05-07T20:32:04.5841305Z moe/activation_test.py:117: 2025-05-07T20:32:04.5841602Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:04.5841926Z moe/activation_test.py:115: in fn 2025-05-07T20:32:04.5842206Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:04.5842895Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:04.5843584Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:04.5844108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:04.5844926Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:04.5845587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:04.5846108Z kernel = self.compile( 2025-05-07T20:32:04.5846641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:04.5847294Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:04.5847686Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:04.5847911Z 2025-05-07T20:32:04.5848115Z self = 2025-05-07T20:32:04.5849267Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:04.5850622Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4898219300>} 2025-05-07T20:32:04.5852024Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:04.5853280Z context = 2025-05-07T20:32:04.5853635Z 2025-05-07T20:32:04.5853840Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:04.5854369Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:04.5854837Z module_map=module_map) 2025-05-07T20:32:04.5855193Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:04.5855545Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:04.5855805Z E ^ 2025-05-07T20:32:04.5856258Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:04.5856706Z 2025-05-07T20:32:04.5857117Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:04.5857625Z 2025-05-07T20:32:04.5857727Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:04.5858132Z self=, 2025-05-07T20:32:04.5858521Z T=128, 2025-05-07T20:32:04.5858705Z D=5120, 2025-05-07T20:32:04.5858895Z scale_ub=None, 2025-05-07T20:32:04.5859108Z contiguous=True, 2025-05-07T20:32:04.5859332Z compiled=False, 2025-05-07T20:32:04.5859544Z ) 2025-05-07T20:32:04.6403809Z self = 2025-05-07T20:32:04.6404498Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:04.6404884Z 2025-05-07T20:32:04.6404990Z @given( 2025-05-07T20:32:04.6405308Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:04.6405733Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:04.6406122Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:04.6406447Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:04.6406766Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:04.6407044Z ) 2025-05-07T20:32:04.6407389Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:04.6407818Z def test_silu_mul_quant( 2025-05-07T20:32:04.6408059Z self, 2025-05-07T20:32:04.6408247Z T: int, 2025-05-07T20:32:04.6408446Z D: int, 2025-05-07T20:32:04.6408661Z scale_ub: Optional[float], 2025-05-07T20:32:04.6408922Z contiguous: bool, 2025-05-07T20:32:04.6409157Z compiled: bool, 2025-05-07T20:32:04.6409381Z ) -> None: 2025-05-07T20:32:04.6409750Z torch.manual_seed(2025) 2025-05-07T20:32:04.6409992Z 2025-05-07T20:32:04.6410260Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:04.6410590Z 2025-05-07T20:32:04.6410783Z x_sign = torch.sign(x) 2025-05-07T20:32:04.6411071Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:04.6411373Z x = x_sign * x_clamp 2025-05-07T20:32:04.6411615Z x0 = x[:, :D] 2025-05-07T20:32:04.6411831Z x1 = x[:, D:] 2025-05-07T20:32:04.6412030Z 2025-05-07T20:32:04.6412215Z if contiguous: 2025-05-07T20:32:04.6412472Z x0 = x0.contiguous() 2025-05-07T20:32:04.6412811Z x1 = x1.contiguous() 2025-05-07T20:32:04.6413095Z 2025-05-07T20:32:04.6413283Z if scale_ub is not None: 2025-05-07T20:32:04.6413551Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:04.6413876Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:04.6414185Z ) 2025-05-07T20:32:04.6414378Z else: 2025-05-07T20:32:04.6414584Z scale_ub_tensor = None 2025-05-07T20:32:04.6414832Z 2025-05-07T20:32:04.6415059Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:04.6415365Z op = silu_mul_quant 2025-05-07T20:32:04.6415614Z if compiled: 2025-05-07T20:32:04.6415857Z op = torch.compile(op) 2025-05-07T20:32:04.6416146Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:04.6416418Z 2025-05-07T20:32:04.6416611Z > y_fp8, y_scale = fn() 2025-05-07T20:32:04.6416773Z 2025-05-07T20:32:04.6416874Z moe/activation_test.py:117: 2025-05-07T20:32:04.6417160Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:04.6417498Z moe/activation_test.py:115: in fn 2025-05-07T20:32:04.6417774Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:04.6418460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:04.6419142Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:04.6419670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:04.6420340Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:04.6420990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:04.6421514Z kernel = self.compile( 2025-05-07T20:32:04.6422047Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:04.6422699Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:04.6423096Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:04.6423324Z 2025-05-07T20:32:04.6423530Z self = 2025-05-07T20:32:04.6424595Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:04.6425946Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f489821a520>} 2025-05-07T20:32:04.6427276Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:04.6428484Z context = 2025-05-07T20:32:04.6428772Z 2025-05-07T20:32:04.6428940Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:04.6429610Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:04.6430069Z module_map=module_map) 2025-05-07T20:32:04.6430429Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:04.6430779Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:04.6431032Z E ^ 2025-05-07T20:32:04.6431489Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:04.6431934Z 2025-05-07T20:32:04.6432401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:04.6432964Z 2025-05-07T20:32:04.6433150Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:04.6433557Z self=, 2025-05-07T20:32:04.6433950Z T=128, 2025-05-07T20:32:04.6434144Z D=7168, 2025-05-07T20:32:04.6434331Z scale_ub=None, 2025-05-07T20:32:04.6434576Z contiguous=True, 2025-05-07T20:32:04.6434798Z compiled=False, 2025-05-07T20:32:04.6435001Z ) 2025-05-07T20:32:04.6435312Z self = 2025-05-07T20:32:04.6435792Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:04.6436053Z 2025-05-07T20:32:04.6436138Z @given( 2025-05-07T20:32:04.6436364Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:04.6436677Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:04.6436982Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:04.6437305Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:04.6437629Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:04.6437912Z ) 2025-05-07T20:32:04.6438261Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:04.6438690Z def test_silu_mul_quant( 2025-05-07T20:32:04.6438934Z self, 2025-05-07T20:32:04.6439126Z T: int, 2025-05-07T20:32:04.6439319Z D: int, 2025-05-07T20:32:04.6445658Z scale_ub: Optional[float], 2025-05-07T20:32:04.6445936Z contiguous: bool, 2025-05-07T20:32:04.6446171Z compiled: bool, 2025-05-07T20:32:04.6446389Z ) -> None: 2025-05-07T20:32:04.6446600Z torch.manual_seed(2025) 2025-05-07T20:32:04.6446836Z 2025-05-07T20:32:04.6447106Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:04.6447445Z 2025-05-07T20:32:04.6447637Z x_sign = torch.sign(x) 2025-05-07T20:32:04.6447920Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:04.6448236Z x = x_sign * x_clamp 2025-05-07T20:32:04.6448477Z x0 = x[:, :D] 2025-05-07T20:32:04.6448685Z x1 = x[:, D:] 2025-05-07T20:32:04.6448895Z 2025-05-07T20:32:04.6449078Z if contiguous: 2025-05-07T20:32:04.6449300Z x0 = x0.contiguous() 2025-05-07T20:32:04.6449560Z x1 = x1.contiguous() 2025-05-07T20:32:04.6449794Z 2025-05-07T20:32:04.6449978Z if scale_ub is not None: 2025-05-07T20:32:04.6450241Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:04.6450572Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:04.6450869Z ) 2025-05-07T20:32:04.6451061Z else: 2025-05-07T20:32:04.6451277Z scale_ub_tensor = None 2025-05-07T20:32:04.6451527Z 2025-05-07T20:32:04.6451753Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:04.6452082Z op = silu_mul_quant 2025-05-07T20:32:04.6452367Z if compiled: 2025-05-07T20:32:04.6452607Z op = torch.compile(op) 2025-05-07T20:32:04.6452907Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:04.6453176Z 2025-05-07T20:32:04.6453364Z > y_fp8, y_scale = fn() 2025-05-07T20:32:04.6453530Z 2025-05-07T20:32:04.6453630Z moe/activation_test.py:117: 2025-05-07T20:32:04.6454034Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:04.6454357Z moe/activation_test.py:115: in fn 2025-05-07T20:32:04.6454631Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:04.6455316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:04.6456000Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:04.6456522Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:04.6457198Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:04.6457930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:04.6458447Z kernel = self.compile( 2025-05-07T20:32:04.6458986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:04.6459632Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:04.6460026Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:04.6460248Z 2025-05-07T20:32:04.6460451Z self = 2025-05-07T20:32:04.6461518Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:04.6462922Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f489821b560>} 2025-05-07T20:32:04.6464260Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:04.6465272Z context = 2025-05-07T20:32:04.6465555Z 2025-05-07T20:32:04.6465718Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:04.6466224Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:04.6466680Z module_map=module_map) 2025-05-07T20:32:04.6467034Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:04.6467381Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:04.6467632Z E ^ 2025-05-07T20:32:04.6468091Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:04.6468535Z 2025-05-07T20:32:04.6468951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:04.6469531Z 2025-05-07T20:32:04.6469633Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:04.6470035Z self=, 2025-05-07T20:32:04.6470430Z T=2048, 2025-05-07T20:32:04.6470614Z D=7168, 2025-05-07T20:32:04.6470813Z scale_ub=1200.0, 2025-05-07T20:32:04.6471034Z contiguous=True, 2025-05-07T20:32:04.6471247Z compiled=False, 2025-05-07T20:32:04.6471453Z ) 2025-05-07T20:32:04.7130713Z self = 2025-05-07T20:32:04.7131350Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:04.7131726Z 2025-05-07T20:32:04.7131839Z @given( 2025-05-07T20:32:04.7132207Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:04.7132679Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:04.7133106Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:04.7133692Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:04.7134026Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:04.7134301Z ) 2025-05-07T20:32:04.7134648Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:04.7135085Z def test_silu_mul_quant( 2025-05-07T20:32:04.7135322Z self, 2025-05-07T20:32:04.7135521Z T: int, 2025-05-07T20:32:04.7135720Z D: int, 2025-05-07T20:32:04.7135938Z scale_ub: Optional[float], 2025-05-07T20:32:04.7136202Z contiguous: bool, 2025-05-07T20:32:04.7136439Z compiled: bool, 2025-05-07T20:32:04.7136664Z ) -> None: 2025-05-07T20:32:04.7136938Z torch.manual_seed(2025) 2025-05-07T20:32:04.7137233Z 2025-05-07T20:32:04.7137504Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:04.7139553Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:04.7141396Z 2025-05-07T20:32:04.7141515Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:04.7141730Z 2025-05-07T20:32:04.7141833Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:04.7142246Z self=, 2025-05-07T20:32:04.7142647Z T=1, 2025-05-07T20:32:04.7142833Z D=5120, 2025-05-07T20:32:04.7143024Z scale_ub=1200.0, 2025-05-07T20:32:04.7143251Z contiguous=True, 2025-05-07T20:32:04.7143467Z compiled=False, 2025-05-07T20:32:04.7143697Z ) 2025-05-07T20:32:04.7144016Z self = 2025-05-07T20:32:04.7144499Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:04.7144763Z 2025-05-07T20:32:04.7144843Z @given( 2025-05-07T20:32:04.7145073Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:04.7145380Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:04.7145675Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:04.7146004Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:04.7146330Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:04.7146612Z ) 2025-05-07T20:32:04.7146955Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:04.7147390Z def test_silu_mul_quant( 2025-05-07T20:32:04.7147626Z self, 2025-05-07T20:32:04.7147813Z T: int, 2025-05-07T20:32:04.7148007Z D: int, 2025-05-07T20:32:04.7148229Z scale_ub: Optional[float], 2025-05-07T20:32:04.7148501Z contiguous: bool, 2025-05-07T20:32:04.7148739Z compiled: bool, 2025-05-07T20:32:04.7148956Z ) -> None: 2025-05-07T20:32:04.7149241Z torch.manual_seed(2025) 2025-05-07T20:32:04.7149478Z 2025-05-07T20:32:04.7149742Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:04.7150077Z 2025-05-07T20:32:04.7150269Z x_sign = torch.sign(x) 2025-05-07T20:32:04.7150555Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:04.7150856Z x = x_sign * x_clamp 2025-05-07T20:32:04.7151096Z x0 = x[:, :D] 2025-05-07T20:32:04.7151312Z x1 = x[:, D:] 2025-05-07T20:32:04.7151519Z 2025-05-07T20:32:04.7151736Z if contiguous: 2025-05-07T20:32:04.7152021Z x0 = x0.contiguous() 2025-05-07T20:32:04.7152338Z x1 = x1.contiguous() 2025-05-07T20:32:04.7152639Z 2025-05-07T20:32:04.7152980Z if scale_ub is not None: 2025-05-07T20:32:04.7153319Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:04.7153736Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:04.7154114Z ) 2025-05-07T20:32:04.7154358Z else: 2025-05-07T20:32:04.7154576Z scale_ub_tensor = None 2025-05-07T20:32:04.7154820Z 2025-05-07T20:32:04.7155045Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:04.7155353Z op = silu_mul_quant 2025-05-07T20:32:04.7155597Z if compiled: 2025-05-07T20:32:04.7155840Z op = torch.compile(op) 2025-05-07T20:32:04.7156130Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:04.7156480Z 2025-05-07T20:32:04.7156665Z > y_fp8, y_scale = fn() 2025-05-07T20:32:04.7156828Z 2025-05-07T20:32:04.7156927Z moe/activation_test.py:117: 2025-05-07T20:32:04.7157216Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:04.7157545Z moe/activation_test.py:115: in fn 2025-05-07T20:32:04.7157821Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:04.7158502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:04.7159182Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:04.7159707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:04.7160385Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:04.7161039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:04.7161569Z kernel = self.compile( 2025-05-07T20:32:04.7162211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:04.7163032Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:04.7163524Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:04.7163795Z 2025-05-07T20:32:04.7164001Z self = 2025-05-07T20:32:04.7165070Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:04.7166415Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f48980e8a40>} 2025-05-07T20:32:04.7167748Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:04.7168762Z context = 2025-05-07T20:32:04.7169046Z 2025-05-07T20:32:04.7169212Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:04.7169728Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:04.7170193Z module_map=module_map) 2025-05-07T20:32:04.7170558Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:04.7170913Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:04.7171174Z E ^ 2025-05-07T20:32:04.7171633Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:04.7172086Z 2025-05-07T20:32:04.7172500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:04.7173013Z 2025-05-07T20:32:04.7173115Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:04.7173604Z self=, 2025-05-07T20:32:04.7174002Z T=2048, 2025-05-07T20:32:04.7174183Z D=5120, 2025-05-07T20:32:04.7174375Z scale_ub=None, 2025-05-07T20:32:04.7174585Z contiguous=True, 2025-05-07T20:32:04.7174808Z compiled=False, 2025-05-07T20:32:04.7175010Z ) 2025-05-07T20:32:04.7175324Z self = 2025-05-07T20:32:04.7175822Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:04.7176088Z 2025-05-07T20:32:04.7176164Z @given( 2025-05-07T20:32:04.7176392Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:04.7176769Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:04.7177106Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:04.7177434Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:04.7177758Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:04.7178042Z ) 2025-05-07T20:32:04.7178389Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:04.7178823Z def test_silu_mul_quant( 2025-05-07T20:32:04.7179058Z self, 2025-05-07T20:32:04.7179243Z T: int, 2025-05-07T20:32:04.7179439Z D: int, 2025-05-07T20:32:04.7179657Z scale_ub: Optional[float], 2025-05-07T20:32:04.7179921Z contiguous: bool, 2025-05-07T20:32:04.7180159Z compiled: bool, 2025-05-07T20:32:04.7180376Z ) -> None: 2025-05-07T20:32:04.7180583Z torch.manual_seed(2025) 2025-05-07T20:32:04.7180819Z 2025-05-07T20:32:04.7181089Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:04.7181429Z 2025-05-07T20:32:04.7181623Z > x_sign = torch.sign(x) 2025-05-07T20:32:04.7183997Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:04.7185912Z 2025-05-07T20:32:04.7186030Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:04.7186246Z 2025-05-07T20:32:04.7186357Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:04.7186759Z self=, 2025-05-07T20:32:04.7187166Z T=16384, 2025-05-07T20:32:04.7187355Z D=5120, 2025-05-07T20:32:04.7187540Z scale_ub=None, 2025-05-07T20:32:04.7187748Z contiguous=True, 2025-05-07T20:32:04.7187970Z compiled=False, 2025-05-07T20:32:04.7188165Z ) 2025-05-07T20:32:04.7906498Z self = 2025-05-07T20:32:04.7907856Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:04.7908569Z 2025-05-07T20:32:04.7908771Z @given( 2025-05-07T20:32:04.7909458Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:04.7910119Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:04.7910677Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:04.7911268Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:04.7911879Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:04.7912493Z ) 2025-05-07T20:32:04.7913259Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:04.7914003Z def test_silu_mul_quant( 2025-05-07T20:32:04.7914279Z self, 2025-05-07T20:32:04.7914471Z T: int, 2025-05-07T20:32:04.7914668Z D: int, 2025-05-07T20:32:04.7915054Z scale_ub: Optional[float], 2025-05-07T20:32:04.7915328Z contiguous: bool, 2025-05-07T20:32:04.7915575Z compiled: bool, 2025-05-07T20:32:04.7915800Z ) -> None: 2025-05-07T20:32:04.7916093Z torch.manual_seed(2025) 2025-05-07T20:32:04.7916425Z 2025-05-07T20:32:04.7916773Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:04.7918826Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:04.7920830Z 2025-05-07T20:32:04.7920961Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:04.7921177Z 2025-05-07T20:32:04.7921279Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:04.7921687Z self=, 2025-05-07T20:32:04.7922096Z T=4096, 2025-05-07T20:32:04.7922284Z D=5120, 2025-05-07T20:32:04.7922476Z scale_ub=None, 2025-05-07T20:32:04.7922695Z contiguous=True, 2025-05-07T20:32:04.7922912Z compiled=False, 2025-05-07T20:32:04.7923115Z ) 2025-05-07T20:32:04.7923437Z self = 2025-05-07T20:32:04.7923923Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:04.7924195Z 2025-05-07T20:32:04.7924273Z @given( 2025-05-07T20:32:04.7924501Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:04.7924811Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:04.7925108Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:04.7925441Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:04.7925767Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:04.7926043Z ) 2025-05-07T20:32:04.7926399Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:04.7926844Z def test_silu_mul_quant( 2025-05-07T20:32:04.7927087Z self, 2025-05-07T20:32:04.7927279Z T: int, 2025-05-07T20:32:04.7927476Z D: int, 2025-05-07T20:32:04.7927694Z scale_ub: Optional[float], 2025-05-07T20:32:04.7927960Z contiguous: bool, 2025-05-07T20:32:04.7928426Z compiled: bool, 2025-05-07T20:32:04.7928657Z ) -> None: 2025-05-07T20:32:04.7928869Z torch.manual_seed(2025) 2025-05-07T20:32:04.7929108Z 2025-05-07T20:32:04.7929376Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:04.7931391Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:04.7933281Z 2025-05-07T20:32:04.7933399Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:04.7933611Z 2025-05-07T20:32:04.7933714Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:04.7934127Z self=, 2025-05-07T20:32:04.7934531Z T=2048, 2025-05-07T20:32:04.7934716Z D=5120, 2025-05-07T20:32:04.7934908Z scale_ub=None, 2025-05-07T20:32:04.7935131Z contiguous=False, 2025-05-07T20:32:04.7935489Z compiled=False, 2025-05-07T20:32:04.7935694Z ) 2025-05-07T20:32:04.7936009Z self = 2025-05-07T20:32:04.7936489Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:04.7936758Z 2025-05-07T20:32:04.7936836Z @given( 2025-05-07T20:32:04.7937062Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:04.7937368Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:04.7937665Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:04.7937989Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:04.7938310Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:04.7938704Z ) 2025-05-07T20:32:04.7939049Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:04.7939490Z def test_silu_mul_quant( 2025-05-07T20:32:04.7939731Z self, 2025-05-07T20:32:04.7939932Z T: int, 2025-05-07T20:32:04.7940141Z D: int, 2025-05-07T20:32:04.7940354Z scale_ub: Optional[float], 2025-05-07T20:32:04.7940619Z contiguous: bool, 2025-05-07T20:32:04.7940857Z compiled: bool, 2025-05-07T20:32:04.7941072Z ) -> None: 2025-05-07T20:32:04.7941291Z torch.manual_seed(2025) 2025-05-07T20:32:04.7941534Z 2025-05-07T20:32:04.7941797Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:04.7943872Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:04.7945701Z 2025-05-07T20:32:04.7945819Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:04.7946031Z 2025-05-07T20:32:04.7946134Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:04.7946540Z self=, 2025-05-07T20:32:04.7946932Z T=4096, 2025-05-07T20:32:04.7947118Z D=7168, 2025-05-07T20:32:04.7947312Z scale_ub=None, 2025-05-07T20:32:04.7947516Z contiguous=True, 2025-05-07T20:32:04.7947736Z compiled=True, 2025-05-07T20:32:04.7947934Z ) 2025-05-07T20:32:04.7948250Z self = 2025-05-07T20:32:04.7948732Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:04.7949005Z 2025-05-07T20:32:04.7949151Z @given( 2025-05-07T20:32:04.7949387Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:04.7949696Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:04.7950014Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:04.7950338Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:04.7950660Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:04.7950945Z ) 2025-05-07T20:32:04.7951290Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:04.7951728Z def test_silu_mul_quant( 2025-05-07T20:32:04.7951966Z self, 2025-05-07T20:32:04.7952175Z T: int, 2025-05-07T20:32:04.7952369Z D: int, 2025-05-07T20:32:04.7952585Z scale_ub: Optional[float], 2025-05-07T20:32:04.7952853Z contiguous: bool, 2025-05-07T20:32:04.7953092Z compiled: bool, 2025-05-07T20:32:04.7953314Z ) -> None: 2025-05-07T20:32:04.7953531Z torch.manual_seed(2025) 2025-05-07T20:32:04.7953777Z 2025-05-07T20:32:04.7954044Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:04.7956173Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:04.7958147Z 2025-05-07T20:32:04.7958271Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:04.7958536Z 2025-05-07T20:32:04.7958640Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:04.7959100Z self=, 2025-05-07T20:32:04.7959496Z T=2048, 2025-05-07T20:32:04.7959684Z D=5120, 2025-05-07T20:32:04.7959874Z scale_ub=1200.0, 2025-05-07T20:32:04.7960098Z contiguous=False, 2025-05-07T20:32:04.7960340Z compiled=False, 2025-05-07T20:32:04.7960543Z ) 2025-05-07T20:32:04.7960867Z self = 2025-05-07T20:32:04.7961353Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:04.7961629Z 2025-05-07T20:32:04.7961708Z @given( 2025-05-07T20:32:04.7961932Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:04.7962268Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:04.7962602Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:04.7962925Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:04.7963253Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:04.7963532Z ) 2025-05-07T20:32:04.7963878Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:04.7964315Z def test_silu_mul_quant( 2025-05-07T20:32:04.7964552Z self, 2025-05-07T20:32:04.7964753Z T: int, 2025-05-07T20:32:04.7964947Z D: int, 2025-05-07T20:32:04.7965164Z scale_ub: Optional[float], 2025-05-07T20:32:04.7965432Z contiguous: bool, 2025-05-07T20:32:04.7965680Z compiled: bool, 2025-05-07T20:32:04.7965902Z ) -> None: 2025-05-07T20:32:04.7966116Z torch.manual_seed(2025) 2025-05-07T20:32:04.7966358Z 2025-05-07T20:32:04.7966640Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:04.7968847Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:04.7970686Z 2025-05-07T20:32:04.7970811Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:04.7971020Z 2025-05-07T20:32:04.7971120Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:04.7971535Z self=, 2025-05-07T20:32:04.7971929Z T=4096, 2025-05-07T20:32:04.7972124Z D=7168, 2025-05-07T20:32:04.7972339Z scale_ub=1200.0, 2025-05-07T20:32:04.7978552Z contiguous=True, 2025-05-07T20:32:04.7978816Z compiled=False, 2025-05-07T20:32:04.7979019Z ) 2025-05-07T20:32:04.8892900Z self = 2025-05-07T20:32:04.8893740Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:04.8894125Z 2025-05-07T20:32:04.8894236Z @given( 2025-05-07T20:32:04.8894731Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:04.8895054Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:04.8895354Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:04.8895684Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:04.8896019Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:04.8896294Z ) 2025-05-07T20:32:04.8896638Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:04.8897080Z def test_silu_mul_quant( 2025-05-07T20:32:04.8897316Z self, 2025-05-07T20:32:04.8897509Z T: int, 2025-05-07T20:32:04.8897707Z D: int, 2025-05-07T20:32:04.8897919Z scale_ub: Optional[float], 2025-05-07T20:32:04.8898303Z contiguous: bool, 2025-05-07T20:32:04.8898539Z compiled: bool, 2025-05-07T20:32:04.8898757Z ) -> None: 2025-05-07T20:32:04.8898974Z torch.manual_seed(2025) 2025-05-07T20:32:04.8899212Z 2025-05-07T20:32:04.8899490Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:04.8901521Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:04.8903787Z 2025-05-07T20:32:04.8903944Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:04.8904182Z 2025-05-07T20:32:04.8904286Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:04.8904696Z self=, 2025-05-07T20:32:04.8905090Z T=16384, 2025-05-07T20:32:04.8905291Z D=7168, 2025-05-07T20:32:04.8905488Z scale_ub=None, 2025-05-07T20:32:04.8905698Z contiguous=False, 2025-05-07T20:32:04.8905926Z compiled=True, 2025-05-07T20:32:04.8906128Z ) 2025-05-07T20:32:04.8906440Z self = 2025-05-07T20:32:04.8906926Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:04.8907208Z 2025-05-07T20:32:04.8907289Z @given( 2025-05-07T20:32:04.8907515Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:04.8907822Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:04.8908127Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:04.8908458Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:04.8908780Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:04.8909129Z ) 2025-05-07T20:32:04.8909479Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:04.8909926Z def test_silu_mul_quant( 2025-05-07T20:32:04.8910159Z self, 2025-05-07T20:32:04.8910355Z T: int, 2025-05-07T20:32:04.8910554Z D: int, 2025-05-07T20:32:04.8910769Z scale_ub: Optional[float], 2025-05-07T20:32:04.8911037Z contiguous: bool, 2025-05-07T20:32:04.8911274Z compiled: bool, 2025-05-07T20:32:04.8911491Z ) -> None: 2025-05-07T20:32:04.8911705Z torch.manual_seed(2025) 2025-05-07T20:32:04.8911948Z 2025-05-07T20:32:04.8912214Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:04.8914321Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:04.8916167Z 2025-05-07T20:32:04.8916290Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:04.8916500Z 2025-05-07T20:32:04.8916606Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:04.8917010Z self=, 2025-05-07T20:32:04.8917403Z T=4096, 2025-05-07T20:32:04.8917591Z D=7168, 2025-05-07T20:32:04.8917786Z scale_ub=None, 2025-05-07T20:32:04.8918002Z contiguous=True, 2025-05-07T20:32:04.8918272Z compiled=False, 2025-05-07T20:32:04.8918512Z ) 2025-05-07T20:32:04.8918826Z self = 2025-05-07T20:32:04.8919317Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:04.8919583Z 2025-05-07T20:32:04.8919668Z @given( 2025-05-07T20:32:04.8919901Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:04.8920211Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:04.8920519Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:04.8920847Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:04.8921164Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:04.8921447Z ) 2025-05-07T20:32:04.8921825Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:04.8922366Z def test_silu_mul_quant( 2025-05-07T20:32:04.8922662Z self, 2025-05-07T20:32:04.8922904Z T: int, 2025-05-07T20:32:04.8923153Z D: int, 2025-05-07T20:32:04.8923437Z scale_ub: Optional[float], 2025-05-07T20:32:04.8923772Z contiguous: bool, 2025-05-07T20:32:04.8924066Z compiled: bool, 2025-05-07T20:32:04.8924295Z ) -> None: 2025-05-07T20:32:04.8924513Z torch.manual_seed(2025) 2025-05-07T20:32:04.8924761Z 2025-05-07T20:32:04.8925028Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:04.8927049Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:04.8929146Z 2025-05-07T20:32:04.8929273Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:04.8929485Z 2025-05-07T20:32:04.8929597Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:04.8930007Z self=, 2025-05-07T20:32:04.8930409Z T=16384, 2025-05-07T20:32:04.8930608Z D=7168, 2025-05-07T20:32:04.8930807Z scale_ub=None, 2025-05-07T20:32:04.8931014Z contiguous=True, 2025-05-07T20:32:04.8931239Z compiled=False, 2025-05-07T20:32:04.8931443Z ) 2025-05-07T20:32:04.8931758Z self = 2025-05-07T20:32:04.8932375Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:04.8932716Z 2025-05-07T20:32:04.8932819Z @given( 2025-05-07T20:32:04.8933105Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:04.8933496Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:04.8933872Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:04.8934194Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:04.8934517Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:04.8934800Z ) 2025-05-07T20:32:04.8935293Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:04.8935732Z def test_silu_mul_quant( 2025-05-07T20:32:04.8935972Z self, 2025-05-07T20:32:04.8936170Z T: int, 2025-05-07T20:32:04.8936359Z D: int, 2025-05-07T20:32:04.8936580Z scale_ub: Optional[float], 2025-05-07T20:32:04.8936848Z contiguous: bool, 2025-05-07T20:32:04.8937093Z compiled: bool, 2025-05-07T20:32:04.8937316Z ) -> None: 2025-05-07T20:32:04.8937536Z torch.manual_seed(2025) 2025-05-07T20:32:04.8937774Z 2025-05-07T20:32:04.8938038Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:04.8940132Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:04.8942027Z 2025-05-07T20:32:04.8942146Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:04.8942360Z 2025-05-07T20:32:04.8942463Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:04.8942867Z self=, 2025-05-07T20:32:04.8943260Z T=16384, 2025-05-07T20:32:04.8943454Z D=7168, 2025-05-07T20:32:04.8943645Z scale_ub=1200.0, 2025-05-07T20:32:04.8943866Z contiguous=True, 2025-05-07T20:32:04.8944090Z compiled=False, 2025-05-07T20:32:04.8944295Z ) 2025-05-07T20:32:04.8944611Z self = 2025-05-07T20:32:04.8945101Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:04.8945382Z 2025-05-07T20:32:04.8945464Z @given( 2025-05-07T20:32:04.8945693Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:04.8946003Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:04.8946302Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:04.8946628Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:04.8946955Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:04.8947234Z ) 2025-05-07T20:32:04.8947576Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:04.8948008Z def test_silu_mul_quant( 2025-05-07T20:32:04.8948250Z self, 2025-05-07T20:32:04.8948446Z T: int, 2025-05-07T20:32:04.8948640Z D: int, 2025-05-07T20:32:04.8948856Z scale_ub: Optional[float], 2025-05-07T20:32:04.8949171Z contiguous: bool, 2025-05-07T20:32:04.8949406Z compiled: bool, 2025-05-07T20:32:04.8949625Z ) -> None: 2025-05-07T20:32:04.8949841Z torch.manual_seed(2025) 2025-05-07T20:32:04.8950078Z 2025-05-07T20:32:04.8950343Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:04.8952370Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:04.8954263Z 2025-05-07T20:32:04.8954384Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:04.8954593Z 2025-05-07T20:32:04.8954706Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:04.8955193Z self=, 2025-05-07T20:32:04.8955596Z T=128, 2025-05-07T20:32:04.8955785Z D=5120, 2025-05-07T20:32:04.8955976Z scale_ub=1200.0, 2025-05-07T20:32:04.8956195Z contiguous=False, 2025-05-07T20:32:04.8956425Z compiled=False, 2025-05-07T20:32:04.8956629Z ) 2025-05-07T20:32:04.9971289Z self = 2025-05-07T20:32:04.9971905Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:04.9972257Z 2025-05-07T20:32:04.9972354Z @given( 2025-05-07T20:32:04.9972601Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:04.9973093Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:04.9973401Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:04.9973728Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:04.9974058Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:04.9974334Z ) 2025-05-07T20:32:04.9974680Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:04.9975117Z def test_silu_mul_quant( 2025-05-07T20:32:04.9975353Z self, 2025-05-07T20:32:04.9975551Z T: int, 2025-05-07T20:32:04.9975751Z D: int, 2025-05-07T20:32:04.9975967Z scale_ub: Optional[float], 2025-05-07T20:32:04.9976234Z contiguous: bool, 2025-05-07T20:32:04.9976466Z compiled: bool, 2025-05-07T20:32:04.9976687Z ) -> None: 2025-05-07T20:32:04.9976904Z torch.manual_seed(2025) 2025-05-07T20:32:04.9977143Z 2025-05-07T20:32:04.9977425Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:04.9977764Z 2025-05-07T20:32:04.9977958Z x_sign = torch.sign(x) 2025-05-07T20:32:04.9978250Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:04.9978550Z x = x_sign * x_clamp 2025-05-07T20:32:04.9978796Z x0 = x[:, :D] 2025-05-07T20:32:04.9979013Z x1 = x[:, D:] 2025-05-07T20:32:04.9979218Z 2025-05-07T20:32:04.9979408Z if contiguous: 2025-05-07T20:32:04.9979640Z x0 = x0.contiguous() 2025-05-07T20:32:04.9979901Z x1 = x1.contiguous() 2025-05-07T20:32:04.9980130Z 2025-05-07T20:32:04.9980322Z if scale_ub is not None: 2025-05-07T20:32:04.9980599Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:04.9980927Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:04.9981230Z ) 2025-05-07T20:32:04.9981421Z else: 2025-05-07T20:32:04.9981627Z scale_ub_tensor = None 2025-05-07T20:32:04.9981879Z 2025-05-07T20:32:04.9982114Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:04.9982423Z op = silu_mul_quant 2025-05-07T20:32:04.9982698Z if compiled: 2025-05-07T20:32:04.9982969Z op = torch.compile(op) 2025-05-07T20:32:04.9983260Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:04.9983532Z 2025-05-07T20:32:04.9983725Z > y_fp8, y_scale = fn() 2025-05-07T20:32:04.9983887Z 2025-05-07T20:32:04.9983988Z moe/activation_test.py:117: 2025-05-07T20:32:04.9984279Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:04.9984609Z moe/activation_test.py:115: in fn 2025-05-07T20:32:04.9984884Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:04.9985562Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:04.9986253Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:04.9986788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:04.9987459Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:04.9988268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:04.9988802Z kernel = self.compile( 2025-05-07T20:32:04.9989394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:04.9990041Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:04.9990434Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:04.9990660Z 2025-05-07T20:32:04.9990871Z self = 2025-05-07T20:32:04.9991942Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:04.9993396Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f489817f6a0>} 2025-05-07T20:32:04.9994721Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:04.9995737Z context = 2025-05-07T20:32:04.9996048Z 2025-05-07T20:32:04.9996213Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:04.9996736Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:04.9997198Z module_map=module_map) 2025-05-07T20:32:04.9997565Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:04.9997919Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:04.9998178Z E ^ 2025-05-07T20:32:04.9998643Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:04.9999093Z 2025-05-07T20:32:04.9999508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.0000014Z 2025-05-07T20:32:05.0000120Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.0000524Z self=, 2025-05-07T20:32:05.0000919Z T=2048, 2025-05-07T20:32:05.0001108Z D=7168, 2025-05-07T20:32:05.0001296Z scale_ub=None, 2025-05-07T20:32:05.0001520Z contiguous=False, 2025-05-07T20:32:05.0001747Z compiled=False, 2025-05-07T20:32:05.0001953Z ) 2025-05-07T20:32:05.0002287Z self = 2025-05-07T20:32:05.0002812Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:05.0003083Z 2025-05-07T20:32:05.0003165Z @given( 2025-05-07T20:32:05.0003394Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.0003704Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.0004013Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.0004337Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.0004660Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.0004941Z ) 2025-05-07T20:32:05.0005288Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.0005718Z def test_silu_mul_quant( 2025-05-07T20:32:05.0005955Z self, 2025-05-07T20:32:05.0006151Z T: int, 2025-05-07T20:32:05.0006344Z D: int, 2025-05-07T20:32:05.0006562Z scale_ub: Optional[float], 2025-05-07T20:32:05.0006833Z contiguous: bool, 2025-05-07T20:32:05.0007066Z compiled: bool, 2025-05-07T20:32:05.0007285Z ) -> None: 2025-05-07T20:32:05.0007496Z torch.manual_seed(2025) 2025-05-07T20:32:05.0007731Z 2025-05-07T20:32:05.0008087Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.0010135Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:05.0012011Z 2025-05-07T20:32:05.0012170Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:05.0012384Z 2025-05-07T20:32:05.0012514Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.0012943Z self=, 2025-05-07T20:32:05.0013349Z T=128, 2025-05-07T20:32:05.0013538Z D=7168, 2025-05-07T20:32:05.0013727Z scale_ub=1200.0, 2025-05-07T20:32:05.0013946Z contiguous=True, 2025-05-07T20:32:05.0014164Z compiled=True, 2025-05-07T20:32:05.0014360Z ) 2025-05-07T20:32:05.0318454Z self = 2025-05-07T20:32:05.0318987Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:05.0319378Z 2025-05-07T20:32:05.0319499Z @given( 2025-05-07T20:32:05.0319818Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.0320235Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.0320661Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.0320996Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.0321318Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.0321600Z ) 2025-05-07T20:32:05.0321950Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.0322387Z def test_silu_mul_quant( 2025-05-07T20:32:05.0322637Z self, 2025-05-07T20:32:05.0322831Z T: int, 2025-05-07T20:32:05.0323023Z D: int, 2025-05-07T20:32:05.0323239Z scale_ub: Optional[float], 2025-05-07T20:32:05.0323510Z contiguous: bool, 2025-05-07T20:32:05.0323742Z compiled: bool, 2025-05-07T20:32:05.0323966Z ) -> None: 2025-05-07T20:32:05.0324177Z torch.manual_seed(2025) 2025-05-07T20:32:05.0324414Z 2025-05-07T20:32:05.0324680Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.0325017Z 2025-05-07T20:32:05.0325214Z x_sign = torch.sign(x) 2025-05-07T20:32:05.0325500Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.0325807Z x = x_sign * x_clamp 2025-05-07T20:32:05.0326053Z x0 = x[:, :D] 2025-05-07T20:32:05.0326265Z x1 = x[:, D:] 2025-05-07T20:32:05.0326471Z 2025-05-07T20:32:05.0326662Z if contiguous: 2025-05-07T20:32:05.0326886Z x0 = x0.contiguous() 2025-05-07T20:32:05.0327141Z x1 = x1.contiguous() 2025-05-07T20:32:05.0327383Z 2025-05-07T20:32:05.0327569Z if scale_ub is not None: 2025-05-07T20:32:05.0327840Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.0328420Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.0328731Z ) 2025-05-07T20:32:05.0328921Z else: 2025-05-07T20:32:05.0329128Z scale_ub_tensor = None 2025-05-07T20:32:05.0329371Z 2025-05-07T20:32:05.0329598Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.0329911Z op = silu_mul_quant 2025-05-07T20:32:05.0330161Z if compiled: 2025-05-07T20:32:05.0330404Z op = torch.compile(op) 2025-05-07T20:32:05.0330696Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.0330963Z 2025-05-07T20:32:05.0331312Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.0331480Z 2025-05-07T20:32:05.0331581Z moe/activation_test.py:117: 2025-05-07T20:32:05.0331870Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.0332198Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.0332476Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.0333031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:05.0333584Z return fn(*args, **kwargs) 2025-05-07T20:32:05.0334229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.0335025Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.0335552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.0336224Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.0336876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.0337395Z kernel = self.compile( 2025-05-07T20:32:05.0337930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.0338572Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.0338964Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.0339187Z 2025-05-07T20:32:05.0339392Z self = 2025-05-07T20:32:05.0340462Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.0341822Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f48983c68e0>} 2025-05-07T20:32:05.0343200Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.0344213Z context = 2025-05-07T20:32:05.0344496Z 2025-05-07T20:32:05.0344662Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.0345170Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.0345636Z module_map=module_map) 2025-05-07T20:32:05.0345995Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.0346342Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.0346593Z E ^ 2025-05-07T20:32:05.0347058Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.0347501Z 2025-05-07T20:32:05.0347915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.0348418Z 2025-05-07T20:32:05.0348523Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.0348927Z self=, 2025-05-07T20:32:05.0349377Z T=128, 2025-05-07T20:32:05.0349565Z D=7168, 2025-05-07T20:32:05.0349751Z scale_ub=1200.0, 2025-05-07T20:32:05.0349999Z contiguous=True, 2025-05-07T20:32:05.0350216Z compiled=False, 2025-05-07T20:32:05.0350422Z ) 2025-05-07T20:32:05.0350732Z self = 2025-05-07T20:32:05.0351219Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:05.0351484Z 2025-05-07T20:32:05.0351658Z @given( 2025-05-07T20:32:05.0351884Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.0352193Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.0352498Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.0352856Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.0353185Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.0353465Z ) 2025-05-07T20:32:05.0353813Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.0354244Z def test_silu_mul_quant( 2025-05-07T20:32:05.0354488Z self, 2025-05-07T20:32:05.0354725Z T: int, 2025-05-07T20:32:05.0354982Z D: int, 2025-05-07T20:32:05.0355203Z scale_ub: Optional[float], 2025-05-07T20:32:05.0355472Z contiguous: bool, 2025-05-07T20:32:05.0355704Z compiled: bool, 2025-05-07T20:32:05.0355920Z ) -> None: 2025-05-07T20:32:05.0356139Z torch.manual_seed(2025) 2025-05-07T20:32:05.0356374Z 2025-05-07T20:32:05.0356645Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.0356985Z 2025-05-07T20:32:05.0363695Z x_sign = torch.sign(x) 2025-05-07T20:32:05.0364013Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.0365996Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:05.0367841Z 2025-05-07T20:32:05.0367968Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:05.0368186Z 2025-05-07T20:32:05.0368289Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.0368695Z self=, 2025-05-07T20:32:05.0369082Z T=128, 2025-05-07T20:32:05.0369268Z D=5120, 2025-05-07T20:32:05.0369459Z scale_ub=1200.0, 2025-05-07T20:32:05.0369676Z contiguous=True, 2025-05-07T20:32:05.0369886Z compiled=True, 2025-05-07T20:32:05.0370087Z ) 2025-05-07T20:32:05.0370398Z self = 2025-05-07T20:32:05.0370875Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:05.0371142Z 2025-05-07T20:32:05.0371223Z @given( 2025-05-07T20:32:05.0371449Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.0371748Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.0372045Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.0372421Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.0372738Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.0373017Z ) 2025-05-07T20:32:05.0373360Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.0373797Z def test_silu_mul_quant( 2025-05-07T20:32:05.0374032Z self, 2025-05-07T20:32:05.0374222Z T: int, 2025-05-07T20:32:05.0374412Z D: int, 2025-05-07T20:32:05.0374620Z scale_ub: Optional[float], 2025-05-07T20:32:05.0374890Z contiguous: bool, 2025-05-07T20:32:05.0375125Z compiled: bool, 2025-05-07T20:32:05.0375336Z ) -> None: 2025-05-07T20:32:05.0375555Z torch.manual_seed(2025) 2025-05-07T20:32:05.0375789Z 2025-05-07T20:32:05.0376051Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.0376383Z 2025-05-07T20:32:05.0376576Z x_sign = torch.sign(x) 2025-05-07T20:32:05.0376966Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.0378956Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:05.0380799Z 2025-05-07T20:32:05.0380957Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:05.0381208Z 2025-05-07T20:32:05.0381309Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.0381711Z self=, 2025-05-07T20:32:05.0382100Z T=128, 2025-05-07T20:32:05.0382290Z D=7168, 2025-05-07T20:32:05.0382479Z scale_ub=None, 2025-05-07T20:32:05.0382681Z contiguous=True, 2025-05-07T20:32:05.0382916Z compiled=True, 2025-05-07T20:32:05.0383147Z ) 2025-05-07T20:32:05.4492821Z self = 2025-05-07T20:32:05.4493326Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:05.4493654Z 2025-05-07T20:32:05.4493766Z @given( 2025-05-07T20:32:05.4494086Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.4494493Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.4494898Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.4495341Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.4495673Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.4495954Z ) 2025-05-07T20:32:05.4496299Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.4496741Z def test_silu_mul_quant( 2025-05-07T20:32:05.4496982Z self, 2025-05-07T20:32:05.4497176Z T: int, 2025-05-07T20:32:05.4497373Z D: int, 2025-05-07T20:32:05.4497593Z scale_ub: Optional[float], 2025-05-07T20:32:05.4497861Z contiguous: bool, 2025-05-07T20:32:05.4498098Z compiled: bool, 2025-05-07T20:32:05.4498318Z ) -> None: 2025-05-07T20:32:05.4498535Z torch.manual_seed(2025) 2025-05-07T20:32:05.4498776Z 2025-05-07T20:32:05.4499043Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.4501079Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:05.4502918Z 2025-05-07T20:32:05.4503040Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:05.4503249Z 2025-05-07T20:32:05.4535039Z FAILED 2025-05-07T20:32:05.4535355Z 2025-05-07T20:32:05.4535868Z =================================== FAILURES =================================== 2025-05-07T20:32:05.4536523Z _____________________ ActivationTests.test_silu_mul_quant ______________________ 2025-05-07T20:32:05.4537084Z + Exception Group Traceback (most recent call last): 2025-05-07T20:32:05.4537937Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/unittest/case.py", line 57, in testPartExecutor 2025-05-07T20:32:05.4538538Z | yield 2025-05-07T20:32:05.4538998Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/unittest/case.py", line 623, in run 2025-05-07T20:32:05.4539854Z | self._callTestMethod(testMethod) 2025-05-07T20:32:05.4540533Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/unittest/case.py", line 579, in _callTestMethod 2025-05-07T20:32:05.4541211Z | if method() is not None: 2025-05-07T20:32:05.4541472Z | ^^^^^^^^ 2025-05-07T20:32:05.4542218Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant 2025-05-07T20:32:05.4543012Z | T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.4543326Z | ^^^^^^^ 2025-05-07T20:32:05.4544115Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/hypothesis/core.py", line 1850, in wrapped_test 2025-05-07T20:32:05.4545514Z | raise the_error_hypothesis_found 2025-05-07T20:32:05.4546122Z | ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions) 2025-05-07T20:32:05.4546728Z +-+---------------- 1 ---------------- 2025-05-07T20:32:05.4547167Z | Traceback (most recent call last): 2025-05-07T20:32:05.4548165Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:05.4549330Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.4549868Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:05.4553049Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:05.4555984Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:05.4556606Z | self=, 2025-05-07T20:32:05.4557177Z | T=2048, 2025-05-07T20:32:05.4557504Z | D=5120, # or any other generated value 2025-05-07T20:32:05.4557984Z | scale_ub=None, # or any other generated value 2025-05-07T20:32:05.4558503Z | contiguous=True, # or any other generated value 2025-05-07T20:32:05.4559014Z | compiled=False, # or any other generated value 2025-05-07T20:32:05.4559436Z | ) 2025-05-07T20:32:05.4559688Z | 2025-05-07T20:32:05.4560439Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case 2025-05-07T20:32:05.4561290Z +---------------- 2 ---------------- 2025-05-07T20:32:05.4561708Z | Traceback (most recent call last): 2025-05-07T20:32:05.4562758Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:05.4563860Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.4564384Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:05.4567218Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:05.4570112Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:05.4570739Z | self=, 2025-05-07T20:32:05.4571310Z | T=128, 2025-05-07T20:32:05.4571590Z | D=7168, 2025-05-07T20:32:05.4571882Z | scale_ub=None, 2025-05-07T20:32:05.4572216Z | contiguous=True, 2025-05-07T20:32:05.4572594Z | compiled=True, 2025-05-07T20:32:05.4572913Z | ) 2025-05-07T20:32:05.4573169Z | 2025-05-07T20:32:05.4573905Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case 2025-05-07T20:32:05.4574757Z +---------------- 3 ---------------- 2025-05-07T20:32:05.4575277Z | Traceback (most recent call last): 2025-05-07T20:32:05.4576272Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:05.4577256Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.4577772Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:05.4579900Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:05.4581881Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:05.4582335Z | self=, 2025-05-07T20:32:05.4582746Z | T=128, 2025-05-07T20:32:05.4582967Z | D=5120, 2025-05-07T20:32:05.4583191Z | scale_ub=1200.0, 2025-05-07T20:32:05.4583442Z | contiguous=True, 2025-05-07T20:32:05.4583685Z | compiled=True, 2025-05-07T20:32:05.4583921Z | ) 2025-05-07T20:32:05.4584112Z | 2025-05-07T20:32:05.4584637Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case 2025-05-07T20:32:05.4585251Z +---------------- 4 ---------------- 2025-05-07T20:32:05.4585554Z | Traceback (most recent call last): 2025-05-07T20:32:05.4586269Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant 2025-05-07T20:32:05.4587003Z | y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:05.4587303Z | ^^^^^^^^ 2025-05-07T20:32:05.4587959Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn 2025-05-07T20:32:05.4588655Z | return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:05.4589005Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:05.4589939Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row 2025-05-07T20:32:05.4590755Z | _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:05.4591426Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py", line 330, in 2025-05-07T20:32:05.4592526Z | return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.4593152Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:05.4594154Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 186, in run 2025-05-07T20:32:05.4595249Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:05.4595924Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:05.4596871Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 186, in 2025-05-07T20:32:05.4598003Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:05.4598659Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:05.4599668Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 166, in _bench 2025-05-07T20:32:05.4600645Z | return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:05.4601168Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:05.4602013Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py", line 117, in do_bench 2025-05-07T20:32:05.4602863Z | fn() 2025-05-07T20:32:05.4603683Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call 2025-05-07T20:32:05.4604591Z | self.fn.run( 2025-05-07T20:32:05.4605345Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py", line 623, in run 2025-05-07T20:32:05.4606171Z | kernel = self.compile( 2025-05-07T20:32:05.4606540Z | ^^^^^^^^^^^^^ 2025-05-07T20:32:05.4607389Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 273, in compile 2025-05-07T20:32:05.4608381Z | module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.4608935Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:05.4609831Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:05.4610954Z | return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.4611638Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:05.4612177Z | triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.4612707Z | def _kernel_quantize_fp8_row( 2025-05-07T20:32:05.4613085Z | ^ 2025-05-07T20:32:05.4613736Z | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.4614524Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:05.4615095Z | # The test always failed when commented parts were varied together. 2025-05-07T20:32:05.4615835Z | self=, 2025-05-07T20:32:05.4616452Z | T=1, # or any other generated value 2025-05-07T20:32:05.4616893Z | D=5120, # or any other generated value 2025-05-07T20:32:05.4617368Z | scale_ub=None, # or any other generated value 2025-05-07T20:32:05.4617882Z | contiguous=True, # or any other generated value 2025-05-07T20:32:05.4618388Z | compiled=True, # or any other generated value 2025-05-07T20:32:05.4618812Z | ) 2025-05-07T20:32:05.4619071Z | 2025-05-07T20:32:05.4619808Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case 2025-05-07T20:32:05.4620674Z +------------------------------------ 2025-05-07T20:32:05.4621186Z ---------------------------------- Hypothesis ---------------------------------- 2025-05-07T20:32:05.4621852Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.4622444Z self=, 2025-05-07T20:32:05.4623019Z T=1, 2025-05-07T20:32:05.4623291Z D=5120, 2025-05-07T20:32:05.4623571Z scale_ub=None, 2025-05-07T20:32:05.4623884Z contiguous=True, 2025-05-07T20:32:05.4624208Z compiled=True, 2025-05-07T20:32:05.4624502Z ) 2025-05-07T20:32:05.4624959Z self = 2025-05-07T20:32:05.4625638Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:05.4626018Z 2025-05-07T20:32:05.4626140Z @given( 2025-05-07T20:32:05.4626532Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.4627026Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.4627449Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.4627918Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.4628715Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.4629221Z ) 2025-05-07T20:32:05.4629842Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.4630465Z def test_silu_mul_quant( 2025-05-07T20:32:05.4630809Z self, 2025-05-07T20:32:05.4631081Z T: int, 2025-05-07T20:32:05.4631368Z D: int, 2025-05-07T20:32:05.4631678Z scale_ub: Optional[float], 2025-05-07T20:32:05.4632056Z contiguous: bool, 2025-05-07T20:32:05.4632427Z compiled: bool, 2025-05-07T20:32:05.4632765Z ) -> None: 2025-05-07T20:32:05.4633065Z torch.manual_seed(2025) 2025-05-07T20:32:05.4633411Z 2025-05-07T20:32:05.4633798Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.4634276Z 2025-05-07T20:32:05.4634549Z x_sign = torch.sign(x) 2025-05-07T20:32:05.4634951Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.4635386Z x = x_sign * x_clamp 2025-05-07T20:32:05.4635720Z x0 = x[:, :D] 2025-05-07T20:32:05.4636010Z x1 = x[:, D:] 2025-05-07T20:32:05.4636288Z 2025-05-07T20:32:05.4636536Z if contiguous: 2025-05-07T20:32:05.4636861Z x0 = x0.contiguous() 2025-05-07T20:32:05.4637235Z x1 = x1.contiguous() 2025-05-07T20:32:05.4637570Z 2025-05-07T20:32:05.4637830Z if scale_ub is not None: 2025-05-07T20:32:05.4638198Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.4638643Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.4639060Z ) 2025-05-07T20:32:05.4639321Z else: 2025-05-07T20:32:05.4639609Z scale_ub_tensor = None 2025-05-07T20:32:05.4639971Z 2025-05-07T20:32:05.4640282Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.4640708Z op = silu_mul_quant 2025-05-07T20:32:05.4641054Z if compiled: 2025-05-07T20:32:05.4641396Z op = torch.compile(op) 2025-05-07T20:32:05.4641822Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.4642213Z 2025-05-07T20:32:05.4642484Z y_fp8, y_scale = fn() 2025-05-07T20:32:05.4642913Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:05.4643312Z 2025-05-07T20:32:05.4643642Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.4644103Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:05.4644500Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:05.4644920Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:05.4645421Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:05.4645857Z 2025-05-07T20:32:05.4646149Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:05.4646415Z 2025-05-07T20:32:05.4646562Z moe/activation_test.py:126: 2025-05-07T20:32:05.4646975Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.4647651Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:05.4648114Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:05.4649198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:05.4650200Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:05.4650928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.4651833Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.4652759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:05.4653912Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:05.4654986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:05.4656038Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:05.4657066Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:05.4657962Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:05.4658805Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:05.4659535Z fn() 2025-05-07T20:32:05.4660239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:05.4661056Z self.fn.run( 2025-05-07T20:32:05.4661707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.4662499Z kernel = self.compile( 2025-05-07T20:32:05.4663264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.4664189Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.4664746Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.4665054Z 2025-05-07T20:32:05.4665325Z self = 2025-05-07T20:32:05.4666784Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.4668681Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4986e7ae80>} 2025-05-07T20:32:05.4670648Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.4672047Z context = 2025-05-07T20:32:05.4672447Z 2025-05-07T20:32:05.4672731Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.4673454Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.4674100Z module_map=module_map) 2025-05-07T20:32:05.4674604Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.4675083Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:05.4675462Z E ^ 2025-05-07T20:32:05.4676109Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.4676745Z 2025-05-07T20:32:05.4677437Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.4678158Z 2025-05-07T20:32:05.4678302Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.4678868Z self=, 2025-05-07T20:32:05.4679426Z T=2048, 2025-05-07T20:32:05.4679678Z D=5120, 2025-05-07T20:32:05.4679941Z scale_ub=1200.0, 2025-05-07T20:32:05.4680259Z contiguous=True, 2025-05-07T20:32:05.4680567Z compiled=False, 2025-05-07T20:32:05.4680858Z ) 2025-05-07T20:32:05.4681304Z self = 2025-05-07T20:32:05.4681981Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:05.4682469Z 2025-05-07T20:32:05.4682576Z @given( 2025-05-07T20:32:05.4682895Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.4683351Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.4683769Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.4684234Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.4684695Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.4685086Z ) 2025-05-07T20:32:05.4685574Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.4686185Z def test_silu_mul_quant( 2025-05-07T20:32:05.4686520Z self, 2025-05-07T20:32:05.4686787Z T: int, 2025-05-07T20:32:05.4687052Z D: int, 2025-05-07T20:32:05.4687339Z scale_ub: Optional[float], 2025-05-07T20:32:05.4687697Z contiguous: bool, 2025-05-07T20:32:05.4708099Z compiled: bool, 2025-05-07T20:32:05.4708432Z ) -> None: 2025-05-07T20:32:05.4708730Z torch.manual_seed(2025) 2025-05-07T20:32:05.4709175Z 2025-05-07T20:32:05.4709552Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.4710014Z 2025-05-07T20:32:05.4710266Z x_sign = torch.sign(x) 2025-05-07T20:32:05.4710652Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.4711064Z x = x_sign * x_clamp 2025-05-07T20:32:05.4711377Z x0 = x[:, :D] 2025-05-07T20:32:05.4711658Z x1 = x[:, D:] 2025-05-07T20:32:05.4711938Z 2025-05-07T20:32:05.4712184Z if contiguous: 2025-05-07T20:32:05.4712529Z x0 = x0.contiguous() 2025-05-07T20:32:05.4712905Z x1 = x1.contiguous() 2025-05-07T20:32:05.4713223Z 2025-05-07T20:32:05.4713491Z if scale_ub is not None: 2025-05-07T20:32:05.4713868Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.4714318Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.4714740Z ) 2025-05-07T20:32:05.4715009Z else: 2025-05-07T20:32:05.4715284Z scale_ub_tensor = None 2025-05-07T20:32:05.4715626Z 2025-05-07T20:32:05.4715937Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.4716359Z op = silu_mul_quant 2025-05-07T20:32:05.4716701Z if compiled: 2025-05-07T20:32:05.4717041Z op = torch.compile(op) 2025-05-07T20:32:05.4717441Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.4717800Z 2025-05-07T20:32:05.4718058Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.4718284Z 2025-05-07T20:32:05.4718427Z moe/activation_test.py:117: 2025-05-07T20:32:05.4718814Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.4719265Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.4719648Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.4720610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.4721570Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.4722315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.4723480Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.4724403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.4725155Z kernel = self.compile( 2025-05-07T20:32:05.4725905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.4726822Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.4727376Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.4727709Z 2025-05-07T20:32:05.4727996Z self = 2025-05-07T20:32:05.4729947Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.4731865Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f49872f9da0>} 2025-05-07T20:32:05.4733794Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.4735195Z context = 2025-05-07T20:32:05.4735590Z 2025-05-07T20:32:05.4735813Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.4736552Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.4737179Z module_map=module_map) 2025-05-07T20:32:05.4737692Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.4738183Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.4738546Z E ^ 2025-05-07T20:32:05.4739174Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.4739810Z 2025-05-07T20:32:05.4740386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.4741110Z 2025-05-07T20:32:05.4741259Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.4741832Z self=, 2025-05-07T20:32:05.4742391Z T=2048, 2025-05-07T20:32:05.4742703Z D=5120, 2025-05-07T20:32:05.4742978Z scale_ub=1200.0, 2025-05-07T20:32:05.4743294Z contiguous=True, 2025-05-07T20:32:05.4743618Z compiled=True, 2025-05-07T20:32:05.4743919Z ) 2025-05-07T20:32:05.4744372Z self = 2025-05-07T20:32:05.4745077Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:05.4745460Z 2025-05-07T20:32:05.4745580Z @given( 2025-05-07T20:32:05.4745894Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.4746333Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.4746756Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.4747193Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.4747624Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.4748015Z ) 2025-05-07T20:32:05.4748466Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.4749033Z def test_silu_mul_quant( 2025-05-07T20:32:05.4749475Z self, 2025-05-07T20:32:05.4749760Z T: int, 2025-05-07T20:32:05.4750035Z D: int, 2025-05-07T20:32:05.4750343Z scale_ub: Optional[float], 2025-05-07T20:32:05.4750717Z contiguous: bool, 2025-05-07T20:32:05.4751044Z compiled: bool, 2025-05-07T20:32:05.4751562Z ) -> None: 2025-05-07T20:32:05.4751870Z torch.manual_seed(2025) 2025-05-07T20:32:05.4752203Z 2025-05-07T20:32:05.4752627Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.4753100Z 2025-05-07T20:32:05.4753360Z x_sign = torch.sign(x) 2025-05-07T20:32:05.4753755Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.4754176Z x = x_sign * x_clamp 2025-05-07T20:32:05.4754504Z x0 = x[:, :D] 2025-05-07T20:32:05.4754815Z x1 = x[:, D:] 2025-05-07T20:32:05.4755107Z 2025-05-07T20:32:05.4755361Z if contiguous: 2025-05-07T20:32:05.4755690Z x0 = x0.contiguous() 2025-05-07T20:32:05.4756222Z x1 = x1.contiguous() 2025-05-07T20:32:05.4756568Z 2025-05-07T20:32:05.4756836Z if scale_ub is not None: 2025-05-07T20:32:05.4757224Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.4757705Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.4758128Z ) 2025-05-07T20:32:05.4758386Z else: 2025-05-07T20:32:05.4758682Z scale_ub_tensor = None 2025-05-07T20:32:05.4758936Z 2025-05-07T20:32:05.4759172Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.4759487Z op = silu_mul_quant 2025-05-07T20:32:05.4759733Z if compiled: 2025-05-07T20:32:05.4759984Z op = torch.compile(op) 2025-05-07T20:32:05.4760278Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.4760547Z 2025-05-07T20:32:05.4760748Z y_fp8, y_scale = fn() 2025-05-07T20:32:05.4761034Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:05.4761325Z 2025-05-07T20:32:05.4761566Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.4761900Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:05.4762200Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:05.4762517Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:05.4762877Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:05.4763191Z 2025-05-07T20:32:05.4763386Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:05.4763587Z 2025-05-07T20:32:05.4763688Z moe/activation_test.py:126: 2025-05-07T20:32:05.4763982Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.4764326Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:05.4764644Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:05.4765436Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:05.4766203Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:05.4766747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.4767436Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.4768133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:05.4768858Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:05.4769620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:05.4770369Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:05.4771107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:05.4771768Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:05.4772357Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:05.4773578Z fn() 2025-05-07T20:32:05.4774091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:05.4774666Z self.fn.run( 2025-05-07T20:32:05.4775123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.4775656Z kernel = self.compile( 2025-05-07T20:32:05.4776190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.4776840Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.4777236Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.4777552Z 2025-05-07T20:32:05.4777755Z self = 2025-05-07T20:32:05.4778848Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.4780231Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4985c122a0>} 2025-05-07T20:32:05.4781579Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.4782659Z context = 2025-05-07T20:32:05.4782955Z 2025-05-07T20:32:05.4783122Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.4783646Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.4784107Z module_map=module_map) 2025-05-07T20:32:05.4784474Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.4784826Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:05.4785086Z E ^ 2025-05-07T20:32:05.4785550Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.4786005Z 2025-05-07T20:32:05.4786423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.4786936Z 2025-05-07T20:32:05.4787043Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.4787453Z self=, 2025-05-07T20:32:05.4787862Z T=16384, 2025-05-07T20:32:05.4788057Z D=7168, 2025-05-07T20:32:05.4788248Z scale_ub=1200.0, 2025-05-07T20:32:05.4788470Z contiguous=False, 2025-05-07T20:32:05.4788699Z compiled=False, 2025-05-07T20:32:05.4788897Z ) 2025-05-07T20:32:05.4789331Z self = 2025-05-07T20:32:05.4789835Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:05.4790112Z 2025-05-07T20:32:05.4790196Z @given( 2025-05-07T20:32:05.4790423Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.4790743Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.4791052Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.4791371Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.4791696Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.4791977Z ) 2025-05-07T20:32:05.4792324Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.4792768Z def test_silu_mul_quant( 2025-05-07T20:32:05.4793007Z self, 2025-05-07T20:32:05.4793202Z T: int, 2025-05-07T20:32:05.4793397Z D: int, 2025-05-07T20:32:05.4793707Z scale_ub: Optional[float], 2025-05-07T20:32:05.4793980Z contiguous: bool, 2025-05-07T20:32:05.4794216Z compiled: bool, 2025-05-07T20:32:05.4794437Z ) -> None: 2025-05-07T20:32:05.4794653Z torch.manual_seed(2025) 2025-05-07T20:32:05.4794888Z 2025-05-07T20:32:05.4795162Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.4795499Z 2025-05-07T20:32:05.4795688Z x_sign = torch.sign(x) 2025-05-07T20:32:05.4795978Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.4796287Z x = x_sign * x_clamp 2025-05-07T20:32:05.4796537Z x0 = x[:, :D] 2025-05-07T20:32:05.4796749Z x1 = x[:, D:] 2025-05-07T20:32:05.4797004Z 2025-05-07T20:32:05.4797265Z if contiguous: 2025-05-07T20:32:05.4797493Z x0 = x0.contiguous() 2025-05-07T20:32:05.4797757Z x1 = x1.contiguous() 2025-05-07T20:32:05.4797999Z 2025-05-07T20:32:05.4798185Z if scale_ub is not None: 2025-05-07T20:32:05.4798471Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.4798807Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.4799115Z ) 2025-05-07T20:32:05.4799306Z else: 2025-05-07T20:32:05.4799521Z scale_ub_tensor = None 2025-05-07T20:32:05.4799772Z 2025-05-07T20:32:05.4799997Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.4800312Z op = silu_mul_quant 2025-05-07T20:32:05.4800562Z if compiled: 2025-05-07T20:32:05.4800808Z op = torch.compile(op) 2025-05-07T20:32:05.4801110Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.4801385Z 2025-05-07T20:32:05.4801572Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.4801746Z 2025-05-07T20:32:05.4801844Z moe/activation_test.py:117: 2025-05-07T20:32:05.4802137Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.4802459Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.4802745Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.4803432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.4804122Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.4804646Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.4805331Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.4805990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.4806525Z kernel = self.compile( 2025-05-07T20:32:05.4807067Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.4807719Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.4808119Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.4808351Z 2025-05-07T20:32:05.4808556Z self = 2025-05-07T20:32:05.4809631Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.4810993Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f49857a2700>} 2025-05-07T20:32:05.4812358Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.4813499Z context = 2025-05-07T20:32:05.4813787Z 2025-05-07T20:32:05.4813952Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.4814470Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.4814936Z module_map=module_map) 2025-05-07T20:32:05.4815296Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.4815647Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.4815907Z E ^ 2025-05-07T20:32:05.4816373Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.4816862Z 2025-05-07T20:32:05.4817318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.4817838Z 2025-05-07T20:32:05.4817942Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.4818358Z self=, 2025-05-07T20:32:05.4818762Z T=1, 2025-05-07T20:32:05.4818947Z D=7168, 2025-05-07T20:32:05.4819147Z scale_ub=None, 2025-05-07T20:32:05.4819368Z contiguous=True, 2025-05-07T20:32:05.4819589Z compiled=True, 2025-05-07T20:32:05.4819797Z ) 2025-05-07T20:32:05.4820118Z self = 2025-05-07T20:32:05.4820592Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:05.4820856Z 2025-05-07T20:32:05.4820935Z @given( 2025-05-07T20:32:05.4821175Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.4821483Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.4821796Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.4822126Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.4822476Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.4822788Z ) 2025-05-07T20:32:05.4823146Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.4823592Z def test_silu_mul_quant( 2025-05-07T20:32:05.4823828Z self, 2025-05-07T20:32:05.4824023Z T: int, 2025-05-07T20:32:05.4824222Z D: int, 2025-05-07T20:32:05.4824437Z scale_ub: Optional[float], 2025-05-07T20:32:05.4824711Z contiguous: bool, 2025-05-07T20:32:05.4824949Z compiled: bool, 2025-05-07T20:32:05.4825165Z ) -> None: 2025-05-07T20:32:05.4825384Z torch.manual_seed(2025) 2025-05-07T20:32:05.4825634Z 2025-05-07T20:32:05.4825899Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.4826242Z 2025-05-07T20:32:05.4826443Z x_sign = torch.sign(x) 2025-05-07T20:32:05.4826728Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.4827038Z x = x_sign * x_clamp 2025-05-07T20:32:05.4827279Z x0 = x[:, :D] 2025-05-07T20:32:05.4827501Z x1 = x[:, D:] 2025-05-07T20:32:05.4827701Z 2025-05-07T20:32:05.4827889Z if contiguous: 2025-05-07T20:32:05.4828121Z x0 = x0.contiguous() 2025-05-07T20:32:05.4828682Z x1 = x1.contiguous() 2025-05-07T20:32:05.4828920Z 2025-05-07T20:32:05.4829169Z if scale_ub is not None: 2025-05-07T20:32:05.4829439Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.4829773Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.4830084Z ) 2025-05-07T20:32:05.4830272Z else: 2025-05-07T20:32:05.4830480Z scale_ub_tensor = None 2025-05-07T20:32:05.4830737Z 2025-05-07T20:32:05.4830966Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.4831289Z op = silu_mul_quant 2025-05-07T20:32:05.4831541Z if compiled: 2025-05-07T20:32:05.4831786Z op = torch.compile(op) 2025-05-07T20:32:05.4832083Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.4832527Z 2025-05-07T20:32:05.4832721Z y_fp8, y_scale = fn() 2025-05-07T20:32:05.4833012Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:05.4833304Z 2025-05-07T20:32:05.4833543Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.4833874Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:05.4834166Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:05.4834479Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:05.4834833Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:05.4835155Z 2025-05-07T20:32:05.4835362Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:05.4835712Z 2025-05-07T20:32:05.4835811Z moe/activation_test.py:126: 2025-05-07T20:32:05.4836109Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.4836447Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:05.4836777Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:05.4837555Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:05.4838302Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:05.4838850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.4839518Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.4840202Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:05.4840922Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:05.4841679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:05.4842420Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:05.4843144Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:05.4843777Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:05.4844372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:05.4844879Z fn() 2025-05-07T20:32:05.4845383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:05.4845963Z self.fn.run( 2025-05-07T20:32:05.4846427Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.4846956Z kernel = self.compile( 2025-05-07T20:32:05.4847494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.4848149Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.4848539Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.4848773Z 2025-05-07T20:32:05.4848981Z self = 2025-05-07T20:32:05.4850052Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.4851413Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4985a18ae0>} 2025-05-07T20:32:05.4852829Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.4853850Z context = 2025-05-07T20:32:05.4854141Z 2025-05-07T20:32:05.4854308Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.4854824Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.4855285Z module_map=module_map) 2025-05-07T20:32:05.4855650Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.4856004Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:05.4856274Z E ^ 2025-05-07T20:32:05.4856734Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.4857262Z 2025-05-07T20:32:05.4857674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.4858181Z 2025-05-07T20:32:05.4858298Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.4858712Z self=, 2025-05-07T20:32:05.4859108Z T=4096, 2025-05-07T20:32:05.4859302Z D=5120, 2025-05-07T20:32:05.4859498Z scale_ub=None, 2025-05-07T20:32:05.4859710Z contiguous=False, 2025-05-07T20:32:05.4859945Z compiled=False, 2025-05-07T20:32:05.4860149Z ) 2025-05-07T20:32:05.4860462Z self = 2025-05-07T20:32:05.4860965Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:05.4861235Z 2025-05-07T20:32:05.4861322Z @given( 2025-05-07T20:32:05.4861554Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.4861870Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.4862173Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.4862505Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.4862830Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.4863112Z ) 2025-05-07T20:32:05.4863456Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.4863887Z def test_silu_mul_quant( 2025-05-07T20:32:05.4864128Z self, 2025-05-07T20:32:05.4864322Z T: int, 2025-05-07T20:32:05.4864512Z D: int, 2025-05-07T20:32:05.4864734Z scale_ub: Optional[float], 2025-05-07T20:32:05.4865003Z contiguous: bool, 2025-05-07T20:32:05.4865236Z compiled: bool, 2025-05-07T20:32:05.4865459Z ) -> None: 2025-05-07T20:32:05.4865675Z torch.manual_seed(2025) 2025-05-07T20:32:05.4865914Z 2025-05-07T20:32:05.4866194Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.4866536Z 2025-05-07T20:32:05.4866725Z x_sign = torch.sign(x) 2025-05-07T20:32:05.4867014Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.4867323Z x = x_sign * x_clamp 2025-05-07T20:32:05.4867560Z x0 = x[:, :D] 2025-05-07T20:32:05.4867770Z x1 = x[:, D:] 2025-05-07T20:32:05.4867977Z 2025-05-07T20:32:05.4868170Z if contiguous: 2025-05-07T20:32:05.4868398Z x0 = x0.contiguous() 2025-05-07T20:32:05.4868654Z x1 = x1.contiguous() 2025-05-07T20:32:05.4868891Z 2025-05-07T20:32:05.4869123Z if scale_ub is not None: 2025-05-07T20:32:05.4869396Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.4869727Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.4878013Z ) 2025-05-07T20:32:05.4878227Z else: 2025-05-07T20:32:05.4878454Z scale_ub_tensor = None 2025-05-07T20:32:05.4878719Z 2025-05-07T20:32:05.4878959Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.4879274Z op = silu_mul_quant 2025-05-07T20:32:05.4879529Z if compiled: 2025-05-07T20:32:05.4879900Z op = torch.compile(op) 2025-05-07T20:32:05.4880195Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.4880471Z 2025-05-07T20:32:05.4880670Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.4880834Z 2025-05-07T20:32:05.4880936Z moe/activation_test.py:117: 2025-05-07T20:32:05.4881074Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.4881177Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.4881278Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.4881789Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.4881962Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.4882401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.4882647Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.4882993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.4883096Z kernel = self.compile( 2025-05-07T20:32:05.4883481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.4883663Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.4883793Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.4883798Z 2025-05-07T20:32:05.4884002Z self = 2025-05-07T20:32:05.4884790Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.4885295Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4985c11940>} 2025-05-07T20:32:05.4886047Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.4886239Z context = 2025-05-07T20:32:05.4886243Z 2025-05-07T20:32:05.4886407Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.4886673Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.4886786Z module_map=module_map) 2025-05-07T20:32:05.4886957Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.4887057Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.4887138Z E ^ 2025-05-07T20:32:05.4887508Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.4887513Z 2025-05-07T20:32:05.4887930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.4887934Z 2025-05-07T20:32:05.4888046Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.4888269Z self=, 2025-05-07T20:32:05.4888349Z T=4096, 2025-05-07T20:32:05.4888435Z D=7168, 2025-05-07T20:32:05.4888518Z scale_ub=None, 2025-05-07T20:32:05.4888607Z contiguous=False, 2025-05-07T20:32:05.4888700Z compiled=False, 2025-05-07T20:32:05.4888778Z ) 2025-05-07T20:32:05.4888997Z self = 2025-05-07T20:32:05.4889177Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:05.4889181Z 2025-05-07T20:32:05.4889345Z @given( 2025-05-07T20:32:05.4889477Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.4889578Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.4889694Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.4889820Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.4889936Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.4890012Z ) 2025-05-07T20:32:05.4890263Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.4890357Z def test_silu_mul_quant( 2025-05-07T20:32:05.4890436Z self, 2025-05-07T20:32:05.4890524Z T: int, 2025-05-07T20:32:05.4890645Z D: int, 2025-05-07T20:32:05.4890785Z scale_ub: Optional[float], 2025-05-07T20:32:05.4890884Z contiguous: bool, 2025-05-07T20:32:05.4890970Z compiled: bool, 2025-05-07T20:32:05.4891058Z ) -> None: 2025-05-07T20:32:05.4891154Z torch.manual_seed(2025) 2025-05-07T20:32:05.4891235Z 2025-05-07T20:32:05.4891411Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.4891487Z 2025-05-07T20:32:05.4891580Z x_sign = torch.sign(x) 2025-05-07T20:32:05.4891717Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.4891809Z x = x_sign * x_clamp 2025-05-07T20:32:05.4891891Z x0 = x[:, :D] 2025-05-07T20:32:05.4891980Z x1 = x[:, D:] 2025-05-07T20:32:05.4892056Z 2025-05-07T20:32:05.4892142Z if contiguous: 2025-05-07T20:32:05.4892243Z x0 = x0.contiguous() 2025-05-07T20:32:05.4892334Z x1 = x1.contiguous() 2025-05-07T20:32:05.4892420Z 2025-05-07T20:32:05.4892515Z if scale_ub is not None: 2025-05-07T20:32:05.4892622Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.4892765Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.4892842Z ) 2025-05-07T20:32:05.4892926Z else: 2025-05-07T20:32:05.4893029Z scale_ub_tensor = None 2025-05-07T20:32:05.4893105Z 2025-05-07T20:32:05.4893237Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.4893337Z op = silu_mul_quant 2025-05-07T20:32:05.4893426Z if compiled: 2025-05-07T20:32:05.4893527Z op = torch.compile(op) 2025-05-07T20:32:05.4893642Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.4893716Z 2025-05-07T20:32:05.4893816Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.4893820Z 2025-05-07T20:32:05.4893921Z moe/activation_test.py:117: 2025-05-07T20:32:05.4894053Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.4894172Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.4894273Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.4894772Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.4894880Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.4895236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.4895465Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.4895802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.4895899Z kernel = self.compile( 2025-05-07T20:32:05.4896286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.4896463Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.4896596Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.4896607Z 2025-05-07T20:32:05.4896810Z self = 2025-05-07T20:32:05.4897687Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.4898200Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4985a1b060>} 2025-05-07T20:32:05.4898946Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.4899219Z context = 2025-05-07T20:32:05.4899224Z 2025-05-07T20:32:05.4899388Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.4899652Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.4899769Z module_map=module_map) 2025-05-07T20:32:05.4899931Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.4900030Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.4900120Z E ^ 2025-05-07T20:32:05.4900473Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.4900478Z 2025-05-07T20:32:05.4900897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.4900901Z 2025-05-07T20:32:05.4901008Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.4901235Z self=, 2025-05-07T20:32:05.4901322Z T=128, 2025-05-07T20:32:05.4901399Z D=7168, 2025-05-07T20:32:05.4901487Z scale_ub=None, 2025-05-07T20:32:05.4901574Z contiguous=False, 2025-05-07T20:32:05.4901664Z compiled=True, 2025-05-07T20:32:05.4901742Z ) 2025-05-07T20:32:05.4901961Z self = 2025-05-07T20:32:05.4902131Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:05.4902135Z 2025-05-07T20:32:05.4902221Z @given( 2025-05-07T20:32:05.4902342Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.4902443Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.4902564Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.4902680Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.4902806Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.4902885Z ) 2025-05-07T20:32:05.4903128Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.4903233Z def test_silu_mul_quant( 2025-05-07T20:32:05.4903311Z self, 2025-05-07T20:32:05.4903392Z T: int, 2025-05-07T20:32:05.4903476Z D: int, 2025-05-07T20:32:05.4903575Z scale_ub: Optional[float], 2025-05-07T20:32:05.4903665Z contiguous: bool, 2025-05-07T20:32:05.4903758Z compiled: bool, 2025-05-07T20:32:05.4903838Z ) -> None: 2025-05-07T20:32:05.4903933Z torch.manual_seed(2025) 2025-05-07T20:32:05.4904017Z 2025-05-07T20:32:05.4904188Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.4904273Z 2025-05-07T20:32:05.4904366Z x_sign = torch.sign(x) 2025-05-07T20:32:05.4904493Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.4904595Z x = x_sign * x_clamp 2025-05-07T20:32:05.4904686Z x0 = x[:, :D] 2025-05-07T20:32:05.4904767Z x1 = x[:, D:] 2025-05-07T20:32:05.4904847Z 2025-05-07T20:32:05.4904933Z if contiguous: 2025-05-07T20:32:05.4905026Z x0 = x0.contiguous() 2025-05-07T20:32:05.4905124Z x1 = x1.contiguous() 2025-05-07T20:32:05.4905295Z 2025-05-07T20:32:05.4905388Z if scale_ub is not None: 2025-05-07T20:32:05.4905507Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.4905643Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.4905720Z ) 2025-05-07T20:32:05.4905808Z else: 2025-05-07T20:32:05.4905903Z scale_ub_tensor = None 2025-05-07T20:32:05.4905986Z 2025-05-07T20:32:05.4906118Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.4906210Z op = silu_mul_quant 2025-05-07T20:32:05.4906306Z if compiled: 2025-05-07T20:32:05.4906407Z op = torch.compile(op) 2025-05-07T20:32:05.4906593Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.4906673Z 2025-05-07T20:32:05.4906765Z y_fp8, y_scale = fn() 2025-05-07T20:32:05.4906886Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:05.4906966Z 2025-05-07T20:32:05.4907110Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.4907214Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:05.4907324Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:05.4907447Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:05.4907594Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:05.4907669Z 2025-05-07T20:32:05.4907771Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:05.4907775Z 2025-05-07T20:32:05.4907883Z moe/activation_test.py:126: 2025-05-07T20:32:05.4908014Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.4908123Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:05.4908272Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:05.4908831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:05.4908944Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:05.4909402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.4909623Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.4910001Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:05.4910254Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:05.4910659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:05.4910916Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:05.4911288Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:05.4911465Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:05.4911804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:05.4911883Z fn() 2025-05-07T20:32:05.4912287Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:05.4912372Z self.fn.run( 2025-05-07T20:32:05.4912764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.4912862Z kernel = self.compile( 2025-05-07T20:32:05.4913240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.4913427Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.4913556Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.4913561Z 2025-05-07T20:32:05.4913879Z self = 2025-05-07T20:32:05.4914654Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.4915152Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f498548d9e0>} 2025-05-07T20:32:05.4915904Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.4916172Z context = 2025-05-07T20:32:05.4916177Z 2025-05-07T20:32:05.4916354Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.4916617Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.4916727Z module_map=module_map) 2025-05-07T20:32:05.4916898Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.4917002Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:05.4917081Z E ^ 2025-05-07T20:32:05.4917446Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.4917451Z 2025-05-07T20:32:05.4917862Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.4917871Z 2025-05-07T20:32:05.4917985Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.4918207Z self=, 2025-05-07T20:32:05.4918286Z T=128, 2025-05-07T20:32:05.4918379Z D=7168, 2025-05-07T20:32:05.4918462Z scale_ub=None, 2025-05-07T20:32:05.4918557Z contiguous=False, 2025-05-07T20:32:05.4918642Z compiled=False, 2025-05-07T20:32:05.4918719Z ) 2025-05-07T20:32:05.4918936Z self = 2025-05-07T20:32:05.4919117Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:05.4919122Z 2025-05-07T20:32:05.4919200Z @given( 2025-05-07T20:32:05.4919325Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.4919424Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.4919539Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.4919668Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.4919780Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.4919859Z ) 2025-05-07T20:32:05.4920111Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.4920209Z def test_silu_mul_quant( 2025-05-07T20:32:05.4920298Z self, 2025-05-07T20:32:05.4920376Z T: int, 2025-05-07T20:32:05.4920456Z D: int, 2025-05-07T20:32:05.4920564Z scale_ub: Optional[float], 2025-05-07T20:32:05.4920658Z contiguous: bool, 2025-05-07T20:32:05.4920745Z compiled: bool, 2025-05-07T20:32:05.4920830Z ) -> None: 2025-05-07T20:32:05.4920925Z torch.manual_seed(2025) 2025-05-07T20:32:05.4921003Z 2025-05-07T20:32:05.4921177Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.4921253Z 2025-05-07T20:32:05.4921348Z x_sign = torch.sign(x) 2025-05-07T20:32:05.4921481Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.4921573Z x = x_sign * x_clamp 2025-05-07T20:32:05.4921654Z x0 = x[:, :D] 2025-05-07T20:32:05.4921738Z x1 = x[:, D:] 2025-05-07T20:32:05.4921811Z 2025-05-07T20:32:05.4921902Z if contiguous: 2025-05-07T20:32:05.4922082Z x0 = x0.contiguous() 2025-05-07T20:32:05.4922173Z x1 = x1.contiguous() 2025-05-07T20:32:05.4922251Z 2025-05-07T20:32:05.4922343Z if scale_ub is not None: 2025-05-07T20:32:05.4922451Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.4922591Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.4922666Z ) 2025-05-07T20:32:05.4922743Z else: 2025-05-07T20:32:05.4922843Z scale_ub_tensor = None 2025-05-07T20:32:05.4922916Z 2025-05-07T20:32:05.4923044Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.4923144Z op = silu_mul_quant 2025-05-07T20:32:05.4923307Z if compiled: 2025-05-07T20:32:05.4923414Z op = torch.compile(op) 2025-05-07T20:32:05.4923522Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.4923599Z 2025-05-07T20:32:05.4923696Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.4923706Z 2025-05-07T20:32:05.4923803Z moe/activation_test.py:117: 2025-05-07T20:32:05.4923934Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.4924045Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.4924149Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.4924649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.4924759Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.4925116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.4925345Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.4925687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.4925781Z kernel = self.compile( 2025-05-07T20:32:05.4926174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.4926349Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.4926483Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.4926488Z 2025-05-07T20:32:05.4926688Z self = 2025-05-07T20:32:05.4927459Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.4927962Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f498548f1a0>} 2025-05-07T20:32:05.4929031Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.4929231Z context = 2025-05-07T20:32:05.4929236Z 2025-05-07T20:32:05.4929399Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.4929656Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.4929768Z module_map=module_map) 2025-05-07T20:32:05.4929929Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.4930039Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.4930122Z E ^ 2025-05-07T20:32:05.4930479Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.4930484Z 2025-05-07T20:32:05.4931100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.4931106Z 2025-05-07T20:32:05.4931210Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.4931444Z self=, 2025-05-07T20:32:05.4931523Z T=4096, 2025-05-07T20:32:05.4931600Z D=5120, 2025-05-07T20:32:05.4931689Z scale_ub=1200.0, 2025-05-07T20:32:05.4931774Z contiguous=True, 2025-05-07T20:32:05.4931858Z compiled=False, 2025-05-07T20:32:05.4931937Z ) 2025-05-07T20:32:05.4932154Z self = 2025-05-07T20:32:05.4932327Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:05.4932454Z 2025-05-07T20:32:05.4932540Z @given( 2025-05-07T20:32:05.4932658Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.4932757Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.4932888Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.4933004Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.4933123Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.4933199Z ) 2025-05-07T20:32:05.4933443Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.4933547Z def test_silu_mul_quant( 2025-05-07T20:32:05.4933625Z self, 2025-05-07T20:32:05.4933704Z T: int, 2025-05-07T20:32:05.4933789Z D: int, 2025-05-07T20:32:05.4933887Z scale_ub: Optional[float], 2025-05-07T20:32:05.4933977Z contiguous: bool, 2025-05-07T20:32:05.4934071Z compiled: bool, 2025-05-07T20:32:05.4934157Z ) -> None: 2025-05-07T20:32:05.4934262Z torch.manual_seed(2025) 2025-05-07T20:32:05.4934335Z 2025-05-07T20:32:05.4934502Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.4934583Z 2025-05-07T20:32:05.4934682Z x_sign = torch.sign(x) 2025-05-07T20:32:05.4934807Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.4934902Z x = x_sign * x_clamp 2025-05-07T20:32:05.4934986Z x0 = x[:, :D] 2025-05-07T20:32:05.4935069Z x1 = x[:, D:] 2025-05-07T20:32:05.4935149Z 2025-05-07T20:32:05.4935235Z if contiguous: 2025-05-07T20:32:05.4935328Z x0 = x0.contiguous() 2025-05-07T20:32:05.4935424Z x1 = x1.contiguous() 2025-05-07T20:32:05.4935496Z 2025-05-07T20:32:05.4935586Z if scale_ub is not None: 2025-05-07T20:32:05.4935696Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.4935829Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.4935915Z ) 2025-05-07T20:32:05.4935992Z else: 2025-05-07T20:32:05.4936086Z scale_ub_tensor = None 2025-05-07T20:32:05.4936165Z 2025-05-07T20:32:05.4936294Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.4936388Z op = silu_mul_quant 2025-05-07T20:32:05.4936478Z if compiled: 2025-05-07T20:32:05.4936579Z op = torch.compile(op) 2025-05-07T20:32:05.4936684Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.4936766Z 2025-05-07T20:32:05.4936856Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.4936860Z 2025-05-07T20:32:05.4936963Z moe/activation_test.py:117: 2025-05-07T20:32:05.4937092Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.4937192Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.4937295Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.4937790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.4937891Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.4938340Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.4938564Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.4938907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.4939006Z kernel = self.compile( 2025-05-07T20:32:05.4939393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.4939566Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.4939694Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.4939739Z 2025-05-07T20:32:05.4939952Z self = 2025-05-07T20:32:05.4940767Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.4941273Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f498548ea20>} 2025-05-07T20:32:05.4942015Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.4942204Z context = 2025-05-07T20:32:05.4942215Z 2025-05-07T20:32:05.4942379Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.4942643Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.4942759Z module_map=module_map) 2025-05-07T20:32:05.4942920Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.4943024Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.4943112Z E ^ 2025-05-07T20:32:05.4943464Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.4943469Z 2025-05-07T20:32:05.4943887Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.4943891Z 2025-05-07T20:32:05.4943994Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.4944217Z self=, 2025-05-07T20:32:05.4944299Z T=1, 2025-05-07T20:32:05.4944383Z D=5120, 2025-05-07T20:32:05.4944468Z scale_ub=None, 2025-05-07T20:32:05.4944566Z contiguous=True, 2025-05-07T20:32:05.4944650Z compiled=True, 2025-05-07T20:32:05.4944724Z ) 2025-05-07T20:32:05.4944950Z self = 2025-05-07T20:32:05.4945115Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:05.4945120Z 2025-05-07T20:32:05.4945202Z @given( 2025-05-07T20:32:05.4945321Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.4945419Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.4945541Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.4945658Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.4945772Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.4945852Z ) 2025-05-07T20:32:05.4946096Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.4946192Z def test_silu_mul_quant( 2025-05-07T20:32:05.4946278Z self, 2025-05-07T20:32:05.4946356Z T: int, 2025-05-07T20:32:05.4946437Z D: int, 2025-05-07T20:32:05.4946534Z scale_ub: Optional[float], 2025-05-07T20:32:05.4946623Z contiguous: bool, 2025-05-07T20:32:05.4946820Z compiled: bool, 2025-05-07T20:32:05.4946901Z ) -> None: 2025-05-07T20:32:05.4946998Z torch.manual_seed(2025) 2025-05-07T20:32:05.4947075Z 2025-05-07T20:32:05.4947243Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.4947316Z 2025-05-07T20:32:05.4947417Z x_sign = torch.sign(x) 2025-05-07T20:32:05.4947542Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.4947631Z x = x_sign * x_clamp 2025-05-07T20:32:05.4947717Z x0 = x[:, :D] 2025-05-07T20:32:05.4947798Z x1 = x[:, D:] 2025-05-07T20:32:05.4947876Z 2025-05-07T20:32:05.4947960Z if contiguous: 2025-05-07T20:32:05.4948097Z x0 = x0.contiguous() 2025-05-07T20:32:05.4948236Z x1 = x1.contiguous() 2025-05-07T20:32:05.4948311Z 2025-05-07T20:32:05.4948401Z if scale_ub is not None: 2025-05-07T20:32:05.4948512Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.4948651Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.4948726Z ) 2025-05-07T20:32:05.4948809Z else: 2025-05-07T20:32:05.4948902Z scale_ub_tensor = None 2025-05-07T20:32:05.4948974Z 2025-05-07T20:32:05.4949171Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.4949262Z op = silu_mul_quant 2025-05-07T20:32:05.4949346Z if compiled: 2025-05-07T20:32:05.4949452Z op = torch.compile(op) 2025-05-07T20:32:05.4949556Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.4949634Z 2025-05-07T20:32:05.4949723Z y_fp8, y_scale = fn() 2025-05-07T20:32:05.4949845Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:05.4949926Z 2025-05-07T20:32:05.4950061Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.4950161Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:05.4950267Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:05.4950392Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:05.4950531Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:05.4950610Z 2025-05-07T20:32:05.4950708Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:05.4950712Z 2025-05-07T20:32:05.4950815Z moe/activation_test.py:126: 2025-05-07T20:32:05.4950944Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.4951049Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:05.4951189Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:05.4951743Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:05.4951852Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:05.4952214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.4952438Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.4952808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:05.4953059Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:05.4953457Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:05.4953713Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:05.4954085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:05.4954264Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:05.4954600Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:05.4954765Z fn() 2025-05-07T20:32:05.4955171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:05.4955255Z self.fn.run( 2025-05-07T20:32:05.4955590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.4955691Z kernel = self.compile( 2025-05-07T20:32:05.4956069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.4956250Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.4956417Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.4956460Z 2025-05-07T20:32:05.4956663Z self = 2025-05-07T20:32:05.4957451Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.4957947Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4985498fe0>} 2025-05-07T20:32:05.4958697Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.4958889Z context = 2025-05-07T20:32:05.4958896Z 2025-05-07T20:32:05.4959068Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.4959327Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.4959438Z module_map=module_map) 2025-05-07T20:32:05.4959607Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.4959709Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:05.4959790Z E ^ 2025-05-07T20:32:05.4960156Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.4960161Z 2025-05-07T20:32:05.4960573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.4960577Z 2025-05-07T20:32:05.4960690Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.4960912Z self=, 2025-05-07T20:32:05.4960997Z T=2048, 2025-05-07T20:32:05.4961083Z D=5120, 2025-05-07T20:32:05.4961166Z scale_ub=None, 2025-05-07T20:32:05.4961250Z contiguous=True, 2025-05-07T20:32:05.4961342Z compiled=True, 2025-05-07T20:32:05.4961417Z ) 2025-05-07T20:32:05.4961638Z self = 2025-05-07T20:32:05.4961814Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:05.4961819Z 2025-05-07T20:32:05.4961896Z @given( 2025-05-07T20:32:05.4962022Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.4962121Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.4962236Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.4962359Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.4962470Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.4962544Z ) 2025-05-07T20:32:05.4962795Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.4962892Z def test_silu_mul_quant( 2025-05-07T20:32:05.4962968Z self, 2025-05-07T20:32:05.4963051Z T: int, 2025-05-07T20:32:05.4963127Z D: int, 2025-05-07T20:32:05.4963314Z scale_ub: Optional[float], 2025-05-07T20:32:05.4963405Z contiguous: bool, 2025-05-07T20:32:05.4963490Z compiled: bool, 2025-05-07T20:32:05.4963574Z ) -> None: 2025-05-07T20:32:05.4963669Z torch.manual_seed(2025) 2025-05-07T20:32:05.4963743Z 2025-05-07T20:32:05.4963915Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.4963987Z 2025-05-07T20:32:05.4964079Z x_sign = torch.sign(x) 2025-05-07T20:32:05.4964208Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.4964298Z x = x_sign * x_clamp 2025-05-07T20:32:05.4964378Z x0 = x[:, :D] 2025-05-07T20:32:05.4964464Z x1 = x[:, D:] 2025-05-07T20:32:05.4964619Z 2025-05-07T20:32:05.4964713Z if contiguous: 2025-05-07T20:32:05.4964805Z x0 = x0.contiguous() 2025-05-07T20:32:05.4964893Z x1 = x1.contiguous() 2025-05-07T20:32:05.4964971Z 2025-05-07T20:32:05.4965061Z if scale_ub is not None: 2025-05-07T20:32:05.4965177Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.4965322Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.4965397Z ) 2025-05-07T20:32:05.4965473Z else: 2025-05-07T20:32:05.4965576Z scale_ub_tensor = None 2025-05-07T20:32:05.4965648Z 2025-05-07T20:32:05.4965776Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.4965873Z op = silu_mul_quant 2025-05-07T20:32:05.4965958Z if compiled: 2025-05-07T20:32:05.4966059Z op = torch.compile(op) 2025-05-07T20:32:05.4966171Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.4966245Z 2025-05-07T20:32:05.4966344Z y_fp8, y_scale = fn() 2025-05-07T20:32:05.4966464Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:05.4966536Z 2025-05-07T20:32:05.4966676Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.4966782Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:05.4966881Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:05.4967007Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:05.4967146Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:05.4967220Z 2025-05-07T20:32:05.4967324Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:05.4967329Z 2025-05-07T20:32:05.4967427Z moe/activation_test.py:126: 2025-05-07T20:32:05.4967563Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.4967669Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:05.4967803Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:05.4968369Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:05.4968469Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:05.4968828Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.4969058Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.4969422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:05.4969679Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:05.4970075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:05.4970325Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:05.4970709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:05.4970875Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:05.4971312Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:05.4971397Z fn() 2025-05-07T20:32:05.4971792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:05.4971880Z self.fn.run( 2025-05-07T20:32:05.4972218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.4972336Z kernel = self.compile( 2025-05-07T20:32:05.4972745Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.4972957Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.4973158Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.4973163Z 2025-05-07T20:32:05.4973365Z self = 2025-05-07T20:32:05.4974143Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.4974644Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4984c9d8a0>} 2025-05-07T20:32:05.4975385Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.4975587Z context = 2025-05-07T20:32:05.4975592Z 2025-05-07T20:32:05.4975754Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.4976027Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.4976134Z module_map=module_map) 2025-05-07T20:32:05.4976293Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.4976406Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:05.4976484Z E ^ 2025-05-07T20:32:05.4976838Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.4976843Z 2025-05-07T20:32:05.4977261Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.4977266Z 2025-05-07T20:32:05.4977373Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.4977604Z self=, 2025-05-07T20:32:05.4977682Z T=128, 2025-05-07T20:32:05.4977758Z D=5120, 2025-05-07T20:32:05.4977844Z scale_ub=None, 2025-05-07T20:32:05.4977933Z contiguous=True, 2025-05-07T20:32:05.4978017Z compiled=True, 2025-05-07T20:32:05.4978093Z ) 2025-05-07T20:32:05.4978310Z self = 2025-05-07T20:32:05.4978476Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:05.4978485Z 2025-05-07T20:32:05.4978563Z @given( 2025-05-07T20:32:05.4978680Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.4978787Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.4978901Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.4979020Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.4979141Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.4979218Z ) 2025-05-07T20:32:05.4979462Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.4979561Z def test_silu_mul_quant( 2025-05-07T20:32:05.4979637Z self, 2025-05-07T20:32:05.4979800Z T: int, 2025-05-07T20:32:05.4979885Z D: int, 2025-05-07T20:32:05.4979984Z scale_ub: Optional[float], 2025-05-07T20:32:05.4980082Z contiguous: bool, 2025-05-07T20:32:05.4980168Z compiled: bool, 2025-05-07T20:32:05.4980246Z ) -> None: 2025-05-07T20:32:05.4980350Z torch.manual_seed(2025) 2025-05-07T20:32:05.4980422Z 2025-05-07T20:32:05.4980591Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.4980670Z 2025-05-07T20:32:05.4980762Z x_sign = torch.sign(x) 2025-05-07T20:32:05.4980886Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.4981023Z x = x_sign * x_clamp 2025-05-07T20:32:05.4981143Z x0 = x[:, :D] 2025-05-07T20:32:05.4981223Z x1 = x[:, D:] 2025-05-07T20:32:05.4981302Z 2025-05-07T20:32:05.4981386Z if contiguous: 2025-05-07T20:32:05.4981482Z x0 = x0.contiguous() 2025-05-07T20:32:05.4981585Z x1 = x1.contiguous() 2025-05-07T20:32:05.4981657Z 2025-05-07T20:32:05.4981754Z if scale_ub is not None: 2025-05-07T20:32:05.4981859Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.4981994Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.4982075Z ) 2025-05-07T20:32:05.4982154Z else: 2025-05-07T20:32:05.4982246Z scale_ub_tensor = None 2025-05-07T20:32:05.4982325Z 2025-05-07T20:32:05.4982453Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.4982547Z op = silu_mul_quant 2025-05-07T20:32:05.4982637Z if compiled: 2025-05-07T20:32:05.4982738Z op = torch.compile(op) 2025-05-07T20:32:05.4982872Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.4982959Z 2025-05-07T20:32:05.4983067Z y_fp8, y_scale = fn() 2025-05-07T20:32:05.4983193Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:05.4983265Z 2025-05-07T20:32:05.4983407Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.4983514Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:05.4983614Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:05.4983740Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:05.4983886Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:05.4983961Z 2025-05-07T20:32:05.4984060Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:05.4984071Z 2025-05-07T20:32:05.4984168Z moe/activation_test.py:126: 2025-05-07T20:32:05.4984298Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.4984416Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:05.4984553Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:05.4985111Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:05.4985222Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:05.4985577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.4985806Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.4986173Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:05.4986425Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:05.4986830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:05.4987087Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:05.4987459Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:05.4987801Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:05.4988142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:05.4988227Z fn() 2025-05-07T20:32:05.4988624Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:05.4988707Z self.fn.run( 2025-05-07T20:32:05.4989051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.4989196Z kernel = self.compile( 2025-05-07T20:32:05.4989573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.4989837Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.4989966Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.4989978Z 2025-05-07T20:32:05.4990186Z self = 2025-05-07T20:32:05.4990962Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.4991466Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f49843453a0>} 2025-05-07T20:32:05.4992207Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.4992404Z context = 2025-05-07T20:32:05.4992409Z 2025-05-07T20:32:05.4992582Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.4992843Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.4992959Z module_map=module_map) 2025-05-07T20:32:05.4993119Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.4993222Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:05.4993306Z E ^ 2025-05-07T20:32:05.4993659Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.4993663Z 2025-05-07T20:32:05.4994074Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.4994089Z 2025-05-07T20:32:05.4994193Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.4994414Z self=, 2025-05-07T20:32:05.4994500Z T=4096, 2025-05-07T20:32:05.4994579Z D=5120, 2025-05-07T20:32:05.4994661Z scale_ub=None, 2025-05-07T20:32:05.4994752Z contiguous=True, 2025-05-07T20:32:05.4994834Z compiled=True, 2025-05-07T20:32:05.4994906Z ) 2025-05-07T20:32:05.4995130Z self = 2025-05-07T20:32:05.4995299Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:05.4995304Z 2025-05-07T20:32:05.4995385Z @given( 2025-05-07T20:32:05.4995504Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.4995603Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.4995723Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.4995845Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.4995957Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.4996041Z ) 2025-05-07T20:32:05.4996285Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.4996459Z def test_silu_mul_quant( 2025-05-07T20:32:05.4996547Z self, 2025-05-07T20:32:05.4996626Z T: int, 2025-05-07T20:32:05.4996702Z D: int, 2025-05-07T20:32:05.4996813Z scale_ub: Optional[float], 2025-05-07T20:32:05.4996906Z contiguous: bool, 2025-05-07T20:32:05.4997000Z compiled: bool, 2025-05-07T20:32:05.4997078Z ) -> None: 2025-05-07T20:32:05.4997171Z torch.manual_seed(2025) 2025-05-07T20:32:05.4997250Z 2025-05-07T20:32:05.4997418Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.4997493Z 2025-05-07T20:32:05.4997590Z x_sign = torch.sign(x) 2025-05-07T20:32:05.4997781Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.4997910Z x = x_sign * x_clamp 2025-05-07T20:32:05.4997997Z x0 = x[:, :D] 2025-05-07T20:32:05.4998077Z x1 = x[:, D:] 2025-05-07T20:32:05.4998151Z 2025-05-07T20:32:05.4998241Z if contiguous: 2025-05-07T20:32:05.4998339Z x0 = x0.contiguous() 2025-05-07T20:32:05.4998428Z x1 = x1.contiguous() 2025-05-07T20:32:05.4998510Z 2025-05-07T20:32:05.4998600Z if scale_ub is not None: 2025-05-07T20:32:05.4998710Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.4998846Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.4998921Z ) 2025-05-07T20:32:05.4999004Z else: 2025-05-07T20:32:05.4999097Z scale_ub_tensor = None 2025-05-07T20:32:05.4999170Z 2025-05-07T20:32:05.4999304Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.4999395Z op = silu_mul_quant 2025-05-07T20:32:05.4999488Z if compiled: 2025-05-07T20:32:05.4999595Z op = torch.compile(op) 2025-05-07T20:32:05.4999700Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.4999775Z 2025-05-07T20:32:05.4999872Z y_fp8, y_scale = fn() 2025-05-07T20:32:05.4999998Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:05.5000076Z 2025-05-07T20:32:05.5000210Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.5000311Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:05.5000422Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:05.5000546Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:05.5000685Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:05.5013443Z 2025-05-07T20:32:05.5013573Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:05.5013580Z 2025-05-07T20:32:05.5013685Z moe/activation_test.py:126: 2025-05-07T20:32:05.5013838Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.5013951Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:05.5014092Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:05.5014669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:05.5014776Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:05.5015143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.5015366Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.5015734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:05.5015999Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:05.5016398Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:05.5016658Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:05.5017152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:05.5017322Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:05.5017670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:05.5017750Z fn() 2025-05-07T20:32:05.5018150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:05.5018243Z self.fn.run( 2025-05-07T20:32:05.5018580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.5018800Z kernel = self.compile( 2025-05-07T20:32:05.5019177Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.5019350Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.5019496Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.5019501Z 2025-05-07T20:32:05.5019705Z self = 2025-05-07T20:32:05.5020492Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.5020992Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4984345b20>} 2025-05-07T20:32:05.5021742Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.5021945Z context = 2025-05-07T20:32:05.5021950Z 2025-05-07T20:32:05.5022116Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.5022388Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.5022498Z module_map=module_map) 2025-05-07T20:32:05.5022663Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.5022774Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:05.5022854Z E ^ 2025-05-07T20:32:05.5023209Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.5023224Z 2025-05-07T20:32:05.5023642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.5023647Z 2025-05-07T20:32:05.5023754Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.5023991Z self=, 2025-05-07T20:32:05.5024075Z T=16384, 2025-05-07T20:32:05.5024155Z D=5120, 2025-05-07T20:32:05.5024247Z scale_ub=None, 2025-05-07T20:32:05.5024335Z contiguous=True, 2025-05-07T20:32:05.5024420Z compiled=True, 2025-05-07T20:32:05.5024504Z ) 2025-05-07T20:32:05.5024723Z self = 2025-05-07T20:32:05.5024906Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:05.5024911Z 2025-05-07T20:32:05.5024990Z @given( 2025-05-07T20:32:05.5025113Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.5025224Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.5025344Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.5025465Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.5025590Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.5025668Z ) 2025-05-07T20:32:05.5025999Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.5026103Z def test_silu_mul_quant( 2025-05-07T20:32:05.5026182Z self, 2025-05-07T20:32:05.5026268Z T: int, 2025-05-07T20:32:05.5026345Z D: int, 2025-05-07T20:32:05.5026445Z scale_ub: Optional[float], 2025-05-07T20:32:05.5026544Z contiguous: bool, 2025-05-07T20:32:05.5026631Z compiled: bool, 2025-05-07T20:32:05.5026713Z ) -> None: 2025-05-07T20:32:05.5026816Z torch.manual_seed(2025) 2025-05-07T20:32:05.5026891Z 2025-05-07T20:32:05.5027060Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.5027221Z 2025-05-07T20:32:05.5027318Z x_sign = torch.sign(x) 2025-05-07T20:32:05.5027446Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.5027545Z x = x_sign * x_clamp 2025-05-07T20:32:05.5027628Z x0 = x[:, :D] 2025-05-07T20:32:05.5027720Z x1 = x[:, D:] 2025-05-07T20:32:05.5027793Z 2025-05-07T20:32:05.5027877Z if contiguous: 2025-05-07T20:32:05.5027979Z x0 = x0.contiguous() 2025-05-07T20:32:05.5028067Z x1 = x1.contiguous() 2025-05-07T20:32:05.5028394Z 2025-05-07T20:32:05.5028551Z if scale_ub is not None: 2025-05-07T20:32:05.5028704Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.5028893Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.5028994Z ) 2025-05-07T20:32:05.5029121Z else: 2025-05-07T20:32:05.5029217Z scale_ub_tensor = None 2025-05-07T20:32:05.5029296Z 2025-05-07T20:32:05.5029432Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.5029532Z op = silu_mul_quant 2025-05-07T20:32:05.5029617Z if compiled: 2025-05-07T20:32:05.5029717Z op = torch.compile(op) 2025-05-07T20:32:05.5029836Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.5029909Z 2025-05-07T20:32:05.5029999Z y_fp8, y_scale = fn() 2025-05-07T20:32:05.5030130Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:05.5030202Z 2025-05-07T20:32:05.5030340Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.5030451Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:05.5030551Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:05.5030671Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:05.5030823Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:05.5030897Z 2025-05-07T20:32:05.5031006Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:05.5031014Z 2025-05-07T20:32:05.5031114Z moe/activation_test.py:126: 2025-05-07T20:32:05.5031244Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.5031359Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:05.5031496Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:05.5032056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:05.5032168Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:05.5032527Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.5032757Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.5033123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:05.5033381Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:05.5033785Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:05.5034269Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:05.5034658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:05.5034824Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:05.5035164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:05.5035254Z fn() 2025-05-07T20:32:05.5035652Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:05.5035737Z self.fn.run( 2025-05-07T20:32:05.5036148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.5036302Z kernel = self.compile( 2025-05-07T20:32:05.5036696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.5036872Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.5037007Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.5037012Z 2025-05-07T20:32:05.5037227Z self = 2025-05-07T20:32:05.5038002Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.5038506Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4984c9ea20>} 2025-05-07T20:32:05.5039265Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.5039454Z context = 2025-05-07T20:32:05.5039467Z 2025-05-07T20:32:05.5039638Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.5039902Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.5040021Z module_map=module_map) 2025-05-07T20:32:05.5040187Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.5040292Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:05.5040379Z E ^ 2025-05-07T20:32:05.5040741Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.5040749Z 2025-05-07T20:32:05.5041168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.5041177Z 2025-05-07T20:32:05.5041281Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.5041503Z self=, 2025-05-07T20:32:05.5041590Z T=1, 2025-05-07T20:32:05.5041670Z D=5120, 2025-05-07T20:32:05.5041756Z scale_ub=1200.0, 2025-05-07T20:32:05.5041850Z contiguous=True, 2025-05-07T20:32:05.5041939Z compiled=True, 2025-05-07T20:32:05.5042014Z ) 2025-05-07T20:32:05.5042239Z self = 2025-05-07T20:32:05.5042404Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:05.5042408Z 2025-05-07T20:32:05.5042500Z @given( 2025-05-07T20:32:05.5042623Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.5042723Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.5042848Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.5042965Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.5043168Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.5043252Z ) 2025-05-07T20:32:05.5043495Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.5043589Z def test_silu_mul_quant( 2025-05-07T20:32:05.5043676Z self, 2025-05-07T20:32:05.5043755Z T: int, 2025-05-07T20:32:05.5043840Z D: int, 2025-05-07T20:32:05.5043938Z scale_ub: Optional[float], 2025-05-07T20:32:05.5044028Z contiguous: bool, 2025-05-07T20:32:05.5044122Z compiled: bool, 2025-05-07T20:32:05.5044203Z ) -> None: 2025-05-07T20:32:05.5044298Z torch.manual_seed(2025) 2025-05-07T20:32:05.5044419Z 2025-05-07T20:32:05.5044628Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.5044702Z 2025-05-07T20:32:05.5044801Z x_sign = torch.sign(x) 2025-05-07T20:32:05.5044927Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.5045022Z x = x_sign * x_clamp 2025-05-07T20:32:05.5045109Z x0 = x[:, :D] 2025-05-07T20:32:05.5045190Z x1 = x[:, D:] 2025-05-07T20:32:05.5045270Z 2025-05-07T20:32:05.5045355Z if contiguous: 2025-05-07T20:32:05.5045446Z x0 = x0.contiguous() 2025-05-07T20:32:05.5045543Z x1 = x1.contiguous() 2025-05-07T20:32:05.5045615Z 2025-05-07T20:32:05.5045704Z if scale_ub is not None: 2025-05-07T20:32:05.5045816Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.5045951Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.5046028Z ) 2025-05-07T20:32:05.5046115Z else: 2025-05-07T20:32:05.5046209Z scale_ub_tensor = None 2025-05-07T20:32:05.5046286Z 2025-05-07T20:32:05.5046422Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.5046513Z op = silu_mul_quant 2025-05-07T20:32:05.5046599Z if compiled: 2025-05-07T20:32:05.5046710Z op = torch.compile(op) 2025-05-07T20:32:05.5046820Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.5046900Z 2025-05-07T20:32:05.5046992Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.5046997Z 2025-05-07T20:32:05.5047098Z moe/activation_test.py:117: 2025-05-07T20:32:05.5047236Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.5047338Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.5047440Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.5047820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:05.5047916Z return fn(*args, **kwargs) 2025-05-07T20:32:05.5048418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.5048515Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.5048874Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.5049104Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.5049442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.5049535Z kernel = self.compile( 2025-05-07T20:32:05.5049926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.5050100Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.5050238Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.5050246Z 2025-05-07T20:32:05.5050448Z self = 2025-05-07T20:32:05.5051329Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.5051839Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4899d816c0>} 2025-05-07T20:32:05.5052585Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.5052783Z context = 2025-05-07T20:32:05.5052824Z 2025-05-07T20:32:05.5053028Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.5053300Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.5053408Z module_map=module_map) 2025-05-07T20:32:05.5053575Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.5053681Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.5053760Z E ^ 2025-05-07T20:32:05.5054114Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.5054119Z 2025-05-07T20:32:05.5054541Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.5054545Z 2025-05-07T20:32:05.5054649Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.5054883Z self=, 2025-05-07T20:32:05.5054966Z T=1, 2025-05-07T20:32:05.5055045Z D=5120, 2025-05-07T20:32:05.5055136Z scale_ub=None, 2025-05-07T20:32:05.5055224Z contiguous=False, 2025-05-07T20:32:05.5055307Z compiled=True, 2025-05-07T20:32:05.5055389Z ) 2025-05-07T20:32:05.5055611Z self = 2025-05-07T20:32:05.5055775Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:05.5055787Z 2025-05-07T20:32:05.5055863Z @given( 2025-05-07T20:32:05.5055982Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.5056090Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.5056205Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.5056321Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.5056441Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.5056517Z ) 2025-05-07T20:32:05.5056760Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.5056864Z def test_silu_mul_quant( 2025-05-07T20:32:05.5056942Z self, 2025-05-07T20:32:05.5057026Z T: int, 2025-05-07T20:32:05.5057104Z D: int, 2025-05-07T20:32:05.5057203Z scale_ub: Optional[float], 2025-05-07T20:32:05.5057302Z contiguous: bool, 2025-05-07T20:32:05.5057387Z compiled: bool, 2025-05-07T20:32:05.5057464Z ) -> None: 2025-05-07T20:32:05.5057568Z torch.manual_seed(2025) 2025-05-07T20:32:05.5057640Z 2025-05-07T20:32:05.5057815Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.5057887Z 2025-05-07T20:32:05.5057978Z x_sign = torch.sign(x) 2025-05-07T20:32:05.5058106Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.5058197Z x = x_sign * x_clamp 2025-05-07T20:32:05.5058279Z x0 = x[:, :D] 2025-05-07T20:32:05.5058366Z x1 = x[:, D:] 2025-05-07T20:32:05.5058441Z 2025-05-07T20:32:05.5058529Z if contiguous: 2025-05-07T20:32:05.5058627Z x0 = x0.contiguous() 2025-05-07T20:32:05.5058717Z x1 = x1.contiguous() 2025-05-07T20:32:05.5058789Z 2025-05-07T20:32:05.5058886Z if scale_ub is not None: 2025-05-07T20:32:05.5059078Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.5059223Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.5059301Z ) 2025-05-07T20:32:05.5059376Z else: 2025-05-07T20:32:05.5059473Z scale_ub_tensor = None 2025-05-07T20:32:05.5059545Z 2025-05-07T20:32:05.5059672Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.5059767Z op = silu_mul_quant 2025-05-07T20:32:05.5059851Z if compiled: 2025-05-07T20:32:05.5059951Z op = torch.compile(op) 2025-05-07T20:32:05.5060062Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.5060176Z 2025-05-07T20:32:05.5060266Z y_fp8, y_scale = fn() 2025-05-07T20:32:05.5060430Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:05.5060502Z 2025-05-07T20:32:05.5060641Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.5060750Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:05.5060850Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:05.5060976Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:05.5061113Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:05.5061185Z 2025-05-07T20:32:05.5061291Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:05.5061295Z 2025-05-07T20:32:05.5061393Z moe/activation_test.py:126: 2025-05-07T20:32:05.5061520Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.5061632Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:05.5061764Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:05.5062445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:05.5062570Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:05.5063022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.5063305Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.5063762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:05.5064085Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:05.5064539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:05.5064790Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:05.5065174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:05.5065342Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:05.5065690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:05.5065772Z fn() 2025-05-07T20:32:05.5066172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:05.5066260Z self.fn.run( 2025-05-07T20:32:05.5066595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.5066687Z kernel = self.compile( 2025-05-07T20:32:05.5067077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.5067248Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.5067381Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.5067391Z 2025-05-07T20:32:05.5067593Z self = 2025-05-07T20:32:05.5068449Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.5068959Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4899d82020>} 2025-05-07T20:32:05.5069761Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.5069999Z context = 2025-05-07T20:32:05.5070044Z 2025-05-07T20:32:05.5070208Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.5070476Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.5070591Z module_map=module_map) 2025-05-07T20:32:05.5070751Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.5070858Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:05.5070935Z E ^ 2025-05-07T20:32:05.5071288Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.5071293Z 2025-05-07T20:32:05.5071714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.5071718Z 2025-05-07T20:32:05.5071820Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.5072051Z self=, 2025-05-07T20:32:05.5072148Z T=1, 2025-05-07T20:32:05.5072243Z D=5120, 2025-05-07T20:32:05.5072352Z scale_ub=None, 2025-05-07T20:32:05.5072458Z contiguous=True, 2025-05-07T20:32:05.5072571Z compiled=False, 2025-05-07T20:32:05.5072669Z ) 2025-05-07T20:32:05.5072940Z self = 2025-05-07T20:32:05.5073146Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:05.5073152Z 2025-05-07T20:32:05.5073255Z @given( 2025-05-07T20:32:05.5073403Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.5073526Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.5073675Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.5073819Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.5073967Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.5074067Z ) 2025-05-07T20:32:05.5074315Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.5074417Z def test_silu_mul_quant( 2025-05-07T20:32:05.5074495Z self, 2025-05-07T20:32:05.5074572Z T: int, 2025-05-07T20:32:05.5074660Z D: int, 2025-05-07T20:32:05.5074758Z scale_ub: Optional[float], 2025-05-07T20:32:05.5074847Z contiguous: bool, 2025-05-07T20:32:05.5074942Z compiled: bool, 2025-05-07T20:32:05.5075020Z ) -> None: 2025-05-07T20:32:05.5075116Z torch.manual_seed(2025) 2025-05-07T20:32:05.5075195Z 2025-05-07T20:32:05.5075366Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.5075448Z 2025-05-07T20:32:05.5075541Z x_sign = torch.sign(x) 2025-05-07T20:32:05.5075664Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.5075758Z x = x_sign * x_clamp 2025-05-07T20:32:05.5075842Z x0 = x[:, :D] 2025-05-07T20:32:05.5075925Z x1 = x[:, D:] 2025-05-07T20:32:05.5076004Z 2025-05-07T20:32:05.5076091Z if contiguous: 2025-05-07T20:32:05.5076181Z x0 = x0.contiguous() 2025-05-07T20:32:05.5076277Z x1 = x1.contiguous() 2025-05-07T20:32:05.5076437Z 2025-05-07T20:32:05.5076529Z if scale_ub is not None: 2025-05-07T20:32:05.5076640Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.5076774Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.5076856Z ) 2025-05-07T20:32:05.5076933Z else: 2025-05-07T20:32:05.5077026Z scale_ub_tensor = None 2025-05-07T20:32:05.5077105Z 2025-05-07T20:32:05.5077234Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.5077326Z op = silu_mul_quant 2025-05-07T20:32:05.5077417Z if compiled: 2025-05-07T20:32:05.5077517Z op = torch.compile(op) 2025-05-07T20:32:05.5077662Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.5077780Z 2025-05-07T20:32:05.5077870Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.5077874Z 2025-05-07T20:32:05.5077991Z moe/activation_test.py:117: 2025-05-07T20:32:05.5078126Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.5078228Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.5078335Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.5078833Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.5078938Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.5079293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.5079517Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.5079865Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.5079961Z kernel = self.compile( 2025-05-07T20:32:05.5080347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.5080525Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.5080653Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.5080657Z 2025-05-07T20:32:05.5080866Z self = 2025-05-07T20:32:05.5081635Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.5082142Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4899d837e0>} 2025-05-07T20:32:05.5082901Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.5083090Z context = 2025-05-07T20:32:05.5083094Z 2025-05-07T20:32:05.5083265Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.5083526Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.5083645Z module_map=module_map) 2025-05-07T20:32:05.5083807Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.5083909Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.5083994Z E ^ 2025-05-07T20:32:05.5084346Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.5084356Z 2025-05-07T20:32:05.5084775Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.5084780Z 2025-05-07T20:32:05.5084986Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.5085213Z self=, 2025-05-07T20:32:05.5085298Z T=128, 2025-05-07T20:32:05.5085379Z D=5120, 2025-05-07T20:32:05.5085465Z scale_ub=None, 2025-05-07T20:32:05.5085561Z contiguous=False, 2025-05-07T20:32:05.5085645Z compiled=True, 2025-05-07T20:32:05.5085718Z ) 2025-05-07T20:32:05.5085942Z self = 2025-05-07T20:32:05.5086112Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:05.5086116Z 2025-05-07T20:32:05.5086200Z @given( 2025-05-07T20:32:05.5086358Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.5086542Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.5086668Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.5086783Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.5086900Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.5086979Z ) 2025-05-07T20:32:05.5087227Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.5087323Z def test_silu_mul_quant( 2025-05-07T20:32:05.5087404Z self, 2025-05-07T20:32:05.5087480Z T: int, 2025-05-07T20:32:05.5087562Z D: int, 2025-05-07T20:32:05.5087660Z scale_ub: Optional[float], 2025-05-07T20:32:05.5087749Z contiguous: bool, 2025-05-07T20:32:05.5087840Z compiled: bool, 2025-05-07T20:32:05.5087919Z ) -> None: 2025-05-07T20:32:05.5088013Z torch.manual_seed(2025) 2025-05-07T20:32:05.5088099Z 2025-05-07T20:32:05.5088268Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.5088343Z 2025-05-07T20:32:05.5088439Z x_sign = torch.sign(x) 2025-05-07T20:32:05.5088562Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.5088655Z x = x_sign * x_clamp 2025-05-07T20:32:05.5088742Z x0 = x[:, :D] 2025-05-07T20:32:05.5088823Z x1 = x[:, D:] 2025-05-07T20:32:05.5088894Z 2025-05-07T20:32:05.5088984Z if contiguous: 2025-05-07T20:32:05.5089076Z x0 = x0.contiguous() 2025-05-07T20:32:05.5089170Z x1 = x1.contiguous() 2025-05-07T20:32:05.5089241Z 2025-05-07T20:32:05.5089331Z if scale_ub is not None: 2025-05-07T20:32:05.5089444Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.5089578Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.5089653Z ) 2025-05-07T20:32:05.5089737Z else: 2025-05-07T20:32:05.5089832Z scale_ub_tensor = None 2025-05-07T20:32:05.5089908Z 2025-05-07T20:32:05.5090044Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.5090133Z op = silu_mul_quant 2025-05-07T20:32:05.5090219Z if compiled: 2025-05-07T20:32:05.5090331Z op = torch.compile(op) 2025-05-07T20:32:05.5090437Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.5090517Z 2025-05-07T20:32:05.5090607Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.5090612Z 2025-05-07T20:32:05.5090709Z moe/activation_test.py:117: 2025-05-07T20:32:05.5090848Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.5090948Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.5091051Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.5091424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:05.5091515Z return fn(*args, **kwargs) 2025-05-07T20:32:05.5092016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.5092111Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.5092583Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.5092840Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.5093176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.5093269Z kernel = self.compile( 2025-05-07T20:32:05.5093653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.5093824Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.5093960Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.5094047Z 2025-05-07T20:32:05.5094254Z self = 2025-05-07T20:32:05.5095031Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.5095537Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f48995e1ee0>} 2025-05-07T20:32:05.5096288Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.5096482Z context = 2025-05-07T20:32:05.5096487Z 2025-05-07T20:32:05.5096651Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.5096924Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.5097034Z module_map=module_map) 2025-05-07T20:32:05.5097198Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.5097307Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.5097384Z E ^ 2025-05-07T20:32:05.5097739Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.5097744Z 2025-05-07T20:32:05.5098163Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.5098168Z 2025-05-07T20:32:05.5098270Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.5098501Z self=, 2025-05-07T20:32:05.5098581Z T=128, 2025-05-07T20:32:05.5098661Z D=7168, 2025-05-07T20:32:05.5098750Z scale_ub=1200.0, 2025-05-07T20:32:05.5098837Z contiguous=False, 2025-05-07T20:32:05.5098921Z compiled=False, 2025-05-07T20:32:05.5099000Z ) 2025-05-07T20:32:05.5099220Z self = 2025-05-07T20:32:05.5099392Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:05.5099396Z 2025-05-07T20:32:05.5099484Z @given( 2025-05-07T20:32:05.5099601Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.5099707Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.5099822Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.5099939Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.5100059Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.5100134Z ) 2025-05-07T20:32:05.5100378Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.5100481Z def test_silu_mul_quant( 2025-05-07T20:32:05.5100559Z self, 2025-05-07T20:32:05.5100637Z T: int, 2025-05-07T20:32:05.5100718Z D: int, 2025-05-07T20:32:05.5100814Z scale_ub: Optional[float], 2025-05-07T20:32:05.5100997Z contiguous: bool, 2025-05-07T20:32:05.5101086Z compiled: bool, 2025-05-07T20:32:05.5101165Z ) -> None: 2025-05-07T20:32:05.5101267Z torch.manual_seed(2025) 2025-05-07T20:32:05.5101339Z 2025-05-07T20:32:05.5101507Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.5101587Z 2025-05-07T20:32:05.5101678Z x_sign = torch.sign(x) 2025-05-07T20:32:05.5101801Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.5101895Z x = x_sign * x_clamp 2025-05-07T20:32:05.5101975Z x0 = x[:, :D] 2025-05-07T20:32:05.5102055Z x1 = x[:, D:] 2025-05-07T20:32:05.5102173Z 2025-05-07T20:32:05.5102256Z if contiguous: 2025-05-07T20:32:05.5102388Z x0 = x0.contiguous() 2025-05-07T20:32:05.5102485Z x1 = x1.contiguous() 2025-05-07T20:32:05.5102560Z 2025-05-07T20:32:05.5102655Z if scale_ub is not None: 2025-05-07T20:32:05.5102767Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.5102901Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.5102982Z ) 2025-05-07T20:32:05.5103057Z else: 2025-05-07T20:32:05.5103151Z scale_ub_tensor = None 2025-05-07T20:32:05.5103228Z 2025-05-07T20:32:05.5103357Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.5103448Z op = silu_mul_quant 2025-05-07T20:32:05.5103540Z if compiled: 2025-05-07T20:32:05.5103639Z op = torch.compile(op) 2025-05-07T20:32:05.5103745Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.5103827Z 2025-05-07T20:32:05.5103922Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.5103929Z 2025-05-07T20:32:05.5104040Z moe/activation_test.py:117: 2025-05-07T20:32:05.5104168Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.5104268Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.5104378Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.5104875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.5104970Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.5105330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.5105554Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.5105901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.5105996Z kernel = self.compile( 2025-05-07T20:32:05.5106378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.5106556Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.5106687Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.5106691Z 2025-05-07T20:32:05.5106901Z self = 2025-05-07T20:32:05.5107671Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.5108166Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f48995e1a80>} 2025-05-07T20:32:05.5108917Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.5109239Z context = 2025-05-07T20:32:05.5109334Z 2025-05-07T20:32:05.5109505Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.5109766Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.5109874Z module_map=module_map) 2025-05-07T20:32:05.5110039Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.5110136Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.5110221Z E ^ 2025-05-07T20:32:05.5110573Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.5110578Z 2025-05-07T20:32:05.5111992Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.5112075Z 2025-05-07T20:32:05.5112186Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.5112418Z self=, 2025-05-07T20:32:05.5112506Z T=128, 2025-05-07T20:32:05.5112585Z D=5120, 2025-05-07T20:32:05.5112668Z scale_ub=None, 2025-05-07T20:32:05.5112765Z contiguous=False, 2025-05-07T20:32:05.5112849Z compiled=False, 2025-05-07T20:32:05.5112921Z ) 2025-05-07T20:32:05.5113143Z self = 2025-05-07T20:32:05.5113313Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:05.5113317Z 2025-05-07T20:32:05.5113393Z @given( 2025-05-07T20:32:05.5113520Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.5113621Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.5113744Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.5113863Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.5113981Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.5114064Z ) 2025-05-07T20:32:05.5114316Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.5114408Z def test_silu_mul_quant( 2025-05-07T20:32:05.5114493Z self, 2025-05-07T20:32:05.5114571Z T: int, 2025-05-07T20:32:05.5114648Z D: int, 2025-05-07T20:32:05.5114751Z scale_ub: Optional[float], 2025-05-07T20:32:05.5114842Z contiguous: bool, 2025-05-07T20:32:05.5114928Z compiled: bool, 2025-05-07T20:32:05.5115012Z ) -> None: 2025-05-07T20:32:05.5115106Z torch.manual_seed(2025) 2025-05-07T20:32:05.5115179Z 2025-05-07T20:32:05.5115352Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.5115428Z 2025-05-07T20:32:05.5115526Z x_sign = torch.sign(x) 2025-05-07T20:32:05.5115654Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.5115742Z x = x_sign * x_clamp 2025-05-07T20:32:05.5115826Z x0 = x[:, :D] 2025-05-07T20:32:05.5115905Z x1 = x[:, D:] 2025-05-07T20:32:05.5115981Z 2025-05-07T20:32:05.5116070Z if contiguous: 2025-05-07T20:32:05.5116160Z x0 = x0.contiguous() 2025-05-07T20:32:05.5116247Z x1 = x1.contiguous() 2025-05-07T20:32:05.5116325Z 2025-05-07T20:32:05.5116416Z if scale_ub is not None: 2025-05-07T20:32:05.5116521Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.5116660Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.5116735Z ) 2025-05-07T20:32:05.5116817Z else: 2025-05-07T20:32:05.5116909Z scale_ub_tensor = None 2025-05-07T20:32:05.5116981Z 2025-05-07T20:32:05.5117117Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.5117210Z op = silu_mul_quant 2025-05-07T20:32:05.5117294Z if compiled: 2025-05-07T20:32:05.5117400Z op = torch.compile(op) 2025-05-07T20:32:05.5117504Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.5117576Z 2025-05-07T20:32:05.5117759Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.5117765Z 2025-05-07T20:32:05.5117865Z moe/activation_test.py:117: 2025-05-07T20:32:05.5117999Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.5118100Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.5118200Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.5118702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.5118797Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.5119152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.5119459Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.5119797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.5119900Z kernel = self.compile( 2025-05-07T20:32:05.5120280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.5120451Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.5120584Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.5120589Z 2025-05-07T20:32:05.5120793Z self = 2025-05-07T20:32:05.5121563Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.5122064Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f48995e3c40>} 2025-05-07T20:32:05.5122815Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.5123009Z context = 2025-05-07T20:32:05.5123013Z 2025-05-07T20:32:05.5123178Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.5123448Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.5123554Z module_map=module_map) 2025-05-07T20:32:05.5123714Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.5123828Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.5123905Z E ^ 2025-05-07T20:32:05.5124259Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.5124271Z 2025-05-07T20:32:05.5124689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.5124694Z 2025-05-07T20:32:05.5124796Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.5125025Z self=, 2025-05-07T20:32:05.5125102Z T=128, 2025-05-07T20:32:05.5125178Z D=5120, 2025-05-07T20:32:05.5125268Z scale_ub=1200.0, 2025-05-07T20:32:05.5125352Z contiguous=True, 2025-05-07T20:32:05.5125438Z compiled=False, 2025-05-07T20:32:05.5125519Z ) 2025-05-07T20:32:05.5125736Z self = 2025-05-07T20:32:05.5125920Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:05.5125924Z 2025-05-07T20:32:05.5126003Z @given( 2025-05-07T20:32:05.5126122Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.5126315Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.5126432Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.5126549Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.5126669Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.5126743Z ) 2025-05-07T20:32:05.5126987Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.5127087Z def test_silu_mul_quant( 2025-05-07T20:32:05.5127164Z self, 2025-05-07T20:32:05.5127248Z T: int, 2025-05-07T20:32:05.5127325Z D: int, 2025-05-07T20:32:05.5127424Z scale_ub: Optional[float], 2025-05-07T20:32:05.5127563Z contiguous: bool, 2025-05-07T20:32:05.5127766Z compiled: bool, 2025-05-07T20:32:05.5127848Z ) -> None: 2025-05-07T20:32:05.5127949Z torch.manual_seed(2025) 2025-05-07T20:32:05.5128021Z 2025-05-07T20:32:05.5128473Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.5128599Z 2025-05-07T20:32:05.5128730Z x_sign = torch.sign(x) 2025-05-07T20:32:05.5128857Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.5128952Z x = x_sign * x_clamp 2025-05-07T20:32:05.5129033Z x0 = x[:, :D] 2025-05-07T20:32:05.5129118Z x1 = x[:, D:] 2025-05-07T20:32:05.5129190Z 2025-05-07T20:32:05.5129274Z if contiguous: 2025-05-07T20:32:05.5129370Z x0 = x0.contiguous() 2025-05-07T20:32:05.5129457Z x1 = x1.contiguous() 2025-05-07T20:32:05.5129528Z 2025-05-07T20:32:05.5129627Z if scale_ub is not None: 2025-05-07T20:32:05.5129733Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.5129874Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.5129955Z ) 2025-05-07T20:32:05.5130034Z else: 2025-05-07T20:32:05.5130127Z scale_ub_tensor = None 2025-05-07T20:32:05.5130205Z 2025-05-07T20:32:05.5130338Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.5130436Z op = silu_mul_quant 2025-05-07T20:32:05.5130522Z if compiled: 2025-05-07T20:32:05.5130622Z op = torch.compile(op) 2025-05-07T20:32:05.5130732Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.5130803Z 2025-05-07T20:32:05.5130895Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.5130899Z 2025-05-07T20:32:05.5131001Z moe/activation_test.py:117: 2025-05-07T20:32:05.5131128Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.5131228Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.5131341Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.5131839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.5131940Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.5132298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.5132517Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.5132861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.5132954Z kernel = self.compile( 2025-05-07T20:32:05.5133330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.5133508Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.5133636Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.5133644Z 2025-05-07T20:32:05.5133850Z self = 2025-05-07T20:32:05.5134850Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.5135356Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4899db7ba0>} 2025-05-07T20:32:05.5136099Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.5136286Z context = 2025-05-07T20:32:05.5136352Z 2025-05-07T20:32:05.5136523Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.5136844Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.5136959Z module_map=module_map) 2025-05-07T20:32:05.5137125Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.5137222Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.5137305Z E ^ 2025-05-07T20:32:05.5137658Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.5137663Z 2025-05-07T20:32:05.5138072Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.5138082Z 2025-05-07T20:32:05.5138185Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.5138404Z self=, 2025-05-07T20:32:05.5138490Z T=1, 2025-05-07T20:32:05.5138567Z D=7168, 2025-05-07T20:32:05.5138651Z scale_ub=1200.0, 2025-05-07T20:32:05.5138743Z contiguous=True, 2025-05-07T20:32:05.5138825Z compiled=True, 2025-05-07T20:32:05.5138898Z ) 2025-05-07T20:32:05.5139132Z self = 2025-05-07T20:32:05.5139295Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:05.5139300Z 2025-05-07T20:32:05.5139377Z @given( 2025-05-07T20:32:05.5139500Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.5145030Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.5145176Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.5145295Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.5145421Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.5145497Z ) 2025-05-07T20:32:05.5145747Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.5145860Z def test_silu_mul_quant( 2025-05-07T20:32:05.5145941Z self, 2025-05-07T20:32:05.5146025Z T: int, 2025-05-07T20:32:05.5146104Z D: int, 2025-05-07T20:32:05.5146203Z scale_ub: Optional[float], 2025-05-07T20:32:05.5146310Z contiguous: bool, 2025-05-07T20:32:05.5146400Z compiled: bool, 2025-05-07T20:32:05.5146484Z ) -> None: 2025-05-07T20:32:05.5146587Z torch.manual_seed(2025) 2025-05-07T20:32:05.5146660Z 2025-05-07T20:32:05.5146832Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.5146915Z 2025-05-07T20:32:05.5147009Z x_sign = torch.sign(x) 2025-05-07T20:32:05.5147136Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.5147234Z x = x_sign * x_clamp 2025-05-07T20:32:05.5147316Z x0 = x[:, :D] 2025-05-07T20:32:05.5147399Z x1 = x[:, D:] 2025-05-07T20:32:05.5147483Z 2025-05-07T20:32:05.5147569Z if contiguous: 2025-05-07T20:32:05.5147673Z x0 = x0.contiguous() 2025-05-07T20:32:05.5147771Z x1 = x1.contiguous() 2025-05-07T20:32:05.5147844Z 2025-05-07T20:32:05.5147944Z if scale_ub is not None: 2025-05-07T20:32:05.5148162Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.5148303Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.5148390Z ) 2025-05-07T20:32:05.5148469Z else: 2025-05-07T20:32:05.5148568Z scale_ub_tensor = None 2025-05-07T20:32:05.5148650Z 2025-05-07T20:32:05.5148782Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.5148875Z op = silu_mul_quant 2025-05-07T20:32:05.5148969Z if compiled: 2025-05-07T20:32:05.5149146Z op = torch.compile(op) 2025-05-07T20:32:05.5149264Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.5149339Z 2025-05-07T20:32:05.5149550Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.5149598Z 2025-05-07T20:32:05.5149707Z moe/activation_test.py:117: 2025-05-07T20:32:05.5149841Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.5149943Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.5150056Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.5150435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:05.5150529Z return fn(*args, **kwargs) 2025-05-07T20:32:05.5151029Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.5151131Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.5151495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.5151715Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.5152062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.5152164Z kernel = self.compile( 2025-05-07T20:32:05.5152551Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.5152735Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.5152864Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.5152869Z 2025-05-07T20:32:05.5153071Z self = 2025-05-07T20:32:05.5153855Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.5154359Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4984e20ea0>} 2025-05-07T20:32:05.5155116Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.5155306Z context = 2025-05-07T20:32:05.5155311Z 2025-05-07T20:32:05.5155473Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.5155740Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.5155852Z module_map=module_map) 2025-05-07T20:32:05.5156022Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.5156122Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.5156202Z E ^ 2025-05-07T20:32:05.5156569Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.5156576Z 2025-05-07T20:32:05.5157075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.5157080Z 2025-05-07T20:32:05.5157191Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.5157413Z self=, 2025-05-07T20:32:05.5157490Z T=1, 2025-05-07T20:32:05.5157579Z D=7168, 2025-05-07T20:32:05.5157663Z scale_ub=1200.0, 2025-05-07T20:32:05.5157749Z contiguous=False, 2025-05-07T20:32:05.5157840Z compiled=True, 2025-05-07T20:32:05.5157913Z ) 2025-05-07T20:32:05.5158127Z self = 2025-05-07T20:32:05.5158300Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:05.5158345Z 2025-05-07T20:32:05.5158463Z @given( 2025-05-07T20:32:05.5158588Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.5158688Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.5158802Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.5158933Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.5159046Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.5159120Z ) 2025-05-07T20:32:05.5159370Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.5159470Z def test_silu_mul_quant( 2025-05-07T20:32:05.5159547Z self, 2025-05-07T20:32:05.5159631Z T: int, 2025-05-07T20:32:05.5159710Z D: int, 2025-05-07T20:32:05.5159815Z scale_ub: Optional[float], 2025-05-07T20:32:05.5159905Z contiguous: bool, 2025-05-07T20:32:05.5159993Z compiled: bool, 2025-05-07T20:32:05.5160083Z ) -> None: 2025-05-07T20:32:05.5160182Z torch.manual_seed(2025) 2025-05-07T20:32:05.5160259Z 2025-05-07T20:32:05.5160436Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.5160515Z 2025-05-07T20:32:05.5160612Z x_sign = torch.sign(x) 2025-05-07T20:32:05.5160746Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.5160837Z x = x_sign * x_clamp 2025-05-07T20:32:05.5160919Z x0 = x[:, :D] 2025-05-07T20:32:05.5161006Z x1 = x[:, D:] 2025-05-07T20:32:05.5161080Z 2025-05-07T20:32:05.5161164Z if contiguous: 2025-05-07T20:32:05.5161264Z x0 = x0.contiguous() 2025-05-07T20:32:05.5161353Z x1 = x1.contiguous() 2025-05-07T20:32:05.5161437Z 2025-05-07T20:32:05.5161530Z if scale_ub is not None: 2025-05-07T20:32:05.5161638Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.5161784Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.5161863Z ) 2025-05-07T20:32:05.5161942Z else: 2025-05-07T20:32:05.5162046Z scale_ub_tensor = None 2025-05-07T20:32:05.5162120Z 2025-05-07T20:32:05.5162249Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.5162351Z op = silu_mul_quant 2025-05-07T20:32:05.5162448Z if compiled: 2025-05-07T20:32:05.5162570Z op = torch.compile(op) 2025-05-07T20:32:05.5162702Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.5162784Z 2025-05-07T20:32:05.5162885Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.5162889Z 2025-05-07T20:32:05.5162988Z moe/activation_test.py:117: 2025-05-07T20:32:05.5163117Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.5163228Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.5163329Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.5163696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:05.5163801Z return fn(*args, **kwargs) 2025-05-07T20:32:05.5164292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.5164400Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.5164841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.5165066Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.5165413Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.5165508Z kernel = self.compile( 2025-05-07T20:32:05.5165887Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.5166067Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.5166233Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.5166280Z 2025-05-07T20:32:05.5166492Z self = 2025-05-07T20:32:05.5167270Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.5167767Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4899e4d1c0>} 2025-05-07T20:32:05.5168519Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.5168707Z context = 2025-05-07T20:32:05.5168716Z 2025-05-07T20:32:05.5168887Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.5169150Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.5169273Z module_map=module_map) 2025-05-07T20:32:05.5169435Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.5169534Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.5169619Z E ^ 2025-05-07T20:32:05.5169973Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.5169978Z 2025-05-07T20:32:05.5170390Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.5170394Z 2025-05-07T20:32:05.5170506Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.5170728Z self=, 2025-05-07T20:32:05.5170816Z T=1, 2025-05-07T20:32:05.5170894Z D=7168, 2025-05-07T20:32:05.5170977Z scale_ub=None, 2025-05-07T20:32:05.5171077Z contiguous=False, 2025-05-07T20:32:05.5171162Z compiled=True, 2025-05-07T20:32:05.5171235Z ) 2025-05-07T20:32:05.5171467Z self = 2025-05-07T20:32:05.5171630Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:05.5171635Z 2025-05-07T20:32:05.5171713Z @given( 2025-05-07T20:32:05.5171839Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.5171937Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.5172061Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.5172177Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.5172290Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.5172372Z ) 2025-05-07T20:32:05.5172614Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.5172711Z def test_silu_mul_quant( 2025-05-07T20:32:05.5172796Z self, 2025-05-07T20:32:05.5172874Z T: int, 2025-05-07T20:32:05.5172950Z D: int, 2025-05-07T20:32:05.5173153Z scale_ub: Optional[float], 2025-05-07T20:32:05.5173245Z contiguous: bool, 2025-05-07T20:32:05.5173334Z compiled: bool, 2025-05-07T20:32:05.5173419Z ) -> None: 2025-05-07T20:32:05.5173513Z torch.manual_seed(2025) 2025-05-07T20:32:05.5173592Z 2025-05-07T20:32:05.5173759Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.5173833Z 2025-05-07T20:32:05.5173935Z x_sign = torch.sign(x) 2025-05-07T20:32:05.5174057Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.5174146Z x = x_sign * x_clamp 2025-05-07T20:32:05.5174233Z x0 = x[:, :D] 2025-05-07T20:32:05.5174359Z x1 = x[:, D:] 2025-05-07T20:32:05.5174472Z 2025-05-07T20:32:05.5174563Z if contiguous: 2025-05-07T20:32:05.5174656Z x0 = x0.contiguous() 2025-05-07T20:32:05.5174745Z x1 = x1.contiguous() 2025-05-07T20:32:05.5174825Z 2025-05-07T20:32:05.5174924Z if scale_ub is not None: 2025-05-07T20:32:05.5175038Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.5175174Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.5175250Z ) 2025-05-07T20:32:05.5175335Z else: 2025-05-07T20:32:05.5175427Z scale_ub_tensor = None 2025-05-07T20:32:05.5175501Z 2025-05-07T20:32:05.5175641Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.5175732Z op = silu_mul_quant 2025-05-07T20:32:05.5175818Z if compiled: 2025-05-07T20:32:05.5175926Z op = torch.compile(op) 2025-05-07T20:32:05.5176033Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.5176109Z 2025-05-07T20:32:05.5176215Z y_fp8, y_scale = fn() 2025-05-07T20:32:05.5176337Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:05.5176418Z 2025-05-07T20:32:05.5176553Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.5176661Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:05.5176769Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:05.5176894Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:05.5177034Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:05.5177117Z 2025-05-07T20:32:05.5177218Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:05.5177222Z 2025-05-07T20:32:05.5177320Z moe/activation_test.py:126: 2025-05-07T20:32:05.5177461Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.5177570Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:05.5177716Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:05.5178275Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:05.5178377Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:05.5178749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.5178974Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.5179347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:05.5179602Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:05.5179996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:05.5180255Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:05.5180630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:05.5180796Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:05.5181252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:05.5181332Z fn() 2025-05-07T20:32:05.5181736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:05.5181820Z self.fn.run( 2025-05-07T20:32:05.5182156Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.5182256Z kernel = self.compile( 2025-05-07T20:32:05.5182674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.5182897Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.5183070Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.5183075Z 2025-05-07T20:32:05.5183279Z self = 2025-05-07T20:32:05.5184059Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.5184555Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4899e4dda0>} 2025-05-07T20:32:05.5185302Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.5185497Z context = 2025-05-07T20:32:05.5185502Z 2025-05-07T20:32:05.5185668Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.5185932Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.5186041Z module_map=module_map) 2025-05-07T20:32:05.5186208Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.5186312Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:05.5186391Z E ^ 2025-05-07T20:32:05.5186750Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.5186755Z 2025-05-07T20:32:05.5187167Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.5187174Z 2025-05-07T20:32:05.5187284Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.5187507Z self=, 2025-05-07T20:32:05.5187586Z T=1, 2025-05-07T20:32:05.5187671Z D=5120, 2025-05-07T20:32:05.5187754Z scale_ub=1200.0, 2025-05-07T20:32:05.5187845Z contiguous=False, 2025-05-07T20:32:05.5187936Z compiled=True, 2025-05-07T20:32:05.5188010Z ) 2025-05-07T20:32:05.5188234Z self = 2025-05-07T20:32:05.5188399Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:05.5188403Z 2025-05-07T20:32:05.5188481Z @given( 2025-05-07T20:32:05.5188604Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.5188702Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.5188816Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.5188936Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.5189050Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.5189185Z ) 2025-05-07T20:32:05.5189429Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.5189523Z def test_silu_mul_quant( 2025-05-07T20:32:05.5189689Z self, 2025-05-07T20:32:05.5189767Z T: int, 2025-05-07T20:32:05.5189843Z D: int, 2025-05-07T20:32:05.5189946Z scale_ub: Optional[float], 2025-05-07T20:32:05.5190034Z contiguous: bool, 2025-05-07T20:32:05.5190119Z compiled: bool, 2025-05-07T20:32:05.5190202Z ) -> None: 2025-05-07T20:32:05.5190296Z torch.manual_seed(2025) 2025-05-07T20:32:05.5190369Z 2025-05-07T20:32:05.5190540Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.5190613Z 2025-05-07T20:32:05.5190703Z x_sign = torch.sign(x) 2025-05-07T20:32:05.5190834Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.5190966Z x = x_sign * x_clamp 2025-05-07T20:32:05.5191092Z x0 = x[:, :D] 2025-05-07T20:32:05.5191173Z x1 = x[:, D:] 2025-05-07T20:32:05.5191244Z 2025-05-07T20:32:05.5191333Z if contiguous: 2025-05-07T20:32:05.5191424Z x0 = x0.contiguous() 2025-05-07T20:32:05.5191518Z x1 = x1.contiguous() 2025-05-07T20:32:05.5191596Z 2025-05-07T20:32:05.5191687Z if scale_ub is not None: 2025-05-07T20:32:05.5191791Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.5191934Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.5192009Z ) 2025-05-07T20:32:05.5192085Z else: 2025-05-07T20:32:05.5192182Z scale_ub_tensor = None 2025-05-07T20:32:05.5192254Z 2025-05-07T20:32:05.5192389Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.5192491Z op = silu_mul_quant 2025-05-07T20:32:05.5192588Z if compiled: 2025-05-07T20:32:05.5192714Z op = torch.compile(op) 2025-05-07T20:32:05.5192827Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.5192899Z 2025-05-07T20:32:05.5192997Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.5193001Z 2025-05-07T20:32:05.5193098Z moe/activation_test.py:117: 2025-05-07T20:32:05.5193235Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.5193343Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.5193443Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.5193814Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:05.5193906Z return fn(*args, **kwargs) 2025-05-07T20:32:05.5194396Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.5194498Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.5194852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.5195077Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.5195421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.5195514Z kernel = self.compile( 2025-05-07T20:32:05.5195896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.5196066Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.5196193Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.5196197Z 2025-05-07T20:32:05.5196406Z self = 2025-05-07T20:32:05.5197173Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.5197765Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4899e4e020>} 2025-05-07T20:32:05.5198511Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.5198698Z context = 2025-05-07T20:32:05.5198713Z 2025-05-07T20:32:05.5198876Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.5199135Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.5199247Z module_map=module_map) 2025-05-07T20:32:05.5199448Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.5199585Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.5199669Z E ^ 2025-05-07T20:32:05.5200030Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.5200035Z 2025-05-07T20:32:05.5200457Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.5200461Z 2025-05-07T20:32:05.5200566Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.5200788Z self=, 2025-05-07T20:32:05.5200874Z T=1, 2025-05-07T20:32:05.5200952Z D=5120, 2025-05-07T20:32:05.5201035Z scale_ub=1200.0, 2025-05-07T20:32:05.5201129Z contiguous=False, 2025-05-07T20:32:05.5201214Z compiled=False, 2025-05-07T20:32:05.5201289Z ) 2025-05-07T20:32:05.5201509Z self = 2025-05-07T20:32:05.5201681Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:05.5201686Z 2025-05-07T20:32:05.5201772Z @given( 2025-05-07T20:32:05.5201892Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.5201995Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.5202115Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.5202231Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.5202342Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.5202426Z ) 2025-05-07T20:32:05.5202667Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.5202768Z def test_silu_mul_quant( 2025-05-07T20:32:05.5202844Z self, 2025-05-07T20:32:05.5202920Z T: int, 2025-05-07T20:32:05.5203003Z D: int, 2025-05-07T20:32:05.5203101Z scale_ub: Optional[float], 2025-05-07T20:32:05.5203192Z contiguous: bool, 2025-05-07T20:32:05.5203286Z compiled: bool, 2025-05-07T20:32:05.5203363Z ) -> None: 2025-05-07T20:32:05.5203458Z torch.manual_seed(2025) 2025-05-07T20:32:05.5203536Z 2025-05-07T20:32:05.5203706Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.5203781Z 2025-05-07T20:32:05.5203879Z x_sign = torch.sign(x) 2025-05-07T20:32:05.5204005Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.5204096Z x = x_sign * x_clamp 2025-05-07T20:32:05.5204180Z x0 = x[:, :D] 2025-05-07T20:32:05.5204260Z x1 = x[:, D:] 2025-05-07T20:32:05.5204338Z 2025-05-07T20:32:05.5204421Z if contiguous: 2025-05-07T20:32:05.5204516Z x0 = x0.contiguous() 2025-05-07T20:32:05.5204610Z x1 = x1.contiguous() 2025-05-07T20:32:05.5204680Z 2025-05-07T20:32:05.5204782Z if scale_ub is not None: 2025-05-07T20:32:05.5204890Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.5205028Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.5205109Z ) 2025-05-07T20:32:05.5205184Z else: 2025-05-07T20:32:05.5205278Z scale_ub_tensor = None 2025-05-07T20:32:05.5205358Z 2025-05-07T20:32:05.5205570Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.5205662Z op = silu_mul_quant 2025-05-07T20:32:05.5205754Z if compiled: 2025-05-07T20:32:05.5205854Z op = torch.compile(op) 2025-05-07T20:32:05.5205968Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.5206037Z 2025-05-07T20:32:05.5206127Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.5206132Z 2025-05-07T20:32:05.5206236Z moe/activation_test.py:117: 2025-05-07T20:32:05.5206364Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.5206465Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.5206614Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.5207146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.5207248Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.5207607Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.5207824Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.5208168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.5208262Z kernel = self.compile( 2025-05-07T20:32:05.5208642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.5208821Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.5208951Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.5208958Z 2025-05-07T20:32:05.5209163Z self = 2025-05-07T20:32:05.5209941Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.5210436Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4898fbc720>} 2025-05-07T20:32:05.5211186Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.5211374Z context = 2025-05-07T20:32:05.5211381Z 2025-05-07T20:32:05.5211552Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.5211809Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.5211920Z module_map=module_map) 2025-05-07T20:32:05.5212088Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.5212187Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.5212270Z E ^ 2025-05-07T20:32:05.5212620Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.5212625Z 2025-05-07T20:32:05.5213034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.5213038Z 2025-05-07T20:32:05.5213145Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.5213366Z self=, 2025-05-07T20:32:05.5213455Z T=16384, 2025-05-07T20:32:05.5213530Z D=5120, 2025-05-07T20:32:05.5213613Z scale_ub=1200.0, 2025-05-07T20:32:05.5213704Z contiguous=False, 2025-05-07T20:32:05.5213787Z compiled=True, 2025-05-07T20:32:05.5213859Z ) 2025-05-07T20:32:05.5214189Z self = 2025-05-07T20:32:05.5214368Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:05.5214373Z 2025-05-07T20:32:05.5214450Z @given( 2025-05-07T20:32:05.5214575Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.5214674Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.5214795Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.5214911Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.5215023Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.5215101Z ) 2025-05-07T20:32:05.5215384Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.5215516Z def test_silu_mul_quant( 2025-05-07T20:32:05.5215599Z self, 2025-05-07T20:32:05.5215676Z T: int, 2025-05-07T20:32:05.5215753Z D: int, 2025-05-07T20:32:05.5215863Z scale_ub: Optional[float], 2025-05-07T20:32:05.5215953Z contiguous: bool, 2025-05-07T20:32:05.5216038Z compiled: bool, 2025-05-07T20:32:05.5216120Z ) -> None: 2025-05-07T20:32:05.5216216Z torch.manual_seed(2025) 2025-05-07T20:32:05.5216293Z 2025-05-07T20:32:05.5216461Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.5216536Z 2025-05-07T20:32:05.5216634Z x_sign = torch.sign(x) 2025-05-07T20:32:05.5216757Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.5216847Z x = x_sign * x_clamp 2025-05-07T20:32:05.5216935Z x0 = x[:, :D] 2025-05-07T20:32:05.5217017Z x1 = x[:, D:] 2025-05-07T20:32:05.5217092Z 2025-05-07T20:32:05.5217181Z if contiguous: 2025-05-07T20:32:05.5217273Z x0 = x0.contiguous() 2025-05-07T20:32:05.5217364Z x1 = x1.contiguous() 2025-05-07T20:32:05.5217441Z 2025-05-07T20:32:05.5217535Z if scale_ub is not None: 2025-05-07T20:32:05.5217646Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.5217786Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.5217860Z ) 2025-05-07T20:32:05.5217944Z else: 2025-05-07T20:32:05.5218037Z scale_ub_tensor = None 2025-05-07T20:32:05.5218109Z 2025-05-07T20:32:05.5218242Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.5218331Z op = silu_mul_quant 2025-05-07T20:32:05.5218416Z if compiled: 2025-05-07T20:32:05.5218522Z op = torch.compile(op) 2025-05-07T20:32:05.5218626Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.5218704Z 2025-05-07T20:32:05.5218805Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.5218809Z 2025-05-07T20:32:05.5218906Z moe/activation_test.py:117: 2025-05-07T20:32:05.5219042Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.5219147Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.5219247Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.5219619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:05.5219715Z return fn(*args, **kwargs) 2025-05-07T20:32:05.5220204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.5220308Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.5220664Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.5220891Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.5221230Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.5221323Z kernel = self.compile( 2025-05-07T20:32:05.5221798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.5221972Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.5222099Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.5222110Z 2025-05-07T20:32:05.5222313Z self = 2025-05-07T20:32:05.5223082Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.5223624Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4898fbdd00>} 2025-05-07T20:32:05.5224409Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.5224606Z context = 2025-05-07T20:32:05.5224611Z 2025-05-07T20:32:05.5224772Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.5225032Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.5225146Z module_map=module_map) 2025-05-07T20:32:05.5225307Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.5225417Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.5225500Z E ^ 2025-05-07T20:32:05.5225855Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.5225860Z 2025-05-07T20:32:05.5226281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.5226286Z 2025-05-07T20:32:05.5226387Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.5226609Z self=, 2025-05-07T20:32:05.5226696Z T=2048, 2025-05-07T20:32:05.5226774Z D=7168, 2025-05-07T20:32:05.5226865Z scale_ub=1200.0, 2025-05-07T20:32:05.5226952Z contiguous=False, 2025-05-07T20:32:05.5227035Z compiled=True, 2025-05-07T20:32:05.5227112Z ) 2025-05-07T20:32:05.5227328Z self = 2025-05-07T20:32:05.5227500Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:05.5227509Z 2025-05-07T20:32:05.5227592Z @given( 2025-05-07T20:32:05.5227711Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.5227809Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.5227930Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.5228047Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.5228432Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.5228546Z ) 2025-05-07T20:32:05.5228879Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.5228979Z def test_silu_mul_quant( 2025-05-07T20:32:05.5229116Z self, 2025-05-07T20:32:05.5229198Z T: int, 2025-05-07T20:32:05.5229279Z D: int, 2025-05-07T20:32:05.5229375Z scale_ub: Optional[float], 2025-05-07T20:32:05.5229463Z contiguous: bool, 2025-05-07T20:32:05.5229554Z compiled: bool, 2025-05-07T20:32:05.5229635Z ) -> None: 2025-05-07T20:32:05.5229732Z torch.manual_seed(2025) 2025-05-07T20:32:05.5229810Z 2025-05-07T20:32:05.5229976Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.5230055Z 2025-05-07T20:32:05.5230148Z x_sign = torch.sign(x) 2025-05-07T20:32:05.5230514Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.5230613Z x = x_sign * x_clamp 2025-05-07T20:32:05.5230693Z x0 = x[:, :D] 2025-05-07T20:32:05.5230772Z x1 = x[:, D:] 2025-05-07T20:32:05.5230850Z 2025-05-07T20:32:05.5230933Z if contiguous: 2025-05-07T20:32:05.5231024Z x0 = x0.contiguous() 2025-05-07T20:32:05.5231121Z x1 = x1.contiguous() 2025-05-07T20:32:05.5231191Z 2025-05-07T20:32:05.5231281Z if scale_ub is not None: 2025-05-07T20:32:05.5231392Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.5231526Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.5231731Z ) 2025-05-07T20:32:05.5231807Z else: 2025-05-07T20:32:05.5231900Z scale_ub_tensor = None 2025-05-07T20:32:05.5231981Z 2025-05-07T20:32:05.5232111Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.5232206Z op = silu_mul_quant 2025-05-07T20:32:05.5232300Z if compiled: 2025-05-07T20:32:05.5232401Z op = torch.compile(op) 2025-05-07T20:32:05.5232506Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.5232583Z 2025-05-07T20:32:05.5232673Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.5232678Z 2025-05-07T20:32:05.5232775Z moe/activation_test.py:117: 2025-05-07T20:32:05.5232909Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.5233009Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.5233114Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.5233480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:05.5233578Z return fn(*args, **kwargs) 2025-05-07T20:32:05.5234075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.5234175Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.5234528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.5234753Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.5235091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.5235190Z kernel = self.compile( 2025-05-07T20:32:05.5235568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.5235738Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.5235884Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.5235888Z 2025-05-07T20:32:05.5236090Z self = 2025-05-07T20:32:05.5236869Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.5237363Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4898fbe840>} 2025-05-07T20:32:05.5238104Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.5238298Z context = 2025-05-07T20:32:05.5238306Z 2025-05-07T20:32:05.5238467Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.5238821Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.5238930Z module_map=module_map) 2025-05-07T20:32:05.5239090Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.5239198Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.5239275Z E ^ 2025-05-07T20:32:05.5239634Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.5239639Z 2025-05-07T20:32:05.5240051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.5240056Z 2025-05-07T20:32:05.5240162Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.5240432Z self=, 2025-05-07T20:32:05.5240575Z T=1, 2025-05-07T20:32:05.5240651Z D=5120, 2025-05-07T20:32:05.5240738Z scale_ub=None, 2025-05-07T20:32:05.5240824Z contiguous=False, 2025-05-07T20:32:05.5240919Z compiled=False, 2025-05-07T20:32:05.5240995Z ) 2025-05-07T20:32:05.5241214Z self = 2025-05-07T20:32:05.5241384Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:05.5241389Z 2025-05-07T20:32:05.5241469Z @given( 2025-05-07T20:32:05.5241587Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.5241693Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.5241806Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.5241921Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.5242038Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.5242120Z ) 2025-05-07T20:32:05.5242371Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.5242464Z def test_silu_mul_quant( 2025-05-07T20:32:05.5242540Z self, 2025-05-07T20:32:05.5242625Z T: int, 2025-05-07T20:32:05.5242704Z D: int, 2025-05-07T20:32:05.5242802Z scale_ub: Optional[float], 2025-05-07T20:32:05.5242897Z contiguous: bool, 2025-05-07T20:32:05.5242981Z compiled: bool, 2025-05-07T20:32:05.5243058Z ) -> None: 2025-05-07T20:32:05.5243159Z torch.manual_seed(2025) 2025-05-07T20:32:05.5243232Z 2025-05-07T20:32:05.5243399Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.5243479Z 2025-05-07T20:32:05.5243569Z x_sign = torch.sign(x) 2025-05-07T20:32:05.5243699Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.5243786Z x = x_sign * x_clamp 2025-05-07T20:32:05.5243869Z x0 = x[:, :D] 2025-05-07T20:32:05.5243959Z x1 = x[:, D:] 2025-05-07T20:32:05.5244031Z 2025-05-07T20:32:05.5244115Z if contiguous: 2025-05-07T20:32:05.5244212Z x0 = x0.contiguous() 2025-05-07T20:32:05.5244301Z x1 = x1.contiguous() 2025-05-07T20:32:05.5244372Z 2025-05-07T20:32:05.5244473Z if scale_ub is not None: 2025-05-07T20:32:05.5244578Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.5244712Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.5244794Z ) 2025-05-07T20:32:05.5244871Z else: 2025-05-07T20:32:05.5244965Z scale_ub_tensor = None 2025-05-07T20:32:05.5245044Z 2025-05-07T20:32:05.5245171Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.5245270Z op = silu_mul_quant 2025-05-07T20:32:05.5245355Z if compiled: 2025-05-07T20:32:05.5245458Z op = torch.compile(op) 2025-05-07T20:32:05.5245573Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.5245647Z 2025-05-07T20:32:05.5245738Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.5245742Z 2025-05-07T20:32:05.5245847Z moe/activation_test.py:117: 2025-05-07T20:32:05.5246061Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.5246162Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.5246267Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.5246760Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.5246862Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.5247216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.5247435Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.5247779Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.5247949Z kernel = self.compile( 2025-05-07T20:32:05.5248332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.5248509Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.5248636Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.5248640Z 2025-05-07T20:32:05.5248848Z self = 2025-05-07T20:32:05.5249615Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.5250117Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f48991640e0>} 2025-05-07T20:32:05.5250862Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.5251054Z context = 2025-05-07T20:32:05.5251058Z 2025-05-07T20:32:05.5251228Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.5251487Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.5251601Z module_map=module_map) 2025-05-07T20:32:05.5251762Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.5251860Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.5251945Z E ^ 2025-05-07T20:32:05.5252298Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.5252308Z 2025-05-07T20:32:05.5252719Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.5252730Z 2025-05-07T20:32:05.5252838Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.5253059Z self=, 2025-05-07T20:32:05.5253143Z T=4096, 2025-05-07T20:32:05.5253220Z D=7168, 2025-05-07T20:32:05.5253304Z scale_ub=1200.0, 2025-05-07T20:32:05.5253396Z contiguous=False, 2025-05-07T20:32:05.5253481Z compiled=False, 2025-05-07T20:32:05.5253555Z ) 2025-05-07T20:32:05.5253780Z self = 2025-05-07T20:32:05.5253957Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:05.5253961Z 2025-05-07T20:32:05.5254045Z @given( 2025-05-07T20:32:05.5254164Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.5254267Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.5254388Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.5254504Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.5254698Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.5254779Z ) 2025-05-07T20:32:05.5255022Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.5255116Z def test_silu_mul_quant( 2025-05-07T20:32:05.5255200Z self, 2025-05-07T20:32:05.5255277Z T: int, 2025-05-07T20:32:05.5255361Z D: int, 2025-05-07T20:32:05.5255458Z scale_ub: Optional[float], 2025-05-07T20:32:05.5255547Z contiguous: bool, 2025-05-07T20:32:05.5255640Z compiled: bool, 2025-05-07T20:32:05.5255717Z ) -> None: 2025-05-07T20:32:05.5255812Z torch.manual_seed(2025) 2025-05-07T20:32:05.5255891Z 2025-05-07T20:32:05.5256098Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.5256212Z 2025-05-07T20:32:05.5256311Z x_sign = torch.sign(x) 2025-05-07T20:32:05.5256434Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.5256527Z x = x_sign * x_clamp 2025-05-07T20:32:05.5256615Z x0 = x[:, :D] 2025-05-07T20:32:05.5256694Z x1 = x[:, D:] 2025-05-07T20:32:05.5256764Z 2025-05-07T20:32:05.5256854Z if contiguous: 2025-05-07T20:32:05.5256943Z x0 = x0.contiguous() 2025-05-07T20:32:05.5257038Z x1 = x1.contiguous() 2025-05-07T20:32:05.5257109Z 2025-05-07T20:32:05.5257198Z if scale_ub is not None: 2025-05-07T20:32:05.5257308Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.5257442Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.5257516Z ) 2025-05-07T20:32:05.5257597Z else: 2025-05-07T20:32:05.5257692Z scale_ub_tensor = None 2025-05-07T20:32:05.5257766Z 2025-05-07T20:32:05.5257901Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.5257992Z op = silu_mul_quant 2025-05-07T20:32:05.5258077Z if compiled: 2025-05-07T20:32:05.5258189Z op = torch.compile(op) 2025-05-07T20:32:05.5258293Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.5258372Z 2025-05-07T20:32:05.5258466Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.5258471Z 2025-05-07T20:32:05.5258569Z moe/activation_test.py:117: 2025-05-07T20:32:05.5258704Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.5258805Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.5258905Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.5259406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.5259505Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.5259867Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.5260086Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.5260426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.5260525Z kernel = self.compile( 2025-05-07T20:32:05.5260907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.5261077Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.5261209Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.5261214Z 2025-05-07T20:32:05.5261414Z self = 2025-05-07T20:32:05.5262188Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.5262824Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4899165300>} 2025-05-07T20:32:05.5263572Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.5263758Z context = 2025-05-07T20:32:05.5263762Z 2025-05-07T20:32:05.5263924Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.5264189Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.5264450Z module_map=module_map) 2025-05-07T20:32:05.5264650Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.5264755Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.5264831Z E ^ 2025-05-07T20:32:05.5265196Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.5265200Z 2025-05-07T20:32:05.5265613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.5265617Z 2025-05-07T20:32:05.5265720Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.5265947Z self=, 2025-05-07T20:32:05.5266026Z T=16384, 2025-05-07T20:32:05.5266108Z D=7168, 2025-05-07T20:32:05.5266189Z scale_ub=None, 2025-05-07T20:32:05.5266273Z contiguous=True, 2025-05-07T20:32:05.5266369Z compiled=True, 2025-05-07T20:32:05.5266444Z ) 2025-05-07T20:32:05.5271405Z self = 2025-05-07T20:32:05.5271610Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:05.5271615Z 2025-05-07T20:32:05.5271704Z @given( 2025-05-07T20:32:05.5271833Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.5271934Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.5272057Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.5272175Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.5272294Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.5272382Z ) 2025-05-07T20:32:05.5272674Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.5272776Z def test_silu_mul_quant( 2025-05-07T20:32:05.5272867Z self, 2025-05-07T20:32:05.5272946Z T: int, 2025-05-07T20:32:05.5273031Z D: int, 2025-05-07T20:32:05.5273143Z scale_ub: Optional[float], 2025-05-07T20:32:05.5273234Z contiguous: bool, 2025-05-07T20:32:05.5273329Z compiled: bool, 2025-05-07T20:32:05.5273411Z ) -> None: 2025-05-07T20:32:05.5273508Z torch.manual_seed(2025) 2025-05-07T20:32:05.5273597Z 2025-05-07T20:32:05.5273767Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.5273844Z 2025-05-07T20:32:05.5273946Z x_sign = torch.sign(x) 2025-05-07T20:32:05.5274074Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.5274168Z x = x_sign * x_clamp 2025-05-07T20:32:05.5274258Z x0 = x[:, :D] 2025-05-07T20:32:05.5274344Z x1 = x[:, D:] 2025-05-07T20:32:05.5274418Z 2025-05-07T20:32:05.5274513Z if contiguous: 2025-05-07T20:32:05.5274604Z x0 = x0.contiguous() 2025-05-07T20:32:05.5274703Z x1 = x1.contiguous() 2025-05-07T20:32:05.5274779Z 2025-05-07T20:32:05.5274870Z if scale_ub is not None: 2025-05-07T20:32:05.5274989Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.5275126Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.5275208Z ) 2025-05-07T20:32:05.5275294Z else: 2025-05-07T20:32:05.5275510Z scale_ub_tensor = None 2025-05-07T20:32:05.5275585Z 2025-05-07T20:32:05.5275724Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.5275816Z op = silu_mul_quant 2025-05-07T20:32:05.5275903Z if compiled: 2025-05-07T20:32:05.5276012Z op = torch.compile(op) 2025-05-07T20:32:05.5276120Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.5276202Z 2025-05-07T20:32:05.5276293Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.5276298Z 2025-05-07T20:32:05.5276397Z moe/activation_test.py:117: 2025-05-07T20:32:05.5276534Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.5276752Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.5276853Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.5277230Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:05.5277329Z return fn(*args, **kwargs) 2025-05-07T20:32:05.5277828Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.5277926Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.5278282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.5278512Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.5278850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.5278943Z kernel = self.compile( 2025-05-07T20:32:05.5279336Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.5279510Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.5279651Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.5279656Z 2025-05-07T20:32:05.5279859Z self = 2025-05-07T20:32:05.5280632Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.5281138Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f48991663e0>} 2025-05-07T20:32:05.5281881Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.5282082Z context = 2025-05-07T20:32:05.5282087Z 2025-05-07T20:32:05.5282258Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.5282559Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.5282684Z module_map=module_map) 2025-05-07T20:32:05.5282845Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.5282955Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.5283033Z E ^ 2025-05-07T20:32:05.5283384Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.5283389Z 2025-05-07T20:32:05.5283810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.5283817Z 2025-05-07T20:32:05.5283919Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.5284146Z self=, 2025-05-07T20:32:05.5284310Z T=4096, 2025-05-07T20:32:05.5284388Z D=5120, 2025-05-07T20:32:05.5284479Z scale_ub=None, 2025-05-07T20:32:05.5284566Z contiguous=False, 2025-05-07T20:32:05.5284649Z compiled=True, 2025-05-07T20:32:05.5284729Z ) 2025-05-07T20:32:05.5284945Z self = 2025-05-07T20:32:05.5285115Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:05.5285120Z 2025-05-07T20:32:05.5285205Z @given( 2025-05-07T20:32:05.5285322Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.5285429Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.5285582Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.5285741Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.5285858Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.5285933Z ) 2025-05-07T20:32:05.5286181Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.5286281Z def test_silu_mul_quant( 2025-05-07T20:32:05.5286359Z self, 2025-05-07T20:32:05.5286436Z T: int, 2025-05-07T20:32:05.5286519Z D: int, 2025-05-07T20:32:05.5286618Z scale_ub: Optional[float], 2025-05-07T20:32:05.5286707Z contiguous: bool, 2025-05-07T20:32:05.5286811Z compiled: bool, 2025-05-07T20:32:05.5286892Z ) -> None: 2025-05-07T20:32:05.5286997Z torch.manual_seed(2025) 2025-05-07T20:32:05.5287071Z 2025-05-07T20:32:05.5287236Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.5287321Z 2025-05-07T20:32:05.5287412Z x_sign = torch.sign(x) 2025-05-07T20:32:05.5287547Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.5287645Z x = x_sign * x_clamp 2025-05-07T20:32:05.5287727Z x0 = x[:, :D] 2025-05-07T20:32:05.5287812Z x1 = x[:, D:] 2025-05-07T20:32:05.5287899Z 2025-05-07T20:32:05.5287984Z if contiguous: 2025-05-07T20:32:05.5288077Z x0 = x0.contiguous() 2025-05-07T20:32:05.5288175Z x1 = x1.contiguous() 2025-05-07T20:32:05.5288248Z 2025-05-07T20:32:05.5288347Z if scale_ub is not None: 2025-05-07T20:32:05.5288454Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.5288591Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.5288677Z ) 2025-05-07T20:32:05.5288755Z else: 2025-05-07T20:32:05.5288849Z scale_ub_tensor = None 2025-05-07T20:32:05.5288931Z 2025-05-07T20:32:05.5289059Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.5289155Z op = silu_mul_quant 2025-05-07T20:32:05.5289250Z if compiled: 2025-05-07T20:32:05.5289351Z op = torch.compile(op) 2025-05-07T20:32:05.5289457Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.5289542Z 2025-05-07T20:32:05.5289638Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.5289642Z 2025-05-07T20:32:05.5289749Z moe/activation_test.py:117: 2025-05-07T20:32:05.5289879Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.5289981Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.5290091Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.5290457Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:05.5290555Z return fn(*args, **kwargs) 2025-05-07T20:32:05.5291055Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.5291159Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.5291522Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.5291834Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.5292173Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.5292275Z kernel = self.compile( 2025-05-07T20:32:05.5292654Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.5292825Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.5292962Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.5292966Z 2025-05-07T20:32:05.5293168Z self = 2025-05-07T20:32:05.5294025Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.5294529Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4899166a20>} 2025-05-07T20:32:05.5295278Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.5295466Z context = 2025-05-07T20:32:05.5295471Z 2025-05-07T20:32:05.5295634Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.5295901Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.5296013Z module_map=module_map) 2025-05-07T20:32:05.5296182Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.5296281Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.5296364Z E ^ 2025-05-07T20:32:05.5296722Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.5296726Z 2025-05-07T20:32:05.5297136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.5297140Z 2025-05-07T20:32:05.5297242Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.5297472Z self=, 2025-05-07T20:32:05.5297551Z T=4096, 2025-05-07T20:32:05.5297636Z D=5120, 2025-05-07T20:32:05.5297721Z scale_ub=1200.0, 2025-05-07T20:32:05.5297811Z contiguous=False, 2025-05-07T20:32:05.5297906Z compiled=False, 2025-05-07T20:32:05.5297980Z ) 2025-05-07T20:32:05.5298199Z self = 2025-05-07T20:32:05.5298379Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:05.5298390Z 2025-05-07T20:32:05.5298467Z @given( 2025-05-07T20:32:05.5298585Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.5298692Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.5298808Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.5298931Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.5299043Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.5299118Z ) 2025-05-07T20:32:05.5299369Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.5299465Z def test_silu_mul_quant( 2025-05-07T20:32:05.5299544Z self, 2025-05-07T20:32:05.5299633Z T: int, 2025-05-07T20:32:05.5299709Z D: int, 2025-05-07T20:32:05.5299806Z scale_ub: Optional[float], 2025-05-07T20:32:05.5299902Z contiguous: bool, 2025-05-07T20:32:05.5299988Z compiled: bool, 2025-05-07T20:32:05.5300066Z ) -> None: 2025-05-07T20:32:05.5300255Z torch.manual_seed(2025) 2025-05-07T20:32:05.5300331Z 2025-05-07T20:32:05.5300504Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.5300578Z 2025-05-07T20:32:05.5300671Z x_sign = torch.sign(x) 2025-05-07T20:32:05.5300802Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.5300890Z x = x_sign * x_clamp 2025-05-07T20:32:05.5300971Z x0 = x[:, :D] 2025-05-07T20:32:05.5301058Z x1 = x[:, D:] 2025-05-07T20:32:05.5301131Z 2025-05-07T20:32:05.5301216Z if contiguous: 2025-05-07T20:32:05.5301317Z x0 = x0.contiguous() 2025-05-07T20:32:05.5301452Z x1 = x1.contiguous() 2025-05-07T20:32:05.5301568Z 2025-05-07T20:32:05.5301670Z if scale_ub is not None: 2025-05-07T20:32:05.5301776Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.5301919Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.5302003Z ) 2025-05-07T20:32:05.5302080Z else: 2025-05-07T20:32:05.5302183Z scale_ub_tensor = None 2025-05-07T20:32:05.5302259Z 2025-05-07T20:32:05.5302391Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.5302494Z op = silu_mul_quant 2025-05-07T20:32:05.5302602Z if compiled: 2025-05-07T20:32:05.5302713Z op = torch.compile(op) 2025-05-07T20:32:05.5302842Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.5302914Z 2025-05-07T20:32:05.5303005Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.5303010Z 2025-05-07T20:32:05.5303117Z moe/activation_test.py:117: 2025-05-07T20:32:05.5303251Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.5303363Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.5303463Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.5303962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.5304067Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.5304424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.5304645Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.5304994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.5305088Z kernel = self.compile( 2025-05-07T20:32:05.5305478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.5305658Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.5305786Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.5305791Z 2025-05-07T20:32:05.5306006Z self = 2025-05-07T20:32:05.5306779Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.5307281Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f48990142c0>} 2025-05-07T20:32:05.5308024Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.5308225Z context = 2025-05-07T20:32:05.5308230Z 2025-05-07T20:32:05.5308392Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.5308773Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.5308894Z module_map=module_map) 2025-05-07T20:32:05.5309184Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.5309285Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.5309370Z E ^ 2025-05-07T20:32:05.5309726Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.5309730Z 2025-05-07T20:32:05.5310152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.5310199Z 2025-05-07T20:32:05.5310341Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.5310562Z self=, 2025-05-07T20:32:05.5310649Z T=4096, 2025-05-07T20:32:05.5310726Z D=5120, 2025-05-07T20:32:05.5310815Z scale_ub=1200.0, 2025-05-07T20:32:05.5310907Z contiguous=False, 2025-05-07T20:32:05.5310988Z compiled=True, 2025-05-07T20:32:05.5311070Z ) 2025-05-07T20:32:05.5311286Z self = 2025-05-07T20:32:05.5311459Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:05.5311470Z 2025-05-07T20:32:05.5311548Z @given( 2025-05-07T20:32:05.5311666Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.5311772Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.5311885Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.5312002Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.5312127Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.5312201Z ) 2025-05-07T20:32:05.5312449Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.5312572Z def test_silu_mul_quant( 2025-05-07T20:32:05.5312660Z self, 2025-05-07T20:32:05.5312754Z T: int, 2025-05-07T20:32:05.5312841Z D: int, 2025-05-07T20:32:05.5312939Z scale_ub: Optional[float], 2025-05-07T20:32:05.5313035Z contiguous: bool, 2025-05-07T20:32:05.5313121Z compiled: bool, 2025-05-07T20:32:05.5313199Z ) -> None: 2025-05-07T20:32:05.5313300Z torch.manual_seed(2025) 2025-05-07T20:32:05.5313373Z 2025-05-07T20:32:05.5313539Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.5313619Z 2025-05-07T20:32:05.5313710Z x_sign = torch.sign(x) 2025-05-07T20:32:05.5313833Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.5313934Z x = x_sign * x_clamp 2025-05-07T20:32:05.5314014Z x0 = x[:, :D] 2025-05-07T20:32:05.5314094Z x1 = x[:, D:] 2025-05-07T20:32:05.5314171Z 2025-05-07T20:32:05.5314254Z if contiguous: 2025-05-07T20:32:05.5314345Z x0 = x0.contiguous() 2025-05-07T20:32:05.5314445Z x1 = x1.contiguous() 2025-05-07T20:32:05.5314516Z 2025-05-07T20:32:05.5314615Z if scale_ub is not None: 2025-05-07T20:32:05.5314722Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.5314855Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.5314935Z ) 2025-05-07T20:32:05.5315010Z else: 2025-05-07T20:32:05.5315108Z scale_ub_tensor = None 2025-05-07T20:32:05.5315186Z 2025-05-07T20:32:05.5315313Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.5315403Z op = silu_mul_quant 2025-05-07T20:32:05.5315494Z if compiled: 2025-05-07T20:32:05.5315597Z op = torch.compile(op) 2025-05-07T20:32:05.5315705Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.5315784Z 2025-05-07T20:32:05.5315876Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.5315881Z 2025-05-07T20:32:05.5316074Z moe/activation_test.py:117: 2025-05-07T20:32:05.5316205Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.5316306Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.5316412Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.5316779Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:05.5316872Z return fn(*args, **kwargs) 2025-05-07T20:32:05.5317364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.5317460Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.5317860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.5318120Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.5318460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.5318557Z kernel = self.compile( 2025-05-07T20:32:05.5318934Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.5319112Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.5319239Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.5319244Z 2025-05-07T20:32:05.5319448Z self = 2025-05-07T20:32:05.5320228Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.5320734Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f48990154e0>} 2025-05-07T20:32:05.5321482Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.5321669Z context = 2025-05-07T20:32:05.5321673Z 2025-05-07T20:32:05.5321836Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.5322100Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.5322210Z module_map=module_map) 2025-05-07T20:32:05.5322405Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.5322512Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.5322603Z E ^ 2025-05-07T20:32:05.5322965Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.5322970Z 2025-05-07T20:32:05.5323380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.5323385Z 2025-05-07T20:32:05.5323492Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.5323715Z self=, 2025-05-07T20:32:05.5323793Z T=2048, 2025-05-07T20:32:05.5323874Z D=7168, 2025-05-07T20:32:05.5323957Z scale_ub=1200.0, 2025-05-07T20:32:05.5324045Z contiguous=False, 2025-05-07T20:32:05.5324135Z compiled=False, 2025-05-07T20:32:05.5324209Z ) 2025-05-07T20:32:05.5324427Z self = 2025-05-07T20:32:05.5324610Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:05.5324615Z 2025-05-07T20:32:05.5324693Z @given( 2025-05-07T20:32:05.5324902Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.5325004Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.5325117Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.5325240Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.5325352Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.5325425Z ) 2025-05-07T20:32:05.5325673Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.5325766Z def test_silu_mul_quant( 2025-05-07T20:32:05.5325843Z self, 2025-05-07T20:32:05.5325927Z T: int, 2025-05-07T20:32:05.5326003Z D: int, 2025-05-07T20:32:05.5326147Z scale_ub: Optional[float], 2025-05-07T20:32:05.5326284Z contiguous: bool, 2025-05-07T20:32:05.5326368Z compiled: bool, 2025-05-07T20:32:05.5326455Z ) -> None: 2025-05-07T20:32:05.5326553Z torch.manual_seed(2025) 2025-05-07T20:32:05.5326626Z 2025-05-07T20:32:05.5326805Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.5326879Z 2025-05-07T20:32:05.5326970Z x_sign = torch.sign(x) 2025-05-07T20:32:05.5327104Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.5327193Z x = x_sign * x_clamp 2025-05-07T20:32:05.5327272Z x0 = x[:, :D] 2025-05-07T20:32:05.5327361Z x1 = x[:, D:] 2025-05-07T20:32:05.5327432Z 2025-05-07T20:32:05.5327516Z if contiguous: 2025-05-07T20:32:05.5327615Z x0 = x0.contiguous() 2025-05-07T20:32:05.5327703Z x1 = x1.contiguous() 2025-05-07T20:32:05.5327774Z 2025-05-07T20:32:05.5327873Z if scale_ub is not None: 2025-05-07T20:32:05.5327980Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.5328124Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.5328507Z ) 2025-05-07T20:32:05.5328620Z else: 2025-05-07T20:32:05.5328763Z scale_ub_tensor = None 2025-05-07T20:32:05.5328854Z 2025-05-07T20:32:05.5328988Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.5329084Z op = silu_mul_quant 2025-05-07T20:32:05.5329169Z if compiled: 2025-05-07T20:32:05.5329268Z op = torch.compile(op) 2025-05-07T20:32:05.5329378Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.5329450Z 2025-05-07T20:32:05.5329540Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.5329551Z 2025-05-07T20:32:05.5329646Z moe/activation_test.py:117: 2025-05-07T20:32:05.5329794Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.5329898Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.5330005Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.5330507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.5330607Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.5330963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.5331191Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.5331528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.5331628Z kernel = self.compile( 2025-05-07T20:32:05.5332004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.5332176Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.5332317Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.5332323Z 2025-05-07T20:32:05.5332524Z self = 2025-05-07T20:32:05.5333548Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.5334171Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4899016340>} 2025-05-07T20:32:05.5335099Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.5335322Z context = 2025-05-07T20:32:05.5335450Z 2025-05-07T20:32:05.5335634Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.5335947Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.5336066Z module_map=module_map) 2025-05-07T20:32:05.5336242Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.5336353Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.5336433Z E ^ 2025-05-07T20:32:05.5336858Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.5336869Z 2025-05-07T20:32:05.5337368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.5337373Z 2025-05-07T20:32:05.5337481Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.5337749Z self=, 2025-05-07T20:32:05.5337830Z T=1, 2025-05-07T20:32:05.5337908Z D=7168, 2025-05-07T20:32:05.5337998Z scale_ub=None, 2025-05-07T20:32:05.5338085Z contiguous=True, 2025-05-07T20:32:05.5338173Z compiled=False, 2025-05-07T20:32:05.5338252Z ) 2025-05-07T20:32:05.5338506Z self = 2025-05-07T20:32:05.5338692Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:05.5338697Z 2025-05-07T20:32:05.5338774Z @given( 2025-05-07T20:32:05.5338899Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.5339007Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.5339127Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.5339253Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.5339378Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.5339455Z ) 2025-05-07T20:32:05.5339746Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.5339845Z def test_silu_mul_quant( 2025-05-07T20:32:05.5339922Z self, 2025-05-07T20:32:05.5340005Z T: int, 2025-05-07T20:32:05.5340081Z D: int, 2025-05-07T20:32:05.5340186Z scale_ub: Optional[float], 2025-05-07T20:32:05.5340285Z contiguous: bool, 2025-05-07T20:32:05.5340371Z compiled: bool, 2025-05-07T20:32:05.5340449Z ) -> None: 2025-05-07T20:32:05.5340556Z torch.manual_seed(2025) 2025-05-07T20:32:05.5340632Z 2025-05-07T20:32:05.5340817Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.5340896Z 2025-05-07T20:32:05.5340992Z x_sign = torch.sign(x) 2025-05-07T20:32:05.5341123Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.5341220Z x = x_sign * x_clamp 2025-05-07T20:32:05.5341302Z x0 = x[:, :D] 2025-05-07T20:32:05.5341391Z x1 = x[:, D:] 2025-05-07T20:32:05.5341467Z 2025-05-07T20:32:05.5341552Z if contiguous: 2025-05-07T20:32:05.5341653Z x0 = x0.contiguous() 2025-05-07T20:32:05.5341744Z x1 = x1.contiguous() 2025-05-07T20:32:05.5341817Z 2025-05-07T20:32:05.5342046Z if scale_ub is not None: 2025-05-07T20:32:05.5342157Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.5342291Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.5342374Z ) 2025-05-07T20:32:05.5342450Z else: 2025-05-07T20:32:05.5342543Z scale_ub_tensor = None 2025-05-07T20:32:05.5342622Z 2025-05-07T20:32:05.5342752Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.5342847Z op = silu_mul_quant 2025-05-07T20:32:05.5342931Z if compiled: 2025-05-07T20:32:05.5343031Z op = torch.compile(op) 2025-05-07T20:32:05.5343139Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.5343289Z 2025-05-07T20:32:05.5343383Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.5343387Z 2025-05-07T20:32:05.5343490Z moe/activation_test.py:117: 2025-05-07T20:32:05.5343619Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.5343727Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.5343832Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.5344327Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.5344431Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.5344788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.5345008Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.5345352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.5345451Z kernel = self.compile( 2025-05-07T20:32:05.5345828Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.5346008Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.5346135Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.5346139Z 2025-05-07T20:32:05.5346350Z self = 2025-05-07T20:32:05.5347120Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.5347623Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4899015c60>} 2025-05-07T20:32:05.5348371Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.5348563Z context = 2025-05-07T20:32:05.5348568Z 2025-05-07T20:32:05.5348738Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.5349000Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.5349204Z module_map=module_map) 2025-05-07T20:32:05.5349367Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.5349468Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.5349553Z E ^ 2025-05-07T20:32:05.5349908Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.5349918Z 2025-05-07T20:32:05.5350334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.5350344Z 2025-05-07T20:32:05.5350448Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.5350769Z self=, 2025-05-07T20:32:05.5350854Z T=16384, 2025-05-07T20:32:05.5350930Z D=7168, 2025-05-07T20:32:05.5351014Z scale_ub=1200.0, 2025-05-07T20:32:05.5351108Z contiguous=False, 2025-05-07T20:32:05.5351195Z compiled=True, 2025-05-07T20:32:05.5351268Z ) 2025-05-07T20:32:05.5351497Z self = 2025-05-07T20:32:05.5351677Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:05.5351681Z 2025-05-07T20:32:05.5351769Z @given( 2025-05-07T20:32:05.5351886Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.5352105Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.5352276Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.5352461Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.5352632Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.5352750Z ) 2025-05-07T20:32:05.5353036Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.5353132Z def test_silu_mul_quant( 2025-05-07T20:32:05.5353217Z self, 2025-05-07T20:32:05.5353295Z T: int, 2025-05-07T20:32:05.5353371Z D: int, 2025-05-07T20:32:05.5353475Z scale_ub: Optional[float], 2025-05-07T20:32:05.5353563Z contiguous: bool, 2025-05-07T20:32:05.5353654Z compiled: bool, 2025-05-07T20:32:05.5353735Z ) -> None: 2025-05-07T20:32:05.5353828Z torch.manual_seed(2025) 2025-05-07T20:32:05.5353905Z 2025-05-07T20:32:05.5354072Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.5354150Z 2025-05-07T20:32:05.5354245Z x_sign = torch.sign(x) 2025-05-07T20:32:05.5354370Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.5354460Z x = x_sign * x_clamp 2025-05-07T20:32:05.5354549Z x0 = x[:, :D] 2025-05-07T20:32:05.5354632Z x1 = x[:, D:] 2025-05-07T20:32:05.5354704Z 2025-05-07T20:32:05.5354792Z if contiguous: 2025-05-07T20:32:05.5354883Z x0 = x0.contiguous() 2025-05-07T20:32:05.5354970Z x1 = x1.contiguous() 2025-05-07T20:32:05.5355048Z 2025-05-07T20:32:05.5355137Z if scale_ub is not None: 2025-05-07T20:32:05.5355249Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.5355383Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.5355457Z ) 2025-05-07T20:32:05.5355538Z else: 2025-05-07T20:32:05.5355631Z scale_ub_tensor = None 2025-05-07T20:32:05.5355707Z 2025-05-07T20:32:05.5355847Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.5355936Z op = silu_mul_quant 2025-05-07T20:32:05.5356021Z if compiled: 2025-05-07T20:32:05.5356127Z op = torch.compile(op) 2025-05-07T20:32:05.5356235Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.5356309Z 2025-05-07T20:32:05.5356406Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.5356410Z 2025-05-07T20:32:05.5356508Z moe/activation_test.py:117: 2025-05-07T20:32:05.5356641Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.5356742Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.5356841Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.5357213Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:05.5357304Z return fn(*args, **kwargs) 2025-05-07T20:32:05.5357795Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.5357903Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.5358363Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.5358592Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.5358928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.5359021Z kernel = self.compile( 2025-05-07T20:32:05.5359404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.5359575Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.5359710Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.5359756Z 2025-05-07T20:32:05.5359959Z self = 2025-05-07T20:32:05.5360772Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.5361274Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4899408900>} 2025-05-07T20:32:05.5362015Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.5362211Z context = 2025-05-07T20:32:05.5362216Z 2025-05-07T20:32:05.5362388Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.5362692Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.5362805Z module_map=module_map) 2025-05-07T20:32:05.5362964Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.5363073Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.5363148Z E ^ 2025-05-07T20:32:05.5363498Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.5363503Z 2025-05-07T20:32:05.5363923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.5363927Z 2025-05-07T20:32:05.5364030Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.5364257Z self=, 2025-05-07T20:32:05.5364334Z T=1, 2025-05-07T20:32:05.5364414Z D=7168, 2025-05-07T20:32:05.5364505Z scale_ub=None, 2025-05-07T20:32:05.5364591Z contiguous=False, 2025-05-07T20:32:05.5364675Z compiled=False, 2025-05-07T20:32:05.5364758Z ) 2025-05-07T20:32:05.5364974Z self = 2025-05-07T20:32:05.5365143Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:05.5365148Z 2025-05-07T20:32:05.5365233Z @given( 2025-05-07T20:32:05.5365350Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.5365455Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.5365568Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.5365683Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.5365804Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.5365878Z ) 2025-05-07T20:32:05.5366121Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.5366223Z def test_silu_mul_quant( 2025-05-07T20:32:05.5366301Z self, 2025-05-07T20:32:05.5366379Z T: int, 2025-05-07T20:32:05.5366460Z D: int, 2025-05-07T20:32:05.5366557Z scale_ub: Optional[float], 2025-05-07T20:32:05.5366646Z contiguous: bool, 2025-05-07T20:32:05.5366827Z compiled: bool, 2025-05-07T20:32:05.5366906Z ) -> None: 2025-05-07T20:32:05.5367011Z torch.manual_seed(2025) 2025-05-07T20:32:05.5367084Z 2025-05-07T20:32:05.5367252Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.5367333Z 2025-05-07T20:32:05.5367424Z x_sign = torch.sign(x) 2025-05-07T20:32:05.5367548Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.5367641Z x = x_sign * x_clamp 2025-05-07T20:32:05.5367720Z x0 = x[:, :D] 2025-05-07T20:32:05.5367800Z x1 = x[:, D:] 2025-05-07T20:32:05.5367876Z 2025-05-07T20:32:05.5367959Z if contiguous: 2025-05-07T20:32:05.5368095Z x0 = x0.contiguous() 2025-05-07T20:32:05.5368266Z x1 = x1.contiguous() 2025-05-07T20:32:05.5368339Z 2025-05-07T20:32:05.5368430Z if scale_ub is not None: 2025-05-07T20:32:05.5368547Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.5368688Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.5368770Z ) 2025-05-07T20:32:05.5368845Z else: 2025-05-07T20:32:05.5368938Z scale_ub_tensor = None 2025-05-07T20:32:05.5369015Z 2025-05-07T20:32:05.5369143Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.5369233Z op = silu_mul_quant 2025-05-07T20:32:05.5369323Z if compiled: 2025-05-07T20:32:05.5369421Z op = torch.compile(op) 2025-05-07T20:32:05.5369527Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.5369603Z 2025-05-07T20:32:05.5369693Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.5369701Z 2025-05-07T20:32:05.5369806Z moe/activation_test.py:117: 2025-05-07T20:32:05.5369937Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.5370037Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.5370146Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.5370642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.5370738Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.5371100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.5371318Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.5371660Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.5371752Z kernel = self.compile( 2025-05-07T20:32:05.5372129Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.5372313Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.5372447Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.5372457Z 2025-05-07T20:32:05.5372693Z self = 2025-05-07T20:32:05.5373478Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.5373973Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4899409760>} 2025-05-07T20:32:05.5374722Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.5374914Z context = 2025-05-07T20:32:05.5374919Z 2025-05-07T20:32:05.5375174Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.5375433Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.5375541Z module_map=module_map) 2025-05-07T20:32:05.5375707Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.5375806Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.5375883Z E ^ 2025-05-07T20:32:05.5376241Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.5376246Z 2025-05-07T20:32:05.5376656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.5376740Z 2025-05-07T20:32:05.5376850Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.5377072Z self=, 2025-05-07T20:32:05.5377149Z T=2048, 2025-05-07T20:32:05.5377239Z D=7168, 2025-05-07T20:32:05.5377322Z scale_ub=None, 2025-05-07T20:32:05.5377408Z contiguous=False, 2025-05-07T20:32:05.5377499Z compiled=True, 2025-05-07T20:32:05.5377570Z ) 2025-05-07T20:32:05.5377792Z self = 2025-05-07T20:32:05.5377965Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:05.5377970Z 2025-05-07T20:32:05.5378046Z @given( 2025-05-07T20:32:05.5378171Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.5378269Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.5378383Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.5378511Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.5378623Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.5378697Z ) 2025-05-07T20:32:05.5378955Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.5379054Z def test_silu_mul_quant( 2025-05-07T20:32:05.5379139Z self, 2025-05-07T20:32:05.5379217Z T: int, 2025-05-07T20:32:05.5379294Z D: int, 2025-05-07T20:32:05.5379397Z scale_ub: Optional[float], 2025-05-07T20:32:05.5379487Z contiguous: bool, 2025-05-07T20:32:05.5379573Z compiled: bool, 2025-05-07T20:32:05.5379657Z ) -> None: 2025-05-07T20:32:05.5379752Z torch.manual_seed(2025) 2025-05-07T20:32:05.5379826Z 2025-05-07T20:32:05.5380000Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.5380075Z 2025-05-07T20:32:05.5380166Z x_sign = torch.sign(x) 2025-05-07T20:32:05.5380302Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.5380394Z x = x_sign * x_clamp 2025-05-07T20:32:05.5380483Z x0 = x[:, :D] 2025-05-07T20:32:05.5380562Z x1 = x[:, D:] 2025-05-07T20:32:05.5380636Z 2025-05-07T20:32:05.5380730Z if contiguous: 2025-05-07T20:32:05.5380822Z x0 = x0.contiguous() 2025-05-07T20:32:05.5380910Z x1 = x1.contiguous() 2025-05-07T20:32:05.5380988Z 2025-05-07T20:32:05.5381078Z if scale_ub is not None: 2025-05-07T20:32:05.5381185Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.5381326Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.5381403Z ) 2025-05-07T20:32:05.5381480Z else: 2025-05-07T20:32:05.5381579Z scale_ub_tensor = None 2025-05-07T20:32:05.5381650Z 2025-05-07T20:32:05.5381785Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.5381877Z op = silu_mul_quant 2025-05-07T20:32:05.5381966Z if compiled: 2025-05-07T20:32:05.5382072Z op = torch.compile(op) 2025-05-07T20:32:05.5382175Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.5382251Z 2025-05-07T20:32:05.5382354Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.5382470Z 2025-05-07T20:32:05.5382577Z moe/activation_test.py:117: 2025-05-07T20:32:05.5382720Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.5382826Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.5382924Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.5383296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:05.5383387Z return fn(*args, **kwargs) 2025-05-07T20:32:05.5383876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.5384018Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.5384415Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.5384635Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.5384985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.5385077Z kernel = self.compile( 2025-05-07T20:32:05.5385460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.5385635Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.5385762Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.5385766Z 2025-05-07T20:32:05.5385974Z self = 2025-05-07T20:32:05.5386742Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.5387251Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f489940aa20>} 2025-05-07T20:32:05.5387995Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.5388183Z context = 2025-05-07T20:32:05.5388193Z 2025-05-07T20:32:05.5388357Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.5388619Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.5388736Z module_map=module_map) 2025-05-07T20:32:05.5388895Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.5388994Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.5389209Z E ^ 2025-05-07T20:32:05.5389569Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.5389574Z 2025-05-07T20:32:05.5389990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.5389994Z 2025-05-07T20:32:05.5390097Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.5390317Z self=, 2025-05-07T20:32:05.5390400Z T=4096, 2025-05-07T20:32:05.5390476Z D=7168, 2025-05-07T20:32:05.5390559Z scale_ub=None, 2025-05-07T20:32:05.5390650Z contiguous=False, 2025-05-07T20:32:05.5390736Z compiled=True, 2025-05-07T20:32:05.5390810Z ) 2025-05-07T20:32:05.5391031Z self = 2025-05-07T20:32:05.5391201Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:05.5391206Z 2025-05-07T20:32:05.5391378Z @given( 2025-05-07T20:32:05.5391497Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.5391596Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.5391720Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.5391835Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.5391947Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.5392028Z ) 2025-05-07T20:32:05.5392271Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.5392371Z def test_silu_mul_quant( 2025-05-07T20:32:05.5392447Z self, 2025-05-07T20:32:05.5392523Z T: int, 2025-05-07T20:32:05.5392649Z D: int, 2025-05-07T20:32:05.5392784Z scale_ub: Optional[float], 2025-05-07T20:32:05.5397419Z contiguous: bool, 2025-05-07T20:32:05.5397532Z compiled: bool, 2025-05-07T20:32:05.5397620Z ) -> None: 2025-05-07T20:32:05.5397729Z torch.manual_seed(2025) 2025-05-07T20:32:05.5397816Z 2025-05-07T20:32:05.5398000Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.5398074Z 2025-05-07T20:32:05.5398169Z x_sign = torch.sign(x) 2025-05-07T20:32:05.5398307Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.5398397Z x = x_sign * x_clamp 2025-05-07T20:32:05.5398480Z x0 = x[:, :D] 2025-05-07T20:32:05.5398572Z x1 = x[:, D:] 2025-05-07T20:32:05.5398645Z 2025-05-07T20:32:05.5398731Z if contiguous: 2025-05-07T20:32:05.5398835Z x0 = x0.contiguous() 2025-05-07T20:32:05.5398925Z x1 = x1.contiguous() 2025-05-07T20:32:05.5399002Z 2025-05-07T20:32:05.5399104Z if scale_ub is not None: 2025-05-07T20:32:05.5399212Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.5399348Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.5399435Z ) 2025-05-07T20:32:05.5399513Z else: 2025-05-07T20:32:05.5399619Z scale_ub_tensor = None 2025-05-07T20:32:05.5399694Z 2025-05-07T20:32:05.5399827Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.5399927Z op = silu_mul_quant 2025-05-07T20:32:05.5400015Z if compiled: 2025-05-07T20:32:05.5400116Z op = torch.compile(op) 2025-05-07T20:32:05.5400232Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.5400307Z 2025-05-07T20:32:05.5400398Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.5400403Z 2025-05-07T20:32:05.5400519Z moe/activation_test.py:117: 2025-05-07T20:32:05.5400656Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.5400771Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.5400876Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.5401253Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:05.5401359Z return fn(*args, **kwargs) 2025-05-07T20:32:05.5401849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.5401949Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.5402314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.5402536Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.5402880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.5402977Z kernel = self.compile( 2025-05-07T20:32:05.5403359Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.5403541Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.5403885Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.5403891Z 2025-05-07T20:32:05.5404103Z self = 2025-05-07T20:32:05.5404878Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.5405376Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f489940bce0>} 2025-05-07T20:32:05.5406208Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.5406438Z context = 2025-05-07T20:32:05.5406448Z 2025-05-07T20:32:05.5406622Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.5406883Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.5406991Z module_map=module_map) 2025-05-07T20:32:05.5407162Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.5407263Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.5407341Z E ^ 2025-05-07T20:32:05.5407702Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.5407707Z 2025-05-07T20:32:05.5408125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.5408132Z 2025-05-07T20:32:05.5408246Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.5408473Z self=, 2025-05-07T20:32:05.5408553Z T=16384, 2025-05-07T20:32:05.5408639Z D=5120, 2025-05-07T20:32:05.5408725Z scale_ub=1200.0, 2025-05-07T20:32:05.5408822Z contiguous=False, 2025-05-07T20:32:05.5408909Z compiled=False, 2025-05-07T20:32:05.5408984Z ) 2025-05-07T20:32:05.5409209Z self = 2025-05-07T20:32:05.5409390Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:05.5409394Z 2025-05-07T20:32:05.5409473Z @given( 2025-05-07T20:32:05.5409601Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.5409700Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.5409819Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.5409944Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.5410058Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.5410145Z ) 2025-05-07T20:32:05.5410395Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.5410489Z def test_silu_mul_quant( 2025-05-07T20:32:05.5410573Z self, 2025-05-07T20:32:05.5410651Z T: int, 2025-05-07T20:32:05.5410728Z D: int, 2025-05-07T20:32:05.5410835Z scale_ub: Optional[float], 2025-05-07T20:32:05.5410925Z contiguous: bool, 2025-05-07T20:32:05.5411012Z compiled: bool, 2025-05-07T20:32:05.5411098Z ) -> None: 2025-05-07T20:32:05.5411194Z torch.manual_seed(2025) 2025-05-07T20:32:05.5411267Z 2025-05-07T20:32:05.5411443Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.5411520Z 2025-05-07T20:32:05.5411622Z x_sign = torch.sign(x) 2025-05-07T20:32:05.5411747Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.5411836Z x = x_sign * x_clamp 2025-05-07T20:32:05.5411925Z x0 = x[:, :D] 2025-05-07T20:32:05.5412007Z x1 = x[:, D:] 2025-05-07T20:32:05.5412170Z 2025-05-07T20:32:05.5412263Z if contiguous: 2025-05-07T20:32:05.5412357Z x0 = x0.contiguous() 2025-05-07T20:32:05.5412449Z x1 = x1.contiguous() 2025-05-07T20:32:05.5412534Z 2025-05-07T20:32:05.5412627Z if scale_ub is not None: 2025-05-07T20:32:05.5412735Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.5412882Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.5412958Z ) 2025-05-07T20:32:05.5413036Z else: 2025-05-07T20:32:05.5413139Z scale_ub_tensor = None 2025-05-07T20:32:05.5413216Z 2025-05-07T20:32:05.5413359Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.5413527Z op = silu_mul_quant 2025-05-07T20:32:05.5413615Z if compiled: 2025-05-07T20:32:05.5413725Z op = torch.compile(op) 2025-05-07T20:32:05.5413831Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.5413913Z 2025-05-07T20:32:05.5414013Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.5414017Z 2025-05-07T20:32:05.5414116Z moe/activation_test.py:117: 2025-05-07T20:32:05.5414248Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.5414365Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.5414465Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.5414973Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.5415072Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.5415429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.5415665Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.5416004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.5416103Z kernel = self.compile( 2025-05-07T20:32:05.5416493Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.5416666Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.5416808Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.5416812Z 2025-05-07T20:32:05.5417015Z self = 2025-05-07T20:32:05.5417787Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.5418297Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4898b24c20>} 2025-05-07T20:32:05.5419045Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.5419244Z context = 2025-05-07T20:32:05.5419249Z 2025-05-07T20:32:05.5419416Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.5419689Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.5419798Z module_map=module_map) 2025-05-07T20:32:05.5419962Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.5420074Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.5420154Z E ^ 2025-05-07T20:32:05.5420508Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.5420596Z 2025-05-07T20:32:05.5421020Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.5421024Z 2025-05-07T20:32:05.5421129Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.5421361Z self=, 2025-05-07T20:32:05.5421440Z T=16384, 2025-05-07T20:32:05.5421518Z D=5120, 2025-05-07T20:32:05.5421610Z scale_ub=1200.0, 2025-05-07T20:32:05.5421701Z contiguous=True, 2025-05-07T20:32:05.5421785Z compiled=True, 2025-05-07T20:32:05.5421866Z ) 2025-05-07T20:32:05.5422085Z self = 2025-05-07T20:32:05.5422337Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:05.5422351Z 2025-05-07T20:32:05.5422429Z @given( 2025-05-07T20:32:05.5422548Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.5422665Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.5422780Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.5422898Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.5423018Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.5423094Z ) 2025-05-07T20:32:05.5423338Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.5423443Z def test_silu_mul_quant( 2025-05-07T20:32:05.5423522Z self, 2025-05-07T20:32:05.5423600Z T: int, 2025-05-07T20:32:05.5423687Z D: int, 2025-05-07T20:32:05.5423789Z scale_ub: Optional[float], 2025-05-07T20:32:05.5423891Z contiguous: bool, 2025-05-07T20:32:05.5423985Z compiled: bool, 2025-05-07T20:32:05.5424065Z ) -> None: 2025-05-07T20:32:05.5424166Z torch.manual_seed(2025) 2025-05-07T20:32:05.5424239Z 2025-05-07T20:32:05.5424414Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.5424495Z 2025-05-07T20:32:05.5424590Z x_sign = torch.sign(x) 2025-05-07T20:32:05.5424714Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.5424811Z x = x_sign * x_clamp 2025-05-07T20:32:05.5424892Z x0 = x[:, :D] 2025-05-07T20:32:05.5424973Z x1 = x[:, D:] 2025-05-07T20:32:05.5425055Z 2025-05-07T20:32:05.5425141Z if contiguous: 2025-05-07T20:32:05.5425241Z x0 = x0.contiguous() 2025-05-07T20:32:05.5425330Z x1 = x1.contiguous() 2025-05-07T20:32:05.5425403Z 2025-05-07T20:32:05.5425502Z if scale_ub is not None: 2025-05-07T20:32:05.5425611Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.5425750Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.5425834Z ) 2025-05-07T20:32:05.5425912Z else: 2025-05-07T20:32:05.5426008Z scale_ub_tensor = None 2025-05-07T20:32:05.5426092Z 2025-05-07T20:32:05.5426226Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.5426318Z op = silu_mul_quant 2025-05-07T20:32:05.5426412Z if compiled: 2025-05-07T20:32:05.5426512Z op = torch.compile(op) 2025-05-07T20:32:05.5426627Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.5426702Z 2025-05-07T20:32:05.5426794Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.5426798Z 2025-05-07T20:32:05.5426908Z moe/activation_test.py:117: 2025-05-07T20:32:05.5427040Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.5427143Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.5427256Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.5427627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:05.5427722Z return fn(*args, **kwargs) 2025-05-07T20:32:05.5428882Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.5429009Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.5429428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.5429652Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.5429991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.5430096Z kernel = self.compile( 2025-05-07T20:32:05.5430478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.5430803Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.5430933Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.5430938Z 2025-05-07T20:32:05.5431150Z self = 2025-05-07T20:32:05.5431927Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.5432424Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4898b260c0>} 2025-05-07T20:32:05.5433171Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.5433366Z context = 2025-05-07T20:32:05.5433371Z 2025-05-07T20:32:05.5433534Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.5433807Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.5433919Z module_map=module_map) 2025-05-07T20:32:05.5434088Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.5434189Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.5434270Z E ^ 2025-05-07T20:32:05.5434632Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.5434637Z 2025-05-07T20:32:05.5435053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.5435063Z 2025-05-07T20:32:05.5435174Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.5435397Z self=, 2025-05-07T20:32:05.5435476Z T=16384, 2025-05-07T20:32:05.5435566Z D=5120, 2025-05-07T20:32:05.5435654Z scale_ub=None, 2025-05-07T20:32:05.5435743Z contiguous=False, 2025-05-07T20:32:05.5435840Z compiled=True, 2025-05-07T20:32:05.5435914Z ) 2025-05-07T20:32:05.5436130Z self = 2025-05-07T20:32:05.5436316Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:05.5436321Z 2025-05-07T20:32:05.5436399Z @given( 2025-05-07T20:32:05.5436526Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.5436629Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.5436744Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.5436873Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.5436993Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.5437067Z ) 2025-05-07T20:32:05.5437321Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.5437547Z def test_silu_mul_quant( 2025-05-07T20:32:05.5437631Z self, 2025-05-07T20:32:05.5437709Z T: int, 2025-05-07T20:32:05.5437785Z D: int, 2025-05-07T20:32:05.5437889Z scale_ub: Optional[float], 2025-05-07T20:32:05.5437980Z contiguous: bool, 2025-05-07T20:32:05.5438065Z compiled: bool, 2025-05-07T20:32:05.5438150Z ) -> None: 2025-05-07T20:32:05.5438245Z torch.manual_seed(2025) 2025-05-07T20:32:05.5438318Z 2025-05-07T20:32:05.5438492Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.5438566Z 2025-05-07T20:32:05.5438665Z x_sign = torch.sign(x) 2025-05-07T20:32:05.5438788Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.5438956Z x = x_sign * x_clamp 2025-05-07T20:32:05.5439043Z x0 = x[:, :D] 2025-05-07T20:32:05.5439123Z x1 = x[:, D:] 2025-05-07T20:32:05.5439195Z 2025-05-07T20:32:05.5439283Z if contiguous: 2025-05-07T20:32:05.5439379Z x0 = x0.contiguous() 2025-05-07T20:32:05.5439470Z x1 = x1.contiguous() 2025-05-07T20:32:05.5439549Z 2025-05-07T20:32:05.5439641Z if scale_ub is not None: 2025-05-07T20:32:05.5439747Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.5439888Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.5439964Z ) 2025-05-07T20:32:05.5440048Z else: 2025-05-07T20:32:05.5440140Z scale_ub_tensor = None 2025-05-07T20:32:05.5440212Z 2025-05-07T20:32:05.5440345Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.5440434Z op = silu_mul_quant 2025-05-07T20:32:05.5440522Z if compiled: 2025-05-07T20:32:05.5440634Z op = torch.compile(op) 2025-05-07T20:32:05.5440738Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.5440810Z 2025-05-07T20:32:05.5440907Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.5440911Z 2025-05-07T20:32:05.5441014Z moe/activation_test.py:117: 2025-05-07T20:32:05.5441141Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.5441247Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.5441347Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.5441718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:05.5441811Z return fn(*args, **kwargs) 2025-05-07T20:32:05.5442299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.5442400Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.5442756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.5442985Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.5443324Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.5443418Z kernel = self.compile( 2025-05-07T20:32:05.5443801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.5443976Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.5444104Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.5444108Z 2025-05-07T20:32:05.5444317Z self = 2025-05-07T20:32:05.5445087Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.5445689Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4898b26c00>} 2025-05-07T20:32:05.5446432Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.5446625Z context = 2025-05-07T20:32:05.5446630Z 2025-05-07T20:32:05.5446791Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.5447049Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.5447201Z module_map=module_map) 2025-05-07T20:32:05.5447400Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.5447497Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.5447579Z E ^ 2025-05-07T20:32:05.5447935Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.5447940Z 2025-05-07T20:32:05.5448356Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.5448361Z 2025-05-07T20:32:05.5448463Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.5448682Z self=, 2025-05-07T20:32:05.5448769Z T=2048, 2025-05-07T20:32:05.5448846Z D=5120, 2025-05-07T20:32:05.5448930Z scale_ub=None, 2025-05-07T20:32:05.5449026Z contiguous=False, 2025-05-07T20:32:05.5449108Z compiled=True, 2025-05-07T20:32:05.5449190Z ) 2025-05-07T20:32:05.5449406Z self = 2025-05-07T20:32:05.5449579Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:05.5449584Z 2025-05-07T20:32:05.5449667Z @given( 2025-05-07T20:32:05.5449787Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.5449886Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.5450008Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.5450124Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.5450241Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.5450314Z ) 2025-05-07T20:32:05.5450557Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.5450656Z def test_silu_mul_quant( 2025-05-07T20:32:05.5450733Z self, 2025-05-07T20:32:05.5450810Z T: int, 2025-05-07T20:32:05.5450892Z D: int, 2025-05-07T20:32:05.5450993Z scale_ub: Optional[float], 2025-05-07T20:32:05.5451085Z contiguous: bool, 2025-05-07T20:32:05.5451179Z compiled: bool, 2025-05-07T20:32:05.5451258Z ) -> None: 2025-05-07T20:32:05.5451353Z torch.manual_seed(2025) 2025-05-07T20:32:05.5451432Z 2025-05-07T20:32:05.5451604Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.5451677Z 2025-05-07T20:32:05.5451775Z x_sign = torch.sign(x) 2025-05-07T20:32:05.5451898Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.5451995Z x = x_sign * x_clamp 2025-05-07T20:32:05.5452077Z x0 = x[:, :D] 2025-05-07T20:32:05.5452160Z x1 = x[:, D:] 2025-05-07T20:32:05.5452240Z 2025-05-07T20:32:05.5452324Z if contiguous: 2025-05-07T20:32:05.5452420Z x0 = x0.contiguous() 2025-05-07T20:32:05.5452515Z x1 = x1.contiguous() 2025-05-07T20:32:05.5452587Z 2025-05-07T20:32:05.5452680Z if scale_ub is not None: 2025-05-07T20:32:05.5452798Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.5452933Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.5453009Z ) 2025-05-07T20:32:05.5453091Z else: 2025-05-07T20:32:05.5453343Z scale_ub_tensor = None 2025-05-07T20:32:05.5453427Z 2025-05-07T20:32:05.5453555Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.5453647Z op = silu_mul_quant 2025-05-07T20:32:05.5453740Z if compiled: 2025-05-07T20:32:05.5453840Z op = torch.compile(op) 2025-05-07T20:32:05.5453946Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.5454024Z 2025-05-07T20:32:05.5454113Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.5454118Z 2025-05-07T20:32:05.5454213Z moe/activation_test.py:117: 2025-05-07T20:32:05.5454345Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.5454489Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.5454633Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.5455003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:05.5455094Z return fn(*args, **kwargs) 2025-05-07T20:32:05.5455594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.5455691Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.5456045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.5456271Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.5456606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.5456708Z kernel = self.compile( 2025-05-07T20:32:05.5457089Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.5457262Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.5457396Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.5457405Z 2025-05-07T20:32:05.5457605Z self = 2025-05-07T20:32:05.5458395Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.5458892Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f489881c680>} 2025-05-07T20:32:05.5459641Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.5459833Z context = 2025-05-07T20:32:05.5459837Z 2025-05-07T20:32:05.5460013Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.5460271Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.5460378Z module_map=module_map) 2025-05-07T20:32:05.5460545Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.5460644Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.5460721Z E ^ 2025-05-07T20:32:05.5461079Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.5461084Z 2025-05-07T20:32:05.5461498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.5461507Z 2025-05-07T20:32:05.5461615Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.5461838Z self=, 2025-05-07T20:32:05.5461916Z T=2048, 2025-05-07T20:32:05.5462082Z D=5120, 2025-05-07T20:32:05.5462167Z scale_ub=1200.0, 2025-05-07T20:32:05.5462254Z contiguous=False, 2025-05-07T20:32:05.5462342Z compiled=True, 2025-05-07T20:32:05.5462415Z ) 2025-05-07T20:32:05.5462630Z self = 2025-05-07T20:32:05.5462809Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:05.5462814Z 2025-05-07T20:32:05.5462891Z @given( 2025-05-07T20:32:05.5463015Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.5463114Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.5463228Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.5463432Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.5463547Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.5463621Z ) 2025-05-07T20:32:05.5463876Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.5463970Z def test_silu_mul_quant( 2025-05-07T20:32:05.5464055Z self, 2025-05-07T20:32:05.5464132Z T: int, 2025-05-07T20:32:05.5464207Z D: int, 2025-05-07T20:32:05.5464312Z scale_ub: Optional[float], 2025-05-07T20:32:05.5464401Z contiguous: bool, 2025-05-07T20:32:05.5464487Z compiled: bool, 2025-05-07T20:32:05.5464572Z ) -> None: 2025-05-07T20:32:05.5464665Z torch.manual_seed(2025) 2025-05-07T20:32:05.5464736Z 2025-05-07T20:32:05.5464908Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.5464980Z 2025-05-07T20:32:05.5465077Z x_sign = torch.sign(x) 2025-05-07T20:32:05.5465209Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.5465295Z x = x_sign * x_clamp 2025-05-07T20:32:05.5465375Z x0 = x[:, :D] 2025-05-07T20:32:05.5465461Z x1 = x[:, D:] 2025-05-07T20:32:05.5465533Z 2025-05-07T20:32:05.5465627Z if contiguous: 2025-05-07T20:32:05.5465717Z x0 = x0.contiguous() 2025-05-07T20:32:05.5465805Z x1 = x1.contiguous() 2025-05-07T20:32:05.5465881Z 2025-05-07T20:32:05.5465972Z if scale_ub is not None: 2025-05-07T20:32:05.5466076Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.5466218Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.5466293Z ) 2025-05-07T20:32:05.5466367Z else: 2025-05-07T20:32:05.5466466Z scale_ub_tensor = None 2025-05-07T20:32:05.5466537Z 2025-05-07T20:32:05.5466664Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.5466763Z op = silu_mul_quant 2025-05-07T20:32:05.5466851Z if compiled: 2025-05-07T20:32:05.5466957Z op = torch.compile(op) 2025-05-07T20:32:05.5467062Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.5467133Z 2025-05-07T20:32:05.5467233Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.5467238Z 2025-05-07T20:32:05.5467336Z moe/activation_test.py:117: 2025-05-07T20:32:05.5467464Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.5467570Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.5467670Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.5468035Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:05.5468132Z return fn(*args, **kwargs) 2025-05-07T20:32:05.5468623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.5468728Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.5469175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.5469398Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.5469866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.5469960Z kernel = self.compile( 2025-05-07T20:32:05.5470343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.5470513Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.5470641Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.5470645Z 2025-05-07T20:32:05.5470854Z self = 2025-05-07T20:32:05.5471664Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.5472220Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f489881d1c0>} 2025-05-07T20:32:05.5472961Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.5473149Z context = 2025-05-07T20:32:05.5473154Z 2025-05-07T20:32:05.5473327Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.5473585Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.5473702Z module_map=module_map) 2025-05-07T20:32:05.5473863Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.5473960Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.5474044Z E ^ 2025-05-07T20:32:05.5474400Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.5474405Z 2025-05-07T20:32:05.5474816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.5474827Z 2025-05-07T20:32:05.5474930Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.5475152Z self=, 2025-05-07T20:32:05.5475236Z T=4096, 2025-05-07T20:32:05.5475312Z D=5120, 2025-05-07T20:32:05.5475397Z scale_ub=1200.0, 2025-05-07T20:32:05.5475489Z contiguous=True, 2025-05-07T20:32:05.5475574Z compiled=True, 2025-05-07T20:32:05.5475650Z ) 2025-05-07T20:32:05.5475872Z self = 2025-05-07T20:32:05.5476041Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:05.5476045Z 2025-05-07T20:32:05.5476135Z @given( 2025-05-07T20:32:05.5476253Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.5476351Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.5476472Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.5476588Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.5476702Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.5476782Z ) 2025-05-07T20:32:05.5477025Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.5477118Z def test_silu_mul_quant( 2025-05-07T20:32:05.5477202Z self, 2025-05-07T20:32:05.5477282Z T: int, 2025-05-07T20:32:05.5477361Z D: int, 2025-05-07T20:32:05.5477466Z scale_ub: Optional[float], 2025-05-07T20:32:05.5477555Z contiguous: bool, 2025-05-07T20:32:05.5477648Z compiled: bool, 2025-05-07T20:32:05.5477729Z ) -> None: 2025-05-07T20:32:05.5477910Z torch.manual_seed(2025) 2025-05-07T20:32:05.5477991Z 2025-05-07T20:32:05.5478159Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.5478232Z 2025-05-07T20:32:05.5478329Z x_sign = torch.sign(x) 2025-05-07T20:32:05.5478454Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.5478543Z x = x_sign * x_clamp 2025-05-07T20:32:05.5478632Z x0 = x[:, :D] 2025-05-07T20:32:05.5478712Z x1 = x[:, D:] 2025-05-07T20:32:05.5478783Z 2025-05-07T20:32:05.5478873Z if contiguous: 2025-05-07T20:32:05.5478963Z x0 = x0.contiguous() 2025-05-07T20:32:05.5479055Z x1 = x1.contiguous() 2025-05-07T20:32:05.5479171Z 2025-05-07T20:32:05.5479305Z if scale_ub is not None: 2025-05-07T20:32:05.5479416Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.5479552Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.5479628Z ) 2025-05-07T20:32:05.5479716Z else: 2025-05-07T20:32:05.5479809Z scale_ub_tensor = None 2025-05-07T20:32:05.5479880Z 2025-05-07T20:32:05.5480014Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.5480104Z op = silu_mul_quant 2025-05-07T20:32:05.5480188Z if compiled: 2025-05-07T20:32:05.5480293Z op = torch.compile(op) 2025-05-07T20:32:05.5480398Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.5480476Z 2025-05-07T20:32:05.5480565Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.5480570Z 2025-05-07T20:32:05.5480666Z moe/activation_test.py:117: 2025-05-07T20:32:05.5480802Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.5480908Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.5481007Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.5481378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:05.5481474Z return fn(*args, **kwargs) 2025-05-07T20:32:05.5481973Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.5482069Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.5482422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.5482647Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.5482981Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.5483077Z kernel = self.compile( 2025-05-07T20:32:05.5483464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.5483636Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.5483774Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.5483779Z 2025-05-07T20:32:05.5483980Z self = 2025-05-07T20:32:05.5484752Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.5485252Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f489881da80>} 2025-05-07T20:32:05.5485994Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.5486191Z context = 2025-05-07T20:32:05.5486282Z 2025-05-07T20:32:05.5486449Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.5486708Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.5486824Z module_map=module_map) 2025-05-07T20:32:05.5486982Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.5487090Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.5487167Z E ^ 2025-05-07T20:32:05.5487519Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.5487562Z 2025-05-07T20:32:05.5487985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.5488028Z 2025-05-07T20:32:05.5488132Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.5488368Z self=, 2025-05-07T20:32:05.5488446Z T=128, 2025-05-07T20:32:05.5488522Z D=5120, 2025-05-07T20:32:05.5488610Z scale_ub=1200.0, 2025-05-07T20:32:05.5488697Z contiguous=False, 2025-05-07T20:32:05.5488783Z compiled=True, 2025-05-07T20:32:05.5488860Z ) 2025-05-07T20:32:05.5489076Z self = 2025-05-07T20:32:05.5489244Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:05.5489248Z 2025-05-07T20:32:05.5489333Z @given( 2025-05-07T20:32:05.5489451Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.5489558Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.5489677Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.5489793Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.5489912Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.5489985Z ) 2025-05-07T20:32:05.5490233Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.5490336Z def test_silu_mul_quant( 2025-05-07T20:32:05.5490412Z self, 2025-05-07T20:32:05.5490487Z T: int, 2025-05-07T20:32:05.5490570Z D: int, 2025-05-07T20:32:05.5490667Z scale_ub: Optional[float], 2025-05-07T20:32:05.5490755Z contiguous: bool, 2025-05-07T20:32:05.5490847Z compiled: bool, 2025-05-07T20:32:05.5490924Z ) -> None: 2025-05-07T20:32:05.5491025Z torch.manual_seed(2025) 2025-05-07T20:32:05.5491098Z 2025-05-07T20:32:05.5491266Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.5491347Z 2025-05-07T20:32:05.5491441Z x_sign = torch.sign(x) 2025-05-07T20:32:05.5491565Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.5491661Z x = x_sign * x_clamp 2025-05-07T20:32:05.5491741Z x0 = x[:, :D] 2025-05-07T20:32:05.5491828Z x1 = x[:, D:] 2025-05-07T20:32:05.5491907Z 2025-05-07T20:32:05.5491992Z if contiguous: 2025-05-07T20:32:05.5492083Z x0 = x0.contiguous() 2025-05-07T20:32:05.5492178Z x1 = x1.contiguous() 2025-05-07T20:32:05.5492249Z 2025-05-07T20:32:05.5492349Z if scale_ub is not None: 2025-05-07T20:32:05.5492455Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.5492589Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.5492671Z ) 2025-05-07T20:32:05.5492748Z else: 2025-05-07T20:32:05.5492842Z scale_ub_tensor = None 2025-05-07T20:32:05.5492919Z 2025-05-07T20:32:05.5493048Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.5493140Z op = silu_mul_quant 2025-05-07T20:32:05.5493229Z if compiled: 2025-05-07T20:32:05.5493327Z op = torch.compile(op) 2025-05-07T20:32:05.5493431Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.5493603Z 2025-05-07T20:32:05.5493696Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.5493700Z 2025-05-07T20:32:05.5493803Z moe/activation_test.py:117: 2025-05-07T20:32:05.5493931Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.5494031Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.5494136Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.5494500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:05.5494592Z return fn(*args, **kwargs) 2025-05-07T20:32:05.5495086Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.5495297Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.5499969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.5500248Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.5500592Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.5500691Z kernel = self.compile( 2025-05-07T20:32:05.5501080Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.5501253Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.5501387Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.5501392Z 2025-05-07T20:32:05.5501595Z self = 2025-05-07T20:32:05.5502400Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.5502910Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f489881fa60>} 2025-05-07T20:32:05.5503651Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.5503849Z context = 2025-05-07T20:32:05.5503854Z 2025-05-07T20:32:05.5504017Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.5504286Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.5504396Z module_map=module_map) 2025-05-07T20:32:05.5504559Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.5504666Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.5504746Z E ^ 2025-05-07T20:32:05.5505103Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.5505108Z 2025-05-07T20:32:05.5505527Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.5505532Z 2025-05-07T20:32:05.5505635Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.5505863Z self=, 2025-05-07T20:32:05.5505943Z T=16384, 2025-05-07T20:32:05.5506020Z D=7168, 2025-05-07T20:32:05.5506114Z scale_ub=1200.0, 2025-05-07T20:32:05.5506203Z contiguous=True, 2025-05-07T20:32:05.5506287Z compiled=True, 2025-05-07T20:32:05.5506366Z ) 2025-05-07T20:32:05.5506583Z self = 2025-05-07T20:32:05.5506835Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:05.5506847Z 2025-05-07T20:32:05.5506926Z @given( 2025-05-07T20:32:05.5507043Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.5507151Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.5507266Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.5507382Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.5507503Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.5507578Z ) 2025-05-07T20:32:05.5507822Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.5507926Z def test_silu_mul_quant( 2025-05-07T20:32:05.5508066Z self, 2025-05-07T20:32:05.5508185Z T: int, 2025-05-07T20:32:05.5508269Z D: int, 2025-05-07T20:32:05.5508367Z scale_ub: Optional[float], 2025-05-07T20:32:05.5508463Z contiguous: bool, 2025-05-07T20:32:05.5508648Z compiled: bool, 2025-05-07T20:32:05.5508734Z ) -> None: 2025-05-07T20:32:05.5508836Z torch.manual_seed(2025) 2025-05-07T20:32:05.5508911Z 2025-05-07T20:32:05.5509169Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.5509251Z 2025-05-07T20:32:05.5509344Z x_sign = torch.sign(x) 2025-05-07T20:32:05.5509469Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.5509566Z x = x_sign * x_clamp 2025-05-07T20:32:05.5509648Z x0 = x[:, :D] 2025-05-07T20:32:05.5509729Z x1 = x[:, D:] 2025-05-07T20:32:05.5509810Z 2025-05-07T20:32:05.5509896Z if contiguous: 2025-05-07T20:32:05.5509989Z x0 = x0.contiguous() 2025-05-07T20:32:05.5510093Z x1 = x1.contiguous() 2025-05-07T20:32:05.5510169Z 2025-05-07T20:32:05.5510266Z if scale_ub is not None: 2025-05-07T20:32:05.5510373Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.5510514Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.5510596Z ) 2025-05-07T20:32:05.5510672Z else: 2025-05-07T20:32:05.5510768Z scale_ub_tensor = None 2025-05-07T20:32:05.5510849Z 2025-05-07T20:32:05.5510976Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.5511067Z op = silu_mul_quant 2025-05-07T20:32:05.5511159Z if compiled: 2025-05-07T20:32:05.5511260Z op = torch.compile(op) 2025-05-07T20:32:05.5511364Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.5511444Z 2025-05-07T20:32:05.5511535Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.5511539Z 2025-05-07T20:32:05.5511644Z moe/activation_test.py:117: 2025-05-07T20:32:05.5511779Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.5511880Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.5511986Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.5512362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:05.5512475Z return fn(*args, **kwargs) 2025-05-07T20:32:05.5512994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.5513091Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.5513450Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.5513671Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.5514006Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.5514112Z kernel = self.compile( 2025-05-07T20:32:05.5514490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.5514711Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.5514848Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.5514852Z 2025-05-07T20:32:05.5515054Z self = 2025-05-07T20:32:05.5515828Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.5516324Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f489875cd60>} 2025-05-07T20:32:05.5517213Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.5517406Z context = 2025-05-07T20:32:05.5517411Z 2025-05-07T20:32:05.5517574Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.5517838Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.5517945Z module_map=module_map) 2025-05-07T20:32:05.5518113Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.5518211Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.5518289Z E ^ 2025-05-07T20:32:05.5518648Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.5518658Z 2025-05-07T20:32:05.5519068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.5519072Z 2025-05-07T20:32:05.5519184Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.5519407Z self=, 2025-05-07T20:32:05.5519486Z T=16384, 2025-05-07T20:32:05.5519569Z D=5120, 2025-05-07T20:32:05.5519652Z scale_ub=1200.0, 2025-05-07T20:32:05.5519740Z contiguous=True, 2025-05-07T20:32:05.5519834Z compiled=False, 2025-05-07T20:32:05.5519908Z ) 2025-05-07T20:32:05.5520125Z self = 2025-05-07T20:32:05.5520307Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:05.5520312Z 2025-05-07T20:32:05.5520389Z @given( 2025-05-07T20:32:05.5520506Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.5520617Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.5520731Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.5520853Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.5520970Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.5521044Z ) 2025-05-07T20:32:05.5521292Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.5521387Z def test_silu_mul_quant( 2025-05-07T20:32:05.5521464Z self, 2025-05-07T20:32:05.5521549Z T: int, 2025-05-07T20:32:05.5521626Z D: int, 2025-05-07T20:32:05.5521724Z scale_ub: Optional[float], 2025-05-07T20:32:05.5521819Z contiguous: bool, 2025-05-07T20:32:05.5521906Z compiled: bool, 2025-05-07T20:32:05.5521991Z ) -> None: 2025-05-07T20:32:05.5522086Z torch.manual_seed(2025) 2025-05-07T20:32:05.5522159Z 2025-05-07T20:32:05.5522336Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.5522412Z 2025-05-07T20:32:05.5527278Z x_sign = torch.sign(x) 2025-05-07T20:32:05.5527432Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.5527530Z x = x_sign * x_clamp 2025-05-07T20:32:05.5527702Z x0 = x[:, :D] 2025-05-07T20:32:05.5527785Z x1 = x[:, D:] 2025-05-07T20:32:05.5527869Z 2025-05-07T20:32:05.5527955Z if contiguous: 2025-05-07T20:32:05.5528049Z x0 = x0.contiguous() 2025-05-07T20:32:05.5528444Z x1 = x1.contiguous() 2025-05-07T20:32:05.5528557Z 2025-05-07T20:32:05.5528692Z if scale_ub is not None: 2025-05-07T20:32:05.5528849Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.5529037Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.5529124Z ) 2025-05-07T20:32:05.5529213Z else: 2025-05-07T20:32:05.5529312Z scale_ub_tensor = None 2025-05-07T20:32:05.5529520Z 2025-05-07T20:32:05.5529734Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.5529827Z op = silu_mul_quant 2025-05-07T20:32:05.5529923Z if compiled: 2025-05-07T20:32:05.5530093Z op = torch.compile(op) 2025-05-07T20:32:05.5530207Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.5530291Z 2025-05-07T20:32:05.5530385Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.5530390Z 2025-05-07T20:32:05.5530492Z moe/activation_test.py:117: 2025-05-07T20:32:05.5530634Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.5530737Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.5530839Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.5531350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.5531450Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.5531822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.5532045Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.5532389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.5532494Z kernel = self.compile( 2025-05-07T20:32:05.5532877Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.5533051Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.5533190Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.5533194Z 2025-05-07T20:32:05.5533401Z self = 2025-05-07T20:32:05.5534185Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.5534696Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f489875dbc0>} 2025-05-07T20:32:05.5535451Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.5535642Z context = 2025-05-07T20:32:05.5535646Z 2025-05-07T20:32:05.5535812Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.5536083Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.5536195Z module_map=module_map) 2025-05-07T20:32:05.5536370Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.5536472Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.5536553Z E ^ 2025-05-07T20:32:05.5536996Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.5537001Z 2025-05-07T20:32:05.5537418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.5537422Z 2025-05-07T20:32:05.5537534Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.5537758Z self=, 2025-05-07T20:32:05.5537837Z T=1, 2025-05-07T20:32:05.5537925Z D=7168, 2025-05-07T20:32:05.5538011Z scale_ub=1200.0, 2025-05-07T20:32:05.5538102Z contiguous=False, 2025-05-07T20:32:05.5538199Z compiled=False, 2025-05-07T20:32:05.5538352Z ) 2025-05-07T20:32:05.5538613Z self = 2025-05-07T20:32:05.5538791Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:05.5538796Z 2025-05-07T20:32:05.5539006Z @given( 2025-05-07T20:32:05.5539131Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.5539239Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.5539355Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.5539481Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.5539595Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.5539672Z ) 2025-05-07T20:32:05.5539924Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.5540020Z def test_silu_mul_quant( 2025-05-07T20:32:05.5540103Z self, 2025-05-07T20:32:05.5540188Z T: int, 2025-05-07T20:32:05.5540266Z D: int, 2025-05-07T20:32:05.5540367Z scale_ub: Optional[float], 2025-05-07T20:32:05.5540469Z contiguous: bool, 2025-05-07T20:32:05.5540557Z compiled: bool, 2025-05-07T20:32:05.5540637Z ) -> None: 2025-05-07T20:32:05.5540745Z torch.manual_seed(2025) 2025-05-07T20:32:05.5540820Z 2025-05-07T20:32:05.5541004Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.5541079Z 2025-05-07T20:32:05.5541174Z x_sign = torch.sign(x) 2025-05-07T20:32:05.5541306Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.5541398Z x = x_sign * x_clamp 2025-05-07T20:32:05.5541482Z x0 = x[:, :D] 2025-05-07T20:32:05.5541572Z x1 = x[:, D:] 2025-05-07T20:32:05.5541648Z 2025-05-07T20:32:05.5541734Z if contiguous: 2025-05-07T20:32:05.5541834Z x0 = x0.contiguous() 2025-05-07T20:32:05.5541926Z x1 = x1.contiguous() 2025-05-07T20:32:05.5542001Z 2025-05-07T20:32:05.5542107Z if scale_ub is not None: 2025-05-07T20:32:05.5542219Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.5542366Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.5542465Z ) 2025-05-07T20:32:05.5542555Z else: 2025-05-07T20:32:05.5542680Z scale_ub_tensor = None 2025-05-07T20:32:05.5542757Z 2025-05-07T20:32:05.5542888Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.5542989Z op = silu_mul_quant 2025-05-07T20:32:05.5543078Z if compiled: 2025-05-07T20:32:05.5543181Z op = torch.compile(op) 2025-05-07T20:32:05.5543295Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.5543372Z 2025-05-07T20:32:05.5543465Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.5543477Z 2025-05-07T20:32:05.5543577Z moe/activation_test.py:117: 2025-05-07T20:32:05.5543710Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.5543826Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.5543931Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.5544433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.5544591Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.5544950Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.5545173Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.5545520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.5545616Z kernel = self.compile( 2025-05-07T20:32:05.5546006Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.5546182Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.5546387Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.5546392Z 2025-05-07T20:32:05.5546603Z self = 2025-05-07T20:32:05.5547421Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.5547926Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f489875d4e0>} 2025-05-07T20:32:05.5548670Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.5548874Z context = 2025-05-07T20:32:05.5548881Z 2025-05-07T20:32:05.5549046Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.5549393Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.5549515Z module_map=module_map) 2025-05-07T20:32:05.5549678Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.5549779Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.5549869Z E ^ 2025-05-07T20:32:05.5550221Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.5550225Z 2025-05-07T20:32:05.5550645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.5550650Z 2025-05-07T20:32:05.5550755Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.5550982Z self=, 2025-05-07T20:32:05.5551074Z T=4096, 2025-05-07T20:32:05.5551154Z D=7168, 2025-05-07T20:32:05.5551240Z scale_ub=1200.0, 2025-05-07T20:32:05.5551346Z contiguous=False, 2025-05-07T20:32:05.5551436Z compiled=True, 2025-05-07T20:32:05.5551520Z ) 2025-05-07T20:32:05.5551741Z self = 2025-05-07T20:32:05.5551921Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:05.5551925Z 2025-05-07T20:32:05.5552014Z @given( 2025-05-07T20:32:05.5552134Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.5552237Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.5552364Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.5552482Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.5552621Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.5552719Z ) 2025-05-07T20:32:05.5552981Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.5553085Z def test_silu_mul_quant( 2025-05-07T20:32:05.5553165Z self, 2025-05-07T20:32:05.5553248Z T: int, 2025-05-07T20:32:05.5553389Z D: int, 2025-05-07T20:32:05.5553491Z scale_ub: Optional[float], 2025-05-07T20:32:05.5553582Z contiguous: bool, 2025-05-07T20:32:05.5553678Z compiled: bool, 2025-05-07T20:32:05.5553761Z ) -> None: 2025-05-07T20:32:05.5553858Z torch.manual_seed(2025) 2025-05-07T20:32:05.5553940Z 2025-05-07T20:32:05.5554109Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.5554185Z 2025-05-07T20:32:05.5554288Z x_sign = torch.sign(x) 2025-05-07T20:32:05.5554413Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.5554514Z x = x_sign * x_clamp 2025-05-07T20:32:05.5554640Z x0 = x[:, :D] 2025-05-07T20:32:05.5554760Z x1 = x[:, D:] 2025-05-07T20:32:05.5554847Z 2025-05-07T20:32:05.5554933Z if contiguous: 2025-05-07T20:32:05.5555025Z x0 = x0.contiguous() 2025-05-07T20:32:05.5555165Z x1 = x1.contiguous() 2025-05-07T20:32:05.5555242Z 2025-05-07T20:32:05.5555337Z if scale_ub is not None: 2025-05-07T20:32:05.5555453Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.5555588Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.5555666Z ) 2025-05-07T20:32:05.5555756Z else: 2025-05-07T20:32:05.5555851Z scale_ub_tensor = None 2025-05-07T20:32:05.5555925Z 2025-05-07T20:32:05.5556061Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.5556152Z op = silu_mul_quant 2025-05-07T20:32:05.5556246Z if compiled: 2025-05-07T20:32:05.5556348Z op = torch.compile(op) 2025-05-07T20:32:05.5556456Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.5556539Z 2025-05-07T20:32:05.5556631Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.5556635Z 2025-05-07T20:32:05.5556733Z moe/activation_test.py:117: 2025-05-07T20:32:05.5556874Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.5556979Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.5557080Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.5557453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:05.5557547Z return fn(*args, **kwargs) 2025-05-07T20:32:05.5558048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.5558147Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.5558503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.5558739Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.5559079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.5559190Z kernel = self.compile( 2025-05-07T20:32:05.5559569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.5559742Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.5559881Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.5559885Z 2025-05-07T20:32:05.5560089Z self = 2025-05-07T20:32:05.5560867Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.5561368Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f48986f0180>} 2025-05-07T20:32:05.5562168Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.5562366Z context = 2025-05-07T20:32:05.5562371Z 2025-05-07T20:32:05.5562536Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.5562803Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.5562913Z module_map=module_map) 2025-05-07T20:32:05.5563075Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.5563259Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.5563339Z E ^ 2025-05-07T20:32:05.5563694Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.5563747Z 2025-05-07T20:32:05.5564164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.5564168Z 2025-05-07T20:32:05.5564275Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.5564508Z self=, 2025-05-07T20:32:05.5564587Z T=128, 2025-05-07T20:32:05.5564668Z D=7168, 2025-05-07T20:32:05.5564761Z scale_ub=1200.0, 2025-05-07T20:32:05.5564853Z contiguous=False, 2025-05-07T20:32:05.5564942Z compiled=True, 2025-05-07T20:32:05.5565025Z ) 2025-05-07T20:32:05.5565244Z self = 2025-05-07T20:32:05.5565429Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:05.5565436Z 2025-05-07T20:32:05.5565515Z @given( 2025-05-07T20:32:05.5565634Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.5565748Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.5565870Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.5565987Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.5566109Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.5566184Z ) 2025-05-07T20:32:05.5566428Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.5566529Z def test_silu_mul_quant( 2025-05-07T20:32:05.5566610Z self, 2025-05-07T20:32:05.5566695Z T: int, 2025-05-07T20:32:05.5566773Z D: int, 2025-05-07T20:32:05.5566870Z scale_ub: Optional[float], 2025-05-07T20:32:05.5566969Z contiguous: bool, 2025-05-07T20:32:05.5567059Z compiled: bool, 2025-05-07T20:32:05.5567140Z ) -> None: 2025-05-07T20:32:05.5567238Z torch.manual_seed(2025) 2025-05-07T20:32:05.5567314Z 2025-05-07T20:32:05.5567491Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.5567565Z 2025-05-07T20:32:05.5567660Z x_sign = torch.sign(x) 2025-05-07T20:32:05.5567791Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.5567881Z x = x_sign * x_clamp 2025-05-07T20:32:05.5567961Z x0 = x[:, :D] 2025-05-07T20:32:05.5568050Z x1 = x[:, D:] 2025-05-07T20:32:05.5568127Z 2025-05-07T20:32:05.5568211Z if contiguous: 2025-05-07T20:32:05.5568310Z x0 = x0.contiguous() 2025-05-07T20:32:05.5568403Z x1 = x1.contiguous() 2025-05-07T20:32:05.5568481Z 2025-05-07T20:32:05.5568571Z if scale_ub is not None: 2025-05-07T20:32:05.5568678Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.5568822Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.5568902Z ) 2025-05-07T20:32:05.5568980Z else: 2025-05-07T20:32:05.5569081Z scale_ub_tensor = None 2025-05-07T20:32:05.5569155Z 2025-05-07T20:32:05.5569361Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.5569460Z op = silu_mul_quant 2025-05-07T20:32:05.5569546Z if compiled: 2025-05-07T20:32:05.5569648Z op = torch.compile(op) 2025-05-07T20:32:05.5569758Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.5569829Z 2025-05-07T20:32:05.5569924Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.5569929Z 2025-05-07T20:32:05.5570025Z moe/activation_test.py:117: 2025-05-07T20:32:05.5570154Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.5570264Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.5570363Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.5570776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:05.5570912Z return fn(*args, **kwargs) 2025-05-07T20:32:05.5571449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.5571553Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.5571907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.5572131Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.5572502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.5572614Z kernel = self.compile( 2025-05-07T20:32:05.5572997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.5573177Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.5573308Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.5573312Z 2025-05-07T20:32:05.5573525Z self = 2025-05-07T20:32:05.5574295Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.5574792Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f48986f0cc0>} 2025-05-07T20:32:05.5575543Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.5575739Z context = 2025-05-07T20:32:05.5575743Z 2025-05-07T20:32:05.5575917Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.5576181Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.5576295Z module_map=module_map) 2025-05-07T20:32:05.5576455Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.5576555Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.5576644Z E ^ 2025-05-07T20:32:05.5576998Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.5577002Z 2025-05-07T20:32:05.5577414Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.5577422Z 2025-05-07T20:32:05.5577531Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.5577757Z self=, 2025-05-07T20:32:05.5577839Z T=2048, 2025-05-07T20:32:05.5577915Z D=7168, 2025-05-07T20:32:05.5577999Z scale_ub=None, 2025-05-07T20:32:05.5578133Z contiguous=True, 2025-05-07T20:32:05.5578217Z compiled=True, 2025-05-07T20:32:05.5578290Z ) 2025-05-07T20:32:05.5578513Z self = 2025-05-07T20:32:05.5578682Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:05.5578687Z 2025-05-07T20:32:05.5578763Z @given( 2025-05-07T20:32:05.5578887Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.5578987Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.5579108Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.5579226Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.5579378Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.5579496Z ) 2025-05-07T20:32:05.5579741Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.5579878Z def test_silu_mul_quant( 2025-05-07T20:32:05.5579964Z self, 2025-05-07T20:32:05.5580044Z T: int, 2025-05-07T20:32:05.5580122Z D: int, 2025-05-07T20:32:05.5580227Z scale_ub: Optional[float], 2025-05-07T20:32:05.5580316Z contiguous: bool, 2025-05-07T20:32:05.5580401Z compiled: bool, 2025-05-07T20:32:05.5580487Z ) -> None: 2025-05-07T20:32:05.5580581Z torch.manual_seed(2025) 2025-05-07T20:32:05.5580665Z 2025-05-07T20:32:05.5580838Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.5580911Z 2025-05-07T20:32:05.5581012Z x_sign = torch.sign(x) 2025-05-07T20:32:05.5581139Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.5581232Z x = x_sign * x_clamp 2025-05-07T20:32:05.5581321Z x0 = x[:, :D] 2025-05-07T20:32:05.5581403Z x1 = x[:, D:] 2025-05-07T20:32:05.5581476Z 2025-05-07T20:32:05.5581567Z if contiguous: 2025-05-07T20:32:05.5581663Z x0 = x0.contiguous() 2025-05-07T20:32:05.5581756Z x1 = x1.contiguous() 2025-05-07T20:32:05.5581837Z 2025-05-07T20:32:05.5581931Z if scale_ub is not None: 2025-05-07T20:32:05.5582044Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.5582179Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.5582256Z ) 2025-05-07T20:32:05.5582339Z else: 2025-05-07T20:32:05.5582434Z scale_ub_tensor = None 2025-05-07T20:32:05.5582508Z 2025-05-07T20:32:05.5582643Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.5582734Z op = silu_mul_quant 2025-05-07T20:32:05.5582820Z if compiled: 2025-05-07T20:32:05.5582931Z op = torch.compile(op) 2025-05-07T20:32:05.5583041Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.5583115Z 2025-05-07T20:32:05.5583213Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.5583218Z 2025-05-07T20:32:05.5583318Z moe/activation_test.py:117: 2025-05-07T20:32:05.5583459Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.5583560Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.5583660Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.5584034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:05.5584126Z return fn(*args, **kwargs) 2025-05-07T20:32:05.5584618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.5584726Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.5585080Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.5585311Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.5585710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.5585804Z kernel = self.compile( 2025-05-07T20:32:05.5586187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.5586358Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.5586487Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.5586497Z 2025-05-07T20:32:05.5586701Z self = 2025-05-07T20:32:05.5587468Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.5588091Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f48986f1940>} 2025-05-07T20:32:05.5588835Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.5589030Z context = 2025-05-07T20:32:05.5589035Z 2025-05-07T20:32:05.5589243Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.5589503Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.5589616Z module_map=module_map) 2025-05-07T20:32:05.5589780Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.5589889Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.5589968Z E ^ 2025-05-07T20:32:05.5590337Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.5590342Z 2025-05-07T20:32:05.5590756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.5590761Z 2025-05-07T20:32:05.5590863Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.5591093Z self=, 2025-05-07T20:32:05.5591175Z T=16384, 2025-05-07T20:32:05.5591252Z D=5120, 2025-05-07T20:32:05.5591341Z scale_ub=None, 2025-05-07T20:32:05.5591431Z contiguous=False, 2025-05-07T20:32:05.5591522Z compiled=False, 2025-05-07T20:32:05.5591599Z ) 2025-05-07T20:32:05.5591818Z self = 2025-05-07T20:32:05.5592008Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:05.5592012Z 2025-05-07T20:32:05.5592090Z @given( 2025-05-07T20:32:05.5592211Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.5592320Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.5592434Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.5592551Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.5592670Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.5592747Z ) 2025-05-07T20:32:05.5592999Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.5593092Z def test_silu_mul_quant( 2025-05-07T20:32:05.5593170Z self, 2025-05-07T20:32:05.5593253Z T: int, 2025-05-07T20:32:05.5593330Z D: int, 2025-05-07T20:32:05.5593430Z scale_ub: Optional[float], 2025-05-07T20:32:05.5593532Z contiguous: bool, 2025-05-07T20:32:05.5593623Z compiled: bool, 2025-05-07T20:32:05.5593702Z ) -> None: 2025-05-07T20:32:05.5593804Z torch.manual_seed(2025) 2025-05-07T20:32:05.5593878Z 2025-05-07T20:32:05.5594103Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.5594185Z 2025-05-07T20:32:05.5594277Z x_sign = torch.sign(x) 2025-05-07T20:32:05.5594408Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.5596216Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:05.5596295Z 2025-05-07T20:32:05.5596421Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:05.5596426Z 2025-05-07T20:32:05.5596592Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.5596816Z self=, 2025-05-07T20:32:05.5596899Z T=4096, 2025-05-07T20:32:05.5596975Z D=7168, 2025-05-07T20:32:05.5597058Z scale_ub=1200.0, 2025-05-07T20:32:05.5597149Z contiguous=True, 2025-05-07T20:32:05.5597231Z compiled=True, 2025-05-07T20:32:05.5597304Z ) 2025-05-07T20:32:05.5597527Z self = 2025-05-07T20:32:05.5597697Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:05.5597702Z 2025-05-07T20:32:05.5597788Z @given( 2025-05-07T20:32:05.5597905Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.5598007Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.5598127Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.5598244Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.5598361Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.5598440Z ) 2025-05-07T20:32:05.5598683Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.5598787Z def test_silu_mul_quant( 2025-05-07T20:32:05.5598863Z self, 2025-05-07T20:32:05.5598940Z T: int, 2025-05-07T20:32:05.5599026Z D: int, 2025-05-07T20:32:05.5599128Z scale_ub: Optional[float], 2025-05-07T20:32:05.5599218Z contiguous: bool, 2025-05-07T20:32:05.5599313Z compiled: bool, 2025-05-07T20:32:05.5599392Z ) -> None: 2025-05-07T20:32:05.5599488Z torch.manual_seed(2025) 2025-05-07T20:32:05.5599567Z 2025-05-07T20:32:05.5599735Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.5599812Z 2025-05-07T20:32:05.5599911Z x_sign = torch.sign(x) 2025-05-07T20:32:05.5600034Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.5601824Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:05.5601830Z 2025-05-07T20:32:05.5601947Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:05.5601951Z 2025-05-07T20:32:05.5602063Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.5602287Z self=, 2025-05-07T20:32:05.5602364Z T=16384, 2025-05-07T20:32:05.5602447Z D=7168, 2025-05-07T20:32:05.5602528Z scale_ub=None, 2025-05-07T20:32:05.5602616Z contiguous=False, 2025-05-07T20:32:05.5602756Z compiled=False, 2025-05-07T20:32:05.5602830Z ) 2025-05-07T20:32:05.5603045Z self = 2025-05-07T20:32:05.5603228Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:05.5603233Z 2025-05-07T20:32:05.5603310Z @given( 2025-05-07T20:32:05.5603434Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.5603532Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.5603646Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.5603768Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.5603926Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.5604040Z ) 2025-05-07T20:32:05.5604290Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.5604384Z def test_silu_mul_quant( 2025-05-07T20:32:05.5604502Z self, 2025-05-07T20:32:05.5604590Z T: int, 2025-05-07T20:32:05.5604670Z D: int, 2025-05-07T20:32:05.5604772Z scale_ub: Optional[float], 2025-05-07T20:32:05.5604867Z contiguous: bool, 2025-05-07T20:32:05.5604953Z compiled: bool, 2025-05-07T20:32:05.5605039Z ) -> None: 2025-05-07T20:32:05.5605134Z torch.manual_seed(2025) 2025-05-07T20:32:05.5605208Z 2025-05-07T20:32:05.5605379Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.5607156Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:05.5607167Z 2025-05-07T20:32:05.5607293Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:05.5607298Z 2025-05-07T20:32:05.5607401Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.5607621Z self=, 2025-05-07T20:32:05.5607705Z T=2048, 2025-05-07T20:32:05.5607784Z D=7168, 2025-05-07T20:32:05.5607867Z scale_ub=1200.0, 2025-05-07T20:32:05.5607959Z contiguous=True, 2025-05-07T20:32:05.5608042Z compiled=True, 2025-05-07T20:32:05.5608125Z ) 2025-05-07T20:32:05.5608341Z self = 2025-05-07T20:32:05.5608516Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:05.5608521Z 2025-05-07T20:32:05.5608604Z @given( 2025-05-07T20:32:05.5608721Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.5608822Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.5608947Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.5609062Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.5609175Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.5609256Z ) 2025-05-07T20:32:05.5609499Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.5609599Z def test_silu_mul_quant( 2025-05-07T20:32:05.5609676Z self, 2025-05-07T20:32:05.5609753Z T: int, 2025-05-07T20:32:05.5609839Z D: int, 2025-05-07T20:32:05.5609937Z scale_ub: Optional[float], 2025-05-07T20:32:05.5610029Z contiguous: bool, 2025-05-07T20:32:05.5610127Z compiled: bool, 2025-05-07T20:32:05.5610206Z ) -> None: 2025-05-07T20:32:05.5610302Z torch.manual_seed(2025) 2025-05-07T20:32:05.5610383Z 2025-05-07T20:32:05.5610551Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.5610672Z 2025-05-07T20:32:05.5610773Z x_sign = torch.sign(x) 2025-05-07T20:32:05.5610896Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.5612651Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:05.5612731Z 2025-05-07T20:32:05.5612848Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:05.5612853Z 2025-05-07T20:32:05.5613006Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.5613230Z self=, 2025-05-07T20:32:05.5613308Z T=2048, 2025-05-07T20:32:05.5613392Z D=7168, 2025-05-07T20:32:05.5613474Z scale_ub=None, 2025-05-07T20:32:05.5613559Z contiguous=True, 2025-05-07T20:32:05.5613649Z compiled=False, 2025-05-07T20:32:05.5613722Z ) 2025-05-07T20:32:05.5613937Z self = 2025-05-07T20:32:05.5614113Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:05.5614117Z 2025-05-07T20:32:05.5614193Z @given( 2025-05-07T20:32:05.5614317Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.5614417Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.5614533Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.5614654Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.5614769Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.5614846Z ) 2025-05-07T20:32:05.5615096Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.5615189Z def test_silu_mul_quant( 2025-05-07T20:32:05.5615272Z self, 2025-05-07T20:32:05.5615348Z T: int, 2025-05-07T20:32:05.5615425Z D: int, 2025-05-07T20:32:05.5615529Z scale_ub: Optional[float], 2025-05-07T20:32:05.5615623Z contiguous: bool, 2025-05-07T20:32:05.5615709Z compiled: bool, 2025-05-07T20:32:05.5615796Z ) -> None: 2025-05-07T20:32:05.5615890Z torch.manual_seed(2025) 2025-05-07T20:32:05.5615962Z 2025-05-07T20:32:05.5616133Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.5616210Z 2025-05-07T20:32:05.5616301Z > x_sign = torch.sign(x) 2025-05-07T20:32:05.5618066Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:05.5618072Z 2025-05-07T20:32:05.5618188Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:05.5618193Z 2025-05-07T20:32:05.5618306Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.5618527Z self=, 2025-05-07T20:32:05.5618614Z T=1, 2025-05-07T20:32:05.5618693Z D=7168, 2025-05-07T20:32:05.5618776Z scale_ub=1200.0, 2025-05-07T20:32:05.5618866Z contiguous=True, 2025-05-07T20:32:05.5618950Z compiled=False, 2025-05-07T20:32:05.5619023Z ) 2025-05-07T20:32:05.5619293Z self = 2025-05-07T20:32:05.5619459Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:05.5619464Z 2025-05-07T20:32:05.5619542Z @given( 2025-05-07T20:32:05.5619664Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.5619762Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.5619880Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.5619995Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.5620107Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.5620186Z ) 2025-05-07T20:32:05.5620430Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.5620603Z def test_silu_mul_quant( 2025-05-07T20:32:05.5620687Z self, 2025-05-07T20:32:05.5620764Z T: int, 2025-05-07T20:32:05.5620841Z D: int, 2025-05-07T20:32:05.5620991Z scale_ub: Optional[float], 2025-05-07T20:32:05.5621085Z contiguous: bool, 2025-05-07T20:32:05.5621171Z compiled: bool, 2025-05-07T20:32:05.5621257Z ) -> None: 2025-05-07T20:32:05.5621355Z torch.manual_seed(2025) 2025-05-07T20:32:05.5621435Z 2025-05-07T20:32:05.5621601Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.5621674Z 2025-05-07T20:32:05.5621777Z x_sign = torch.sign(x) 2025-05-07T20:32:05.5621904Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.5621994Z x = x_sign * x_clamp 2025-05-07T20:32:05.5622085Z x0 = x[:, :D] 2025-05-07T20:32:05.5622170Z x1 = x[:, D:] 2025-05-07T20:32:05.5622246Z 2025-05-07T20:32:05.5622339Z if contiguous: 2025-05-07T20:32:05.5622431Z x0 = x0.contiguous() 2025-05-07T20:32:05.5622521Z x1 = x1.contiguous() 2025-05-07T20:32:05.5622601Z 2025-05-07T20:32:05.5622694Z if scale_ub is not None: 2025-05-07T20:32:05.5622803Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.5622946Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.5623022Z ) 2025-05-07T20:32:05.5623105Z else: 2025-05-07T20:32:05.5623199Z scale_ub_tensor = None 2025-05-07T20:32:05.5623272Z 2025-05-07T20:32:05.5623408Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.5623499Z op = silu_mul_quant 2025-05-07T20:32:05.5623584Z if compiled: 2025-05-07T20:32:05.5623690Z op = torch.compile(op) 2025-05-07T20:32:05.5623797Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.5623871Z 2025-05-07T20:32:05.5623970Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.5623979Z 2025-05-07T20:32:05.5624077Z moe/activation_test.py:117: 2025-05-07T20:32:05.5624214Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.5624318Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.5624417Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.5624923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.5625020Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.5625378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.5625606Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.5625946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.5626052Z kernel = self.compile( 2025-05-07T20:32:05.5626435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.5626608Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.5626793Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.5626798Z 2025-05-07T20:32:05.5627001Z self = 2025-05-07T20:32:05.5627779Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.5628636Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f4898219300>} 2025-05-07T20:32:05.5629529Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.5629882Z context = 2025-05-07T20:32:05.5629895Z 2025-05-07T20:32:05.5630060Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.5630326Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.5630434Z module_map=module_map) 2025-05-07T20:32:05.5630593Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.5630704Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.5630782Z E ^ 2025-05-07T20:32:05.5631133Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.5631146Z 2025-05-07T20:32:05.5631558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.5631564Z 2025-05-07T20:32:05.5631669Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.5631900Z self=, 2025-05-07T20:32:05.5631977Z T=128, 2025-05-07T20:32:05.5632054Z D=5120, 2025-05-07T20:32:05.5632142Z scale_ub=None, 2025-05-07T20:32:05.5632231Z contiguous=True, 2025-05-07T20:32:05.5632314Z compiled=False, 2025-05-07T20:32:05.5632393Z ) 2025-05-07T20:32:05.5632609Z self = 2025-05-07T20:32:05.5632782Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:05.5632787Z 2025-05-07T20:32:05.5632865Z @given( 2025-05-07T20:32:05.5632981Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.5633090Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.5633210Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.5633326Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.5633442Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.5633519Z ) 2025-05-07T20:32:05.5633768Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.5633865Z def test_silu_mul_quant( 2025-05-07T20:32:05.5633942Z self, 2025-05-07T20:32:05.5634030Z T: int, 2025-05-07T20:32:05.5634107Z D: int, 2025-05-07T20:32:05.5634205Z scale_ub: Optional[float], 2025-05-07T20:32:05.5634300Z contiguous: bool, 2025-05-07T20:32:05.5634386Z compiled: bool, 2025-05-07T20:32:05.5634466Z ) -> None: 2025-05-07T20:32:05.5634570Z torch.manual_seed(2025) 2025-05-07T20:32:05.5634642Z 2025-05-07T20:32:05.5634811Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.5634895Z 2025-05-07T20:32:05.5634990Z x_sign = torch.sign(x) 2025-05-07T20:32:05.5635114Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.5635210Z x = x_sign * x_clamp 2025-05-07T20:32:05.5635291Z x0 = x[:, :D] 2025-05-07T20:32:05.5635378Z x1 = x[:, D:] 2025-05-07T20:32:05.5635522Z 2025-05-07T20:32:05.5635607Z if contiguous: 2025-05-07T20:32:05.5635704Z x0 = x0.contiguous() 2025-05-07T20:32:05.5635793Z x1 = x1.contiguous() 2025-05-07T20:32:05.5635869Z 2025-05-07T20:32:05.5635968Z if scale_ub is not None: 2025-05-07T20:32:05.5636074Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.5636209Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.5636291Z ) 2025-05-07T20:32:05.5636370Z else: 2025-05-07T20:32:05.5636467Z scale_ub_tensor = None 2025-05-07T20:32:05.5636546Z 2025-05-07T20:32:05.5636676Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.5636863Z op = silu_mul_quant 2025-05-07T20:32:05.5636948Z if compiled: 2025-05-07T20:32:05.5637048Z op = torch.compile(op) 2025-05-07T20:32:05.5637202Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.5637278Z 2025-05-07T20:32:05.5637369Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.5637374Z 2025-05-07T20:32:05.5637481Z moe/activation_test.py:117: 2025-05-07T20:32:05.5637610Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.5637712Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.5637819Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.5638313Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.5638415Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.5638768Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.5638993Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.5639340Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.5639436Z kernel = self.compile( 2025-05-07T20:32:05.5639814Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.5639991Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.5640118Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.5640122Z 2025-05-07T20:32:05.5640328Z self = 2025-05-07T20:32:05.5641097Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.5641611Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f489821a520>} 2025-05-07T20:32:05.5642357Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.5642548Z context = 2025-05-07T20:32:05.5642553Z 2025-05-07T20:32:05.5642720Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.5642978Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.5643092Z module_map=module_map) 2025-05-07T20:32:05.5643254Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.5643355Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.5643436Z E ^ 2025-05-07T20:32:05.5643790Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.5643840Z 2025-05-07T20:32:05.5644254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.5644263Z 2025-05-07T20:32:05.5644365Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.5644585Z self=, 2025-05-07T20:32:05.5644667Z T=128, 2025-05-07T20:32:05.5644743Z D=7168, 2025-05-07T20:32:05.5644824Z scale_ub=None, 2025-05-07T20:32:05.5644915Z contiguous=True, 2025-05-07T20:32:05.5645000Z compiled=False, 2025-05-07T20:32:05.5645073Z ) 2025-05-07T20:32:05.5645292Z self = 2025-05-07T20:32:05.5645538Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:05.5645543Z 2025-05-07T20:32:05.5645624Z @given( 2025-05-07T20:32:05.5645810Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.5645912Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.5646031Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.5646145Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.5646256Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.5646338Z ) 2025-05-07T20:32:05.5646580Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.5646674Z def test_silu_mul_quant( 2025-05-07T20:32:05.5646754Z self, 2025-05-07T20:32:05.5646831Z T: int, 2025-05-07T20:32:05.5646905Z D: int, 2025-05-07T20:32:05.5647009Z scale_ub: Optional[float], 2025-05-07T20:32:05.5647100Z contiguous: bool, 2025-05-07T20:32:05.5647193Z compiled: bool, 2025-05-07T20:32:05.5647271Z ) -> None: 2025-05-07T20:32:05.5647368Z torch.manual_seed(2025) 2025-05-07T20:32:05.5647444Z 2025-05-07T20:32:05.5647614Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.5647687Z 2025-05-07T20:32:05.5647783Z x_sign = torch.sign(x) 2025-05-07T20:32:05.5647908Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.5647996Z x = x_sign * x_clamp 2025-05-07T20:32:05.5648081Z x0 = x[:, :D] 2025-05-07T20:32:05.5648161Z x1 = x[:, D:] 2025-05-07T20:32:05.5648233Z 2025-05-07T20:32:05.5648325Z if contiguous: 2025-05-07T20:32:05.5648418Z x0 = x0.contiguous() 2025-05-07T20:32:05.5648510Z x1 = x1.contiguous() 2025-05-07T20:32:05.5648588Z 2025-05-07T20:32:05.5648677Z if scale_ub is not None: 2025-05-07T20:32:05.5648786Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.5648923Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.5659141Z ) 2025-05-07T20:32:05.5659259Z else: 2025-05-07T20:32:05.5659368Z scale_ub_tensor = None 2025-05-07T20:32:05.5659490Z 2025-05-07T20:32:05.5659666Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.5659770Z op = silu_mul_quant 2025-05-07T20:32:05.5659886Z if compiled: 2025-05-07T20:32:05.5659993Z op = torch.compile(op) 2025-05-07T20:32:05.5660110Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.5660211Z 2025-05-07T20:32:05.5660305Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.5660311Z 2025-05-07T20:32:05.5660422Z moe/activation_test.py:117: 2025-05-07T20:32:05.5660565Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.5660679Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.5660789Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.5661301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.5661406Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.5661848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.5662074Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.5662421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.5662517Z kernel = self.compile( 2025-05-07T20:32:05.5662901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.5663078Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.5663208Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.5663291Z 2025-05-07T20:32:05.5663503Z self = 2025-05-07T20:32:05.5664323Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.5664828Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f489821b560>} 2025-05-07T20:32:05.5665602Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.5665794Z context = 2025-05-07T20:32:05.5665802Z 2025-05-07T20:32:05.5665972Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.5666237Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.5666356Z module_map=module_map) 2025-05-07T20:32:05.5666525Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.5666628Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.5666714Z E ^ 2025-05-07T20:32:05.5667067Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.5667071Z 2025-05-07T20:32:05.5667486Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.5667499Z 2025-05-07T20:32:05.5667606Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.5667831Z self=, 2025-05-07T20:32:05.5667920Z T=2048, 2025-05-07T20:32:05.5667999Z D=7168, 2025-05-07T20:32:05.5668088Z scale_ub=1200.0, 2025-05-07T20:32:05.5668179Z contiguous=True, 2025-05-07T20:32:05.5668265Z compiled=False, 2025-05-07T20:32:05.5668340Z ) 2025-05-07T20:32:05.5668568Z self = 2025-05-07T20:32:05.5668744Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:05.5668749Z 2025-05-07T20:32:05.5668828Z @given( 2025-05-07T20:32:05.5668957Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.5669146Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.5669272Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.5669391Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.5669508Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.5669589Z ) 2025-05-07T20:32:05.5669836Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.5669937Z def test_silu_mul_quant( 2025-05-07T20:32:05.5670023Z self, 2025-05-07T20:32:05.5670102Z T: int, 2025-05-07T20:32:05.5670180Z D: int, 2025-05-07T20:32:05.5670340Z scale_ub: Optional[float], 2025-05-07T20:32:05.5670434Z contiguous: bool, 2025-05-07T20:32:05.5670530Z compiled: bool, 2025-05-07T20:32:05.5670612Z ) -> None: 2025-05-07T20:32:05.5670711Z torch.manual_seed(2025) 2025-05-07T20:32:05.5670793Z 2025-05-07T20:32:05.5670964Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.5672806Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:05.5672931Z 2025-05-07T20:32:05.5673059Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:05.5673064Z 2025-05-07T20:32:05.5673170Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.5673397Z self=, 2025-05-07T20:32:05.5673477Z T=1, 2025-05-07T20:32:05.5673555Z D=5120, 2025-05-07T20:32:05.5673645Z scale_ub=1200.0, 2025-05-07T20:32:05.5673730Z contiguous=True, 2025-05-07T20:32:05.5673817Z compiled=False, 2025-05-07T20:32:05.5673899Z ) 2025-05-07T20:32:05.5674117Z self = 2025-05-07T20:32:05.5674288Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:05.5674295Z 2025-05-07T20:32:05.5674379Z @given( 2025-05-07T20:32:05.5674499Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.5674606Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.5674723Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.5674847Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.5674970Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.5675049Z ) 2025-05-07T20:32:05.5675302Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.5675399Z def test_silu_mul_quant( 2025-05-07T20:32:05.5675475Z self, 2025-05-07T20:32:05.5675558Z T: int, 2025-05-07T20:32:05.5675637Z D: int, 2025-05-07T20:32:05.5675740Z scale_ub: Optional[float], 2025-05-07T20:32:05.5675834Z contiguous: bool, 2025-05-07T20:32:05.5675920Z compiled: bool, 2025-05-07T20:32:05.5676001Z ) -> None: 2025-05-07T20:32:05.5676109Z torch.manual_seed(2025) 2025-05-07T20:32:05.5676187Z 2025-05-07T20:32:05.5676357Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.5676441Z 2025-05-07T20:32:05.5676599Z x_sign = torch.sign(x) 2025-05-07T20:32:05.5676736Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.5676839Z x = x_sign * x_clamp 2025-05-07T20:32:05.5676925Z x0 = x[:, :D] 2025-05-07T20:32:05.5697762Z x1 = x[:, D:] 2025-05-07T20:32:05.5697854Z 2025-05-07T20:32:05.5697945Z if contiguous: 2025-05-07T20:32:05.5698041Z x0 = x0.contiguous() 2025-05-07T20:32:05.5698130Z x1 = x1.contiguous() 2025-05-07T20:32:05.5698202Z 2025-05-07T20:32:05.5698296Z if scale_ub is not None: 2025-05-07T20:32:05.5698405Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.5698541Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.5698629Z ) 2025-05-07T20:32:05.5698709Z else: 2025-05-07T20:32:05.5698803Z scale_ub_tensor = None 2025-05-07T20:32:05.5698878Z 2025-05-07T20:32:05.5699010Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.5699105Z op = silu_mul_quant 2025-05-07T20:32:05.5699259Z if compiled: 2025-05-07T20:32:05.5699361Z op = torch.compile(op) 2025-05-07T20:32:05.5699469Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.5699540Z 2025-05-07T20:32:05.5699630Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.5699635Z 2025-05-07T20:32:05.5699737Z moe/activation_test.py:117: 2025-05-07T20:32:05.5699867Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.5699968Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.5700071Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.5700567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.5700823Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.5701216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.5701439Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.5701777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.5701870Z kernel = self.compile( 2025-05-07T20:32:05.5702247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.5702422Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.5702547Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.5702552Z 2025-05-07T20:32:05.5702752Z self = 2025-05-07T20:32:05.5703535Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.5704033Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f48980e8a40>} 2025-05-07T20:32:05.5704768Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.5704954Z context = 2025-05-07T20:32:05.5704958Z 2025-05-07T20:32:05.5705121Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.5705379Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.5705491Z module_map=module_map) 2025-05-07T20:32:05.5705650Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.5705753Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.5705833Z E ^ 2025-05-07T20:32:05.5706182Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.5706187Z 2025-05-07T20:32:05.5706592Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.5706600Z 2025-05-07T20:32:05.5706701Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.5706921Z self=, 2025-05-07T20:32:05.5707001Z T=2048, 2025-05-07T20:32:05.5707076Z D=5120, 2025-05-07T20:32:05.5707158Z scale_ub=None, 2025-05-07T20:32:05.5707248Z contiguous=True, 2025-05-07T20:32:05.5707331Z compiled=False, 2025-05-07T20:32:05.5707404Z ) 2025-05-07T20:32:05.5707622Z self = 2025-05-07T20:32:05.5707838Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:05.5707843Z 2025-05-07T20:32:05.5707923Z @given( 2025-05-07T20:32:05.5708039Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.5708137Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.5708252Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.5708366Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.5708476Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.5708554Z ) 2025-05-07T20:32:05.5708793Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.5708885Z def test_silu_mul_quant( 2025-05-07T20:32:05.5709005Z self, 2025-05-07T20:32:05.5709199Z T: int, 2025-05-07T20:32:05.5709274Z D: int, 2025-05-07T20:32:05.5709373Z scale_ub: Optional[float], 2025-05-07T20:32:05.5709461Z contiguous: bool, 2025-05-07T20:32:05.5709587Z compiled: bool, 2025-05-07T20:32:05.5709670Z ) -> None: 2025-05-07T20:32:05.5709763Z torch.manual_seed(2025) 2025-05-07T20:32:05.5709837Z 2025-05-07T20:32:05.5710000Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.5710072Z 2025-05-07T20:32:05.5710166Z > x_sign = torch.sign(x) 2025-05-07T20:32:05.5711930Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:05.5711941Z 2025-05-07T20:32:05.5712063Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:05.5712069Z 2025-05-07T20:32:05.5712169Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.5712413Z self=, 2025-05-07T20:32:05.5712496Z T=16384, 2025-05-07T20:32:05.5712590Z D=5120, 2025-05-07T20:32:05.5712676Z scale_ub=None, 2025-05-07T20:32:05.5712759Z contiguous=True, 2025-05-07T20:32:05.5712842Z compiled=False, 2025-05-07T20:32:05.5712915Z ) 2025-05-07T20:32:05.5713127Z self = 2025-05-07T20:32:05.5713297Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:05.5713304Z 2025-05-07T20:32:05.5713383Z @given( 2025-05-07T20:32:05.5713499Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.5713595Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.5713709Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.5713828Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.5713940Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.5714013Z ) 2025-05-07T20:32:05.5714253Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.5714347Z def test_silu_mul_quant( 2025-05-07T20:32:05.5714422Z self, 2025-05-07T20:32:05.5714499Z T: int, 2025-05-07T20:32:05.5714576Z D: int, 2025-05-07T20:32:05.5714677Z scale_ub: Optional[float], 2025-05-07T20:32:05.5714778Z contiguous: bool, 2025-05-07T20:32:05.5714865Z compiled: bool, 2025-05-07T20:32:05.5714949Z ) -> None: 2025-05-07T20:32:05.5715057Z torch.manual_seed(2025) 2025-05-07T20:32:05.5715136Z 2025-05-07T20:32:05.5715303Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.5717118Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:05.5717125Z 2025-05-07T20:32:05.5717246Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:05.5717256Z 2025-05-07T20:32:05.5717361Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.5717582Z self=, 2025-05-07T20:32:05.5717742Z T=4096, 2025-05-07T20:32:05.5717822Z D=5120, 2025-05-07T20:32:05.5717907Z scale_ub=None, 2025-05-07T20:32:05.5718000Z contiguous=True, 2025-05-07T20:32:05.5718124Z compiled=False, 2025-05-07T20:32:05.5718200Z ) 2025-05-07T20:32:05.5718427Z self = 2025-05-07T20:32:05.5718594Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:05.5718598Z 2025-05-07T20:32:05.5718677Z @given( 2025-05-07T20:32:05.5718800Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.5718899Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.5719021Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.5719137Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.5719251Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.5719336Z ) 2025-05-07T20:32:05.5719579Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.5719678Z def test_silu_mul_quant( 2025-05-07T20:32:05.5719764Z self, 2025-05-07T20:32:05.5719843Z T: int, 2025-05-07T20:32:05.5719925Z D: int, 2025-05-07T20:32:05.5720029Z scale_ub: Optional[float], 2025-05-07T20:32:05.5720120Z contiguous: bool, 2025-05-07T20:32:05.5720214Z compiled: bool, 2025-05-07T20:32:05.5720294Z ) -> None: 2025-05-07T20:32:05.5720391Z torch.manual_seed(2025) 2025-05-07T20:32:05.5720473Z 2025-05-07T20:32:05.5720639Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.5722398Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:05.5722417Z 2025-05-07T20:32:05.5722538Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:05.5722542Z 2025-05-07T20:32:05.5722649Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.5722877Z self=, 2025-05-07T20:32:05.5722957Z T=2048, 2025-05-07T20:32:05.5723036Z D=5120, 2025-05-07T20:32:05.5723128Z scale_ub=None, 2025-05-07T20:32:05.5723219Z contiguous=False, 2025-05-07T20:32:05.5723305Z compiled=False, 2025-05-07T20:32:05.5723391Z ) 2025-05-07T20:32:05.5723609Z self = 2025-05-07T20:32:05.5723788Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:05.5723797Z 2025-05-07T20:32:05.5723878Z @given( 2025-05-07T20:32:05.5723995Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.5724103Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.5724262Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.5724380Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.5724499Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.5724574Z ) 2025-05-07T20:32:05.5724823Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.5724919Z def test_silu_mul_quant( 2025-05-07T20:32:05.5724998Z self, 2025-05-07T20:32:05.5725084Z T: int, 2025-05-07T20:32:05.5725163Z D: int, 2025-05-07T20:32:05.5725262Z scale_ub: Optional[float], 2025-05-07T20:32:05.5725359Z contiguous: bool, 2025-05-07T20:32:05.5725448Z compiled: bool, 2025-05-07T20:32:05.5725581Z ) -> None: 2025-05-07T20:32:05.5725719Z torch.manual_seed(2025) 2025-05-07T20:32:05.5725792Z 2025-05-07T20:32:05.5725962Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.5727764Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:05.5727770Z 2025-05-07T20:32:05.5727888Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:05.5727902Z 2025-05-07T20:32:05.5728004Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.5728458Z self=, 2025-05-07T20:32:05.5728585Z T=4096, 2025-05-07T20:32:05.5728680Z D=7168, 2025-05-07T20:32:05.5728766Z scale_ub=None, 2025-05-07T20:32:05.5728864Z contiguous=True, 2025-05-07T20:32:05.5728952Z compiled=True, 2025-05-07T20:32:05.5729030Z ) 2025-05-07T20:32:05.5729256Z self = 2025-05-07T20:32:05.5729424Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:05.5729429Z 2025-05-07T20:32:05.5729508Z @given( 2025-05-07T20:32:05.5729633Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.5729731Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.5729852Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.5729969Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.5730084Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.5730176Z ) 2025-05-07T20:32:05.5730418Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.5730516Z def test_silu_mul_quant( 2025-05-07T20:32:05.5730602Z self, 2025-05-07T20:32:05.5730682Z T: int, 2025-05-07T20:32:05.5730760Z D: int, 2025-05-07T20:32:05.5730864Z scale_ub: Optional[float], 2025-05-07T20:32:05.5730953Z contiguous: bool, 2025-05-07T20:32:05.5731049Z compiled: bool, 2025-05-07T20:32:05.5731130Z ) -> None: 2025-05-07T20:32:05.5731226Z torch.manual_seed(2025) 2025-05-07T20:32:05.5731310Z 2025-05-07T20:32:05.5731476Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.5733404Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:05.5733423Z 2025-05-07T20:32:05.5733544Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:05.5733548Z 2025-05-07T20:32:05.5733649Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.5733879Z self=, 2025-05-07T20:32:05.5733957Z T=2048, 2025-05-07T20:32:05.5734035Z D=5120, 2025-05-07T20:32:05.5734125Z scale_ub=1200.0, 2025-05-07T20:32:05.5734214Z contiguous=False, 2025-05-07T20:32:05.5734299Z compiled=False, 2025-05-07T20:32:05.5734380Z ) 2025-05-07T20:32:05.5734597Z self = 2025-05-07T20:32:05.5734899Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:05.5734903Z 2025-05-07T20:32:05.5734982Z @given( 2025-05-07T20:32:05.5735152Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.5735260Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.5735377Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.5735492Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.5735610Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.5735684Z ) 2025-05-07T20:32:05.5735934Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.5736030Z def test_silu_mul_quant( 2025-05-07T20:32:05.5736107Z self, 2025-05-07T20:32:05.5736190Z T: int, 2025-05-07T20:32:05.5736268Z D: int, 2025-05-07T20:32:05.5736368Z scale_ub: Optional[float], 2025-05-07T20:32:05.5736467Z contiguous: bool, 2025-05-07T20:32:05.5736557Z compiled: bool, 2025-05-07T20:32:05.5736633Z ) -> None: 2025-05-07T20:32:05.5736731Z torch.manual_seed(2025) 2025-05-07T20:32:05.5736807Z 2025-05-07T20:32:05.5736974Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.5738730Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:05.5738736Z 2025-05-07T20:32:05.5738852Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:05.5738866Z 2025-05-07T20:32:05.5738969Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.5739188Z self=, 2025-05-07T20:32:05.5739269Z T=4096, 2025-05-07T20:32:05.5739352Z D=7168, 2025-05-07T20:32:05.5739440Z scale_ub=1200.0, 2025-05-07T20:32:05.5739534Z contiguous=True, 2025-05-07T20:32:05.5739619Z compiled=False, 2025-05-07T20:32:05.5739693Z ) 2025-05-07T20:32:05.5739913Z self = 2025-05-07T20:32:05.5740082Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:05.5740087Z 2025-05-07T20:32:05.5740165Z @given( 2025-05-07T20:32:05.5740287Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.5740386Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.5740502Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.5740620Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.5740734Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.5740813Z ) 2025-05-07T20:32:05.5741056Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.5741202Z def test_silu_mul_quant( 2025-05-07T20:32:05.5741285Z self, 2025-05-07T20:32:05.5741362Z T: int, 2025-05-07T20:32:05.5741438Z D: int, 2025-05-07T20:32:05.5741541Z scale_ub: Optional[float], 2025-05-07T20:32:05.5741632Z contiguous: bool, 2025-05-07T20:32:05.5741716Z compiled: bool, 2025-05-07T20:32:05.5741799Z ) -> None: 2025-05-07T20:32:05.5741891Z torch.manual_seed(2025) 2025-05-07T20:32:05.5741968Z 2025-05-07T20:32:05.5742132Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.5743953Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:05.5744040Z 2025-05-07T20:32:05.5744156Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:05.5744161Z 2025-05-07T20:32:05.5744262Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.5744487Z self=, 2025-05-07T20:32:05.5744564Z T=16384, 2025-05-07T20:32:05.5744639Z D=7168, 2025-05-07T20:32:05.5744725Z scale_ub=None, 2025-05-07T20:32:05.5744811Z contiguous=False, 2025-05-07T20:32:05.5744893Z compiled=True, 2025-05-07T20:32:05.5744973Z ) 2025-05-07T20:32:05.5745186Z self = 2025-05-07T20:32:05.5745371Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:05.5745376Z 2025-05-07T20:32:05.5745452Z @given( 2025-05-07T20:32:05.5745571Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.5745671Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.5745782Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.5745896Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.5746015Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.5746089Z ) 2025-05-07T20:32:05.5746330Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.5746430Z def test_silu_mul_quant( 2025-05-07T20:32:05.5746507Z self, 2025-05-07T20:32:05.5746591Z T: int, 2025-05-07T20:32:05.5746670Z D: int, 2025-05-07T20:32:05.5746772Z scale_ub: Optional[float], 2025-05-07T20:32:05.5746869Z contiguous: bool, 2025-05-07T20:32:05.5746956Z compiled: bool, 2025-05-07T20:32:05.5747032Z ) -> None: 2025-05-07T20:32:05.5747131Z torch.manual_seed(2025) 2025-05-07T20:32:05.5747205Z 2025-05-07T20:32:05.5747373Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.5749233Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:05.5749242Z 2025-05-07T20:32:05.5749360Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:05.5749364Z 2025-05-07T20:32:05.5749470Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.5749692Z self=, 2025-05-07T20:32:05.5749824Z T=4096, 2025-05-07T20:32:05.5749901Z D=7168, 2025-05-07T20:32:05.5749982Z scale_ub=None, 2025-05-07T20:32:05.5750071Z contiguous=True, 2025-05-07T20:32:05.5750154Z compiled=False, 2025-05-07T20:32:05.5750225Z ) 2025-05-07T20:32:05.5750447Z self = 2025-05-07T20:32:05.5750613Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:05.5750617Z 2025-05-07T20:32:05.5750692Z @given( 2025-05-07T20:32:05.5750811Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.5750908Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.5751069Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.5751223Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.5751334Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.5751412Z ) 2025-05-07T20:32:05.5751692Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.5751787Z def test_silu_mul_quant( 2025-05-07T20:32:05.5751869Z self, 2025-05-07T20:32:05.5751945Z T: int, 2025-05-07T20:32:05.5752020Z D: int, 2025-05-07T20:32:05.5752124Z scale_ub: Optional[float], 2025-05-07T20:32:05.5752212Z contiguous: bool, 2025-05-07T20:32:05.5752297Z compiled: bool, 2025-05-07T20:32:05.5752384Z ) -> None: 2025-05-07T20:32:05.5752481Z torch.manual_seed(2025) 2025-05-07T20:32:05.5752571Z 2025-05-07T20:32:05.5752761Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.5754530Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:05.5754548Z 2025-05-07T20:32:05.5754668Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:05.5754672Z 2025-05-07T20:32:05.5754774Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.5755000Z self=, 2025-05-07T20:32:05.5755078Z T=16384, 2025-05-07T20:32:05.5755158Z D=7168, 2025-05-07T20:32:05.5755245Z scale_ub=None, 2025-05-07T20:32:05.5755329Z contiguous=True, 2025-05-07T20:32:05.5755416Z compiled=False, 2025-05-07T20:32:05.5755498Z ) 2025-05-07T20:32:05.5755711Z self = 2025-05-07T20:32:05.5755893Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:05.5755897Z 2025-05-07T20:32:05.5755977Z @given( 2025-05-07T20:32:05.5756093Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.5756195Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.5756308Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.5756424Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.5756545Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.5756619Z ) 2025-05-07T20:32:05.5756863Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.5756966Z def test_silu_mul_quant( 2025-05-07T20:32:05.5757042Z self, 2025-05-07T20:32:05.5757128Z T: int, 2025-05-07T20:32:05.5757205Z D: int, 2025-05-07T20:32:05.5757302Z scale_ub: Optional[float], 2025-05-07T20:32:05.5757402Z contiguous: bool, 2025-05-07T20:32:05.5757486Z compiled: bool, 2025-05-07T20:32:05.5757564Z ) -> None: 2025-05-07T20:32:05.5757713Z torch.manual_seed(2025) 2025-05-07T20:32:05.5757788Z 2025-05-07T20:32:05.5757952Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.5759710Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:05.5759789Z 2025-05-07T20:32:05.5759905Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:05.5759910Z 2025-05-07T20:32:05.5760020Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.5760278Z self=, 2025-05-07T20:32:05.5760361Z T=16384, 2025-05-07T20:32:05.5760440Z D=7168, 2025-05-07T20:32:05.5760522Z scale_ub=1200.0, 2025-05-07T20:32:05.5760613Z contiguous=True, 2025-05-07T20:32:05.5760697Z compiled=False, 2025-05-07T20:32:05.5760770Z ) 2025-05-07T20:32:05.5760987Z self = 2025-05-07T20:32:05.5761161Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:05.5761165Z 2025-05-07T20:32:05.5761242Z @given( 2025-05-07T20:32:05.5761362Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.5761461Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.5761589Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.5761703Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.5761822Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.5761896Z ) 2025-05-07T20:32:05.5762143Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.5762244Z def test_silu_mul_quant( 2025-05-07T20:32:05.5762319Z self, 2025-05-07T20:32:05.5762395Z T: int, 2025-05-07T20:32:05.5762477Z D: int, 2025-05-07T20:32:05.5762579Z scale_ub: Optional[float], 2025-05-07T20:32:05.5762674Z contiguous: bool, 2025-05-07T20:32:05.5762785Z compiled: bool, 2025-05-07T20:32:05.5762870Z ) -> None: 2025-05-07T20:32:05.5762980Z torch.manual_seed(2025) 2025-05-07T20:32:05.5763062Z 2025-05-07T20:32:05.5763228Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.5764999Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:05.5765006Z 2025-05-07T20:32:05.5765122Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:05.5765126Z 2025-05-07T20:32:05.5765236Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.5765456Z self=, 2025-05-07T20:32:05.5765534Z T=128, 2025-05-07T20:32:05.5765618Z D=5120, 2025-05-07T20:32:05.5765707Z scale_ub=1200.0, 2025-05-07T20:32:05.5765796Z contiguous=False, 2025-05-07T20:32:05.5765892Z compiled=False, 2025-05-07T20:32:05.5765968Z ) 2025-05-07T20:32:05.5766181Z self = 2025-05-07T20:32:05.5766413Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:05.5766418Z 2025-05-07T20:32:05.5766497Z @given( 2025-05-07T20:32:05.5766615Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.5766712Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.5766824Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.5766945Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.5767056Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.5767130Z ) 2025-05-07T20:32:05.5767381Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.5767514Z def test_silu_mul_quant( 2025-05-07T20:32:05.5767634Z self, 2025-05-07T20:32:05.5767711Z T: int, 2025-05-07T20:32:05.5767789Z D: int, 2025-05-07T20:32:05.5767891Z scale_ub: Optional[float], 2025-05-07T20:32:05.5767979Z contiguous: bool, 2025-05-07T20:32:05.5768104Z compiled: bool, 2025-05-07T20:32:05.5768194Z ) -> None: 2025-05-07T20:32:05.5768288Z torch.manual_seed(2025) 2025-05-07T20:32:05.5768360Z 2025-05-07T20:32:05.5768530Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.5768603Z 2025-05-07T20:32:05.5768696Z x_sign = torch.sign(x) 2025-05-07T20:32:05.5768824Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.5768914Z x = x_sign * x_clamp 2025-05-07T20:32:05.5768993Z x0 = x[:, :D] 2025-05-07T20:32:05.5769079Z x1 = x[:, D:] 2025-05-07T20:32:05.5769152Z 2025-05-07T20:32:05.5769240Z if contiguous: 2025-05-07T20:32:05.5769335Z x0 = x0.contiguous() 2025-05-07T20:32:05.5769427Z x1 = x1.contiguous() 2025-05-07T20:32:05.5769504Z 2025-05-07T20:32:05.5769594Z if scale_ub is not None: 2025-05-07T20:32:05.5769701Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.5769845Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.5769922Z ) 2025-05-07T20:32:05.5769997Z else: 2025-05-07T20:32:05.5770101Z scale_ub_tensor = None 2025-05-07T20:32:05.5770173Z 2025-05-07T20:32:05.5770304Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.5770399Z op = silu_mul_quant 2025-05-07T20:32:05.5770485Z if compiled: 2025-05-07T20:32:05.5770591Z op = torch.compile(op) 2025-05-07T20:32:05.5770696Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.5770767Z 2025-05-07T20:32:05.5770861Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.5770866Z 2025-05-07T20:32:05.5770967Z moe/activation_test.py:117: 2025-05-07T20:32:05.5771099Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.5771204Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.5771303Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.5771806Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.5771908Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.5772264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.5772487Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.5772825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.5772919Z kernel = self.compile( 2025-05-07T20:32:05.5773305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.5773480Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.5773615Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.5773619Z 2025-05-07T20:32:05.5773892Z self = 2025-05-07T20:32:05.5774670Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.5775173Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f489817f6a0>} 2025-05-07T20:32:05.5775915Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.5776185Z context = 2025-05-07T20:32:05.5776189Z 2025-05-07T20:32:05.5776391Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.5776651Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.5776766Z module_map=module_map) 2025-05-07T20:32:05.5776926Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.5777030Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.5777111Z E ^ 2025-05-07T20:32:05.5777464Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.5777469Z 2025-05-07T20:32:05.5777884Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.5777897Z 2025-05-07T20:32:05.5778004Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.5778227Z self=, 2025-05-07T20:32:05.5778305Z T=2048, 2025-05-07T20:32:05.5778385Z D=7168, 2025-05-07T20:32:05.5778477Z scale_ub=None, 2025-05-07T20:32:05.5778564Z contiguous=False, 2025-05-07T20:32:05.5778648Z compiled=False, 2025-05-07T20:32:05.5778724Z ) 2025-05-07T20:32:05.5778940Z self = 2025-05-07T20:32:05.5779111Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:05.5779115Z 2025-05-07T20:32:05.5779200Z @given( 2025-05-07T20:32:05.5779316Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.5779421Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.5779534Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.5779652Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.5779773Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.5779847Z ) 2025-05-07T20:32:05.5780096Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.5780201Z def test_silu_mul_quant( 2025-05-07T20:32:05.5780278Z self, 2025-05-07T20:32:05.5780355Z T: int, 2025-05-07T20:32:05.5780439Z D: int, 2025-05-07T20:32:05.5780537Z scale_ub: Optional[float], 2025-05-07T20:32:05.5780626Z contiguous: bool, 2025-05-07T20:32:05.5780719Z compiled: bool, 2025-05-07T20:32:05.5780797Z ) -> None: 2025-05-07T20:32:05.5780898Z torch.manual_seed(2025) 2025-05-07T20:32:05.5780970Z 2025-05-07T20:32:05.5781135Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.5783008Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:05.5783019Z 2025-05-07T20:32:05.5783138Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:05.5783142Z 2025-05-07T20:32:05.5783251Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.5783470Z self=, 2025-05-07T20:32:05.5783547Z T=128, 2025-05-07T20:32:05.5783631Z D=7168, 2025-05-07T20:32:05.5783714Z scale_ub=1200.0, 2025-05-07T20:32:05.5783800Z contiguous=True, 2025-05-07T20:32:05.5783891Z compiled=True, 2025-05-07T20:32:05.5784006Z ) 2025-05-07T20:32:05.5784267Z self = 2025-05-07T20:32:05.5784431Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:05.5784435Z 2025-05-07T20:32:05.5784548Z @given( 2025-05-07T20:32:05.5784672Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.5784770Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.5784884Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.5785005Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.5785115Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.5785188Z ) 2025-05-07T20:32:05.5785439Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.5785530Z def test_silu_mul_quant( 2025-05-07T20:32:05.5785613Z self, 2025-05-07T20:32:05.5785691Z T: int, 2025-05-07T20:32:05.5785770Z D: int, 2025-05-07T20:32:05.5785877Z scale_ub: Optional[float], 2025-05-07T20:32:05.5785967Z contiguous: bool, 2025-05-07T20:32:05.5786052Z compiled: bool, 2025-05-07T20:32:05.5786135Z ) -> None: 2025-05-07T20:32:05.5786231Z torch.manual_seed(2025) 2025-05-07T20:32:05.5786309Z 2025-05-07T20:32:05.5786479Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.5786553Z 2025-05-07T20:32:05.5786646Z x_sign = torch.sign(x) 2025-05-07T20:32:05.5786775Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.5786864Z x = x_sign * x_clamp 2025-05-07T20:32:05.5786953Z x0 = x[:, :D] 2025-05-07T20:32:05.5787032Z x1 = x[:, D:] 2025-05-07T20:32:05.5787104Z 2025-05-07T20:32:05.5787192Z if contiguous: 2025-05-07T20:32:05.5787283Z x0 = x0.contiguous() 2025-05-07T20:32:05.5787371Z x1 = x1.contiguous() 2025-05-07T20:32:05.5787455Z 2025-05-07T20:32:05.5787544Z if scale_ub is not None: 2025-05-07T20:32:05.5787653Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.5787793Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.5787869Z ) 2025-05-07T20:32:05.5787953Z else: 2025-05-07T20:32:05.5788054Z scale_ub_tensor = None 2025-05-07T20:32:05.5788125Z 2025-05-07T20:32:05.5788253Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.5788348Z op = silu_mul_quant 2025-05-07T20:32:05.5788432Z if compiled: 2025-05-07T20:32:05.5788538Z op = torch.compile(op) 2025-05-07T20:32:05.5788645Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.5788718Z 2025-05-07T20:32:05.5788813Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.5788817Z 2025-05-07T20:32:05.5788914Z moe/activation_test.py:117: 2025-05-07T20:32:05.5789041Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.5789200Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.5789300Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.5789675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:05.5789816Z return fn(*args, **kwargs) 2025-05-07T20:32:05.5790306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.5790413Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.5790767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.5790987Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.5791329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.5791425Z kernel = self.compile( 2025-05-07T20:32:05.5791852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.5792065Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.5792231Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.5792236Z 2025-05-07T20:32:05.5792443Z self = 2025-05-07T20:32:05.5793267Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.5793768Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f48983c68e0>} 2025-05-07T20:32:05.5794510Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.5794705Z context = 2025-05-07T20:32:05.5794716Z 2025-05-07T20:32:05.5794882Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.5795141Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.5795255Z module_map=module_map) 2025-05-07T20:32:05.5795415Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.5795517Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.5795603Z E ^ 2025-05-07T20:32:05.5795955Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.5795960Z 2025-05-07T20:32:05.5796377Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.5796385Z 2025-05-07T20:32:05.5796491Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.5796713Z self=, 2025-05-07T20:32:05.5796799Z T=128, 2025-05-07T20:32:05.5796876Z D=7168, 2025-05-07T20:32:05.5796958Z scale_ub=1200.0, 2025-05-07T20:32:05.5797048Z contiguous=True, 2025-05-07T20:32:05.5797134Z compiled=False, 2025-05-07T20:32:05.5797206Z ) 2025-05-07T20:32:05.5797423Z self = 2025-05-07T20:32:05.5797594Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:05.5797599Z 2025-05-07T20:32:05.5797682Z @given( 2025-05-07T20:32:05.5797800Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.5797900Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.5798021Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.5798138Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.5798250Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.5798327Z ) 2025-05-07T20:32:05.5798613Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.5798707Z def test_silu_mul_quant( 2025-05-07T20:32:05.5798790Z self, 2025-05-07T20:32:05.5798866Z T: int, 2025-05-07T20:32:05.5798950Z D: int, 2025-05-07T20:32:05.5799047Z scale_ub: Optional[float], 2025-05-07T20:32:05.5799135Z contiguous: bool, 2025-05-07T20:32:05.5799225Z compiled: bool, 2025-05-07T20:32:05.5799302Z ) -> None: 2025-05-07T20:32:05.5799395Z torch.manual_seed(2025) 2025-05-07T20:32:05.5799473Z 2025-05-07T20:32:05.5799638Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.5799752Z 2025-05-07T20:32:05.5799852Z x_sign = torch.sign(x) 2025-05-07T20:32:05.5800012Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.5801836Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:05.5801843Z 2025-05-07T20:32:05.5801960Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:05.5801964Z 2025-05-07T20:32:05.5802070Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.5802293Z self=, 2025-05-07T20:32:05.5802380Z T=128, 2025-05-07T20:32:05.5802463Z D=5120, 2025-05-07T20:32:05.5802548Z scale_ub=1200.0, 2025-05-07T20:32:05.5802633Z contiguous=True, 2025-05-07T20:32:05.5802721Z compiled=True, 2025-05-07T20:32:05.5802797Z ) 2025-05-07T20:32:05.5803018Z self = 2025-05-07T20:32:05.5803184Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:05.5803189Z 2025-05-07T20:32:05.5803269Z @given( 2025-05-07T20:32:05.5803392Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.5803489Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.5803601Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.5803723Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.5803834Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.5803907Z ) 2025-05-07T20:32:05.5804157Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.5804252Z def test_silu_mul_quant( 2025-05-07T20:32:05.5804330Z self, 2025-05-07T20:32:05.5804412Z T: int, 2025-05-07T20:32:05.5804486Z D: int, 2025-05-07T20:32:05.5804588Z scale_ub: Optional[float], 2025-05-07T20:32:05.5804683Z contiguous: bool, 2025-05-07T20:32:05.5804769Z compiled: bool, 2025-05-07T20:32:05.5804850Z ) -> None: 2025-05-07T20:32:05.5808832Z torch.manual_seed(2025) 2025-05-07T20:32:05.5808917Z 2025-05-07T20:32:05.5809094Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.5809167Z 2025-05-07T20:32:05.5809267Z x_sign = torch.sign(x) 2025-05-07T20:32:05.5809390Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.5811233Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:05.5811245Z 2025-05-07T20:32:05.5811364Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:05.5811369Z 2025-05-07T20:32:05.5811473Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.5811692Z self=, 2025-05-07T20:32:05.5811767Z T=128, 2025-05-07T20:32:05.5811843Z D=7168, 2025-05-07T20:32:05.5811922Z scale_ub=None, 2025-05-07T20:32:05.5812004Z contiguous=True, 2025-05-07T20:32:05.5812087Z compiled=True, 2025-05-07T20:32:05.5812158Z ) 2025-05-07T20:32:05.5812418Z self = 2025-05-07T20:32:05.5812625Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:05.5812629Z 2025-05-07T20:32:05.5812708Z @given( 2025-05-07T20:32:05.5812882Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.5812981Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.5813093Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.5813210Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.5813319Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.5813391Z ) 2025-05-07T20:32:05.5813636Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.5813727Z def test_silu_mul_quant( 2025-05-07T20:32:05.5813802Z self, 2025-05-07T20:32:05.5813879Z T: int, 2025-05-07T20:32:05.5813954Z D: int, 2025-05-07T20:32:05.5814053Z scale_ub: Optional[float], 2025-05-07T20:32:05.5814147Z contiguous: bool, 2025-05-07T20:32:05.5814230Z compiled: bool, 2025-05-07T20:32:05.5814318Z ) -> None: 2025-05-07T20:32:05.5814413Z torch.manual_seed(2025) 2025-05-07T20:32:05.5814482Z 2025-05-07T20:32:05.5814654Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.5816431Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:05.5816439Z 2025-05-07T20:32:05.5816558Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:05.5816693Z =============================== warnings summary =============================== 2025-05-07T20:32:05.5817002Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:05.5817304Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:05.5817596Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:05.5818471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details. 2025-05-07T20:32:05.5818698Z warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See " 2025-05-07T20:32:05.5818705Z 2025-05-07T20:32:05.5818911Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html 2025-05-07T20:32:05.5819081Z ================= 1 failed, 1 deselected, 3 warnings in 16.14s ================= 2025-05-07T20:32:07.2168638Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error) 2025-05-07T20:32:07.2785424Z [EXEC] [ATTEMPT 0/2] Command attempt failed. 2025-05-07T20:32:07.2786037Z 2025-05-07T20:32:09.2802878Z [EXEC] [ATTEMPT 1/2] + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py 2025-05-07T20:32:11.4264961Z ============================= test session starts ============================== 2025-05-07T20:32:11.4265689Z platform linux -- Python 3.11.8, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:32:11.4266609Z cachedir: .pytest_cache 2025-05-07T20:32:11.4267187Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:32:11.4267981Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:32:11.4268391Z plugins: hypothesis-6.131.14 2025-05-07T20:32:13.0204626Z TMA benchmarks will be running with experimental grid constant TMA descriptor. 2025-05-07T20:32:13.1707283Z collecting ... collected 2 items / 1 deselected / 1 selected 2025-05-07T20:32:13.1707702Z run-last-failure: rerun previous 1 failure 2025-05-07T20:32:13.1707924Z 2025-05-07T20:32:15.5330779Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.5332614Z self=, 2025-05-07T20:32:15.5333531Z T=1, 2025-05-07T20:32:15.5333934Z D=5120, 2025-05-07T20:32:15.5334359Z scale_ub=None, 2025-05-07T20:32:15.5334815Z contiguous=True, 2025-05-07T20:32:15.5335289Z compiled=True, 2025-05-07T20:32:15.5335711Z ) 2025-05-07T20:32:15.5336264Z self = 2025-05-07T20:32:15.5336818Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:15.5337079Z 2025-05-07T20:32:15.5337178Z @given( 2025-05-07T20:32:15.5337419Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:15.5337749Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:15.5338066Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:15.5338399Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:15.5338736Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:15.5339038Z ) 2025-05-07T20:32:15.5339393Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:15.5339846Z def test_silu_mul_quant( 2025-05-07T20:32:15.5340108Z self, 2025-05-07T20:32:15.5340311Z T: int, 2025-05-07T20:32:15.5340527Z D: int, 2025-05-07T20:32:15.5340760Z scale_ub: Optional[float], 2025-05-07T20:32:15.5341034Z contiguous: bool, 2025-05-07T20:32:15.5341293Z compiled: bool, 2025-05-07T20:32:15.5341534Z ) -> None: 2025-05-07T20:32:15.5341767Z torch.manual_seed(2025) 2025-05-07T20:32:15.5342014Z 2025-05-07T20:32:15.5342302Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:15.5342656Z 2025-05-07T20:32:15.5342868Z x_sign = torch.sign(x) 2025-05-07T20:32:15.5343169Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:15.5343480Z x = x_sign * x_clamp 2025-05-07T20:32:15.5343735Z x0 = x[:, :D] 2025-05-07T20:32:15.5343963Z x1 = x[:, D:] 2025-05-07T20:32:15.5344173Z 2025-05-07T20:32:15.5344369Z if contiguous: 2025-05-07T20:32:15.5344619Z x0 = x0.contiguous() 2025-05-07T20:32:15.5344882Z x1 = x1.contiguous() 2025-05-07T20:32:15.5345134Z 2025-05-07T20:32:15.5345337Z if scale_ub is not None: 2025-05-07T20:32:15.5345610Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:15.5346261Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:15.5346583Z ) 2025-05-07T20:32:15.5346779Z else: 2025-05-07T20:32:15.5347004Z scale_ub_tensor = None 2025-05-07T20:32:15.5347261Z 2025-05-07T20:32:15.5347495Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:15.5347817Z op = silu_mul_quant 2025-05-07T20:32:15.5348076Z if compiled: 2025-05-07T20:32:15.5348329Z op = torch.compile(op) 2025-05-07T20:32:15.5348625Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.5348907Z 2025-05-07T20:32:15.5349226Z y_fp8, y_scale = fn() 2025-05-07T20:32:15.5349514Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:15.5349986Z 2025-05-07T20:32:15.5350236Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:15.5350574Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:15.5350952Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:15.5351279Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:15.5351636Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:15.5351957Z 2025-05-07T20:32:15.5352168Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:15.5352364Z 2025-05-07T20:32:15.5352478Z moe/activation_test.py:126: 2025-05-07T20:32:15.5352780Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.5353128Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:15.5353458Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:15.5354244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:15.5355013Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:15.5355564Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:15.5356280Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:15.5356993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:15.5357720Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:15.5358478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:15.5359227Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:15.5359953Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:15.5360605Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:15.5361218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:15.5361736Z fn() 2025-05-07T20:32:15.5362253Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:15.5362842Z self.fn.run( 2025-05-07T20:32:15.5363318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:15.5363849Z kernel = self.compile( 2025-05-07T20:32:15.5364398Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:15.5365059Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:15.5365457Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.5365698Z 2025-05-07T20:32:15.5365907Z self = 2025-05-07T20:32:15.5367050Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:15.5368439Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd95239260>} 2025-05-07T20:32:15.5369785Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:15.5370803Z context = 2025-05-07T20:32:15.5371137Z 2025-05-07T20:32:15.5371348Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:15.5371867Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:15.5372382Z module_map=module_map) 2025-05-07T20:32:15.5372748Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:15.5373109Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:15.5373382Z E ^ 2025-05-07T20:32:15.5373866Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:15.5374312Z 2025-05-07T20:32:15.5374726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:15.5375241Z 2025-05-07T20:32:15.5375352Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.5375779Z self=, 2025-05-07T20:32:15.5376201Z T=2048, 2025-05-07T20:32:15.5376398Z D=5120, 2025-05-07T20:32:15.5376636Z scale_ub=1200.0, 2025-05-07T20:32:15.5376883Z contiguous=True, 2025-05-07T20:32:15.5377109Z compiled=False, 2025-05-07T20:32:15.5377326Z ) 2025-05-07T20:32:16.4623123Z self = 2025-05-07T20:32:16.4623903Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:16.4624286Z 2025-05-07T20:32:16.4624409Z @given( 2025-05-07T20:32:16.4624721Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.4625101Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.4625415Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.4625751Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.4626074Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.4626365Z ) 2025-05-07T20:32:16.4626721Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.4627165Z def test_silu_mul_quant( 2025-05-07T20:32:16.4627413Z self, 2025-05-07T20:32:16.4627612Z T: int, 2025-05-07T20:32:16.4627811Z D: int, 2025-05-07T20:32:16.4628047Z scale_ub: Optional[float], 2025-05-07T20:32:16.4628594Z contiguous: bool, 2025-05-07T20:32:16.4628837Z compiled: bool, 2025-05-07T20:32:16.4629118Z ) -> None: 2025-05-07T20:32:16.4629341Z torch.manual_seed(2025) 2025-05-07T20:32:16.4629582Z 2025-05-07T20:32:16.4629861Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.4630207Z 2025-05-07T20:32:16.4630402Z x_sign = torch.sign(x) 2025-05-07T20:32:16.4630703Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.4631016Z x = x_sign * x_clamp 2025-05-07T20:32:16.4631264Z x0 = x[:, :D] 2025-05-07T20:32:16.4631481Z x1 = x[:, D:] 2025-05-07T20:32:16.4631697Z 2025-05-07T20:32:16.4631893Z if contiguous: 2025-05-07T20:32:16.4632124Z x0 = x0.contiguous() 2025-05-07T20:32:16.4632385Z x1 = x1.contiguous() 2025-05-07T20:32:16.4632630Z 2025-05-07T20:32:16.4632822Z if scale_ub is not None: 2025-05-07T20:32:16.4633377Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.4633716Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.4634022Z ) 2025-05-07T20:32:16.4634223Z else: 2025-05-07T20:32:16.4634439Z scale_ub_tensor = None 2025-05-07T20:32:16.4634684Z 2025-05-07T20:32:16.4634917Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.4635232Z op = silu_mul_quant 2025-05-07T20:32:16.4635481Z if compiled: 2025-05-07T20:32:16.4635738Z op = torch.compile(op) 2025-05-07T20:32:16.4636036Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.4636407Z 2025-05-07T20:32:16.4636599Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.4636880Z 2025-05-07T20:32:16.4636984Z moe/activation_test.py:117: 2025-05-07T20:32:16.4637284Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.4637694Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.4637993Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.4638685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.4639370Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.4639906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.4640586Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.4641249Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.4641778Z kernel = self.compile( 2025-05-07T20:32:16.4642325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.4642982Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.4643386Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.4643614Z 2025-05-07T20:32:16.4643824Z self = 2025-05-07T20:32:16.4644914Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.4646319Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd94ee4180>} 2025-05-07T20:32:16.4647679Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.4648717Z context = 2025-05-07T20:32:16.4649010Z 2025-05-07T20:32:16.4649179Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.4649703Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.4650179Z module_map=module_map) 2025-05-07T20:32:16.4650541Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.4650897Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.4651164Z E ^ 2025-05-07T20:32:16.4651637Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.4652088Z 2025-05-07T20:32:16.4652508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.4653027Z 2025-05-07T20:32:16.4653135Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.4653604Z self=, 2025-05-07T20:32:16.4654011Z T=2048, 2025-05-07T20:32:16.4654200Z D=5120, 2025-05-07T20:32:16.4654401Z scale_ub=1200.0, 2025-05-07T20:32:16.4654628Z contiguous=True, 2025-05-07T20:32:16.4654848Z compiled=True, 2025-05-07T20:32:16.4655063Z ) 2025-05-07T20:32:16.4655388Z self = 2025-05-07T20:32:16.4655878Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:16.4656154Z 2025-05-07T20:32:16.4656234Z @given( 2025-05-07T20:32:16.4656470Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.4656825Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.4657207Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.4657554Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.4657886Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.4658208Z ) 2025-05-07T20:32:16.4658564Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.4659009Z def test_silu_mul_quant( 2025-05-07T20:32:16.4659247Z self, 2025-05-07T20:32:16.4659447Z T: int, 2025-05-07T20:32:16.4659647Z D: int, 2025-05-07T20:32:16.4659863Z scale_ub: Optional[float], 2025-05-07T20:32:16.4660141Z contiguous: bool, 2025-05-07T20:32:16.4660382Z compiled: bool, 2025-05-07T20:32:16.4660605Z ) -> None: 2025-05-07T20:32:16.4660823Z torch.manual_seed(2025) 2025-05-07T20:32:16.4661072Z 2025-05-07T20:32:16.4661343Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.4661694Z 2025-05-07T20:32:16.4661896Z x_sign = torch.sign(x) 2025-05-07T20:32:16.4662187Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.4662499Z x = x_sign * x_clamp 2025-05-07T20:32:16.4662753Z x0 = x[:, :D] 2025-05-07T20:32:16.4662979Z x1 = x[:, D:] 2025-05-07T20:32:16.4663189Z 2025-05-07T20:32:16.4663378Z if contiguous: 2025-05-07T20:32:16.4663621Z x0 = x0.contiguous() 2025-05-07T20:32:16.4663884Z x1 = x1.contiguous() 2025-05-07T20:32:16.4664128Z 2025-05-07T20:32:16.4664325Z if scale_ub is not None: 2025-05-07T20:32:16.4664596Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.4664935Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.4665244Z ) 2025-05-07T20:32:16.4665437Z else: 2025-05-07T20:32:16.4665651Z scale_ub_tensor = None 2025-05-07T20:32:16.4665906Z 2025-05-07T20:32:16.4666137Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.4666460Z op = silu_mul_quant 2025-05-07T20:32:16.4666716Z if compiled: 2025-05-07T20:32:16.4666961Z op = torch.compile(op) 2025-05-07T20:32:16.4667320Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.4667606Z 2025-05-07T20:32:16.4667810Z y_fp8, y_scale = fn() 2025-05-07T20:32:16.4668095Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:16.4668392Z 2025-05-07T20:32:16.4668639Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.4668974Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:16.4669354Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:16.4669675Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:16.4670027Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:16.4670337Z 2025-05-07T20:32:16.4670541Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:16.4670738Z 2025-05-07T20:32:16.4670841Z moe/activation_test.py:126: 2025-05-07T20:32:16.4671140Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.4671482Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:16.4671863Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:16.4672641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:16.4673394Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:16.4673940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.4674615Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.4675299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:16.4676054Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:16.4676840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:16.4677621Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:16.4678351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:16.4678990Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:16.4679598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:16.4680112Z fn() 2025-05-07T20:32:16.4680621Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:16.4681207Z self.fn.run( 2025-05-07T20:32:16.4681674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.4682212Z kernel = self.compile( 2025-05-07T20:32:16.4682758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.4683419Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.4683818Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.4684053Z 2025-05-07T20:32:16.4684265Z self = 2025-05-07T20:32:16.4685348Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.4686776Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd8fbb74c0>} 2025-05-07T20:32:16.4688138Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.4689172Z context = 2025-05-07T20:32:16.4689471Z 2025-05-07T20:32:16.4689642Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.4690170Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.4690632Z module_map=module_map) 2025-05-07T20:32:16.4691009Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.4691376Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:16.4691648Z E ^ 2025-05-07T20:32:16.4692110Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.4692569Z 2025-05-07T20:32:16.4692989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.4693502Z 2025-05-07T20:32:16.4693654Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.4701030Z self=, 2025-05-07T20:32:16.4701492Z T=16384, 2025-05-07T20:32:16.4701700Z D=7168, 2025-05-07T20:32:16.4701900Z scale_ub=1200.0, 2025-05-07T20:32:16.4702132Z contiguous=False, 2025-05-07T20:32:16.4702370Z compiled=False, 2025-05-07T20:32:16.4702576Z ) 2025-05-07T20:32:17.2600066Z self = 2025-05-07T20:32:17.2600852Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:17.2601253Z 2025-05-07T20:32:17.2601701Z @given( 2025-05-07T20:32:17.2602140Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.2602471Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.2602789Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.2603220Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.2603556Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.2603849Z ) 2025-05-07T20:32:17.2604209Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.2604655Z def test_silu_mul_quant( 2025-05-07T20:32:17.2604906Z self, 2025-05-07T20:32:17.2605114Z T: int, 2025-05-07T20:32:17.2605318Z D: int, 2025-05-07T20:32:17.2605556Z scale_ub: Optional[float], 2025-05-07T20:32:17.2605841Z contiguous: bool, 2025-05-07T20:32:17.2606082Z compiled: bool, 2025-05-07T20:32:17.2606321Z ) -> None: 2025-05-07T20:32:17.2606549Z torch.manual_seed(2025) 2025-05-07T20:32:17.2606844Z 2025-05-07T20:32:17.2607143Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.2607497Z 2025-05-07T20:32:17.2607695Z x_sign = torch.sign(x) 2025-05-07T20:32:17.2608007Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.2608336Z x = x_sign * x_clamp 2025-05-07T20:32:17.2608592Z x0 = x[:, :D] 2025-05-07T20:32:17.2608810Z x1 = x[:, D:] 2025-05-07T20:32:17.2609025Z 2025-05-07T20:32:17.2609224Z if contiguous: 2025-05-07T20:32:17.2609457Z x0 = x0.contiguous() 2025-05-07T20:32:17.2609723Z x1 = x1.contiguous() 2025-05-07T20:32:17.2609967Z 2025-05-07T20:32:17.2610162Z if scale_ub is not None: 2025-05-07T20:32:17.2610440Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.2610781Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.2611087Z ) 2025-05-07T20:32:17.2611289Z else: 2025-05-07T20:32:17.2611512Z scale_ub_tensor = None 2025-05-07T20:32:17.2611768Z 2025-05-07T20:32:17.2612004Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.2612325Z op = silu_mul_quant 2025-05-07T20:32:17.2612578Z if compiled: 2025-05-07T20:32:17.2612834Z op = torch.compile(op) 2025-05-07T20:32:17.2613134Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.2613417Z 2025-05-07T20:32:17.2613612Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.2613785Z 2025-05-07T20:32:17.2613888Z moe/activation_test.py:117: 2025-05-07T20:32:17.2614192Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.2614526Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.2614814Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.2615510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.2616203Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.2616791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.2617566Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.2618239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.2618770Z kernel = self.compile( 2025-05-07T20:32:17.2619320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.2619984Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.2620391Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.2620621Z 2025-05-07T20:32:17.2620831Z self = 2025-05-07T20:32:17.2621953Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.2623414Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd8fe971a0>} 2025-05-07T20:32:17.2624764Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.2625786Z context = 2025-05-07T20:32:17.2626087Z 2025-05-07T20:32:17.2626258Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.2626786Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.2627264Z module_map=module_map) 2025-05-07T20:32:17.2627663Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.2628035Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.2628613Z E ^ 2025-05-07T20:32:17.2629140Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.2629595Z 2025-05-07T20:32:17.2630013Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.2630531Z 2025-05-07T20:32:17.2630636Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.2631054Z self=, 2025-05-07T20:32:17.2631456Z T=1, 2025-05-07T20:32:17.2631651Z D=7168, 2025-05-07T20:32:17.2631856Z scale_ub=None, 2025-05-07T20:32:17.2632078Z contiguous=True, 2025-05-07T20:32:17.2632315Z compiled=True, 2025-05-07T20:32:17.2632536Z ) 2025-05-07T20:32:17.2632858Z self = 2025-05-07T20:32:17.2633344Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:17.2633608Z 2025-05-07T20:32:17.2633692Z @given( 2025-05-07T20:32:17.2633930Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.2634241Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.2634555Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.2634885Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.2635210Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.2635500Z ) 2025-05-07T20:32:17.2635862Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.2636305Z def test_silu_mul_quant( 2025-05-07T20:32:17.2636558Z self, 2025-05-07T20:32:17.2636764Z T: int, 2025-05-07T20:32:17.2636973Z D: int, 2025-05-07T20:32:17.2637193Z scale_ub: Optional[float], 2025-05-07T20:32:17.2637471Z contiguous: bool, 2025-05-07T20:32:17.2637718Z compiled: bool, 2025-05-07T20:32:17.2637942Z ) -> None: 2025-05-07T20:32:17.2638243Z torch.manual_seed(2025) 2025-05-07T20:32:17.2638494Z 2025-05-07T20:32:17.2638796Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.2639137Z 2025-05-07T20:32:17.2639342Z x_sign = torch.sign(x) 2025-05-07T20:32:17.2639640Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.2639946Z x = x_sign * x_clamp 2025-05-07T20:32:17.2640194Z x0 = x[:, :D] 2025-05-07T20:32:17.2640421Z x1 = x[:, D:] 2025-05-07T20:32:17.2640640Z 2025-05-07T20:32:17.2640836Z if contiguous: 2025-05-07T20:32:17.2641067Z x0 = x0.contiguous() 2025-05-07T20:32:17.2641330Z x1 = x1.contiguous() 2025-05-07T20:32:17.2641635Z 2025-05-07T20:32:17.2641882Z if scale_ub is not None: 2025-05-07T20:32:17.2642158Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.2642496Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.2642851Z ) 2025-05-07T20:32:17.2643055Z else: 2025-05-07T20:32:17.2643274Z scale_ub_tensor = None 2025-05-07T20:32:17.2643522Z 2025-05-07T20:32:17.2643759Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.2644079Z op = silu_mul_quant 2025-05-07T20:32:17.2644331Z if compiled: 2025-05-07T20:32:17.2644579Z op = torch.compile(op) 2025-05-07T20:32:17.2644875Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.2645154Z 2025-05-07T20:32:17.2645346Z y_fp8, y_scale = fn() 2025-05-07T20:32:17.2645633Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:17.2645924Z 2025-05-07T20:32:17.2646166Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.2646504Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:17.2646802Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:17.2647117Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:17.2647488Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:17.2647804Z 2025-05-07T20:32:17.2648011Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:17.2648205Z 2025-05-07T20:32:17.2648307Z moe/activation_test.py:126: 2025-05-07T20:32:17.2648607Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.2648951Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:17.2649274Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:17.2650063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:17.2650821Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:17.2651547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.2652230Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.2652918Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:17.2653636Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:17.2654380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:17.2655125Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:17.2655855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:17.2656496Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:17.2657092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:17.2657620Z fn() 2025-05-07T20:32:17.2658183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:17.2658773Z self.fn.run( 2025-05-07T20:32:17.2659238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.2659779Z kernel = self.compile( 2025-05-07T20:32:17.2660328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.2660979Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.2661383Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.2661621Z 2025-05-07T20:32:17.2661878Z self = 2025-05-07T20:32:17.2663036Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.2664403Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd8fc128e0>} 2025-05-07T20:32:17.2665995Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.2667154Z context = 2025-05-07T20:32:17.2667444Z 2025-05-07T20:32:17.2667621Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.2668156Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.2668630Z module_map=module_map) 2025-05-07T20:32:17.2669003Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.2669446Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:17.2669713Z E ^ 2025-05-07T20:32:17.2670178Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.2670625Z 2025-05-07T20:32:17.2671049Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.2671558Z 2025-05-07T20:32:17.2671669Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.2672076Z self=, 2025-05-07T20:32:17.2672478Z T=4096, 2025-05-07T20:32:17.2672673Z D=5120, 2025-05-07T20:32:17.2672890Z scale_ub=None, 2025-05-07T20:32:17.2673113Z contiguous=False, 2025-05-07T20:32:17.2673347Z compiled=False, 2025-05-07T20:32:17.2673557Z ) 2025-05-07T20:32:18.1810569Z self = 2025-05-07T20:32:18.1811322Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:18.1811608Z 2025-05-07T20:32:18.1811694Z @given( 2025-05-07T20:32:18.1811934Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:18.1812243Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:18.1812553Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:18.1812897Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:18.1813226Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:18.1813509Z ) 2025-05-07T20:32:18.1813863Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:18.1814312Z def test_silu_mul_quant( 2025-05-07T20:32:18.1814560Z self, 2025-05-07T20:32:18.1814770Z T: int, 2025-05-07T20:32:18.1814972Z D: int, 2025-05-07T20:32:18.1815191Z scale_ub: Optional[float], 2025-05-07T20:32:18.1815468Z contiguous: bool, 2025-05-07T20:32:18.1815997Z compiled: bool, 2025-05-07T20:32:18.1816229Z ) -> None: 2025-05-07T20:32:18.1816450Z torch.manual_seed(2025) 2025-05-07T20:32:18.1816751Z 2025-05-07T20:32:18.1817090Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:18.1817518Z 2025-05-07T20:32:18.1817766Z x_sign = torch.sign(x) 2025-05-07T20:32:18.1818132Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:18.1818518Z x = x_sign * x_clamp 2025-05-07T20:32:18.1818825Z x0 = x[:, :D] 2025-05-07T20:32:18.1819067Z x1 = x[:, D:] 2025-05-07T20:32:18.1819273Z 2025-05-07T20:32:18.1819470Z if contiguous: 2025-05-07T20:32:18.1819785Z x0 = x0.contiguous() 2025-05-07T20:32:18.1820133Z x1 = x1.contiguous() 2025-05-07T20:32:18.1820382Z 2025-05-07T20:32:18.1820584Z if scale_ub is not None: 2025-05-07T20:32:18.1820854Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:18.1821265Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:18.1821577Z ) 2025-05-07T20:32:18.1821770Z else: 2025-05-07T20:32:18.1821990Z scale_ub_tensor = None 2025-05-07T20:32:18.1822242Z 2025-05-07T20:32:18.1822470Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:18.1822787Z op = silu_mul_quant 2025-05-07T20:32:18.1823044Z if compiled: 2025-05-07T20:32:18.1823292Z op = torch.compile(op) 2025-05-07T20:32:18.1823583Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:18.1823859Z 2025-05-07T20:32:18.1824056Z > y_fp8, y_scale = fn() 2025-05-07T20:32:18.1824229Z 2025-05-07T20:32:18.1824331Z moe/activation_test.py:117: 2025-05-07T20:32:18.1824636Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:18.1824974Z moe/activation_test.py:115: in fn 2025-05-07T20:32:18.1825254Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:18.1825951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:18.1826639Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:18.1827226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:18.1827900Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:18.1829023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:18.1829615Z kernel = self.compile( 2025-05-07T20:32:18.1830152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:18.1830816Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:18.1831216Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:18.1831447Z 2025-05-07T20:32:18.1831665Z self = 2025-05-07T20:32:18.1832738Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:18.1834129Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd8f59c680>} 2025-05-07T20:32:18.1835468Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:18.1836497Z context = 2025-05-07T20:32:18.1836792Z 2025-05-07T20:32:18.1837567Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:18.1838216Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:18.1838804Z module_map=module_map) 2025-05-07T20:32:18.1839182Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:18.1839532Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:18.1839799Z E ^ 2025-05-07T20:32:18.1840268Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:18.1840717Z 2025-05-07T20:32:18.1841146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:18.1841766Z 2025-05-07T20:32:18.1841871Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:18.1842293Z self=, 2025-05-07T20:32:18.1842795Z T=4096, 2025-05-07T20:32:18.1842989Z D=7168, 2025-05-07T20:32:18.1843192Z scale_ub=None, 2025-05-07T20:32:18.1843415Z contiguous=False, 2025-05-07T20:32:18.1843643Z compiled=False, 2025-05-07T20:32:18.1843862Z ) 2025-05-07T20:32:18.1844188Z self = 2025-05-07T20:32:18.1844684Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:18.1844954Z 2025-05-07T20:32:18.1845036Z @given( 2025-05-07T20:32:18.1845270Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:18.1845585Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:18.1845889Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:18.1846234Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:18.1846568Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:18.1846852Z ) 2025-05-07T20:32:18.1847209Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:18.1847654Z def test_silu_mul_quant( 2025-05-07T20:32:18.1847906Z self, 2025-05-07T20:32:18.1848134Z T: int, 2025-05-07T20:32:18.1848354Z D: int, 2025-05-07T20:32:18.1848581Z scale_ub: Optional[float], 2025-05-07T20:32:18.1848852Z contiguous: bool, 2025-05-07T20:32:18.1849098Z compiled: bool, 2025-05-07T20:32:18.1849326Z ) -> None: 2025-05-07T20:32:18.1849540Z torch.manual_seed(2025) 2025-05-07T20:32:18.1849784Z 2025-05-07T20:32:18.1850059Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:18.1850395Z 2025-05-07T20:32:18.1850594Z x_sign = torch.sign(x) 2025-05-07T20:32:18.1850892Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:18.1851205Z x = x_sign * x_clamp 2025-05-07T20:32:18.1851448Z x0 = x[:, :D] 2025-05-07T20:32:18.1851668Z x1 = x[:, D:] 2025-05-07T20:32:18.1851873Z 2025-05-07T20:32:18.1852067Z if contiguous: 2025-05-07T20:32:18.1852303Z x0 = x0.contiguous() 2025-05-07T20:32:18.1852559Z x1 = x1.contiguous() 2025-05-07T20:32:18.1852800Z 2025-05-07T20:32:18.1853000Z if scale_ub is not None: 2025-05-07T20:32:18.1853276Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:18.1853606Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:18.1853919Z ) 2025-05-07T20:32:18.1854118Z else: 2025-05-07T20:32:18.1854327Z scale_ub_tensor = None 2025-05-07T20:32:18.1854581Z 2025-05-07T20:32:18.1854816Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:18.1855128Z op = silu_mul_quant 2025-05-07T20:32:18.1855386Z if compiled: 2025-05-07T20:32:18.1855638Z op = torch.compile(op) 2025-05-07T20:32:18.1855932Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:18.1856212Z 2025-05-07T20:32:18.1856414Z > y_fp8, y_scale = fn() 2025-05-07T20:32:18.1856632Z 2025-05-07T20:32:18.1856735Z moe/activation_test.py:117: 2025-05-07T20:32:18.1857031Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:18.1857365Z moe/activation_test.py:115: in fn 2025-05-07T20:32:18.1857649Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:18.1858331Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:18.1859016Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:18.1859548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:18.1860266Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:18.1860961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:18.1861555Z kernel = self.compile( 2025-05-07T20:32:18.1862097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:18.1862745Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:18.1863146Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:18.1863376Z 2025-05-07T20:32:18.1863589Z self = 2025-05-07T20:32:18.1864675Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:18.1866045Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd8feec180>} 2025-05-07T20:32:18.1867600Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:18.1868873Z context = 2025-05-07T20:32:18.1869249Z 2025-05-07T20:32:18.1869424Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:18.1869936Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:18.1870404Z module_map=module_map) 2025-05-07T20:32:18.1870773Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:18.1871134Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:18.1871396Z E ^ 2025-05-07T20:32:18.1871863Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:18.1872308Z 2025-05-07T20:32:18.1872738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:18.1873249Z 2025-05-07T20:32:18.1873360Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:18.1873767Z self=, 2025-05-07T20:32:18.1874176Z T=128, 2025-05-07T20:32:18.1874369Z D=7168, 2025-05-07T20:32:18.1874564Z scale_ub=None, 2025-05-07T20:32:18.1874787Z contiguous=False, 2025-05-07T20:32:18.1875018Z compiled=True, 2025-05-07T20:32:18.1875224Z ) 2025-05-07T20:32:18.2307729Z self = 2025-05-07T20:32:18.2308484Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:18.2308879Z 2025-05-07T20:32:18.2308993Z @given( 2025-05-07T20:32:18.2309313Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:18.2309643Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:18.2310200Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:18.2310537Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:18.2310863Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:18.2311151Z ) 2025-05-07T20:32:18.2311505Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:18.2311941Z def test_silu_mul_quant( 2025-05-07T20:32:18.2312191Z self, 2025-05-07T20:32:18.2312391Z T: int, 2025-05-07T20:32:18.2312590Z D: int, 2025-05-07T20:32:18.2312818Z scale_ub: Optional[float], 2025-05-07T20:32:18.2313092Z contiguous: bool, 2025-05-07T20:32:18.2313330Z compiled: bool, 2025-05-07T20:32:18.2313710Z ) -> None: 2025-05-07T20:32:18.2320333Z torch.manual_seed(2025) 2025-05-07T20:32:18.2320595Z 2025-05-07T20:32:18.2320875Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:18.2321349Z 2025-05-07T20:32:18.2321559Z x_sign = torch.sign(x) 2025-05-07T20:32:18.2321851Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:18.2322170Z x = x_sign * x_clamp 2025-05-07T20:32:18.2322418Z x0 = x[:, :D] 2025-05-07T20:32:18.2322636Z x1 = x[:, D:] 2025-05-07T20:32:18.2322848Z 2025-05-07T20:32:18.2323047Z if contiguous: 2025-05-07T20:32:18.2323282Z x0 = x0.contiguous() 2025-05-07T20:32:18.2323555Z x1 = x1.contiguous() 2025-05-07T20:32:18.2323818Z 2025-05-07T20:32:18.2324020Z if scale_ub is not None: 2025-05-07T20:32:18.2324294Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:18.2324639Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:18.2324964Z ) 2025-05-07T20:32:18.2325160Z else: 2025-05-07T20:32:18.2325378Z scale_ub_tensor = None 2025-05-07T20:32:18.2325635Z 2025-05-07T20:32:18.2325870Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:18.2326193Z op = silu_mul_quant 2025-05-07T20:32:18.2326459Z if compiled: 2025-05-07T20:32:18.2326705Z op = torch.compile(op) 2025-05-07T20:32:18.2327006Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:18.2327311Z 2025-05-07T20:32:18.2327528Z y_fp8, y_scale = fn() 2025-05-07T20:32:18.2327818Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:18.2328115Z 2025-05-07T20:32:18.2328639Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:18.2328980Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:18.2329283Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:18.2329607Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:18.2329965Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:18.2330280Z 2025-05-07T20:32:18.2330491Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:18.2330688Z 2025-05-07T20:32:18.2330793Z moe/activation_test.py:126: 2025-05-07T20:32:18.2331094Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:18.2331435Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:18.2331758Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:18.2332558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:18.2333322Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:18.2333877Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:18.2334566Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:18.2335262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:18.2336072Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:18.2336837Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:18.2337633Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:18.2338367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:18.2339014Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:18.2339624Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:18.2340204Z fn() 2025-05-07T20:32:18.2340781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:18.2341373Z self.fn.run( 2025-05-07T20:32:18.2341897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:18.2342437Z kernel = self.compile( 2025-05-07T20:32:18.2342986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:18.2343654Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:18.2344059Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:18.2344300Z 2025-05-07T20:32:18.2344510Z self = 2025-05-07T20:32:18.2345608Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:18.2347047Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd8f5c7100>} 2025-05-07T20:32:18.2348419Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:18.2349517Z context = 2025-05-07T20:32:18.2349815Z 2025-05-07T20:32:18.2349985Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:18.2350512Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:18.2350986Z module_map=module_map) 2025-05-07T20:32:18.2351364Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:18.2351728Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:18.2351995Z E ^ 2025-05-07T20:32:18.2352474Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:18.2352934Z 2025-05-07T20:32:18.2353353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:18.2353868Z 2025-05-07T20:32:18.2353981Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:18.2354393Z self=, 2025-05-07T20:32:18.2354804Z T=128, 2025-05-07T20:32:18.2355003Z D=7168, 2025-05-07T20:32:18.2355197Z scale_ub=None, 2025-05-07T20:32:18.2355422Z contiguous=False, 2025-05-07T20:32:18.2355659Z compiled=False, 2025-05-07T20:32:18.2355874Z ) 2025-05-07T20:32:18.5322361Z self = 2025-05-07T20:32:18.5323138Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:18.5323487Z 2025-05-07T20:32:18.5323586Z @given( 2025-05-07T20:32:18.5324129Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:18.5324463Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:18.5324778Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:18.5325116Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:18.5325443Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:18.5325732Z ) 2025-05-07T20:32:18.5326093Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:18.5326541Z def test_silu_mul_quant( 2025-05-07T20:32:18.5326792Z self, 2025-05-07T20:32:18.5326998Z T: int, 2025-05-07T20:32:18.5327211Z D: int, 2025-05-07T20:32:18.5327479Z scale_ub: Optional[float], 2025-05-07T20:32:18.5327922Z contiguous: bool, 2025-05-07T20:32:18.5328441Z compiled: bool, 2025-05-07T20:32:18.5328685Z ) -> None: 2025-05-07T20:32:18.5328918Z torch.manual_seed(2025) 2025-05-07T20:32:18.5329159Z 2025-05-07T20:32:18.5329549Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:18.5329900Z 2025-05-07T20:32:18.5330111Z x_sign = torch.sign(x) 2025-05-07T20:32:18.5330400Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:18.5330717Z x = x_sign * x_clamp 2025-05-07T20:32:18.5330962Z x0 = x[:, :D] 2025-05-07T20:32:18.5331182Z x1 = x[:, D:] 2025-05-07T20:32:18.5331405Z 2025-05-07T20:32:18.5331598Z if contiguous: 2025-05-07T20:32:18.5331834Z x0 = x0.contiguous() 2025-05-07T20:32:18.5332100Z x1 = x1.contiguous() 2025-05-07T20:32:18.5332354Z 2025-05-07T20:32:18.5332557Z if scale_ub is not None: 2025-05-07T20:32:18.5332828Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:18.5333182Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:18.5333499Z ) 2025-05-07T20:32:18.5333690Z else: 2025-05-07T20:32:18.5333914Z scale_ub_tensor = None 2025-05-07T20:32:18.5334168Z 2025-05-07T20:32:18.5334401Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:18.5334719Z op = silu_mul_quant 2025-05-07T20:32:18.5334984Z if compiled: 2025-05-07T20:32:18.5335233Z op = torch.compile(op) 2025-05-07T20:32:18.5335531Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:18.5335807Z 2025-05-07T20:32:18.5336003Z > y_fp8, y_scale = fn() 2025-05-07T20:32:18.5336166Z 2025-05-07T20:32:18.5336268Z moe/activation_test.py:117: 2025-05-07T20:32:18.5336564Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:18.5336900Z moe/activation_test.py:115: in fn 2025-05-07T20:32:18.5337182Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:18.5337872Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:18.5338564Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:18.5339095Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:18.5339776Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:18.5340440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:18.5340977Z kernel = self.compile( 2025-05-07T20:32:18.5341512Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:18.5342165Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:18.5342567Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:18.5342795Z 2025-05-07T20:32:18.5343012Z self = 2025-05-07T20:32:18.5344147Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:18.5345537Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd8f08ce00>} 2025-05-07T20:32:18.5346881Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:18.5347908Z context = 2025-05-07T20:32:18.5348322Z 2025-05-07T20:32:18.5348494Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:18.5349007Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:18.5349608Z module_map=module_map) 2025-05-07T20:32:18.5349980Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:18.5350329Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:18.5350593Z E ^ 2025-05-07T20:32:18.5351061Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:18.5351507Z 2025-05-07T20:32:18.5351927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:18.5352435Z 2025-05-07T20:32:18.5352539Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:18.5352953Z self=, 2025-05-07T20:32:18.5353364Z T=4096, 2025-05-07T20:32:18.5353554Z D=5120, 2025-05-07T20:32:18.5353762Z scale_ub=1200.0, 2025-05-07T20:32:18.5353990Z contiguous=True, 2025-05-07T20:32:18.5354215Z compiled=False, 2025-05-07T20:32:18.5354435Z ) 2025-05-07T20:32:18.5354773Z self = 2025-05-07T20:32:18.5355264Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:18.5355541Z 2025-05-07T20:32:18.5355622Z @given( 2025-05-07T20:32:18.5355860Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:18.5356176Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:18.5356480Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:18.5356812Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:18.5357146Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:18.5357432Z ) 2025-05-07T20:32:18.5357784Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:18.5358236Z def test_silu_mul_quant( 2025-05-07T20:32:18.5358474Z self, 2025-05-07T20:32:18.5358671Z T: int, 2025-05-07T20:32:18.5358876Z D: int, 2025-05-07T20:32:18.5359097Z scale_ub: Optional[float], 2025-05-07T20:32:18.5359368Z contiguous: bool, 2025-05-07T20:32:18.5359610Z compiled: bool, 2025-05-07T20:32:18.5359832Z ) -> None: 2025-05-07T20:32:18.5360043Z torch.manual_seed(2025) 2025-05-07T20:32:18.5360298Z 2025-05-07T20:32:18.5360576Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:18.5360922Z 2025-05-07T20:32:18.5361115Z x_sign = torch.sign(x) 2025-05-07T20:32:18.5361410Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:18.5361722Z x = x_sign * x_clamp 2025-05-07T20:32:18.5361964Z x0 = x[:, :D] 2025-05-07T20:32:18.5362188Z x1 = x[:, D:] 2025-05-07T20:32:18.5362407Z 2025-05-07T20:32:18.5362595Z if contiguous: 2025-05-07T20:32:18.5362835Z x0 = x0.contiguous() 2025-05-07T20:32:18.5363099Z x1 = x1.contiguous() 2025-05-07T20:32:18.5363334Z 2025-05-07T20:32:18.5363587Z if scale_ub is not None: 2025-05-07T20:32:18.5363861Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:18.5364192Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:18.5364506Z ) 2025-05-07T20:32:18.5364703Z else: 2025-05-07T20:32:18.5364915Z scale_ub_tensor = None 2025-05-07T20:32:18.5365165Z 2025-05-07T20:32:18.5365396Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:18.5365713Z op = silu_mul_quant 2025-05-07T20:32:18.5365958Z if compiled: 2025-05-07T20:32:18.5366213Z op = torch.compile(op) 2025-05-07T20:32:18.5366512Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:18.5366831Z 2025-05-07T20:32:18.5367070Z > y_fp8, y_scale = fn() 2025-05-07T20:32:18.5367233Z 2025-05-07T20:32:18.5367340Z moe/activation_test.py:117: 2025-05-07T20:32:18.5367674Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:18.5368014Z moe/activation_test.py:115: in fn 2025-05-07T20:32:18.5368302Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:18.5368983Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:18.5369845Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:18.5370383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:18.5371065Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:18.5371716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:18.5372253Z kernel = self.compile( 2025-05-07T20:32:18.5372794Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:18.5373452Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:18.5373849Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:18.5374083Z 2025-05-07T20:32:18.5374292Z self = 2025-05-07T20:32:18.5375367Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:18.5376732Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd8f08df80>} 2025-05-07T20:32:18.5378075Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:18.5379107Z context = 2025-05-07T20:32:18.5379401Z 2025-05-07T20:32:18.5379567Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:18.5380083Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:18.5380543Z module_map=module_map) 2025-05-07T20:32:18.5380909Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:18.5381266Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:18.5381524Z E ^ 2025-05-07T20:32:18.5381993Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:18.5382455Z 2025-05-07T20:32:18.5382871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:18.5383379Z 2025-05-07T20:32:18.5383494Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:18.5383960Z self=, 2025-05-07T20:32:18.5384365Z T=1, 2025-05-07T20:32:18.5384562Z D=5120, 2025-05-07T20:32:18.5384762Z scale_ub=None, 2025-05-07T20:32:18.5384975Z contiguous=True, 2025-05-07T20:32:18.5385203Z compiled=True, 2025-05-07T20:32:18.5385412Z ) 2025-05-07T20:32:18.9833684Z self = 2025-05-07T20:32:18.9834416Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:18.9834768Z 2025-05-07T20:32:18.9834879Z @given( 2025-05-07T20:32:18.9835155Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:18.9835771Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:18.9836180Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:18.9836510Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:18.9836935Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:18.9837233Z ) 2025-05-07T20:32:18.9837603Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:18.9838079Z def test_silu_mul_quant( 2025-05-07T20:32:18.9838332Z self, 2025-05-07T20:32:18.9838528Z T: int, 2025-05-07T20:32:18.9838735Z D: int, 2025-05-07T20:32:18.9838962Z scale_ub: Optional[float], 2025-05-07T20:32:18.9839231Z contiguous: bool, 2025-05-07T20:32:18.9839480Z compiled: bool, 2025-05-07T20:32:18.9839722Z ) -> None: 2025-05-07T20:32:18.9839942Z torch.manual_seed(2025) 2025-05-07T20:32:18.9840197Z 2025-05-07T20:32:18.9840476Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:18.9840826Z 2025-05-07T20:32:18.9841023Z x_sign = torch.sign(x) 2025-05-07T20:32:18.9841322Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:18.9841635Z x = x_sign * x_clamp 2025-05-07T20:32:18.9841881Z x0 = x[:, :D] 2025-05-07T20:32:18.9842116Z x1 = x[:, D:] 2025-05-07T20:32:18.9842335Z 2025-05-07T20:32:18.9842530Z if contiguous: 2025-05-07T20:32:18.9842770Z x0 = x0.contiguous() 2025-05-07T20:32:18.9843032Z x1 = x1.contiguous() 2025-05-07T20:32:18.9843268Z 2025-05-07T20:32:18.9843468Z if scale_ub is not None: 2025-05-07T20:32:18.9843749Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:18.9844083Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:18.9844393Z ) 2025-05-07T20:32:18.9844597Z else: 2025-05-07T20:32:18.9844808Z scale_ub_tensor = None 2025-05-07T20:32:18.9845062Z 2025-05-07T20:32:18.9845296Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:18.9845620Z op = silu_mul_quant 2025-05-07T20:32:18.9845869Z if compiled: 2025-05-07T20:32:18.9846118Z op = torch.compile(op) 2025-05-07T20:32:18.9846420Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:18.9846690Z 2025-05-07T20:32:18.9846888Z y_fp8, y_scale = fn() 2025-05-07T20:32:18.9847179Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:18.9847468Z 2025-05-07T20:32:18.9847709Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:18.9848044Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:18.9848334Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:18.9848650Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:18.9849012Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:18.9849322Z 2025-05-07T20:32:18.9849525Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:18.9849729Z 2025-05-07T20:32:18.9849831Z moe/activation_test.py:126: 2025-05-07T20:32:18.9850130Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:18.9850464Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:18.9850887Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:18.9851674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:18.9852423Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:18.9852970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:18.9853648Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:18.9854334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:18.9855135Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:18.9855929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:18.9856680Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:18.9857424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:18.9858087Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:18.9858687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:18.9859207Z fn() 2025-05-07T20:32:18.9859709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:18.9860298Z self.fn.run( 2025-05-07T20:32:18.9860771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:18.9861307Z kernel = self.compile( 2025-05-07T20:32:18.9861848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:18.9862500Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:18.9862899Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:18.9863125Z 2025-05-07T20:32:18.9863337Z self = 2025-05-07T20:32:18.9864411Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:18.9865793Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd8f08e340>} 2025-05-07T20:32:18.9867137Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:18.9868208Z context = 2025-05-07T20:32:18.9868495Z 2025-05-07T20:32:18.9868662Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:18.9869325Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:18.9869795Z module_map=module_map) 2025-05-07T20:32:18.9870161Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:18.9870512Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:18.9870783Z E ^ 2025-05-07T20:32:18.9871253Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:18.9871699Z 2025-05-07T20:32:18.9872118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:18.9872688Z 2025-05-07T20:32:18.9872797Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:18.9873213Z self=, 2025-05-07T20:32:18.9873617Z T=2048, 2025-05-07T20:32:18.9873808Z D=5120, 2025-05-07T20:32:18.9874011Z scale_ub=None, 2025-05-07T20:32:18.9874232Z contiguous=True, 2025-05-07T20:32:18.9874455Z compiled=True, 2025-05-07T20:32:18.9874677Z ) 2025-05-07T20:32:19.4198273Z self = 2025-05-07T20:32:19.4199009Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:19.4199338Z 2025-05-07T20:32:19.4199727Z @given( 2025-05-07T20:32:19.4200060Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.4200371Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.4200687Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.4201106Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.4201474Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.4201800Z ) 2025-05-07T20:32:19.4202207Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.4202718Z def test_silu_mul_quant( 2025-05-07T20:32:19.4202981Z self, 2025-05-07T20:32:19.4203192Z T: int, 2025-05-07T20:32:19.4203397Z D: int, 2025-05-07T20:32:19.4203641Z scale_ub: Optional[float], 2025-05-07T20:32:19.4203946Z contiguous: bool, 2025-05-07T20:32:19.4204210Z compiled: bool, 2025-05-07T20:32:19.4204459Z ) -> None: 2025-05-07T20:32:19.4204700Z torch.manual_seed(2025) 2025-05-07T20:32:19.4204966Z 2025-05-07T20:32:19.4211166Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.4211529Z 2025-05-07T20:32:19.4211736Z x_sign = torch.sign(x) 2025-05-07T20:32:19.4212034Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:19.4212363Z x = x_sign * x_clamp 2025-05-07T20:32:19.4212613Z x0 = x[:, :D] 2025-05-07T20:32:19.4212835Z x1 = x[:, D:] 2025-05-07T20:32:19.4213053Z 2025-05-07T20:32:19.4213250Z if contiguous: 2025-05-07T20:32:19.4213483Z x0 = x0.contiguous() 2025-05-07T20:32:19.4213751Z x1 = x1.contiguous() 2025-05-07T20:32:19.4214000Z 2025-05-07T20:32:19.4214194Z if scale_ub is not None: 2025-05-07T20:32:19.4214477Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:19.4214822Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:19.4215140Z ) 2025-05-07T20:32:19.4215342Z else: 2025-05-07T20:32:19.4215567Z scale_ub_tensor = None 2025-05-07T20:32:19.4215826Z 2025-05-07T20:32:19.4216060Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:19.4216385Z op = silu_mul_quant 2025-05-07T20:32:19.4216646Z if compiled: 2025-05-07T20:32:19.4216896Z op = torch.compile(op) 2025-05-07T20:32:19.4217197Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.4217479Z 2025-05-07T20:32:19.4217676Z y_fp8, y_scale = fn() 2025-05-07T20:32:19.4217971Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:19.4218269Z 2025-05-07T20:32:19.4218505Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:19.4218847Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:19.4219146Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:19.4219468Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:19.4219830Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:19.4220151Z 2025-05-07T20:32:19.4220361Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:19.4220557Z 2025-05-07T20:32:19.4220660Z moe/activation_test.py:126: 2025-05-07T20:32:19.4221103Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.4221447Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:19.4221774Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:19.4222572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:19.4223346Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:19.4223902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:19.4224590Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:19.4225335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:19.4226152Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:19.4226967Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:19.4227772Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:19.4228838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:19.4229551Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:19.4230158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:19.4230691Z fn() 2025-05-07T20:32:19.4231215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:19.4231818Z self.fn.run( 2025-05-07T20:32:19.4232290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:19.4232836Z kernel = self.compile( 2025-05-07T20:32:19.4233391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:19.4234054Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:19.4234465Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.4234705Z 2025-05-07T20:32:19.4234916Z self = 2025-05-07T20:32:19.4236013Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:19.4237430Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd8f6c2d40>} 2025-05-07T20:32:19.4238840Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:19.4239885Z context = 2025-05-07T20:32:19.4240186Z 2025-05-07T20:32:19.4240356Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:19.4240884Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:19.4241355Z module_map=module_map) 2025-05-07T20:32:19.4241733Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:19.4242099Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:19.4242368Z E ^ 2025-05-07T20:32:19.4242847Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:19.4243311Z 2025-05-07T20:32:19.4243827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:19.4244347Z 2025-05-07T20:32:19.4244463Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.4244876Z self=, 2025-05-07T20:32:19.4245288Z T=128, 2025-05-07T20:32:19.4245487Z D=5120, 2025-05-07T20:32:19.4245685Z scale_ub=None, 2025-05-07T20:32:19.4245913Z contiguous=True, 2025-05-07T20:32:19.4246147Z compiled=True, 2025-05-07T20:32:19.4246361Z ) 2025-05-07T20:32:20.0838283Z self = 2025-05-07T20:32:20.0839336Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:20.0839823Z 2025-05-07T20:32:20.0839928Z @given( 2025-05-07T20:32:20.0840235Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:20.0840730Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:20.0841046Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:20.0841386Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:20.0841723Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:20.0842005Z ) 2025-05-07T20:32:20.0842358Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:20.0842802Z def test_silu_mul_quant( 2025-05-07T20:32:20.0843057Z self, 2025-05-07T20:32:20.0843258Z T: int, 2025-05-07T20:32:20.0843465Z D: int, 2025-05-07T20:32:20.0843696Z scale_ub: Optional[float], 2025-05-07T20:32:20.0843966Z contiguous: bool, 2025-05-07T20:32:20.0844225Z compiled: bool, 2025-05-07T20:32:20.0844465Z ) -> None: 2025-05-07T20:32:20.0844682Z torch.manual_seed(2025) 2025-05-07T20:32:20.0844932Z 2025-05-07T20:32:20.0845216Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:20.0845562Z 2025-05-07T20:32:20.0845769Z x_sign = torch.sign(x) 2025-05-07T20:32:20.0846073Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:20.0846384Z x = x_sign * x_clamp 2025-05-07T20:32:20.0846633Z x0 = x[:, :D] 2025-05-07T20:32:20.0846880Z x1 = x[:, D:] 2025-05-07T20:32:20.0847096Z 2025-05-07T20:32:20.0847284Z if contiguous: 2025-05-07T20:32:20.0847538Z x0 = x0.contiguous() 2025-05-07T20:32:20.0847843Z x1 = x1.contiguous() 2025-05-07T20:32:20.0848089Z 2025-05-07T20:32:20.0848295Z if scale_ub is not None: 2025-05-07T20:32:20.0848580Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:20.0848915Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:20.0849237Z ) 2025-05-07T20:32:20.0849438Z else: 2025-05-07T20:32:20.0849654Z scale_ub_tensor = None 2025-05-07T20:32:20.0849910Z 2025-05-07T20:32:20.0850147Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:20.0850462Z op = silu_mul_quant 2025-05-07T20:32:20.0850718Z if compiled: 2025-05-07T20:32:20.0850974Z op = torch.compile(op) 2025-05-07T20:32:20.0851270Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:20.0851550Z 2025-05-07T20:32:20.0851750Z y_fp8, y_scale = fn() 2025-05-07T20:32:20.0852034Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:20.0852327Z 2025-05-07T20:32:20.0852570Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:20.0852904Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:20.0853203Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:20.0853520Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:20.0853879Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:20.0854182Z 2025-05-07T20:32:20.0854391Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:20.0854586Z 2025-05-07T20:32:20.0854787Z moe/activation_test.py:126: 2025-05-07T20:32:20.0855081Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:20.0855419Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:20.0855745Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:20.0856530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:20.0857278Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:20.0857845Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:20.0858747Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:20.0859642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:20.0860478Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:20.0861228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:20.0861975Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:20.0862696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:20.0863332Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:20.0863930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:20.0864450Z fn() 2025-05-07T20:32:20.0864959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:20.0865540Z self.fn.run( 2025-05-07T20:32:20.0866011Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:20.0866536Z kernel = self.compile( 2025-05-07T20:32:20.0867080Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:20.0867736Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:20.0868138Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:20.0868366Z 2025-05-07T20:32:20.0868575Z self = 2025-05-07T20:32:20.0869804Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:20.0871195Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd8ecd3740>} 2025-05-07T20:32:20.0872534Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:20.0873548Z context = 2025-05-07T20:32:20.0873846Z 2025-05-07T20:32:20.0874013Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:20.0874534Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:20.0875007Z module_map=module_map) 2025-05-07T20:32:20.0875371Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:20.0875738Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:20.0876012Z E ^ 2025-05-07T20:32:20.0876520Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:20.0876974Z 2025-05-07T20:32:20.0877389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:20.0877903Z 2025-05-07T20:32:20.0878009Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:20.0878430Z self=, 2025-05-07T20:32:20.0878828Z T=4096, 2025-05-07T20:32:20.0879032Z D=5120, 2025-05-07T20:32:20.0879232Z scale_ub=None, 2025-05-07T20:32:20.0879450Z contiguous=True, 2025-05-07T20:32:20.0879677Z compiled=True, 2025-05-07T20:32:20.0879888Z ) 2025-05-07T20:32:20.5988986Z self = 2025-05-07T20:32:20.5990112Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:20.5990417Z 2025-05-07T20:32:20.5990503Z @given( 2025-05-07T20:32:20.5990847Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:20.5991159Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:20.5991476Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:20.5991820Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:20.5992147Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:20.5992435Z ) 2025-05-07T20:32:20.5992791Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:20.5993233Z def test_silu_mul_quant( 2025-05-07T20:32:20.5993493Z self, 2025-05-07T20:32:20.5993697Z T: int, 2025-05-07T20:32:20.5993895Z D: int, 2025-05-07T20:32:20.5994131Z scale_ub: Optional[float], 2025-05-07T20:32:20.5994411Z contiguous: bool, 2025-05-07T20:32:20.5994650Z compiled: bool, 2025-05-07T20:32:20.5994887Z ) -> None: 2025-05-07T20:32:20.5995122Z torch.manual_seed(2025) 2025-05-07T20:32:20.5995375Z 2025-05-07T20:32:20.5995654Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:20.5995998Z 2025-05-07T20:32:20.5996197Z x_sign = torch.sign(x) 2025-05-07T20:32:20.5996483Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:20.5996797Z x = x_sign * x_clamp 2025-05-07T20:32:20.5997042Z x0 = x[:, :D] 2025-05-07T20:32:20.5997256Z x1 = x[:, D:] 2025-05-07T20:32:20.5997469Z 2025-05-07T20:32:20.5997661Z if contiguous: 2025-05-07T20:32:20.5997891Z x0 = x0.contiguous() 2025-05-07T20:32:20.5998157Z x1 = x1.contiguous() 2025-05-07T20:32:20.5998408Z 2025-05-07T20:32:20.5998635Z if scale_ub is not None: 2025-05-07T20:32:20.5998922Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:20.5999269Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:20.5999573Z ) 2025-05-07T20:32:20.5999778Z else: 2025-05-07T20:32:20.5999997Z scale_ub_tensor = None 2025-05-07T20:32:20.6000258Z 2025-05-07T20:32:20.6000491Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:20.6000807Z op = silu_mul_quant 2025-05-07T20:32:20.6001061Z if compiled: 2025-05-07T20:32:20.6001304Z op = torch.compile(op) 2025-05-07T20:32:20.6001603Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:20.6001882Z 2025-05-07T20:32:20.6002076Z y_fp8, y_scale = fn() 2025-05-07T20:32:20.6002368Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:20.6002661Z 2025-05-07T20:32:20.6002901Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:20.6003244Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:20.6003544Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:20.6003854Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:20.6004216Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:20.6004529Z 2025-05-07T20:32:20.6004841Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:20.6005037Z 2025-05-07T20:32:20.6005144Z moe/activation_test.py:126: 2025-05-07T20:32:20.6005448Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:20.6005789Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:20.6006113Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:20.6006901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:20.6007656Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:20.6008335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:20.6009052Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:20.6009784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:20.6010505Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:20.6011261Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:20.6012004Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:20.6012736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:20.6013376Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:20.6013973Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:20.6014494Z fn() 2025-05-07T20:32:20.6015003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:20.6015587Z self.fn.run( 2025-05-07T20:32:20.6016048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:20.6016579Z kernel = self.compile( 2025-05-07T20:32:20.6017118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:20.6017763Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:20.6018163Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:20.6018400Z 2025-05-07T20:32:20.6018607Z self = 2025-05-07T20:32:20.6019687Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:20.6021081Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd8e75d940>} 2025-05-07T20:32:20.6022412Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:20.6023434Z context = 2025-05-07T20:32:20.6023722Z 2025-05-07T20:32:20.6023896Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:20.6024413Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:20.6024879Z module_map=module_map) 2025-05-07T20:32:20.6025247Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:20.6025603Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:20.6025870Z E ^ 2025-05-07T20:32:20.6026392Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:20.6026841Z 2025-05-07T20:32:20.6027263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:20.6027771Z 2025-05-07T20:32:20.6027881Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:20.6028654Z self=, 2025-05-07T20:32:20.6029104Z T=16384, 2025-05-07T20:32:20.6029303Z D=5120, 2025-05-07T20:32:20.6029495Z scale_ub=None, 2025-05-07T20:32:20.6029794Z contiguous=True, 2025-05-07T20:32:20.6030081Z compiled=True, 2025-05-07T20:32:20.6030286Z ) 2025-05-07T20:32:20.6285945Z W0507 20:32:20.627000 238910 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8) 2025-05-07T20:32:20.6287423Z W0507 20:32:20.627000 238910 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55) 2025-05-07T20:32:20.6288769Z W0507 20:32:20.627000 238910 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240 2025-05-07T20:32:20.6289765Z W0507 20:32:20.627000 238910 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles". 2025-05-07T20:32:20.6290871Z W0507 20:32:20.627000 238910 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html. 2025-05-07T20:32:20.6969650Z self = 2025-05-07T20:32:20.6971104Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:20.6971861Z 2025-05-07T20:32:20.6972025Z @given( 2025-05-07T20:32:20.6972501Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:20.6973119Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:20.6973728Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:20.6974394Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:20.6975037Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:20.6975599Z ) 2025-05-07T20:32:20.6976298Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:20.6977190Z def test_silu_mul_quant( 2025-05-07T20:32:20.6977667Z self, 2025-05-07T20:32:20.6977957Z T: int, 2025-05-07T20:32:20.6978162Z D: int, 2025-05-07T20:32:20.6978384Z scale_ub: Optional[float], 2025-05-07T20:32:20.6978657Z contiguous: bool, 2025-05-07T20:32:20.6978899Z compiled: bool, 2025-05-07T20:32:20.6979124Z ) -> None: 2025-05-07T20:32:20.6979351Z torch.manual_seed(2025) 2025-05-07T20:32:20.6979596Z 2025-05-07T20:32:20.6979866Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:20.6980207Z 2025-05-07T20:32:20.6980405Z x_sign = torch.sign(x) 2025-05-07T20:32:20.6980692Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:20.6981005Z x = x_sign * x_clamp 2025-05-07T20:32:20.6981247Z x0 = x[:, :D] 2025-05-07T20:32:20.6981462Z x1 = x[:, D:] 2025-05-07T20:32:20.6981671Z 2025-05-07T20:32:20.6981863Z if contiguous: 2025-05-07T20:32:20.6982094Z x0 = x0.contiguous() 2025-05-07T20:32:20.6982355Z x1 = x1.contiguous() 2025-05-07T20:32:20.6982601Z 2025-05-07T20:32:20.6982792Z if scale_ub is not None: 2025-05-07T20:32:20.6983067Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:20.6983407Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:20.6983721Z ) 2025-05-07T20:32:20.6984167Z else: 2025-05-07T20:32:20.6984388Z scale_ub_tensor = None 2025-05-07T20:32:20.6984644Z 2025-05-07T20:32:20.6984876Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:20.6985197Z op = silu_mul_quant 2025-05-07T20:32:20.6985451Z if compiled: 2025-05-07T20:32:20.6985696Z op = torch.compile(op) 2025-05-07T20:32:20.6986001Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:20.6986288Z 2025-05-07T20:32:20.6986481Z y_fp8, y_scale = fn() 2025-05-07T20:32:20.6986774Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:20.6987145Z 2025-05-07T20:32:20.6987383Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:20.6987797Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:20.6988104Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:20.6988490Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:20.6988850Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:20.6989272Z 2025-05-07T20:32:20.6989481Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:20.6989674Z 2025-05-07T20:32:20.6989776Z moe/activation_test.py:126: 2025-05-07T20:32:20.6990075Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:20.6990415Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:20.6990741Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:20.6991529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:20.6992286Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:20.6992827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:20.6993510Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:20.6994204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:20.6994930Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:20.6995688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:20.6996435Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:20.6997163Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:20.6997805Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:20.6998414Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:20.6998925Z fn() 2025-05-07T20:32:20.6999441Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:20.7000024Z self.fn.run( 2025-05-07T20:32:20.7000490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:20.7001025Z kernel = self.compile( 2025-05-07T20:32:20.7001566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:20.7002225Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:20.7002620Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:20.7002859Z 2025-05-07T20:32:20.7003070Z self = 2025-05-07T20:32:20.7004238Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:20.7005616Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd8e311bc0>} 2025-05-07T20:32:20.7006951Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:20.7007974Z context = 2025-05-07T20:32:20.7008267Z 2025-05-07T20:32:20.7008437Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:20.7009033Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:20.7015800Z module_map=module_map) 2025-05-07T20:32:20.7016305Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:20.7016668Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:20.7016941Z E ^ 2025-05-07T20:32:20.7017417Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:20.7017871Z 2025-05-07T20:32:20.7018348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:20.7018880Z 2025-05-07T20:32:20.7018988Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:20.7019413Z self=, 2025-05-07T20:32:20.7019822Z T=1, 2025-05-07T20:32:20.7020013Z D=5120, 2025-05-07T20:32:20.7020216Z scale_ub=1200.0, 2025-05-07T20:32:20.7020447Z contiguous=True, 2025-05-07T20:32:20.7020671Z compiled=True, 2025-05-07T20:32:20.7020890Z ) 2025-05-07T20:32:20.9653261Z self = 2025-05-07T20:32:20.9654014Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:20.9654382Z 2025-05-07T20:32:20.9654496Z @given( 2025-05-07T20:32:20.9654749Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:20.9655068Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:20.9655384Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:20.9655721Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:20.9656045Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:20.9656337Z ) 2025-05-07T20:32:20.9656693Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:20.9657149Z def test_silu_mul_quant( 2025-05-07T20:32:20.9657401Z self, 2025-05-07T20:32:20.9657615Z T: int, 2025-05-07T20:32:20.9657831Z D: int, 2025-05-07T20:32:20.9658051Z scale_ub: Optional[float], 2025-05-07T20:32:20.9658340Z contiguous: bool, 2025-05-07T20:32:20.9658625Z compiled: bool, 2025-05-07T20:32:20.9658868Z ) -> None: 2025-05-07T20:32:20.9659098Z torch.manual_seed(2025) 2025-05-07T20:32:20.9659351Z 2025-05-07T20:32:20.9659630Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:20.9659981Z 2025-05-07T20:32:20.9660185Z x_sign = torch.sign(x) 2025-05-07T20:32:20.9660474Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:20.9660793Z x = x_sign * x_clamp 2025-05-07T20:32:20.9661046Z x0 = x[:, :D] 2025-05-07T20:32:20.9661264Z x1 = x[:, D:] 2025-05-07T20:32:20.9661484Z 2025-05-07T20:32:20.9661678Z if contiguous: 2025-05-07T20:32:20.9661922Z x0 = x0.contiguous() 2025-05-07T20:32:20.9662188Z x1 = x1.contiguous() 2025-05-07T20:32:20.9662437Z 2025-05-07T20:32:20.9662639Z if scale_ub is not None: 2025-05-07T20:32:20.9662919Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:20.9663501Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:20.9663822Z ) 2025-05-07T20:32:20.9664021Z else: 2025-05-07T20:32:20.9664245Z scale_ub_tensor = None 2025-05-07T20:32:20.9664512Z 2025-05-07T20:32:20.9664747Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:20.9665072Z op = silu_mul_quant 2025-05-07T20:32:20.9665333Z if compiled: 2025-05-07T20:32:20.9665581Z op = torch.compile(op) 2025-05-07T20:32:20.9665883Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:20.9666163Z 2025-05-07T20:32:20.9666360Z > y_fp8, y_scale = fn() 2025-05-07T20:32:20.9666623Z 2025-05-07T20:32:20.9666806Z moe/activation_test.py:117: 2025-05-07T20:32:20.9667113Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:20.9667452Z moe/activation_test.py:115: in fn 2025-05-07T20:32:20.9667802Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:20.9668419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:20.9668983Z return fn(*args, **kwargs) 2025-05-07T20:32:20.9669745Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:20.9670430Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:20.9670965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:20.9671643Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:20.9672305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:20.9672851Z kernel = self.compile( 2025-05-07T20:32:20.9673401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:20.9674058Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:20.9674468Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:20.9674706Z 2025-05-07T20:32:20.9674920Z self = 2025-05-07T20:32:20.9676010Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:20.9677395Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd8e4998a0>} 2025-05-07T20:32:20.9678749Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:20.9679775Z context = 2025-05-07T20:32:20.9680075Z 2025-05-07T20:32:20.9680245Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:20.9680766Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:20.9681233Z module_map=module_map) 2025-05-07T20:32:20.9681606Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:20.9681963Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:20.9682227Z E ^ 2025-05-07T20:32:20.9682703Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:20.9683165Z 2025-05-07T20:32:20.9683586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:20.9684094Z 2025-05-07T20:32:20.9684261Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:20.9684673Z self=, 2025-05-07T20:32:20.9685087Z T=1, 2025-05-07T20:32:20.9685291Z D=5120, 2025-05-07T20:32:20.9685488Z scale_ub=None, 2025-05-07T20:32:20.9685718Z contiguous=False, 2025-05-07T20:32:20.9685956Z compiled=True, 2025-05-07T20:32:20.9686168Z ) 2025-05-07T20:32:21.0160795Z self = 2025-05-07T20:32:21.0161512Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:21.0161871Z 2025-05-07T20:32:21.0162194Z @given( 2025-05-07T20:32:21.0162432Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:21.0162844Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:21.0163161Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:21.0163564Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:21.0163909Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:21.0164203Z ) 2025-05-07T20:32:21.0164563Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:21.0165003Z def test_silu_mul_quant( 2025-05-07T20:32:21.0165249Z self, 2025-05-07T20:32:21.0165453Z T: int, 2025-05-07T20:32:21.0165655Z D: int, 2025-05-07T20:32:21.0165881Z scale_ub: Optional[float], 2025-05-07T20:32:21.0166156Z contiguous: bool, 2025-05-07T20:32:21.0166392Z compiled: bool, 2025-05-07T20:32:21.0166627Z ) -> None: 2025-05-07T20:32:21.0166864Z torch.manual_seed(2025) 2025-05-07T20:32:21.0167115Z 2025-05-07T20:32:21.0167391Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:21.0167745Z 2025-05-07T20:32:21.0167946Z x_sign = torch.sign(x) 2025-05-07T20:32:21.0168245Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:21.0168570Z x = x_sign * x_clamp 2025-05-07T20:32:21.0168862Z x0 = x[:, :D] 2025-05-07T20:32:21.0169095Z x1 = x[:, D:] 2025-05-07T20:32:21.0169329Z 2025-05-07T20:32:21.0169516Z if contiguous: 2025-05-07T20:32:21.0169754Z x0 = x0.contiguous() 2025-05-07T20:32:21.0170026Z x1 = x1.contiguous() 2025-05-07T20:32:21.0170262Z 2025-05-07T20:32:21.0170465Z if scale_ub is not None: 2025-05-07T20:32:21.0170753Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:21.0171084Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:21.0171393Z ) 2025-05-07T20:32:21.0171589Z else: 2025-05-07T20:32:21.0171810Z scale_ub_tensor = None 2025-05-07T20:32:21.0172066Z 2025-05-07T20:32:21.0172302Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:21.0172611Z op = silu_mul_quant 2025-05-07T20:32:21.0172865Z if compiled: 2025-05-07T20:32:21.0173118Z op = torch.compile(op) 2025-05-07T20:32:21.0173410Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:21.0173684Z 2025-05-07T20:32:21.0173884Z y_fp8, y_scale = fn() 2025-05-07T20:32:21.0174173Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:21.0174464Z 2025-05-07T20:32:21.0174705Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:21.0175039Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:21.0175333Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:21.0175653Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:21.0176012Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:21.0176325Z 2025-05-07T20:32:21.0176529Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:21.0176721Z 2025-05-07T20:32:21.0176833Z moe/activation_test.py:126: 2025-05-07T20:32:21.0177220Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:21.0177555Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:21.0177885Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:21.0178719Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:21.0179460Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:21.0180018Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:21.0180705Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:21.0181443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:21.0182272Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:21.0183158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:21.0183909Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:21.0184644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:21.0185277Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:21.0185887Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:21.0186410Z fn() 2025-05-07T20:32:21.0186917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:21.0187510Z self.fn.run( 2025-05-07T20:32:21.0187981Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:21.0188519Z kernel = self.compile( 2025-05-07T20:32:21.0189171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:21.0189824Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:21.0190229Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:21.0190460Z 2025-05-07T20:32:21.0190668Z self = 2025-05-07T20:32:21.0191746Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:21.0193143Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd8e4bde40>} 2025-05-07T20:32:21.0194502Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:21.0195528Z context = 2025-05-07T20:32:21.0195814Z 2025-05-07T20:32:21.0195980Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:21.0196504Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:21.0196974Z module_map=module_map) 2025-05-07T20:32:21.0197358Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:21.0197719Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:21.0197998Z E ^ 2025-05-07T20:32:21.0198469Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:21.0198971Z 2025-05-07T20:32:21.0199456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:21.0199970Z 2025-05-07T20:32:21.0200079Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:21.0200498Z self=, 2025-05-07T20:32:21.0200911Z T=1, 2025-05-07T20:32:21.0201103Z D=5120, 2025-05-07T20:32:21.0201297Z scale_ub=None, 2025-05-07T20:32:21.0201525Z contiguous=True, 2025-05-07T20:32:21.0201759Z compiled=False, 2025-05-07T20:32:21.0201968Z ) 2025-05-07T20:32:21.1353677Z self = 2025-05-07T20:32:21.1354422Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:21.1355202Z 2025-05-07T20:32:21.1355328Z @given( 2025-05-07T20:32:21.1355628Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:21.1356048Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:21.1356560Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:21.1356911Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:21.1357254Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:21.1357544Z ) 2025-05-07T20:32:21.1357896Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:21.1358332Z def test_silu_mul_quant( 2025-05-07T20:32:21.1358578Z self, 2025-05-07T20:32:21.1358781Z T: int, 2025-05-07T20:32:21.1358976Z D: int, 2025-05-07T20:32:21.1359205Z scale_ub: Optional[float], 2025-05-07T20:32:21.1359478Z contiguous: bool, 2025-05-07T20:32:21.1359716Z compiled: bool, 2025-05-07T20:32:21.1359961Z ) -> None: 2025-05-07T20:32:21.1360187Z torch.manual_seed(2025) 2025-05-07T20:32:21.1360425Z 2025-05-07T20:32:21.1360709Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:21.1361055Z 2025-05-07T20:32:21.1361253Z x_sign = torch.sign(x) 2025-05-07T20:32:21.1361551Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:21.1361867Z x = x_sign * x_clamp 2025-05-07T20:32:21.1362110Z x0 = x[:, :D] 2025-05-07T20:32:21.1362335Z x1 = x[:, D:] 2025-05-07T20:32:21.1362559Z 2025-05-07T20:32:21.1362758Z if contiguous: 2025-05-07T20:32:21.1362993Z x0 = x0.contiguous() 2025-05-07T20:32:21.1363256Z x1 = x1.contiguous() 2025-05-07T20:32:21.1363501Z 2025-05-07T20:32:21.1363693Z if scale_ub is not None: 2025-05-07T20:32:21.1363978Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:21.1364329Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:21.1364639Z ) 2025-05-07T20:32:21.1364837Z else: 2025-05-07T20:32:21.1365053Z scale_ub_tensor = None 2025-05-07T20:32:21.1365305Z 2025-05-07T20:32:21.1365542Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:21.1365861Z op = silu_mul_quant 2025-05-07T20:32:21.1366112Z if compiled: 2025-05-07T20:32:21.1366364Z op = torch.compile(op) 2025-05-07T20:32:21.1366664Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:21.1366937Z 2025-05-07T20:32:21.1367137Z > y_fp8, y_scale = fn() 2025-05-07T20:32:21.1367312Z 2025-05-07T20:32:21.1367418Z moe/activation_test.py:117: 2025-05-07T20:32:21.1367722Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:21.1368065Z moe/activation_test.py:115: in fn 2025-05-07T20:32:21.1368402Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:21.1369091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:21.1369781Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:21.1370318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:21.1371091Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:21.1371756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:21.1372281Z kernel = self.compile( 2025-05-07T20:32:21.1372824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:21.1373479Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:21.1373873Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:21.1374106Z 2025-05-07T20:32:21.1374361Z self = 2025-05-07T20:32:21.1375518Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:21.1376903Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd8e4bf880>} 2025-05-07T20:32:21.1378243Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:21.1379259Z context = 2025-05-07T20:32:21.1379552Z 2025-05-07T20:32:21.1379717Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:21.1380241Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:21.1380712Z module_map=module_map) 2025-05-07T20:32:21.1381076Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:21.1381440Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:21.1381701Z E ^ 2025-05-07T20:32:21.1382162Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:21.1382620Z 2025-05-07T20:32:21.1383036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:21.1383553Z 2025-05-07T20:32:21.1383657Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:21.1384072Z self=, 2025-05-07T20:32:21.1384476Z T=128, 2025-05-07T20:32:21.1384677Z D=5120, 2025-05-07T20:32:21.1384878Z scale_ub=None, 2025-05-07T20:32:21.1385097Z contiguous=False, 2025-05-07T20:32:21.1385330Z compiled=True, 2025-05-07T20:32:21.1385545Z ) 2025-05-07T20:32:21.1385867Z self = 2025-05-07T20:32:21.1386362Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:21.1386635Z 2025-05-07T20:32:21.1386718Z @given( 2025-05-07T20:32:21.1386955Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:21.1387268Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:21.1387581Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:21.1387920Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:21.1388249Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:21.1388559Z ) 2025-05-07T20:32:21.1388957Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:21.1389504Z def test_silu_mul_quant( 2025-05-07T20:32:21.1389756Z self, 2025-05-07T20:32:21.1389959Z T: int, 2025-05-07T20:32:21.1390157Z D: int, 2025-05-07T20:32:21.1390380Z scale_ub: Optional[float], 2025-05-07T20:32:21.1390659Z contiguous: bool, 2025-05-07T20:32:21.1390962Z compiled: bool, 2025-05-07T20:32:21.1391185Z ) -> None: 2025-05-07T20:32:21.1391410Z torch.manual_seed(2025) 2025-05-07T20:32:21.1391653Z 2025-05-07T20:32:21.1391922Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:21.1392267Z 2025-05-07T20:32:21.1392467Z x_sign = torch.sign(x) 2025-05-07T20:32:21.1392757Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:21.1393069Z x = x_sign * x_clamp 2025-05-07T20:32:21.1393330Z x0 = x[:, :D] 2025-05-07T20:32:21.1393552Z x1 = x[:, D:] 2025-05-07T20:32:21.1393758Z 2025-05-07T20:32:21.1393947Z if contiguous: 2025-05-07T20:32:21.1394230Z x0 = x0.contiguous() 2025-05-07T20:32:21.1394530Z x1 = x1.contiguous() 2025-05-07T20:32:21.1394777Z 2025-05-07T20:32:21.1394977Z if scale_ub is not None: 2025-05-07T20:32:21.1395249Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:21.1395627Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:21.1395942Z ) 2025-05-07T20:32:21.1396137Z else: 2025-05-07T20:32:21.1396348Z scale_ub_tensor = None 2025-05-07T20:32:21.1396602Z 2025-05-07T20:32:21.1396833Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:21.1397153Z op = silu_mul_quant 2025-05-07T20:32:21.1397410Z if compiled: 2025-05-07T20:32:21.1397657Z op = torch.compile(op) 2025-05-07T20:32:21.1397961Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:21.1398246Z 2025-05-07T20:32:21.1398470Z > y_fp8, y_scale = fn() 2025-05-07T20:32:21.1398665Z 2025-05-07T20:32:21.1398769Z moe/activation_test.py:117: 2025-05-07T20:32:21.1399082Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:21.1399421Z moe/activation_test.py:115: in fn 2025-05-07T20:32:21.1399705Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:21.1400272Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:21.1400835Z return fn(*args, **kwargs) 2025-05-07T20:32:21.1401487Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:21.1402177Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:21.1402717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:21.1403399Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:21.1404057Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:21.1404605Z kernel = self.compile( 2025-05-07T20:32:21.1405163Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:21.1405832Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:21.1406236Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:21.1406478Z 2025-05-07T20:32:21.1406692Z self = 2025-05-07T20:32:21.1407782Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:21.1409157Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd8e499c60>} 2025-05-07T20:32:21.1410551Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:21.1411583Z context = 2025-05-07T20:32:21.1411880Z 2025-05-07T20:32:21.1412048Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:21.1412572Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:21.1413039Z module_map=module_map) 2025-05-07T20:32:21.1413409Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:21.1413766Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:21.1414025Z E ^ 2025-05-07T20:32:21.1414490Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:21.1415064Z 2025-05-07T20:32:21.1415480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:21.1416025Z 2025-05-07T20:32:21.1416142Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:21.1416553Z self=, 2025-05-07T20:32:21.1416964Z T=128, 2025-05-07T20:32:21.1417163Z D=7168, 2025-05-07T20:32:21.1417358Z scale_ub=1200.0, 2025-05-07T20:32:21.1417591Z contiguous=False, 2025-05-07T20:32:21.1417824Z compiled=False, 2025-05-07T20:32:21.1418030Z ) 2025-05-07T20:32:21.2287937Z self = 2025-05-07T20:32:21.2288810Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:21.2289109Z 2025-05-07T20:32:21.2289189Z @given( 2025-05-07T20:32:21.2289450Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:21.2289771Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:21.2290075Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:21.2290418Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:21.2290752Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:21.2291031Z ) 2025-05-07T20:32:21.2291387Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:21.2298837Z def test_silu_mul_quant( 2025-05-07T20:32:21.2299163Z self, 2025-05-07T20:32:21.2299380Z T: int, 2025-05-07T20:32:21.2299580Z D: int, 2025-05-07T20:32:21.2299814Z scale_ub: Optional[float], 2025-05-07T20:32:21.2300104Z contiguous: bool, 2025-05-07T20:32:21.2300349Z compiled: bool, 2025-05-07T20:32:21.2300587Z ) -> None: 2025-05-07T20:32:21.2300813Z torch.manual_seed(2025) 2025-05-07T20:32:21.2301064Z 2025-05-07T20:32:21.2301352Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:21.2301711Z 2025-05-07T20:32:21.2301908Z x_sign = torch.sign(x) 2025-05-07T20:32:21.2302212Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:21.2302527Z x = x_sign * x_clamp 2025-05-07T20:32:21.2302775Z x0 = x[:, :D] 2025-05-07T20:32:21.2302995Z x1 = x[:, D:] 2025-05-07T20:32:21.2303212Z 2025-05-07T20:32:21.2303410Z if contiguous: 2025-05-07T20:32:21.2303642Z x0 = x0.contiguous() 2025-05-07T20:32:21.2303897Z x1 = x1.contiguous() 2025-05-07T20:32:21.2304141Z 2025-05-07T20:32:21.2304332Z if scale_ub is not None: 2025-05-07T20:32:21.2304614Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:21.2304954Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:21.2305266Z ) 2025-05-07T20:32:21.2305471Z else: 2025-05-07T20:32:21.2305693Z scale_ub_tensor = None 2025-05-07T20:32:21.2305948Z 2025-05-07T20:32:21.2306189Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:21.2306509Z op = silu_mul_quant 2025-05-07T20:32:21.2306763Z if compiled: 2025-05-07T20:32:21.2307295Z op = torch.compile(op) 2025-05-07T20:32:21.2307604Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:21.2307885Z 2025-05-07T20:32:21.2308079Z > y_fp8, y_scale = fn() 2025-05-07T20:32:21.2308254Z 2025-05-07T20:32:21.2308358Z moe/activation_test.py:117: 2025-05-07T20:32:21.2308671Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:21.2309002Z moe/activation_test.py:115: in fn 2025-05-07T20:32:21.2309390Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:21.2310095Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:21.2310886Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:21.2311517Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:21.2312282Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:21.2312959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:21.2313492Z kernel = self.compile( 2025-05-07T20:32:21.2314043Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:21.2314708Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:21.2315114Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:21.2315344Z 2025-05-07T20:32:21.2315555Z self = 2025-05-07T20:32:21.2316649Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:21.2318071Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd8ea9c360>} 2025-05-07T20:32:21.2319432Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:21.2320466Z context = 2025-05-07T20:32:21.2320765Z 2025-05-07T20:32:21.2320933Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:21.2321458Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:21.2321944Z module_map=module_map) 2025-05-07T20:32:21.2322310Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:21.2322676Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:21.2322943Z E ^ 2025-05-07T20:32:21.2323417Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:21.2323880Z 2025-05-07T20:32:21.2324300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:21.2324825Z 2025-05-07T20:32:21.2324934Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:21.2325359Z self=, 2025-05-07T20:32:21.2325765Z T=128, 2025-05-07T20:32:21.2325966Z D=5120, 2025-05-07T20:32:21.2326176Z scale_ub=None, 2025-05-07T20:32:21.2326394Z contiguous=False, 2025-05-07T20:32:21.2326633Z compiled=False, 2025-05-07T20:32:21.2326858Z ) 2025-05-07T20:32:21.2327179Z self = 2025-05-07T20:32:21.2327679Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:21.2327965Z 2025-05-07T20:32:21.2328049Z @given( 2025-05-07T20:32:21.2328744Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:21.2329061Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:21.2329378Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:21.2329718Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:21.2330044Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:21.2330335Z ) 2025-05-07T20:32:21.2330692Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:21.2331137Z def test_silu_mul_quant( 2025-05-07T20:32:21.2331394Z self, 2025-05-07T20:32:21.2331599Z T: int, 2025-05-07T20:32:21.2331860Z D: int, 2025-05-07T20:32:21.2332143Z scale_ub: Optional[float], 2025-05-07T20:32:21.2332422Z contiguous: bool, 2025-05-07T20:32:21.2332679Z compiled: bool, 2025-05-07T20:32:21.2332901Z ) -> None: 2025-05-07T20:32:21.2333181Z torch.manual_seed(2025) 2025-05-07T20:32:21.2333435Z 2025-05-07T20:32:21.2333709Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:21.2334061Z 2025-05-07T20:32:21.2334260Z x_sign = torch.sign(x) 2025-05-07T20:32:21.2334550Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:21.2334869Z x = x_sign * x_clamp 2025-05-07T20:32:21.2335116Z x0 = x[:, :D] 2025-05-07T20:32:21.2335332Z x1 = x[:, D:] 2025-05-07T20:32:21.2335550Z 2025-05-07T20:32:21.2335753Z if contiguous: 2025-05-07T20:32:21.2335987Z x0 = x0.contiguous() 2025-05-07T20:32:21.2336258Z x1 = x1.contiguous() 2025-05-07T20:32:21.2336513Z 2025-05-07T20:32:21.2336704Z if scale_ub is not None: 2025-05-07T20:32:21.2336981Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:21.2337329Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:21.2337648Z ) 2025-05-07T20:32:21.2337850Z else: 2025-05-07T20:32:21.2338071Z scale_ub_tensor = None 2025-05-07T20:32:21.2338334Z 2025-05-07T20:32:21.2338568Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:21.2338897Z op = silu_mul_quant 2025-05-07T20:32:21.2339163Z if compiled: 2025-05-07T20:32:21.2339416Z op = torch.compile(op) 2025-05-07T20:32:21.2339722Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:21.2340008Z 2025-05-07T20:32:21.2340198Z > y_fp8, y_scale = fn() 2025-05-07T20:32:21.2340378Z 2025-05-07T20:32:21.2340478Z moe/activation_test.py:117: 2025-05-07T20:32:21.2340785Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:21.2341122Z moe/activation_test.py:115: in fn 2025-05-07T20:32:21.2341421Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:21.2342122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:21.2342818Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:21.2343371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:21.2344063Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:21.2344743Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:21.2345286Z kernel = self.compile( 2025-05-07T20:32:21.2345830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:21.2346502Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:21.2346921Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:21.2347154Z 2025-05-07T20:32:21.2347381Z self = 2025-05-07T20:32:21.2348578Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:21.2350033Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efca39885e0>} 2025-05-07T20:32:21.2351411Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:21.2352497Z context = 2025-05-07T20:32:21.2352830Z 2025-05-07T20:32:21.2353005Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:21.2353568Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:21.2354042Z module_map=module_map) 2025-05-07T20:32:21.2354409Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:21.2354760Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:21.2355023Z E ^ 2025-05-07T20:32:21.2355492Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:21.2355946Z 2025-05-07T20:32:21.2356372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:21.2356889Z 2025-05-07T20:32:21.2356998Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:21.2357421Z self=, 2025-05-07T20:32:21.2357828Z T=128, 2025-05-07T20:32:21.2358016Z D=5120, 2025-05-07T20:32:21.2358217Z scale_ub=1200.0, 2025-05-07T20:32:21.2358448Z contiguous=True, 2025-05-07T20:32:21.2358672Z compiled=False, 2025-05-07T20:32:21.2358883Z ) 2025-05-07T20:32:21.5291747Z self = 2025-05-07T20:32:21.5292853Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:21.5293402Z 2025-05-07T20:32:21.5293586Z @given( 2025-05-07T20:32:21.5294073Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:21.5294715Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:21.5295340Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:21.5296000Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:21.5296703Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:21.5297289Z ) 2025-05-07T20:32:21.5297993Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:21.5298601Z def test_silu_mul_quant( 2025-05-07T20:32:21.5298872Z self, 2025-05-07T20:32:21.5299084Z T: int, 2025-05-07T20:32:21.5299298Z D: int, 2025-05-07T20:32:21.5299531Z scale_ub: Optional[float], 2025-05-07T20:32:21.5299809Z contiguous: bool, 2025-05-07T20:32:21.5300062Z compiled: bool, 2025-05-07T20:32:21.5300310Z ) -> None: 2025-05-07T20:32:21.5300537Z torch.manual_seed(2025) 2025-05-07T20:32:21.5300801Z 2025-05-07T20:32:21.5301089Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:21.5301441Z 2025-05-07T20:32:21.5301643Z x_sign = torch.sign(x) 2025-05-07T20:32:21.5301942Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:21.5302265Z x = x_sign * x_clamp 2025-05-07T20:32:21.5302528Z x0 = x[:, :D] 2025-05-07T20:32:21.5302765Z x1 = x[:, D:] 2025-05-07T20:32:21.5302979Z 2025-05-07T20:32:21.5303179Z if contiguous: 2025-05-07T20:32:21.5303424Z x0 = x0.contiguous() 2025-05-07T20:32:21.5303690Z x1 = x1.contiguous() 2025-05-07T20:32:21.5304198Z 2025-05-07T20:32:21.5304407Z if scale_ub is not None: 2025-05-07T20:32:21.5304683Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:21.5305033Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:21.5305353Z ) 2025-05-07T20:32:21.5305561Z else: 2025-05-07T20:32:21.5305782Z scale_ub_tensor = None 2025-05-07T20:32:21.5306051Z 2025-05-07T20:32:21.5306300Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:21.5306622Z op = silu_mul_quant 2025-05-07T20:32:21.5306886Z if compiled: 2025-05-07T20:32:21.5307149Z op = torch.compile(op) 2025-05-07T20:32:21.5307541Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:21.5307946Z 2025-05-07T20:32:21.5308159Z > y_fp8, y_scale = fn() 2025-05-07T20:32:21.5308326Z 2025-05-07T20:32:21.5308435Z moe/activation_test.py:117: 2025-05-07T20:32:21.5308866Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:21.5309287Z moe/activation_test.py:115: in fn 2025-05-07T20:32:21.5309576Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:21.5310274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:21.5310967Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:21.5311506Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:21.5312182Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:21.5312851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:21.5313392Z kernel = self.compile( 2025-05-07T20:32:21.5313942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:21.5314597Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:21.5315001Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:21.5315229Z 2025-05-07T20:32:21.5315447Z self = 2025-05-07T20:32:21.5316529Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:21.5317908Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efca3989760>} 2025-05-07T20:32:21.5319259Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:21.5320282Z context = 2025-05-07T20:32:21.5320569Z 2025-05-07T20:32:21.5320743Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:21.5321255Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:21.5321725Z module_map=module_map) 2025-05-07T20:32:21.5322092Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:21.5322454Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:21.5322713Z E ^ 2025-05-07T20:32:21.5323180Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:21.5323636Z 2025-05-07T20:32:21.5324060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:21.5324570Z 2025-05-07T20:32:21.5324733Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:21.5325145Z self=, 2025-05-07T20:32:21.5325556Z T=1, 2025-05-07T20:32:21.5325760Z D=7168, 2025-05-07T20:32:21.5325958Z scale_ub=1200.0, 2025-05-07T20:32:21.5326191Z contiguous=True, 2025-05-07T20:32:21.5326422Z compiled=True, 2025-05-07T20:32:21.5326638Z ) 2025-05-07T20:32:21.5326966Z self = 2025-05-07T20:32:21.5327458Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:21.5327718Z 2025-05-07T20:32:21.5327801Z @given( 2025-05-07T20:32:21.5328084Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:21.5328724Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:21.5329040Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:21.5329434Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:21.5329769Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:21.5330060Z ) 2025-05-07T20:32:21.5330410Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:21.5330855Z def test_silu_mul_quant( 2025-05-07T20:32:21.5331103Z self, 2025-05-07T20:32:21.5331302Z T: int, 2025-05-07T20:32:21.5331504Z D: int, 2025-05-07T20:32:21.5331733Z scale_ub: Optional[float], 2025-05-07T20:32:21.5332001Z contiguous: bool, 2025-05-07T20:32:21.5332250Z compiled: bool, 2025-05-07T20:32:21.5332479Z ) -> None: 2025-05-07T20:32:21.5332697Z torch.manual_seed(2025) 2025-05-07T20:32:21.5332946Z 2025-05-07T20:32:21.5333230Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:21.5333570Z 2025-05-07T20:32:21.5333775Z x_sign = torch.sign(x) 2025-05-07T20:32:21.5334078Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:21.5334396Z x = x_sign * x_clamp 2025-05-07T20:32:21.5334635Z x0 = x[:, :D] 2025-05-07T20:32:21.5334861Z x1 = x[:, D:] 2025-05-07T20:32:21.5335081Z 2025-05-07T20:32:21.5335269Z if contiguous: 2025-05-07T20:32:21.5335510Z x0 = x0.contiguous() 2025-05-07T20:32:21.5335776Z x1 = x1.contiguous() 2025-05-07T20:32:21.5336017Z 2025-05-07T20:32:21.5336216Z if scale_ub is not None: 2025-05-07T20:32:21.5336494Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:21.5336829Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:21.5337143Z ) 2025-05-07T20:32:21.5337354Z else: 2025-05-07T20:32:21.5337571Z scale_ub_tensor = None 2025-05-07T20:32:21.5337830Z 2025-05-07T20:32:21.5338073Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:21.5338394Z op = silu_mul_quant 2025-05-07T20:32:21.5338658Z if compiled: 2025-05-07T20:32:21.5338927Z op = torch.compile(op) 2025-05-07T20:32:21.5339233Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:21.5339513Z 2025-05-07T20:32:21.5339721Z > y_fp8, y_scale = fn() 2025-05-07T20:32:21.5339885Z 2025-05-07T20:32:21.5340001Z moe/activation_test.py:117: 2025-05-07T20:32:21.5340302Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:21.5340641Z moe/activation_test.py:115: in fn 2025-05-07T20:32:21.5340934Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:21.5341492Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:21.5342063Z return fn(*args, **kwargs) 2025-05-07T20:32:21.5342727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:21.5343429Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:21.5344037Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:21.5344726Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:21.5345396Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:21.5345934Z kernel = self.compile( 2025-05-07T20:32:21.5346490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:21.5347153Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:21.5347559Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:21.5347893Z 2025-05-07T20:32:21.5348104Z self = 2025-05-07T20:32:21.5349362Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:21.5350727Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efca398ad40>} 2025-05-07T20:32:21.5352077Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:21.5353111Z context = 2025-05-07T20:32:21.5353403Z 2025-05-07T20:32:21.5353570Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:21.5354096Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:21.5354568Z module_map=module_map) 2025-05-07T20:32:21.5354933Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:21.5355290Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:21.5355561Z E ^ 2025-05-07T20:32:21.5356030Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:21.5356480Z 2025-05-07T20:32:21.5356896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:21.5357411Z 2025-05-07T20:32:21.5357518Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:21.5357938Z self=, 2025-05-07T20:32:21.5358357Z T=1, 2025-05-07T20:32:21.5358551Z D=7168, 2025-05-07T20:32:21.5358756Z scale_ub=1200.0, 2025-05-07T20:32:21.5358994Z contiguous=False, 2025-05-07T20:32:21.5359227Z compiled=True, 2025-05-07T20:32:21.5359441Z ) 2025-05-07T20:32:21.6353228Z self = 2025-05-07T20:32:21.6353737Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:21.6354011Z 2025-05-07T20:32:21.6354095Z @given( 2025-05-07T20:32:21.6354343Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:21.6354659Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:21.6354971Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:21.6355317Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:21.6355650Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:21.6355939Z ) 2025-05-07T20:32:21.6356300Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:21.6356763Z def test_silu_mul_quant( 2025-05-07T20:32:21.6357005Z self, 2025-05-07T20:32:21.6357216Z T: int, 2025-05-07T20:32:21.6357428Z D: int, 2025-05-07T20:32:21.6357657Z scale_ub: Optional[float], 2025-05-07T20:32:21.6358183Z contiguous: bool, 2025-05-07T20:32:21.6358435Z compiled: bool, 2025-05-07T20:32:21.6358689Z ) -> None: 2025-05-07T20:32:21.6358941Z torch.manual_seed(2025) 2025-05-07T20:32:21.6359190Z 2025-05-07T20:32:21.6359480Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:21.6359829Z 2025-05-07T20:32:21.6360042Z x_sign = torch.sign(x) 2025-05-07T20:32:21.6360336Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:21.6360658Z x = x_sign * x_clamp 2025-05-07T20:32:21.6360905Z x0 = x[:, :D] 2025-05-07T20:32:21.6361132Z x1 = x[:, D:] 2025-05-07T20:32:21.6361411Z 2025-05-07T20:32:21.6361610Z if contiguous: 2025-05-07T20:32:21.6361920Z x0 = x0.contiguous() 2025-05-07T20:32:21.6362185Z x1 = x1.contiguous() 2025-05-07T20:32:21.6362425Z 2025-05-07T20:32:21.6362624Z if scale_ub is not None: 2025-05-07T20:32:21.6362976Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:21.6363308Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:21.6363627Z ) 2025-05-07T20:32:21.6363828Z else: 2025-05-07T20:32:21.6364042Z scale_ub_tensor = None 2025-05-07T20:32:21.6364302Z 2025-05-07T20:32:21.6364544Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:21.6364858Z op = silu_mul_quant 2025-05-07T20:32:21.6365117Z if compiled: 2025-05-07T20:32:21.6365368Z op = torch.compile(op) 2025-05-07T20:32:21.6365660Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:21.6365939Z 2025-05-07T20:32:21.6366142Z > y_fp8, y_scale = fn() 2025-05-07T20:32:21.6366313Z 2025-05-07T20:32:21.6366424Z moe/activation_test.py:117: 2025-05-07T20:32:21.6366717Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:21.6367057Z moe/activation_test.py:115: in fn 2025-05-07T20:32:21.6367345Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:21.6367899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:21.6368476Z return fn(*args, **kwargs) 2025-05-07T20:32:21.6376005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:21.6376701Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:21.6377241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:21.6377916Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:21.6378648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:21.6379187Z kernel = self.compile( 2025-05-07T20:32:21.6379739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:21.6380388Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:21.6380794Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:21.6381025Z 2025-05-07T20:32:21.6381241Z self = 2025-05-07T20:32:21.6382319Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:21.6383686Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efca3aa0540>} 2025-05-07T20:32:21.6385111Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:21.6386136Z context = 2025-05-07T20:32:21.6386423Z 2025-05-07T20:32:21.6386597Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:21.6387112Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:21.6387579Z module_map=module_map) 2025-05-07T20:32:21.6387949Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:21.6388305Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:21.6388640Z E ^ 2025-05-07T20:32:21.6389264Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:21.6389760Z 2025-05-07T20:32:21.6390223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:21.6390730Z 2025-05-07T20:32:21.6390835Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:21.6391246Z self=, 2025-05-07T20:32:21.6391649Z T=1, 2025-05-07T20:32:21.6391843Z D=7168, 2025-05-07T20:32:21.6392036Z scale_ub=None, 2025-05-07T20:32:21.6392260Z contiguous=False, 2025-05-07T20:32:21.6392491Z compiled=True, 2025-05-07T20:32:21.6392695Z ) 2025-05-07T20:32:21.7056709Z self = 2025-05-07T20:32:21.7057224Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:21.7057506Z 2025-05-07T20:32:21.7057596Z @given( 2025-05-07T20:32:21.7057837Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:21.7058162Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:21.7058477Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:21.7058815Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:21.7059176Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:21.7059496Z ) 2025-05-07T20:32:21.7059841Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:21.7060290Z def test_silu_mul_quant( 2025-05-07T20:32:21.7060544Z self, 2025-05-07T20:32:21.7060754Z T: int, 2025-05-07T20:32:21.7060952Z D: int, 2025-05-07T20:32:21.7061178Z scale_ub: Optional[float], 2025-05-07T20:32:21.7061459Z contiguous: bool, 2025-05-07T20:32:21.7061699Z compiled: bool, 2025-05-07T20:32:21.7061940Z ) -> None: 2025-05-07T20:32:21.7062174Z torch.manual_seed(2025) 2025-05-07T20:32:21.7062423Z 2025-05-07T20:32:21.7062707Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:21.7063057Z 2025-05-07T20:32:21.7063252Z x_sign = torch.sign(x) 2025-05-07T20:32:21.7063559Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:21.7063882Z x = x_sign * x_clamp 2025-05-07T20:32:21.7064121Z x0 = x[:, :D] 2025-05-07T20:32:21.7064356Z x1 = x[:, D:] 2025-05-07T20:32:21.7064573Z 2025-05-07T20:32:21.7064761Z if contiguous: 2025-05-07T20:32:21.7065008Z x0 = x0.contiguous() 2025-05-07T20:32:21.7065276Z x1 = x1.contiguous() 2025-05-07T20:32:21.7065526Z 2025-05-07T20:32:21.7065731Z if scale_ub is not None: 2025-05-07T20:32:21.7066014Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:21.7066366Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:21.7066673Z ) 2025-05-07T20:32:21.7066882Z else: 2025-05-07T20:32:21.7067105Z scale_ub_tensor = None 2025-05-07T20:32:21.7067359Z 2025-05-07T20:32:21.7067618Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:21.7067948Z op = silu_mul_quant 2025-05-07T20:32:21.7068468Z if compiled: 2025-05-07T20:32:21.7068723Z op = torch.compile(op) 2025-05-07T20:32:21.7069130Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:21.7069414Z 2025-05-07T20:32:21.7069607Z y_fp8, y_scale = fn() 2025-05-07T20:32:21.7069900Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:21.7070193Z 2025-05-07T20:32:21.7070429Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:21.7070771Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:21.7071071Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:21.7071391Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:21.7071871Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:21.7072295Z 2025-05-07T20:32:21.7072513Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:21.7072735Z 2025-05-07T20:32:21.7072841Z moe/activation_test.py:126: 2025-05-07T20:32:21.7073243Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:21.7073631Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:21.7073992Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:21.7074947Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:21.7075864Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:21.7076514Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:21.7077329Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:21.7078166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:21.7079097Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:21.7080014Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:21.7080917Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:21.7081800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:21.7082567Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:21.7083278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:21.7083900Z fn() 2025-05-07T20:32:21.7084508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:21.7085218Z self.fn.run( 2025-05-07T20:32:21.7085760Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:21.7086400Z kernel = self.compile( 2025-05-07T20:32:21.7087043Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:21.7087818Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:21.7088284Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:21.7088560Z 2025-05-07T20:32:21.7088799Z self = 2025-05-07T20:32:21.7090190Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:21.7091928Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efca3aa1440>} 2025-05-07T20:32:21.7093652Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:21.7094915Z context = 2025-05-07T20:32:21.7095264Z 2025-05-07T20:32:21.7095450Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:21.7096069Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:21.7096619Z module_map=module_map) 2025-05-07T20:32:21.7097042Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:21.7097475Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:21.7097756Z E ^ 2025-05-07T20:32:21.7098216Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:21.7098706Z 2025-05-07T20:32:21.7099125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:21.7099640Z 2025-05-07T20:32:21.7099746Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:21.7100163Z self=, 2025-05-07T20:32:21.7100565Z T=1, 2025-05-07T20:32:21.7100765Z D=5120, 2025-05-07T20:32:21.7100976Z scale_ub=1200.0, 2025-05-07T20:32:21.7101206Z contiguous=False, 2025-05-07T20:32:21.7101444Z compiled=True, 2025-05-07T20:32:21.7101667Z ) 2025-05-07T20:32:21.8294588Z self = 2025-05-07T20:32:21.8295126Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:21.8295398Z 2025-05-07T20:32:21.8295481Z @given( 2025-05-07T20:32:21.8295724Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:21.8296048Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:21.8296364Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:21.8296703Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:21.8297046Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:21.8297330Z ) 2025-05-07T20:32:21.8297688Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:21.8298139Z def test_silu_mul_quant( 2025-05-07T20:32:21.8298392Z self, 2025-05-07T20:32:21.8298594Z T: int, 2025-05-07T20:32:21.8298805Z D: int, 2025-05-07T20:32:21.8299071Z scale_ub: Optional[float], 2025-05-07T20:32:21.8299354Z contiguous: bool, 2025-05-07T20:32:21.8299602Z compiled: bool, 2025-05-07T20:32:21.8299848Z ) -> None: 2025-05-07T20:32:21.8300071Z torch.manual_seed(2025) 2025-05-07T20:32:21.8300321Z 2025-05-07T20:32:21.8300604Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:21.8300980Z 2025-05-07T20:32:21.8301194Z x_sign = torch.sign(x) 2025-05-07T20:32:21.8301488Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:21.8301806Z x = x_sign * x_clamp 2025-05-07T20:32:21.8302053Z x0 = x[:, :D] 2025-05-07T20:32:21.8302272Z x1 = x[:, D:] 2025-05-07T20:32:21.8302489Z 2025-05-07T20:32:21.8302682Z if contiguous: 2025-05-07T20:32:21.8302915Z x0 = x0.contiguous() 2025-05-07T20:32:21.8303180Z x1 = x1.contiguous() 2025-05-07T20:32:21.8303427Z 2025-05-07T20:32:21.8303619Z if scale_ub is not None: 2025-05-07T20:32:21.8303902Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:21.8304248Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:21.8304556Z ) 2025-05-07T20:32:21.8304763Z else: 2025-05-07T20:32:21.8304987Z scale_ub_tensor = None 2025-05-07T20:32:21.8305234Z 2025-05-07T20:32:21.8305711Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:21.8306034Z op = silu_mul_quant 2025-05-07T20:32:21.8306287Z if compiled: 2025-05-07T20:32:21.8306534Z op = torch.compile(op) 2025-05-07T20:32:21.8306834Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:21.8307110Z 2025-05-07T20:32:21.8307300Z > y_fp8, y_scale = fn() 2025-05-07T20:32:21.8307471Z 2025-05-07T20:32:21.8307576Z moe/activation_test.py:117: 2025-05-07T20:32:21.8307876Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:21.8308230Z moe/activation_test.py:115: in fn 2025-05-07T20:32:21.8308529Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:21.8309263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:21.8309925Z return fn(*args, **kwargs) 2025-05-07T20:32:21.8310653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:21.8311333Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:21.8311873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:21.8312562Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:21.8313223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:21.8313750Z kernel = self.compile( 2025-05-07T20:32:21.8314295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:21.8314957Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:21.8315359Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:21.8315590Z 2025-05-07T20:32:21.8315803Z self = 2025-05-07T20:32:21.8316884Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:21.8318266Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efca3aa2a20>} 2025-05-07T20:32:21.8319610Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:21.8320629Z context = 2025-05-07T20:32:21.8320922Z 2025-05-07T20:32:21.8321088Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:21.8321613Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:21.8322084Z module_map=module_map) 2025-05-07T20:32:21.8322449Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:21.8322803Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:21.8323066Z E ^ 2025-05-07T20:32:21.8323530Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:21.8323983Z 2025-05-07T20:32:21.8324396Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:21.8324917Z 2025-05-07T20:32:21.8325026Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:21.8325443Z self=, 2025-05-07T20:32:21.8325845Z T=1, 2025-05-07T20:32:21.8326041Z D=5120, 2025-05-07T20:32:21.8326250Z scale_ub=1200.0, 2025-05-07T20:32:21.8326557Z contiguous=False, 2025-05-07T20:32:21.8326795Z compiled=False, 2025-05-07T20:32:21.8327008Z ) 2025-05-07T20:32:21.8327324Z self = 2025-05-07T20:32:21.8327813Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:21.8328086Z 2025-05-07T20:32:21.8328432Z @given( 2025-05-07T20:32:21.8328688Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:21.8328997Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:21.8329304Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:21.8329635Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:21.8330024Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:21.8330368Z ) 2025-05-07T20:32:21.8330723Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:21.8331230Z def test_silu_mul_quant( 2025-05-07T20:32:21.8331479Z self, 2025-05-07T20:32:21.8331683Z T: int, 2025-05-07T20:32:21.8331883Z D: int, 2025-05-07T20:32:21.8332106Z scale_ub: Optional[float], 2025-05-07T20:32:21.8332383Z contiguous: bool, 2025-05-07T20:32:21.8332632Z compiled: bool, 2025-05-07T20:32:21.8332853Z ) -> None: 2025-05-07T20:32:21.8333074Z torch.manual_seed(2025) 2025-05-07T20:32:21.8333323Z 2025-05-07T20:32:21.8333598Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:21.8333942Z 2025-05-07T20:32:21.8334145Z x_sign = torch.sign(x) 2025-05-07T20:32:21.8334433Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:21.8334748Z x = x_sign * x_clamp 2025-05-07T20:32:21.8334994Z x0 = x[:, :D] 2025-05-07T20:32:21.8335210Z x1 = x[:, D:] 2025-05-07T20:32:21.8335423Z 2025-05-07T20:32:21.8335612Z if contiguous: 2025-05-07T20:32:21.8335843Z x0 = x0.contiguous() 2025-05-07T20:32:21.8336110Z x1 = x1.contiguous() 2025-05-07T20:32:21.8336351Z 2025-05-07T20:32:21.8336540Z if scale_ub is not None: 2025-05-07T20:32:21.8336815Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:21.8337151Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:21.8337465Z ) 2025-05-07T20:32:21.8337654Z else: 2025-05-07T20:32:21.8337867Z scale_ub_tensor = None 2025-05-07T20:32:21.8338122Z 2025-05-07T20:32:21.8338352Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:21.8338697Z op = silu_mul_quant 2025-05-07T20:32:21.8338977Z if compiled: 2025-05-07T20:32:21.8339226Z op = torch.compile(op) 2025-05-07T20:32:21.8339528Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:21.8339807Z 2025-05-07T20:32:21.8340000Z > y_fp8, y_scale = fn() 2025-05-07T20:32:21.8340173Z 2025-05-07T20:32:21.8340276Z moe/activation_test.py:117: 2025-05-07T20:32:21.8340577Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:21.8340904Z moe/activation_test.py:115: in fn 2025-05-07T20:32:21.8341188Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:21.8341873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:21.8342568Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:21.8343099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:21.8343782Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:21.8344447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:21.8344984Z kernel = self.compile( 2025-05-07T20:32:21.8345588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:21.8346247Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:21.8346654Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:21.8346881Z 2025-05-07T20:32:21.8347088Z self = 2025-05-07T20:32:21.8348168Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:21.8349627Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efca3aa31a0>} 2025-05-07T20:32:21.8351092Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:21.8352118Z context = 2025-05-07T20:32:21.8352406Z 2025-05-07T20:32:21.8352575Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:21.8353102Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:21.8353577Z module_map=module_map) 2025-05-07T20:32:21.8353953Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:21.8354308Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:21.8354579Z E ^ 2025-05-07T20:32:21.8355059Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:21.8355507Z 2025-05-07T20:32:21.8355924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:21.8356452Z 2025-05-07T20:32:21.8356560Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:21.8356979Z self=, 2025-05-07T20:32:21.8357390Z T=16384, 2025-05-07T20:32:21.8357592Z D=5120, 2025-05-07T20:32:21.8357801Z scale_ub=1200.0, 2025-05-07T20:32:21.8358041Z contiguous=False, 2025-05-07T20:32:21.8358300Z compiled=True, 2025-05-07T20:32:21.8358540Z ) 2025-05-07T20:32:22.0642590Z self = 2025-05-07T20:32:22.0643289Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:22.0643679Z 2025-05-07T20:32:22.0643792Z @given( 2025-05-07T20:32:22.0644112Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.0644431Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.0644733Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.0645080Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.0645409Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.0645691Z ) 2025-05-07T20:32:22.0646044Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.0646484Z def test_silu_mul_quant( 2025-05-07T20:32:22.0646729Z self, 2025-05-07T20:32:22.0646933Z T: int, 2025-05-07T20:32:22.0647135Z D: int, 2025-05-07T20:32:22.0647353Z scale_ub: Optional[float], 2025-05-07T20:32:22.0647631Z contiguous: bool, 2025-05-07T20:32:22.0647873Z compiled: bool, 2025-05-07T20:32:22.0648138Z ) -> None: 2025-05-07T20:32:22.0648365Z torch.manual_seed(2025) 2025-05-07T20:32:22.0648620Z 2025-05-07T20:32:22.0648934Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.0649284Z 2025-05-07T20:32:22.0649478Z x_sign = torch.sign(x) 2025-05-07T20:32:22.0650061Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.0650376Z x = x_sign * x_clamp 2025-05-07T20:32:22.0650627Z x0 = x[:, :D] 2025-05-07T20:32:22.0650848Z x1 = x[:, D:] 2025-05-07T20:32:22.0651066Z 2025-05-07T20:32:22.0651262Z if contiguous: 2025-05-07T20:32:22.0651500Z x0 = x0.contiguous() 2025-05-07T20:32:22.0651764Z x1 = x1.contiguous() 2025-05-07T20:32:22.0652015Z 2025-05-07T20:32:22.0652210Z if scale_ub is not None: 2025-05-07T20:32:22.0652487Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:22.0652827Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:22.0653136Z ) 2025-05-07T20:32:22.0653420Z else: 2025-05-07T20:32:22.0653710Z scale_ub_tensor = None 2025-05-07T20:32:22.0653957Z 2025-05-07T20:32:22.0654194Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.0654516Z op = silu_mul_quant 2025-05-07T20:32:22.0654835Z if compiled: 2025-05-07T20:32:22.0655093Z op = torch.compile(op) 2025-05-07T20:32:22.0655394Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0655677Z 2025-05-07T20:32:22.0655867Z > y_fp8, y_scale = fn() 2025-05-07T20:32:22.0656038Z 2025-05-07T20:32:22.0656140Z moe/activation_test.py:117: 2025-05-07T20:32:22.0656433Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0656760Z moe/activation_test.py:115: in fn 2025-05-07T20:32:22.0657047Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0657607Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:22.0658159Z return fn(*args, **kwargs) 2025-05-07T20:32:22.0658815Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:22.0659502Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:22.0660035Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:22.0660710Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:22.0661368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:22.0661902Z kernel = self.compile( 2025-05-07T20:32:22.0662441Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:22.0663086Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.0663496Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0663732Z 2025-05-07T20:32:22.0663949Z self = 2025-05-07T20:32:22.0665025Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:22.0666404Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efca3308ea0>} 2025-05-07T20:32:22.0667742Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:22.0668810Z context = 2025-05-07T20:32:22.0669173Z 2025-05-07T20:32:22.0669347Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:22.0669857Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.0670371Z module_map=module_map) 2025-05-07T20:32:22.0670741Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.0671089Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:22.0671358Z E ^ 2025-05-07T20:32:22.0671822Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.0679062Z 2025-05-07T20:32:22.0679525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:22.0680061Z 2025-05-07T20:32:22.0680169Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.0680596Z self=, 2025-05-07T20:32:22.0681176Z T=2048, 2025-05-07T20:32:22.0681380Z D=7168, 2025-05-07T20:32:22.0681587Z scale_ub=1200.0, 2025-05-07T20:32:22.0681826Z contiguous=False, 2025-05-07T20:32:22.0682077Z compiled=True, 2025-05-07T20:32:22.0682381Z ) 2025-05-07T20:32:22.0682703Z self = 2025-05-07T20:32:22.0683204Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:22.0683478Z 2025-05-07T20:32:22.0683563Z @given( 2025-05-07T20:32:22.0683795Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.0684114Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.0684429Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.0684764Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.0685092Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.0685381Z ) 2025-05-07T20:32:22.0685743Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.0686194Z def test_silu_mul_quant( 2025-05-07T20:32:22.0686446Z self, 2025-05-07T20:32:22.0686649Z T: int, 2025-05-07T20:32:22.0686852Z D: int, 2025-05-07T20:32:22.0687082Z scale_ub: Optional[float], 2025-05-07T20:32:22.0687362Z contiguous: bool, 2025-05-07T20:32:22.0687603Z compiled: bool, 2025-05-07T20:32:22.0687833Z ) -> None: 2025-05-07T20:32:22.0688054Z torch.manual_seed(2025) 2025-05-07T20:32:22.0688292Z 2025-05-07T20:32:22.0688574Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.0688924Z 2025-05-07T20:32:22.0689131Z x_sign = torch.sign(x) 2025-05-07T20:32:22.0689423Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.0689739Z x = x_sign * x_clamp 2025-05-07T20:32:22.0689991Z x0 = x[:, :D] 2025-05-07T20:32:22.0690214Z x1 = x[:, D:] 2025-05-07T20:32:22.0690430Z 2025-05-07T20:32:22.0690624Z if contiguous: 2025-05-07T20:32:22.0690856Z x0 = x0.contiguous() 2025-05-07T20:32:22.0691122Z x1 = x1.contiguous() 2025-05-07T20:32:22.0691365Z 2025-05-07T20:32:22.0691557Z if scale_ub is not None: 2025-05-07T20:32:22.0691839Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:22.0692181Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:22.0692489Z ) 2025-05-07T20:32:22.0692690Z else: 2025-05-07T20:32:22.0692913Z scale_ub_tensor = None 2025-05-07T20:32:22.0693165Z 2025-05-07T20:32:22.0693402Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.0693722Z op = silu_mul_quant 2025-05-07T20:32:22.0693979Z if compiled: 2025-05-07T20:32:22.0694230Z op = torch.compile(op) 2025-05-07T20:32:22.0694532Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0694817Z 2025-05-07T20:32:22.0695010Z > y_fp8, y_scale = fn() 2025-05-07T20:32:22.0695181Z 2025-05-07T20:32:22.0695283Z moe/activation_test.py:117: 2025-05-07T20:32:22.0695591Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0695970Z moe/activation_test.py:115: in fn 2025-05-07T20:32:22.0696259Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.0696824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:22.0697392Z return fn(*args, **kwargs) 2025-05-07T20:32:22.0698057Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:22.0698767Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:22.0699350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:22.0700077Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:22.0700788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:22.0701329Z kernel = self.compile( 2025-05-07T20:32:22.0701929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:22.0702589Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.0702999Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.0703232Z 2025-05-07T20:32:22.0703453Z self = 2025-05-07T20:32:22.0704552Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:22.0705944Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efca33099e0>} 2025-05-07T20:32:22.0707317Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:22.0708350Z context = 2025-05-07T20:32:22.0708640Z 2025-05-07T20:32:22.0708821Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:22.0709483Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.0709957Z module_map=module_map) 2025-05-07T20:32:22.0710329Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.0710693Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:22.0710960Z E ^ 2025-05-07T20:32:22.0711440Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.0711897Z 2025-05-07T20:32:22.0712335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:22.0712853Z 2025-05-07T20:32:22.1594370Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.1594834Z self=, 2025-05-07T20:32:22.1595414Z T=1, 2025-05-07T20:32:22.1595680Z D=5120, 2025-05-07T20:32:22.1595939Z scale_ub=None, 2025-05-07T20:32:22.1596249Z contiguous=False, 2025-05-07T20:32:22.1596551Z compiled=False, 2025-05-07T20:32:22.1596820Z ) 2025-05-07T20:32:22.1597247Z self = 2025-05-07T20:32:22.1597821Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:22.1598104Z 2025-05-07T20:32:22.1598197Z @given( 2025-05-07T20:32:22.1598431Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.1598748Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.1599353Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.1599686Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.1600023Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.1600318Z ) 2025-05-07T20:32:22.1600667Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.1601112Z def test_silu_mul_quant( 2025-05-07T20:32:22.1601360Z self, 2025-05-07T20:32:22.1601557Z T: int, 2025-05-07T20:32:22.1601768Z D: int, 2025-05-07T20:32:22.1602000Z scale_ub: Optional[float], 2025-05-07T20:32:22.1602272Z contiguous: bool, 2025-05-07T20:32:22.1602565Z compiled: bool, 2025-05-07T20:32:22.1602876Z ) -> None: 2025-05-07T20:32:22.1603182Z torch.manual_seed(2025) 2025-05-07T20:32:22.1603431Z 2025-05-07T20:32:22.1603705Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.1604061Z 2025-05-07T20:32:22.1604337Z x_sign = torch.sign(x) 2025-05-07T20:32:22.1604646Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.1604955Z x = x_sign * x_clamp 2025-05-07T20:32:22.1605203Z x0 = x[:, :D] 2025-05-07T20:32:22.1605431Z x1 = x[:, D:] 2025-05-07T20:32:22.1605635Z 2025-05-07T20:32:22.1605830Z if contiguous: 2025-05-07T20:32:22.1606073Z x0 = x0.contiguous() 2025-05-07T20:32:22.1606332Z x1 = x1.contiguous() 2025-05-07T20:32:22.1606584Z 2025-05-07T20:32:22.1606789Z if scale_ub is not None: 2025-05-07T20:32:22.1607066Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:22.1607419Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:22.1607739Z ) 2025-05-07T20:32:22.1607935Z else: 2025-05-07T20:32:22.1608152Z scale_ub_tensor = None 2025-05-07T20:32:22.1608413Z 2025-05-07T20:32:22.1608643Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.1608966Z op = silu_mul_quant 2025-05-07T20:32:22.1609221Z if compiled: 2025-05-07T20:32:22.1609477Z op = torch.compile(op) 2025-05-07T20:32:22.1609774Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.1610049Z 2025-05-07T20:32:22.1610247Z > y_fp8, y_scale = fn() 2025-05-07T20:32:22.1610415Z 2025-05-07T20:32:22.1610519Z moe/activation_test.py:117: 2025-05-07T20:32:22.1610822Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.1611157Z moe/activation_test.py:115: in fn 2025-05-07T20:32:22.1611436Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.1612127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:22.1612828Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:22.1613369Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:22.1614051Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:22.1614713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:22.1615244Z kernel = self.compile( 2025-05-07T20:32:22.1615784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:22.1616443Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.1616847Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.1617073Z 2025-05-07T20:32:22.1617292Z self = 2025-05-07T20:32:22.1618431Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:22.1619841Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efca330ad40>} 2025-05-07T20:32:22.1621183Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:22.1622201Z context = 2025-05-07T20:32:22.1622487Z 2025-05-07T20:32:22.1622657Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:22.1623222Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.1623731Z module_map=module_map) 2025-05-07T20:32:22.1624136Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.1624488Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:22.1624758Z E ^ 2025-05-07T20:32:22.1625234Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.1625681Z 2025-05-07T20:32:22.1626103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:22.1626614Z 2025-05-07T20:32:22.1626719Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.1627136Z self=, 2025-05-07T20:32:22.1627541Z T=4096, 2025-05-07T20:32:22.1627733Z D=7168, 2025-05-07T20:32:22.1627934Z scale_ub=1200.0, 2025-05-07T20:32:22.1628472Z contiguous=False, 2025-05-07T20:32:22.1628701Z compiled=False, 2025-05-07T20:32:22.1628909Z ) 2025-05-07T20:32:22.1629320Z self = 2025-05-07T20:32:22.1629830Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:22.1630104Z 2025-05-07T20:32:22.1630186Z @given( 2025-05-07T20:32:22.1630419Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.1630732Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.1631035Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.1631366Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.1631699Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.1631979Z ) 2025-05-07T20:32:22.1632325Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.1632766Z def test_silu_mul_quant( 2025-05-07T20:32:22.1633028Z self, 2025-05-07T20:32:22.1633221Z T: int, 2025-05-07T20:32:22.1633427Z D: int, 2025-05-07T20:32:22.1633649Z scale_ub: Optional[float], 2025-05-07T20:32:22.1633926Z contiguous: bool, 2025-05-07T20:32:22.1634165Z compiled: bool, 2025-05-07T20:32:22.1634391Z ) -> None: 2025-05-07T20:32:22.1634613Z torch.manual_seed(2025) 2025-05-07T20:32:22.1634851Z 2025-05-07T20:32:22.1635128Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.1635473Z 2025-05-07T20:32:22.1635665Z x_sign = torch.sign(x) 2025-05-07T20:32:22.1635959Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.1636278Z x = x_sign * x_clamp 2025-05-07T20:32:22.1636515Z x0 = x[:, :D] 2025-05-07T20:32:22.1636738Z x1 = x[:, D:] 2025-05-07T20:32:22.1636955Z 2025-05-07T20:32:22.1637139Z if contiguous: 2025-05-07T20:32:22.1637376Z x0 = x0.contiguous() 2025-05-07T20:32:22.1637644Z x1 = x1.contiguous() 2025-05-07T20:32:22.1637881Z 2025-05-07T20:32:22.1638073Z if scale_ub is not None: 2025-05-07T20:32:22.1638345Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:22.1638758Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:22.1639070Z ) 2025-05-07T20:32:22.1639266Z else: 2025-05-07T20:32:22.1639480Z scale_ub_tensor = None 2025-05-07T20:32:22.1639732Z 2025-05-07T20:32:22.1639968Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.1640285Z op = silu_mul_quant 2025-05-07T20:32:22.1640534Z if compiled: 2025-05-07T20:32:22.1640785Z op = torch.compile(op) 2025-05-07T20:32:22.1641133Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.1641527Z 2025-05-07T20:32:22.1641809Z > y_fp8, y_scale = fn() 2025-05-07T20:32:22.1642038Z 2025-05-07T20:32:22.1642279Z moe/activation_test.py:117: 2025-05-07T20:32:22.1642768Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.1643105Z moe/activation_test.py:115: in fn 2025-05-07T20:32:22.1643394Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.1644147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:22.1644830Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:22.1645365Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:22.1646055Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:22.1646709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:22.1647241Z kernel = self.compile( 2025-05-07T20:32:22.1647784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:22.1648487Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.1648905Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.1649142Z 2025-05-07T20:32:22.1649351Z self = 2025-05-07T20:32:22.1650435Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:22.1651800Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efca330ba60>} 2025-05-07T20:32:22.1653131Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:22.1654165Z context = 2025-05-07T20:32:22.1654462Z 2025-05-07T20:32:22.1654635Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:22.1655162Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.1655626Z module_map=module_map) 2025-05-07T20:32:22.1655997Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.1656361Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:22.1656629Z E ^ 2025-05-07T20:32:22.1657089Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.1657547Z 2025-05-07T20:32:22.1657968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:22.1658481Z 2025-05-07T20:32:22.1658595Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.1659047Z self=, 2025-05-07T20:32:22.1659462Z T=16384, 2025-05-07T20:32:22.1659706Z D=7168, 2025-05-07T20:32:22.1659909Z scale_ub=None, 2025-05-07T20:32:22.1660125Z contiguous=True, 2025-05-07T20:32:22.1660352Z compiled=True, 2025-05-07T20:32:22.1660560Z ) 2025-05-07T20:32:22.3023250Z self = 2025-05-07T20:32:22.3023999Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:22.3024387Z 2025-05-07T20:32:22.3024506Z @given( 2025-05-07T20:32:22.3024824Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.3025238Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.3025552Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.3026100Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.3026531Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.3026823Z ) 2025-05-07T20:32:22.3027250Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.3027709Z def test_silu_mul_quant( 2025-05-07T20:32:22.3027963Z self, 2025-05-07T20:32:22.3028433Z T: int, 2025-05-07T20:32:22.3028646Z D: int, 2025-05-07T20:32:22.3028875Z scale_ub: Optional[float], 2025-05-07T20:32:22.3029228Z contiguous: bool, 2025-05-07T20:32:22.3029477Z compiled: bool, 2025-05-07T20:32:22.3029709Z ) -> None: 2025-05-07T20:32:22.3029934Z torch.manual_seed(2025) 2025-05-07T20:32:22.3030175Z 2025-05-07T20:32:22.3030462Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.3030810Z 2025-05-07T20:32:22.3031013Z x_sign = torch.sign(x) 2025-05-07T20:32:22.3031319Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.3031636Z x = x_sign * x_clamp 2025-05-07T20:32:22.3031880Z x0 = x[:, :D] 2025-05-07T20:32:22.3032106Z x1 = x[:, D:] 2025-05-07T20:32:22.3032318Z 2025-05-07T20:32:22.3032514Z if contiguous: 2025-05-07T20:32:22.3032754Z x0 = x0.contiguous() 2025-05-07T20:32:22.3033016Z x1 = x1.contiguous() 2025-05-07T20:32:22.3033254Z 2025-05-07T20:32:22.3033450Z if scale_ub is not None: 2025-05-07T20:32:22.3033731Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:22.3034065Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:22.3034388Z ) 2025-05-07T20:32:22.3034593Z else: 2025-05-07T20:32:22.3034809Z scale_ub_tensor = None 2025-05-07T20:32:22.3035065Z 2025-05-07T20:32:22.3035301Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.3035648Z op = silu_mul_quant 2025-05-07T20:32:22.3035910Z if compiled: 2025-05-07T20:32:22.3036158Z op = torch.compile(op) 2025-05-07T20:32:22.3036459Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.3036738Z 2025-05-07T20:32:22.3036936Z > y_fp8, y_scale = fn() 2025-05-07T20:32:22.3037106Z 2025-05-07T20:32:22.3037211Z moe/activation_test.py:117: 2025-05-07T20:32:22.3037510Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.3037843Z moe/activation_test.py:115: in fn 2025-05-07T20:32:22.3038130Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.3038695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:22.3039260Z return fn(*args, **kwargs) 2025-05-07T20:32:22.3039915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:22.3040611Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:22.3041166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:22.3041846Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:22.3042603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:22.3043144Z kernel = self.compile( 2025-05-07T20:32:22.3043688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:22.3044346Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.3044753Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.3044985Z 2025-05-07T20:32:22.3045202Z self = 2025-05-07T20:32:22.3046291Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:22.3047891Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efca3d05120>} 2025-05-07T20:32:22.3049247Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:22.3050292Z context = 2025-05-07T20:32:22.3050589Z 2025-05-07T20:32:22.3050771Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:22.3051291Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.3051767Z module_map=module_map) 2025-05-07T20:32:22.3052142Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.3052497Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:22.3052756Z E ^ 2025-05-07T20:32:22.3053229Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.3053677Z 2025-05-07T20:32:22.3054097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:22.3054606Z 2025-05-07T20:32:22.3054711Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.3055128Z self=, 2025-05-07T20:32:22.3055534Z T=4096, 2025-05-07T20:32:22.3055729Z D=5120, 2025-05-07T20:32:22.3055924Z scale_ub=None, 2025-05-07T20:32:22.3056148Z contiguous=False, 2025-05-07T20:32:22.3056378Z compiled=True, 2025-05-07T20:32:22.3056592Z ) 2025-05-07T20:32:22.3056920Z self = 2025-05-07T20:32:22.3057412Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:22.3057682Z 2025-05-07T20:32:22.3057767Z @given( 2025-05-07T20:32:22.3058010Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.3058327Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.3058634Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.3058969Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.3059301Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.3059589Z ) 2025-05-07T20:32:22.3059937Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.3060383Z def test_silu_mul_quant( 2025-05-07T20:32:22.3060632Z self, 2025-05-07T20:32:22.3060834Z T: int, 2025-05-07T20:32:22.3061049Z D: int, 2025-05-07T20:32:22.3061282Z scale_ub: Optional[float], 2025-05-07T20:32:22.3061551Z contiguous: bool, 2025-05-07T20:32:22.3061802Z compiled: bool, 2025-05-07T20:32:22.3062033Z ) -> None: 2025-05-07T20:32:22.3062261Z torch.manual_seed(2025) 2025-05-07T20:32:22.3062568Z 2025-05-07T20:32:22.3062860Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.3063205Z 2025-05-07T20:32:22.3063410Z x_sign = torch.sign(x) 2025-05-07T20:32:22.3063714Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.3064034Z x = x_sign * x_clamp 2025-05-07T20:32:22.3064279Z x0 = x[:, :D] 2025-05-07T20:32:22.3064515Z x1 = x[:, D:] 2025-05-07T20:32:22.3064737Z 2025-05-07T20:32:22.3064946Z if contiguous: 2025-05-07T20:32:22.3065177Z x0 = x0.contiguous() 2025-05-07T20:32:22.3072236Z x1 = x1.contiguous() 2025-05-07T20:32:22.3072588Z 2025-05-07T20:32:22.3072785Z if scale_ub is not None: 2025-05-07T20:32:22.3073109Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:22.3073452Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:22.3073756Z ) 2025-05-07T20:32:22.3073995Z else: 2025-05-07T20:32:22.3074216Z scale_ub_tensor = None 2025-05-07T20:32:22.3074466Z 2025-05-07T20:32:22.3074711Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.3075035Z op = silu_mul_quant 2025-05-07T20:32:22.3075292Z if compiled: 2025-05-07T20:32:22.3075548Z op = torch.compile(op) 2025-05-07T20:32:22.3075853Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.3076128Z 2025-05-07T20:32:22.3076335Z > y_fp8, y_scale = fn() 2025-05-07T20:32:22.3076499Z 2025-05-07T20:32:22.3076609Z moe/activation_test.py:117: 2025-05-07T20:32:22.3076912Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.3077251Z moe/activation_test.py:115: in fn 2025-05-07T20:32:22.3077542Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.3078106Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:22.3078682Z return fn(*args, **kwargs) 2025-05-07T20:32:22.3079381Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:22.3080070Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:22.3080608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:22.3081283Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:22.3081953Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:22.3082491Z kernel = self.compile( 2025-05-07T20:32:22.3083032Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:22.3083695Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.3084101Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.3084328Z 2025-05-07T20:32:22.3084545Z self = 2025-05-07T20:32:22.3085618Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:22.3086986Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efca3d05c60>} 2025-05-07T20:32:22.3088326Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:22.3089408Z context = 2025-05-07T20:32:22.3089694Z 2025-05-07T20:32:22.3089921Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:22.3090434Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.3090901Z module_map=module_map) 2025-05-07T20:32:22.3091268Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.3091617Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:22.3091884Z E ^ 2025-05-07T20:32:22.3092354Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.3092805Z 2025-05-07T20:32:22.3093272Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:22.3093819Z 2025-05-07T20:32:22.4226290Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.4227288Z self=, 2025-05-07T20:32:22.4227837Z T=4096, 2025-05-07T20:32:22.4228042Z D=5120, 2025-05-07T20:32:22.4228507Z scale_ub=1200.0, 2025-05-07T20:32:22.4228741Z contiguous=False, 2025-05-07T20:32:22.4228977Z compiled=False, 2025-05-07T20:32:22.4229245Z ) 2025-05-07T20:32:22.4229572Z self = 2025-05-07T20:32:22.4230075Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:22.4230347Z 2025-05-07T20:32:22.4230437Z @given( 2025-05-07T20:32:22.4230683Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.4231005Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.4231334Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.4231679Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.4232005Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.4232306Z ) 2025-05-07T20:32:22.4232667Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.4233107Z def test_silu_mul_quant( 2025-05-07T20:32:22.4233361Z self, 2025-05-07T20:32:22.4233570Z T: int, 2025-05-07T20:32:22.4233771Z D: int, 2025-05-07T20:32:22.4234008Z scale_ub: Optional[float], 2025-05-07T20:32:22.4234299Z contiguous: bool, 2025-05-07T20:32:22.4234540Z compiled: bool, 2025-05-07T20:32:22.4234787Z ) -> None: 2025-05-07T20:32:22.4235019Z torch.manual_seed(2025) 2025-05-07T20:32:22.4235270Z 2025-05-07T20:32:22.4235545Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.4235901Z 2025-05-07T20:32:22.4236110Z x_sign = torch.sign(x) 2025-05-07T20:32:22.4236413Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.4236731Z x = x_sign * x_clamp 2025-05-07T20:32:22.4236981Z x0 = x[:, :D] 2025-05-07T20:32:22.4237201Z x1 = x[:, D:] 2025-05-07T20:32:22.4237425Z 2025-05-07T20:32:22.4237623Z if contiguous: 2025-05-07T20:32:22.4237854Z x0 = x0.contiguous() 2025-05-07T20:32:22.4238125Z x1 = x1.contiguous() 2025-05-07T20:32:22.4238380Z 2025-05-07T20:32:22.4238599Z if scale_ub is not None: 2025-05-07T20:32:22.4238904Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:22.4239241Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:22.4239547Z ) 2025-05-07T20:32:22.4239751Z else: 2025-05-07T20:32:22.4239970Z scale_ub_tensor = None 2025-05-07T20:32:22.4240221Z 2025-05-07T20:32:22.4240460Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.4240790Z op = silu_mul_quant 2025-05-07T20:32:22.4241050Z if compiled: 2025-05-07T20:32:22.4241302Z op = torch.compile(op) 2025-05-07T20:32:22.4241604Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.4241890Z 2025-05-07T20:32:22.4242195Z > y_fp8, y_scale = fn() 2025-05-07T20:32:22.4242368Z 2025-05-07T20:32:22.4242474Z moe/activation_test.py:117: 2025-05-07T20:32:22.4242778Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.4243107Z moe/activation_test.py:115: in fn 2025-05-07T20:32:22.4243397Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.4244092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:22.4244790Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:22.4245325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:22.4246169Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:22.4246896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:22.4247432Z kernel = self.compile( 2025-05-07T20:32:22.4247983Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:22.4248644Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.4249055Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.4249283Z 2025-05-07T20:32:22.4249496Z self = 2025-05-07T20:32:22.4250587Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:22.4251984Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efca3d07240>} 2025-05-07T20:32:22.4253339Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:22.4254366Z context = 2025-05-07T20:32:22.4254654Z 2025-05-07T20:32:22.4254821Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:22.4255340Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.4255812Z module_map=module_map) 2025-05-07T20:32:22.4256182Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.4256536Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:22.4256805Z E ^ 2025-05-07T20:32:22.4257277Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.4257724Z 2025-05-07T20:32:22.4258140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:22.4258652Z 2025-05-07T20:32:22.4258758Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.4259177Z self=, 2025-05-07T20:32:22.4259584Z T=4096, 2025-05-07T20:32:22.4259775Z D=5120, 2025-05-07T20:32:22.4259978Z scale_ub=1200.0, 2025-05-07T20:32:22.4260207Z contiguous=False, 2025-05-07T20:32:22.4260433Z compiled=True, 2025-05-07T20:32:22.4260642Z ) 2025-05-07T20:32:22.4260963Z self = 2025-05-07T20:32:22.4261451Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:22.4261728Z 2025-05-07T20:32:22.4261808Z @given( 2025-05-07T20:32:22.4262043Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.4262411Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.4262722Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.4263054Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.4263383Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.4263666Z ) 2025-05-07T20:32:22.4264022Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.4264470Z def test_silu_mul_quant( 2025-05-07T20:32:22.4264712Z self, 2025-05-07T20:32:22.4264915Z T: int, 2025-05-07T20:32:22.4265119Z D: int, 2025-05-07T20:32:22.4265337Z scale_ub: Optional[float], 2025-05-07T20:32:22.4265662Z contiguous: bool, 2025-05-07T20:32:22.4265947Z compiled: bool, 2025-05-07T20:32:22.4266169Z ) -> None: 2025-05-07T20:32:22.4266395Z torch.manual_seed(2025) 2025-05-07T20:32:22.4266642Z 2025-05-07T20:32:22.4266981Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.4267329Z 2025-05-07T20:32:22.4267529Z x_sign = torch.sign(x) 2025-05-07T20:32:22.4267818Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.4268133Z x = x_sign * x_clamp 2025-05-07T20:32:22.4268389Z x0 = x[:, :D] 2025-05-07T20:32:22.4268616Z x1 = x[:, D:] 2025-05-07T20:32:22.4268822Z 2025-05-07T20:32:22.4269015Z if contiguous: 2025-05-07T20:32:22.4269343Z x0 = x0.contiguous() 2025-05-07T20:32:22.4269604Z x1 = x1.contiguous() 2025-05-07T20:32:22.4269846Z 2025-05-07T20:32:22.4270045Z if scale_ub is not None: 2025-05-07T20:32:22.4270319Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:22.4270658Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:22.4270973Z ) 2025-05-07T20:32:22.4271168Z else: 2025-05-07T20:32:22.4271385Z scale_ub_tensor = None 2025-05-07T20:32:22.4271643Z 2025-05-07T20:32:22.4271882Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.4272197Z op = silu_mul_quant 2025-05-07T20:32:22.4272453Z if compiled: 2025-05-07T20:32:22.4272699Z op = torch.compile(op) 2025-05-07T20:32:22.4272998Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.4273273Z 2025-05-07T20:32:22.4273473Z > y_fp8, y_scale = fn() 2025-05-07T20:32:22.4273640Z 2025-05-07T20:32:22.4273743Z moe/activation_test.py:117: 2025-05-07T20:32:22.4274038Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.4274369Z moe/activation_test.py:115: in fn 2025-05-07T20:32:22.4274647Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.4275229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:22.4275788Z return fn(*args, **kwargs) 2025-05-07T20:32:22.4276452Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:22.4277143Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:22.4277671Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:22.4278348Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:22.4279011Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:22.4279536Z kernel = self.compile( 2025-05-07T20:32:22.4280077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:22.4280739Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.4281139Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.4281365Z 2025-05-07T20:32:22.4281634Z self = 2025-05-07T20:32:22.4282715Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:22.4284076Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efca32d0720>} 2025-05-07T20:32:22.4285418Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:22.4286508Z context = 2025-05-07T20:32:22.4286804Z 2025-05-07T20:32:22.4287007Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:22.4287529Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.4287992Z module_map=module_map) 2025-05-07T20:32:22.4288350Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.4288749Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:22.4289019Z E ^ 2025-05-07T20:32:22.4289480Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.4289934Z 2025-05-07T20:32:22.4290350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:22.4290867Z 2025-05-07T20:32:22.5165444Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.5166120Z self=, 2025-05-07T20:32:22.5166652Z T=2048, 2025-05-07T20:32:22.5166848Z D=7168, 2025-05-07T20:32:22.5167067Z scale_ub=1200.0, 2025-05-07T20:32:22.5167299Z contiguous=False, 2025-05-07T20:32:22.5167531Z compiled=False, 2025-05-07T20:32:22.5167739Z ) 2025-05-07T20:32:22.5168057Z self = 2025-05-07T20:32:22.5168552Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:22.5168847Z 2025-05-07T20:32:22.5168946Z @given( 2025-05-07T20:32:22.5169200Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.5169516Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.5169822Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.5170152Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.5170496Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.5170786Z ) 2025-05-07T20:32:22.5171130Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.5171575Z def test_silu_mul_quant( 2025-05-07T20:32:22.5171823Z self, 2025-05-07T20:32:22.5172019Z T: int, 2025-05-07T20:32:22.5172228Z D: int, 2025-05-07T20:32:22.5172456Z scale_ub: Optional[float], 2025-05-07T20:32:22.5172729Z contiguous: bool, 2025-05-07T20:32:22.5172976Z compiled: bool, 2025-05-07T20:32:22.5173210Z ) -> None: 2025-05-07T20:32:22.5173435Z torch.manual_seed(2025) 2025-05-07T20:32:22.5173680Z 2025-05-07T20:32:22.5173968Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.5174323Z 2025-05-07T20:32:22.5174517Z x_sign = torch.sign(x) 2025-05-07T20:32:22.5174815Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.5175136Z x = x_sign * x_clamp 2025-05-07T20:32:22.5175374Z x0 = x[:, :D] 2025-05-07T20:32:22.5175597Z x1 = x[:, D:] 2025-05-07T20:32:22.5175814Z 2025-05-07T20:32:22.5175999Z if contiguous: 2025-05-07T20:32:22.5176243Z x0 = x0.contiguous() 2025-05-07T20:32:22.5176771Z x1 = x1.contiguous() 2025-05-07T20:32:22.5177011Z 2025-05-07T20:32:22.5177207Z if scale_ub is not None: 2025-05-07T20:32:22.5177487Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:22.5177815Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:22.5178125Z ) 2025-05-07T20:32:22.5178325Z else: 2025-05-07T20:32:22.5178542Z scale_ub_tensor = None 2025-05-07T20:32:22.5178787Z 2025-05-07T20:32:22.5179026Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.5179341Z op = silu_mul_quant 2025-05-07T20:32:22.5179595Z if compiled: 2025-05-07T20:32:22.5179935Z op = torch.compile(op) 2025-05-07T20:32:22.5180301Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.5180573Z 2025-05-07T20:32:22.5180776Z > y_fp8, y_scale = fn() 2025-05-07T20:32:22.5180937Z 2025-05-07T20:32:22.5181120Z moe/activation_test.py:117: 2025-05-07T20:32:22.5181414Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.5181751Z moe/activation_test.py:115: in fn 2025-05-07T20:32:22.5182030Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.5182715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:22.5183397Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:22.5183929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:22.5184609Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:22.5185265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:22.5185803Z kernel = self.compile( 2025-05-07T20:32:22.5186352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:22.5187002Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.5187394Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.5187629Z 2025-05-07T20:32:22.5187840Z self = 2025-05-07T20:32:22.5188965Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:22.5190447Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efca32d1580>} 2025-05-07T20:32:22.5191789Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:22.5192813Z context = 2025-05-07T20:32:22.5193110Z 2025-05-07T20:32:22.5193275Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:22.5193790Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.5194250Z module_map=module_map) 2025-05-07T20:32:22.5194616Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.5194972Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:22.5195246Z E ^ 2025-05-07T20:32:22.5195705Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.5196159Z 2025-05-07T20:32:22.5196637Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:22.5197147Z 2025-05-07T20:32:22.5197259Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.5197675Z self=, 2025-05-07T20:32:22.5198071Z T=1, 2025-05-07T20:32:22.5198264Z D=7168, 2025-05-07T20:32:22.5198464Z scale_ub=None, 2025-05-07T20:32:22.5198681Z contiguous=True, 2025-05-07T20:32:22.5198934Z compiled=False, 2025-05-07T20:32:22.5199162Z ) 2025-05-07T20:32:22.5199475Z self = 2025-05-07T20:32:22.5199955Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:22.5200258Z 2025-05-07T20:32:22.5200385Z @given( 2025-05-07T20:32:22.5200616Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.5200928Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.5201291Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.5201620Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.5201947Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.5202232Z ) 2025-05-07T20:32:22.5202580Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.5203014Z def test_silu_mul_quant( 2025-05-07T20:32:22.5203257Z self, 2025-05-07T20:32:22.5203456Z T: int, 2025-05-07T20:32:22.5203649Z D: int, 2025-05-07T20:32:22.5203870Z scale_ub: Optional[float], 2025-05-07T20:32:22.5204146Z contiguous: bool, 2025-05-07T20:32:22.5204384Z compiled: bool, 2025-05-07T20:32:22.5204612Z ) -> None: 2025-05-07T20:32:22.5204841Z torch.manual_seed(2025) 2025-05-07T20:32:22.5205078Z 2025-05-07T20:32:22.5205358Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.5205708Z 2025-05-07T20:32:22.5205901Z x_sign = torch.sign(x) 2025-05-07T20:32:22.5206207Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.5206517Z x = x_sign * x_clamp 2025-05-07T20:32:22.5206765Z x0 = x[:, :D] 2025-05-07T20:32:22.5206984Z x1 = x[:, D:] 2025-05-07T20:32:22.5207202Z 2025-05-07T20:32:22.5207399Z if contiguous: 2025-05-07T20:32:22.5207631Z x0 = x0.contiguous() 2025-05-07T20:32:22.5207907Z x1 = x1.contiguous() 2025-05-07T20:32:22.5208158Z 2025-05-07T20:32:22.5208350Z if scale_ub is not None: 2025-05-07T20:32:22.5208634Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:22.5208979Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:22.5209295Z ) 2025-05-07T20:32:22.5209505Z else: 2025-05-07T20:32:22.5209727Z scale_ub_tensor = None 2025-05-07T20:32:22.5209976Z 2025-05-07T20:32:22.5210216Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.5210539Z op = silu_mul_quant 2025-05-07T20:32:22.5210789Z if compiled: 2025-05-07T20:32:22.5211058Z op = torch.compile(op) 2025-05-07T20:32:22.5211364Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.5211638Z 2025-05-07T20:32:22.5211841Z > y_fp8, y_scale = fn() 2025-05-07T20:32:22.5212021Z 2025-05-07T20:32:22.5212123Z moe/activation_test.py:117: 2025-05-07T20:32:22.5212427Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.5212758Z moe/activation_test.py:115: in fn 2025-05-07T20:32:22.5213041Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.5213733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:22.5214425Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:22.5214969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:22.5215708Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:22.5216378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:22.5216904Z kernel = self.compile( 2025-05-07T20:32:22.5217447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:22.5218105Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.5218503Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.5218735Z 2025-05-07T20:32:22.5218945Z self = 2025-05-07T20:32:22.5220220Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:22.5221594Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efca32d0ea0>} 2025-05-07T20:32:22.5222934Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:22.5223949Z context = 2025-05-07T20:32:22.5224243Z 2025-05-07T20:32:22.5224410Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:22.5224931Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.5225409Z module_map=module_map) 2025-05-07T20:32:22.5225767Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.5226127Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:22.5226405Z E ^ 2025-05-07T20:32:22.5226871Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.5227324Z 2025-05-07T20:32:22.5227743Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:22.5228540Z 2025-05-07T20:32:22.5228648Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.5229148Z self=, 2025-05-07T20:32:22.5229574Z T=16384, 2025-05-07T20:32:22.5229774Z D=7168, 2025-05-07T20:32:22.5229976Z scale_ub=1200.0, 2025-05-07T20:32:22.5230205Z contiguous=False, 2025-05-07T20:32:22.5230443Z compiled=True, 2025-05-07T20:32:22.8706072Z ) 2025-05-07T20:32:22.8706628Z self = 2025-05-07T20:32:22.8707366Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:22.8707752Z 2025-05-07T20:32:22.8707870Z @given( 2025-05-07T20:32:22.8708192Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.8708517Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.8708834Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.8709227Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.8709555Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.8709844Z ) 2025-05-07T20:32:22.8710200Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.8710636Z def test_silu_mul_quant( 2025-05-07T20:32:22.8710890Z self, 2025-05-07T20:32:22.8711097Z T: int, 2025-05-07T20:32:22.8711297Z D: int, 2025-05-07T20:32:22.8711523Z scale_ub: Optional[float], 2025-05-07T20:32:22.8711799Z contiguous: bool, 2025-05-07T20:32:22.8712040Z compiled: bool, 2025-05-07T20:32:22.8712554Z ) -> None: 2025-05-07T20:32:22.8712784Z torch.manual_seed(2025) 2025-05-07T20:32:22.8713026Z 2025-05-07T20:32:22.8713314Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.8713664Z 2025-05-07T20:32:22.8713864Z x_sign = torch.sign(x) 2025-05-07T20:32:22.8714169Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.8714492Z x = x_sign * x_clamp 2025-05-07T20:32:22.8714747Z x0 = x[:, :D] 2025-05-07T20:32:22.8714968Z x1 = x[:, D:] 2025-05-07T20:32:22.8715191Z 2025-05-07T20:32:22.8715391Z if contiguous: 2025-05-07T20:32:22.8715628Z x0 = x0.contiguous() 2025-05-07T20:32:22.8715979Z x1 = x1.contiguous() 2025-05-07T20:32:22.8716307Z 2025-05-07T20:32:22.8716501Z if scale_ub is not None: 2025-05-07T20:32:22.8716817Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:22.8717230Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:22.8717555Z ) 2025-05-07T20:32:22.8717760Z else: 2025-05-07T20:32:22.8717971Z scale_ub_tensor = None 2025-05-07T20:32:22.8718232Z 2025-05-07T20:32:22.8718474Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.8718823Z op = silu_mul_quant 2025-05-07T20:32:22.8719097Z if compiled: 2025-05-07T20:32:22.8719348Z op = torch.compile(op) 2025-05-07T20:32:22.8719644Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.8719929Z 2025-05-07T20:32:22.8720129Z > y_fp8, y_scale = fn() 2025-05-07T20:32:22.8720293Z 2025-05-07T20:32:22.8720398Z moe/activation_test.py:117: 2025-05-07T20:32:22.8720702Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.8721045Z moe/activation_test.py:115: in fn 2025-05-07T20:32:22.8721331Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.8721893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:22.8722457Z return fn(*args, **kwargs) 2025-05-07T20:32:22.8723121Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:22.8723800Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:22.8724340Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:22.8725025Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:22.8725689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:22.8726222Z kernel = self.compile( 2025-05-07T20:32:22.8726766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:22.8727428Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.8727829Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.8728063Z 2025-05-07T20:32:22.8728538Z self = 2025-05-07T20:32:22.8729673Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:22.8731057Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efca32d39c0>} 2025-05-07T20:32:22.8732397Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:22.8733494Z context = 2025-05-07T20:32:22.8733794Z 2025-05-07T20:32:22.8733967Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:22.8734487Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.8734959Z module_map=module_map) 2025-05-07T20:32:22.8735322Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.8735685Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:22.8735953Z E ^ 2025-05-07T20:32:22.8736416Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.8736993Z 2025-05-07T20:32:22.8737410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:22.8737925Z 2025-05-07T20:32:22.8738085Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.8738510Z self=, 2025-05-07T20:32:22.8738911Z T=1, 2025-05-07T20:32:22.8739107Z D=7168, 2025-05-07T20:32:22.8739310Z scale_ub=None, 2025-05-07T20:32:22.8739529Z contiguous=False, 2025-05-07T20:32:22.8739770Z compiled=False, 2025-05-07T20:32:22.8739983Z ) 2025-05-07T20:32:22.8740302Z self = 2025-05-07T20:32:22.8740794Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:22.8741063Z 2025-05-07T20:32:22.8741145Z @given( 2025-05-07T20:32:22.8741385Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.8741704Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.8742019Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.8742355Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.8742682Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.8742979Z ) 2025-05-07T20:32:22.8743334Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.8743774Z def test_silu_mul_quant( 2025-05-07T20:32:22.8744022Z self, 2025-05-07T20:32:22.8744224Z T: int, 2025-05-07T20:32:22.8744424Z D: int, 2025-05-07T20:32:22.8744651Z scale_ub: Optional[float], 2025-05-07T20:32:22.8744929Z contiguous: bool, 2025-05-07T20:32:22.8745180Z compiled: bool, 2025-05-07T20:32:22.8745399Z ) -> None: 2025-05-07T20:32:22.8745623Z torch.manual_seed(2025) 2025-05-07T20:32:22.8745870Z 2025-05-07T20:32:22.8746144Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.8746492Z 2025-05-07T20:32:22.8746704Z x_sign = torch.sign(x) 2025-05-07T20:32:22.8747001Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.8747308Z x = x_sign * x_clamp 2025-05-07T20:32:22.8747556Z x0 = x[:, :D] 2025-05-07T20:32:22.8747779Z x1 = x[:, D:] 2025-05-07T20:32:22.8747983Z 2025-05-07T20:32:22.8748175Z if contiguous: 2025-05-07T20:32:22.8748417Z x0 = x0.contiguous() 2025-05-07T20:32:22.8748704Z x1 = x1.contiguous() 2025-05-07T20:32:22.8748968Z 2025-05-07T20:32:22.8749238Z if scale_ub is not None: 2025-05-07T20:32:22.8749508Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:22.8749851Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:22.8750158Z ) 2025-05-07T20:32:22.8750361Z else: 2025-05-07T20:32:22.8750574Z scale_ub_tensor = None 2025-05-07T20:32:22.8750831Z 2025-05-07T20:32:22.8751065Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.8751377Z op = silu_mul_quant 2025-05-07T20:32:22.8751632Z if compiled: 2025-05-07T20:32:22.8751884Z op = torch.compile(op) 2025-05-07T20:32:22.8752233Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.8752511Z 2025-05-07T20:32:22.8752710Z > y_fp8, y_scale = fn() 2025-05-07T20:32:22.8752873Z 2025-05-07T20:32:22.8752973Z moe/activation_test.py:117: 2025-05-07T20:32:22.8753273Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.8753609Z moe/activation_test.py:115: in fn 2025-05-07T20:32:22.8753888Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.8754581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:22.8755276Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:22.8755854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:22.8756572Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:22.8757274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:22.8757805Z kernel = self.compile( 2025-05-07T20:32:22.8758345Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:22.8759022Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.8759445Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.8759673Z 2025-05-07T20:32:22.8759893Z self = 2025-05-07T20:32:22.8760973Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:22.8762341Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efca3638860>} 2025-05-07T20:32:22.8763683Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:22.8764707Z context = 2025-05-07T20:32:22.8764998Z 2025-05-07T20:32:22.8765175Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:22.8765691Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.8766161Z module_map=module_map) 2025-05-07T20:32:22.8766536Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.8766897Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:22.8767153Z E ^ 2025-05-07T20:32:22.8767637Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.8768089Z 2025-05-07T20:32:22.8768521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:22.8769037Z 2025-05-07T20:32:22.8769143Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.8769564Z self=, 2025-05-07T20:32:22.8769978Z T=2048, 2025-05-07T20:32:22.8770178Z D=7168, 2025-05-07T20:32:22.8770380Z scale_ub=None, 2025-05-07T20:32:22.8770602Z contiguous=False, 2025-05-07T20:32:22.8770837Z compiled=True, 2025-05-07T20:32:22.8771048Z ) 2025-05-07T20:32:22.9452837Z self = 2025-05-07T20:32:22.9454350Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:22.9455092Z 2025-05-07T20:32:22.9455313Z @given( 2025-05-07T20:32:22.9456226Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.9456841Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.9457451Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.9458105Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.9458703Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.9459056Z ) 2025-05-07T20:32:22.9459415Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.9459865Z def test_silu_mul_quant( 2025-05-07T20:32:22.9460104Z self, 2025-05-07T20:32:22.9460308Z T: int, 2025-05-07T20:32:22.9460513Z D: int, 2025-05-07T20:32:22.9460865Z scale_ub: Optional[float], 2025-05-07T20:32:22.9461210Z contiguous: bool, 2025-05-07T20:32:22.9461455Z compiled: bool, 2025-05-07T20:32:22.9461686Z ) -> None: 2025-05-07T20:32:22.9461906Z torch.manual_seed(2025) 2025-05-07T20:32:22.9462233Z 2025-05-07T20:32:22.9462514Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.9462854Z 2025-05-07T20:32:22.9463057Z x_sign = torch.sign(x) 2025-05-07T20:32:22.9463352Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.9463658Z x = x_sign * x_clamp 2025-05-07T20:32:22.9463904Z x0 = x[:, :D] 2025-05-07T20:32:22.9464124Z x1 = x[:, D:] 2025-05-07T20:32:22.9464325Z 2025-05-07T20:32:22.9464513Z if contiguous: 2025-05-07T20:32:22.9464748Z x0 = x0.contiguous() 2025-05-07T20:32:22.9465006Z x1 = x1.contiguous() 2025-05-07T20:32:22.9465246Z 2025-05-07T20:32:22.9465446Z if scale_ub is not None: 2025-05-07T20:32:22.9465720Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:22.9466057Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:22.9466372Z ) 2025-05-07T20:32:22.9466572Z else: 2025-05-07T20:32:22.9466787Z scale_ub_tensor = None 2025-05-07T20:32:22.9467048Z 2025-05-07T20:32:22.9467283Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.9467590Z op = silu_mul_quant 2025-05-07T20:32:22.9467844Z if compiled: 2025-05-07T20:32:22.9468093Z op = torch.compile(op) 2025-05-07T20:32:22.9468387Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.9468666Z 2025-05-07T20:32:22.9468863Z > y_fp8, y_scale = fn() 2025-05-07T20:32:22.9469026Z 2025-05-07T20:32:22.9469234Z moe/activation_test.py:117: 2025-05-07T20:32:22.9469559Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.9469898Z moe/activation_test.py:115: in fn 2025-05-07T20:32:22.9470188Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.9470746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:22.9471305Z return fn(*args, **kwargs) 2025-05-07T20:32:22.9471967Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:22.9472646Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:22.9473178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:22.9473856Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:22.9474527Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:22.9475063Z kernel = self.compile( 2025-05-07T20:32:22.9475614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:22.9476280Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.9476684Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.9476980Z 2025-05-07T20:32:22.9477193Z self = 2025-05-07T20:32:22.9478275Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:22.9479661Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efca3639bc0>} 2025-05-07T20:32:22.9480999Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:22.9482095Z context = 2025-05-07T20:32:22.9482389Z 2025-05-07T20:32:22.9482596Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:22.9483113Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.9483584Z module_map=module_map) 2025-05-07T20:32:22.9483943Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.9484299Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:22.9484561Z E ^ 2025-05-07T20:32:22.9485021Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.9485472Z 2025-05-07T20:32:22.9485887Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:22.9486407Z 2025-05-07T20:32:22.9486513Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:22.9486923Z self=, 2025-05-07T20:32:22.9487322Z T=4096, 2025-05-07T20:32:22.9487526Z D=7168, 2025-05-07T20:32:22.9487727Z scale_ub=None, 2025-05-07T20:32:22.9487946Z contiguous=False, 2025-05-07T20:32:22.9488176Z compiled=True, 2025-05-07T20:32:22.9488385Z ) 2025-05-07T20:32:22.9488699Z self = 2025-05-07T20:32:22.9489187Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:22.9489460Z 2025-05-07T20:32:22.9489541Z @given( 2025-05-07T20:32:22.9489777Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:22.9490084Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:22.9490391Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:22.9490725Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:22.9491053Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:22.9491344Z ) 2025-05-07T20:32:22.9491699Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:22.9492134Z def test_silu_mul_quant( 2025-05-07T20:32:22.9492378Z self, 2025-05-07T20:32:22.9492580Z T: int, 2025-05-07T20:32:22.9492780Z D: int, 2025-05-07T20:32:22.9493006Z scale_ub: Optional[float], 2025-05-07T20:32:22.9493282Z contiguous: bool, 2025-05-07T20:32:22.9493518Z compiled: bool, 2025-05-07T20:32:22.9493752Z ) -> None: 2025-05-07T20:32:22.9493976Z torch.manual_seed(2025) 2025-05-07T20:32:22.9494218Z 2025-05-07T20:32:22.9494493Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:22.9494844Z 2025-05-07T20:32:22.9495050Z x_sign = torch.sign(x) 2025-05-07T20:32:22.9495349Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:22.9495673Z x = x_sign * x_clamp 2025-05-07T20:32:22.9495924Z x0 = x[:, :D] 2025-05-07T20:32:22.9496146Z x1 = x[:, D:] 2025-05-07T20:32:22.9496359Z 2025-05-07T20:32:22.9496608Z if contiguous: 2025-05-07T20:32:22.9496840Z x0 = x0.contiguous() 2025-05-07T20:32:22.9497116Z x1 = x1.contiguous() 2025-05-07T20:32:22.9497359Z 2025-05-07T20:32:22.9497555Z if scale_ub is not None: 2025-05-07T20:32:22.9497837Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:22.9498175Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:22.9498480Z ) 2025-05-07T20:32:22.9498685Z else: 2025-05-07T20:32:22.9498904Z scale_ub_tensor = None 2025-05-07T20:32:22.9499164Z 2025-05-07T20:32:22.9499397Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:22.9499763Z op = silu_mul_quant 2025-05-07T20:32:22.9500064Z if compiled: 2025-05-07T20:32:22.9500313Z op = torch.compile(op) 2025-05-07T20:32:22.9500610Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.9500892Z 2025-05-07T20:32:22.9501127Z > y_fp8, y_scale = fn() 2025-05-07T20:32:22.9501303Z 2025-05-07T20:32:22.9501406Z moe/activation_test.py:117: 2025-05-07T20:32:22.9501705Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.9502033Z moe/activation_test.py:115: in fn 2025-05-07T20:32:22.9502320Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:22.9502875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:22.9503438Z return fn(*args, **kwargs) 2025-05-07T20:32:22.9504090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:22.9504779Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:22.9505323Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:22.9505997Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:22.9506657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:22.9507192Z kernel = self.compile( 2025-05-07T20:32:22.9507737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:22.9508387Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:22.9508800Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:22.9509167Z 2025-05-07T20:32:22.9509382Z self = 2025-05-07T20:32:22.9510458Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:22.9511829Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efca363a700>} 2025-05-07T20:32:22.9513168Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:22.9514191Z context = 2025-05-07T20:32:22.9514483Z 2025-05-07T20:32:22.9514658Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:22.9515175Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:22.9515653Z module_map=module_map) 2025-05-07T20:32:22.9516035Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:22.9516399Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:22.9516665Z E ^ 2025-05-07T20:32:22.9517191Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:22.9517647Z 2025-05-07T20:32:22.9518079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:22.9518591Z 2025-05-07T20:32:23.0777137Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:23.0777790Z self=, 2025-05-07T20:32:23.0778366Z T=16384, 2025-05-07T20:32:23.0778637Z D=5120, 2025-05-07T20:32:23.0779153Z scale_ub=1200.0, 2025-05-07T20:32:23.0779636Z contiguous=False, 2025-05-07T20:32:23.0780433Z compiled=False, 2025-05-07T20:32:23.0780971Z ) 2025-05-07T20:32:23.0781596Z self = 2025-05-07T20:32:23.0782587Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:23.0783265Z 2025-05-07T20:32:23.0783438Z @given( 2025-05-07T20:32:23.0783895Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:23.0784514Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:23.0785116Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:23.0785768Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:23.0786418Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:23.0786981Z ) 2025-05-07T20:32:23.0787669Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:23.0788529Z def test_silu_mul_quant( 2025-05-07T20:32:23.0788876Z self, 2025-05-07T20:32:23.0789205Z T: int, 2025-05-07T20:32:23.0789404Z D: int, 2025-05-07T20:32:23.0789632Z scale_ub: Optional[float], 2025-05-07T20:32:23.0789905Z contiguous: bool, 2025-05-07T20:32:23.0790141Z compiled: bool, 2025-05-07T20:32:23.0790368Z ) -> None: 2025-05-07T20:32:23.0790592Z torch.manual_seed(2025) 2025-05-07T20:32:23.0790836Z 2025-05-07T20:32:23.0791108Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:23.0791448Z 2025-05-07T20:32:23.0791638Z x_sign = torch.sign(x) 2025-05-07T20:32:23.0791932Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:23.0792245Z x = x_sign * x_clamp 2025-05-07T20:32:23.0792482Z x0 = x[:, :D] 2025-05-07T20:32:23.0792699Z x1 = x[:, D:] 2025-05-07T20:32:23.0792914Z 2025-05-07T20:32:23.0793097Z if contiguous: 2025-05-07T20:32:23.0793333Z x0 = x0.contiguous() 2025-05-07T20:32:23.0793593Z x1 = x1.contiguous() 2025-05-07T20:32:23.0793856Z 2025-05-07T20:32:23.0794056Z if scale_ub is not None: 2025-05-07T20:32:23.0794332Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:23.0794670Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:23.0802360Z ) 2025-05-07T20:32:23.0802609Z else: 2025-05-07T20:32:23.0802830Z scale_ub_tensor = None 2025-05-07T20:32:23.0803096Z 2025-05-07T20:32:23.0803355Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:23.0803677Z op = silu_mul_quant 2025-05-07T20:32:23.0803937Z if compiled: 2025-05-07T20:32:23.0804195Z op = torch.compile(op) 2025-05-07T20:32:23.0804494Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:23.0804785Z 2025-05-07T20:32:23.0804990Z > y_fp8, y_scale = fn() 2025-05-07T20:32:23.0805156Z 2025-05-07T20:32:23.0805260Z moe/activation_test.py:117: 2025-05-07T20:32:23.0805562Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:23.0805908Z moe/activation_test.py:115: in fn 2025-05-07T20:32:23.0806200Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:23.0807060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:23.0807766Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:23.0808309Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:23.0809076Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:23.0809739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:23.0810278Z kernel = self.compile( 2025-05-07T20:32:23.0810830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:23.0811549Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:23.0811987Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:23.0812228Z 2025-05-07T20:32:23.0812485Z self = 2025-05-07T20:32:23.0813585Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:23.0814986Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efca363b060>} 2025-05-07T20:32:23.0816335Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:23.0817371Z context = 2025-05-07T20:32:23.0817667Z 2025-05-07T20:32:23.0817834Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:23.0818363Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:23.0818826Z module_map=module_map) 2025-05-07T20:32:23.0819222Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:23.0819603Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:23.0819866Z E ^ 2025-05-07T20:32:23.0820329Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:23.0820784Z 2025-05-07T20:32:23.0821208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:23.0821720Z 2025-05-07T20:32:23.0821835Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:23.0822248Z self=, 2025-05-07T20:32:23.0822662Z T=16384, 2025-05-07T20:32:23.0822869Z D=5120, 2025-05-07T20:32:23.0823074Z scale_ub=1200.0, 2025-05-07T20:32:23.0823304Z contiguous=True, 2025-05-07T20:32:23.0823532Z compiled=True, 2025-05-07T20:32:23.0823747Z ) 2025-05-07T20:32:23.0824068Z self = 2025-05-07T20:32:23.0824572Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:23.0824854Z 2025-05-07T20:32:23.0824942Z @given( 2025-05-07T20:32:23.0825173Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:23.0825496Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:23.0825808Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:23.0826140Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:23.0826476Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:23.0826780Z ) 2025-05-07T20:32:23.0827133Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:23.0827577Z def test_silu_mul_quant( 2025-05-07T20:32:23.0827827Z self, 2025-05-07T20:32:23.0828086Z T: int, 2025-05-07T20:32:23.0828583Z D: int, 2025-05-07T20:32:23.0828808Z scale_ub: Optional[float], 2025-05-07T20:32:23.0829128Z contiguous: bool, 2025-05-07T20:32:23.0829375Z compiled: bool, 2025-05-07T20:32:23.0829602Z ) -> None: 2025-05-07T20:32:23.0829823Z torch.manual_seed(2025) 2025-05-07T20:32:23.0830063Z 2025-05-07T20:32:23.0830344Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:23.0830689Z 2025-05-07T20:32:23.0830883Z x_sign = torch.sign(x) 2025-05-07T20:32:23.0831188Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:23.0831579Z x = x_sign * x_clamp 2025-05-07T20:32:23.0831885Z x0 = x[:, :D] 2025-05-07T20:32:23.0832100Z x1 = x[:, D:] 2025-05-07T20:32:23.0832312Z 2025-05-07T20:32:23.0832501Z if contiguous: 2025-05-07T20:32:23.0832731Z x0 = x0.contiguous() 2025-05-07T20:32:23.0833056Z x1 = x1.contiguous() 2025-05-07T20:32:23.0833301Z 2025-05-07T20:32:23.0833493Z if scale_ub is not None: 2025-05-07T20:32:23.0833768Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:23.0834108Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:23.0834413Z ) 2025-05-07T20:32:23.0834616Z else: 2025-05-07T20:32:23.0834832Z scale_ub_tensor = None 2025-05-07T20:32:23.0835081Z 2025-05-07T20:32:23.0835322Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:23.0835648Z op = silu_mul_quant 2025-05-07T20:32:23.0835901Z if compiled: 2025-05-07T20:32:23.0836159Z op = torch.compile(op) 2025-05-07T20:32:23.0836471Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:23.0836742Z 2025-05-07T20:32:23.0836941Z > y_fp8, y_scale = fn() 2025-05-07T20:32:23.0837112Z 2025-05-07T20:32:23.0837215Z moe/activation_test.py:117: 2025-05-07T20:32:23.0837525Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:23.0837862Z moe/activation_test.py:115: in fn 2025-05-07T20:32:23.0838149Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:23.0838745Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:23.0839331Z return fn(*args, **kwargs) 2025-05-07T20:32:23.0840000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:23.0840710Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:23.0841260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:23.0841956Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:23.0842638Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:23.0843186Z kernel = self.compile( 2025-05-07T20:32:23.0843733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:23.0844403Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:23.0844820Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:23.0845053Z 2025-05-07T20:32:23.0845272Z self = 2025-05-07T20:32:23.0846362Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:23.0847757Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efca35d11c0>} 2025-05-07T20:32:23.0849193Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:23.0850291Z context = 2025-05-07T20:32:23.0850587Z 2025-05-07T20:32:23.0850765Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:23.0851287Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:23.0851768Z module_map=module_map) 2025-05-07T20:32:23.0852183Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:23.0852579Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:23.0852844Z E ^ 2025-05-07T20:32:23.0853361Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:23.0853824Z 2025-05-07T20:32:23.0854245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:23.0854773Z 2025-05-07T20:32:23.3813136Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:23.3814039Z self=, 2025-05-07T20:32:23.3814852Z T=16384, 2025-05-07T20:32:23.3815241Z D=5120, 2025-05-07T20:32:23.3815634Z scale_ub=None, 2025-05-07T20:32:23.3816074Z contiguous=False, 2025-05-07T20:32:23.3816519Z compiled=True, 2025-05-07T20:32:23.3816936Z ) 2025-05-07T20:32:23.3817586Z self = 2025-05-07T20:32:23.3818599Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:23.3819151Z 2025-05-07T20:32:23.3819235Z @given( 2025-05-07T20:32:23.3819519Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:23.3819858Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:23.3820167Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:23.3820505Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:23.3820837Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:23.3821127Z ) 2025-05-07T20:32:23.3821487Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:23.3821930Z def test_silu_mul_quant( 2025-05-07T20:32:23.3822174Z self, 2025-05-07T20:32:23.3822379Z T: int, 2025-05-07T20:32:23.3822583Z D: int, 2025-05-07T20:32:23.3822806Z scale_ub: Optional[float], 2025-05-07T20:32:23.3823087Z contiguous: bool, 2025-05-07T20:32:23.3823336Z compiled: bool, 2025-05-07T20:32:23.3823564Z ) -> None: 2025-05-07T20:32:23.3823785Z torch.manual_seed(2025) 2025-05-07T20:32:23.3824029Z 2025-05-07T20:32:23.3824303Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:23.3824653Z 2025-05-07T20:32:23.3824853Z x_sign = torch.sign(x) 2025-05-07T20:32:23.3825144Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:23.3825450Z x = x_sign * x_clamp 2025-05-07T20:32:23.3825694Z x0 = x[:, :D] 2025-05-07T20:32:23.3825919Z x1 = x[:, D:] 2025-05-07T20:32:23.3826125Z 2025-05-07T20:32:23.3826319Z if contiguous: 2025-05-07T20:32:23.3826558Z x0 = x0.contiguous() 2025-05-07T20:32:23.3826814Z x1 = x1.contiguous() 2025-05-07T20:32:23.3827054Z 2025-05-07T20:32:23.3827258Z if scale_ub is not None: 2025-05-07T20:32:23.3827528Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:23.3827873Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:23.3828455Z ) 2025-05-07T20:32:23.3828650Z else: 2025-05-07T20:32:23.3828864Z scale_ub_tensor = None 2025-05-07T20:32:23.3829172Z 2025-05-07T20:32:23.3829677Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:23.3830000Z op = silu_mul_quant 2025-05-07T20:32:23.3830255Z if compiled: 2025-05-07T20:32:23.3830521Z op = torch.compile(op) 2025-05-07T20:32:23.3830821Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:23.3831097Z 2025-05-07T20:32:23.3831291Z > y_fp8, y_scale = fn() 2025-05-07T20:32:23.3831463Z 2025-05-07T20:32:23.3831567Z moe/activation_test.py:117: 2025-05-07T20:32:23.3831864Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:23.3832195Z moe/activation_test.py:115: in fn 2025-05-07T20:32:23.3832559Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:23.3833201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:23.3833767Z return fn(*args, **kwargs) 2025-05-07T20:32:23.3834520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:23.3835216Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:23.3835753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:23.3836430Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:23.3837090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:23.3837623Z kernel = self.compile( 2025-05-07T20:32:23.3838163Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:23.3838818Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:23.3839225Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:23.3839495Z 2025-05-07T20:32:23.3839721Z self = 2025-05-07T20:32:23.3840793Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:23.3842162Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efca35d1d00>} 2025-05-07T20:32:23.3843495Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:23.3844522Z context = 2025-05-07T20:32:23.3844807Z 2025-05-07T20:32:23.3844982Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:23.3845494Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:23.3845966Z module_map=module_map) 2025-05-07T20:32:23.3846329Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:23.3846684Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:23.3846938Z E ^ 2025-05-07T20:32:23.3847404Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:23.3847854Z 2025-05-07T20:32:23.3848279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:23.3848794Z 2025-05-07T20:32:23.3848914Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:23.3849361Z self=, 2025-05-07T20:32:23.3849765Z T=2048, 2025-05-07T20:32:23.3849960Z D=5120, 2025-05-07T20:32:23.3850201Z scale_ub=None, 2025-05-07T20:32:23.3850425Z contiguous=False, 2025-05-07T20:32:23.3850653Z compiled=True, 2025-05-07T20:32:23.3850853Z ) 2025-05-07T20:32:23.4568763Z self = 2025-05-07T20:32:23.4569431Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:23.4569819Z 2025-05-07T20:32:23.4569922Z @given( 2025-05-07T20:32:23.4570225Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:23.4570627Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:23.4571027Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:23.4571628Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:23.4572047Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:23.4572332Z ) 2025-05-07T20:32:23.4572682Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:23.4573205Z def test_silu_mul_quant( 2025-05-07T20:32:23.4573453Z self, 2025-05-07T20:32:23.4573657Z T: int, 2025-05-07T20:32:23.4573866Z D: int, 2025-05-07T20:32:23.4574083Z scale_ub: Optional[float], 2025-05-07T20:32:23.4574362Z contiguous: bool, 2025-05-07T20:32:23.4574610Z compiled: bool, 2025-05-07T20:32:23.4574838Z ) -> None: 2025-05-07T20:32:23.4575061Z torch.manual_seed(2025) 2025-05-07T20:32:23.4575310Z 2025-05-07T20:32:23.4575586Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:23.4575930Z 2025-05-07T20:32:23.4576133Z x_sign = torch.sign(x) 2025-05-07T20:32:23.4576423Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:23.4576740Z x = x_sign * x_clamp 2025-05-07T20:32:23.4576984Z x0 = x[:, :D] 2025-05-07T20:32:23.4577204Z x1 = x[:, D:] 2025-05-07T20:32:23.4577408Z 2025-05-07T20:32:23.4577604Z if contiguous: 2025-05-07T20:32:23.4577845Z x0 = x0.contiguous() 2025-05-07T20:32:23.4578103Z x1 = x1.contiguous() 2025-05-07T20:32:23.4578345Z 2025-05-07T20:32:23.4578542Z if scale_ub is not None: 2025-05-07T20:32:23.4578809Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:23.4579148Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:23.4579487Z ) 2025-05-07T20:32:23.4579703Z else: 2025-05-07T20:32:23.4579919Z scale_ub_tensor = None 2025-05-07T20:32:23.4580172Z 2025-05-07T20:32:23.4580401Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:23.4580717Z op = silu_mul_quant 2025-05-07T20:32:23.4580974Z if compiled: 2025-05-07T20:32:23.4581220Z op = torch.compile(op) 2025-05-07T20:32:23.4581519Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:23.4581796Z 2025-05-07T20:32:23.4581990Z > y_fp8, y_scale = fn() 2025-05-07T20:32:23.4582164Z 2025-05-07T20:32:23.4582268Z moe/activation_test.py:117: 2025-05-07T20:32:23.4582562Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:23.4582899Z moe/activation_test.py:115: in fn 2025-05-07T20:32:23.4583177Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:23.4583738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:23.4584302Z return fn(*args, **kwargs) 2025-05-07T20:32:23.4584957Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:23.4585645Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:23.4586182Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:23.4586865Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:23.4587607Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:23.4588144Z kernel = self.compile( 2025-05-07T20:32:23.4588693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:23.4589443Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:23.4589834Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:23.4590069Z 2025-05-07T20:32:23.4590278Z self = 2025-05-07T20:32:23.4591358Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:23.4592867Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efca35d1620>} 2025-05-07T20:32:23.4594206Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:23.4595239Z context = 2025-05-07T20:32:23.4595535Z 2025-05-07T20:32:23.4595701Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:23.4596234Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:23.4596695Z module_map=module_map) 2025-05-07T20:32:23.4597072Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:23.4597430Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:23.4597688Z E ^ 2025-05-07T20:32:23.4598160Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:23.4598613Z 2025-05-07T20:32:23.4599028Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:23.4599537Z 2025-05-07T20:32:23.4599648Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:23.4600055Z self=, 2025-05-07T20:32:23.4600458Z T=2048, 2025-05-07T20:32:23.4600657Z D=5120, 2025-05-07T20:32:23.4600849Z scale_ub=1200.0, 2025-05-07T20:32:23.4601079Z contiguous=False, 2025-05-07T20:32:23.4601314Z compiled=True, 2025-05-07T20:32:23.4601520Z ) 2025-05-07T20:32:23.4601845Z self = 2025-05-07T20:32:23.4602341Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:23.4602612Z 2025-05-07T20:32:23.4602696Z @given( 2025-05-07T20:32:23.4602937Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:23.4603257Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:23.4603565Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:23.4603890Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:23.4604218Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:23.4604503Z ) 2025-05-07T20:32:23.4604849Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:23.4605294Z def test_silu_mul_quant( 2025-05-07T20:32:23.4605539Z self, 2025-05-07T20:32:23.4605739Z T: int, 2025-05-07T20:32:23.4605936Z D: int, 2025-05-07T20:32:23.4606163Z scale_ub: Optional[float], 2025-05-07T20:32:23.4606441Z contiguous: bool, 2025-05-07T20:32:23.4606678Z compiled: bool, 2025-05-07T20:32:23.4606908Z ) -> None: 2025-05-07T20:32:23.4607127Z torch.manual_seed(2025) 2025-05-07T20:32:23.4607370Z 2025-05-07T20:32:23.4607704Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:23.4608055Z 2025-05-07T20:32:23.4608255Z x_sign = torch.sign(x) 2025-05-07T20:32:23.4608558Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:23.4608877Z x = x_sign * x_clamp 2025-05-07T20:32:23.4609119Z x0 = x[:, :D] 2025-05-07T20:32:23.4609348Z x1 = x[:, D:] 2025-05-07T20:32:23.4609568Z 2025-05-07T20:32:23.4609760Z if contiguous: 2025-05-07T20:32:23.4610005Z x0 = x0.contiguous() 2025-05-07T20:32:23.4610278Z x1 = x1.contiguous() 2025-05-07T20:32:23.4610530Z 2025-05-07T20:32:23.4610767Z if scale_ub is not None: 2025-05-07T20:32:23.4611086Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:23.4611427Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:23.4611733Z ) 2025-05-07T20:32:23.4611933Z else: 2025-05-07T20:32:23.4612199Z scale_ub_tensor = None 2025-05-07T20:32:23.4612448Z 2025-05-07T20:32:23.4612688Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:23.4613005Z op = silu_mul_quant 2025-05-07T20:32:23.4613252Z if compiled: 2025-05-07T20:32:23.4613506Z op = torch.compile(op) 2025-05-07T20:32:23.4613807Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:23.4614077Z 2025-05-07T20:32:23.4614275Z > y_fp8, y_scale = fn() 2025-05-07T20:32:23.4614437Z 2025-05-07T20:32:23.4614548Z moe/activation_test.py:117: 2025-05-07T20:32:23.4614846Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:23.4615180Z moe/activation_test.py:115: in fn 2025-05-07T20:32:23.4615474Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:23.4616032Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:23.4616586Z return fn(*args, **kwargs) 2025-05-07T20:32:23.4617251Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:23.4617943Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:23.4618479Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:23.4619153Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:23.4619817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:23.4620401Z kernel = self.compile( 2025-05-07T20:32:23.4620939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:23.4621599Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:23.4622004Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:23.4622233Z 2025-05-07T20:32:23.4622447Z self = 2025-05-07T20:32:23.4623518Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:23.4624878Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efca34905e0>} 2025-05-07T20:32:23.4626219Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:23.4627272Z context = 2025-05-07T20:32:23.4627566Z 2025-05-07T20:32:23.4627790Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:23.4628625Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:23.4629142Z module_map=module_map) 2025-05-07T20:32:23.4629518Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:23.4629873Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:23.4630144Z E ^ 2025-05-07T20:32:23.4630616Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:23.4637735Z 2025-05-07T20:32:23.4638195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:23.4638982Z 2025-05-07T20:32:23.5956420Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:23.5958054Z self=, 2025-05-07T20:32:23.5958879Z T=4096, 2025-05-07T20:32:23.5959082Z D=5120, 2025-05-07T20:32:23.5959280Z scale_ub=1200.0, 2025-05-07T20:32:23.5959517Z contiguous=True, 2025-05-07T20:32:23.5959746Z compiled=True, 2025-05-07T20:32:23.5959960Z ) 2025-05-07T20:32:23.5960291Z self = 2025-05-07T20:32:23.5960787Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:23.5961057Z 2025-05-07T20:32:23.5961149Z @given( 2025-05-07T20:32:23.5961383Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:23.5961707Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:23.5962015Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:23.5962356Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:23.5962692Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:23.5962982Z ) 2025-05-07T20:32:23.5963343Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:23.5963788Z def test_silu_mul_quant( 2025-05-07T20:32:23.5964040Z self, 2025-05-07T20:32:23.5964245Z T: int, 2025-05-07T20:32:23.5964454Z D: int, 2025-05-07T20:32:23.5964691Z scale_ub: Optional[float], 2025-05-07T20:32:23.5964967Z contiguous: bool, 2025-05-07T20:32:23.5965224Z compiled: bool, 2025-05-07T20:32:23.5965468Z ) -> None: 2025-05-07T20:32:23.5965699Z torch.manual_seed(2025) 2025-05-07T20:32:23.5965945Z 2025-05-07T20:32:23.5966234Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:23.5966586Z 2025-05-07T20:32:23.5966792Z x_sign = torch.sign(x) 2025-05-07T20:32:23.5967106Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:23.5967426Z x = x_sign * x_clamp 2025-05-07T20:32:23.5967669Z x0 = x[:, :D] 2025-05-07T20:32:23.5967897Z x1 = x[:, D:] 2025-05-07T20:32:23.5968120Z 2025-05-07T20:32:23.5968312Z if contiguous: 2025-05-07T20:32:23.5968555Z x0 = x0.contiguous() 2025-05-07T20:32:23.5968824Z x1 = x1.contiguous() 2025-05-07T20:32:23.5969064Z 2025-05-07T20:32:23.5969269Z if scale_ub is not None: 2025-05-07T20:32:23.5969546Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:23.5969882Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:23.5970198Z ) 2025-05-07T20:32:23.5970402Z else: 2025-05-07T20:32:23.5970620Z scale_ub_tensor = None 2025-05-07T20:32:23.5970871Z 2025-05-07T20:32:23.5971110Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:23.5971439Z op = silu_mul_quant 2025-05-07T20:32:23.5971693Z if compiled: 2025-05-07T20:32:23.5971949Z op = torch.compile(op) 2025-05-07T20:32:23.5972246Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:23.5972517Z 2025-05-07T20:32:23.5972832Z > y_fp8, y_scale = fn() 2025-05-07T20:32:23.5972998Z 2025-05-07T20:32:23.5973109Z moe/activation_test.py:117: 2025-05-07T20:32:23.5973400Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:23.5973736Z moe/activation_test.py:115: in fn 2025-05-07T20:32:23.5974021Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:23.5974584Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:23.5975141Z return fn(*args, **kwargs) 2025-05-07T20:32:23.5975809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:23.5976667Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:23.5977199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:23.5977933Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:23.5978607Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:23.5979143Z kernel = self.compile( 2025-05-07T20:32:23.5979700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:23.5980403Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:23.5980811Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:23.5981041Z 2025-05-07T20:32:23.5981254Z self = 2025-05-07T20:32:23.5982337Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:23.5983741Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efca3491120>} 2025-05-07T20:32:23.5985089Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:23.5986125Z context = 2025-05-07T20:32:23.5986420Z 2025-05-07T20:32:23.5986587Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:23.5987107Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:23.5987587Z module_map=module_map) 2025-05-07T20:32:23.5987965Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:23.5988320Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:23.5988593Z E ^ 2025-05-07T20:32:23.5989182Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:23.5989663Z 2025-05-07T20:32:23.5990110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:23.5990629Z 2025-05-07T20:32:23.5990738Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:23.5991162Z self=, 2025-05-07T20:32:23.5991576Z T=128, 2025-05-07T20:32:23.5991766Z D=5120, 2025-05-07T20:32:23.5991970Z scale_ub=1200.0, 2025-05-07T20:32:23.5992207Z contiguous=False, 2025-05-07T20:32:23.5992434Z compiled=True, 2025-05-07T20:32:23.5992655Z ) 2025-05-07T20:32:23.8503872Z self = 2025-05-07T20:32:23.8504653Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:23.8505042Z 2025-05-07T20:32:23.8505437Z @given( 2025-05-07T20:32:23.8505700Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:23.8506019Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:23.8506323Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:23.8506661Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:23.8506996Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:23.8507286Z ) 2025-05-07T20:32:23.8507634Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:23.8508083Z def test_silu_mul_quant( 2025-05-07T20:32:23.8508338Z self, 2025-05-07T20:32:23.8508628Z T: int, 2025-05-07T20:32:23.8508920Z D: int, 2025-05-07T20:32:23.8509248Z scale_ub: Optional[float], 2025-05-07T20:32:23.8509546Z contiguous: bool, 2025-05-07T20:32:23.8509814Z compiled: bool, 2025-05-07T20:32:23.8510050Z ) -> None: 2025-05-07T20:32:23.8510349Z torch.manual_seed(2025) 2025-05-07T20:32:23.8510594Z 2025-05-07T20:32:23.8510874Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:23.8511211Z 2025-05-07T20:32:23.8511415Z x_sign = torch.sign(x) 2025-05-07T20:32:23.8511711Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:23.8512032Z x = x_sign * x_clamp 2025-05-07T20:32:23.8512272Z x0 = x[:, :D] 2025-05-07T20:32:23.8512492Z x1 = x[:, D:] 2025-05-07T20:32:23.8512705Z 2025-05-07T20:32:23.8512891Z if contiguous: 2025-05-07T20:32:23.8513127Z x0 = x0.contiguous() 2025-05-07T20:32:23.8513392Z x1 = x1.contiguous() 2025-05-07T20:32:23.8513630Z 2025-05-07T20:32:23.8513830Z if scale_ub is not None: 2025-05-07T20:32:23.8514109Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:23.8514437Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:23.8514758Z ) 2025-05-07T20:32:23.8514957Z else: 2025-05-07T20:32:23.8515164Z scale_ub_tensor = None 2025-05-07T20:32:23.8515416Z 2025-05-07T20:32:23.8515656Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:23.8515963Z op = silu_mul_quant 2025-05-07T20:32:23.8516223Z if compiled: 2025-05-07T20:32:23.8516473Z op = torch.compile(op) 2025-05-07T20:32:23.8516767Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:23.8517045Z 2025-05-07T20:32:23.8517242Z > y_fp8, y_scale = fn() 2025-05-07T20:32:23.8517404Z 2025-05-07T20:32:23.8517512Z moe/activation_test.py:117: 2025-05-07T20:32:23.8517803Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:23.8518149Z moe/activation_test.py:115: in fn 2025-05-07T20:32:23.8518432Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:23.8518990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:23.8519606Z return fn(*args, **kwargs) 2025-05-07T20:32:23.8520264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:23.8520956Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:23.8521486Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:23.8522176Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:23.8522839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:23.8523370Z kernel = self.compile( 2025-05-07T20:32:23.8523916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:23.8524585Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:23.8525048Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:23.8525278Z 2025-05-07T20:32:23.8525489Z self = 2025-05-07T20:32:23.8526566Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:23.8527955Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efca3492340>} 2025-05-07T20:32:23.8529664Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:23.8530811Z context = 2025-05-07T20:32:23.8531104Z 2025-05-07T20:32:23.8531272Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:23.8531797Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:23.8532266Z module_map=module_map) 2025-05-07T20:32:23.8532631Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:23.8532985Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:23.8533246Z E ^ 2025-05-07T20:32:23.8533715Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:23.8534162Z 2025-05-07T20:32:23.8534577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:23.8535093Z 2025-05-07T20:32:23.8535198Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:23.8535615Z self=, 2025-05-07T20:32:23.8536010Z T=16384, 2025-05-07T20:32:23.8536210Z D=7168, 2025-05-07T20:32:23.8536411Z scale_ub=1200.0, 2025-05-07T20:32:23.8536636Z contiguous=True, 2025-05-07T20:32:23.8536853Z compiled=True, 2025-05-07T20:32:23.8537084Z ) 2025-05-07T20:32:23.8537407Z self = 2025-05-07T20:32:23.8537904Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:23.8538178Z 2025-05-07T20:32:23.8538259Z @given( 2025-05-07T20:32:23.8538503Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:23.8538826Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:23.8539130Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:23.8539472Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:23.8539847Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:23.8540133Z ) 2025-05-07T20:32:23.8540491Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:23.8540934Z def test_silu_mul_quant( 2025-05-07T20:32:23.8541185Z self, 2025-05-07T20:32:23.8541376Z T: int, 2025-05-07T20:32:23.8541583Z D: int, 2025-05-07T20:32:23.8541810Z scale_ub: Optional[float], 2025-05-07T20:32:23.8542077Z contiguous: bool, 2025-05-07T20:32:23.8542321Z compiled: bool, 2025-05-07T20:32:23.8542551Z ) -> None: 2025-05-07T20:32:23.8542765Z torch.manual_seed(2025) 2025-05-07T20:32:23.8543006Z 2025-05-07T20:32:23.8543280Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:23.8543621Z 2025-05-07T20:32:23.8543824Z x_sign = torch.sign(x) 2025-05-07T20:32:23.8544123Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:23.8544427Z x = x_sign * x_clamp 2025-05-07T20:32:23.8544676Z x0 = x[:, :D] 2025-05-07T20:32:23.8544980Z x1 = x[:, D:] 2025-05-07T20:32:23.8545190Z 2025-05-07T20:32:23.8545380Z if contiguous: 2025-05-07T20:32:23.8545622Z x0 = x0.contiguous() 2025-05-07T20:32:23.8545877Z x1 = x1.contiguous() 2025-05-07T20:32:23.8546126Z 2025-05-07T20:32:23.8546324Z if scale_ub is not None: 2025-05-07T20:32:23.8546596Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:23.8546931Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:23.8547242Z ) 2025-05-07T20:32:23.8547436Z else: 2025-05-07T20:32:23.8547645Z scale_ub_tensor = None 2025-05-07T20:32:23.8547904Z 2025-05-07T20:32:23.8548235Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:23.8548591Z op = silu_mul_quant 2025-05-07T20:32:23.8548850Z if compiled: 2025-05-07T20:32:23.8549152Z op = torch.compile(op) 2025-05-07T20:32:23.8549484Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:23.8549800Z 2025-05-07T20:32:23.8550026Z > y_fp8, y_scale = fn() 2025-05-07T20:32:23.8550190Z 2025-05-07T20:32:23.8550293Z moe/activation_test.py:117: 2025-05-07T20:32:23.8550603Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:23.8550949Z moe/activation_test.py:115: in fn 2025-05-07T20:32:23.8551245Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:23.8551807Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:23.8552373Z return fn(*args, **kwargs) 2025-05-07T20:32:23.8553038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:23.8553727Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:23.8554277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:23.8554967Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:23.8555642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:23.8556169Z kernel = self.compile( 2025-05-07T20:32:23.8556713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:23.8557370Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:23.8557770Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:23.8558010Z 2025-05-07T20:32:23.8558220Z self = 2025-05-07T20:32:23.8559366Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:23.8560741Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efca3493c40>} 2025-05-07T20:32:23.8562090Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:23.8563115Z context = 2025-05-07T20:32:23.8563413Z 2025-05-07T20:32:23.8563582Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:23.8564102Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:23.8564577Z module_map=module_map) 2025-05-07T20:32:23.8564946Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:23.8565355Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:23.8565623Z E ^ 2025-05-07T20:32:23.8566081Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:23.8566535Z 2025-05-07T20:32:23.8566948Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:23.8567462Z 2025-05-07T20:32:23.9526868Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:23.9527483Z self=, 2025-05-07T20:32:23.9528103Z T=16384, 2025-05-07T20:32:23.9528654Z D=5120, 2025-05-07T20:32:23.9529175Z scale_ub=1200.0, 2025-05-07T20:32:23.9529524Z contiguous=True, 2025-05-07T20:32:23.9529749Z compiled=False, 2025-05-07T20:32:23.9529956Z ) 2025-05-07T20:32:23.9530277Z self = 2025-05-07T20:32:23.9530863Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:23.9531144Z 2025-05-07T20:32:23.9531231Z @given( 2025-05-07T20:32:23.9531462Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:23.9531778Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:23.9532086Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:23.9532415Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:23.9532749Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:23.9533038Z ) 2025-05-07T20:32:23.9533381Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:23.9533829Z def test_silu_mul_quant( 2025-05-07T20:32:23.9534087Z self, 2025-05-07T20:32:23.9534288Z T: int, 2025-05-07T20:32:23.9534497Z D: int, 2025-05-07T20:32:23.9534722Z scale_ub: Optional[float], 2025-05-07T20:32:23.9534999Z contiguous: bool, 2025-05-07T20:32:23.9535245Z compiled: bool, 2025-05-07T20:32:23.9535485Z ) -> None: 2025-05-07T20:32:23.9535710Z torch.manual_seed(2025) 2025-05-07T20:32:23.9535952Z 2025-05-07T20:32:23.9536228Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:23.9536577Z 2025-05-07T20:32:23.9536774Z x_sign = torch.sign(x) 2025-05-07T20:32:23.9537068Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:23.9537386Z x = x_sign * x_clamp 2025-05-07T20:32:23.9537629Z x0 = x[:, :D] 2025-05-07T20:32:23.9537848Z x1 = x[:, D:] 2025-05-07T20:32:23.9538060Z 2025-05-07T20:32:23.9538248Z if contiguous: 2025-05-07T20:32:23.9538487Z x0 = x0.contiguous() 2025-05-07T20:32:23.9538754Z x1 = x1.contiguous() 2025-05-07T20:32:23.9538991Z 2025-05-07T20:32:23.9539189Z if scale_ub is not None: 2025-05-07T20:32:23.9539464Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:23.9539808Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:23.9540117Z ) 2025-05-07T20:32:23.9540319Z else: 2025-05-07T20:32:23.9540540Z scale_ub_tensor = None 2025-05-07T20:32:23.9540789Z 2025-05-07T20:32:23.9541027Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:23.9541347Z op = silu_mul_quant 2025-05-07T20:32:23.9541596Z if compiled: 2025-05-07T20:32:23.9541847Z op = torch.compile(op) 2025-05-07T20:32:23.9542148Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:23.9542424Z 2025-05-07T20:32:23.9542629Z > y_fp8, y_scale = fn() 2025-05-07T20:32:23.9542799Z 2025-05-07T20:32:23.9542916Z moe/activation_test.py:117: 2025-05-07T20:32:23.9543209Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:23.9543545Z moe/activation_test.py:115: in fn 2025-05-07T20:32:23.9543828Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:23.9544601Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:23.9545285Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:23.9545822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:23.9546507Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:23.9547159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:23.9547695Z kernel = self.compile( 2025-05-07T20:32:23.9548241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:23.9549017Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:23.9549503Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:23.9549781Z 2025-05-07T20:32:23.9549997Z self = 2025-05-07T20:32:23.9551080Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:23.9552474Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efca2d00c20>} 2025-05-07T20:32:23.9553826Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:23.9554869Z context = 2025-05-07T20:32:23.9555164Z 2025-05-07T20:32:23.9555337Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:23.9555863Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:23.9556326Z module_map=module_map) 2025-05-07T20:32:23.9556699Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:23.9557058Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:23.9557322Z E ^ 2025-05-07T20:32:23.9557783Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:23.9558236Z 2025-05-07T20:32:23.9558649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:23.9559187Z 2025-05-07T20:32:23.9559312Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:23.9559737Z self=, 2025-05-07T20:32:23.9560135Z T=1, 2025-05-07T20:32:23.9560327Z D=7168, 2025-05-07T20:32:23.9560529Z scale_ub=1200.0, 2025-05-07T20:32:23.9560753Z contiguous=False, 2025-05-07T20:32:23.9560983Z compiled=False, 2025-05-07T20:32:23.9561195Z ) 2025-05-07T20:32:23.9561510Z self = 2025-05-07T20:32:23.9561996Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:23.9562259Z 2025-05-07T20:32:23.9562345Z @given( 2025-05-07T20:32:23.9562579Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:23.9562893Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:23.9563201Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:23.9563531Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:23.9563866Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:23.9564152Z ) 2025-05-07T20:32:23.9564507Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:23.9564993Z def test_silu_mul_quant( 2025-05-07T20:32:23.9565245Z self, 2025-05-07T20:32:23.9565446Z T: int, 2025-05-07T20:32:23.9565646Z D: int, 2025-05-07T20:32:23.9565867Z scale_ub: Optional[float], 2025-05-07T20:32:23.9566140Z contiguous: bool, 2025-05-07T20:32:23.9566382Z compiled: bool, 2025-05-07T20:32:23.9566607Z ) -> None: 2025-05-07T20:32:23.9566829Z torch.manual_seed(2025) 2025-05-07T20:32:23.9567065Z 2025-05-07T20:32:23.9567350Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:23.9567692Z 2025-05-07T20:32:23.9567884Z x_sign = torch.sign(x) 2025-05-07T20:32:23.9568228Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:23.9568605Z x = x_sign * x_clamp 2025-05-07T20:32:23.9568853Z x0 = x[:, :D] 2025-05-07T20:32:23.9569070Z x1 = x[:, D:] 2025-05-07T20:32:23.9569283Z 2025-05-07T20:32:23.9569525Z if contiguous: 2025-05-07T20:32:23.9569792Z x0 = x0.contiguous() 2025-05-07T20:32:23.9577212Z x1 = x1.contiguous() 2025-05-07T20:32:23.9577481Z 2025-05-07T20:32:23.9577678Z if scale_ub is not None: 2025-05-07T20:32:23.9577965Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:23.9578322Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:23.9578638Z ) 2025-05-07T20:32:23.9578834Z else: 2025-05-07T20:32:23.9579052Z scale_ub_tensor = None 2025-05-07T20:32:23.9579319Z 2025-05-07T20:32:23.9579585Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:23.9579934Z op = silu_mul_quant 2025-05-07T20:32:23.9580205Z if compiled: 2025-05-07T20:32:23.9580453Z op = torch.compile(op) 2025-05-07T20:32:23.9580760Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:23.9581044Z 2025-05-07T20:32:23.9581238Z > y_fp8, y_scale = fn() 2025-05-07T20:32:23.9581412Z 2025-05-07T20:32:23.9581515Z moe/activation_test.py:117: 2025-05-07T20:32:23.9581821Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:23.9582173Z moe/activation_test.py:115: in fn 2025-05-07T20:32:23.9582451Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:23.9583159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:23.9583860Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:23.9584396Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:23.9585088Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:23.9585762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:23.9586299Z kernel = self.compile( 2025-05-07T20:32:23.9586845Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:23.9587504Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:23.9587911Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:23.9588141Z 2025-05-07T20:32:23.9588351Z self = 2025-05-07T20:32:23.9589538Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:23.9590987Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efca2d01120>} 2025-05-07T20:32:23.9592459Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:23.9593497Z context = 2025-05-07T20:32:23.9593791Z 2025-05-07T20:32:23.9593959Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:23.9594485Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:23.9594962Z module_map=module_map) 2025-05-07T20:32:23.9595341Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:23.9595692Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:23.9596015Z E ^ 2025-05-07T20:32:23.9596537Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:23.9596989Z 2025-05-07T20:32:23.9597455Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:23.9597981Z 2025-05-07T20:32:24.0932823Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:24.0933418Z self=, 2025-05-07T20:32:24.0934022Z T=4096, 2025-05-07T20:32:24.0934293Z D=7168, 2025-05-07T20:32:24.0934565Z scale_ub=1200.0, 2025-05-07T20:32:24.0934880Z contiguous=False, 2025-05-07T20:32:24.0935118Z compiled=True, 2025-05-07T20:32:24.0935337Z ) 2025-05-07T20:32:24.0935704Z self = 2025-05-07T20:32:24.0936273Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:24.0936569Z 2025-05-07T20:32:24.0936651Z @given( 2025-05-07T20:32:24.0936897Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:24.0937216Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:24.0937539Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:24.0937872Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:24.0938206Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:24.0938504Z ) 2025-05-07T20:32:24.0938856Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:24.0939307Z def test_silu_mul_quant( 2025-05-07T20:32:24.0939560Z self, 2025-05-07T20:32:24.0939785Z T: int, 2025-05-07T20:32:24.0940016Z D: int, 2025-05-07T20:32:24.0940245Z scale_ub: Optional[float], 2025-05-07T20:32:24.0940516Z contiguous: bool, 2025-05-07T20:32:24.0940766Z compiled: bool, 2025-05-07T20:32:24.0941004Z ) -> None: 2025-05-07T20:32:24.0941226Z torch.manual_seed(2025) 2025-05-07T20:32:24.0941477Z 2025-05-07T20:32:24.0941763Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:24.0942114Z 2025-05-07T20:32:24.0942314Z x_sign = torch.sign(x) 2025-05-07T20:32:24.0942616Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:24.0942933Z x = x_sign * x_clamp 2025-05-07T20:32:24.0943175Z x0 = x[:, :D] 2025-05-07T20:32:24.0943404Z x1 = x[:, D:] 2025-05-07T20:32:24.0943622Z 2025-05-07T20:32:24.0943808Z if contiguous: 2025-05-07T20:32:24.0944047Z x0 = x0.contiguous() 2025-05-07T20:32:24.0944312Z x1 = x1.contiguous() 2025-05-07T20:32:24.0944549Z 2025-05-07T20:32:24.0944750Z if scale_ub is not None: 2025-05-07T20:32:24.0945028Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:24.0945362Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:24.0945681Z ) 2025-05-07T20:32:24.0945892Z else: 2025-05-07T20:32:24.0946107Z scale_ub_tensor = None 2025-05-07T20:32:24.0946365Z 2025-05-07T20:32:24.0946608Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:24.0947216Z op = silu_mul_quant 2025-05-07T20:32:24.0947472Z if compiled: 2025-05-07T20:32:24.0947727Z op = torch.compile(op) 2025-05-07T20:32:24.0948030Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:24.0948301Z 2025-05-07T20:32:24.0948500Z > y_fp8, y_scale = fn() 2025-05-07T20:32:24.0948665Z 2025-05-07T20:32:24.0948776Z moe/activation_test.py:117: 2025-05-07T20:32:24.0949169Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:24.0949537Z moe/activation_test.py:115: in fn 2025-05-07T20:32:24.0949851Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:24.0950409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:24.0951140Z return fn(*args, **kwargs) 2025-05-07T20:32:24.0951876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:24.0952571Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:24.0953106Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:24.0953792Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:24.0954460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:24.0955000Z kernel = self.compile( 2025-05-07T20:32:24.0955538Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:24.0956195Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:24.0956603Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:24.0956832Z 2025-05-07T20:32:24.0957044Z self = 2025-05-07T20:32:24.0958130Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:24.0959512Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efca2d02f20>} 2025-05-07T20:32:24.0960858Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:24.0961874Z context = 2025-05-07T20:32:24.0962170Z 2025-05-07T20:32:24.0962336Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:24.0962859Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:24.0963335Z module_map=module_map) 2025-05-07T20:32:24.0963696Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:24.0964065Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:24.0964337Z E ^ 2025-05-07T20:32:24.0964796Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:24.0965252Z 2025-05-07T20:32:24.0965667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:24.0966185Z 2025-05-07T20:32:24.0966292Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:24.0966710Z self=, 2025-05-07T20:32:24.0967111Z T=128, 2025-05-07T20:32:24.0967308Z D=7168, 2025-05-07T20:32:24.0967510Z scale_ub=1200.0, 2025-05-07T20:32:24.0967740Z contiguous=False, 2025-05-07T20:32:24.0968044Z compiled=True, 2025-05-07T20:32:24.0968258Z ) 2025-05-07T20:32:24.1681581Z self = 2025-05-07T20:32:24.1682353Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:24.1682724Z 2025-05-07T20:32:24.1682844Z @given( 2025-05-07T20:32:24.1683080Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:24.1683394Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:24.1683704Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:24.1684030Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:24.1684363Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:24.1685012Z ) 2025-05-07T20:32:24.1685356Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:24.1685805Z def test_silu_mul_quant( 2025-05-07T20:32:24.1686054Z self, 2025-05-07T20:32:24.1686326Z T: int, 2025-05-07T20:32:24.1686537Z D: int, 2025-05-07T20:32:24.1686760Z scale_ub: Optional[float], 2025-05-07T20:32:24.1687030Z contiguous: bool, 2025-05-07T20:32:24.1687269Z compiled: bool, 2025-05-07T20:32:24.1687499Z ) -> None: 2025-05-07T20:32:24.1687712Z torch.manual_seed(2025) 2025-05-07T20:32:24.1687960Z 2025-05-07T20:32:24.1688233Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:24.1688573Z 2025-05-07T20:32:24.1688771Z x_sign = torch.sign(x) 2025-05-07T20:32:24.1689069Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:24.1689422Z x = x_sign * x_clamp 2025-05-07T20:32:24.1689680Z x0 = x[:, :D] 2025-05-07T20:32:24.1689914Z x1 = x[:, D:] 2025-05-07T20:32:24.1690131Z 2025-05-07T20:32:24.1690318Z if contiguous: 2025-05-07T20:32:24.1690559Z x0 = x0.contiguous() 2025-05-07T20:32:24.1690824Z x1 = x1.contiguous() 2025-05-07T20:32:24.1691060Z 2025-05-07T20:32:24.1691268Z if scale_ub is not None: 2025-05-07T20:32:24.1691545Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:24.1691876Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:24.1692187Z ) 2025-05-07T20:32:24.1692392Z else: 2025-05-07T20:32:24.1692600Z scale_ub_tensor = None 2025-05-07T20:32:24.1692855Z 2025-05-07T20:32:24.1693094Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:24.1693419Z op = silu_mul_quant 2025-05-07T20:32:24.1693665Z if compiled: 2025-05-07T20:32:24.1693912Z op = torch.compile(op) 2025-05-07T20:32:24.1694213Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:24.1694484Z 2025-05-07T20:32:24.1694690Z > y_fp8, y_scale = fn() 2025-05-07T20:32:24.1694849Z 2025-05-07T20:32:24.1694958Z moe/activation_test.py:117: 2025-05-07T20:32:24.1695254Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:24.1695599Z moe/activation_test.py:115: in fn 2025-05-07T20:32:24.1695882Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:24.1696446Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:24.1697006Z return fn(*args, **kwargs) 2025-05-07T20:32:24.1697669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:24.1698360Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:24.1698899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:24.1699579Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:24.1700241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:24.1700875Z kernel = self.compile( 2025-05-07T20:32:24.1701421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:24.1702074Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:24.1702480Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:24.1702707Z 2025-05-07T20:32:24.1702927Z self = 2025-05-07T20:32:24.1704004Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:24.1705472Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efca2f7c220>} 2025-05-07T20:32:24.1706895Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:24.1707917Z context = 2025-05-07T20:32:24.1708208Z 2025-05-07T20:32:24.1708381Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:24.1708898Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:24.1709444Z module_map=module_map) 2025-05-07T20:32:24.1709809Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:24.1710168Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:24.1710424Z E ^ 2025-05-07T20:32:24.1710891Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:24.1711339Z 2025-05-07T20:32:24.1711761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:24.1712269Z 2025-05-07T20:32:24.1712377Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:24.1712789Z self=, 2025-05-07T20:32:24.1713194Z T=2048, 2025-05-07T20:32:24.1713387Z D=7168, 2025-05-07T20:32:24.1713576Z scale_ub=None, 2025-05-07T20:32:24.1713797Z contiguous=True, 2025-05-07T20:32:24.1714022Z compiled=True, 2025-05-07T20:32:24.1714226Z ) 2025-05-07T20:32:24.1714549Z self = 2025-05-07T20:32:24.1715042Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:24.1715311Z 2025-05-07T20:32:24.1715392Z @given( 2025-05-07T20:32:24.1715625Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:24.1715945Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:24.1716250Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:24.1716580Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:24.1716914Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:24.1717201Z ) 2025-05-07T20:32:24.1717544Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:24.1717986Z def test_silu_mul_quant( 2025-05-07T20:32:24.1718229Z self, 2025-05-07T20:32:24.1718421Z T: int, 2025-05-07T20:32:24.1718625Z D: int, 2025-05-07T20:32:24.1718849Z scale_ub: Optional[float], 2025-05-07T20:32:24.1719119Z contiguous: bool, 2025-05-07T20:32:24.1719372Z compiled: bool, 2025-05-07T20:32:24.1719600Z ) -> None: 2025-05-07T20:32:24.1719814Z torch.manual_seed(2025) 2025-05-07T20:32:24.1720061Z 2025-05-07T20:32:24.1720344Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:24.1720683Z 2025-05-07T20:32:24.1720940Z x_sign = torch.sign(x) 2025-05-07T20:32:24.1721241Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:24.1721551Z x = x_sign * x_clamp 2025-05-07T20:32:24.1721805Z x0 = x[:, :D] 2025-05-07T20:32:24.1722030Z x1 = x[:, D:] 2025-05-07T20:32:24.1722248Z 2025-05-07T20:32:24.1722439Z if contiguous: 2025-05-07T20:32:24.1722682Z x0 = x0.contiguous() 2025-05-07T20:32:24.1722947Z x1 = x1.contiguous() 2025-05-07T20:32:24.1723185Z 2025-05-07T20:32:24.1723390Z if scale_ub is not None: 2025-05-07T20:32:24.1723672Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:24.1724048Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:24.1724401Z ) 2025-05-07T20:32:24.1724602Z else: 2025-05-07T20:32:24.1724810Z scale_ub_tensor = None 2025-05-07T20:32:24.1725066Z 2025-05-07T20:32:24.1725346Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:24.1725658Z op = silu_mul_quant 2025-05-07T20:32:24.1725914Z if compiled: 2025-05-07T20:32:24.1726165Z op = torch.compile(op) 2025-05-07T20:32:24.1726460Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:24.1726739Z 2025-05-07T20:32:24.1726939Z > y_fp8, y_scale = fn() 2025-05-07T20:32:24.1727105Z 2025-05-07T20:32:24.1727214Z moe/activation_test.py:117: 2025-05-07T20:32:24.1727507Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:24.1727847Z moe/activation_test.py:115: in fn 2025-05-07T20:32:24.1728420Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:24.1728982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:24.1729549Z return fn(*args, **kwargs) 2025-05-07T20:32:24.1730216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:24.1730911Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:24.1731447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:24.1732127Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:24.1732787Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:24.1733314Z kernel = self.compile( 2025-05-07T20:32:24.1733860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:24.1734517Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:24.1734920Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:24.1735150Z 2025-05-07T20:32:24.1735365Z self = 2025-05-07T20:32:24.1736445Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:24.1737820Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efca2f7cd60>} 2025-05-07T20:32:24.1739162Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:24.1740247Z context = 2025-05-07T20:32:24.1740544Z 2025-05-07T20:32:24.1740713Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:24.1741319Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:24.1741793Z module_map=module_map) 2025-05-07T20:32:24.1742160Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:24.1742525Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:24.1742797Z E ^ 2025-05-07T20:32:24.1743269Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:24.1743726Z 2025-05-07T20:32:24.1744145Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:24.1744735Z 2025-05-07T20:32:24.2389302Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:24.2390203Z self=, 2025-05-07T20:32:24.2390763Z T=16384, 2025-05-07T20:32:24.2390986Z D=5120, 2025-05-07T20:32:24.2391298Z scale_ub=None, 2025-05-07T20:32:24.2391526Z contiguous=False, 2025-05-07T20:32:24.2391758Z compiled=False, 2025-05-07T20:32:24.2391965Z ) 2025-05-07T20:32:24.2392284Z self = 2025-05-07T20:32:24.2392783Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:24.2393065Z 2025-05-07T20:32:24.2393153Z @given( 2025-05-07T20:32:24.2393386Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:24.2393705Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:24.2394016Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:24.2394343Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:24.2394683Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:24.2394976Z ) 2025-05-07T20:32:24.2395321Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:24.2395766Z def test_silu_mul_quant( 2025-05-07T20:32:24.2396018Z self, 2025-05-07T20:32:24.2396216Z T: int, 2025-05-07T20:32:24.2396422Z D: int, 2025-05-07T20:32:24.2396649Z scale_ub: Optional[float], 2025-05-07T20:32:24.2396924Z contiguous: bool, 2025-05-07T20:32:24.2397163Z compiled: bool, 2025-05-07T20:32:24.2397401Z ) -> None: 2025-05-07T20:32:24.2397622Z torch.manual_seed(2025) 2025-05-07T20:32:24.2397862Z 2025-05-07T20:32:24.2398146Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:24.2398495Z 2025-05-07T20:32:24.2398689Z x_sign = torch.sign(x) 2025-05-07T20:32:24.2398989Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:24.2401078Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:24.2402944Z 2025-05-07T20:32:24.2403075Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:24.2403289Z 2025-05-07T20:32:24.2403403Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:24.2403811Z self=, 2025-05-07T20:32:24.2404227Z T=4096, 2025-05-07T20:32:24.2404429Z D=7168, 2025-05-07T20:32:24.2404625Z scale_ub=1200.0, 2025-05-07T20:32:24.2404858Z contiguous=True, 2025-05-07T20:32:24.2405092Z compiled=True, 2025-05-07T20:32:24.2405297Z ) 2025-05-07T20:32:24.2405620Z self = 2025-05-07T20:32:24.2406202Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:24.2406474Z 2025-05-07T20:32:24.2406555Z @given( 2025-05-07T20:32:24.2406794Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:24.2407113Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:24.2407425Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:24.2407760Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:24.2408092Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:24.2408386Z ) 2025-05-07T20:32:24.2408737Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:24.2409184Z def test_silu_mul_quant( 2025-05-07T20:32:24.2409533Z self, 2025-05-07T20:32:24.2409783Z T: int, 2025-05-07T20:32:24.2409987Z D: int, 2025-05-07T20:32:24.2410217Z scale_ub: Optional[float], 2025-05-07T20:32:24.2410488Z contiguous: bool, 2025-05-07T20:32:24.2410777Z compiled: bool, 2025-05-07T20:32:24.2411011Z ) -> None: 2025-05-07T20:32:24.2411226Z torch.manual_seed(2025) 2025-05-07T20:32:24.2411473Z 2025-05-07T20:32:24.2411753Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:24.2412096Z 2025-05-07T20:32:24.2412290Z x_sign = torch.sign(x) 2025-05-07T20:32:24.2412589Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:24.2414586Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:24.2416439Z 2025-05-07T20:32:24.2416573Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:24.2416789Z 2025-05-07T20:32:24.2416893Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:24.2417312Z self=, 2025-05-07T20:32:24.2417721Z T=16384, 2025-05-07T20:32:24.2417926Z D=7168, 2025-05-07T20:32:24.2418119Z scale_ub=None, 2025-05-07T20:32:24.2418344Z contiguous=False, 2025-05-07T20:32:24.2418581Z compiled=False, 2025-05-07T20:32:24.2418790Z ) 2025-05-07T20:32:24.2419114Z self = 2025-05-07T20:32:24.2419610Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:24.2419892Z 2025-05-07T20:32:24.2419975Z @given( 2025-05-07T20:32:24.2420229Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:24.2420555Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:24.2420863Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:24.2421198Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:24.2421540Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:24.2421835Z ) 2025-05-07T20:32:24.2430147Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:24.2430594Z def test_silu_mul_quant( 2025-05-07T20:32:24.2430830Z self, 2025-05-07T20:32:24.2431015Z T: int, 2025-05-07T20:32:24.2431203Z D: int, 2025-05-07T20:32:24.2431425Z scale_ub: Optional[float], 2025-05-07T20:32:24.2431696Z contiguous: bool, 2025-05-07T20:32:24.2431948Z compiled: bool, 2025-05-07T20:32:24.2432181Z ) -> None: 2025-05-07T20:32:24.2432404Z torch.manual_seed(2025) 2025-05-07T20:32:24.2432653Z 2025-05-07T20:32:24.2432933Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:24.2435144Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:24.2437027Z 2025-05-07T20:32:24.2437149Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:24.2437370Z 2025-05-07T20:32:24.2437543Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:24.2438055Z self=, 2025-05-07T20:32:24.2438465Z T=2048, 2025-05-07T20:32:24.2438651Z D=7168, 2025-05-07T20:32:24.2438849Z scale_ub=1200.0, 2025-05-07T20:32:24.2439147Z contiguous=True, 2025-05-07T20:32:24.2439401Z compiled=True, 2025-05-07T20:32:24.2439641Z ) 2025-05-07T20:32:24.2439970Z self = 2025-05-07T20:32:24.2440464Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:24.2440744Z 2025-05-07T20:32:24.2440822Z @given( 2025-05-07T20:32:24.2441059Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:24.2441383Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:24.2441690Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:24.2442024Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:24.2442369Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:24.2442661Z ) 2025-05-07T20:32:24.2443014Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:24.2443467Z def test_silu_mul_quant( 2025-05-07T20:32:24.2443714Z self, 2025-05-07T20:32:24.2443922Z T: int, 2025-05-07T20:32:24.2444130Z D: int, 2025-05-07T20:32:24.2444349Z scale_ub: Optional[float], 2025-05-07T20:32:24.2444623Z contiguous: bool, 2025-05-07T20:32:24.2444872Z compiled: bool, 2025-05-07T20:32:24.2445091Z ) -> None: 2025-05-07T20:32:24.2445317Z torch.manual_seed(2025) 2025-05-07T20:32:24.2445563Z 2025-05-07T20:32:24.2445845Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:24.2446184Z 2025-05-07T20:32:24.2446385Z x_sign = torch.sign(x) 2025-05-07T20:32:24.2446683Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:24.2448686Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:24.2450570Z 2025-05-07T20:32:24.2450693Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:24.2450910Z 2025-05-07T20:32:24.2451015Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:24.2451430Z self=, 2025-05-07T20:32:24.2451833Z T=2048, 2025-05-07T20:32:24.2452020Z D=7168, 2025-05-07T20:32:24.2452221Z scale_ub=None, 2025-05-07T20:32:24.2452444Z contiguous=True, 2025-05-07T20:32:24.2452671Z compiled=False, 2025-05-07T20:32:24.2452880Z ) 2025-05-07T20:32:24.3314352Z self = 2025-05-07T20:32:24.3315147Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:24.3315736Z 2025-05-07T20:32:24.3315836Z @given( 2025-05-07T20:32:24.3316080Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:24.3316405Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:24.3316724Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:24.3317056Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:24.3317395Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:24.3317692Z ) 2025-05-07T20:32:24.3318051Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:24.3318493Z def test_silu_mul_quant( 2025-05-07T20:32:24.3318756Z self, 2025-05-07T20:32:24.3319058Z T: int, 2025-05-07T20:32:24.3319367Z D: int, 2025-05-07T20:32:24.3319640Z scale_ub: Optional[float], 2025-05-07T20:32:24.3319931Z contiguous: bool, 2025-05-07T20:32:24.3320182Z compiled: bool, 2025-05-07T20:32:24.3320548Z ) -> None: 2025-05-07T20:32:24.3320786Z torch.manual_seed(2025) 2025-05-07T20:32:24.3321038Z 2025-05-07T20:32:24.3321321Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:24.3321675Z 2025-05-07T20:32:24.3321879Z > x_sign = torch.sign(x) 2025-05-07T20:32:24.3323832Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:24.3325703Z 2025-05-07T20:32:24.3325832Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:24.3326055Z 2025-05-07T20:32:24.3326165Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:24.3326585Z self=, 2025-05-07T20:32:24.3326990Z T=1, 2025-05-07T20:32:24.3327189Z D=7168, 2025-05-07T20:32:24.3327396Z scale_ub=1200.0, 2025-05-07T20:32:24.3327630Z contiguous=True, 2025-05-07T20:32:24.3327864Z compiled=False, 2025-05-07T20:32:24.3328085Z ) 2025-05-07T20:32:24.3328673Z self = 2025-05-07T20:32:24.3329169Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:24.3329441Z 2025-05-07T20:32:24.3329527Z @given( 2025-05-07T20:32:24.3329773Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:24.3330093Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:24.3330409Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:24.3330752Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:24.3331085Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:24.3331381Z ) 2025-05-07T20:32:24.3331743Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:24.3332189Z def test_silu_mul_quant( 2025-05-07T20:32:24.3332445Z self, 2025-05-07T20:32:24.3332656Z T: int, 2025-05-07T20:32:24.3332860Z D: int, 2025-05-07T20:32:24.3333091Z scale_ub: Optional[float], 2025-05-07T20:32:24.3333374Z contiguous: bool, 2025-05-07T20:32:24.3333618Z compiled: bool, 2025-05-07T20:32:24.3333852Z ) -> None: 2025-05-07T20:32:24.3334078Z torch.manual_seed(2025) 2025-05-07T20:32:24.3334335Z 2025-05-07T20:32:24.3334620Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:24.3334973Z 2025-05-07T20:32:24.3335176Z x_sign = torch.sign(x) 2025-05-07T20:32:24.3335475Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:24.3335875Z x = x_sign * x_clamp 2025-05-07T20:32:24.3336127Z x0 = x[:, :D] 2025-05-07T20:32:24.3336347Z x1 = x[:, D:] 2025-05-07T20:32:24.3336564Z 2025-05-07T20:32:24.3336764Z if contiguous: 2025-05-07T20:32:24.3336999Z x0 = x0.contiguous() 2025-05-07T20:32:24.3337268Z x1 = x1.contiguous() 2025-05-07T20:32:24.3337516Z 2025-05-07T20:32:24.3337711Z if scale_ub is not None: 2025-05-07T20:32:24.3337994Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:24.3338332Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:24.3338640Z ) 2025-05-07T20:32:24.3338846Z else: 2025-05-07T20:32:24.3339133Z scale_ub_tensor = None 2025-05-07T20:32:24.3339445Z 2025-05-07T20:32:24.3339708Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:24.3340053Z op = silu_mul_quant 2025-05-07T20:32:24.3340367Z if compiled: 2025-05-07T20:32:24.3340626Z op = torch.compile(op) 2025-05-07T20:32:24.3340928Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:24.3341208Z 2025-05-07T20:32:24.3341404Z > y_fp8, y_scale = fn() 2025-05-07T20:32:24.3341578Z 2025-05-07T20:32:24.3341680Z moe/activation_test.py:117: 2025-05-07T20:32:24.3341979Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:24.3342313Z moe/activation_test.py:115: in fn 2025-05-07T20:32:24.3342604Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:24.3343300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:24.3343995Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:24.3344539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:24.3345224Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:24.3345893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:24.3346424Z kernel = self.compile( 2025-05-07T20:32:24.3346970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:24.3347629Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:24.3348033Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:24.3348262Z 2025-05-07T20:32:24.3348472Z self = 2025-05-07T20:32:24.3349697Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:24.3351072Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efca2b60540>} 2025-05-07T20:32:24.3352417Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:24.3353438Z context = 2025-05-07T20:32:24.3353737Z 2025-05-07T20:32:24.3353908Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:24.3354438Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:24.3354916Z module_map=module_map) 2025-05-07T20:32:24.3355284Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:24.3355652Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:24.3355928Z E ^ 2025-05-07T20:32:24.3356446Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:24.3356903Z 2025-05-07T20:32:24.3357320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:24.3357840Z 2025-05-07T20:32:24.3357952Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:24.3358375Z self=, 2025-05-07T20:32:24.3358777Z T=128, 2025-05-07T20:32:24.3358978Z D=5120, 2025-05-07T20:32:24.3359190Z scale_ub=None, 2025-05-07T20:32:24.3359412Z contiguous=True, 2025-05-07T20:32:24.3359690Z compiled=False, 2025-05-07T20:32:24.3359946Z ) 2025-05-07T20:32:24.5594066Z self = 2025-05-07T20:32:24.5595495Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:24.5596612Z 2025-05-07T20:32:24.5596857Z @given( 2025-05-07T20:32:24.5597442Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:24.5598089Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:24.5598718Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:24.5599287Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:24.5599659Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:24.5599949Z ) 2025-05-07T20:32:24.5600305Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:24.5600742Z def test_silu_mul_quant( 2025-05-07T20:32:24.5600989Z self, 2025-05-07T20:32:24.5601197Z T: int, 2025-05-07T20:32:24.5601393Z D: int, 2025-05-07T20:32:24.5601628Z scale_ub: Optional[float], 2025-05-07T20:32:24.5601904Z contiguous: bool, 2025-05-07T20:32:24.5602144Z compiled: bool, 2025-05-07T20:32:24.5602388Z ) -> None: 2025-05-07T20:32:24.5602624Z torch.manual_seed(2025) 2025-05-07T20:32:24.5602869Z 2025-05-07T20:32:24.5603148Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:24.5603494Z 2025-05-07T20:32:24.5603701Z x_sign = torch.sign(x) 2025-05-07T20:32:24.5603990Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:24.5604300Z x = x_sign * x_clamp 2025-05-07T20:32:24.5604545Z x0 = x[:, :D] 2025-05-07T20:32:24.5604764Z x1 = x[:, D:] 2025-05-07T20:32:24.5604976Z 2025-05-07T20:32:24.5605168Z if contiguous: 2025-05-07T20:32:24.5605401Z x0 = x0.contiguous() 2025-05-07T20:32:24.5605661Z x1 = x1.contiguous() 2025-05-07T20:32:24.5605904Z 2025-05-07T20:32:24.5606102Z if scale_ub is not None: 2025-05-07T20:32:24.5606380Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:24.5606720Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:24.5607028Z ) 2025-05-07T20:32:24.5607229Z else: 2025-05-07T20:32:24.5607443Z scale_ub_tensor = None 2025-05-07T20:32:24.5607689Z 2025-05-07T20:32:24.5607924Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:24.5608244Z op = silu_mul_quant 2025-05-07T20:32:24.5608500Z if compiled: 2025-05-07T20:32:24.5608746Z op = torch.compile(op) 2025-05-07T20:32:24.5609041Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:24.5609320Z 2025-05-07T20:32:24.5609511Z > y_fp8, y_scale = fn() 2025-05-07T20:32:24.5609688Z 2025-05-07T20:32:24.5609812Z moe/activation_test.py:117: 2025-05-07T20:32:24.5610134Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:24.5610467Z moe/activation_test.py:115: in fn 2025-05-07T20:32:24.5610751Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:24.5611543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:24.5612239Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:24.5612771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:24.5613450Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:24.5614113Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:24.5614639Z kernel = self.compile( 2025-05-07T20:32:24.5615190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:24.5615963Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:24.5616448Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:24.5616680Z 2025-05-07T20:32:24.5616930Z self = 2025-05-07T20:32:24.5618016Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:24.5619402Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efca2b61620>} 2025-05-07T20:32:24.5620741Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:24.5621771Z context = 2025-05-07T20:32:24.5622062Z 2025-05-07T20:32:24.5622230Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:24.5622760Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:24.5623229Z module_map=module_map) 2025-05-07T20:32:24.5623594Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:24.5623950Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:24.5624214Z E ^ 2025-05-07T20:32:24.5624677Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:24.5625132Z 2025-05-07T20:32:24.5625548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:24.5626065Z 2025-05-07T20:32:24.5626173Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:24.5626597Z self=, 2025-05-07T20:32:24.5627001Z T=128, 2025-05-07T20:32:24.5627195Z D=7168, 2025-05-07T20:32:24.5627391Z scale_ub=None, 2025-05-07T20:32:24.5627607Z contiguous=True, 2025-05-07T20:32:24.5627840Z compiled=False, 2025-05-07T20:32:24.5628058Z ) 2025-05-07T20:32:24.5628665Z self = 2025-05-07T20:32:24.5629195Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:24.5629465Z 2025-05-07T20:32:24.5629546Z @given( 2025-05-07T20:32:24.5629786Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:24.5630141Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:24.5630462Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:24.5630797Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:24.5631129Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:24.5631416Z ) 2025-05-07T20:32:24.5631764Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:24.5632203Z def test_silu_mul_quant( 2025-05-07T20:32:24.5632451Z self, 2025-05-07T20:32:24.5632730Z T: int, 2025-05-07T20:32:24.5632940Z D: int, 2025-05-07T20:32:24.5633158Z scale_ub: Optional[float], 2025-05-07T20:32:24.5633429Z contiguous: bool, 2025-05-07T20:32:24.5633673Z compiled: bool, 2025-05-07T20:32:24.5633892Z ) -> None: 2025-05-07T20:32:24.5634111Z torch.manual_seed(2025) 2025-05-07T20:32:24.5634355Z 2025-05-07T20:32:24.5634626Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:24.5634982Z 2025-05-07T20:32:24.5635183Z x_sign = torch.sign(x) 2025-05-07T20:32:24.5635471Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:24.5635858Z x = x_sign * x_clamp 2025-05-07T20:32:24.5636163Z x0 = x[:, :D] 2025-05-07T20:32:24.5636379Z x1 = x[:, D:] 2025-05-07T20:32:24.5636591Z 2025-05-07T20:32:24.5636782Z if contiguous: 2025-05-07T20:32:24.5637012Z x0 = x0.contiguous() 2025-05-07T20:32:24.5637344Z x1 = x1.contiguous() 2025-05-07T20:32:24.5637592Z 2025-05-07T20:32:24.5637794Z if scale_ub is not None: 2025-05-07T20:32:24.5638069Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:24.5638413Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:24.5638728Z ) 2025-05-07T20:32:24.5638925Z else: 2025-05-07T20:32:24.5639144Z scale_ub_tensor = None 2025-05-07T20:32:24.5639402Z 2025-05-07T20:32:24.5639641Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:24.5639959Z op = silu_mul_quant 2025-05-07T20:32:24.5640261Z if compiled: 2025-05-07T20:32:24.5640507Z op = torch.compile(op) 2025-05-07T20:32:24.5640817Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:24.5641098Z 2025-05-07T20:32:24.5641292Z > y_fp8, y_scale = fn() 2025-05-07T20:32:24.5641466Z 2025-05-07T20:32:24.5641567Z moe/activation_test.py:117: 2025-05-07T20:32:24.5641873Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:24.5642209Z moe/activation_test.py:115: in fn 2025-05-07T20:32:24.5642486Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:24.5643169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:24.5643858Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:24.5644387Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:24.5645068Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:24.5645730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:24.5646272Z kernel = self.compile( 2025-05-07T20:32:24.5646812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:24.5647471Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:24.5647876Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:24.5648102Z 2025-05-07T20:32:24.5648320Z self = 2025-05-07T20:32:24.5649440Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:24.5650824Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efca2b62660>} 2025-05-07T20:32:24.5652217Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:24.5653242Z context = 2025-05-07T20:32:24.5653531Z 2025-05-07T20:32:24.5653697Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:24.5654218Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:24.5654688Z module_map=module_map) 2025-05-07T20:32:24.5655054Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:24.5655406Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:24.5655668Z E ^ 2025-05-07T20:32:24.5656137Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:24.5656661Z 2025-05-07T20:32:24.5657112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:24.5657626Z 2025-05-07T20:32:24.5657735Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:24.5658149Z self=, 2025-05-07T20:32:24.5658554Z T=2048, 2025-05-07T20:32:24.5658743Z D=7168, 2025-05-07T20:32:24.5658945Z scale_ub=1200.0, 2025-05-07T20:32:24.5659177Z contiguous=True, 2025-05-07T20:32:24.5659419Z compiled=False, 2025-05-07T20:32:24.5659662Z ) 2025-05-07T20:32:24.6327738Z self = 2025-05-07T20:32:24.6328700Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:24.6329084Z 2025-05-07T20:32:24.6329215Z @given( 2025-05-07T20:32:24.6329538Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:24.6329967Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:24.6330278Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:24.6330617Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:24.6330958Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:24.6331250Z ) 2025-05-07T20:32:24.6331598Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:24.6332041Z def test_silu_mul_quant( 2025-05-07T20:32:24.6332288Z self, 2025-05-07T20:32:24.6332487Z T: int, 2025-05-07T20:32:24.6332690Z D: int, 2025-05-07T20:32:24.6332915Z scale_ub: Optional[float], 2025-05-07T20:32:24.6333187Z contiguous: bool, 2025-05-07T20:32:24.6333431Z compiled: bool, 2025-05-07T20:32:24.6333664Z ) -> None: 2025-05-07T20:32:24.6333878Z torch.manual_seed(2025) 2025-05-07T20:32:24.6334133Z 2025-05-07T20:32:24.6334411Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:24.6336462Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:24.6338304Z 2025-05-07T20:32:24.6338432Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:24.6338643Z 2025-05-07T20:32:24.6338746Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:24.6339158Z self=, 2025-05-07T20:32:24.6339611Z T=1, 2025-05-07T20:32:24.6339793Z D=5120, 2025-05-07T20:32:24.6339989Z scale_ub=1200.0, 2025-05-07T20:32:24.6340216Z contiguous=True, 2025-05-07T20:32:24.6340444Z compiled=False, 2025-05-07T20:32:24.6340650Z ) 2025-05-07T20:32:24.6341248Z self = 2025-05-07T20:32:24.6341749Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:24.6349260Z 2025-05-07T20:32:24.6349377Z @given( 2025-05-07T20:32:24.6349651Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:24.6350002Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:24.6350318Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:24.6350658Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:24.6350989Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:24.6351283Z ) 2025-05-07T20:32:24.6351644Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:24.6352296Z def test_silu_mul_quant( 2025-05-07T20:32:24.6352552Z self, 2025-05-07T20:32:24.6352757Z T: int, 2025-05-07T20:32:24.6352954Z D: int, 2025-05-07T20:32:24.6353267Z scale_ub: Optional[float], 2025-05-07T20:32:24.6353551Z contiguous: bool, 2025-05-07T20:32:24.6353793Z compiled: bool, 2025-05-07T20:32:24.6354027Z ) -> None: 2025-05-07T20:32:24.6354257Z torch.manual_seed(2025) 2025-05-07T20:32:24.6354504Z 2025-05-07T20:32:24.6354790Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:24.6355143Z 2025-05-07T20:32:24.6355352Z x_sign = torch.sign(x) 2025-05-07T20:32:24.6355645Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:24.6355962Z x = x_sign * x_clamp 2025-05-07T20:32:24.6356210Z x0 = x[:, :D] 2025-05-07T20:32:24.6356428Z x1 = x[:, D:] 2025-05-07T20:32:24.6356644Z 2025-05-07T20:32:24.6356841Z if contiguous: 2025-05-07T20:32:24.6357073Z x0 = x0.contiguous() 2025-05-07T20:32:24.6357338Z x1 = x1.contiguous() 2025-05-07T20:32:24.6357587Z 2025-05-07T20:32:24.6357781Z if scale_ub is not None: 2025-05-07T20:32:24.6358066Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:24.6358415Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:24.6358723Z ) 2025-05-07T20:32:24.6358925Z else: 2025-05-07T20:32:24.6359143Z scale_ub_tensor = None 2025-05-07T20:32:24.6359396Z 2025-05-07T20:32:24.6359635Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:24.6359954Z op = silu_mul_quant 2025-05-07T20:32:24.6360211Z if compiled: 2025-05-07T20:32:24.6360457Z op = torch.compile(op) 2025-05-07T20:32:24.6360759Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:24.6361042Z 2025-05-07T20:32:24.6361234Z > y_fp8, y_scale = fn() 2025-05-07T20:32:24.6361406Z 2025-05-07T20:32:24.6361508Z moe/activation_test.py:117: 2025-05-07T20:32:24.6361807Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:24.6362141Z moe/activation_test.py:115: in fn 2025-05-07T20:32:24.6362435Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:24.6363139Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:24.6363843Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:24.6364421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:24.6365110Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:24.6365791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:24.6366340Z kernel = self.compile( 2025-05-07T20:32:24.6366887Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:24.6367556Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:24.6368045Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:24.6368274Z 2025-05-07T20:32:24.6368494Z self = 2025-05-07T20:32:24.6369589Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:24.6371040Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efca2b639c0>} 2025-05-07T20:32:24.6372443Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:24.6373561Z context = 2025-05-07T20:32:24.6373862Z 2025-05-07T20:32:24.6374040Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:24.6374568Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:24.6375050Z module_map=module_map) 2025-05-07T20:32:24.6375427Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:24.6375787Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:24.6376054Z E ^ 2025-05-07T20:32:24.6376523Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:24.6376986Z 2025-05-07T20:32:24.6377423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:24.6377944Z 2025-05-07T20:32:24.6378051Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:24.6378488Z self=, 2025-05-07T20:32:24.6378898Z T=2048, 2025-05-07T20:32:24.6379089Z D=5120, 2025-05-07T20:32:24.6379297Z scale_ub=None, 2025-05-07T20:32:24.6379530Z contiguous=True, 2025-05-07T20:32:24.6379756Z compiled=False, 2025-05-07T20:32:24.6379971Z ) 2025-05-07T20:32:24.6380337Z self = 2025-05-07T20:32:24.6380870Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:24.6381141Z 2025-05-07T20:32:24.6381226Z @given( 2025-05-07T20:32:24.6381467Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:24.6381795Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:24.6382106Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:24.6382447Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:24.6382787Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:24.6383077Z ) 2025-05-07T20:32:24.6383440Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:24.6383891Z def test_silu_mul_quant( 2025-05-07T20:32:24.6384140Z self, 2025-05-07T20:32:24.6384338Z T: int, 2025-05-07T20:32:24.6384545Z D: int, 2025-05-07T20:32:24.6384772Z scale_ub: Optional[float], 2025-05-07T20:32:24.6385047Z contiguous: bool, 2025-05-07T20:32:24.6385297Z compiled: bool, 2025-05-07T20:32:24.6385528Z ) -> None: 2025-05-07T20:32:24.6385750Z torch.manual_seed(2025) 2025-05-07T20:32:24.6386002Z 2025-05-07T20:32:24.6386288Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:24.6386634Z 2025-05-07T20:32:24.6386843Z > x_sign = torch.sign(x) 2025-05-07T20:32:24.6388886Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:24.6390829Z 2025-05-07T20:32:24.6390964Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:24.6391183Z 2025-05-07T20:32:24.6391300Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:24.6391716Z self=, 2025-05-07T20:32:24.6392178Z T=16384, 2025-05-07T20:32:24.6392422Z D=5120, 2025-05-07T20:32:24.6392620Z scale_ub=None, 2025-05-07T20:32:24.6392843Z contiguous=True, 2025-05-07T20:32:24.6393074Z compiled=False, 2025-05-07T20:32:24.6393280Z ) 2025-05-07T20:32:24.7086350Z self = 2025-05-07T20:32:24.7087096Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:24.7087477Z 2025-05-07T20:32:24.7087604Z @given( 2025-05-07T20:32:24.7087918Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:24.7088279Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:24.7088595Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:24.7088935Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:24.7089275Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:24.7089575Z ) 2025-05-07T20:32:24.7089932Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:24.7090391Z def test_silu_mul_quant( 2025-05-07T20:32:24.7090642Z self, 2025-05-07T20:32:24.7090846Z T: int, 2025-05-07T20:32:24.7091046Z D: int, 2025-05-07T20:32:24.7091281Z scale_ub: Optional[float], 2025-05-07T20:32:24.7091590Z contiguous: bool, 2025-05-07T20:32:24.7091843Z compiled: bool, 2025-05-07T20:32:24.7092078Z ) -> None: 2025-05-07T20:32:24.7092297Z torch.manual_seed(2025) 2025-05-07T20:32:24.7092551Z 2025-05-07T20:32:24.7092831Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:24.7094856Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:24.7096703Z 2025-05-07T20:32:24.7096833Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:24.7097045Z 2025-05-07T20:32:24.7097153Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:24.7097571Z self=, 2025-05-07T20:32:24.7097974Z T=4096, 2025-05-07T20:32:24.7098161Z D=5120, 2025-05-07T20:32:24.7098361Z scale_ub=None, 2025-05-07T20:32:24.7098582Z contiguous=True, 2025-05-07T20:32:24.7098803Z compiled=False, 2025-05-07T20:32:24.7099017Z ) 2025-05-07T20:32:24.7099349Z self = 2025-05-07T20:32:24.7099882Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:24.7100161Z 2025-05-07T20:32:24.7100241Z @given( 2025-05-07T20:32:24.7100482Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:24.7100794Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:24.7101095Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:24.7101530Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:24.7101865Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:24.7102148Z ) 2025-05-07T20:32:24.7102502Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:24.7102943Z def test_silu_mul_quant( 2025-05-07T20:32:24.7103185Z self, 2025-05-07T20:32:24.7103390Z T: int, 2025-05-07T20:32:24.7103594Z D: int, 2025-05-07T20:32:24.7103815Z scale_ub: Optional[float], 2025-05-07T20:32:24.7104095Z contiguous: bool, 2025-05-07T20:32:24.7104341Z compiled: bool, 2025-05-07T20:32:24.7104565Z ) -> None: 2025-05-07T20:32:24.7104864Z torch.manual_seed(2025) 2025-05-07T20:32:24.7105644Z 2025-05-07T20:32:24.7105924Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:24.7107989Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:24.7110011Z 2025-05-07T20:32:24.7110133Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:24.7110353Z 2025-05-07T20:32:24.7110462Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:24.7110876Z self=, 2025-05-07T20:32:24.7111287Z T=2048, 2025-05-07T20:32:24.7111485Z D=5120, 2025-05-07T20:32:24.7111690Z scale_ub=None, 2025-05-07T20:32:24.7111915Z contiguous=False, 2025-05-07T20:32:24.7112149Z compiled=False, 2025-05-07T20:32:24.7112361Z ) 2025-05-07T20:32:24.7112686Z self = 2025-05-07T20:32:24.7113175Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:24.7113450Z 2025-05-07T20:32:24.7113532Z @given( 2025-05-07T20:32:24.7113769Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:24.7114078Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:24.7114386Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:24.7114715Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:24.7115042Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:24.7115336Z ) 2025-05-07T20:32:24.7115686Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:24.7116132Z def test_silu_mul_quant( 2025-05-07T20:32:24.7116376Z self, 2025-05-07T20:32:24.7116577Z T: int, 2025-05-07T20:32:24.7116795Z D: int, 2025-05-07T20:32:24.7117025Z scale_ub: Optional[float], 2025-05-07T20:32:24.7117295Z contiguous: bool, 2025-05-07T20:32:24.7117546Z compiled: bool, 2025-05-07T20:32:24.7117777Z ) -> None: 2025-05-07T20:32:24.7117992Z torch.manual_seed(2025) 2025-05-07T20:32:24.7118241Z 2025-05-07T20:32:24.7118519Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:24.7120597Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:24.7122496Z 2025-05-07T20:32:24.7122619Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:24.7122836Z 2025-05-07T20:32:24.7122942Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:24.7123362Z self=, 2025-05-07T20:32:24.7123781Z T=4096, 2025-05-07T20:32:24.7123976Z D=7168, 2025-05-07T20:32:24.7124180Z scale_ub=None, 2025-05-07T20:32:24.7124406Z contiguous=True, 2025-05-07T20:32:24.7124629Z compiled=True, 2025-05-07T20:32:24.7124841Z ) 2025-05-07T20:32:24.7125167Z self = 2025-05-07T20:32:24.7125649Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:24.7126002Z 2025-05-07T20:32:24.7126082Z @given( 2025-05-07T20:32:24.7126319Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:24.7126635Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:24.7127011Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:24.7127346Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:24.7127678Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:24.7127958Z ) 2025-05-07T20:32:24.7128564Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:24.7129015Z def test_silu_mul_quant( 2025-05-07T20:32:24.7129258Z self, 2025-05-07T20:32:24.7129458Z T: int, 2025-05-07T20:32:24.7129667Z D: int, 2025-05-07T20:32:24.7129887Z scale_ub: Optional[float], 2025-05-07T20:32:24.7130165Z contiguous: bool, 2025-05-07T20:32:24.7130412Z compiled: bool, 2025-05-07T20:32:24.7130637Z ) -> None: 2025-05-07T20:32:24.7130865Z torch.manual_seed(2025) 2025-05-07T20:32:24.7131115Z 2025-05-07T20:32:24.7131384Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:24.7133435Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:24.7135280Z 2025-05-07T20:32:24.7135400Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:24.7135617Z 2025-05-07T20:32:24.7135723Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:24.7136145Z self=, 2025-05-07T20:32:24.7136549Z T=2048, 2025-05-07T20:32:24.7136743Z D=5120, 2025-05-07T20:32:24.7136942Z scale_ub=1200.0, 2025-05-07T20:32:24.7137175Z contiguous=False, 2025-05-07T20:32:24.7137417Z compiled=False, 2025-05-07T20:32:24.7137634Z ) 2025-05-07T20:32:24.7137970Z self = 2025-05-07T20:32:24.7138469Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:24.7138756Z 2025-05-07T20:32:24.7138839Z @given( 2025-05-07T20:32:24.7139088Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:24.7139456Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:24.7139777Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:24.7140118Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:24.7140455Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:24.7140751Z ) 2025-05-07T20:32:24.7141108Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:24.7141560Z def test_silu_mul_quant( 2025-05-07T20:32:24.7141808Z self, 2025-05-07T20:32:24.7142103Z T: int, 2025-05-07T20:32:24.7142313Z D: int, 2025-05-07T20:32:24.7142536Z scale_ub: Optional[float], 2025-05-07T20:32:24.7142817Z contiguous: bool, 2025-05-07T20:32:24.7143076Z compiled: bool, 2025-05-07T20:32:24.7143301Z ) -> None: 2025-05-07T20:32:24.7143526Z torch.manual_seed(2025) 2025-05-07T20:32:24.7143773Z 2025-05-07T20:32:24.7144045Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:24.7146143Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:24.7148089Z 2025-05-07T20:32:24.7148210Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:24.7148430Z 2025-05-07T20:32:24.7148537Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:24.7148953Z self=, 2025-05-07T20:32:24.7149402Z T=4096, 2025-05-07T20:32:24.7149597Z D=7168, 2025-05-07T20:32:24.7149798Z scale_ub=1200.0, 2025-05-07T20:32:24.7150037Z contiguous=True, 2025-05-07T20:32:24.7150314Z compiled=False, 2025-05-07T20:32:24.7150524Z ) 2025-05-07T20:32:24.8076262Z self = 2025-05-07T20:32:24.8077000Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:24.8077332Z 2025-05-07T20:32:24.8077442Z @given( 2025-05-07T20:32:24.8077692Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:24.8078021Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:24.8078337Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:24.8078669Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:24.8079005Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:24.8079329Z ) 2025-05-07T20:32:24.8079761Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:24.8080322Z def test_silu_mul_quant( 2025-05-07T20:32:24.8080636Z self, 2025-05-07T20:32:24.8080881Z T: int, 2025-05-07T20:32:24.8081135Z D: int, 2025-05-07T20:32:24.8081417Z scale_ub: Optional[float], 2025-05-07T20:32:24.8081757Z contiguous: bool, 2025-05-07T20:32:24.8082006Z compiled: bool, 2025-05-07T20:32:24.8082240Z ) -> None: 2025-05-07T20:32:24.8082458Z torch.manual_seed(2025) 2025-05-07T20:32:24.8082710Z 2025-05-07T20:32:24.8082998Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:24.8085048Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:24.8086893Z 2025-05-07T20:32:24.8087023Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:24.8087237Z 2025-05-07T20:32:24.8087346Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:24.8087762Z self=, 2025-05-07T20:32:24.8088167Z T=16384, 2025-05-07T20:32:24.8088362Z D=7168, 2025-05-07T20:32:24.8088746Z scale_ub=None, 2025-05-07T20:32:24.8088990Z contiguous=False, 2025-05-07T20:32:24.8089268Z compiled=True, 2025-05-07T20:32:24.8089532Z ) 2025-05-07T20:32:24.8089935Z self = 2025-05-07T20:32:24.8090555Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:24.8090897Z 2025-05-07T20:32:24.8090997Z @given( 2025-05-07T20:32:24.8091242Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:24.8091561Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:24.8091860Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:24.8092275Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:24.8092683Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:24.8092962Z ) 2025-05-07T20:32:24.8093314Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:24.8093833Z def test_silu_mul_quant( 2025-05-07T20:32:24.8094082Z self, 2025-05-07T20:32:24.8094278Z T: int, 2025-05-07T20:32:24.8094482Z D: int, 2025-05-07T20:32:24.8094703Z scale_ub: Optional[float], 2025-05-07T20:32:24.8094977Z contiguous: bool, 2025-05-07T20:32:24.8095222Z compiled: bool, 2025-05-07T20:32:24.8095446Z ) -> None: 2025-05-07T20:32:24.8095661Z torch.manual_seed(2025) 2025-05-07T20:32:24.8095911Z 2025-05-07T20:32:24.8096192Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:24.8098229Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:24.8100071Z 2025-05-07T20:32:24.8100194Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:24.8100418Z 2025-05-07T20:32:24.8100523Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:24.8100935Z self=, 2025-05-07T20:32:24.8101339Z T=4096, 2025-05-07T20:32:24.8101534Z D=7168, 2025-05-07T20:32:24.8101740Z scale_ub=None, 2025-05-07T20:32:24.8101960Z contiguous=True, 2025-05-07T20:32:24.8102182Z compiled=False, 2025-05-07T20:32:24.8102405Z ) 2025-05-07T20:32:24.8102732Z self = 2025-05-07T20:32:24.8103221Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:24.8103500Z 2025-05-07T20:32:24.8103589Z @given( 2025-05-07T20:32:24.8103829Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:24.8104138Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:24.8104451Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:24.8104788Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:24.8105125Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:24.8105405Z ) 2025-05-07T20:32:24.8105761Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:24.8106203Z def test_silu_mul_quant( 2025-05-07T20:32:24.8106445Z self, 2025-05-07T20:32:24.8106650Z T: int, 2025-05-07T20:32:24.8106856Z D: int, 2025-05-07T20:32:24.8107075Z scale_ub: Optional[float], 2025-05-07T20:32:24.8107354Z contiguous: bool, 2025-05-07T20:32:24.8107608Z compiled: bool, 2025-05-07T20:32:24.8107830Z ) -> None: 2025-05-07T20:32:24.8108051Z torch.manual_seed(2025) 2025-05-07T20:32:24.8108299Z 2025-05-07T20:32:24.8108622Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:24.8111168Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:24.8113189Z 2025-05-07T20:32:24.8113350Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:24.8113567Z 2025-05-07T20:32:24.8113672Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:24.8114124Z self=, 2025-05-07T20:32:24.8114527Z T=16384, 2025-05-07T20:32:24.8114731Z D=7168, 2025-05-07T20:32:24.8114930Z scale_ub=None, 2025-05-07T20:32:24.8115147Z contiguous=True, 2025-05-07T20:32:24.8115378Z compiled=False, 2025-05-07T20:32:24.8115587Z ) 2025-05-07T20:32:24.8115904Z self = 2025-05-07T20:32:24.8116397Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:24.8116678Z 2025-05-07T20:32:24.8116760Z @given( 2025-05-07T20:32:24.8124084Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:24.8124454Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:24.8124804Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:24.8125189Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:24.8125569Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:24.8125897Z ) 2025-05-07T20:32:24.8126301Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:24.8126830Z def test_silu_mul_quant( 2025-05-07T20:32:24.8127100Z self, 2025-05-07T20:32:24.8127306Z T: int, 2025-05-07T20:32:24.8127519Z D: int, 2025-05-07T20:32:24.8127757Z scale_ub: Optional[float], 2025-05-07T20:32:24.8128056Z contiguous: bool, 2025-05-07T20:32:24.8128585Z compiled: bool, 2025-05-07T20:32:24.8128818Z ) -> None: 2025-05-07T20:32:24.8129040Z torch.manual_seed(2025) 2025-05-07T20:32:24.8129295Z 2025-05-07T20:32:24.8129587Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:24.8131682Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:24.8133574Z 2025-05-07T20:32:24.8133711Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:24.8133929Z 2025-05-07T20:32:24.8134035Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:24.8134461Z self=, 2025-05-07T20:32:24.8134878Z T=16384, 2025-05-07T20:32:24.8135073Z D=7168, 2025-05-07T20:32:24.8135284Z scale_ub=1200.0, 2025-05-07T20:32:24.8135516Z contiguous=True, 2025-05-07T20:32:24.8135743Z compiled=False, 2025-05-07T20:32:24.8135958Z ) 2025-05-07T20:32:24.8136289Z self = 2025-05-07T20:32:24.8136796Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:24.8137083Z 2025-05-07T20:32:24.8137281Z @given( 2025-05-07T20:32:24.8137525Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:24.8137846Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:24.8138155Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:24.8138495Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:24.8138837Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:24.8139186Z ) 2025-05-07T20:32:24.8139630Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:24.8140193Z def test_silu_mul_quant( 2025-05-07T20:32:24.8140493Z self, 2025-05-07T20:32:24.8140830Z T: int, 2025-05-07T20:32:24.8141188Z D: int, 2025-05-07T20:32:24.8141471Z scale_ub: Optional[float], 2025-05-07T20:32:24.8141790Z contiguous: bool, 2025-05-07T20:32:24.8142036Z compiled: bool, 2025-05-07T20:32:24.8142319Z ) -> None: 2025-05-07T20:32:24.8142541Z torch.manual_seed(2025) 2025-05-07T20:32:24.8142795Z 2025-05-07T20:32:24.8143077Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:24.8145164Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:24.8147060Z 2025-05-07T20:32:24.8147185Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:24.8147408Z 2025-05-07T20:32:24.8147516Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:24.8147949Z self=, 2025-05-07T20:32:24.8148365Z T=128, 2025-05-07T20:32:24.8148561Z D=5120, 2025-05-07T20:32:24.8148772Z scale_ub=1200.0, 2025-05-07T20:32:24.8149020Z contiguous=False, 2025-05-07T20:32:24.8149311Z compiled=False, 2025-05-07T20:32:24.8149526Z ) 2025-05-07T20:32:24.9163760Z self = 2025-05-07T20:32:24.9164319Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:24.9164702Z 2025-05-07T20:32:24.9164790Z @given( 2025-05-07T20:32:24.9165032Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:24.9165366Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:24.9165682Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:24.9166025Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:24.9166371Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:24.9166660Z ) 2025-05-07T20:32:24.9167025Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:24.9167472Z def test_silu_mul_quant( 2025-05-07T20:32:24.9167718Z self, 2025-05-07T20:32:24.9167927Z T: int, 2025-05-07T20:32:24.9168137Z D: int, 2025-05-07T20:32:24.9168361Z scale_ub: Optional[float], 2025-05-07T20:32:24.9168631Z contiguous: bool, 2025-05-07T20:32:24.9168881Z compiled: bool, 2025-05-07T20:32:24.9169111Z ) -> None: 2025-05-07T20:32:24.9169326Z torch.manual_seed(2025) 2025-05-07T20:32:24.9169582Z 2025-05-07T20:32:24.9169861Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:24.9170209Z 2025-05-07T20:32:24.9170414Z x_sign = torch.sign(x) 2025-05-07T20:32:24.9170711Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:24.9171025Z x = x_sign * x_clamp 2025-05-07T20:32:24.9171273Z x0 = x[:, :D] 2025-05-07T20:32:24.9171771Z x1 = x[:, D:] 2025-05-07T20:32:24.9171980Z 2025-05-07T20:32:24.9172175Z if contiguous: 2025-05-07T20:32:24.9172416Z x0 = x0.contiguous() 2025-05-07T20:32:24.9172674Z x1 = x1.contiguous() 2025-05-07T20:32:24.9172924Z 2025-05-07T20:32:24.9173126Z if scale_ub is not None: 2025-05-07T20:32:24.9173401Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:24.9173749Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:24.9174067Z ) 2025-05-07T20:32:24.9174266Z else: 2025-05-07T20:32:24.9174480Z scale_ub_tensor = None 2025-05-07T20:32:24.9174740Z 2025-05-07T20:32:24.9175075Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:24.9175488Z op = silu_mul_quant 2025-05-07T20:32:24.9175744Z if compiled: 2025-05-07T20:32:24.9175999Z op = torch.compile(op) 2025-05-07T20:32:24.9176383Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:24.9176667Z 2025-05-07T20:32:24.9176868Z > y_fp8, y_scale = fn() 2025-05-07T20:32:24.9177032Z 2025-05-07T20:32:24.9177136Z moe/activation_test.py:117: 2025-05-07T20:32:24.9177458Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:24.9177800Z moe/activation_test.py:115: in fn 2025-05-07T20:32:24.9178090Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:24.9178787Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:24.9179473Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:24.9180023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:24.9180723Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:24.9181384Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:24.9181925Z kernel = self.compile( 2025-05-07T20:32:24.9182476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:24.9183141Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:24.9183537Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:24.9183772Z 2025-05-07T20:32:24.9183983Z self = 2025-05-07T20:32:24.9185067Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:24.9186473Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efca290a5c0>} 2025-05-07T20:32:24.9187821Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:24.9188833Z context = 2025-05-07T20:32:24.9189233Z 2025-05-07T20:32:24.9189402Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:24.9189924Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:24.9190393Z module_map=module_map) 2025-05-07T20:32:24.9190755Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:24.9191117Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:24.9191383Z E ^ 2025-05-07T20:32:24.9191902Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:24.9192458Z 2025-05-07T20:32:24.9192961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:24.9193593Z 2025-05-07T20:32:24.9193706Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:24.9194177Z self=, 2025-05-07T20:32:24.9194636Z T=2048, 2025-05-07T20:32:24.9194844Z D=7168, 2025-05-07T20:32:24.9195050Z scale_ub=None, 2025-05-07T20:32:24.9195279Z contiguous=False, 2025-05-07T20:32:24.9195523Z compiled=False, 2025-05-07T20:32:24.9195743Z ) 2025-05-07T20:32:24.9196141Z self = 2025-05-07T20:32:24.9196755Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:24.9197080Z 2025-05-07T20:32:24.9197161Z @given( 2025-05-07T20:32:24.9197447Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:24.9197796Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:24.9198140Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:24.9198515Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:24.9198882Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:24.9199206Z ) 2025-05-07T20:32:24.9199661Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:24.9200181Z def test_silu_mul_quant( 2025-05-07T20:32:24.9200437Z self, 2025-05-07T20:32:24.9200644Z T: int, 2025-05-07T20:32:24.9200853Z D: int, 2025-05-07T20:32:24.9201079Z scale_ub: Optional[float], 2025-05-07T20:32:24.9201385Z contiguous: bool, 2025-05-07T20:32:24.9201649Z compiled: bool, 2025-05-07T20:32:24.9201883Z ) -> None: 2025-05-07T20:32:24.9202112Z torch.manual_seed(2025) 2025-05-07T20:32:24.9202381Z 2025-05-07T20:32:24.9202680Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:24.9205273Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:24.9207632Z 2025-05-07T20:32:24.9207763Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:24.9208016Z 2025-05-07T20:32:24.9208126Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:24.9208601Z self=, 2025-05-07T20:32:24.9209065Z T=128, 2025-05-07T20:32:24.9209272Z D=7168, 2025-05-07T20:32:24.9209484Z scale_ub=1200.0, 2025-05-07T20:32:24.9209721Z contiguous=True, 2025-05-07T20:32:24.9209991Z compiled=True, 2025-05-07T20:32:24.9210231Z ) 2025-05-07T20:32:24.9512983Z self = 2025-05-07T20:32:24.9513487Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:24.9513800Z 2025-05-07T20:32:24.9513910Z @given( 2025-05-07T20:32:24.9514244Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:24.9514667Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:24.9515084Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:24.9515487Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:24.9515821Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:24.9516110Z ) 2025-05-07T20:32:24.9516471Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:24.9517060Z def test_silu_mul_quant( 2025-05-07T20:32:24.9517308Z self, 2025-05-07T20:32:24.9517532Z T: int, 2025-05-07T20:32:24.9517734Z D: int, 2025-05-07T20:32:24.9517953Z scale_ub: Optional[float], 2025-05-07T20:32:24.9518218Z contiguous: bool, 2025-05-07T20:32:24.9518470Z compiled: bool, 2025-05-07T20:32:24.9518698Z ) -> None: 2025-05-07T20:32:24.9518926Z torch.manual_seed(2025) 2025-05-07T20:32:24.9519163Z 2025-05-07T20:32:24.9519442Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:24.9519785Z 2025-05-07T20:32:24.9519981Z x_sign = torch.sign(x) 2025-05-07T20:32:24.9520344Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:24.9520727Z x = x_sign * x_clamp 2025-05-07T20:32:24.9520970Z x0 = x[:, :D] 2025-05-07T20:32:24.9521185Z x1 = x[:, D:] 2025-05-07T20:32:24.9521397Z 2025-05-07T20:32:24.9521648Z if contiguous: 2025-05-07T20:32:24.9521882Z x0 = x0.contiguous() 2025-05-07T20:32:24.9522151Z x1 = x1.contiguous() 2025-05-07T20:32:24.9522394Z 2025-05-07T20:32:24.9522586Z if scale_ub is not None: 2025-05-07T20:32:24.9522864Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:24.9523204Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:24.9523507Z ) 2025-05-07T20:32:24.9523705Z else: 2025-05-07T20:32:24.9523919Z scale_ub_tensor = None 2025-05-07T20:32:24.9524166Z 2025-05-07T20:32:24.9524402Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:24.9524720Z op = silu_mul_quant 2025-05-07T20:32:24.9524972Z if compiled: 2025-05-07T20:32:24.9525225Z op = torch.compile(op) 2025-05-07T20:32:24.9525520Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:24.9525795Z 2025-05-07T20:32:24.9525986Z > y_fp8, y_scale = fn() 2025-05-07T20:32:24.9526157Z 2025-05-07T20:32:24.9526261Z moe/activation_test.py:117: 2025-05-07T20:32:24.9526561Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:24.9526889Z moe/activation_test.py:115: in fn 2025-05-07T20:32:24.9527173Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:24.9527736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:24.9528590Z return fn(*args, **kwargs) 2025-05-07T20:32:24.9529248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:24.9529988Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:24.9530523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:24.9531200Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:24.9531860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:24.9532399Z kernel = self.compile( 2025-05-07T20:32:24.9532940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:24.9533588Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:24.9533986Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:24.9534213Z 2025-05-07T20:32:24.9534428Z self = 2025-05-07T20:32:24.9535501Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:24.9536978Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efca290aac0>} 2025-05-07T20:32:24.9538328Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:24.9539353Z context = 2025-05-07T20:32:24.9539645Z 2025-05-07T20:32:24.9539834Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:24.9540380Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:24.9540945Z module_map=module_map) 2025-05-07T20:32:24.9541381Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:24.9541733Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:24.9542004Z E ^ 2025-05-07T20:32:24.9542534Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:24.9542983Z 2025-05-07T20:32:24.9543407Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:24.9543916Z 2025-05-07T20:32:24.9544021Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:24.9544438Z self=, 2025-05-07T20:32:24.9544843Z T=128, 2025-05-07T20:32:24.9545041Z D=7168, 2025-05-07T20:32:24.9545240Z scale_ub=1200.0, 2025-05-07T20:32:24.9545469Z contiguous=True, 2025-05-07T20:32:24.9545699Z compiled=False, 2025-05-07T20:32:24.9545912Z ) 2025-05-07T20:32:24.9546237Z self = 2025-05-07T20:32:24.9546726Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:24.9546994Z 2025-05-07T20:32:24.9547077Z @given( 2025-05-07T20:32:24.9547315Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:24.9547630Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:24.9547932Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:24.9548262Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:24.9548593Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:24.9548880Z ) 2025-05-07T20:32:24.9549339Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:24.9549783Z def test_silu_mul_quant( 2025-05-07T20:32:24.9550027Z self, 2025-05-07T20:32:24.9550222Z T: int, 2025-05-07T20:32:24.9550427Z D: int, 2025-05-07T20:32:24.9550655Z scale_ub: Optional[float], 2025-05-07T20:32:24.9550923Z contiguous: bool, 2025-05-07T20:32:24.9551165Z compiled: bool, 2025-05-07T20:32:24.9551386Z ) -> None: 2025-05-07T20:32:24.9551602Z torch.manual_seed(2025) 2025-05-07T20:32:24.9551849Z 2025-05-07T20:32:24.9552126Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:24.9552465Z 2025-05-07T20:32:24.9552665Z x_sign = torch.sign(x) 2025-05-07T20:32:24.9552961Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:24.9554955Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:24.9556791Z 2025-05-07T20:32:24.9556920Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:24.9557193Z 2025-05-07T20:32:24.9557302Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:24.9557714Z self=, 2025-05-07T20:32:24.9558116Z T=128, 2025-05-07T20:32:24.9558305Z D=5120, 2025-05-07T20:32:24.9558503Z scale_ub=1200.0, 2025-05-07T20:32:24.9558728Z contiguous=True, 2025-05-07T20:32:24.9558947Z compiled=True, 2025-05-07T20:32:24.9559154Z ) 2025-05-07T20:32:24.9559502Z self = 2025-05-07T20:32:24.9560020Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:24.9560285Z 2025-05-07T20:32:24.9560414Z @given( 2025-05-07T20:32:24.9560686Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:24.9561001Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:24.9561305Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:24.9561682Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:24.9562018Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:24.9562298Z ) 2025-05-07T20:32:24.9562650Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:24.9563092Z def test_silu_mul_quant( 2025-05-07T20:32:24.9563331Z self, 2025-05-07T20:32:24.9563534Z T: int, 2025-05-07T20:32:24.9563737Z D: int, 2025-05-07T20:32:24.9563958Z scale_ub: Optional[float], 2025-05-07T20:32:24.9564226Z contiguous: bool, 2025-05-07T20:32:24.9564472Z compiled: bool, 2025-05-07T20:32:24.9564701Z ) -> None: 2025-05-07T20:32:24.9564917Z torch.manual_seed(2025) 2025-05-07T20:32:24.9565167Z 2025-05-07T20:32:24.9565449Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:24.9565789Z 2025-05-07T20:32:24.9565990Z x_sign = torch.sign(x) 2025-05-07T20:32:24.9566285Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:24.9568263Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:24.9570153Z 2025-05-07T20:32:24.9570275Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:24.9570498Z 2025-05-07T20:32:24.9570608Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:24.9571017Z self=, 2025-05-07T20:32:24.9571420Z T=128, 2025-05-07T20:32:24.9571615Z D=7168, 2025-05-07T20:32:24.9571817Z scale_ub=None, 2025-05-07T20:32:24.9572036Z contiguous=True, 2025-05-07T20:32:24.9572258Z compiled=True, 2025-05-07T20:32:24.9572465Z ) 2025-05-07T20:32:25.1476849Z self = 2025-05-07T20:32:25.1477349Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:25.1477619Z 2025-05-07T20:32:25.1477700Z @given( 2025-05-07T20:32:25.1477936Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.1478250Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.1478552Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.1478884Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.1479233Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.1479512Z ) 2025-05-07T20:32:25.1479883Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.1480366Z def test_silu_mul_quant( 2025-05-07T20:32:25.1480829Z self, 2025-05-07T20:32:25.1481032Z T: int, 2025-05-07T20:32:25.1481235Z D: int, 2025-05-07T20:32:25.1481458Z scale_ub: Optional[float], 2025-05-07T20:32:25.1481727Z contiguous: bool, 2025-05-07T20:32:25.1481972Z compiled: bool, 2025-05-07T20:32:25.1482202Z ) -> None: 2025-05-07T20:32:25.1482416Z torch.manual_seed(2025) 2025-05-07T20:32:25.1482660Z 2025-05-07T20:32:25.1482935Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.1485040Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:25.1487028Z 2025-05-07T20:32:25.1487149Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:25.1487372Z 2025-05-07T20:32:25.1496661Z FAILED 2025-05-07T20:32:25.1496791Z 2025-05-07T20:32:25.1496938Z =================================== FAILURES =================================== 2025-05-07T20:32:25.1497542Z _____________________ ActivationTests.test_silu_mul_quant ______________________ 2025-05-07T20:32:25.1498155Z + Exception Group Traceback (most recent call last): 2025-05-07T20:32:25.1499016Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/unittest/case.py", line 57, in testPartExecutor 2025-05-07T20:32:25.1499836Z | yield 2025-05-07T20:32:25.1500435Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/unittest/case.py", line 623, in run 2025-05-07T20:32:25.1501264Z | self._callTestMethod(testMethod) 2025-05-07T20:32:25.1502054Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/unittest/case.py", line 579, in _callTestMethod 2025-05-07T20:32:25.1502828Z | if method() is not None: 2025-05-07T20:32:25.1503178Z | ^^^^^^^^ 2025-05-07T20:32:25.1504066Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant 2025-05-07T20:32:25.1505066Z | T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.1505478Z | ^^^^^^^ 2025-05-07T20:32:25.1506256Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/hypothesis/core.py", line 1850, in wrapped_test 2025-05-07T20:32:25.1517103Z | raise the_error_hypothesis_found 2025-05-07T20:32:25.1517752Z | ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions) 2025-05-07T20:32:25.1518355Z +-+---------------- 1 ---------------- 2025-05-07T20:32:25.1518779Z | Traceback (most recent call last): 2025-05-07T20:32:25.1519788Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:25.1520908Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.1521437Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:25.1524219Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:25.1527086Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:25.1527694Z | self=, 2025-05-07T20:32:25.1528500Z | T=2048, 2025-05-07T20:32:25.1528829Z | D=5120, # or any other generated value 2025-05-07T20:32:25.1529312Z | scale_ub=None, # or any other generated value 2025-05-07T20:32:25.1529863Z | contiguous=True, # or any other generated value 2025-05-07T20:32:25.1530380Z | compiled=False, # or any other generated value 2025-05-07T20:32:25.1530808Z | ) 2025-05-07T20:32:25.1531050Z | 2025-05-07T20:32:25.1531795Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case 2025-05-07T20:32:25.1532860Z +---------------- 2 ---------------- 2025-05-07T20:32:25.1533259Z | Traceback (most recent call last): 2025-05-07T20:32:25.1534346Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:25.1535449Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.1535985Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:25.1538780Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:25.1541624Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:25.1542227Z | self=, 2025-05-07T20:32:25.1542789Z | T=128, 2025-05-07T20:32:25.1543081Z | D=7168, 2025-05-07T20:32:25.1543298Z | scale_ub=None, 2025-05-07T20:32:25.1543549Z | contiguous=True, 2025-05-07T20:32:25.1543796Z | compiled=True, 2025-05-07T20:32:25.1544023Z | ) 2025-05-07T20:32:25.1544211Z | 2025-05-07T20:32:25.1544739Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case 2025-05-07T20:32:25.1545352Z +---------------- 3 ---------------- 2025-05-07T20:32:25.1545647Z | Traceback (most recent call last): 2025-05-07T20:32:25.1546362Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:25.1547143Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.1547518Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:25.1549584Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:25.1551542Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:25.1551985Z | self=, 2025-05-07T20:32:25.1552395Z | T=128, 2025-05-07T20:32:25.1552599Z | D=5120, 2025-05-07T20:32:25.1552954Z | scale_ub=1200.0, 2025-05-07T20:32:25.1553225Z | contiguous=True, 2025-05-07T20:32:25.1553482Z | compiled=True, 2025-05-07T20:32:25.1553727Z | ) 2025-05-07T20:32:25.1553921Z | 2025-05-07T20:32:25.1554537Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case 2025-05-07T20:32:25.1555262Z +---------------- 4 ---------------- 2025-05-07T20:32:25.1555585Z | Traceback (most recent call last): 2025-05-07T20:32:25.1556437Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant 2025-05-07T20:32:25.1557358Z | y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:25.1557717Z | ^^^^^^^^ 2025-05-07T20:32:25.1558519Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn 2025-05-07T20:32:25.1559354Z | return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:25.1559791Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:25.1560757Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row 2025-05-07T20:32:25.1561729Z | _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:25.1562448Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py", line 330, in 2025-05-07T20:32:25.1563329Z | return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.1563852Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:25.1564616Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 186, in run 2025-05-07T20:32:25.1565544Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:25.1566107Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:25.1566908Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 186, in 2025-05-07T20:32:25.1567883Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:25.1568425Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:25.1569186Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 166, in _bench 2025-05-07T20:32:25.1570018Z | return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:25.1570497Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:25.1571202Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py", line 117, in do_bench 2025-05-07T20:32:25.1571875Z | fn() 2025-05-07T20:32:25.1572543Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call 2025-05-07T20:32:25.1573300Z | self.fn.run( 2025-05-07T20:32:25.1573921Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py", line 623, in run 2025-05-07T20:32:25.1574607Z | kernel = self.compile( 2025-05-07T20:32:25.1574894Z | ^^^^^^^^^^^^^ 2025-05-07T20:32:25.1575604Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 273, in compile 2025-05-07T20:32:25.1576451Z | module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.1576939Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:25.1577702Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:25.1578654Z | return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.1579207Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:25.1579635Z | triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.1580032Z | def _kernel_quantize_fp8_row( 2025-05-07T20:32:25.1580324Z | ^ 2025-05-07T20:32:25.1580901Z | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.1581614Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:25.1582106Z | # The test always failed when commented parts were varied together. 2025-05-07T20:32:25.1582705Z | self=, 2025-05-07T20:32:25.1583208Z | T=1, # or any other generated value 2025-05-07T20:32:25.1583558Z | D=5120, # or any other generated value 2025-05-07T20:32:25.1583934Z | scale_ub=None, # or any other generated value 2025-05-07T20:32:25.1584339Z | contiguous=True, # or any other generated value 2025-05-07T20:32:25.1584751Z | compiled=True, # or any other generated value 2025-05-07T20:32:25.1585089Z | ) 2025-05-07T20:32:25.1585275Z | 2025-05-07T20:32:25.1585896Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case 2025-05-07T20:32:25.1586625Z +------------------------------------ 2025-05-07T20:32:25.1587025Z ---------------------------------- Hypothesis ---------------------------------- 2025-05-07T20:32:25.1587455Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.1587937Z self=, 2025-05-07T20:32:25.1588502Z T=1, 2025-05-07T20:32:25.1588757Z D=5120, 2025-05-07T20:32:25.1589032Z scale_ub=None, 2025-05-07T20:32:25.1589442Z contiguous=True, 2025-05-07T20:32:25.1589756Z compiled=True, 2025-05-07T20:32:25.1590098Z ) 2025-05-07T20:32:25.1590547Z self = 2025-05-07T20:32:25.1591222Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:25.1591597Z 2025-05-07T20:32:25.1719740Z @given( 2025-05-07T20:32:25.1720116Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.1720587Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.1720993Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.1721435Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.1721905Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.1722293Z ) 2025-05-07T20:32:25.1722772Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.1723372Z def test_silu_mul_quant( 2025-05-07T20:32:25.1723698Z self, 2025-05-07T20:32:25.1723968Z T: int, 2025-05-07T20:32:25.1724237Z D: int, 2025-05-07T20:32:25.1724533Z scale_ub: Optional[float], 2025-05-07T20:32:25.1724910Z contiguous: bool, 2025-05-07T20:32:25.1725239Z compiled: bool, 2025-05-07T20:32:25.1725549Z ) -> None: 2025-05-07T20:32:25.1725843Z torch.manual_seed(2025) 2025-05-07T20:32:25.1726174Z 2025-05-07T20:32:25.1726544Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.1727029Z 2025-05-07T20:32:25.1727302Z x_sign = torch.sign(x) 2025-05-07T20:32:25.1727706Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.1728400Z x = x_sign * x_clamp 2025-05-07T20:32:25.1729156Z x0 = x[:, :D] 2025-05-07T20:32:25.1729487Z x1 = x[:, D:] 2025-05-07T20:32:25.1729779Z 2025-05-07T20:32:25.1730033Z if contiguous: 2025-05-07T20:32:25.1730340Z x0 = x0.contiguous() 2025-05-07T20:32:25.1730674Z x1 = x1.contiguous() 2025-05-07T20:32:25.1730976Z 2025-05-07T20:32:25.1731227Z if scale_ub is not None: 2025-05-07T20:32:25.1731584Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.1731915Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.1732223Z ) 2025-05-07T20:32:25.1732417Z else: 2025-05-07T20:32:25.1732621Z scale_ub_tensor = None 2025-05-07T20:32:25.1733073Z 2025-05-07T20:32:25.1733519Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.1733970Z op = silu_mul_quant 2025-05-07T20:32:25.1734330Z if compiled: 2025-05-07T20:32:25.1734679Z op = torch.compile(op) 2025-05-07T20:32:25.1735156Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.1735501Z 2025-05-07T20:32:25.1735742Z y_fp8, y_scale = fn() 2025-05-07T20:32:25.1736133Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:25.1736438Z 2025-05-07T20:32:25.1736679Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.1737010Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:25.1737297Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:25.1737607Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:25.1737967Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:25.1738270Z 2025-05-07T20:32:25.1738479Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:25.1738676Z 2025-05-07T20:32:25.1738784Z moe/activation_test.py:126: 2025-05-07T20:32:25.1739077Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.1739420Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:25.1739798Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:25.1740591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:25.1741351Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:25.1741899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.1742576Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.1743257Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:25.1743972Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:25.1744720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:25.1745469Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:25.1746191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:25.1746826Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:25.1747421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:25.1747935Z fn() 2025-05-07T20:32:25.1748432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:25.1749013Z self.fn.run( 2025-05-07T20:32:25.1749619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.1750169Z kernel = self.compile( 2025-05-07T20:32:25.1750763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.1751419Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.1751818Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.1752048Z 2025-05-07T20:32:25.1752254Z self = 2025-05-07T20:32:25.1753331Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.1754712Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd95239260>} 2025-05-07T20:32:25.1756165Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.1757187Z context = 2025-05-07T20:32:25.1757472Z 2025-05-07T20:32:25.1757639Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.1758161Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.1758626Z module_map=module_map) 2025-05-07T20:32:25.1758986Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.1759342Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:25.1759609Z E ^ 2025-05-07T20:32:25.1760078Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.1760524Z 2025-05-07T20:32:25.1760940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.1761454Z 2025-05-07T20:32:25.1761558Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.1761967Z self=, 2025-05-07T20:32:25.1762368Z T=2048, 2025-05-07T20:32:25.1762553Z D=5120, 2025-05-07T20:32:25.1762747Z scale_ub=1200.0, 2025-05-07T20:32:25.1762969Z contiguous=True, 2025-05-07T20:32:25.1763188Z compiled=False, 2025-05-07T20:32:25.1763400Z ) 2025-05-07T20:32:25.1763721Z self = 2025-05-07T20:32:25.1764203Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:25.1764480Z 2025-05-07T20:32:25.1764557Z @given( 2025-05-07T20:32:25.1764795Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.1765100Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.1765403Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.1765735Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.1766058Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.1766336Z ) 2025-05-07T20:32:25.1766683Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.1767122Z def test_silu_mul_quant( 2025-05-07T20:32:25.1767356Z self, 2025-05-07T20:32:25.1767554Z T: int, 2025-05-07T20:32:25.1767752Z D: int, 2025-05-07T20:32:25.1767964Z scale_ub: Optional[float], 2025-05-07T20:32:25.1768232Z contiguous: bool, 2025-05-07T20:32:25.1768472Z compiled: bool, 2025-05-07T20:32:25.1768687Z ) -> None: 2025-05-07T20:32:25.1768906Z torch.manual_seed(2025) 2025-05-07T20:32:25.1769150Z 2025-05-07T20:32:25.1769420Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.1769809Z 2025-05-07T20:32:25.1770009Z x_sign = torch.sign(x) 2025-05-07T20:32:25.1770349Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.1770662Z x = x_sign * x_clamp 2025-05-07T20:32:25.1770905Z x0 = x[:, :D] 2025-05-07T20:32:25.1771128Z x1 = x[:, D:] 2025-05-07T20:32:25.1771330Z 2025-05-07T20:32:25.1771517Z if contiguous: 2025-05-07T20:32:25.1771748Z x0 = x0.contiguous() 2025-05-07T20:32:25.1772000Z x1 = x1.contiguous() 2025-05-07T20:32:25.1772240Z 2025-05-07T20:32:25.1772431Z if scale_ub is not None: 2025-05-07T20:32:25.1772697Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.1773032Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.1773339Z ) 2025-05-07T20:32:25.1773578Z else: 2025-05-07T20:32:25.1773838Z scale_ub_tensor = None 2025-05-07T20:32:25.1774089Z 2025-05-07T20:32:25.1774317Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.1774632Z op = silu_mul_quant 2025-05-07T20:32:25.1774921Z if compiled: 2025-05-07T20:32:25.1775163Z op = torch.compile(op) 2025-05-07T20:32:25.1775459Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.1775733Z 2025-05-07T20:32:25.1775927Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.1776092Z 2025-05-07T20:32:25.1776193Z moe/activation_test.py:117: 2025-05-07T20:32:25.1776485Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.1776818Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.1777093Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.1777774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.1778470Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.1778997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.1779678Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.1780355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.1780920Z kernel = self.compile( 2025-05-07T20:32:25.1781450Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.1782099Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.1782495Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.1782719Z 2025-05-07T20:32:25.1782931Z self = 2025-05-07T20:32:25.1784005Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.1785370Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd94ee4180>} 2025-05-07T20:32:25.1786712Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.1787731Z context = 2025-05-07T20:32:25.1788016Z 2025-05-07T20:32:25.1788186Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.1788694Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.1789247Z module_map=module_map) 2025-05-07T20:32:25.1789615Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.1789968Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.1790279Z E ^ 2025-05-07T20:32:25.1790749Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.1791194Z 2025-05-07T20:32:25.1791620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.1792128Z 2025-05-07T20:32:25.1792231Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.1792643Z self=, 2025-05-07T20:32:25.1793045Z T=2048, 2025-05-07T20:32:25.1793232Z D=5120, 2025-05-07T20:32:25.1793429Z scale_ub=1200.0, 2025-05-07T20:32:25.1793696Z contiguous=True, 2025-05-07T20:32:25.1793978Z compiled=True, 2025-05-07T20:32:25.1794184Z ) 2025-05-07T20:32:25.1794506Z self = 2025-05-07T20:32:25.1795026Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:25.1795304Z 2025-05-07T20:32:25.1795383Z @given( 2025-05-07T20:32:25.1795615Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.1795928Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.1796226Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.1796553Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.1796883Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.1797159Z ) 2025-05-07T20:32:25.1797415Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.1797509Z def test_silu_mul_quant( 2025-05-07T20:32:25.1797594Z self, 2025-05-07T20:32:25.1797685Z T: int, 2025-05-07T20:32:25.1797761Z D: int, 2025-05-07T20:32:25.1797864Z scale_ub: Optional[float], 2025-05-07T20:32:25.1797953Z contiguous: bool, 2025-05-07T20:32:25.1798041Z compiled: bool, 2025-05-07T20:32:25.1798127Z ) -> None: 2025-05-07T20:32:25.1798228Z torch.manual_seed(2025) 2025-05-07T20:32:25.1798300Z 2025-05-07T20:32:25.1798474Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.1798549Z 2025-05-07T20:32:25.1798641Z x_sign = torch.sign(x) 2025-05-07T20:32:25.1798772Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.1798861Z x = x_sign * x_clamp 2025-05-07T20:32:25.1798941Z x0 = x[:, :D] 2025-05-07T20:32:25.1799031Z x1 = x[:, D:] 2025-05-07T20:32:25.1799103Z 2025-05-07T20:32:25.1799187Z if contiguous: 2025-05-07T20:32:25.1799284Z x0 = x0.contiguous() 2025-05-07T20:32:25.1799376Z x1 = x1.contiguous() 2025-05-07T20:32:25.1799456Z 2025-05-07T20:32:25.1799545Z if scale_ub is not None: 2025-05-07T20:32:25.1799650Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.1800042Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.1800132Z ) 2025-05-07T20:32:25.1800223Z else: 2025-05-07T20:32:25.1800326Z scale_ub_tensor = None 2025-05-07T20:32:25.1800399Z 2025-05-07T20:32:25.1800531Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.1800627Z op = silu_mul_quant 2025-05-07T20:32:25.1800712Z if compiled: 2025-05-07T20:32:25.1800811Z op = torch.compile(op) 2025-05-07T20:32:25.1800920Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.1800992Z 2025-05-07T20:32:25.1801087Z y_fp8, y_scale = fn() 2025-05-07T20:32:25.1801208Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:25.1801282Z 2025-05-07T20:32:25.1801424Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.1801525Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:25.1801625Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:25.1801806Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:25.1801946Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:25.1802020Z 2025-05-07T20:32:25.1802126Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:25.1802131Z 2025-05-07T20:32:25.1802229Z moe/activation_test.py:126: 2025-05-07T20:32:25.1802364Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.1802468Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:25.1802600Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:25.1803161Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:25.1803341Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:25.1803700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.1803978Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.1804347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:25.1804609Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:25.1805004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:25.1805255Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:25.1805635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:25.1805807Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:25.1806150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:25.1806233Z fn() 2025-05-07T20:32:25.1806632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:25.1806723Z self.fn.run( 2025-05-07T20:32:25.1807058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.1807155Z kernel = self.compile( 2025-05-07T20:32:25.1807537Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.1807710Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.1807845Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.1807853Z 2025-05-07T20:32:25.1808062Z self = 2025-05-07T20:32:25.1808839Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.1809348Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd8fbb74c0>} 2025-05-07T20:32:25.1810121Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.1810340Z context = 2025-05-07T20:32:25.1810344Z 2025-05-07T20:32:25.1810509Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.1810782Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.1810890Z module_map=module_map) 2025-05-07T20:32:25.1811102Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.1824563Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:25.1824668Z E ^ 2025-05-07T20:32:25.1825044Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.1825050Z 2025-05-07T20:32:25.1825470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.1825474Z 2025-05-07T20:32:25.1825577Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.1825800Z self=, 2025-05-07T20:32:25.1825963Z T=16384, 2025-05-07T20:32:25.1826084Z D=7168, 2025-05-07T20:32:25.1826169Z scale_ub=1200.0, 2025-05-07T20:32:25.1826253Z contiguous=False, 2025-05-07T20:32:25.1826336Z compiled=False, 2025-05-07T20:32:25.1826406Z ) 2025-05-07T20:32:25.1826666Z self = 2025-05-07T20:32:25.1826848Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:25.1826853Z 2025-05-07T20:32:25.1826929Z @given( 2025-05-07T20:32:25.1827047Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.1827147Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.1827260Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.1827373Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.1827485Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.1827557Z ) 2025-05-07T20:32:25.1827802Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.1827899Z def test_silu_mul_quant( 2025-05-07T20:32:25.1827972Z self, 2025-05-07T20:32:25.1828049Z T: int, 2025-05-07T20:32:25.1828403Z D: int, 2025-05-07T20:32:25.1828562Z scale_ub: Optional[float], 2025-05-07T20:32:25.1828696Z contiguous: bool, 2025-05-07T20:32:25.1828795Z compiled: bool, 2025-05-07T20:32:25.1828874Z ) -> None: 2025-05-07T20:32:25.1828972Z torch.manual_seed(2025) 2025-05-07T20:32:25.1829041Z 2025-05-07T20:32:25.1829258Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.1829337Z 2025-05-07T20:32:25.1829432Z x_sign = torch.sign(x) 2025-05-07T20:32:25.1829561Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.1829647Z x = x_sign * x_clamp 2025-05-07T20:32:25.1829728Z x0 = x[:, :D] 2025-05-07T20:32:25.1829813Z x1 = x[:, D:] 2025-05-07T20:32:25.1829886Z 2025-05-07T20:32:25.1829973Z if contiguous: 2025-05-07T20:32:25.1830070Z x0 = x0.contiguous() 2025-05-07T20:32:25.1830180Z x1 = x1.contiguous() 2025-05-07T20:32:25.1830257Z 2025-05-07T20:32:25.1830369Z if scale_ub is not None: 2025-05-07T20:32:25.1830476Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.1830611Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.1830693Z ) 2025-05-07T20:32:25.1830767Z else: 2025-05-07T20:32:25.1830864Z scale_ub_tensor = None 2025-05-07T20:32:25.1830940Z 2025-05-07T20:32:25.1831066Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.1831161Z op = silu_mul_quant 2025-05-07T20:32:25.1831246Z if compiled: 2025-05-07T20:32:25.1831345Z op = torch.compile(op) 2025-05-07T20:32:25.1831449Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.1831519Z 2025-05-07T20:32:25.1831609Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.1831616Z 2025-05-07T20:32:25.1831714Z moe/activation_test.py:117: 2025-05-07T20:32:25.1831839Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.1831943Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.1832192Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.1832695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.1832794Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.1833147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.1833366Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.1833705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.1833796Z kernel = self.compile( 2025-05-07T20:32:25.1834255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.1834487Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.1834682Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.1834687Z 2025-05-07T20:32:25.1834900Z self = 2025-05-07T20:32:25.1835678Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.1836188Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd8fe971a0>} 2025-05-07T20:32:25.1836930Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.1837130Z context = 2025-05-07T20:32:25.1837135Z 2025-05-07T20:32:25.1837306Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.1837562Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.1837673Z module_map=module_map) 2025-05-07T20:32:25.1837834Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.1837934Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.1838015Z E ^ 2025-05-07T20:32:25.1838366Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.1838371Z 2025-05-07T20:32:25.1838789Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.1838796Z 2025-05-07T20:32:25.1838901Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.1839122Z self=, 2025-05-07T20:32:25.1839206Z T=1, 2025-05-07T20:32:25.1839282Z D=7168, 2025-05-07T20:32:25.1839362Z scale_ub=None, 2025-05-07T20:32:25.1839450Z contiguous=True, 2025-05-07T20:32:25.1839531Z compiled=True, 2025-05-07T20:32:25.1839606Z ) 2025-05-07T20:32:25.1839826Z self = 2025-05-07T20:32:25.1839985Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:25.1839989Z 2025-05-07T20:32:25.1840077Z @given( 2025-05-07T20:32:25.1840202Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.1840302Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.1840453Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.1840584Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.1840707Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.1840786Z ) 2025-05-07T20:32:25.1841077Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.1841173Z def test_silu_mul_quant( 2025-05-07T20:32:25.1841250Z self, 2025-05-07T20:32:25.1841324Z T: int, 2025-05-07T20:32:25.1841404Z D: int, 2025-05-07T20:32:25.1841503Z scale_ub: Optional[float], 2025-05-07T20:32:25.1841589Z contiguous: bool, 2025-05-07T20:32:25.1841678Z compiled: bool, 2025-05-07T20:32:25.1841758Z ) -> None: 2025-05-07T20:32:25.1841853Z torch.manual_seed(2025) 2025-05-07T20:32:25.1841928Z 2025-05-07T20:32:25.1842098Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.1842170Z 2025-05-07T20:32:25.1842330Z x_sign = torch.sign(x) 2025-05-07T20:32:25.1842496Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.1842583Z x = x_sign * x_clamp 2025-05-07T20:32:25.1842663Z x0 = x[:, :D] 2025-05-07T20:32:25.1842777Z x1 = x[:, D:] 2025-05-07T20:32:25.1842851Z 2025-05-07T20:32:25.1842944Z if contiguous: 2025-05-07T20:32:25.1843034Z x0 = x0.contiguous() 2025-05-07T20:32:25.1843124Z x1 = x1.contiguous() 2025-05-07T20:32:25.1843195Z 2025-05-07T20:32:25.1843283Z if scale_ub is not None: 2025-05-07T20:32:25.1843394Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.1843525Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.1843602Z ) 2025-05-07T20:32:25.1843678Z else: 2025-05-07T20:32:25.1843769Z scale_ub_tensor = None 2025-05-07T20:32:25.1843838Z 2025-05-07T20:32:25.1843966Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.1844057Z op = silu_mul_quant 2025-05-07T20:32:25.1844142Z if compiled: 2025-05-07T20:32:25.1844240Z op = torch.compile(op) 2025-05-07T20:32:25.1844342Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.1844419Z 2025-05-07T20:32:25.1844509Z y_fp8, y_scale = fn() 2025-05-07T20:32:25.1844629Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:25.1844702Z 2025-05-07T20:32:25.1844833Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.1844932Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:25.1845032Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:25.1845150Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:25.1845287Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:25.1845359Z 2025-05-07T20:32:25.1845455Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:25.1845462Z 2025-05-07T20:32:25.1845562Z moe/activation_test.py:126: 2025-05-07T20:32:25.1845694Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.1845798Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:25.1845936Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:25.1846498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:25.1846597Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:25.1847616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.1847837Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.1848210Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:25.1848465Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:25.1848866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:25.1849175Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:25.1849568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:25.1849763Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:25.1850112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:25.1850190Z fn() 2025-05-07T20:32:25.1850595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:25.1850680Z self.fn.run( 2025-05-07T20:32:25.1851015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.1851192Z kernel = self.compile( 2025-05-07T20:32:25.1851570Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.1851791Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.1851922Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.1851926Z 2025-05-07T20:32:25.1852129Z self = 2025-05-07T20:32:25.1852909Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.1853411Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd8fc128e0>} 2025-05-07T20:32:25.1854168Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.1854362Z context = 2025-05-07T20:32:25.1854366Z 2025-05-07T20:32:25.1854538Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.1854799Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.1854910Z module_map=module_map) 2025-05-07T20:32:25.1855079Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.1855186Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:25.1855268Z E ^ 2025-05-07T20:32:25.1855629Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.1855639Z 2025-05-07T20:32:25.1856052Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.1856057Z 2025-05-07T20:32:25.1856173Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.1856401Z self=, 2025-05-07T20:32:25.1856480Z T=4096, 2025-05-07T20:32:25.1856567Z D=5120, 2025-05-07T20:32:25.1856654Z scale_ub=None, 2025-05-07T20:32:25.1856742Z contiguous=False, 2025-05-07T20:32:25.1856838Z compiled=False, 2025-05-07T20:32:25.1856913Z ) 2025-05-07T20:32:25.1857131Z self = 2025-05-07T20:32:25.1857316Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:25.1857320Z 2025-05-07T20:32:25.1857399Z @given( 2025-05-07T20:32:25.1857526Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.1857633Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.1857749Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.1857874Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.1858035Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.1858112Z ) 2025-05-07T20:32:25.1858368Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.1858463Z def test_silu_mul_quant( 2025-05-07T20:32:25.1858551Z self, 2025-05-07T20:32:25.1858629Z T: int, 2025-05-07T20:32:25.1858707Z D: int, 2025-05-07T20:32:25.1858813Z scale_ub: Optional[float], 2025-05-07T20:32:25.1858904Z contiguous: bool, 2025-05-07T20:32:25.1858992Z compiled: bool, 2025-05-07T20:32:25.1859082Z ) -> None: 2025-05-07T20:32:25.1859178Z torch.manual_seed(2025) 2025-05-07T20:32:25.1859252Z 2025-05-07T20:32:25.1859472Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.1859592Z 2025-05-07T20:32:25.1859686Z x_sign = torch.sign(x) 2025-05-07T20:32:25.1859820Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.1859949Z x = x_sign * x_clamp 2025-05-07T20:32:25.1860050Z x0 = x[:, :D] 2025-05-07T20:32:25.1860154Z x1 = x[:, D:] 2025-05-07T20:32:25.1860241Z 2025-05-07T20:32:25.1860347Z if contiguous: 2025-05-07T20:32:25.1860441Z x0 = x0.contiguous() 2025-05-07T20:32:25.1860531Z x1 = x1.contiguous() 2025-05-07T20:32:25.1860609Z 2025-05-07T20:32:25.1860701Z if scale_ub is not None: 2025-05-07T20:32:25.1860807Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.1860951Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.1861024Z ) 2025-05-07T20:32:25.1861109Z else: 2025-05-07T20:32:25.1861202Z scale_ub_tensor = None 2025-05-07T20:32:25.1861278Z 2025-05-07T20:32:25.1861416Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.1861506Z op = silu_mul_quant 2025-05-07T20:32:25.1861595Z if compiled: 2025-05-07T20:32:25.1861704Z op = torch.compile(op) 2025-05-07T20:32:25.1861814Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.1861887Z 2025-05-07T20:32:25.1861987Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.1861991Z 2025-05-07T20:32:25.1862087Z moe/activation_test.py:117: 2025-05-07T20:32:25.1862227Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.1862328Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.1862427Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.1862931Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.1863027Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.1863387Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.1863613Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.1863954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.1864057Z kernel = self.compile( 2025-05-07T20:32:25.1864437Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.1864611Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.1864747Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.1864752Z 2025-05-07T20:32:25.1864955Z self = 2025-05-07T20:32:25.1865735Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.1866284Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd8f59c680>} 2025-05-07T20:32:25.1867032Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.1867228Z context = 2025-05-07T20:32:25.1867232Z 2025-05-07T20:32:25.1867393Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.1867657Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.1867806Z module_map=module_map) 2025-05-07T20:32:25.1868006Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.1868112Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.1868190Z E ^ 2025-05-07T20:32:25.1868583Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.1868594Z 2025-05-07T20:32:25.1869007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.1869012Z 2025-05-07T20:32:25.1869173Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.1869400Z self=, 2025-05-07T20:32:25.1869477Z T=4096, 2025-05-07T20:32:25.1869553Z D=7168, 2025-05-07T20:32:25.1869638Z scale_ub=None, 2025-05-07T20:32:25.1869726Z contiguous=False, 2025-05-07T20:32:25.1869813Z compiled=False, 2025-05-07T20:32:25.1869895Z ) 2025-05-07T20:32:25.1870114Z self = 2025-05-07T20:32:25.1870296Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:25.1870300Z 2025-05-07T20:32:25.1870384Z @given( 2025-05-07T20:32:25.1870530Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.1870650Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.1870773Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.1870888Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.1871005Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.1871080Z ) 2025-05-07T20:32:25.1871329Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.1871423Z def test_silu_mul_quant( 2025-05-07T20:32:25.1871499Z self, 2025-05-07T20:32:25.1871581Z T: int, 2025-05-07T20:32:25.1871656Z D: int, 2025-05-07T20:32:25.1871758Z scale_ub: Optional[float], 2025-05-07T20:32:25.1871860Z contiguous: bool, 2025-05-07T20:32:25.1871951Z compiled: bool, 2025-05-07T20:32:25.1872030Z ) -> None: 2025-05-07T20:32:25.1872129Z torch.manual_seed(2025) 2025-05-07T20:32:25.1872204Z 2025-05-07T20:32:25.1872374Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.1872453Z 2025-05-07T20:32:25.1872544Z x_sign = torch.sign(x) 2025-05-07T20:32:25.1872668Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.1872764Z x = x_sign * x_clamp 2025-05-07T20:32:25.1872850Z x0 = x[:, :D] 2025-05-07T20:32:25.1872936Z x1 = x[:, D:] 2025-05-07T20:32:25.1873008Z 2025-05-07T20:32:25.1873093Z if contiguous: 2025-05-07T20:32:25.1873189Z x0 = x0.contiguous() 2025-05-07T20:32:25.1873277Z x1 = x1.contiguous() 2025-05-07T20:32:25.1873350Z 2025-05-07T20:32:25.1873447Z if scale_ub is not None: 2025-05-07T20:32:25.1873558Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.1873691Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.1873777Z ) 2025-05-07T20:32:25.1873855Z else: 2025-05-07T20:32:25.1874034Z scale_ub_tensor = None 2025-05-07T20:32:25.1874113Z 2025-05-07T20:32:25.1874243Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.1874342Z op = silu_mul_quant 2025-05-07T20:32:25.1874431Z if compiled: 2025-05-07T20:32:25.1874530Z op = torch.compile(op) 2025-05-07T20:32:25.1874642Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.1874716Z 2025-05-07T20:32:25.1874806Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.1874810Z 2025-05-07T20:32:25.1874914Z moe/activation_test.py:117: 2025-05-07T20:32:25.1875042Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.1875185Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.1875330Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.1875866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.1875974Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.1876331Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.1876553Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.1876896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.1876990Z kernel = self.compile( 2025-05-07T20:32:25.1877371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.1877551Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.1877684Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.1877689Z 2025-05-07T20:32:25.1877899Z self = 2025-05-07T20:32:25.1878674Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.1879187Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd8feec180>} 2025-05-07T20:32:25.1879927Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.1880150Z context = 2025-05-07T20:32:25.1880159Z 2025-05-07T20:32:25.1880348Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.1880606Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.1880721Z module_map=module_map) 2025-05-07T20:32:25.1880884Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.1880985Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.1881070Z E ^ 2025-05-07T20:32:25.1881424Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.1881429Z 2025-05-07T20:32:25.1881843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.1881853Z 2025-05-07T20:32:25.1881961Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.1882185Z self=, 2025-05-07T20:32:25.1882271Z T=128, 2025-05-07T20:32:25.1882349Z D=7168, 2025-05-07T20:32:25.1882430Z scale_ub=None, 2025-05-07T20:32:25.1882521Z contiguous=False, 2025-05-07T20:32:25.1882649Z compiled=True, 2025-05-07T20:32:25.1882722Z ) 2025-05-07T20:32:25.1882944Z self = 2025-05-07T20:32:25.1883113Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:25.1883118Z 2025-05-07T20:32:25.1883200Z @given( 2025-05-07T20:32:25.1883317Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.1883417Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.1883535Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.1883655Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.1883766Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.1883886Z ) 2025-05-07T20:32:25.1884169Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.1884275Z def test_silu_mul_quant( 2025-05-07T20:32:25.1884352Z self, 2025-05-07T20:32:25.1884472Z T: int, 2025-05-07T20:32:25.1884559Z D: int, 2025-05-07T20:32:25.1884661Z scale_ub: Optional[float], 2025-05-07T20:32:25.1884752Z contiguous: bool, 2025-05-07T20:32:25.1884846Z compiled: bool, 2025-05-07T20:32:25.1884925Z ) -> None: 2025-05-07T20:32:25.1885024Z torch.manual_seed(2025) 2025-05-07T20:32:25.1885098Z 2025-05-07T20:32:25.1885265Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.1885345Z 2025-05-07T20:32:25.1885441Z x_sign = torch.sign(x) 2025-05-07T20:32:25.1885566Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.1885662Z x = x_sign * x_clamp 2025-05-07T20:32:25.1885745Z x0 = x[:, :D] 2025-05-07T20:32:25.1885828Z x1 = x[:, D:] 2025-05-07T20:32:25.1885907Z 2025-05-07T20:32:25.1885991Z if contiguous: 2025-05-07T20:32:25.1886084Z x0 = x0.contiguous() 2025-05-07T20:32:25.1886179Z x1 = x1.contiguous() 2025-05-07T20:32:25.1886255Z 2025-05-07T20:32:25.1886351Z if scale_ub is not None: 2025-05-07T20:32:25.1886464Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.1886598Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.1886680Z ) 2025-05-07T20:32:25.1886760Z else: 2025-05-07T20:32:25.1886856Z scale_ub_tensor = None 2025-05-07T20:32:25.1886935Z 2025-05-07T20:32:25.1887061Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.1887151Z op = silu_mul_quant 2025-05-07T20:32:25.1887243Z if compiled: 2025-05-07T20:32:25.1887343Z op = torch.compile(op) 2025-05-07T20:32:25.1887452Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.1887533Z 2025-05-07T20:32:25.1887624Z y_fp8, y_scale = fn() 2025-05-07T20:32:25.1887742Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:25.1887821Z 2025-05-07T20:32:25.1887961Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.1888067Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:25.1888166Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:25.1888287Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:25.1888430Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:25.1888503Z 2025-05-07T20:32:25.1888602Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:25.1888607Z 2025-05-07T20:32:25.1888710Z moe/activation_test.py:126: 2025-05-07T20:32:25.1888841Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.1888952Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:25.1889088Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:25.1889647Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:25.1889809Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:25.1890204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.1890445Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.1890816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:25.1891067Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:25.1891466Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:25.1891758Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:25.1892170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:25.1892381Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:25.1892722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:25.1892806Z fn() 2025-05-07T20:32:25.1893206Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:25.1893290Z self.fn.run( 2025-05-07T20:32:25.1893630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.1893722Z kernel = self.compile( 2025-05-07T20:32:25.1894099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.1894287Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.1894414Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.1894418Z 2025-05-07T20:32:25.1894633Z self = 2025-05-07T20:32:25.1895406Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.1896128Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd8f5c7100>} 2025-05-07T20:32:25.1896874Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.1897075Z context = 2025-05-07T20:32:25.1897080Z 2025-05-07T20:32:25.1897251Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.1897513Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.1897619Z module_map=module_map) 2025-05-07T20:32:25.1897789Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.1897890Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:25.1897971Z E ^ 2025-05-07T20:32:25.1898324Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.1898328Z 2025-05-07T20:32:25.1898739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.1898747Z 2025-05-07T20:32:25.1898857Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.1899076Z self=, 2025-05-07T20:32:25.1899162Z T=128, 2025-05-07T20:32:25.1899242Z D=7168, 2025-05-07T20:32:25.1899375Z scale_ub=None, 2025-05-07T20:32:25.1899471Z contiguous=False, 2025-05-07T20:32:25.1899554Z compiled=False, 2025-05-07T20:32:25.1899626Z ) 2025-05-07T20:32:25.1899851Z self = 2025-05-07T20:32:25.1900021Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:25.1900025Z 2025-05-07T20:32:25.1900102Z @given( 2025-05-07T20:32:25.1900229Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.1900328Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.1900473Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.1900655Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.1900807Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.1900888Z ) 2025-05-07T20:32:25.1901131Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.1901290Z def test_silu_mul_quant( 2025-05-07T20:32:25.1901373Z self, 2025-05-07T20:32:25.1901452Z T: int, 2025-05-07T20:32:25.1901528Z D: int, 2025-05-07T20:32:25.1901633Z scale_ub: Optional[float], 2025-05-07T20:32:25.1901723Z contiguous: bool, 2025-05-07T20:32:25.1901808Z compiled: bool, 2025-05-07T20:32:25.1901896Z ) -> None: 2025-05-07T20:32:25.1901992Z torch.manual_seed(2025) 2025-05-07T20:32:25.1902070Z 2025-05-07T20:32:25.1902242Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.1902317Z 2025-05-07T20:32:25.1902416Z x_sign = torch.sign(x) 2025-05-07T20:32:25.1902540Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.1902634Z x = x_sign * x_clamp 2025-05-07T20:32:25.1902722Z x0 = x[:, :D] 2025-05-07T20:32:25.1902802Z x1 = x[:, D:] 2025-05-07T20:32:25.1902878Z 2025-05-07T20:32:25.1902969Z if contiguous: 2025-05-07T20:32:25.1903071Z x0 = x0.contiguous() 2025-05-07T20:32:25.1903162Z x1 = x1.contiguous() 2025-05-07T20:32:25.1903243Z 2025-05-07T20:32:25.1903333Z if scale_ub is not None: 2025-05-07T20:32:25.1903438Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.1903578Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.1903654Z ) 2025-05-07T20:32:25.1903736Z else: 2025-05-07T20:32:25.1903833Z scale_ub_tensor = None 2025-05-07T20:32:25.1903905Z 2025-05-07T20:32:25.1904041Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.1904131Z op = silu_mul_quant 2025-05-07T20:32:25.1904219Z if compiled: 2025-05-07T20:32:25.1904330Z op = torch.compile(op) 2025-05-07T20:32:25.1904435Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.1904508Z 2025-05-07T20:32:25.1904605Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.1904612Z 2025-05-07T20:32:25.1904709Z moe/activation_test.py:117: 2025-05-07T20:32:25.1904843Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.1904944Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.1905042Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.1905542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.1905638Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.1905994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.1906222Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.1906564Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.1906667Z kernel = self.compile( 2025-05-07T20:32:25.1907100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.1907275Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.1907408Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.1907413Z 2025-05-07T20:32:25.1907615Z self = 2025-05-07T20:32:25.1908392Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.1908929Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd8f08ce00>} 2025-05-07T20:32:25.1909890Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.1910091Z context = 2025-05-07T20:32:25.1910095Z 2025-05-07T20:32:25.1910258Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.1910524Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.1910630Z module_map=module_map) 2025-05-07T20:32:25.1910792Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.1910896Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.1910977Z E ^ 2025-05-07T20:32:25.1911332Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.1911342Z 2025-05-07T20:32:25.1911758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.1911762Z 2025-05-07T20:32:25.1911865Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.1912094Z self=, 2025-05-07T20:32:25.1912170Z T=4096, 2025-05-07T20:32:25.1912247Z D=5120, 2025-05-07T20:32:25.1912336Z scale_ub=1200.0, 2025-05-07T20:32:25.1912422Z contiguous=True, 2025-05-07T20:32:25.1912506Z compiled=False, 2025-05-07T20:32:25.1912584Z ) 2025-05-07T20:32:25.1912800Z self = 2025-05-07T20:32:25.1912982Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:25.1912988Z 2025-05-07T20:32:25.1913068Z @given( 2025-05-07T20:32:25.1913188Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.1913293Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.1913408Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.1913526Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.1913643Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.1913718Z ) 2025-05-07T20:32:25.1913964Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.1914063Z def test_silu_mul_quant( 2025-05-07T20:32:25.1914140Z self, 2025-05-07T20:32:25.1914224Z T: int, 2025-05-07T20:32:25.1914302Z D: int, 2025-05-07T20:32:25.1914399Z scale_ub: Optional[float], 2025-05-07T20:32:25.1914494Z contiguous: bool, 2025-05-07T20:32:25.1914580Z compiled: bool, 2025-05-07T20:32:25.1914660Z ) -> None: 2025-05-07T20:32:25.1914763Z torch.manual_seed(2025) 2025-05-07T20:32:25.1914839Z 2025-05-07T20:32:25.1915005Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.1915087Z 2025-05-07T20:32:25.1915182Z x_sign = torch.sign(x) 2025-05-07T20:32:25.1915353Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.1915453Z x = x_sign * x_clamp 2025-05-07T20:32:25.1915534Z x0 = x[:, :D] 2025-05-07T20:32:25.1915620Z x1 = x[:, D:] 2025-05-07T20:32:25.1915693Z 2025-05-07T20:32:25.1915780Z if contiguous: 2025-05-07T20:32:25.1915879Z x0 = x0.contiguous() 2025-05-07T20:32:25.1915968Z x1 = x1.contiguous() 2025-05-07T20:32:25.1916042Z 2025-05-07T20:32:25.1916137Z if scale_ub is not None: 2025-05-07T20:32:25.1916242Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.1916376Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.1916505Z ) 2025-05-07T20:32:25.1916621Z else: 2025-05-07T20:32:25.1916715Z scale_ub_tensor = None 2025-05-07T20:32:25.1916793Z 2025-05-07T20:32:25.1916921Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.1917058Z op = silu_mul_quant 2025-05-07T20:32:25.1917149Z if compiled: 2025-05-07T20:32:25.1917249Z op = torch.compile(op) 2025-05-07T20:32:25.1917361Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.1917435Z 2025-05-07T20:32:25.1917526Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.1917530Z 2025-05-07T20:32:25.1917634Z moe/activation_test.py:117: 2025-05-07T20:32:25.1917762Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.1917863Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.1917969Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.1918463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.1918574Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.1918930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.1919154Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.1919525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.1919628Z kernel = self.compile( 2025-05-07T20:32:25.1920024Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.1920201Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.1920327Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.1920331Z 2025-05-07T20:32:25.1920539Z self = 2025-05-07T20:32:25.1921322Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.1921833Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd8f08df80>} 2025-05-07T20:32:25.1922576Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.1922767Z context = 2025-05-07T20:32:25.1922772Z 2025-05-07T20:32:25.1922939Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.1923198Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.1923312Z module_map=module_map) 2025-05-07T20:32:25.1923472Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.1923619Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.1923706Z E ^ 2025-05-07T20:32:25.1924061Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.1924065Z 2025-05-07T20:32:25.1924476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.1924487Z 2025-05-07T20:32:25.1924590Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.1924810Z self=, 2025-05-07T20:32:25.1924894Z T=1, 2025-05-07T20:32:25.1924971Z D=5120, 2025-05-07T20:32:25.1925093Z scale_ub=None, 2025-05-07T20:32:25.1925222Z contiguous=True, 2025-05-07T20:32:25.1925304Z compiled=True, 2025-05-07T20:32:25.1925377Z ) 2025-05-07T20:32:25.1925598Z self = 2025-05-07T20:32:25.1925797Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:25.1925802Z 2025-05-07T20:32:25.1925880Z @given( 2025-05-07T20:32:25.1926006Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.1926104Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.1926222Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.1926339Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.1926451Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.1926530Z ) 2025-05-07T20:32:25.1926774Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.1926866Z def test_silu_mul_quant( 2025-05-07T20:32:25.1926952Z self, 2025-05-07T20:32:25.1927032Z T: int, 2025-05-07T20:32:25.1927110Z D: int, 2025-05-07T20:32:25.1927214Z scale_ub: Optional[float], 2025-05-07T20:32:25.1927304Z contiguous: bool, 2025-05-07T20:32:25.1927398Z compiled: bool, 2025-05-07T20:32:25.1927478Z ) -> None: 2025-05-07T20:32:25.1927573Z torch.manual_seed(2025) 2025-05-07T20:32:25.1927654Z 2025-05-07T20:32:25.1927822Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.1927897Z 2025-05-07T20:32:25.1927997Z x_sign = torch.sign(x) 2025-05-07T20:32:25.1928309Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.1928443Z x = x_sign * x_clamp 2025-05-07T20:32:25.1928565Z x0 = x[:, :D] 2025-05-07T20:32:25.1928678Z x1 = x[:, D:] 2025-05-07T20:32:25.1928778Z 2025-05-07T20:32:25.1928869Z if contiguous: 2025-05-07T20:32:25.1928961Z x0 = x0.contiguous() 2025-05-07T20:32:25.1929057Z x1 = x1.contiguous() 2025-05-07T20:32:25.1929142Z 2025-05-07T20:32:25.1929232Z if scale_ub is not None: 2025-05-07T20:32:25.1929344Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.1929481Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.1929559Z ) 2025-05-07T20:32:25.1929642Z else: 2025-05-07T20:32:25.1929737Z scale_ub_tensor = None 2025-05-07T20:32:25.1929808Z 2025-05-07T20:32:25.1929944Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.1930033Z op = silu_mul_quant 2025-05-07T20:32:25.1930119Z if compiled: 2025-05-07T20:32:25.1930230Z op = torch.compile(op) 2025-05-07T20:32:25.1930336Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.1930411Z 2025-05-07T20:32:25.1930509Z y_fp8, y_scale = fn() 2025-05-07T20:32:25.1930631Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:25.1930727Z 2025-05-07T20:32:25.1930882Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.1931002Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:25.1931108Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:25.1931382Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:25.1931526Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:25.1931607Z 2025-05-07T20:32:25.1931708Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:25.1931713Z 2025-05-07T20:32:25.1931811Z moe/activation_test.py:126: 2025-05-07T20:32:25.1931947Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.1932054Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:25.1932193Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:25.1932753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:25.1933008Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:25.1933374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.1933657Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.1934029Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:25.1934282Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:25.1934678Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:25.1934935Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:25.1935308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:25.1935479Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:25.1935823Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:25.1935904Z fn() 2025-05-07T20:32:25.1936309Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:25.1936395Z self.fn.run( 2025-05-07T20:32:25.1936729Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.1936829Z kernel = self.compile( 2025-05-07T20:32:25.1937206Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.1937379Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.1937516Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.1937526Z 2025-05-07T20:32:25.1937730Z self = 2025-05-07T20:32:25.1938513Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.1939011Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd8f08e340>} 2025-05-07T20:32:25.1939758Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.1939950Z context = 2025-05-07T20:32:25.1939955Z 2025-05-07T20:32:25.1940120Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.1940388Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.1940498Z module_map=module_map) 2025-05-07T20:32:25.1940715Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.1940818Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:25.1940896Z E ^ 2025-05-07T20:32:25.1941255Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.1941259Z 2025-05-07T20:32:25.1941669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.1941673Z 2025-05-07T20:32:25.1941777Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.1942003Z self=, 2025-05-07T20:32:25.1942119Z T=2048, 2025-05-07T20:32:25.1942240Z D=5120, 2025-05-07T20:32:25.1942323Z scale_ub=None, 2025-05-07T20:32:25.1942408Z contiguous=True, 2025-05-07T20:32:25.1942496Z compiled=True, 2025-05-07T20:32:25.1942570Z ) 2025-05-07T20:32:25.1942828Z self = 2025-05-07T20:32:25.1943006Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:25.1943011Z 2025-05-07T20:32:25.1943087Z @given( 2025-05-07T20:32:25.1943205Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.1943308Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.1943422Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.1943542Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.1943655Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.1943729Z ) 2025-05-07T20:32:25.1943977Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.1944074Z def test_silu_mul_quant( 2025-05-07T20:32:25.1944150Z self, 2025-05-07T20:32:25.1944236Z T: int, 2025-05-07T20:32:25.1944312Z D: int, 2025-05-07T20:32:25.1944413Z scale_ub: Optional[float], 2025-05-07T20:32:25.1944509Z contiguous: bool, 2025-05-07T20:32:25.1944594Z compiled: bool, 2025-05-07T20:32:25.1944676Z ) -> None: 2025-05-07T20:32:25.1944778Z torch.manual_seed(2025) 2025-05-07T20:32:25.1944851Z 2025-05-07T20:32:25.1945024Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.1945097Z 2025-05-07T20:32:25.1945189Z x_sign = torch.sign(x) 2025-05-07T20:32:25.1945318Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.1945418Z x = x_sign * x_clamp 2025-05-07T20:32:25.1952207Z x0 = x[:, :D] 2025-05-07T20:32:25.1952299Z x1 = x[:, D:] 2025-05-07T20:32:25.1952388Z 2025-05-07T20:32:25.1952475Z if contiguous: 2025-05-07T20:32:25.1952574Z x0 = x0.contiguous() 2025-05-07T20:32:25.1952671Z x1 = x1.contiguous() 2025-05-07T20:32:25.1952745Z 2025-05-07T20:32:25.1952837Z if scale_ub is not None: 2025-05-07T20:32:25.1952962Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.1953102Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.1953181Z ) 2025-05-07T20:32:25.1953267Z else: 2025-05-07T20:32:25.1953366Z scale_ub_tensor = None 2025-05-07T20:32:25.1953440Z 2025-05-07T20:32:25.1953582Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.1953674Z op = silu_mul_quant 2025-05-07T20:32:25.1953770Z if compiled: 2025-05-07T20:32:25.1953875Z op = torch.compile(op) 2025-05-07T20:32:25.1953983Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.1954065Z 2025-05-07T20:32:25.1954160Z y_fp8, y_scale = fn() 2025-05-07T20:32:25.1954286Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:25.1954372Z 2025-05-07T20:32:25.1954510Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.1954617Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:25.1954802Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:25.1954932Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:25.1955074Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:25.1955156Z 2025-05-07T20:32:25.1955259Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:25.1955264Z 2025-05-07T20:32:25.1955371Z moe/activation_test.py:126: 2025-05-07T20:32:25.1955505Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.1955614Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:25.1955756Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:25.1956366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:25.1956513Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:25.1956923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.1957148Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.1957524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:25.1957778Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:25.1958176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:25.1958437Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:25.1958814Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:25.1958990Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:25.1959335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:25.1959420Z fn() 2025-05-07T20:32:25.1959839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:25.1959937Z self.fn.run( 2025-05-07T20:32:25.1960301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.1960404Z kernel = self.compile( 2025-05-07T20:32:25.1960784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.1960965Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.1961101Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.1961105Z 2025-05-07T20:32:25.1961311Z self = 2025-05-07T20:32:25.1962102Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.1962605Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd8f6c2d40>} 2025-05-07T20:32:25.1963356Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.1963551Z context = 2025-05-07T20:32:25.1963557Z 2025-05-07T20:32:25.1963729Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.1963992Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.1964143Z module_map=module_map) 2025-05-07T20:32:25.1964315Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.1964417Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:25.1964498Z E ^ 2025-05-07T20:32:25.1964860Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.1964864Z 2025-05-07T20:32:25.1965275Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.1965280Z 2025-05-07T20:32:25.1965392Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.1965656Z self=, 2025-05-07T20:32:25.1965775Z T=128, 2025-05-07T20:32:25.1965862Z D=5120, 2025-05-07T20:32:25.1965945Z scale_ub=None, 2025-05-07T20:32:25.1966069Z contiguous=True, 2025-05-07T20:32:25.1966163Z compiled=True, 2025-05-07T20:32:25.1966238Z ) 2025-05-07T20:32:25.1966457Z self = 2025-05-07T20:32:25.1966629Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:25.1966634Z 2025-05-07T20:32:25.1966715Z @given( 2025-05-07T20:32:25.1966841Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.1966939Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.1967054Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.1967177Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.1967290Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.1967374Z ) 2025-05-07T20:32:25.1967628Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.1967722Z def test_silu_mul_quant( 2025-05-07T20:32:25.1967800Z self, 2025-05-07T20:32:25.1967887Z T: int, 2025-05-07T20:32:25.1967969Z D: int, 2025-05-07T20:32:25.1968077Z scale_ub: Optional[float], 2025-05-07T20:32:25.1968172Z contiguous: bool, 2025-05-07T20:32:25.1968259Z compiled: bool, 2025-05-07T20:32:25.1968347Z ) -> None: 2025-05-07T20:32:25.1968448Z torch.manual_seed(2025) 2025-05-07T20:32:25.1968522Z 2025-05-07T20:32:25.1968699Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.1968774Z 2025-05-07T20:32:25.1968870Z x_sign = torch.sign(x) 2025-05-07T20:32:25.1969005Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.1969096Z x = x_sign * x_clamp 2025-05-07T20:32:25.1969182Z x0 = x[:, :D] 2025-05-07T20:32:25.1969276Z x1 = x[:, D:] 2025-05-07T20:32:25.1969354Z 2025-05-07T20:32:25.1969446Z if contiguous: 2025-05-07T20:32:25.1969543Z x0 = x0.contiguous() 2025-05-07T20:32:25.1969638Z x1 = x1.contiguous() 2025-05-07T20:32:25.1969721Z 2025-05-07T20:32:25.1969817Z if scale_ub is not None: 2025-05-07T20:32:25.1969927Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.1970073Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.1970151Z ) 2025-05-07T20:32:25.1970229Z else: 2025-05-07T20:32:25.1970333Z scale_ub_tensor = None 2025-05-07T20:32:25.1970408Z 2025-05-07T20:32:25.1970545Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.1970644Z op = silu_mul_quant 2025-05-07T20:32:25.1970731Z if compiled: 2025-05-07T20:32:25.1970833Z op = torch.compile(op) 2025-05-07T20:32:25.1970949Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.1971026Z 2025-05-07T20:32:25.1971127Z y_fp8, y_scale = fn() 2025-05-07T20:32:25.1971249Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:25.1971323Z 2025-05-07T20:32:25.1971519Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.1971624Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:25.1971725Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:25.1971860Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:25.1972002Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:25.1972076Z 2025-05-07T20:32:25.1972185Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:25.1972190Z 2025-05-07T20:32:25.1972290Z moe/activation_test.py:126: 2025-05-07T20:32:25.1972428Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.1972536Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:25.1972771Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:25.1973378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:25.1973484Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:25.1973843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.1974073Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.1974442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:25.1974702Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:25.1975099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:25.1975352Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:25.1975739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:25.1975910Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:25.1976258Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:25.1976339Z fn() 2025-05-07T20:32:25.1976737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:25.1976830Z self.fn.run( 2025-05-07T20:32:25.1977166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.1977264Z kernel = self.compile( 2025-05-07T20:32:25.1977650Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.1977831Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.1977963Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.1977970Z 2025-05-07T20:32:25.1978175Z self = 2025-05-07T20:32:25.1978946Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.1979447Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd8ecd3740>} 2025-05-07T20:32:25.1980185Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.1980385Z context = 2025-05-07T20:32:25.1980389Z 2025-05-07T20:32:25.1980578Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.1980913Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.1981022Z module_map=module_map) 2025-05-07T20:32:25.1981186Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.1981295Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:25.1981373Z E ^ 2025-05-07T20:32:25.1981728Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.1981732Z 2025-05-07T20:32:25.1982152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.1982231Z 2025-05-07T20:32:25.1982336Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.1982565Z self=, 2025-05-07T20:32:25.1982647Z T=4096, 2025-05-07T20:32:25.1982764Z D=5120, 2025-05-07T20:32:25.1982859Z scale_ub=None, 2025-05-07T20:32:25.1982946Z contiguous=True, 2025-05-07T20:32:25.1983029Z compiled=True, 2025-05-07T20:32:25.1983113Z ) 2025-05-07T20:32:25.1983333Z self = 2025-05-07T20:32:25.1983503Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:25.1983515Z 2025-05-07T20:32:25.1983593Z @given( 2025-05-07T20:32:25.1983711Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.1983818Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.1983937Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.1984058Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.1984179Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.1984254Z ) 2025-05-07T20:32:25.1984501Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.1984610Z def test_silu_mul_quant( 2025-05-07T20:32:25.1984689Z self, 2025-05-07T20:32:25.1984768Z T: int, 2025-05-07T20:32:25.1984854Z D: int, 2025-05-07T20:32:25.1984957Z scale_ub: Optional[float], 2025-05-07T20:32:25.1985057Z contiguous: bool, 2025-05-07T20:32:25.1985145Z compiled: bool, 2025-05-07T20:32:25.1985230Z ) -> None: 2025-05-07T20:32:25.1985336Z torch.manual_seed(2025) 2025-05-07T20:32:25.1985412Z 2025-05-07T20:32:25.1985582Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.1985666Z 2025-05-07T20:32:25.1985761Z x_sign = torch.sign(x) 2025-05-07T20:32:25.1985889Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.1985997Z x = x_sign * x_clamp 2025-05-07T20:32:25.1986079Z x0 = x[:, :D] 2025-05-07T20:32:25.1986162Z x1 = x[:, D:] 2025-05-07T20:32:25.1986244Z 2025-05-07T20:32:25.1986330Z if contiguous: 2025-05-07T20:32:25.1986436Z x0 = x0.contiguous() 2025-05-07T20:32:25.1986528Z x1 = x1.contiguous() 2025-05-07T20:32:25.1986602Z 2025-05-07T20:32:25.1986701Z if scale_ub is not None: 2025-05-07T20:32:25.1986810Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.1986945Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.1987029Z ) 2025-05-07T20:32:25.1987108Z else: 2025-05-07T20:32:25.1987204Z scale_ub_tensor = None 2025-05-07T20:32:25.1987287Z 2025-05-07T20:32:25.1987419Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.1987512Z op = silu_mul_quant 2025-05-07T20:32:25.1987611Z if compiled: 2025-05-07T20:32:25.1987715Z op = torch.compile(op) 2025-05-07T20:32:25.1987823Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.1987905Z 2025-05-07T20:32:25.1988000Z y_fp8, y_scale = fn() 2025-05-07T20:32:25.1988181Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:25.1988258Z 2025-05-07T20:32:25.1988396Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.1988506Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:25.1988605Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:25.1988726Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:25.1988876Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:25.1988950Z 2025-05-07T20:32:25.1989052Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:25.1989134Z 2025-05-07T20:32:25.1989237Z moe/activation_test.py:126: 2025-05-07T20:32:25.1989410Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.1989569Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:25.1989703Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:25.1990315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:25.1990422Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:25.1990825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.1991052Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.1991416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:25.1991674Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:25.1992071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:25.1992322Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:25.1992705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:25.1992873Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:25.1993212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:25.1993298Z fn() 2025-05-07T20:32:25.1993694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:25.1993784Z self.fn.run( 2025-05-07T20:32:25.1994119Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.1994218Z kernel = self.compile( 2025-05-07T20:32:25.1994608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.1994785Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.1994918Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.1994931Z 2025-05-07T20:32:25.1995135Z self = 2025-05-07T20:32:25.1995909Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.1996420Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd8e75d940>} 2025-05-07T20:32:25.1997158Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.1997366Z context = 2025-05-07T20:32:25.1997415Z 2025-05-07T20:32:25.1997581Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.1997840Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.1997953Z module_map=module_map) 2025-05-07T20:32:25.1998114Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.1998221Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:25.1998298Z E ^ 2025-05-07T20:32:25.1998651Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.1998656Z 2025-05-07T20:32:25.1999113Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.1999156Z 2025-05-07T20:32:25.1999264Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.1999542Z self=, 2025-05-07T20:32:25.1999638Z T=16384, 2025-05-07T20:32:25.1999733Z D=5120, 2025-05-07T20:32:25.1999828Z scale_ub=None, 2025-05-07T20:32:25.1999916Z contiguous=True, 2025-05-07T20:32:25.1999997Z compiled=True, 2025-05-07T20:32:25.2000075Z ) 2025-05-07T20:32:25.2000290Z self = 2025-05-07T20:32:25.2000462Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:25.2000467Z 2025-05-07T20:32:25.2000553Z @given( 2025-05-07T20:32:25.2000670Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.2000769Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.2000895Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.2001013Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.2001132Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.2001207Z ) 2025-05-07T20:32:25.2001454Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.2001560Z def test_silu_mul_quant( 2025-05-07T20:32:25.2001637Z self, 2025-05-07T20:32:25.2001716Z T: int, 2025-05-07T20:32:25.2001798Z D: int, 2025-05-07T20:32:25.2001897Z scale_ub: Optional[float], 2025-05-07T20:32:25.2001986Z contiguous: bool, 2025-05-07T20:32:25.2002079Z compiled: bool, 2025-05-07T20:32:25.2002157Z ) -> None: 2025-05-07T20:32:25.2002254Z torch.manual_seed(2025) 2025-05-07T20:32:25.2002334Z 2025-05-07T20:32:25.2002500Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.2002582Z 2025-05-07T20:32:25.2002677Z x_sign = torch.sign(x) 2025-05-07T20:32:25.2002801Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.2002895Z x = x_sign * x_clamp 2025-05-07T20:32:25.2002976Z x0 = x[:, :D] 2025-05-07T20:32:25.2003059Z x1 = x[:, D:] 2025-05-07T20:32:25.2003142Z 2025-05-07T20:32:25.2003226Z if contiguous: 2025-05-07T20:32:25.2003318Z x0 = x0.contiguous() 2025-05-07T20:32:25.2003413Z x1 = x1.contiguous() 2025-05-07T20:32:25.2003485Z 2025-05-07T20:32:25.2003575Z if scale_ub is not None: 2025-05-07T20:32:25.2003686Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.2003819Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.2003901Z ) 2025-05-07T20:32:25.2003978Z else: 2025-05-07T20:32:25.2004072Z scale_ub_tensor = None 2025-05-07T20:32:25.2004152Z 2025-05-07T20:32:25.2004281Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.2004378Z op = silu_mul_quant 2025-05-07T20:32:25.2004472Z if compiled: 2025-05-07T20:32:25.2004575Z op = torch.compile(op) 2025-05-07T20:32:25.2004683Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2004762Z 2025-05-07T20:32:25.2004928Z y_fp8, y_scale = fn() 2025-05-07T20:32:25.2005054Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:25.2005138Z 2025-05-07T20:32:25.2005275Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.2005384Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:25.2005484Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:25.2005605Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:25.2005754Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:25.2005827Z 2025-05-07T20:32:25.2005928Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:25.2005974Z 2025-05-07T20:32:25.2006119Z moe/activation_test.py:126: 2025-05-07T20:32:25.2006247Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2006352Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:25.2006536Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:25.2007096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:25.2007205Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:25.2007564Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.2007787Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.2008162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:25.2008420Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:25.2008829Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:25.2009085Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:25.2009463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:25.2009661Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:25.2010025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:25.2010104Z fn() 2025-05-07T20:32:25.2010510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:25.2010595Z self.fn.run( 2025-05-07T20:32:25.2010941Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.2011041Z kernel = self.compile( 2025-05-07T20:32:25.2011422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.2011609Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.2011738Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2011743Z 2025-05-07T20:32:25.2011954Z self = 2025-05-07T20:32:25.2012725Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.2013225Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd8e311bc0>} 2025-05-07T20:32:25.2013982Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.2014232Z context = 2025-05-07T20:32:25.2014238Z 2025-05-07T20:32:25.2014408Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.2014670Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.2014777Z module_map=module_map) 2025-05-07T20:32:25.2014946Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.2015050Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:25.2015134Z E ^ 2025-05-07T20:32:25.2015487Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.2015570Z 2025-05-07T20:32:25.2015982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.2016024Z 2025-05-07T20:32:25.2016139Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.2016361Z self=, 2025-05-07T20:32:25.2016453Z T=1, 2025-05-07T20:32:25.2016532Z D=5120, 2025-05-07T20:32:25.2016616Z scale_ub=1200.0, 2025-05-07T20:32:25.2016712Z contiguous=True, 2025-05-07T20:32:25.2016795Z compiled=True, 2025-05-07T20:32:25.2016876Z ) 2025-05-07T20:32:25.2017093Z self = 2025-05-07T20:32:25.2017258Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:25.2017262Z 2025-05-07T20:32:25.2017348Z @given( 2025-05-07T20:32:25.2017473Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.2017574Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.2017695Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.2017816Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.2017931Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.2018012Z ) 2025-05-07T20:32:25.2018256Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.2018356Z def test_silu_mul_quant( 2025-05-07T20:32:25.2018436Z self, 2025-05-07T20:32:25.2018514Z T: int, 2025-05-07T20:32:25.2018598Z D: int, 2025-05-07T20:32:25.2018698Z scale_ub: Optional[float], 2025-05-07T20:32:25.2018789Z contiguous: bool, 2025-05-07T20:32:25.2018883Z compiled: bool, 2025-05-07T20:32:25.2018962Z ) -> None: 2025-05-07T20:32:25.2019058Z torch.manual_seed(2025) 2025-05-07T20:32:25.2019140Z 2025-05-07T20:32:25.2019312Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.2019390Z 2025-05-07T20:32:25.2019496Z x_sign = torch.sign(x) 2025-05-07T20:32:25.2019623Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.2019725Z x = x_sign * x_clamp 2025-05-07T20:32:25.2019812Z x0 = x[:, :D] 2025-05-07T20:32:25.2019911Z x1 = x[:, D:] 2025-05-07T20:32:25.2019999Z 2025-05-07T20:32:25.2020099Z if contiguous: 2025-05-07T20:32:25.2020201Z x0 = x0.contiguous() 2025-05-07T20:32:25.2020300Z x1 = x1.contiguous() 2025-05-07T20:32:25.2020373Z 2025-05-07T20:32:25.2020465Z if scale_ub is not None: 2025-05-07T20:32:25.2020576Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.2020710Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.2020786Z ) 2025-05-07T20:32:25.2020868Z else: 2025-05-07T20:32:25.2020965Z scale_ub_tensor = None 2025-05-07T20:32:25.2021041Z 2025-05-07T20:32:25.2021176Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.2021268Z op = silu_mul_quant 2025-05-07T20:32:25.2021362Z if compiled: 2025-05-07T20:32:25.2021465Z op = torch.compile(op) 2025-05-07T20:32:25.2021626Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2021708Z 2025-05-07T20:32:25.2021798Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.2021802Z 2025-05-07T20:32:25.2021900Z moe/activation_test.py:117: 2025-05-07T20:32:25.2022036Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2022139Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.2022240Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2022612Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:25.2022705Z return fn(*args, **kwargs) 2025-05-07T20:32:25.2023242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.2023382Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.2023778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.2024010Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.2024348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.2024448Z kernel = self.compile( 2025-05-07T20:32:25.2024828Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.2025000Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.2025138Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2025148Z 2025-05-07T20:32:25.2025355Z self = 2025-05-07T20:32:25.2026137Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.2026642Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd8e4998a0>} 2025-05-07T20:32:25.2027385Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.2027582Z context = 2025-05-07T20:32:25.2027587Z 2025-05-07T20:32:25.2027753Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.2028024Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.2028398Z module_map=module_map) 2025-05-07T20:32:25.2028637Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.2028786Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.2028865Z E ^ 2025-05-07T20:32:25.2029262Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.2029274Z 2025-05-07T20:32:25.2029688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.2029693Z 2025-05-07T20:32:25.2029796Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.2030027Z self=, 2025-05-07T20:32:25.2030109Z T=1, 2025-05-07T20:32:25.2030189Z D=5120, 2025-05-07T20:32:25.2030278Z scale_ub=None, 2025-05-07T20:32:25.2030365Z contiguous=False, 2025-05-07T20:32:25.2030448Z compiled=True, 2025-05-07T20:32:25.2030527Z ) 2025-05-07T20:32:25.2030929Z self = 2025-05-07T20:32:25.2031100Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:25.2031104Z 2025-05-07T20:32:25.2031184Z @given( 2025-05-07T20:32:25.2031303Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.2031407Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.2031524Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.2031642Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.2031766Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.2031843Z ) 2025-05-07T20:32:25.2032086Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.2032325Z def test_silu_mul_quant( 2025-05-07T20:32:25.2032406Z self, 2025-05-07T20:32:25.2032490Z T: int, 2025-05-07T20:32:25.2032568Z D: int, 2025-05-07T20:32:25.2032667Z scale_ub: Optional[float], 2025-05-07T20:32:25.2032850Z contiguous: bool, 2025-05-07T20:32:25.2032939Z compiled: bool, 2025-05-07T20:32:25.2033020Z ) -> None: 2025-05-07T20:32:25.2033120Z torch.manual_seed(2025) 2025-05-07T20:32:25.2033194Z 2025-05-07T20:32:25.2033366Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.2033447Z 2025-05-07T20:32:25.2033541Z x_sign = torch.sign(x) 2025-05-07T20:32:25.2033668Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.2033765Z x = x_sign * x_clamp 2025-05-07T20:32:25.2033851Z x0 = x[:, :D] 2025-05-07T20:32:25.2033940Z x1 = x[:, D:] 2025-05-07T20:32:25.2034013Z 2025-05-07T20:32:25.2034100Z if contiguous: 2025-05-07T20:32:25.2034204Z x0 = x0.contiguous() 2025-05-07T20:32:25.2034295Z x1 = x1.contiguous() 2025-05-07T20:32:25.2034367Z 2025-05-07T20:32:25.2034465Z if scale_ub is not None: 2025-05-07T20:32:25.2034573Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.2034711Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.2034796Z ) 2025-05-07T20:32:25.2034875Z else: 2025-05-07T20:32:25.2034971Z scale_ub_tensor = None 2025-05-07T20:32:25.2035056Z 2025-05-07T20:32:25.2035186Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.2035279Z op = silu_mul_quant 2025-05-07T20:32:25.2035374Z if compiled: 2025-05-07T20:32:25.2035473Z op = torch.compile(op) 2025-05-07T20:32:25.2035587Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2035660Z 2025-05-07T20:32:25.2035750Z y_fp8, y_scale = fn() 2025-05-07T20:32:25.2035883Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:25.2035954Z 2025-05-07T20:32:25.2036090Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.2036199Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:25.2036301Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:25.2036423Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:25.2036568Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:25.2036641Z 2025-05-07T20:32:25.2036748Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:25.2036752Z 2025-05-07T20:32:25.2036851Z moe/activation_test.py:126: 2025-05-07T20:32:25.2036980Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2037091Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:25.2037225Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:25.2037785Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:25.2037897Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:25.2038305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.2038533Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.2038896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:25.2039151Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:25.2039554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:25.2039843Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:25.2040267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:25.2040474Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:25.2040850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:25.2040938Z fn() 2025-05-07T20:32:25.2041334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:25.2041416Z self.fn.run( 2025-05-07T20:32:25.2041757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.2041852Z kernel = self.compile( 2025-05-07T20:32:25.2042233Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.2042405Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.2042539Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2042543Z 2025-05-07T20:32:25.2042755Z self = 2025-05-07T20:32:25.2043533Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.2044036Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd8e4bde40>} 2025-05-07T20:32:25.2044777Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.2044970Z context = 2025-05-07T20:32:25.2044985Z 2025-05-07T20:32:25.2045149Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.2045410Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.2045529Z module_map=module_map) 2025-05-07T20:32:25.2045690Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.2045794Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:25.2045882Z E ^ 2025-05-07T20:32:25.2046234Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.2046238Z 2025-05-07T20:32:25.2046654Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.2046658Z 2025-05-07T20:32:25.2046763Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.2046987Z self=, 2025-05-07T20:32:25.2047075Z T=1, 2025-05-07T20:32:25.2047154Z D=5120, 2025-05-07T20:32:25.2047238Z scale_ub=None, 2025-05-07T20:32:25.2047331Z contiguous=True, 2025-05-07T20:32:25.2047418Z compiled=False, 2025-05-07T20:32:25.2047536Z ) 2025-05-07T20:32:25.2047761Z self = 2025-05-07T20:32:25.2047922Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:25.2047927Z 2025-05-07T20:32:25.2048011Z @given( 2025-05-07T20:32:25.2048130Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.2048231Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.2048354Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.2048476Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.2048590Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.2048712Z ) 2025-05-07T20:32:25.2049081Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.2049175Z def test_silu_mul_quant( 2025-05-07T20:32:25.2049260Z self, 2025-05-07T20:32:25.2049374Z T: int, 2025-05-07T20:32:25.2049466Z D: int, 2025-05-07T20:32:25.2049564Z scale_ub: Optional[float], 2025-05-07T20:32:25.2049657Z contiguous: bool, 2025-05-07T20:32:25.2049755Z compiled: bool, 2025-05-07T20:32:25.2049853Z ) -> None: 2025-05-07T20:32:25.2049956Z torch.manual_seed(2025) 2025-05-07T20:32:25.2050052Z 2025-05-07T20:32:25.2050221Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.2050294Z 2025-05-07T20:32:25.2050393Z x_sign = torch.sign(x) 2025-05-07T20:32:25.2050519Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.2050610Z x = x_sign * x_clamp 2025-05-07T20:32:25.2050700Z x0 = x[:, :D] 2025-05-07T20:32:25.2050784Z x1 = x[:, D:] 2025-05-07T20:32:25.2050867Z 2025-05-07T20:32:25.2050953Z if contiguous: 2025-05-07T20:32:25.2051047Z x0 = x0.contiguous() 2025-05-07T20:32:25.2051146Z x1 = x1.contiguous() 2025-05-07T20:32:25.2051221Z 2025-05-07T20:32:25.2051314Z if scale_ub is not None: 2025-05-07T20:32:25.2051426Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.2051562Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.2051639Z ) 2025-05-07T20:32:25.2051723Z else: 2025-05-07T20:32:25.2051818Z scale_ub_tensor = None 2025-05-07T20:32:25.2051891Z 2025-05-07T20:32:25.2052026Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.2052116Z op = silu_mul_quant 2025-05-07T20:32:25.2052201Z if compiled: 2025-05-07T20:32:25.2052307Z op = torch.compile(op) 2025-05-07T20:32:25.2052416Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2052501Z 2025-05-07T20:32:25.2052592Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.2052597Z 2025-05-07T20:32:25.2052695Z moe/activation_test.py:117: 2025-05-07T20:32:25.2052831Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2052935Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.2053035Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2053542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.2053639Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.2054001Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.2054222Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.2054559Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.2054668Z kernel = self.compile( 2025-05-07T20:32:25.2055046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.2055267Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.2055407Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2055411Z 2025-05-07T20:32:25.2055614Z self = 2025-05-07T20:32:25.2056390Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.2056887Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd8e4bf880>} 2025-05-07T20:32:25.2057746Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.2057942Z context = 2025-05-07T20:32:25.2057947Z 2025-05-07T20:32:25.2058109Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.2058375Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.2058483Z module_map=module_map) 2025-05-07T20:32:25.2058649Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.2058750Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.2058828Z E ^ 2025-05-07T20:32:25.2059187Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.2059196Z 2025-05-07T20:32:25.2059605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.2059610Z 2025-05-07T20:32:25.2059714Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.2059943Z self=, 2025-05-07T20:32:25.2060022Z T=128, 2025-05-07T20:32:25.2060108Z D=5120, 2025-05-07T20:32:25.2060190Z scale_ub=None, 2025-05-07T20:32:25.2060277Z contiguous=False, 2025-05-07T20:32:25.2060367Z compiled=True, 2025-05-07T20:32:25.2060441Z ) 2025-05-07T20:32:25.2060659Z self = 2025-05-07T20:32:25.2060834Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:25.2060839Z 2025-05-07T20:32:25.2060918Z @given( 2025-05-07T20:32:25.2061037Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.2061147Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.2061262Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.2061387Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.2061505Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.2061581Z ) 2025-05-07T20:32:25.2061831Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.2061925Z def test_silu_mul_quant( 2025-05-07T20:32:25.2062003Z self, 2025-05-07T20:32:25.2062089Z T: int, 2025-05-07T20:32:25.2062167Z D: int, 2025-05-07T20:32:25.2062266Z scale_ub: Optional[float], 2025-05-07T20:32:25.2062364Z contiguous: bool, 2025-05-07T20:32:25.2062452Z compiled: bool, 2025-05-07T20:32:25.2062532Z ) -> None: 2025-05-07T20:32:25.2062634Z torch.manual_seed(2025) 2025-05-07T20:32:25.2062709Z 2025-05-07T20:32:25.2062884Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.2062963Z 2025-05-07T20:32:25.2063057Z x_sign = torch.sign(x) 2025-05-07T20:32:25.2063192Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.2063288Z x = x_sign * x_clamp 2025-05-07T20:32:25.2063417Z x0 = x[:, :D] 2025-05-07T20:32:25.2063507Z x1 = x[:, D:] 2025-05-07T20:32:25.2063581Z 2025-05-07T20:32:25.2063666Z if contiguous: 2025-05-07T20:32:25.2063766Z x0 = x0.contiguous() 2025-05-07T20:32:25.2063857Z x1 = x1.contiguous() 2025-05-07T20:32:25.2063930Z 2025-05-07T20:32:25.2064028Z if scale_ub is not None: 2025-05-07T20:32:25.2064135Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.2064277Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.2064353Z ) 2025-05-07T20:32:25.2064432Z else: 2025-05-07T20:32:25.2064533Z scale_ub_tensor = None 2025-05-07T20:32:25.2064651Z 2025-05-07T20:32:25.2064843Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.2064939Z op = silu_mul_quant 2025-05-07T20:32:25.2065025Z if compiled: 2025-05-07T20:32:25.2065164Z op = torch.compile(op) 2025-05-07T20:32:25.2065280Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2065352Z 2025-05-07T20:32:25.2065443Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.2065448Z 2025-05-07T20:32:25.2065550Z moe/activation_test.py:117: 2025-05-07T20:32:25.2065681Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2065789Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.2065888Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2066255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:25.2066357Z return fn(*args, **kwargs) 2025-05-07T20:32:25.2066850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.2066949Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.2067314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.2067536Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.2067879Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.2067973Z kernel = self.compile( 2025-05-07T20:32:25.2068352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.2068532Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.2068660Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2068667Z 2025-05-07T20:32:25.2068880Z self = 2025-05-07T20:32:25.2069708Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.2070245Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd8e499c60>} 2025-05-07T20:32:25.2071007Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.2071199Z context = 2025-05-07T20:32:25.2071204Z 2025-05-07T20:32:25.2071374Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.2071636Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.2071745Z module_map=module_map) 2025-05-07T20:32:25.2071984Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.2072087Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.2072171Z E ^ 2025-05-07T20:32:25.2072522Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.2072526Z 2025-05-07T20:32:25.2072938Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.2072942Z 2025-05-07T20:32:25.2073054Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.2073275Z self=, 2025-05-07T20:32:25.2073354Z T=128, 2025-05-07T20:32:25.2073477Z D=7168, 2025-05-07T20:32:25.2073599Z scale_ub=1200.0, 2025-05-07T20:32:25.2073692Z contiguous=False, 2025-05-07T20:32:25.2073777Z compiled=False, 2025-05-07T20:32:25.2073850Z ) 2025-05-07T20:32:25.2074113Z self = 2025-05-07T20:32:25.2074290Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:25.2074296Z 2025-05-07T20:32:25.2074372Z @given( 2025-05-07T20:32:25.2074498Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.2074600Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.2074714Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.2074836Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.2074949Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.2075028Z ) 2025-05-07T20:32:25.2075273Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.2075368Z def test_silu_mul_quant( 2025-05-07T20:32:25.2075453Z self, 2025-05-07T20:32:25.2075532Z T: int, 2025-05-07T20:32:25.2075610Z D: int, 2025-05-07T20:32:25.2075715Z scale_ub: Optional[float], 2025-05-07T20:32:25.2075809Z contiguous: bool, 2025-05-07T20:32:25.2075901Z compiled: bool, 2025-05-07T20:32:25.2075989Z ) -> None: 2025-05-07T20:32:25.2076084Z torch.manual_seed(2025) 2025-05-07T20:32:25.2076157Z 2025-05-07T20:32:25.2076334Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.2076406Z 2025-05-07T20:32:25.2076502Z x_sign = torch.sign(x) 2025-05-07T20:32:25.2076626Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.2076724Z x = x_sign * x_clamp 2025-05-07T20:32:25.2076811Z x0 = x[:, :D] 2025-05-07T20:32:25.2076895Z x1 = x[:, D:] 2025-05-07T20:32:25.2076968Z 2025-05-07T20:32:25.2077059Z if contiguous: 2025-05-07T20:32:25.2077158Z x0 = x0.contiguous() 2025-05-07T20:32:25.2088260Z x1 = x1.contiguous() 2025-05-07T20:32:25.2088351Z 2025-05-07T20:32:25.2088452Z if scale_ub is not None: 2025-05-07T20:32:25.2088567Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.2088705Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.2088781Z ) 2025-05-07T20:32:25.2088862Z else: 2025-05-07T20:32:25.2088961Z scale_ub_tensor = None 2025-05-07T20:32:25.2089046Z 2025-05-07T20:32:25.2089185Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.2089281Z op = silu_mul_quant 2025-05-07T20:32:25.2089378Z if compiled: 2025-05-07T20:32:25.2089479Z op = torch.compile(op) 2025-05-07T20:32:25.2089587Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2089669Z 2025-05-07T20:32:25.2089762Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.2089770Z 2025-05-07T20:32:25.2089882Z moe/activation_test.py:117: 2025-05-07T20:32:25.2090017Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2090119Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.2090230Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2090850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.2090949Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.2091318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.2091542Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.2091893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.2091989Z kernel = self.compile( 2025-05-07T20:32:25.2092425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.2092654Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.2092826Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2092834Z 2025-05-07T20:32:25.2093049Z self = 2025-05-07T20:32:25.2093830Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.2094336Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efd8ea9c360>} 2025-05-07T20:32:25.2095088Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.2095287Z context = 2025-05-07T20:32:25.2095293Z 2025-05-07T20:32:25.2095474Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.2095736Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.2095844Z module_map=module_map) 2025-05-07T20:32:25.2096019Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.2096121Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.2096208Z E ^ 2025-05-07T20:32:25.2096564Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.2096569Z 2025-05-07T20:32:25.2096984Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.2096994Z 2025-05-07T20:32:25.2097106Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.2097330Z self=, 2025-05-07T20:32:25.2097418Z T=128, 2025-05-07T20:32:25.2097497Z D=5120, 2025-05-07T20:32:25.2097582Z scale_ub=None, 2025-05-07T20:32:25.2097678Z contiguous=False, 2025-05-07T20:32:25.2097764Z compiled=False, 2025-05-07T20:32:25.2097841Z ) 2025-05-07T20:32:25.2098065Z self = 2025-05-07T20:32:25.2098236Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:25.2098240Z 2025-05-07T20:32:25.2098320Z @given( 2025-05-07T20:32:25.2098449Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.2098548Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.2098665Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.2098793Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.2098906Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.2098992Z ) 2025-05-07T20:32:25.2099290Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.2099388Z def test_silu_mul_quant( 2025-05-07T20:32:25.2099472Z self, 2025-05-07T20:32:25.2099551Z T: int, 2025-05-07T20:32:25.2099628Z D: int, 2025-05-07T20:32:25.2099736Z scale_ub: Optional[float], 2025-05-07T20:32:25.2099828Z contiguous: bool, 2025-05-07T20:32:25.2099919Z compiled: bool, 2025-05-07T20:32:25.2100009Z ) -> None: 2025-05-07T20:32:25.2100130Z torch.manual_seed(2025) 2025-05-07T20:32:25.2100208Z 2025-05-07T20:32:25.2100407Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.2100481Z 2025-05-07T20:32:25.2100623Z x_sign = torch.sign(x) 2025-05-07T20:32:25.2100785Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.2100875Z x = x_sign * x_clamp 2025-05-07T20:32:25.2100964Z x0 = x[:, :D] 2025-05-07T20:32:25.2101045Z x1 = x[:, D:] 2025-05-07T20:32:25.2101157Z 2025-05-07T20:32:25.2101255Z if contiguous: 2025-05-07T20:32:25.2101348Z x0 = x0.contiguous() 2025-05-07T20:32:25.2101440Z x1 = x1.contiguous() 2025-05-07T20:32:25.2101519Z 2025-05-07T20:32:25.2101610Z if scale_ub is not None: 2025-05-07T20:32:25.2101717Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.2101862Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.2101938Z ) 2025-05-07T20:32:25.2102022Z else: 2025-05-07T20:32:25.2102119Z scale_ub_tensor = None 2025-05-07T20:32:25.2102191Z 2025-05-07T20:32:25.2102327Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.2102421Z op = silu_mul_quant 2025-05-07T20:32:25.2102509Z if compiled: 2025-05-07T20:32:25.2102618Z op = torch.compile(op) 2025-05-07T20:32:25.2102725Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2102800Z 2025-05-07T20:32:25.2102904Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.2102908Z 2025-05-07T20:32:25.2103008Z moe/activation_test.py:117: 2025-05-07T20:32:25.2103141Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2103254Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.2103356Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2103865Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.2103965Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.2104322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.2104558Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.2104897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.2105003Z kernel = self.compile( 2025-05-07T20:32:25.2105386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.2105559Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.2105698Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2105702Z 2025-05-07T20:32:25.2105907Z self = 2025-05-07T20:32:25.2106690Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.2107200Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efca39885e0>} 2025-05-07T20:32:25.2107996Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.2108201Z context = 2025-05-07T20:32:25.2108206Z 2025-05-07T20:32:25.2108369Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.2108638Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.2108747Z module_map=module_map) 2025-05-07T20:32:25.2108909Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.2109243Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.2109364Z E ^ 2025-05-07T20:32:25.2109719Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.2109731Z 2025-05-07T20:32:25.2110187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.2110192Z 2025-05-07T20:32:25.2110297Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.2110552Z self=, 2025-05-07T20:32:25.2110640Z T=128, 2025-05-07T20:32:25.2110733Z D=5120, 2025-05-07T20:32:25.2110826Z scale_ub=1200.0, 2025-05-07T20:32:25.2110912Z contiguous=True, 2025-05-07T20:32:25.2110998Z compiled=False, 2025-05-07T20:32:25.2111084Z ) 2025-05-07T20:32:25.2111304Z self = 2025-05-07T20:32:25.2111490Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:25.2111498Z 2025-05-07T20:32:25.2111577Z @given( 2025-05-07T20:32:25.2111699Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.2111808Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.2111927Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.2112044Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.2112167Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.2112243Z ) 2025-05-07T20:32:25.2112490Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.2112591Z def test_silu_mul_quant( 2025-05-07T20:32:25.2112669Z self, 2025-05-07T20:32:25.2112756Z T: int, 2025-05-07T20:32:25.2112834Z D: int, 2025-05-07T20:32:25.2112932Z scale_ub: Optional[float], 2025-05-07T20:32:25.2113031Z contiguous: bool, 2025-05-07T20:32:25.2113120Z compiled: bool, 2025-05-07T20:32:25.2113201Z ) -> None: 2025-05-07T20:32:25.2113305Z torch.manual_seed(2025) 2025-05-07T20:32:25.2113379Z 2025-05-07T20:32:25.2113548Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.2113631Z 2025-05-07T20:32:25.2113726Z x_sign = torch.sign(x) 2025-05-07T20:32:25.2113849Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.2113949Z x = x_sign * x_clamp 2025-05-07T20:32:25.2114030Z x0 = x[:, :D] 2025-05-07T20:32:25.2114121Z x1 = x[:, D:] 2025-05-07T20:32:25.2114195Z 2025-05-07T20:32:25.2114281Z if contiguous: 2025-05-07T20:32:25.2114377Z x0 = x0.contiguous() 2025-05-07T20:32:25.2114468Z x1 = x1.contiguous() 2025-05-07T20:32:25.2114541Z 2025-05-07T20:32:25.2114641Z if scale_ub is not None: 2025-05-07T20:32:25.2114749Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.2114888Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.2114974Z ) 2025-05-07T20:32:25.2115052Z else: 2025-05-07T20:32:25.2115147Z scale_ub_tensor = None 2025-05-07T20:32:25.2115229Z 2025-05-07T20:32:25.2115361Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.2115501Z op = silu_mul_quant 2025-05-07T20:32:25.2115598Z if compiled: 2025-05-07T20:32:25.2115698Z op = torch.compile(op) 2025-05-07T20:32:25.2115811Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2115884Z 2025-05-07T20:32:25.2115975Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.2115980Z 2025-05-07T20:32:25.2116084Z moe/activation_test.py:117: 2025-05-07T20:32:25.2116214Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2116315Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.2116422Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2116962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.2117106Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.2117504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.2117726Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.2118071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.2118165Z kernel = self.compile( 2025-05-07T20:32:25.2118547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.2118727Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.2118857Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2118867Z 2025-05-07T20:32:25.2119083Z self = 2025-05-07T20:32:25.2119860Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.2120363Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efca3989760>} 2025-05-07T20:32:25.2121119Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.2121311Z context = 2025-05-07T20:32:25.2121315Z 2025-05-07T20:32:25.2121487Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.2121752Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.2121867Z module_map=module_map) 2025-05-07T20:32:25.2122034Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.2122134Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.2122221Z E ^ 2025-05-07T20:32:25.2122574Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.2122579Z 2025-05-07T20:32:25.2122992Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.2123000Z 2025-05-07T20:32:25.2123111Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.2123333Z self=, 2025-05-07T20:32:25.2123420Z T=1, 2025-05-07T20:32:25.2123500Z D=7168, 2025-05-07T20:32:25.2123587Z scale_ub=1200.0, 2025-05-07T20:32:25.2123683Z contiguous=True, 2025-05-07T20:32:25.2123768Z compiled=True, 2025-05-07T20:32:25.2123842Z ) 2025-05-07T20:32:25.2124070Z self = 2025-05-07T20:32:25.2124281Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:25.2124286Z 2025-05-07T20:32:25.2124367Z @given( 2025-05-07T20:32:25.2124495Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.2124595Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.2124719Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.2124837Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.2124952Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.2125037Z ) 2025-05-07T20:32:25.2125281Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.2125452Z def test_silu_mul_quant( 2025-05-07T20:32:25.2125538Z self, 2025-05-07T20:32:25.2125616Z T: int, 2025-05-07T20:32:25.2125695Z D: int, 2025-05-07T20:32:25.2125802Z scale_ub: Optional[float], 2025-05-07T20:32:25.2125935Z contiguous: bool, 2025-05-07T20:32:25.2126026Z compiled: bool, 2025-05-07T20:32:25.2126114Z ) -> None: 2025-05-07T20:32:25.2126210Z torch.manual_seed(2025) 2025-05-07T20:32:25.2126289Z 2025-05-07T20:32:25.2126458Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.2126532Z 2025-05-07T20:32:25.2126635Z x_sign = torch.sign(x) 2025-05-07T20:32:25.2126762Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.2126851Z x = x_sign * x_clamp 2025-05-07T20:32:25.2126939Z x0 = x[:, :D] 2025-05-07T20:32:25.2127020Z x1 = x[:, D:] 2025-05-07T20:32:25.2127096Z 2025-05-07T20:32:25.2127195Z if contiguous: 2025-05-07T20:32:25.2127288Z x0 = x0.contiguous() 2025-05-07T20:32:25.2127383Z x1 = x1.contiguous() 2025-05-07T20:32:25.2127455Z 2025-05-07T20:32:25.2127545Z if scale_ub is not None: 2025-05-07T20:32:25.2127660Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.2127799Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.2127878Z ) 2025-05-07T20:32:25.2127961Z else: 2025-05-07T20:32:25.2128056Z scale_ub_tensor = None 2025-05-07T20:32:25.2128135Z 2025-05-07T20:32:25.2128619Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.2128754Z op = silu_mul_quant 2025-05-07T20:32:25.2128864Z if compiled: 2025-05-07T20:32:25.2128966Z op = torch.compile(op) 2025-05-07T20:32:25.2129071Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2129149Z 2025-05-07T20:32:25.2129240Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.2129249Z 2025-05-07T20:32:25.2129350Z moe/activation_test.py:117: 2025-05-07T20:32:25.2129486Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2129588Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.2129699Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2130098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:25.2130215Z return fn(*args, **kwargs) 2025-05-07T20:32:25.2130714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.2130813Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.2131170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.2131398Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.2131740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.2131841Z kernel = self.compile( 2025-05-07T20:32:25.2132222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.2132581Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.2132721Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2132725Z 2025-05-07T20:32:25.2132929Z self = 2025-05-07T20:32:25.2133703Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.2134203Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efca398ad40>} 2025-05-07T20:32:25.2135817Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.2136023Z context = 2025-05-07T20:32:25.2136028Z 2025-05-07T20:32:25.2136192Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.2136457Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.2136563Z module_map=module_map) 2025-05-07T20:32:25.2136724Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.2136828Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.2136905Z E ^ 2025-05-07T20:32:25.2137259Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.2137277Z 2025-05-07T20:32:25.2137693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.2137698Z 2025-05-07T20:32:25.2137804Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.2138033Z self=, 2025-05-07T20:32:25.2138111Z T=1, 2025-05-07T20:32:25.2138187Z D=7168, 2025-05-07T20:32:25.2138280Z scale_ub=1200.0, 2025-05-07T20:32:25.2138367Z contiguous=False, 2025-05-07T20:32:25.2138449Z compiled=True, 2025-05-07T20:32:25.2138531Z ) 2025-05-07T20:32:25.2138748Z self = 2025-05-07T20:32:25.2138917Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:25.2138921Z 2025-05-07T20:32:25.2139001Z @given( 2025-05-07T20:32:25.2139121Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.2139226Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.2139342Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.2139459Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.2139579Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.2139654Z ) 2025-05-07T20:32:25.2139904Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.2139998Z def test_silu_mul_quant( 2025-05-07T20:32:25.2140074Z self, 2025-05-07T20:32:25.2140156Z T: int, 2025-05-07T20:32:25.2140234Z D: int, 2025-05-07T20:32:25.2140334Z scale_ub: Optional[float], 2025-05-07T20:32:25.2140452Z contiguous: bool, 2025-05-07T20:32:25.2140544Z compiled: bool, 2025-05-07T20:32:25.2140641Z ) -> None: 2025-05-07T20:32:25.2140742Z torch.manual_seed(2025) 2025-05-07T20:32:25.2140816Z 2025-05-07T20:32:25.2140990Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.2141071Z 2025-05-07T20:32:25.2141163Z x_sign = torch.sign(x) 2025-05-07T20:32:25.2141290Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.2141478Z x = x_sign * x_clamp 2025-05-07T20:32:25.2141561Z x0 = x[:, :D] 2025-05-07T20:32:25.2141646Z x1 = x[:, D:] 2025-05-07T20:32:25.2141717Z 2025-05-07T20:32:25.2141800Z if contiguous: 2025-05-07T20:32:25.2141896Z x0 = x0.contiguous() 2025-05-07T20:32:25.2141986Z x1 = x1.contiguous() 2025-05-07T20:32:25.2142058Z 2025-05-07T20:32:25.2142153Z if scale_ub is not None: 2025-05-07T20:32:25.2142260Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.2142396Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.2142480Z ) 2025-05-07T20:32:25.2142598Z else: 2025-05-07T20:32:25.2142693Z scale_ub_tensor = None 2025-05-07T20:32:25.2142813Z 2025-05-07T20:32:25.2142942Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.2143038Z op = silu_mul_quant 2025-05-07T20:32:25.2143165Z if compiled: 2025-05-07T20:32:25.2143270Z op = torch.compile(op) 2025-05-07T20:32:25.2143382Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2143454Z 2025-05-07T20:32:25.2143546Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.2143550Z 2025-05-07T20:32:25.2143655Z moe/activation_test.py:117: 2025-05-07T20:32:25.2143784Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2143885Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.2143996Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2144362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:25.2144466Z return fn(*args, **kwargs) 2025-05-07T20:32:25.2144958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.2145056Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.2145422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.2145641Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.2145978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.2146078Z kernel = self.compile( 2025-05-07T20:32:25.2146456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.2146632Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.2146760Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2146770Z 2025-05-07T20:32:25.2146974Z self = 2025-05-07T20:32:25.2147757Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.2148260Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efca3aa0540>} 2025-05-07T20:32:25.2149008Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.2149271Z context = 2025-05-07T20:32:25.2149279Z 2025-05-07T20:32:25.2149451Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.2149709Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.2149818Z module_map=module_map) 2025-05-07T20:32:25.2150036Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.2150136Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.2150216Z E ^ 2025-05-07T20:32:25.2150574Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.2150579Z 2025-05-07T20:32:25.2150990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.2150998Z 2025-05-07T20:32:25.2151106Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.2151326Z self=, 2025-05-07T20:32:25.2151482Z T=1, 2025-05-07T20:32:25.2151565Z D=7168, 2025-05-07T20:32:25.2151648Z scale_ub=None, 2025-05-07T20:32:25.2151735Z contiguous=False, 2025-05-07T20:32:25.2151834Z compiled=True, 2025-05-07T20:32:25.2151907Z ) 2025-05-07T20:32:25.2152183Z self = 2025-05-07T20:32:25.2152347Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:25.2152351Z 2025-05-07T20:32:25.2152434Z @given( 2025-05-07T20:32:25.2152553Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.2152653Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.2152776Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.2152894Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.2153007Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.2153088Z ) 2025-05-07T20:32:25.2153336Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.2153435Z def test_silu_mul_quant( 2025-05-07T20:32:25.2153518Z self, 2025-05-07T20:32:25.2153598Z T: int, 2025-05-07T20:32:25.2153675Z D: int, 2025-05-07T20:32:25.2153783Z scale_ub: Optional[float], 2025-05-07T20:32:25.2153877Z contiguous: bool, 2025-05-07T20:32:25.2153969Z compiled: bool, 2025-05-07T20:32:25.2154048Z ) -> None: 2025-05-07T20:32:25.2154143Z torch.manual_seed(2025) 2025-05-07T20:32:25.2154222Z 2025-05-07T20:32:25.2154390Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.2154467Z 2025-05-07T20:32:25.2154569Z x_sign = torch.sign(x) 2025-05-07T20:32:25.2154694Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.2154784Z x = x_sign * x_clamp 2025-05-07T20:32:25.2154873Z x0 = x[:, :D] 2025-05-07T20:32:25.2154953Z x1 = x[:, D:] 2025-05-07T20:32:25.2155030Z 2025-05-07T20:32:25.2155123Z if contiguous: 2025-05-07T20:32:25.2155215Z x0 = x0.contiguous() 2025-05-07T20:32:25.2155311Z x1 = x1.contiguous() 2025-05-07T20:32:25.2155384Z 2025-05-07T20:32:25.2155476Z if scale_ub is not None: 2025-05-07T20:32:25.2155592Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.2155728Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.2155804Z ) 2025-05-07T20:32:25.2155889Z else: 2025-05-07T20:32:25.2155984Z scale_ub_tensor = None 2025-05-07T20:32:25.2156056Z 2025-05-07T20:32:25.2156189Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.2156280Z op = silu_mul_quant 2025-05-07T20:32:25.2156365Z if compiled: 2025-05-07T20:32:25.2156475Z op = torch.compile(op) 2025-05-07T20:32:25.2156579Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2156663Z 2025-05-07T20:32:25.2156753Z y_fp8, y_scale = fn() 2025-05-07T20:32:25.2156874Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:25.2156952Z 2025-05-07T20:32:25.2157087Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.2157240Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:25.2157348Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:25.2157469Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:25.2157608Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:25.2157689Z 2025-05-07T20:32:25.2157788Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:25.2157793Z 2025-05-07T20:32:25.2157896Z moe/activation_test.py:126: 2025-05-07T20:32:25.2158026Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2158133Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:25.2158270Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:25.2158906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:25.2159007Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:25.2159422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.2159684Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.2160060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:25.2160315Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:25.2160711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:25.2160969Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:25.2161347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:25.2161518Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:25.2161861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:25.2161938Z fn() 2025-05-07T20:32:25.2162343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:25.2162425Z self.fn.run( 2025-05-07T20:32:25.2162760Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.2162859Z kernel = self.compile( 2025-05-07T20:32:25.2163238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.2163420Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.2163554Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2163559Z 2025-05-07T20:32:25.2163766Z self = 2025-05-07T20:32:25.2164552Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.2165054Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efca3aa1440>} 2025-05-07T20:32:25.2165805Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.2165998Z context = 2025-05-07T20:32:25.2166005Z 2025-05-07T20:32:25.2166170Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.2166483Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.2166593Z module_map=module_map) 2025-05-07T20:32:25.2166759Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.2166860Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:25.2166937Z E ^ 2025-05-07T20:32:25.2167294Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.2167299Z 2025-05-07T20:32:25.2167711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.2167715Z 2025-05-07T20:32:25.2167828Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.2168125Z self=, 2025-05-07T20:32:25.2168203Z T=1, 2025-05-07T20:32:25.2168290Z D=5120, 2025-05-07T20:32:25.2168374Z scale_ub=1200.0, 2025-05-07T20:32:25.2168531Z contiguous=False, 2025-05-07T20:32:25.2168625Z compiled=True, 2025-05-07T20:32:25.2168697Z ) 2025-05-07T20:32:25.2168914Z self = 2025-05-07T20:32:25.2169089Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:25.2169094Z 2025-05-07T20:32:25.2169171Z @given( 2025-05-07T20:32:25.2169294Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.2169393Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.2169510Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.2169633Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.2169769Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.2169853Z ) 2025-05-07T20:32:25.2170126Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.2170219Z def test_silu_mul_quant( 2025-05-07T20:32:25.2170301Z self, 2025-05-07T20:32:25.2170388Z T: int, 2025-05-07T20:32:25.2170464Z D: int, 2025-05-07T20:32:25.2170562Z scale_ub: Optional[float], 2025-05-07T20:32:25.2170661Z contiguous: bool, 2025-05-07T20:32:25.2170746Z compiled: bool, 2025-05-07T20:32:25.2170831Z ) -> None: 2025-05-07T20:32:25.2170928Z torch.manual_seed(2025) 2025-05-07T20:32:25.2171001Z 2025-05-07T20:32:25.2171176Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.2171249Z 2025-05-07T20:32:25.2171341Z x_sign = torch.sign(x) 2025-05-07T20:32:25.2171471Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.2171560Z x = x_sign * x_clamp 2025-05-07T20:32:25.2171644Z x0 = x[:, :D] 2025-05-07T20:32:25.2171733Z x1 = x[:, D:] 2025-05-07T20:32:25.2171809Z 2025-05-07T20:32:25.2171892Z if contiguous: 2025-05-07T20:32:25.2171991Z x0 = x0.contiguous() 2025-05-07T20:32:25.2172082Z x1 = x1.contiguous() 2025-05-07T20:32:25.2172156Z 2025-05-07T20:32:25.2172252Z if scale_ub is not None: 2025-05-07T20:32:25.2172357Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.2172498Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.2172573Z ) 2025-05-07T20:32:25.2172651Z else: 2025-05-07T20:32:25.2172751Z scale_ub_tensor = None 2025-05-07T20:32:25.2172823Z 2025-05-07T20:32:25.2172951Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.2173046Z op = silu_mul_quant 2025-05-07T20:32:25.2173132Z if compiled: 2025-05-07T20:32:25.2173232Z op = torch.compile(op) 2025-05-07T20:32:25.2173346Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2173420Z 2025-05-07T20:32:25.2173510Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.2173521Z 2025-05-07T20:32:25.2173619Z moe/activation_test.py:117: 2025-05-07T20:32:25.2173802Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2173911Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.2174009Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2174374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:25.2174471Z return fn(*args, **kwargs) 2025-05-07T20:32:25.2174960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.2175062Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.2175416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.2175719Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.2176099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.2176197Z kernel = self.compile( 2025-05-07T20:32:25.2176574Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.2176754Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.2176883Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2176887Z 2025-05-07T20:32:25.2177098Z self = 2025-05-07T20:32:25.2177869Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.2178378Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efca3aa2a20>} 2025-05-07T20:32:25.2179127Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.2179317Z context = 2025-05-07T20:32:25.2179322Z 2025-05-07T20:32:25.2179490Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.2179749Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.2179855Z module_map=module_map) 2025-05-07T20:32:25.2180021Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.2180124Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.2180206Z E ^ 2025-05-07T20:32:25.2180561Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.2180566Z 2025-05-07T20:32:25.2180983Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.2180987Z 2025-05-07T20:32:25.2181095Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.2181317Z self=, 2025-05-07T20:32:25.2181401Z T=1, 2025-05-07T20:32:25.2181477Z D=5120, 2025-05-07T20:32:25.2181559Z scale_ub=1200.0, 2025-05-07T20:32:25.2181655Z contiguous=False, 2025-05-07T20:32:25.2181739Z compiled=False, 2025-05-07T20:32:25.2181811Z ) 2025-05-07T20:32:25.2182040Z self = 2025-05-07T20:32:25.2182212Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:25.2182217Z 2025-05-07T20:32:25.2182292Z @given( 2025-05-07T20:32:25.2182418Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.2182569Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.2182692Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.2182808Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.2182920Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.2183000Z ) 2025-05-07T20:32:25.2183245Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.2183338Z def test_silu_mul_quant( 2025-05-07T20:32:25.2183423Z self, 2025-05-07T20:32:25.2183502Z T: int, 2025-05-07T20:32:25.2183578Z D: int, 2025-05-07T20:32:25.2183682Z scale_ub: Optional[float], 2025-05-07T20:32:25.2183814Z contiguous: bool, 2025-05-07T20:32:25.2183943Z compiled: bool, 2025-05-07T20:32:25.2184028Z ) -> None: 2025-05-07T20:32:25.2184123Z torch.manual_seed(2025) 2025-05-07T20:32:25.2184203Z 2025-05-07T20:32:25.2184411Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.2184493Z 2025-05-07T20:32:25.2184592Z x_sign = torch.sign(x) 2025-05-07T20:32:25.2184719Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.2184807Z x = x_sign * x_clamp 2025-05-07T20:32:25.2184893Z x0 = x[:, :D] 2025-05-07T20:32:25.2184975Z x1 = x[:, D:] 2025-05-07T20:32:25.2185047Z 2025-05-07T20:32:25.2185141Z if contiguous: 2025-05-07T20:32:25.2185233Z x0 = x0.contiguous() 2025-05-07T20:32:25.2185323Z x1 = x1.contiguous() 2025-05-07T20:32:25.2185401Z 2025-05-07T20:32:25.2185491Z if scale_ub is not None: 2025-05-07T20:32:25.2185600Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.2185743Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.2185821Z ) 2025-05-07T20:32:25.2185903Z else: 2025-05-07T20:32:25.2185996Z scale_ub_tensor = None 2025-05-07T20:32:25.2186070Z 2025-05-07T20:32:25.2186211Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.2186301Z op = silu_mul_quant 2025-05-07T20:32:25.2186386Z if compiled: 2025-05-07T20:32:25.2186494Z op = torch.compile(op) 2025-05-07T20:32:25.2186598Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2186670Z 2025-05-07T20:32:25.2186769Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.2186773Z 2025-05-07T20:32:25.2186870Z moe/activation_test.py:117: 2025-05-07T20:32:25.2187007Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2187108Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.2187206Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2187711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.2187807Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.2188167Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.2188393Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.2188730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.2188828Z kernel = self.compile( 2025-05-07T20:32:25.2189376Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.2189549Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.2189682Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2189691Z 2025-05-07T20:32:25.2189895Z self = 2025-05-07T20:32:25.2190778Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.2191280Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efca3aa31a0>} 2025-05-07T20:32:25.2192022Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.2192218Z context = 2025-05-07T20:32:25.2192223Z 2025-05-07T20:32:25.2192537Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.2192841Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.2192985Z module_map=module_map) 2025-05-07T20:32:25.2193151Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.2193254Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.2193332Z E ^ 2025-05-07T20:32:25.2193685Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.2193696Z 2025-05-07T20:32:25.2194106Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.2194110Z 2025-05-07T20:32:25.2194212Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.2194440Z self=, 2025-05-07T20:32:25.2194520Z T=16384, 2025-05-07T20:32:25.2194599Z D=5120, 2025-05-07T20:32:25.2194690Z scale_ub=1200.0, 2025-05-07T20:32:25.2194778Z contiguous=False, 2025-05-07T20:32:25.2194861Z compiled=True, 2025-05-07T20:32:25.2194941Z ) 2025-05-07T20:32:25.2195162Z self = 2025-05-07T20:32:25.2195343Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:25.2195347Z 2025-05-07T20:32:25.2195423Z @given( 2025-05-07T20:32:25.2195540Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.2195645Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.2195760Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.2195874Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.2195992Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.2196066Z ) 2025-05-07T20:32:25.2196316Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.2196414Z def test_silu_mul_quant( 2025-05-07T20:32:25.2196490Z self, 2025-05-07T20:32:25.2196573Z T: int, 2025-05-07T20:32:25.2196650Z D: int, 2025-05-07T20:32:25.2196753Z scale_ub: Optional[float], 2025-05-07T20:32:25.2196851Z contiguous: bool, 2025-05-07T20:32:25.2196937Z compiled: bool, 2025-05-07T20:32:25.2197014Z ) -> None: 2025-05-07T20:32:25.2197115Z torch.manual_seed(2025) 2025-05-07T20:32:25.2197192Z 2025-05-07T20:32:25.2197359Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.2197440Z 2025-05-07T20:32:25.2197532Z x_sign = torch.sign(x) 2025-05-07T20:32:25.2197657Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.2197752Z x = x_sign * x_clamp 2025-05-07T20:32:25.2197833Z x0 = x[:, :D] 2025-05-07T20:32:25.2197922Z x1 = x[:, D:] 2025-05-07T20:32:25.2197997Z 2025-05-07T20:32:25.2198082Z if contiguous: 2025-05-07T20:32:25.2198182Z x0 = x0.contiguous() 2025-05-07T20:32:25.2198271Z x1 = x1.contiguous() 2025-05-07T20:32:25.2198343Z 2025-05-07T20:32:25.2198439Z if scale_ub is not None: 2025-05-07T20:32:25.2198595Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.2198731Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.2198816Z ) 2025-05-07T20:32:25.2198891Z else: 2025-05-07T20:32:25.2198985Z scale_ub_tensor = None 2025-05-07T20:32:25.2199066Z 2025-05-07T20:32:25.2199194Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.2199290Z op = silu_mul_quant 2025-05-07T20:32:25.2199375Z if compiled: 2025-05-07T20:32:25.2199474Z op = torch.compile(op) 2025-05-07T20:32:25.2199589Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2199662Z 2025-05-07T20:32:25.2199806Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.2199895Z 2025-05-07T20:32:25.2200013Z moe/activation_test.py:117: 2025-05-07T20:32:25.2200166Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2200304Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.2200414Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2200779Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:25.2200880Z return fn(*args, **kwargs) 2025-05-07T20:32:25.2201367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.2201463Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.2201822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.2202041Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.2202380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.2202479Z kernel = self.compile( 2025-05-07T20:32:25.2202860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.2203036Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.2203163Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2203168Z 2025-05-07T20:32:25.2203370Z self = 2025-05-07T20:32:25.2204143Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.2204642Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efca3308ea0>} 2025-05-07T20:32:25.2205395Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.2205586Z context = 2025-05-07T20:32:25.2205590Z 2025-05-07T20:32:25.2205758Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.2206018Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.2206126Z module_map=module_map) 2025-05-07T20:32:25.2206291Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.2206390Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.2206469Z E ^ 2025-05-07T20:32:25.2206831Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.2206835Z 2025-05-07T20:32:25.2207294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.2207299Z 2025-05-07T20:32:25.2207407Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.2207628Z self=, 2025-05-07T20:32:25.2207704Z T=2048, 2025-05-07T20:32:25.2207787Z D=7168, 2025-05-07T20:32:25.2207870Z scale_ub=1200.0, 2025-05-07T20:32:25.2207956Z contiguous=False, 2025-05-07T20:32:25.2208045Z compiled=True, 2025-05-07T20:32:25.2208118Z ) 2025-05-07T20:32:25.2208333Z self = 2025-05-07T20:32:25.2208512Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:25.2208559Z 2025-05-07T20:32:25.2208674Z @given( 2025-05-07T20:32:25.2208797Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.2208896Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.2209048Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.2209172Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.2209285Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.2209358Z ) 2025-05-07T20:32:25.2209606Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.2209698Z def test_silu_mul_quant( 2025-05-07T20:32:25.2209779Z self, 2025-05-07T20:32:25.2209854Z T: int, 2025-05-07T20:32:25.2209930Z D: int, 2025-05-07T20:32:25.2210034Z scale_ub: Optional[float], 2025-05-07T20:32:25.2210121Z contiguous: bool, 2025-05-07T20:32:25.2210206Z compiled: bool, 2025-05-07T20:32:25.2210287Z ) -> None: 2025-05-07T20:32:25.2210385Z torch.manual_seed(2025) 2025-05-07T20:32:25.2210458Z 2025-05-07T20:32:25.2210631Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.2210703Z 2025-05-07T20:32:25.2210792Z x_sign = torch.sign(x) 2025-05-07T20:32:25.2210928Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.2211017Z x = x_sign * x_clamp 2025-05-07T20:32:25.2211097Z x0 = x[:, :D] 2025-05-07T20:32:25.2211182Z x1 = x[:, D:] 2025-05-07T20:32:25.2211254Z 2025-05-07T20:32:25.2215724Z if contiguous: 2025-05-07T20:32:25.2215837Z x0 = x0.contiguous() 2025-05-07T20:32:25.2215941Z x1 = x1.contiguous() 2025-05-07T20:32:25.2216016Z 2025-05-07T20:32:25.2216108Z if scale_ub is not None: 2025-05-07T20:32:25.2216227Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.2216370Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.2216460Z ) 2025-05-07T20:32:25.2216543Z else: 2025-05-07T20:32:25.2216638Z scale_ub_tensor = None 2025-05-07T20:32:25.2216723Z 2025-05-07T20:32:25.2216859Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.2216955Z op = silu_mul_quant 2025-05-07T20:32:25.2217052Z if compiled: 2025-05-07T20:32:25.2217156Z op = torch.compile(op) 2025-05-07T20:32:25.2217263Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2217347Z 2025-05-07T20:32:25.2217440Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.2217445Z 2025-05-07T20:32:25.2217544Z moe/activation_test.py:117: 2025-05-07T20:32:25.2217686Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2217790Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.2217899Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2218273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:25.2218373Z return fn(*args, **kwargs) 2025-05-07T20:32:25.2218875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.2219053Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.2219416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.2219650Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.2220041Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.2220146Z kernel = self.compile( 2025-05-07T20:32:25.2220528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.2220704Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.2220896Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2220942Z 2025-05-07T20:32:25.2221153Z self = 2025-05-07T20:32:25.2221983Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.2222488Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efca33099e0>} 2025-05-07T20:32:25.2223232Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.2223435Z context = 2025-05-07T20:32:25.2223444Z 2025-05-07T20:32:25.2223610Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.2223882Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.2223995Z module_map=module_map) 2025-05-07T20:32:25.2224157Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.2224264Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.2224342Z E ^ 2025-05-07T20:32:25.2224704Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.2224710Z 2025-05-07T20:32:25.2225123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.2225128Z 2025-05-07T20:32:25.2225230Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.2225462Z self=, 2025-05-07T20:32:25.2225545Z T=1, 2025-05-07T20:32:25.2225623Z D=5120, 2025-05-07T20:32:25.2225716Z scale_ub=None, 2025-05-07T20:32:25.2225805Z contiguous=False, 2025-05-07T20:32:25.2225899Z compiled=False, 2025-05-07T20:32:25.2225975Z ) 2025-05-07T20:32:25.2226195Z self = 2025-05-07T20:32:25.2226369Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:25.2226374Z 2025-05-07T20:32:25.2226452Z @given( 2025-05-07T20:32:25.2226575Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.2226687Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.2226802Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.2226920Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.2227039Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.2227116Z ) 2025-05-07T20:32:25.2227370Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.2227468Z def test_silu_mul_quant( 2025-05-07T20:32:25.2227544Z self, 2025-05-07T20:32:25.2227632Z T: int, 2025-05-07T20:32:25.2227712Z D: int, 2025-05-07T20:32:25.2227861Z scale_ub: Optional[float], 2025-05-07T20:32:25.2227962Z contiguous: bool, 2025-05-07T20:32:25.2228048Z compiled: bool, 2025-05-07T20:32:25.2228131Z ) -> None: 2025-05-07T20:32:25.2228596Z torch.manual_seed(2025) 2025-05-07T20:32:25.2228704Z 2025-05-07T20:32:25.2228916Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.2229000Z 2025-05-07T20:32:25.2229148Z x_sign = torch.sign(x) 2025-05-07T20:32:25.2229282Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.2229373Z x = x_sign * x_clamp 2025-05-07T20:32:25.2229457Z x0 = x[:, :D] 2025-05-07T20:32:25.2229715Z x1 = x[:, D:] 2025-05-07T20:32:25.2229863Z 2025-05-07T20:32:25.2229969Z if contiguous: 2025-05-07T20:32:25.2230074Z x0 = x0.contiguous() 2025-05-07T20:32:25.2230178Z x1 = x1.contiguous() 2025-05-07T20:32:25.2230313Z 2025-05-07T20:32:25.2230417Z if scale_ub is not None: 2025-05-07T20:32:25.2230525Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.2230660Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.2230742Z ) 2025-05-07T20:32:25.2230818Z else: 2025-05-07T20:32:25.2230917Z scale_ub_tensor = None 2025-05-07T20:32:25.2230994Z 2025-05-07T20:32:25.2231125Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.2231222Z op = silu_mul_quant 2025-05-07T20:32:25.2231310Z if compiled: 2025-05-07T20:32:25.2231412Z op = torch.compile(op) 2025-05-07T20:32:25.2231528Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2231608Z 2025-05-07T20:32:25.2231701Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.2231706Z 2025-05-07T20:32:25.2231811Z moe/activation_test.py:117: 2025-05-07T20:32:25.2231942Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2232047Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.2232155Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2232652Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.2232759Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.2233117Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.2233339Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.2233687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.2233787Z kernel = self.compile( 2025-05-07T20:32:25.2234175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.2234350Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.2234478Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2234483Z 2025-05-07T20:32:25.2234697Z self = 2025-05-07T20:32:25.2235469Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.2235978Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efca330ad40>} 2025-05-07T20:32:25.2236727Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.2237007Z context = 2025-05-07T20:32:25.2237013Z 2025-05-07T20:32:25.2237213Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.2237561Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.2237683Z module_map=module_map) 2025-05-07T20:32:25.2237845Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.2237944Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.2238031Z E ^ 2025-05-07T20:32:25.2238386Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.2238521Z 2025-05-07T20:32:25.2238943Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.2238947Z 2025-05-07T20:32:25.2239090Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.2239319Z self=, 2025-05-07T20:32:25.2239406Z T=4096, 2025-05-07T20:32:25.2239484Z D=7168, 2025-05-07T20:32:25.2239568Z scale_ub=1200.0, 2025-05-07T20:32:25.2239665Z contiguous=False, 2025-05-07T20:32:25.2239750Z compiled=False, 2025-05-07T20:32:25.2239825Z ) 2025-05-07T20:32:25.2240047Z self = 2025-05-07T20:32:25.2240222Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:25.2240227Z 2025-05-07T20:32:25.2240313Z @given( 2025-05-07T20:32:25.2240436Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.2240544Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.2240665Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.2240781Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.2240896Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.2240981Z ) 2025-05-07T20:32:25.2241223Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.2241318Z def test_silu_mul_quant( 2025-05-07T20:32:25.2241403Z self, 2025-05-07T20:32:25.2241480Z T: int, 2025-05-07T20:32:25.2241564Z D: int, 2025-05-07T20:32:25.2241663Z scale_ub: Optional[float], 2025-05-07T20:32:25.2241754Z contiguous: bool, 2025-05-07T20:32:25.2241848Z compiled: bool, 2025-05-07T20:32:25.2241927Z ) -> None: 2025-05-07T20:32:25.2242023Z torch.manual_seed(2025) 2025-05-07T20:32:25.2242107Z 2025-05-07T20:32:25.2242277Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.2242357Z 2025-05-07T20:32:25.2242456Z x_sign = torch.sign(x) 2025-05-07T20:32:25.2242585Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.2242678Z x = x_sign * x_clamp 2025-05-07T20:32:25.2242770Z x0 = x[:, :D] 2025-05-07T20:32:25.2242851Z x1 = x[:, D:] 2025-05-07T20:32:25.2242924Z 2025-05-07T20:32:25.2243018Z if contiguous: 2025-05-07T20:32:25.2243109Z x0 = x0.contiguous() 2025-05-07T20:32:25.2243206Z x1 = x1.contiguous() 2025-05-07T20:32:25.2243277Z 2025-05-07T20:32:25.2243371Z if scale_ub is not None: 2025-05-07T20:32:25.2243484Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.2243618Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.2243696Z ) 2025-05-07T20:32:25.2243780Z else: 2025-05-07T20:32:25.2243877Z scale_ub_tensor = None 2025-05-07T20:32:25.2243953Z 2025-05-07T20:32:25.2244092Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.2244183Z op = silu_mul_quant 2025-05-07T20:32:25.2244268Z if compiled: 2025-05-07T20:32:25.2244378Z op = torch.compile(op) 2025-05-07T20:32:25.2244535Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2244616Z 2025-05-07T20:32:25.2244707Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.2244712Z 2025-05-07T20:32:25.2244809Z moe/activation_test.py:117: 2025-05-07T20:32:25.2244946Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2245048Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.2245148Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2245650Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.2245750Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.2246155Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.2246415Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.2246792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.2246896Z kernel = self.compile( 2025-05-07T20:32:25.2247277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.2247451Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.2247588Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2247592Z 2025-05-07T20:32:25.2247797Z self = 2025-05-07T20:32:25.2248578Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.2249087Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efca330ba60>} 2025-05-07T20:32:25.2249895Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.2250086Z context = 2025-05-07T20:32:25.2250090Z 2025-05-07T20:32:25.2250253Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.2250521Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.2250630Z module_map=module_map) 2025-05-07T20:32:25.2250796Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.2250904Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.2250983Z E ^ 2025-05-07T20:32:25.2251350Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.2251355Z 2025-05-07T20:32:25.2251767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.2251772Z 2025-05-07T20:32:25.2251877Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.2252107Z self=, 2025-05-07T20:32:25.2252187Z T=16384, 2025-05-07T20:32:25.2252274Z D=7168, 2025-05-07T20:32:25.2252357Z scale_ub=None, 2025-05-07T20:32:25.2252443Z contiguous=True, 2025-05-07T20:32:25.2252536Z compiled=True, 2025-05-07T20:32:25.2252611Z ) 2025-05-07T20:32:25.2252829Z self = 2025-05-07T20:32:25.2253013Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:25.2253017Z 2025-05-07T20:32:25.2253098Z @given( 2025-05-07T20:32:25.2253263Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.2253378Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.2253493Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.2253618Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.2253734Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.2253808Z ) 2025-05-07T20:32:25.2254062Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.2254157Z def test_silu_mul_quant( 2025-05-07T20:32:25.2254234Z self, 2025-05-07T20:32:25.2254320Z T: int, 2025-05-07T20:32:25.2254398Z D: int, 2025-05-07T20:32:25.2254538Z scale_ub: Optional[float], 2025-05-07T20:32:25.2254676Z contiguous: bool, 2025-05-07T20:32:25.2254761Z compiled: bool, 2025-05-07T20:32:25.2254842Z ) -> None: 2025-05-07T20:32:25.2254947Z torch.manual_seed(2025) 2025-05-07T20:32:25.2255061Z 2025-05-07T20:32:25.2255241Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.2255317Z 2025-05-07T20:32:25.2255410Z x_sign = torch.sign(x) 2025-05-07T20:32:25.2255539Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.2255633Z x = x_sign * x_clamp 2025-05-07T20:32:25.2255722Z x0 = x[:, :D] 2025-05-07T20:32:25.2255803Z x1 = x[:, D:] 2025-05-07T20:32:25.2255875Z 2025-05-07T20:32:25.2255965Z if contiguous: 2025-05-07T20:32:25.2256055Z x0 = x0.contiguous() 2025-05-07T20:32:25.2256145Z x1 = x1.contiguous() 2025-05-07T20:32:25.2256223Z 2025-05-07T20:32:25.2256313Z if scale_ub is not None: 2025-05-07T20:32:25.2256427Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.2256568Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.2256644Z ) 2025-05-07T20:32:25.2256721Z else: 2025-05-07T20:32:25.2256827Z scale_ub_tensor = None 2025-05-07T20:32:25.2256900Z 2025-05-07T20:32:25.2257030Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.2257125Z op = silu_mul_quant 2025-05-07T20:32:25.2257212Z if compiled: 2025-05-07T20:32:25.2257320Z op = torch.compile(op) 2025-05-07T20:32:25.2257426Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2257499Z 2025-05-07T20:32:25.2257597Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.2257601Z 2025-05-07T20:32:25.2257699Z moe/activation_test.py:117: 2025-05-07T20:32:25.2257827Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2257939Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.2258042Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2258412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:25.2258507Z return fn(*args, **kwargs) 2025-05-07T20:32:25.2258997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.2259098Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.2259451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.2259671Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.2260012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.2260105Z kernel = self.compile( 2025-05-07T20:32:25.2260518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.2260717Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.2260845Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2260897Z 2025-05-07T20:32:25.2261107Z self = 2025-05-07T20:32:25.2261875Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.2262381Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efca3d05120>} 2025-05-07T20:32:25.2263120Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.2263409Z context = 2025-05-07T20:32:25.2263423Z 2025-05-07T20:32:25.2263624Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.2263882Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.2263995Z module_map=module_map) 2025-05-07T20:32:25.2264154Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.2264252Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.2264338Z E ^ 2025-05-07T20:32:25.2264690Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.2264695Z 2025-05-07T20:32:25.2265112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.2265122Z 2025-05-07T20:32:25.2265224Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.2265445Z self=, 2025-05-07T20:32:25.2265530Z T=4096, 2025-05-07T20:32:25.2265607Z D=5120, 2025-05-07T20:32:25.2265689Z scale_ub=None, 2025-05-07T20:32:25.2265784Z contiguous=False, 2025-05-07T20:32:25.2265866Z compiled=True, 2025-05-07T20:32:25.2265938Z ) 2025-05-07T20:32:25.2266158Z self = 2025-05-07T20:32:25.2266329Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:25.2266334Z 2025-05-07T20:32:25.2266416Z @given( 2025-05-07T20:32:25.2266533Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.2266633Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.2266751Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.2266872Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.2266984Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.2267062Z ) 2025-05-07T20:32:25.2267310Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.2267403Z def test_silu_mul_quant( 2025-05-07T20:32:25.2267484Z self, 2025-05-07T20:32:25.2267561Z T: int, 2025-05-07T20:32:25.2267643Z D: int, 2025-05-07T20:32:25.2267740Z scale_ub: Optional[float], 2025-05-07T20:32:25.2267828Z contiguous: bool, 2025-05-07T20:32:25.2267920Z compiled: bool, 2025-05-07T20:32:25.2267997Z ) -> None: 2025-05-07T20:32:25.2268091Z torch.manual_seed(2025) 2025-05-07T20:32:25.2268171Z 2025-05-07T20:32:25.2268337Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.2268411Z 2025-05-07T20:32:25.2268509Z x_sign = torch.sign(x) 2025-05-07T20:32:25.2268641Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.2268733Z x = x_sign * x_clamp 2025-05-07T20:32:25.2268821Z x0 = x[:, :D] 2025-05-07T20:32:25.2268905Z x1 = x[:, D:] 2025-05-07T20:32:25.2268986Z 2025-05-07T20:32:25.2269261Z if contiguous: 2025-05-07T20:32:25.2269357Z x0 = x0.contiguous() 2025-05-07T20:32:25.2269455Z x1 = x1.contiguous() 2025-05-07T20:32:25.2269527Z 2025-05-07T20:32:25.2269618Z if scale_ub is not None: 2025-05-07T20:32:25.2269730Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.2269867Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.2269942Z ) 2025-05-07T20:32:25.2270027Z else: 2025-05-07T20:32:25.2270122Z scale_ub_tensor = None 2025-05-07T20:32:25.2270195Z 2025-05-07T20:32:25.2270330Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.2270468Z op = silu_mul_quant 2025-05-07T20:32:25.2270594Z if compiled: 2025-05-07T20:32:25.2270701Z op = torch.compile(op) 2025-05-07T20:32:25.2270807Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2270886Z 2025-05-07T20:32:25.2271016Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.2271024Z 2025-05-07T20:32:25.2271123Z moe/activation_test.py:117: 2025-05-07T20:32:25.2271260Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2271362Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.2271462Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2271838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:25.2271930Z return fn(*args, **kwargs) 2025-05-07T20:32:25.2272425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.2272527Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.2272882Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.2273112Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.2273449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.2273544Z kernel = self.compile( 2025-05-07T20:32:25.2273931Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.2274103Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.2274237Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2274241Z 2025-05-07T20:32:25.2274449Z self = 2025-05-07T20:32:25.2275223Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.2275736Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efca3d05c60>} 2025-05-07T20:32:25.2276477Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.2276675Z context = 2025-05-07T20:32:25.2276679Z 2025-05-07T20:32:25.2276841Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.2277108Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.2277221Z module_map=module_map) 2025-05-07T20:32:25.2277380Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.2277483Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.2277561Z E ^ 2025-05-07T20:32:25.2277962Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.2277967Z 2025-05-07T20:32:25.2278385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.2278389Z 2025-05-07T20:32:25.2278492Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.2278716Z self=, 2025-05-07T20:32:25.2278806Z T=4096, 2025-05-07T20:32:25.2278890Z D=5120, 2025-05-07T20:32:25.2278974Z scale_ub=1200.0, 2025-05-07T20:32:25.2279063Z contiguous=False, 2025-05-07T20:32:25.2279196Z compiled=False, 2025-05-07T20:32:25.2279315Z ) 2025-05-07T20:32:25.2279560Z self = 2025-05-07T20:32:25.2279764Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:25.2279807Z 2025-05-07T20:32:25.2279890Z @given( 2025-05-07T20:32:25.2280018Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.2280118Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.2280233Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.2280355Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.2280467Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.2280540Z ) 2025-05-07T20:32:25.2280788Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.2280882Z def test_silu_mul_quant( 2025-05-07T20:32:25.2280958Z self, 2025-05-07T20:32:25.2281046Z T: int, 2025-05-07T20:32:25.2281125Z D: int, 2025-05-07T20:32:25.2281222Z scale_ub: Optional[float], 2025-05-07T20:32:25.2281317Z contiguous: bool, 2025-05-07T20:32:25.2281403Z compiled: bool, 2025-05-07T20:32:25.2281488Z ) -> None: 2025-05-07T20:32:25.2281587Z torch.manual_seed(2025) 2025-05-07T20:32:25.2281660Z 2025-05-07T20:32:25.2281835Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.2281909Z 2025-05-07T20:32:25.2282001Z x_sign = torch.sign(x) 2025-05-07T20:32:25.2282133Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.2282223Z x = x_sign * x_clamp 2025-05-07T20:32:25.2282304Z x0 = x[:, :D] 2025-05-07T20:32:25.2282394Z x1 = x[:, D:] 2025-05-07T20:32:25.2282467Z 2025-05-07T20:32:25.2282550Z if contiguous: 2025-05-07T20:32:25.2282648Z x0 = x0.contiguous() 2025-05-07T20:32:25.2282737Z x1 = x1.contiguous() 2025-05-07T20:32:25.2282810Z 2025-05-07T20:32:25.2282911Z if scale_ub is not None: 2025-05-07T20:32:25.2283014Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.2283158Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.2283235Z ) 2025-05-07T20:32:25.2283314Z else: 2025-05-07T20:32:25.2283412Z scale_ub_tensor = None 2025-05-07T20:32:25.2283484Z 2025-05-07T20:32:25.2283612Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.2283708Z op = silu_mul_quant 2025-05-07T20:32:25.2283792Z if compiled: 2025-05-07T20:32:25.2283895Z op = torch.compile(op) 2025-05-07T20:32:25.2284005Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2284078Z 2025-05-07T20:32:25.2284172Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.2284182Z 2025-05-07T20:32:25.2284279Z moe/activation_test.py:117: 2025-05-07T20:32:25.2284408Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2284521Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.2284620Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2285167Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.2285270Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.2285625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.2285850Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.2286186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.2286279Z kernel = self.compile( 2025-05-07T20:32:25.2286662Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.2286876Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.2287043Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2287047Z 2025-05-07T20:32:25.2287302Z self = 2025-05-07T20:32:25.2288075Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.2288587Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efca3d07240>} 2025-05-07T20:32:25.2289325Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.2289528Z context = 2025-05-07T20:32:25.2289533Z 2025-05-07T20:32:25.2289696Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.2289958Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.2290075Z module_map=module_map) 2025-05-07T20:32:25.2290257Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.2290363Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.2290464Z E ^ 2025-05-07T20:32:25.2290815Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.2290820Z 2025-05-07T20:32:25.2291235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.2291239Z 2025-05-07T20:32:25.2291345Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.2291569Z self=, 2025-05-07T20:32:25.2291652Z T=4096, 2025-05-07T20:32:25.2291730Z D=5120, 2025-05-07T20:32:25.2291816Z scale_ub=1200.0, 2025-05-07T20:32:25.2291910Z contiguous=False, 2025-05-07T20:32:25.2291996Z compiled=True, 2025-05-07T20:32:25.2292075Z ) 2025-05-07T20:32:25.2292292Z self = 2025-05-07T20:32:25.2292463Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:25.2292467Z 2025-05-07T20:32:25.2292551Z @given( 2025-05-07T20:32:25.2292667Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.2292767Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.2292886Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.2293001Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.2293123Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.2293200Z ) 2025-05-07T20:32:25.2293443Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.2293543Z def test_silu_mul_quant( 2025-05-07T20:32:25.2293622Z self, 2025-05-07T20:32:25.2293748Z T: int, 2025-05-07T20:32:25.2293832Z D: int, 2025-05-07T20:32:25.2293929Z scale_ub: Optional[float], 2025-05-07T20:32:25.2294019Z contiguous: bool, 2025-05-07T20:32:25.2294111Z compiled: bool, 2025-05-07T20:32:25.2294189Z ) -> None: 2025-05-07T20:32:25.2294283Z torch.manual_seed(2025) 2025-05-07T20:32:25.2294364Z 2025-05-07T20:32:25.2294533Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.2294608Z 2025-05-07T20:32:25.2294710Z x_sign = torch.sign(x) 2025-05-07T20:32:25.2294834Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.2294973Z x = x_sign * x_clamp 2025-05-07T20:32:25.2295094Z x0 = x[:, :D] 2025-05-07T20:32:25.2295173Z x1 = x[:, D:] 2025-05-07T20:32:25.2295254Z 2025-05-07T20:32:25.2295339Z if contiguous: 2025-05-07T20:32:25.2295430Z x0 = x0.contiguous() 2025-05-07T20:32:25.2295593Z x1 = x1.contiguous() 2025-05-07T20:32:25.2295666Z 2025-05-07T20:32:25.2295756Z if scale_ub is not None: 2025-05-07T20:32:25.2295868Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.2296003Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.2296078Z ) 2025-05-07T20:32:25.2296161Z else: 2025-05-07T20:32:25.2296255Z scale_ub_tensor = None 2025-05-07T20:32:25.2296334Z 2025-05-07T20:32:25.2296462Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.2296552Z op = silu_mul_quant 2025-05-07T20:32:25.2296646Z if compiled: 2025-05-07T20:32:25.2296748Z op = torch.compile(op) 2025-05-07T20:32:25.2296858Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2296935Z 2025-05-07T20:32:25.2297026Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.2297031Z 2025-05-07T20:32:25.2297130Z moe/activation_test.py:117: 2025-05-07T20:32:25.2297267Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2297368Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.2297473Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2297837Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:25.2297928Z return fn(*args, **kwargs) 2025-05-07T20:32:25.2298425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.2298520Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.2298873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.2299103Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.2299441Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.2299544Z kernel = self.compile( 2025-05-07T20:32:25.2299921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.2300090Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.2300224Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2300228Z 2025-05-07T20:32:25.2300430Z self = 2025-05-07T20:32:25.2301205Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.2301711Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efca32d0720>} 2025-05-07T20:32:25.2302498Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.2302694Z context = 2025-05-07T20:32:25.2302698Z 2025-05-07T20:32:25.2302859Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.2303124Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.2303230Z module_map=module_map) 2025-05-07T20:32:25.2303427Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.2303569Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.2303649Z E ^ 2025-05-07T20:32:25.2304039Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.2304056Z 2025-05-07T20:32:25.2304468Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.2304472Z 2025-05-07T20:32:25.2304574Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.2304803Z self=, 2025-05-07T20:32:25.2304879Z T=2048, 2025-05-07T20:32:25.2304954Z D=7168, 2025-05-07T20:32:25.2305042Z scale_ub=1200.0, 2025-05-07T20:32:25.2305127Z contiguous=False, 2025-05-07T20:32:25.2305211Z compiled=False, 2025-05-07T20:32:25.2305289Z ) 2025-05-07T20:32:25.2305504Z self = 2025-05-07T20:32:25.2305687Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:25.2305692Z 2025-05-07T20:32:25.2305770Z @given( 2025-05-07T20:32:25.2305891Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.2306001Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.2306115Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.2306231Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.2306349Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.2306422Z ) 2025-05-07T20:32:25.2306669Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.2306762Z def test_silu_mul_quant( 2025-05-07T20:32:25.2306839Z self, 2025-05-07T20:32:25.2306923Z T: int, 2025-05-07T20:32:25.2306999Z D: int, 2025-05-07T20:32:25.2307096Z scale_ub: Optional[float], 2025-05-07T20:32:25.2307195Z contiguous: bool, 2025-05-07T20:32:25.2307283Z compiled: bool, 2025-05-07T20:32:25.2307363Z ) -> None: 2025-05-07T20:32:25.2307463Z torch.manual_seed(2025) 2025-05-07T20:32:25.2307535Z 2025-05-07T20:32:25.2307707Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.2307790Z 2025-05-07T20:32:25.2307882Z x_sign = torch.sign(x) 2025-05-07T20:32:25.2308005Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.2308101Z x = x_sign * x_clamp 2025-05-07T20:32:25.2308182Z x0 = x[:, :D] 2025-05-07T20:32:25.2308269Z x1 = x[:, D:] 2025-05-07T20:32:25.2308341Z 2025-05-07T20:32:25.2308424Z if contiguous: 2025-05-07T20:32:25.2308524Z x0 = x0.contiguous() 2025-05-07T20:32:25.2308613Z x1 = x1.contiguous() 2025-05-07T20:32:25.2308685Z 2025-05-07T20:32:25.2308782Z if scale_ub is not None: 2025-05-07T20:32:25.2308890Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.2309027Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.2309204Z ) 2025-05-07T20:32:25.2309281Z else: 2025-05-07T20:32:25.2309375Z scale_ub_tensor = None 2025-05-07T20:32:25.2309460Z 2025-05-07T20:32:25.2309658Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.2309765Z op = silu_mul_quant 2025-05-07T20:32:25.2309870Z if compiled: 2025-05-07T20:32:25.2309974Z op = torch.compile(op) 2025-05-07T20:32:25.2310086Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2310159Z 2025-05-07T20:32:25.2310251Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.2310255Z 2025-05-07T20:32:25.2310358Z moe/activation_test.py:117: 2025-05-07T20:32:25.2310487Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2310588Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.2310735Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2311274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.2311413Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.2311771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.2311992Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.2312334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.2312427Z kernel = self.compile( 2025-05-07T20:32:25.2312810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.2312987Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.2313114Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2313123Z 2025-05-07T20:32:25.2313332Z self = 2025-05-07T20:32:25.2314102Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.2314606Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efca32d1580>} 2025-05-07T20:32:25.2315345Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.2315536Z context = 2025-05-07T20:32:25.2315543Z 2025-05-07T20:32:25.2315714Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.2315971Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.2316086Z module_map=module_map) 2025-05-07T20:32:25.2316251Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.2316348Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.2316431Z E ^ 2025-05-07T20:32:25.2316782Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.2316786Z 2025-05-07T20:32:25.2317198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.2317208Z 2025-05-07T20:32:25.2317310Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.2317531Z self=, 2025-05-07T20:32:25.2317621Z T=1, 2025-05-07T20:32:25.2317696Z D=7168, 2025-05-07T20:32:25.2317780Z scale_ub=None, 2025-05-07T20:32:25.2317873Z contiguous=True, 2025-05-07T20:32:25.2317957Z compiled=False, 2025-05-07T20:32:25.2318031Z ) 2025-05-07T20:32:25.2318299Z self = 2025-05-07T20:32:25.2318461Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:25.2318466Z 2025-05-07T20:32:25.2318545Z @given( 2025-05-07T20:32:25.2318670Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.2318769Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.2318891Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.2319006Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.2319118Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.2319197Z ) 2025-05-07T20:32:25.2319535Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.2319667Z def test_silu_mul_quant( 2025-05-07T20:32:25.2319748Z self, 2025-05-07T20:32:25.2319823Z T: int, 2025-05-07T20:32:25.2319898Z D: int, 2025-05-07T20:32:25.2320042Z scale_ub: Optional[float], 2025-05-07T20:32:25.2320132Z contiguous: bool, 2025-05-07T20:32:25.2320222Z compiled: bool, 2025-05-07T20:32:25.2320302Z ) -> None: 2025-05-07T20:32:25.2320396Z torch.manual_seed(2025) 2025-05-07T20:32:25.2320492Z 2025-05-07T20:32:25.2320685Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.2320763Z 2025-05-07T20:32:25.2320860Z x_sign = torch.sign(x) 2025-05-07T20:32:25.2320983Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.2321073Z x = x_sign * x_clamp 2025-05-07T20:32:25.2321160Z x0 = x[:, :D] 2025-05-07T20:32:25.2321239Z x1 = x[:, D:] 2025-05-07T20:32:25.2321314Z 2025-05-07T20:32:25.2321406Z if contiguous: 2025-05-07T20:32:25.2321496Z x0 = x0.contiguous() 2025-05-07T20:32:25.2321584Z x1 = x1.contiguous() 2025-05-07T20:32:25.2321664Z 2025-05-07T20:32:25.2321756Z if scale_ub is not None: 2025-05-07T20:32:25.2321870Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.2322005Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.2322082Z ) 2025-05-07T20:32:25.2322164Z else: 2025-05-07T20:32:25.2322258Z scale_ub_tensor = None 2025-05-07T20:32:25.2322330Z 2025-05-07T20:32:25.2322467Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.2322561Z op = silu_mul_quant 2025-05-07T20:32:25.2322646Z if compiled: 2025-05-07T20:32:25.2322755Z op = torch.compile(op) 2025-05-07T20:32:25.2322860Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2322934Z 2025-05-07T20:32:25.2323031Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.2323038Z 2025-05-07T20:32:25.2323137Z moe/activation_test.py:117: 2025-05-07T20:32:25.2323271Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2323374Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.2323478Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2323977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.2324073Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.2324430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.2324657Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.2324994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.2325096Z kernel = self.compile( 2025-05-07T20:32:25.2325476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.2325649Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.2325941Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2325947Z 2025-05-07T20:32:25.2326154Z self = 2025-05-07T20:32:25.2326929Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.2327427Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efca32d0ea0>} 2025-05-07T20:32:25.2328630Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.2329167Z context = 2025-05-07T20:32:25.2329173Z 2025-05-07T20:32:25.2329357Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.2329669Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.2329781Z module_map=module_map) 2025-05-07T20:32:25.2329960Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.2330069Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.2330148Z E ^ 2025-05-07T20:32:25.2330581Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.2330588Z 2025-05-07T20:32:25.2331089Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.2331094Z 2025-05-07T20:32:25.2331202Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.2331466Z self=, 2025-05-07T20:32:25.2331544Z T=16384, 2025-05-07T20:32:25.2331622Z D=7168, 2025-05-07T20:32:25.2331715Z scale_ub=1200.0, 2025-05-07T20:32:25.2331806Z contiguous=False, 2025-05-07T20:32:25.2331896Z compiled=True, 2025-05-07T20:32:25.2331969Z ) 2025-05-07T20:32:25.2332218Z self = 2025-05-07T20:32:25.2332422Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:25.2332426Z 2025-05-07T20:32:25.2332503Z @given( 2025-05-07T20:32:25.2332628Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.2332740Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.2332863Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.2332987Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.2333112Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.2333189Z ) 2025-05-07T20:32:25.2333484Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.2333581Z def test_silu_mul_quant( 2025-05-07T20:32:25.2333658Z self, 2025-05-07T20:32:25.2333741Z T: int, 2025-05-07T20:32:25.2333818Z D: int, 2025-05-07T20:32:25.2333918Z scale_ub: Optional[float], 2025-05-07T20:32:25.2334016Z contiguous: bool, 2025-05-07T20:32:25.2334103Z compiled: bool, 2025-05-07T20:32:25.2334182Z ) -> None: 2025-05-07T20:32:25.2334286Z torch.manual_seed(2025) 2025-05-07T20:32:25.2334360Z 2025-05-07T20:32:25.2334545Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.2334629Z 2025-05-07T20:32:25.2334726Z x_sign = torch.sign(x) 2025-05-07T20:32:25.2334862Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.2334954Z x = x_sign * x_clamp 2025-05-07T20:32:25.2335038Z x0 = x[:, :D] 2025-05-07T20:32:25.2335197Z x1 = x[:, D:] 2025-05-07T20:32:25.2335270Z 2025-05-07T20:32:25.2335353Z if contiguous: 2025-05-07T20:32:25.2335453Z x0 = x0.contiguous() 2025-05-07T20:32:25.2335545Z x1 = x1.contiguous() 2025-05-07T20:32:25.2335617Z 2025-05-07T20:32:25.2335713Z if scale_ub is not None: 2025-05-07T20:32:25.2335818Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.2335950Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.2336031Z ) 2025-05-07T20:32:25.2336106Z else: 2025-05-07T20:32:25.2336205Z scale_ub_tensor = None 2025-05-07T20:32:25.2336279Z 2025-05-07T20:32:25.2336470Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.2336604Z op = silu_mul_quant 2025-05-07T20:32:25.2336689Z if compiled: 2025-05-07T20:32:25.2336787Z op = torch.compile(op) 2025-05-07T20:32:25.2336936Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2337012Z 2025-05-07T20:32:25.2341229Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.2341237Z 2025-05-07T20:32:25.2341354Z moe/activation_test.py:117: 2025-05-07T20:32:25.2341495Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2341600Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.2341702Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2342088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:25.2342184Z return fn(*args, **kwargs) 2025-05-07T20:32:25.2342682Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.2342800Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.2343162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.2343404Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.2343744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.2343840Z kernel = self.compile( 2025-05-07T20:32:25.2344235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.2344410Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.2344549Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2344554Z 2025-05-07T20:32:25.2344761Z self = 2025-05-07T20:32:25.2345545Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.2346059Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efca32d39c0>} 2025-05-07T20:32:25.2346806Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.2347007Z context = 2025-05-07T20:32:25.2347012Z 2025-05-07T20:32:25.2347178Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.2347441Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.2347562Z module_map=module_map) 2025-05-07T20:32:25.2347727Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.2347911Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.2347991Z E ^ 2025-05-07T20:32:25.2348346Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.2348351Z 2025-05-07T20:32:25.2348775Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.2348779Z 2025-05-07T20:32:25.2348886Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.2349198Z self=, 2025-05-07T20:32:25.2349278Z T=1, 2025-05-07T20:32:25.2349358Z D=7168, 2025-05-07T20:32:25.2349501Z scale_ub=None, 2025-05-07T20:32:25.2349633Z contiguous=False, 2025-05-07T20:32:25.2349719Z compiled=False, 2025-05-07T20:32:25.2349817Z ) 2025-05-07T20:32:25.2350066Z self = 2025-05-07T20:32:25.2350276Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:25.2350281Z 2025-05-07T20:32:25.2350369Z @given( 2025-05-07T20:32:25.2350488Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.2350597Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.2350714Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.2350831Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.2350952Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.2351028Z ) 2025-05-07T20:32:25.2351272Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.2351374Z def test_silu_mul_quant( 2025-05-07T20:32:25.2351455Z self, 2025-05-07T20:32:25.2351535Z T: int, 2025-05-07T20:32:25.2351621Z D: int, 2025-05-07T20:32:25.2351720Z scale_ub: Optional[float], 2025-05-07T20:32:25.2351809Z contiguous: bool, 2025-05-07T20:32:25.2351906Z compiled: bool, 2025-05-07T20:32:25.2351989Z ) -> None: 2025-05-07T20:32:25.2352093Z torch.manual_seed(2025) 2025-05-07T20:32:25.2352167Z 2025-05-07T20:32:25.2352336Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.2352421Z 2025-05-07T20:32:25.2352515Z x_sign = torch.sign(x) 2025-05-07T20:32:25.2352641Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.2352740Z x = x_sign * x_clamp 2025-05-07T20:32:25.2352821Z x0 = x[:, :D] 2025-05-07T20:32:25.2352902Z x1 = x[:, D:] 2025-05-07T20:32:25.2352983Z 2025-05-07T20:32:25.2353067Z if contiguous: 2025-05-07T20:32:25.2353159Z x0 = x0.contiguous() 2025-05-07T20:32:25.2353265Z x1 = x1.contiguous() 2025-05-07T20:32:25.2353339Z 2025-05-07T20:32:25.2353431Z if scale_ub is not None: 2025-05-07T20:32:25.2353544Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.2353688Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.2353774Z ) 2025-05-07T20:32:25.2353850Z else: 2025-05-07T20:32:25.2353943Z scale_ub_tensor = None 2025-05-07T20:32:25.2354024Z 2025-05-07T20:32:25.2354154Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.2354243Z op = silu_mul_quant 2025-05-07T20:32:25.2354342Z if compiled: 2025-05-07T20:32:25.2354444Z op = torch.compile(op) 2025-05-07T20:32:25.2354548Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2354629Z 2025-05-07T20:32:25.2354721Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.2354725Z 2025-05-07T20:32:25.2354828Z moe/activation_test.py:117: 2025-05-07T20:32:25.2354962Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2355062Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.2355169Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2355721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.2355819Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.2356182Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.2356402Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.2356747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.2356841Z kernel = self.compile( 2025-05-07T20:32:25.2357223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.2357478Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.2357607Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2357648Z 2025-05-07T20:32:25.2357856Z self = 2025-05-07T20:32:25.2358639Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.2359139Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efca3638860>} 2025-05-07T20:32:25.2359893Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.2360090Z context = 2025-05-07T20:32:25.2360095Z 2025-05-07T20:32:25.2360269Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.2360532Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.2360640Z module_map=module_map) 2025-05-07T20:32:25.2360809Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.2360909Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.2360988Z E ^ 2025-05-07T20:32:25.2361351Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.2361356Z 2025-05-07T20:32:25.2361766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.2361776Z 2025-05-07T20:32:25.2361887Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.2362110Z self=, 2025-05-07T20:32:25.2362190Z T=2048, 2025-05-07T20:32:25.2362277Z D=7168, 2025-05-07T20:32:25.2362364Z scale_ub=None, 2025-05-07T20:32:25.2362452Z contiguous=False, 2025-05-07T20:32:25.2362543Z compiled=True, 2025-05-07T20:32:25.2362617Z ) 2025-05-07T20:32:25.2362841Z self = 2025-05-07T20:32:25.2363012Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:25.2363017Z 2025-05-07T20:32:25.2363097Z @given( 2025-05-07T20:32:25.2363224Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.2363323Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.2363438Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.2363566Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.2363682Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.2363757Z ) 2025-05-07T20:32:25.2364012Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.2364154Z def test_silu_mul_quant( 2025-05-07T20:32:25.2364242Z self, 2025-05-07T20:32:25.2364320Z T: int, 2025-05-07T20:32:25.2364398Z D: int, 2025-05-07T20:32:25.2364505Z scale_ub: Optional[float], 2025-05-07T20:32:25.2364596Z contiguous: bool, 2025-05-07T20:32:25.2364682Z compiled: bool, 2025-05-07T20:32:25.2364768Z ) -> None: 2025-05-07T20:32:25.2364864Z torch.manual_seed(2025) 2025-05-07T20:32:25.2364938Z 2025-05-07T20:32:25.2365114Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.2365190Z 2025-05-07T20:32:25.2365282Z x_sign = torch.sign(x) 2025-05-07T20:32:25.2365501Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.2365631Z x = x_sign * x_clamp 2025-05-07T20:32:25.2365722Z x0 = x[:, :D] 2025-05-07T20:32:25.2365803Z x1 = x[:, D:] 2025-05-07T20:32:25.2365876Z 2025-05-07T20:32:25.2366005Z if contiguous: 2025-05-07T20:32:25.2366102Z x0 = x0.contiguous() 2025-05-07T20:32:25.2366193Z x1 = x1.contiguous() 2025-05-07T20:32:25.2366273Z 2025-05-07T20:32:25.2366363Z if scale_ub is not None: 2025-05-07T20:32:25.2366470Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.2366615Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.2366692Z ) 2025-05-07T20:32:25.2366768Z else: 2025-05-07T20:32:25.2366868Z scale_ub_tensor = None 2025-05-07T20:32:25.2366941Z 2025-05-07T20:32:25.2367076Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.2367167Z op = silu_mul_quant 2025-05-07T20:32:25.2367256Z if compiled: 2025-05-07T20:32:25.2367368Z op = torch.compile(op) 2025-05-07T20:32:25.2367474Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2367547Z 2025-05-07T20:32:25.2367646Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.2367655Z 2025-05-07T20:32:25.2367755Z moe/activation_test.py:117: 2025-05-07T20:32:25.2367884Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2367993Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.2368092Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2368465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:25.2368557Z return fn(*args, **kwargs) 2025-05-07T20:32:25.2369048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.2369157Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.2369539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.2369789Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.2370135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.2370229Z kernel = self.compile( 2025-05-07T20:32:25.2370613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.2370783Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.2370912Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2370916Z 2025-05-07T20:32:25.2371129Z self = 2025-05-07T20:32:25.2371900Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.2372462Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efca3639bc0>} 2025-05-07T20:32:25.2373204Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.2373396Z context = 2025-05-07T20:32:25.2373408Z 2025-05-07T20:32:25.2373570Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.2373829Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.2373985Z module_map=module_map) 2025-05-07T20:32:25.2374187Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.2374285Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.2374370Z E ^ 2025-05-07T20:32:25.2374764Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.2374769Z 2025-05-07T20:32:25.2375190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.2375195Z 2025-05-07T20:32:25.2375297Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.2375520Z self=, 2025-05-07T20:32:25.2375606Z T=4096, 2025-05-07T20:32:25.2375684Z D=7168, 2025-05-07T20:32:25.2375768Z scale_ub=None, 2025-05-07T20:32:25.2375864Z contiguous=False, 2025-05-07T20:32:25.2375949Z compiled=True, 2025-05-07T20:32:25.2376025Z ) 2025-05-07T20:32:25.2376253Z self = 2025-05-07T20:32:25.2376424Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:25.2376428Z 2025-05-07T20:32:25.2376518Z @given( 2025-05-07T20:32:25.2376640Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.2376742Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.2376866Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.2376983Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.2377096Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.2377180Z ) 2025-05-07T20:32:25.2377423Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.2377527Z def test_silu_mul_quant( 2025-05-07T20:32:25.2377605Z self, 2025-05-07T20:32:25.2377684Z T: int, 2025-05-07T20:32:25.2377770Z D: int, 2025-05-07T20:32:25.2377871Z scale_ub: Optional[float], 2025-05-07T20:32:25.2377964Z contiguous: bool, 2025-05-07T20:32:25.2378059Z compiled: bool, 2025-05-07T20:32:25.2378139Z ) -> None: 2025-05-07T20:32:25.2378235Z torch.manual_seed(2025) 2025-05-07T20:32:25.2378321Z 2025-05-07T20:32:25.2378492Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.2378571Z 2025-05-07T20:32:25.2378673Z x_sign = torch.sign(x) 2025-05-07T20:32:25.2378799Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.2378890Z x = x_sign * x_clamp 2025-05-07T20:32:25.2378982Z x0 = x[:, :D] 2025-05-07T20:32:25.2379066Z x1 = x[:, D:] 2025-05-07T20:32:25.2379148Z 2025-05-07T20:32:25.2379233Z if contiguous: 2025-05-07T20:32:25.2379326Z x0 = x0.contiguous() 2025-05-07T20:32:25.2379425Z x1 = x1.contiguous() 2025-05-07T20:32:25.2379497Z 2025-05-07T20:32:25.2379591Z if scale_ub is not None: 2025-05-07T20:32:25.2379710Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.2379844Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.2379920Z ) 2025-05-07T20:32:25.2380007Z else: 2025-05-07T20:32:25.2380153Z scale_ub_tensor = None 2025-05-07T20:32:25.2380227Z 2025-05-07T20:32:25.2380365Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.2380458Z op = silu_mul_quant 2025-05-07T20:32:25.2380550Z if compiled: 2025-05-07T20:32:25.2380667Z op = torch.compile(op) 2025-05-07T20:32:25.2380784Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2380879Z 2025-05-07T20:32:25.2380972Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.2380977Z 2025-05-07T20:32:25.2381080Z moe/activation_test.py:117: 2025-05-07T20:32:25.2381209Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2381351Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.2381493Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2381858Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:25.2381989Z return fn(*args, **kwargs) 2025-05-07T20:32:25.2382491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.2382588Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.2382948Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.2383167Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.2383502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.2383604Z kernel = self.compile( 2025-05-07T20:32:25.2383987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.2384162Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.2384301Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2384306Z 2025-05-07T20:32:25.2384510Z self = 2025-05-07T20:32:25.2385289Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.2385788Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efca363a700>} 2025-05-07T20:32:25.2386532Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.2386727Z context = 2025-05-07T20:32:25.2386734Z 2025-05-07T20:32:25.2386897Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.2387165Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.2387271Z module_map=module_map) 2025-05-07T20:32:25.2387437Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.2387534Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.2387613Z E ^ 2025-05-07T20:32:25.2387975Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.2387980Z 2025-05-07T20:32:25.2388392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.2388401Z 2025-05-07T20:32:25.2388502Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.2388729Z self=, 2025-05-07T20:32:25.2388852Z T=16384, 2025-05-07T20:32:25.2388936Z D=5120, 2025-05-07T20:32:25.2389019Z scale_ub=1200.0, 2025-05-07T20:32:25.2389223Z contiguous=False, 2025-05-07T20:32:25.2389313Z compiled=False, 2025-05-07T20:32:25.2389384Z ) 2025-05-07T20:32:25.2389602Z self = 2025-05-07T20:32:25.2389785Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:25.2389789Z 2025-05-07T20:32:25.2389865Z @given( 2025-05-07T20:32:25.2389982Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.2390090Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.2390249Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.2390410Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.2390525Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.2390598Z ) 2025-05-07T20:32:25.2390887Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.2390981Z def test_silu_mul_quant( 2025-05-07T20:32:25.2391058Z self, 2025-05-07T20:32:25.2391140Z T: int, 2025-05-07T20:32:25.2391217Z D: int, 2025-05-07T20:32:25.2391314Z scale_ub: Optional[float], 2025-05-07T20:32:25.2391409Z contiguous: bool, 2025-05-07T20:32:25.2391495Z compiled: bool, 2025-05-07T20:32:25.2391573Z ) -> None: 2025-05-07T20:32:25.2391672Z torch.manual_seed(2025) 2025-05-07T20:32:25.2391744Z 2025-05-07T20:32:25.2391917Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.2391991Z 2025-05-07T20:32:25.2392085Z x_sign = torch.sign(x) 2025-05-07T20:32:25.2392216Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.2392305Z x = x_sign * x_clamp 2025-05-07T20:32:25.2392385Z x0 = x[:, :D] 2025-05-07T20:32:25.2392471Z x1 = x[:, D:] 2025-05-07T20:32:25.2392546Z 2025-05-07T20:32:25.2392631Z if contiguous: 2025-05-07T20:32:25.2392728Z x0 = x0.contiguous() 2025-05-07T20:32:25.2392816Z x1 = x1.contiguous() 2025-05-07T20:32:25.2392888Z 2025-05-07T20:32:25.2392983Z if scale_ub is not None: 2025-05-07T20:32:25.2393088Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.2393225Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.2393301Z ) 2025-05-07T20:32:25.2393379Z else: 2025-05-07T20:32:25.2393476Z scale_ub_tensor = None 2025-05-07T20:32:25.2393548Z 2025-05-07T20:32:25.2393676Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.2393774Z op = silu_mul_quant 2025-05-07T20:32:25.2393864Z if compiled: 2025-05-07T20:32:25.2393963Z op = torch.compile(op) 2025-05-07T20:32:25.2394075Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2394154Z 2025-05-07T20:32:25.2394249Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.2394260Z 2025-05-07T20:32:25.2394357Z moe/activation_test.py:117: 2025-05-07T20:32:25.2394487Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2394593Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.2394694Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2395186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.2395290Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.2395648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.2395873Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.2396221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.2396396Z kernel = self.compile( 2025-05-07T20:32:25.2396782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.2396953Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.2397082Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2397086Z 2025-05-07T20:32:25.2397296Z self = 2025-05-07T20:32:25.2398061Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.2398646Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efca363b060>} 2025-05-07T20:32:25.2399429Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.2399652Z context = 2025-05-07T20:32:25.2399656Z 2025-05-07T20:32:25.2399845Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.2400104Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.2400217Z module_map=module_map) 2025-05-07T20:32:25.2400377Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.2400481Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.2400563Z E ^ 2025-05-07T20:32:25.2400918Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.2400925Z 2025-05-07T20:32:25.2401348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.2401352Z 2025-05-07T20:32:25.2401456Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.2401679Z self=, 2025-05-07T20:32:25.2401764Z T=16384, 2025-05-07T20:32:25.2401840Z D=5120, 2025-05-07T20:32:25.2401923Z scale_ub=1200.0, 2025-05-07T20:32:25.2402016Z contiguous=True, 2025-05-07T20:32:25.2402100Z compiled=True, 2025-05-07T20:32:25.2402177Z ) 2025-05-07T20:32:25.2402392Z self = 2025-05-07T20:32:25.2402565Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:25.2402573Z 2025-05-07T20:32:25.2402656Z @given( 2025-05-07T20:32:25.2402773Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.2402874Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.2402997Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.2403112Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.2403223Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.2403304Z ) 2025-05-07T20:32:25.2403546Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.2403644Z def test_silu_mul_quant( 2025-05-07T20:32:25.2403733Z self, 2025-05-07T20:32:25.2403817Z T: int, 2025-05-07T20:32:25.2403895Z D: int, 2025-05-07T20:32:25.2403993Z scale_ub: Optional[float], 2025-05-07T20:32:25.2404091Z contiguous: bool, 2025-05-07T20:32:25.2404181Z compiled: bool, 2025-05-07T20:32:25.2404263Z ) -> None: 2025-05-07T20:32:25.2404364Z torch.manual_seed(2025) 2025-05-07T20:32:25.2404437Z 2025-05-07T20:32:25.2404606Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.2404686Z 2025-05-07T20:32:25.2404829Z x_sign = torch.sign(x) 2025-05-07T20:32:25.2404954Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.2405051Z x = x_sign * x_clamp 2025-05-07T20:32:25.2405132Z x0 = x[:, :D] 2025-05-07T20:32:25.2405218Z x1 = x[:, D:] 2025-05-07T20:32:25.2405289Z 2025-05-07T20:32:25.2405373Z if contiguous: 2025-05-07T20:32:25.2405472Z x0 = x0.contiguous() 2025-05-07T20:32:25.2405560Z x1 = x1.contiguous() 2025-05-07T20:32:25.2405633Z 2025-05-07T20:32:25.2405729Z if scale_ub is not None: 2025-05-07T20:32:25.2405834Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.2406012Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.2406134Z ) 2025-05-07T20:32:25.2406209Z else: 2025-05-07T20:32:25.2406303Z scale_ub_tensor = None 2025-05-07T20:32:25.2406385Z 2025-05-07T20:32:25.2406554Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.2406649Z op = silu_mul_quant 2025-05-07T20:32:25.2406740Z if compiled: 2025-05-07T20:32:25.2406840Z op = torch.compile(op) 2025-05-07T20:32:25.2406950Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2407022Z 2025-05-07T20:32:25.2407111Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.2407115Z 2025-05-07T20:32:25.2407218Z moe/activation_test.py:117: 2025-05-07T20:32:25.2407346Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2407445Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.2407549Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2407914Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:25.2408014Z return fn(*args, **kwargs) 2025-05-07T20:32:25.2408510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.2408606Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.2408964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.2409181Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.2409516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.2409622Z kernel = self.compile( 2025-05-07T20:32:25.2410048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.2410227Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.2410357Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2410362Z 2025-05-07T20:32:25.2410567Z self = 2025-05-07T20:32:25.2411345Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.2411841Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efca35d11c0>} 2025-05-07T20:32:25.2412586Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.2412779Z context = 2025-05-07T20:32:25.2412783Z 2025-05-07T20:32:25.2412944Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.2413255Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.2413363Z module_map=module_map) 2025-05-07T20:32:25.2413528Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.2413626Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.2413702Z E ^ 2025-05-07T20:32:25.2414065Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.2414070Z 2025-05-07T20:32:25.2414483Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.2414529Z 2025-05-07T20:32:25.2414639Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.2414898Z self=, 2025-05-07T20:32:25.2414976Z T=16384, 2025-05-07T20:32:25.2415060Z D=5120, 2025-05-07T20:32:25.2415184Z scale_ub=None, 2025-05-07T20:32:25.2415276Z contiguous=False, 2025-05-07T20:32:25.2415366Z compiled=True, 2025-05-07T20:32:25.2415438Z ) 2025-05-07T20:32:25.2415653Z self = 2025-05-07T20:32:25.2415832Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:25.2415836Z 2025-05-07T20:32:25.2415914Z @given( 2025-05-07T20:32:25.2416035Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.2416134Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.2416247Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.2416369Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.2416488Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.2416564Z ) 2025-05-07T20:32:25.2416813Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.2416911Z def test_silu_mul_quant( 2025-05-07T20:32:25.2416989Z self, 2025-05-07T20:32:25.2417073Z T: int, 2025-05-07T20:32:25.2417150Z D: int, 2025-05-07T20:32:25.2417253Z scale_ub: Optional[float], 2025-05-07T20:32:25.2417341Z contiguous: bool, 2025-05-07T20:32:25.2417426Z compiled: bool, 2025-05-07T20:32:25.2417509Z ) -> None: 2025-05-07T20:32:25.2417602Z torch.manual_seed(2025) 2025-05-07T20:32:25.2417675Z 2025-05-07T20:32:25.2417845Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.2417919Z 2025-05-07T20:32:25.2418012Z x_sign = torch.sign(x) 2025-05-07T20:32:25.2418141Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.2418233Z x = x_sign * x_clamp 2025-05-07T20:32:25.2418319Z x0 = x[:, :D] 2025-05-07T20:32:25.2418408Z x1 = x[:, D:] 2025-05-07T20:32:25.2418480Z 2025-05-07T20:32:25.2418565Z if contiguous: 2025-05-07T20:32:25.2418668Z x0 = x0.contiguous() 2025-05-07T20:32:25.2418759Z x1 = x1.contiguous() 2025-05-07T20:32:25.2418839Z 2025-05-07T20:32:25.2418929Z if scale_ub is not None: 2025-05-07T20:32:25.2419035Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.2419176Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.2419252Z ) 2025-05-07T20:32:25.2419329Z else: 2025-05-07T20:32:25.2419429Z scale_ub_tensor = None 2025-05-07T20:32:25.2419502Z 2025-05-07T20:32:25.2419629Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.2419727Z op = silu_mul_quant 2025-05-07T20:32:25.2419815Z if compiled: 2025-05-07T20:32:25.2419915Z op = torch.compile(op) 2025-05-07T20:32:25.2420029Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2420102Z 2025-05-07T20:32:25.2420199Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.2420203Z 2025-05-07T20:32:25.2420302Z moe/activation_test.py:117: 2025-05-07T20:32:25.2420487Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2420597Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.2420695Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2421058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:25.2421156Z return fn(*args, **kwargs) 2025-05-07T20:32:25.2421642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.2421743Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.2422094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.2422394Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.2422813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.2422909Z kernel = self.compile( 2025-05-07T20:32:25.2423286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.2423464Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.2423591Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2423596Z 2025-05-07T20:32:25.2423804Z self = 2025-05-07T20:32:25.2424572Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.2425089Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efca35d1d00>} 2025-05-07T20:32:25.2425828Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.2426017Z context = 2025-05-07T20:32:25.2426021Z 2025-05-07T20:32:25.2426189Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.2426445Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.2426557Z module_map=module_map) 2025-05-07T20:32:25.2426719Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.2426820Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.2426902Z E ^ 2025-05-07T20:32:25.2427259Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.2427263Z 2025-05-07T20:32:25.2427680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.2427690Z 2025-05-07T20:32:25.2427791Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.2428012Z self=, 2025-05-07T20:32:25.2428095Z T=2048, 2025-05-07T20:32:25.2428908Z D=5120, 2025-05-07T20:32:25.2429002Z scale_ub=None, 2025-05-07T20:32:25.2429140Z contiguous=False, 2025-05-07T20:32:25.2429226Z compiled=True, 2025-05-07T20:32:25.2429299Z ) 2025-05-07T20:32:25.2429521Z self = 2025-05-07T20:32:25.2429728Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:25.2429733Z 2025-05-07T20:32:25.2429822Z @given( 2025-05-07T20:32:25.2429964Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.2430251Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.2430377Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.2430494Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.2430607Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.2430687Z ) 2025-05-07T20:32:25.2430932Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.2431026Z def test_silu_mul_quant( 2025-05-07T20:32:25.2431110Z self, 2025-05-07T20:32:25.2431186Z T: int, 2025-05-07T20:32:25.2431265Z D: int, 2025-05-07T20:32:25.2431367Z scale_ub: Optional[float], 2025-05-07T20:32:25.2431601Z contiguous: bool, 2025-05-07T20:32:25.2431693Z compiled: bool, 2025-05-07T20:32:25.2431774Z ) -> None: 2025-05-07T20:32:25.2431871Z torch.manual_seed(2025) 2025-05-07T20:32:25.2431950Z 2025-05-07T20:32:25.2432180Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.2432253Z 2025-05-07T20:32:25.2432351Z x_sign = torch.sign(x) 2025-05-07T20:32:25.2432473Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.2432562Z x = x_sign * x_clamp 2025-05-07T20:32:25.2432647Z x0 = x[:, :D] 2025-05-07T20:32:25.2432728Z x1 = x[:, D:] 2025-05-07T20:32:25.2432800Z 2025-05-07T20:32:25.2432891Z if contiguous: 2025-05-07T20:32:25.2432983Z x0 = x0.contiguous() 2025-05-07T20:32:25.2433071Z x1 = x1.contiguous() 2025-05-07T20:32:25.2433148Z 2025-05-07T20:32:25.2433237Z if scale_ub is not None: 2025-05-07T20:32:25.2433354Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.2433491Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.2433566Z ) 2025-05-07T20:32:25.2433648Z else: 2025-05-07T20:32:25.2433744Z scale_ub_tensor = None 2025-05-07T20:32:25.2433819Z 2025-05-07T20:32:25.2433956Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.2434049Z op = silu_mul_quant 2025-05-07T20:32:25.2434136Z if compiled: 2025-05-07T20:32:25.2434244Z op = torch.compile(op) 2025-05-07T20:32:25.2434348Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2434420Z 2025-05-07T20:32:25.2434517Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.2434522Z 2025-05-07T20:32:25.2434619Z moe/activation_test.py:117: 2025-05-07T20:32:25.2434759Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2434860Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.2434961Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2435339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:25.2435432Z return fn(*args, **kwargs) 2025-05-07T20:32:25.2435925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.2436031Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.2436385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.2436611Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.2436948Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.2437042Z kernel = self.compile( 2025-05-07T20:32:25.2437426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.2437606Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.2437742Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2437749Z 2025-05-07T20:32:25.2438001Z self = 2025-05-07T20:32:25.2438772Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.2439279Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efca35d1620>} 2025-05-07T20:32:25.2440048Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.2440347Z context = 2025-05-07T20:32:25.2440352Z 2025-05-07T20:32:25.2440553Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.2440813Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.2440927Z module_map=module_map) 2025-05-07T20:32:25.2441086Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.2441191Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.2441268Z E ^ 2025-05-07T20:32:25.2441618Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.2441622Z 2025-05-07T20:32:25.2442038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.2442047Z 2025-05-07T20:32:25.2442150Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.2442375Z self=, 2025-05-07T20:32:25.2442454Z T=2048, 2025-05-07T20:32:25.2442531Z D=5120, 2025-05-07T20:32:25.2442623Z scale_ub=1200.0, 2025-05-07T20:32:25.2442712Z contiguous=False, 2025-05-07T20:32:25.2442795Z compiled=True, 2025-05-07T20:32:25.2442876Z ) 2025-05-07T20:32:25.2443093Z self = 2025-05-07T20:32:25.2443263Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:25.2443267Z 2025-05-07T20:32:25.2443351Z @given( 2025-05-07T20:32:25.2443469Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.2443574Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.2443690Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.2443808Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.2443934Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.2444008Z ) 2025-05-07T20:32:25.2444253Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.2444356Z def test_silu_mul_quant( 2025-05-07T20:32:25.2444432Z self, 2025-05-07T20:32:25.2444512Z T: int, 2025-05-07T20:32:25.2444595Z D: int, 2025-05-07T20:32:25.2444692Z scale_ub: Optional[float], 2025-05-07T20:32:25.2444785Z contiguous: bool, 2025-05-07T20:32:25.2444876Z compiled: bool, 2025-05-07T20:32:25.2444956Z ) -> None: 2025-05-07T20:32:25.2445057Z torch.manual_seed(2025) 2025-05-07T20:32:25.2445131Z 2025-05-07T20:32:25.2445299Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.2445379Z 2025-05-07T20:32:25.2445474Z x_sign = torch.sign(x) 2025-05-07T20:32:25.2445600Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.2445702Z x = x_sign * x_clamp 2025-05-07T20:32:25.2445783Z x0 = x[:, :D] 2025-05-07T20:32:25.2445862Z x1 = x[:, D:] 2025-05-07T20:32:25.2445942Z 2025-05-07T20:32:25.2446030Z if contiguous: 2025-05-07T20:32:25.2446171Z x0 = x0.contiguous() 2025-05-07T20:32:25.2446270Z x1 = x1.contiguous() 2025-05-07T20:32:25.2446343Z 2025-05-07T20:32:25.2446432Z if scale_ub is not None: 2025-05-07T20:32:25.2446545Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.2446680Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.2446763Z ) 2025-05-07T20:32:25.2446840Z else: 2025-05-07T20:32:25.2446936Z scale_ub_tensor = None 2025-05-07T20:32:25.2447018Z 2025-05-07T20:32:25.2447148Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.2447240Z op = silu_mul_quant 2025-05-07T20:32:25.2447378Z if compiled: 2025-05-07T20:32:25.2447516Z op = torch.compile(op) 2025-05-07T20:32:25.2447622Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2447702Z 2025-05-07T20:32:25.2447830Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.2447835Z 2025-05-07T20:32:25.2447944Z moe/activation_test.py:117: 2025-05-07T20:32:25.2448078Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2448179Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.2448284Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2448650Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:25.2448743Z return fn(*args, **kwargs) 2025-05-07T20:32:25.2449239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.2449340Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.2449702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.2449921Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.2450260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.2450365Z kernel = self.compile( 2025-05-07T20:32:25.2450742Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.2450917Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.2451051Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2451055Z 2025-05-07T20:32:25.2451259Z self = 2025-05-07T20:32:25.2452032Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.2452546Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efca34905e0>} 2025-05-07T20:32:25.2453293Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.2453483Z context = 2025-05-07T20:32:25.2453487Z 2025-05-07T20:32:25.2453650Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.2453919Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.2454028Z module_map=module_map) 2025-05-07T20:32:25.2454192Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.2454298Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.2454376Z E ^ 2025-05-07T20:32:25.2454781Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.2454787Z 2025-05-07T20:32:25.2455199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.2455203Z 2025-05-07T20:32:25.2455306Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.2455532Z self=, 2025-05-07T20:32:25.2455614Z T=4096, 2025-05-07T20:32:25.2455700Z D=5120, 2025-05-07T20:32:25.2455784Z scale_ub=1200.0, 2025-05-07T20:32:25.2455871Z contiguous=True, 2025-05-07T20:32:25.2455959Z compiled=True, 2025-05-07T20:32:25.2456073Z ) 2025-05-07T20:32:25.2456354Z self = 2025-05-07T20:32:25.2456530Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:25.2456535Z 2025-05-07T20:32:25.2456749Z @given( 2025-05-07T20:32:25.2456872Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.2456977Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.2457091Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.2457212Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.2457324Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.2457399Z ) 2025-05-07T20:32:25.2457647Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.2457741Z def test_silu_mul_quant( 2025-05-07T20:32:25.2457818Z self, 2025-05-07T20:32:25.2457901Z T: int, 2025-05-07T20:32:25.2457983Z D: int, 2025-05-07T20:32:25.2458084Z scale_ub: Optional[float], 2025-05-07T20:32:25.2458185Z contiguous: bool, 2025-05-07T20:32:25.2458271Z compiled: bool, 2025-05-07T20:32:25.2458353Z ) -> None: 2025-05-07T20:32:25.2458461Z torch.manual_seed(2025) 2025-05-07T20:32:25.2458537Z 2025-05-07T20:32:25.2458713Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.2458787Z 2025-05-07T20:32:25.2458880Z x_sign = torch.sign(x) 2025-05-07T20:32:25.2459010Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.2459100Z x = x_sign * x_clamp 2025-05-07T20:32:25.2459180Z x0 = x[:, :D] 2025-05-07T20:32:25.2459266Z x1 = x[:, D:] 2025-05-07T20:32:25.2459339Z 2025-05-07T20:32:25.2459423Z if contiguous: 2025-05-07T20:32:25.2459524Z x0 = x0.contiguous() 2025-05-07T20:32:25.2459638Z x1 = x1.contiguous() 2025-05-07T20:32:25.2459714Z 2025-05-07T20:32:25.2459838Z if scale_ub is not None: 2025-05-07T20:32:25.2459951Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.2460086Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.2460168Z ) 2025-05-07T20:32:25.2460247Z else: 2025-05-07T20:32:25.2460347Z scale_ub_tensor = None 2025-05-07T20:32:25.2460420Z 2025-05-07T20:32:25.2460548Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.2460643Z op = silu_mul_quant 2025-05-07T20:32:25.2460729Z if compiled: 2025-05-07T20:32:25.2460831Z op = torch.compile(op) 2025-05-07T20:32:25.2460944Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2461017Z 2025-05-07T20:32:25.2461109Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.2461113Z 2025-05-07T20:32:25.2461217Z moe/activation_test.py:117: 2025-05-07T20:32:25.2461344Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2461457Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.2461557Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2461923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:25.2462075Z return fn(*args, **kwargs) 2025-05-07T20:32:25.2462563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.2462661Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.2463019Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.2463237Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.2467642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.2467758Z kernel = self.compile( 2025-05-07T20:32:25.2468238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.2468455Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.2468639Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2468644Z 2025-05-07T20:32:25.2468854Z self = 2025-05-07T20:32:25.2469732Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.2470247Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efca3491120>} 2025-05-07T20:32:25.2471045Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.2471249Z context = 2025-05-07T20:32:25.2471256Z 2025-05-07T20:32:25.2471423Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.2471685Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.2471802Z module_map=module_map) 2025-05-07T20:32:25.2471964Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.2472075Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.2472155Z E ^ 2025-05-07T20:32:25.2472509Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.2472514Z 2025-05-07T20:32:25.2472936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.2472945Z 2025-05-07T20:32:25.2473052Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.2473286Z self=, 2025-05-07T20:32:25.2473369Z T=128, 2025-05-07T20:32:25.2473449Z D=5120, 2025-05-07T20:32:25.2473544Z scale_ub=1200.0, 2025-05-07T20:32:25.2473634Z contiguous=False, 2025-05-07T20:32:25.2473722Z compiled=True, 2025-05-07T20:32:25.2473808Z ) 2025-05-07T20:32:25.2474025Z self = 2025-05-07T20:32:25.2474200Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:25.2474204Z 2025-05-07T20:32:25.2474292Z @given( 2025-05-07T20:32:25.2474410Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.2474523Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.2474648Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.2474770Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.2474892Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.2474970Z ) 2025-05-07T20:32:25.2475266Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.2475372Z def test_silu_mul_quant( 2025-05-07T20:32:25.2475452Z self, 2025-05-07T20:32:25.2475532Z T: int, 2025-05-07T20:32:25.2475619Z D: int, 2025-05-07T20:32:25.2475719Z scale_ub: Optional[float], 2025-05-07T20:32:25.2475810Z contiguous: bool, 2025-05-07T20:32:25.2475913Z compiled: bool, 2025-05-07T20:32:25.2475993Z ) -> None: 2025-05-07T20:32:25.2476100Z torch.manual_seed(2025) 2025-05-07T20:32:25.2476176Z 2025-05-07T20:32:25.2476346Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.2476429Z 2025-05-07T20:32:25.2476563Z x_sign = torch.sign(x) 2025-05-07T20:32:25.2476724Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.2476823Z x = x_sign * x_clamp 2025-05-07T20:32:25.2476905Z x0 = x[:, :D] 2025-05-07T20:32:25.2477023Z x1 = x[:, D:] 2025-05-07T20:32:25.2477105Z 2025-05-07T20:32:25.2477193Z if contiguous: 2025-05-07T20:32:25.2477285Z x0 = x0.contiguous() 2025-05-07T20:32:25.2477382Z x1 = x1.contiguous() 2025-05-07T20:32:25.2477456Z 2025-05-07T20:32:25.2477556Z if scale_ub is not None: 2025-05-07T20:32:25.2477661Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.2477797Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.2477881Z ) 2025-05-07T20:32:25.2477958Z else: 2025-05-07T20:32:25.2478052Z scale_ub_tensor = None 2025-05-07T20:32:25.2478132Z 2025-05-07T20:32:25.2478261Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.2478355Z op = silu_mul_quant 2025-05-07T20:32:25.2478453Z if compiled: 2025-05-07T20:32:25.2478555Z op = torch.compile(op) 2025-05-07T20:32:25.2478661Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2478746Z 2025-05-07T20:32:25.2478842Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.2478846Z 2025-05-07T20:32:25.2478952Z moe/activation_test.py:117: 2025-05-07T20:32:25.2479084Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2479184Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.2479292Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2479661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:25.2479754Z return fn(*args, **kwargs) 2025-05-07T20:32:25.2480254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.2480358Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.2480722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.2480947Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.2481285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.2481387Z kernel = self.compile( 2025-05-07T20:32:25.2481767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.2481940Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.2482080Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2482084Z 2025-05-07T20:32:25.2482289Z self = 2025-05-07T20:32:25.2483067Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.2483619Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efca3492340>} 2025-05-07T20:32:25.2484370Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.2484560Z context = 2025-05-07T20:32:25.2484565Z 2025-05-07T20:32:25.2484729Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.2484999Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.2485181Z module_map=module_map) 2025-05-07T20:32:25.2485351Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.2485490Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.2485572Z E ^ 2025-05-07T20:32:25.2485937Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.2485942Z 2025-05-07T20:32:25.2486354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.2486358Z 2025-05-07T20:32:25.2486464Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.2486696Z self=, 2025-05-07T20:32:25.2486776Z T=16384, 2025-05-07T20:32:25.2486863Z D=7168, 2025-05-07T20:32:25.2486952Z scale_ub=1200.0, 2025-05-07T20:32:25.2487045Z contiguous=True, 2025-05-07T20:32:25.2487141Z compiled=True, 2025-05-07T20:32:25.2487217Z ) 2025-05-07T20:32:25.2487434Z self = 2025-05-07T20:32:25.2487624Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:25.2487631Z 2025-05-07T20:32:25.2487712Z @given( 2025-05-07T20:32:25.2487831Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.2487940Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.2488060Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.2488186Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.2488301Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.2488378Z ) 2025-05-07T20:32:25.2488630Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.2488726Z def test_silu_mul_quant( 2025-05-07T20:32:25.2488808Z self, 2025-05-07T20:32:25.2488897Z T: int, 2025-05-07T20:32:25.2488979Z D: int, 2025-05-07T20:32:25.2489080Z scale_ub: Optional[float], 2025-05-07T20:32:25.2489181Z contiguous: bool, 2025-05-07T20:32:25.2489269Z compiled: bool, 2025-05-07T20:32:25.2489353Z ) -> None: 2025-05-07T20:32:25.2489460Z torch.manual_seed(2025) 2025-05-07T20:32:25.2489540Z 2025-05-07T20:32:25.2489739Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.2489821Z 2025-05-07T20:32:25.2489933Z x_sign = torch.sign(x) 2025-05-07T20:32:25.2490068Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.2490159Z x = x_sign * x_clamp 2025-05-07T20:32:25.2490242Z x0 = x[:, :D] 2025-05-07T20:32:25.2490337Z x1 = x[:, D:] 2025-05-07T20:32:25.2490411Z 2025-05-07T20:32:25.2490499Z if contiguous: 2025-05-07T20:32:25.2490601Z x0 = x0.contiguous() 2025-05-07T20:32:25.2490695Z x1 = x1.contiguous() 2025-05-07T20:32:25.2490771Z 2025-05-07T20:32:25.2490871Z if scale_ub is not None: 2025-05-07T20:32:25.2490978Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.2491124Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.2491254Z ) 2025-05-07T20:32:25.2491334Z else: 2025-05-07T20:32:25.2491441Z scale_ub_tensor = None 2025-05-07T20:32:25.2491516Z 2025-05-07T20:32:25.2491646Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.2491746Z op = silu_mul_quant 2025-05-07T20:32:25.2491834Z if compiled: 2025-05-07T20:32:25.2491936Z op = torch.compile(op) 2025-05-07T20:32:25.2492052Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2492128Z 2025-05-07T20:32:25.2492222Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.2492226Z 2025-05-07T20:32:25.2492334Z moe/activation_test.py:117: 2025-05-07T20:32:25.2492534Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2492682Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.2492784Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2493190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:25.2493291Z return fn(*args, **kwargs) 2025-05-07T20:32:25.2493782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.2493880Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.2494243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.2494466Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.2494813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.2494915Z kernel = self.compile( 2025-05-07T20:32:25.2495297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.2495478Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.2495610Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2495614Z 2025-05-07T20:32:25.2495827Z self = 2025-05-07T20:32:25.2496595Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.2497093Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efca3493c40>} 2025-05-07T20:32:25.2497844Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.2498042Z context = 2025-05-07T20:32:25.2498046Z 2025-05-07T20:32:25.2498219Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.2498480Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.2498587Z module_map=module_map) 2025-05-07T20:32:25.2498754Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.2498854Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.2498940Z E ^ 2025-05-07T20:32:25.2499293Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.2499300Z 2025-05-07T20:32:25.2499714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.2499719Z 2025-05-07T20:32:25.2499830Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.2500098Z self=, 2025-05-07T20:32:25.2500187Z T=16384, 2025-05-07T20:32:25.2500269Z D=5120, 2025-05-07T20:32:25.2500355Z scale_ub=1200.0, 2025-05-07T20:32:25.2500470Z contiguous=True, 2025-05-07T20:32:25.2500564Z compiled=False, 2025-05-07T20:32:25.2500655Z ) 2025-05-07T20:32:25.2500884Z self = 2025-05-07T20:32:25.2501063Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:25.2501068Z 2025-05-07T20:32:25.2501146Z @given( 2025-05-07T20:32:25.2501275Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.2501419Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.2501577Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.2501701Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.2501855Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.2501942Z ) 2025-05-07T20:32:25.2502188Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.2502284Z def test_silu_mul_quant( 2025-05-07T20:32:25.2502369Z self, 2025-05-07T20:32:25.2502448Z T: int, 2025-05-07T20:32:25.2502529Z D: int, 2025-05-07T20:32:25.2502639Z scale_ub: Optional[float], 2025-05-07T20:32:25.2502730Z contiguous: bool, 2025-05-07T20:32:25.2502816Z compiled: bool, 2025-05-07T20:32:25.2502903Z ) -> None: 2025-05-07T20:32:25.2502999Z torch.manual_seed(2025) 2025-05-07T20:32:25.2503073Z 2025-05-07T20:32:25.2503253Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.2503335Z 2025-05-07T20:32:25.2503437Z x_sign = torch.sign(x) 2025-05-07T20:32:25.2503564Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.2503656Z x = x_sign * x_clamp 2025-05-07T20:32:25.2503753Z x0 = x[:, :D] 2025-05-07T20:32:25.2503840Z x1 = x[:, D:] 2025-05-07T20:32:25.2503914Z 2025-05-07T20:32:25.2504007Z if contiguous: 2025-05-07T20:32:25.2504101Z x0 = x0.contiguous() 2025-05-07T20:32:25.2504192Z x1 = x1.contiguous() 2025-05-07T20:32:25.2504274Z 2025-05-07T20:32:25.2504365Z if scale_ub is not None: 2025-05-07T20:32:25.2504473Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.2504618Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.2504694Z ) 2025-05-07T20:32:25.2504779Z else: 2025-05-07T20:32:25.2504875Z scale_ub_tensor = None 2025-05-07T20:32:25.2504952Z 2025-05-07T20:32:25.2505087Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.2505181Z op = silu_mul_quant 2025-05-07T20:32:25.2505267Z if compiled: 2025-05-07T20:32:25.2505374Z op = torch.compile(op) 2025-05-07T20:32:25.2505484Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2505558Z 2025-05-07T20:32:25.2505662Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.2505666Z 2025-05-07T20:32:25.2505765Z moe/activation_test.py:117: 2025-05-07T20:32:25.2505894Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2506002Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.2506102Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2506603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.2506700Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.2507065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.2507287Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.2507679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.2507780Z kernel = self.compile( 2025-05-07T20:32:25.2508160Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.2508343Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.2508471Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2508475Z 2025-05-07T20:32:25.2508680Z self = 2025-05-07T20:32:25.2509539Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.2510158Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efca2d00c20>} 2025-05-07T20:32:25.2510909Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.2511098Z context = 2025-05-07T20:32:25.2511103Z 2025-05-07T20:32:25.2511265Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.2511532Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.2511640Z module_map=module_map) 2025-05-07T20:32:25.2511815Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.2511914Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.2511992Z E ^ 2025-05-07T20:32:25.2512359Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.2512364Z 2025-05-07T20:32:25.2512778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.2512782Z 2025-05-07T20:32:25.2512892Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.2513114Z self=, 2025-05-07T20:32:25.2513191Z T=1, 2025-05-07T20:32:25.2513275Z D=7168, 2025-05-07T20:32:25.2513359Z scale_ub=1200.0, 2025-05-07T20:32:25.2513446Z contiguous=False, 2025-05-07T20:32:25.2513537Z compiled=False, 2025-05-07T20:32:25.2513610Z ) 2025-05-07T20:32:25.2513829Z self = 2025-05-07T20:32:25.2514008Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:25.2514012Z 2025-05-07T20:32:25.2514089Z @given( 2025-05-07T20:32:25.2514219Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.2514317Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.2514430Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.2514553Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.2514664Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.2514740Z ) 2025-05-07T20:32:25.2514987Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.2515081Z def test_silu_mul_quant( 2025-05-07T20:32:25.2515157Z self, 2025-05-07T20:32:25.2515240Z T: int, 2025-05-07T20:32:25.2515316Z D: int, 2025-05-07T20:32:25.2515413Z scale_ub: Optional[float], 2025-05-07T20:32:25.2515512Z contiguous: bool, 2025-05-07T20:32:25.2515597Z compiled: bool, 2025-05-07T20:32:25.2515683Z ) -> None: 2025-05-07T20:32:25.2515779Z torch.manual_seed(2025) 2025-05-07T20:32:25.2515852Z 2025-05-07T20:32:25.2516074Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.2516150Z 2025-05-07T20:32:25.2516241Z x_sign = torch.sign(x) 2025-05-07T20:32:25.2516371Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.2516460Z x = x_sign * x_clamp 2025-05-07T20:32:25.2516539Z x0 = x[:, :D] 2025-05-07T20:32:25.2516625Z x1 = x[:, D:] 2025-05-07T20:32:25.2516699Z 2025-05-07T20:32:25.2516782Z if contiguous: 2025-05-07T20:32:25.2516879Z x0 = x0.contiguous() 2025-05-07T20:32:25.2516968Z x1 = x1.contiguous() 2025-05-07T20:32:25.2517040Z 2025-05-07T20:32:25.2517134Z if scale_ub is not None: 2025-05-07T20:32:25.2517281Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.2517460Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.2517538Z ) 2025-05-07T20:32:25.2517618Z else: 2025-05-07T20:32:25.2517759Z scale_ub_tensor = None 2025-05-07T20:32:25.2517836Z 2025-05-07T20:32:25.2517965Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.2518063Z op = silu_mul_quant 2025-05-07T20:32:25.2518148Z if compiled: 2025-05-07T20:32:25.2518247Z op = torch.compile(op) 2025-05-07T20:32:25.2518358Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2518431Z 2025-05-07T20:32:25.2518525Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.2518535Z 2025-05-07T20:32:25.2518633Z moe/activation_test.py:117: 2025-05-07T20:32:25.2518762Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2518868Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.2518972Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2519468Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.2519579Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.2519985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.2520211Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.2520549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.2520642Z kernel = self.compile( 2025-05-07T20:32:25.2521030Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.2521203Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.2521334Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2521341Z 2025-05-07T20:32:25.2521552Z self = 2025-05-07T20:32:25.2522327Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.2522835Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efca2d01120>} 2025-05-07T20:32:25.2523576Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.2523774Z context = 2025-05-07T20:32:25.2523782Z 2025-05-07T20:32:25.2523945Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.2524207Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.2524398Z module_map=module_map) 2025-05-07T20:32:25.2524561Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.2524660Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.2524744Z E ^ 2025-05-07T20:32:25.2525098Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.2525103Z 2025-05-07T20:32:25.2525520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.2525525Z 2025-05-07T20:32:25.2525629Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.2525851Z self=, 2025-05-07T20:32:25.2526010Z T=4096, 2025-05-07T20:32:25.2526088Z D=7168, 2025-05-07T20:32:25.2526174Z scale_ub=1200.0, 2025-05-07T20:32:25.2526267Z contiguous=False, 2025-05-07T20:32:25.2526390Z compiled=True, 2025-05-07T20:32:25.2526470Z ) 2025-05-07T20:32:25.2526688Z self = 2025-05-07T20:32:25.2526862Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:25.2526867Z 2025-05-07T20:32:25.2526952Z @given( 2025-05-07T20:32:25.2527069Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.2527167Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.2527289Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.2527403Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.2527520Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.2527598Z ) 2025-05-07T20:32:25.2527842Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.2527944Z def test_silu_mul_quant( 2025-05-07T20:32:25.2528020Z self, 2025-05-07T20:32:25.2528096Z T: int, 2025-05-07T20:32:25.2528584Z D: int, 2025-05-07T20:32:25.2528731Z scale_ub: Optional[float], 2025-05-07T20:32:25.2528824Z contiguous: bool, 2025-05-07T20:32:25.2528914Z compiled: bool, 2025-05-07T20:32:25.2528993Z ) -> None: 2025-05-07T20:32:25.2529087Z torch.manual_seed(2025) 2025-05-07T20:32:25.2529165Z 2025-05-07T20:32:25.2529331Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.2529403Z 2025-05-07T20:32:25.2529513Z x_sign = torch.sign(x) 2025-05-07T20:32:25.2529637Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.2529736Z x = x_sign * x_clamp 2025-05-07T20:32:25.2529817Z x0 = x[:, :D] 2025-05-07T20:32:25.2529913Z x1 = x[:, D:] 2025-05-07T20:32:25.2529990Z 2025-05-07T20:32:25.2530093Z if contiguous: 2025-05-07T20:32:25.2530209Z x0 = x0.contiguous() 2025-05-07T20:32:25.2530320Z x1 = x1.contiguous() 2025-05-07T20:32:25.2530392Z 2025-05-07T20:32:25.2530492Z if scale_ub is not None: 2025-05-07T20:32:25.2530598Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.2530733Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.2530816Z ) 2025-05-07T20:32:25.2530892Z else: 2025-05-07T20:32:25.2530986Z scale_ub_tensor = None 2025-05-07T20:32:25.2531066Z 2025-05-07T20:32:25.2531194Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.2531291Z op = silu_mul_quant 2025-05-07T20:32:25.2531377Z if compiled: 2025-05-07T20:32:25.2531477Z op = torch.compile(op) 2025-05-07T20:32:25.2531588Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2531665Z 2025-05-07T20:32:25.2531757Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.2531761Z 2025-05-07T20:32:25.2531869Z moe/activation_test.py:117: 2025-05-07T20:32:25.2532000Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2532285Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.2532393Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2532760Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:25.2532859Z return fn(*args, **kwargs) 2025-05-07T20:32:25.2533350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.2533447Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.2533814Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.2534107Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.2534504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.2534664Z kernel = self.compile( 2025-05-07T20:32:25.2535047Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.2535225Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.2535351Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2535355Z 2025-05-07T20:32:25.2535557Z self = 2025-05-07T20:32:25.2536337Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.2536843Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efca2d02f20>} 2025-05-07T20:32:25.2537593Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.2537783Z context = 2025-05-07T20:32:25.2537787Z 2025-05-07T20:32:25.2537955Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.2538215Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.2538321Z module_map=module_map) 2025-05-07T20:32:25.2538485Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.2538586Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.2538667Z E ^ 2025-05-07T20:32:25.2539025Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.2539030Z 2025-05-07T20:32:25.2539446Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.2539451Z 2025-05-07T20:32:25.2539559Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.2539779Z self=, 2025-05-07T20:32:25.2539856Z T=128, 2025-05-07T20:32:25.2539940Z D=7168, 2025-05-07T20:32:25.2540026Z scale_ub=1200.0, 2025-05-07T20:32:25.2540113Z contiguous=False, 2025-05-07T20:32:25.2540202Z compiled=True, 2025-05-07T20:32:25.2540276Z ) 2025-05-07T20:32:25.2540491Z self = 2025-05-07T20:32:25.2540669Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:25.2540678Z 2025-05-07T20:32:25.2540754Z @given( 2025-05-07T20:32:25.2540880Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.2540979Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.2541141Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.2541269Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.2541381Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.2541456Z ) 2025-05-07T20:32:25.2541707Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.2541800Z def test_silu_mul_quant( 2025-05-07T20:32:25.2541881Z self, 2025-05-07T20:32:25.2541958Z T: int, 2025-05-07T20:32:25.2542034Z D: int, 2025-05-07T20:32:25.2542136Z scale_ub: Optional[float], 2025-05-07T20:32:25.2542226Z contiguous: bool, 2025-05-07T20:32:25.2542310Z compiled: bool, 2025-05-07T20:32:25.2542438Z ) -> None: 2025-05-07T20:32:25.2542573Z torch.manual_seed(2025) 2025-05-07T20:32:25.2542645Z 2025-05-07T20:32:25.2542817Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.2542890Z 2025-05-07T20:32:25.2543021Z x_sign = torch.sign(x) 2025-05-07T20:32:25.2543154Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.2543243Z x = x_sign * x_clamp 2025-05-07T20:32:25.2543326Z x0 = x[:, :D] 2025-05-07T20:32:25.2543414Z x1 = x[:, D:] 2025-05-07T20:32:25.2543486Z 2025-05-07T20:32:25.2543577Z if contiguous: 2025-05-07T20:32:25.2543667Z x0 = x0.contiguous() 2025-05-07T20:32:25.2543756Z x1 = x1.contiguous() 2025-05-07T20:32:25.2543833Z 2025-05-07T20:32:25.2543926Z if scale_ub is not None: 2025-05-07T20:32:25.2544031Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.2544172Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.2544252Z ) 2025-05-07T20:32:25.2544331Z else: 2025-05-07T20:32:25.2544432Z scale_ub_tensor = None 2025-05-07T20:32:25.2544505Z 2025-05-07T20:32:25.2544637Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.2544736Z op = silu_mul_quant 2025-05-07T20:32:25.2544822Z if compiled: 2025-05-07T20:32:25.2544927Z op = torch.compile(op) 2025-05-07T20:32:25.2545033Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2545107Z 2025-05-07T20:32:25.2545207Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.2545211Z 2025-05-07T20:32:25.2545307Z moe/activation_test.py:117: 2025-05-07T20:32:25.2545436Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2545543Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.2545642Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2546007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:25.2546111Z return fn(*args, **kwargs) 2025-05-07T20:32:25.2546603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.2546707Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.2547061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.2547281Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.2547627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.2547722Z kernel = self.compile( 2025-05-07T20:32:25.2548109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.2548283Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.2548415Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2548419Z 2025-05-07T20:32:25.2548634Z self = 2025-05-07T20:32:25.2549557Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.2550099Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efca2f7c220>} 2025-05-07T20:32:25.2550843Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.2551075Z context = 2025-05-07T20:32:25.2551119Z 2025-05-07T20:32:25.2551292Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.2551617Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.2551736Z module_map=module_map) 2025-05-07T20:32:25.2551895Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.2551992Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.2552077Z E ^ 2025-05-07T20:32:25.2552430Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.2552434Z 2025-05-07T20:32:25.2552846Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.2552856Z 2025-05-07T20:32:25.2552958Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.2553183Z self=, 2025-05-07T20:32:25.2553268Z T=2048, 2025-05-07T20:32:25.2553344Z D=7168, 2025-05-07T20:32:25.2553426Z scale_ub=None, 2025-05-07T20:32:25.2553520Z contiguous=True, 2025-05-07T20:32:25.2553606Z compiled=True, 2025-05-07T20:32:25.2553678Z ) 2025-05-07T20:32:25.2553900Z self = 2025-05-07T20:32:25.2554067Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:25.2554072Z 2025-05-07T20:32:25.2554153Z @given( 2025-05-07T20:32:25.2554272Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.2554369Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.2554490Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.2554606Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.2554718Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.2554802Z ) 2025-05-07T20:32:25.2555046Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.2555140Z def test_silu_mul_quant( 2025-05-07T20:32:25.2555225Z self, 2025-05-07T20:32:25.2555303Z T: int, 2025-05-07T20:32:25.2555379Z D: int, 2025-05-07T20:32:25.2555482Z scale_ub: Optional[float], 2025-05-07T20:32:25.2555571Z contiguous: bool, 2025-05-07T20:32:25.2555663Z compiled: bool, 2025-05-07T20:32:25.2555740Z ) -> None: 2025-05-07T20:32:25.2555833Z torch.manual_seed(2025) 2025-05-07T20:32:25.2555913Z 2025-05-07T20:32:25.2556079Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.2556151Z 2025-05-07T20:32:25.2556247Z x_sign = torch.sign(x) 2025-05-07T20:32:25.2556370Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.2556459Z x = x_sign * x_clamp 2025-05-07T20:32:25.2556553Z x0 = x[:, :D] 2025-05-07T20:32:25.2556636Z x1 = x[:, D:] 2025-05-07T20:32:25.2556708Z 2025-05-07T20:32:25.2556798Z if contiguous: 2025-05-07T20:32:25.2556888Z x0 = x0.contiguous() 2025-05-07T20:32:25.2556984Z x1 = x1.contiguous() 2025-05-07T20:32:25.2557108Z 2025-05-07T20:32:25.2557201Z if scale_ub is not None: 2025-05-07T20:32:25.2557311Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.2557445Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.2557520Z ) 2025-05-07T20:32:25.2557602Z else: 2025-05-07T20:32:25.2557695Z scale_ub_tensor = None 2025-05-07T20:32:25.2557768Z 2025-05-07T20:32:25.2557903Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.2557992Z op = silu_mul_quant 2025-05-07T20:32:25.2558077Z if compiled: 2025-05-07T20:32:25.2558185Z op = torch.compile(op) 2025-05-07T20:32:25.2558332Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2558455Z 2025-05-07T20:32:25.2558546Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.2558551Z 2025-05-07T20:32:25.2558648Z moe/activation_test.py:117: 2025-05-07T20:32:25.2558823Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2558925Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.2559025Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2559399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:25.2559492Z return fn(*args, **kwargs) 2025-05-07T20:32:25.2560038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.2560147Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.2560501Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.2560734Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.2561074Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.2561170Z kernel = self.compile( 2025-05-07T20:32:25.2561558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.2561731Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.2561865Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2561869Z 2025-05-07T20:32:25.2562074Z self = 2025-05-07T20:32:25.2562843Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.2563360Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efca2f7cd60>} 2025-05-07T20:32:25.2564105Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.2564301Z context = 2025-05-07T20:32:25.2564306Z 2025-05-07T20:32:25.2564468Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.2564730Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.2564843Z module_map=module_map) 2025-05-07T20:32:25.2565003Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.2565114Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.2565192Z E ^ 2025-05-07T20:32:25.2565546Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.2565551Z 2025-05-07T20:32:25.2566025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.2566030Z 2025-05-07T20:32:25.2566135Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.2566367Z self=, 2025-05-07T20:32:25.2566444Z T=16384, 2025-05-07T20:32:25.2566521Z D=5120, 2025-05-07T20:32:25.2566608Z scale_ub=None, 2025-05-07T20:32:25.2566698Z contiguous=False, 2025-05-07T20:32:25.2566782Z compiled=False, 2025-05-07T20:32:25.2566863Z ) 2025-05-07T20:32:25.2567078Z self = 2025-05-07T20:32:25.2567329Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:25.2567335Z 2025-05-07T20:32:25.2567418Z @given( 2025-05-07T20:32:25.2567537Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.2567681Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.2567797Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.2567914Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.2568031Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.2568105Z ) 2025-05-07T20:32:25.2568347Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.2568446Z def test_silu_mul_quant( 2025-05-07T20:32:25.2568524Z self, 2025-05-07T20:32:25.2568600Z T: int, 2025-05-07T20:32:25.2568681Z D: int, 2025-05-07T20:32:25.2568779Z scale_ub: Optional[float], 2025-05-07T20:32:25.2568873Z contiguous: bool, 2025-05-07T20:32:25.2568966Z compiled: bool, 2025-05-07T20:32:25.2569049Z ) -> None: 2025-05-07T20:32:25.2569150Z torch.manual_seed(2025) 2025-05-07T20:32:25.2569223Z 2025-05-07T20:32:25.2569392Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.2569473Z 2025-05-07T20:32:25.2569564Z x_sign = torch.sign(x) 2025-05-07T20:32:25.2569688Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.2571496Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:25.2571508Z 2025-05-07T20:32:25.2571626Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:25.2571630Z 2025-05-07T20:32:25.2571742Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.2571965Z self=, 2025-05-07T20:32:25.2572041Z T=4096, 2025-05-07T20:32:25.2572125Z D=7168, 2025-05-07T20:32:25.2572207Z scale_ub=1200.0, 2025-05-07T20:32:25.2572298Z contiguous=True, 2025-05-07T20:32:25.2572381Z compiled=True, 2025-05-07T20:32:25.2572454Z ) 2025-05-07T20:32:25.2572676Z self = 2025-05-07T20:32:25.2572846Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:25.2572851Z 2025-05-07T20:32:25.2572926Z @given( 2025-05-07T20:32:25.2573048Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.2573149Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.2573266Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.2573387Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.2573500Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.2573580Z ) 2025-05-07T20:32:25.2573874Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.2573969Z def test_silu_mul_quant( 2025-05-07T20:32:25.2574053Z self, 2025-05-07T20:32:25.2574131Z T: int, 2025-05-07T20:32:25.2574207Z D: int, 2025-05-07T20:32:25.2574312Z scale_ub: Optional[float], 2025-05-07T20:32:25.2574403Z contiguous: bool, 2025-05-07T20:32:25.2574488Z compiled: bool, 2025-05-07T20:32:25.2574571Z ) -> None: 2025-05-07T20:32:25.2574666Z torch.manual_seed(2025) 2025-05-07T20:32:25.2574738Z 2025-05-07T20:32:25.2574915Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.2575073Z 2025-05-07T20:32:25.2575171Z x_sign = torch.sign(x) 2025-05-07T20:32:25.2575295Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.2577109Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:25.2577122Z 2025-05-07T20:32:25.2577239Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:25.2577243Z 2025-05-07T20:32:25.2577345Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.2577575Z self=, 2025-05-07T20:32:25.2577656Z T=16384, 2025-05-07T20:32:25.2577733Z D=7168, 2025-05-07T20:32:25.2577823Z scale_ub=None, 2025-05-07T20:32:25.2577909Z contiguous=False, 2025-05-07T20:32:25.2577995Z compiled=False, 2025-05-07T20:32:25.2578083Z ) 2025-05-07T20:32:25.2578298Z self = 2025-05-07T20:32:25.2578479Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:25.2578483Z 2025-05-07T20:32:25.2578559Z @given( 2025-05-07T20:32:25.2578676Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.2578781Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.2578894Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.2579010Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.2579130Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.2579208Z ) 2025-05-07T20:32:25.2579454Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.2579555Z def test_silu_mul_quant( 2025-05-07T20:32:25.2579632Z self, 2025-05-07T20:32:25.2579718Z T: int, 2025-05-07T20:32:25.2579802Z D: int, 2025-05-07T20:32:25.2579900Z scale_ub: Optional[float], 2025-05-07T20:32:25.2580002Z contiguous: bool, 2025-05-07T20:32:25.2580092Z compiled: bool, 2025-05-07T20:32:25.2580190Z ) -> None: 2025-05-07T20:32:25.2580300Z torch.manual_seed(2025) 2025-05-07T20:32:25.2580395Z 2025-05-07T20:32:25.2580560Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.2582391Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:25.2582403Z 2025-05-07T20:32:25.2582521Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:25.2582525Z 2025-05-07T20:32:25.2582633Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.2582852Z self=, 2025-05-07T20:32:25.2582934Z T=2048, 2025-05-07T20:32:25.2583009Z D=7168, 2025-05-07T20:32:25.2583091Z scale_ub=1200.0, 2025-05-07T20:32:25.2583181Z contiguous=True, 2025-05-07T20:32:25.2583263Z compiled=True, 2025-05-07T20:32:25.2583336Z ) 2025-05-07T20:32:25.2583555Z self = 2025-05-07T20:32:25.2583766Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:25.2583836Z 2025-05-07T20:32:25.2583916Z @given( 2025-05-07T20:32:25.2584038Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.2584176Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.2584299Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.2584414Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.2584525Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.2584606Z ) 2025-05-07T20:32:25.2584847Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.2584941Z def test_silu_mul_quant( 2025-05-07T20:32:25.2585023Z self, 2025-05-07T20:32:25.2585098Z T: int, 2025-05-07T20:32:25.2585174Z D: int, 2025-05-07T20:32:25.2585277Z scale_ub: Optional[float], 2025-05-07T20:32:25.2585366Z contiguous: bool, 2025-05-07T20:32:25.2585455Z compiled: bool, 2025-05-07T20:32:25.2585542Z ) -> None: 2025-05-07T20:32:25.2585636Z torch.manual_seed(2025) 2025-05-07T20:32:25.2585715Z 2025-05-07T20:32:25.2585879Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.2585955Z 2025-05-07T20:32:25.2586055Z x_sign = torch.sign(x) 2025-05-07T20:32:25.2586180Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.2587930Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:25.2587947Z 2025-05-07T20:32:25.2588064Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:25.2588069Z 2025-05-07T20:32:25.2588171Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.2588400Z self=, 2025-05-07T20:32:25.2588476Z T=2048, 2025-05-07T20:32:25.2588553Z D=7168, 2025-05-07T20:32:25.2588644Z scale_ub=None, 2025-05-07T20:32:25.2588728Z contiguous=True, 2025-05-07T20:32:25.2588817Z compiled=False, 2025-05-07T20:32:25.2588890Z ) 2025-05-07T20:32:25.2589192Z self = 2025-05-07T20:32:25.2589371Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:25.2589376Z 2025-05-07T20:32:25.2589452Z @given( 2025-05-07T20:32:25.2589566Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.2589670Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.2589791Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.2589917Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.2590054Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.2590149Z ) 2025-05-07T20:32:25.2590453Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.2590549Z def test_silu_mul_quant( 2025-05-07T20:32:25.2590626Z self, 2025-05-07T20:32:25.2590712Z T: int, 2025-05-07T20:32:25.2590787Z D: int, 2025-05-07T20:32:25.2590884Z scale_ub: Optional[float], 2025-05-07T20:32:25.2590993Z contiguous: bool, 2025-05-07T20:32:25.2595163Z compiled: bool, 2025-05-07T20:32:25.2595257Z ) -> None: 2025-05-07T20:32:25.2595356Z torch.manual_seed(2025) 2025-05-07T20:32:25.2595438Z 2025-05-07T20:32:25.2595613Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.2595772Z 2025-05-07T20:32:25.2596013Z > x_sign = torch.sign(x) 2025-05-07T20:32:25.2597850Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:25.2597857Z 2025-05-07T20:32:25.2597987Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:25.2597992Z 2025-05-07T20:32:25.2598095Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.2598318Z self=, 2025-05-07T20:32:25.2598406Z T=1, 2025-05-07T20:32:25.2598484Z D=7168, 2025-05-07T20:32:25.2598583Z scale_ub=1200.0, 2025-05-07T20:32:25.2598671Z contiguous=True, 2025-05-07T20:32:25.2598759Z compiled=False, 2025-05-07T20:32:25.2598842Z ) 2025-05-07T20:32:25.2599064Z self = 2025-05-07T20:32:25.2599230Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:25.2599235Z 2025-05-07T20:32:25.2599322Z @given( 2025-05-07T20:32:25.2599442Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.2599542Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.2599665Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.2599782Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.2599903Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.2599980Z ) 2025-05-07T20:32:25.2600225Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.2600336Z def test_silu_mul_quant( 2025-05-07T20:32:25.2600415Z self, 2025-05-07T20:32:25.2600494Z T: int, 2025-05-07T20:32:25.2600585Z D: int, 2025-05-07T20:32:25.2600686Z scale_ub: Optional[float], 2025-05-07T20:32:25.2600782Z contiguous: bool, 2025-05-07T20:32:25.2600878Z compiled: bool, 2025-05-07T20:32:25.2600957Z ) -> None: 2025-05-07T20:32:25.2601055Z torch.manual_seed(2025) 2025-05-07T20:32:25.2601138Z 2025-05-07T20:32:25.2601308Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.2601385Z 2025-05-07T20:32:25.2601488Z x_sign = torch.sign(x) 2025-05-07T20:32:25.2601615Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.2601715Z x = x_sign * x_clamp 2025-05-07T20:32:25.2601798Z x0 = x[:, :D] 2025-05-07T20:32:25.2601878Z x1 = x[:, D:] 2025-05-07T20:32:25.2601958Z 2025-05-07T20:32:25.2602049Z if contiguous: 2025-05-07T20:32:25.2602146Z x0 = x0.contiguous() 2025-05-07T20:32:25.2602242Z x1 = x1.contiguous() 2025-05-07T20:32:25.2602315Z 2025-05-07T20:32:25.2602405Z if scale_ub is not None: 2025-05-07T20:32:25.2602521Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.2602712Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.2602790Z ) 2025-05-07T20:32:25.2602875Z else: 2025-05-07T20:32:25.2602974Z scale_ub_tensor = None 2025-05-07T20:32:25.2603054Z 2025-05-07T20:32:25.2603184Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.2603275Z op = silu_mul_quant 2025-05-07T20:32:25.2603366Z if compiled: 2025-05-07T20:32:25.2603470Z op = torch.compile(op) 2025-05-07T20:32:25.2603576Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2603659Z 2025-05-07T20:32:25.2603751Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.2603797Z 2025-05-07T20:32:25.2603934Z moe/activation_test.py:117: 2025-05-07T20:32:25.2604075Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2604177Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.2604327Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2604833Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.2604932Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.2605300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.2605521Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.2605864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.2605964Z kernel = self.compile( 2025-05-07T20:32:25.2606352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.2606537Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.2606669Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2606674Z 2025-05-07T20:32:25.2606881Z self = 2025-05-07T20:32:25.2607664Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.2608171Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efca2b60540>} 2025-05-07T20:32:25.2608928Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.2609127Z context = 2025-05-07T20:32:25.2609134Z 2025-05-07T20:32:25.2609311Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.2609617Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.2609734Z module_map=module_map) 2025-05-07T20:32:25.2609904Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.2610003Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.2610081Z E ^ 2025-05-07T20:32:25.2610444Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.2610449Z 2025-05-07T20:32:25.2610866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.2610876Z 2025-05-07T20:32:25.2610988Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.2611214Z self=, 2025-05-07T20:32:25.2611337Z T=128, 2025-05-07T20:32:25.2611425Z D=5120, 2025-05-07T20:32:25.2611507Z scale_ub=None, 2025-05-07T20:32:25.2611595Z contiguous=True, 2025-05-07T20:32:25.2611686Z compiled=False, 2025-05-07T20:32:25.2611761Z ) 2025-05-07T20:32:25.2611980Z self = 2025-05-07T20:32:25.2612162Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:25.2612166Z 2025-05-07T20:32:25.2612245Z @given( 2025-05-07T20:32:25.2612375Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.2612480Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.2612636Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.2612800Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.2612916Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.2612991Z ) 2025-05-07T20:32:25.2613288Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.2613389Z def test_silu_mul_quant( 2025-05-07T20:32:25.2613470Z self, 2025-05-07T20:32:25.2613556Z T: int, 2025-05-07T20:32:25.2613633Z D: int, 2025-05-07T20:32:25.2613743Z scale_ub: Optional[float], 2025-05-07T20:32:25.2613834Z contiguous: bool, 2025-05-07T20:32:25.2613920Z compiled: bool, 2025-05-07T20:32:25.2614006Z ) -> None: 2025-05-07T20:32:25.2614101Z torch.manual_seed(2025) 2025-05-07T20:32:25.2614175Z 2025-05-07T20:32:25.2614351Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.2614426Z 2025-05-07T20:32:25.2614523Z x_sign = torch.sign(x) 2025-05-07T20:32:25.2614660Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.2614750Z x = x_sign * x_clamp 2025-05-07T20:32:25.2614834Z x0 = x[:, :D] 2025-05-07T20:32:25.2614925Z x1 = x[:, D:] 2025-05-07T20:32:25.2615002Z 2025-05-07T20:32:25.2615099Z if contiguous: 2025-05-07T20:32:25.2615192Z x0 = x0.contiguous() 2025-05-07T20:32:25.2615283Z x1 = x1.contiguous() 2025-05-07T20:32:25.2615364Z 2025-05-07T20:32:25.2615456Z if scale_ub is not None: 2025-05-07T20:32:25.2615563Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.2615709Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.2615790Z ) 2025-05-07T20:32:25.2615868Z else: 2025-05-07T20:32:25.2615970Z scale_ub_tensor = None 2025-05-07T20:32:25.2616043Z 2025-05-07T20:32:25.2616174Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.2616279Z op = silu_mul_quant 2025-05-07T20:32:25.2616368Z if compiled: 2025-05-07T20:32:25.2616468Z op = torch.compile(op) 2025-05-07T20:32:25.2616582Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2616659Z 2025-05-07T20:32:25.2616766Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.2616770Z 2025-05-07T20:32:25.2616869Z moe/activation_test.py:117: 2025-05-07T20:32:25.2617001Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2617112Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.2617212Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2617710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.2617818Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.2618174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.2618407Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.2618747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.2618895Z kernel = self.compile( 2025-05-07T20:32:25.2619285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.2619456Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.2619586Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2619597Z 2025-05-07T20:32:25.2619802Z self = 2025-05-07T20:32:25.2620576Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.2621246Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efca2b61620>} 2025-05-07T20:32:25.2622033Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.2622231Z context = 2025-05-07T20:32:25.2622236Z 2025-05-07T20:32:25.2622402Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.2622663Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.2622778Z module_map=module_map) 2025-05-07T20:32:25.2622941Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.2623051Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.2623128Z E ^ 2025-05-07T20:32:25.2623481Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.2623488Z 2025-05-07T20:32:25.2623910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.2623915Z 2025-05-07T20:32:25.2624017Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.2624241Z self=, 2025-05-07T20:32:25.2624325Z T=128, 2025-05-07T20:32:25.2624403Z D=7168, 2025-05-07T20:32:25.2624493Z scale_ub=None, 2025-05-07T20:32:25.2624579Z contiguous=True, 2025-05-07T20:32:25.2624663Z compiled=False, 2025-05-07T20:32:25.2624744Z ) 2025-05-07T20:32:25.2624964Z self = 2025-05-07T20:32:25.2625140Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:25.2625147Z 2025-05-07T20:32:25.2625233Z @given( 2025-05-07T20:32:25.2625351Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.2625453Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.2625581Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.2625698Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.2625816Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.2625890Z ) 2025-05-07T20:32:25.2626135Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.2626235Z def test_silu_mul_quant( 2025-05-07T20:32:25.2626312Z self, 2025-05-07T20:32:25.2626391Z T: int, 2025-05-07T20:32:25.2626479Z D: int, 2025-05-07T20:32:25.2626578Z scale_ub: Optional[float], 2025-05-07T20:32:25.2626668Z contiguous: bool, 2025-05-07T20:32:25.2626769Z compiled: bool, 2025-05-07T20:32:25.2626852Z ) -> None: 2025-05-07T20:32:25.2626948Z torch.manual_seed(2025) 2025-05-07T20:32:25.2627029Z 2025-05-07T20:32:25.2627201Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.2627283Z 2025-05-07T20:32:25.2627460Z x_sign = torch.sign(x) 2025-05-07T20:32:25.2627590Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.2627693Z x = x_sign * x_clamp 2025-05-07T20:32:25.2627776Z x0 = x[:, :D] 2025-05-07T20:32:25.2627858Z x1 = x[:, D:] 2025-05-07T20:32:25.2627940Z 2025-05-07T20:32:25.2628026Z if contiguous: 2025-05-07T20:32:25.2628489Z x0 = x0.contiguous() 2025-05-07T20:32:25.2628633Z x1 = x1.contiguous() 2025-05-07T20:32:25.2628739Z 2025-05-07T20:32:25.2628843Z if scale_ub is not None: 2025-05-07T20:32:25.2628959Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.2629304Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.2629463Z ) 2025-05-07T20:32:25.2629541Z else: 2025-05-07T20:32:25.2629636Z scale_ub_tensor = None 2025-05-07T20:32:25.2629716Z 2025-05-07T20:32:25.2629914Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.2630007Z op = silu_mul_quant 2025-05-07T20:32:25.2630106Z if compiled: 2025-05-07T20:32:25.2630207Z op = torch.compile(op) 2025-05-07T20:32:25.2630313Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2630395Z 2025-05-07T20:32:25.2630486Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.2630491Z 2025-05-07T20:32:25.2630594Z moe/activation_test.py:117: 2025-05-07T20:32:25.2630734Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2630835Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.2630944Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2631445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.2631543Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.2631915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.2632134Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.2632478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.2632571Z kernel = self.compile( 2025-05-07T20:32:25.2632949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.2633128Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.2633256Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2633263Z 2025-05-07T20:32:25.2633468Z self = 2025-05-07T20:32:25.2634247Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.2634759Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efca2b62660>} 2025-05-07T20:32:25.2635500Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.2635689Z context = 2025-05-07T20:32:25.2635693Z 2025-05-07T20:32:25.2635863Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.2636126Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.2636240Z module_map=module_map) 2025-05-07T20:32:25.2636479Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.2636579Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.2636663Z E ^ 2025-05-07T20:32:25.2637018Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.2637022Z 2025-05-07T20:32:25.2637434Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.2637445Z 2025-05-07T20:32:25.2637550Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.2637771Z self=, 2025-05-07T20:32:25.2637899Z T=2048, 2025-05-07T20:32:25.2638017Z D=7168, 2025-05-07T20:32:25.2638102Z scale_ub=1200.0, 2025-05-07T20:32:25.2638195Z contiguous=True, 2025-05-07T20:32:25.2638281Z compiled=False, 2025-05-07T20:32:25.2638354Z ) 2025-05-07T20:32:25.2638619Z self = 2025-05-07T20:32:25.2638800Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:25.2638807Z 2025-05-07T20:32:25.2638924Z @given( 2025-05-07T20:32:25.2639084Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.2639224Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.2639391Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.2639576Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.2639754Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.2639863Z ) 2025-05-07T20:32:25.2640195Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.2640296Z def test_silu_mul_quant( 2025-05-07T20:32:25.2640378Z self, 2025-05-07T20:32:25.2640455Z T: int, 2025-05-07T20:32:25.2640531Z D: int, 2025-05-07T20:32:25.2640636Z scale_ub: Optional[float], 2025-05-07T20:32:25.2640730Z contiguous: bool, 2025-05-07T20:32:25.2640824Z compiled: bool, 2025-05-07T20:32:25.2640903Z ) -> None: 2025-05-07T20:32:25.2640997Z torch.manual_seed(2025) 2025-05-07T20:32:25.2641073Z 2025-05-07T20:32:25.2641238Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.2643003Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:25.2643020Z 2025-05-07T20:32:25.2643141Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:25.2643147Z 2025-05-07T20:32:25.2643248Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.2643476Z self=, 2025-05-07T20:32:25.2643553Z T=1, 2025-05-07T20:32:25.2643630Z D=5120, 2025-05-07T20:32:25.2643718Z scale_ub=1200.0, 2025-05-07T20:32:25.2643802Z contiguous=True, 2025-05-07T20:32:25.2643894Z compiled=False, 2025-05-07T20:32:25.2643967Z ) 2025-05-07T20:32:25.2644182Z self = 2025-05-07T20:32:25.2644350Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:25.2644357Z 2025-05-07T20:32:25.2644433Z @given( 2025-05-07T20:32:25.2644551Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.2644656Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.2644769Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.2644946Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.2645066Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.2645141Z ) 2025-05-07T20:32:25.2645389Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.2645482Z def test_silu_mul_quant( 2025-05-07T20:32:25.2645557Z self, 2025-05-07T20:32:25.2645638Z T: int, 2025-05-07T20:32:25.2645714Z D: int, 2025-05-07T20:32:25.2645812Z scale_ub: Optional[float], 2025-05-07T20:32:25.2645904Z contiguous: bool, 2025-05-07T20:32:25.2645989Z compiled: bool, 2025-05-07T20:32:25.2646069Z ) -> None: 2025-05-07T20:32:25.2646173Z torch.manual_seed(2025) 2025-05-07T20:32:25.2646330Z 2025-05-07T20:32:25.2646494Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.2646573Z 2025-05-07T20:32:25.2646663Z x_sign = torch.sign(x) 2025-05-07T20:32:25.2646831Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.2646921Z x = x_sign * x_clamp 2025-05-07T20:32:25.2647001Z x0 = x[:, :D] 2025-05-07T20:32:25.2647089Z x1 = x[:, D:] 2025-05-07T20:32:25.2647160Z 2025-05-07T20:32:25.2647242Z if contiguous: 2025-05-07T20:32:25.2647342Z x0 = x0.contiguous() 2025-05-07T20:32:25.2647430Z x1 = x1.contiguous() 2025-05-07T20:32:25.2647501Z 2025-05-07T20:32:25.2647600Z if scale_ub is not None: 2025-05-07T20:32:25.2647706Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.2647839Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.2647921Z ) 2025-05-07T20:32:25.2648001Z else: 2025-05-07T20:32:25.2648098Z scale_ub_tensor = None 2025-05-07T20:32:25.2648176Z 2025-05-07T20:32:25.2648305Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.2648403Z op = silu_mul_quant 2025-05-07T20:32:25.2648492Z if compiled: 2025-05-07T20:32:25.2648596Z op = torch.compile(op) 2025-05-07T20:32:25.2648707Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2648783Z 2025-05-07T20:32:25.2648877Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.2648881Z 2025-05-07T20:32:25.2648985Z moe/activation_test.py:117: 2025-05-07T20:32:25.2649113Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2649214Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.2649319Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2649844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.2649974Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.2650329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.2650552Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.2650896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.2650989Z kernel = self.compile( 2025-05-07T20:32:25.2651377Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.2651548Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.2651676Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2651681Z 2025-05-07T20:32:25.2651890Z self = 2025-05-07T20:32:25.2652668Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.2653253Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efca2b639c0>} 2025-05-07T20:32:25.2653999Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.2654192Z context = 2025-05-07T20:32:25.2654196Z 2025-05-07T20:32:25.2654363Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.2654623Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.2654811Z module_map=module_map) 2025-05-07T20:32:25.2654970Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.2655105Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.2655192Z E ^ 2025-05-07T20:32:25.2655544Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.2655549Z 2025-05-07T20:32:25.2655959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.2655969Z 2025-05-07T20:32:25.2656074Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.2656294Z self=, 2025-05-07T20:32:25.2656376Z T=2048, 2025-05-07T20:32:25.2656451Z D=5120, 2025-05-07T20:32:25.2656532Z scale_ub=None, 2025-05-07T20:32:25.2656629Z contiguous=True, 2025-05-07T20:32:25.2656713Z compiled=False, 2025-05-07T20:32:25.2656784Z ) 2025-05-07T20:32:25.2657008Z self = 2025-05-07T20:32:25.2657184Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:25.2657191Z 2025-05-07T20:32:25.2657272Z @given( 2025-05-07T20:32:25.2657395Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.2657500Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.2657614Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.2657730Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.2657847Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.2657924Z ) 2025-05-07T20:32:25.2658165Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.2658266Z def test_silu_mul_quant( 2025-05-07T20:32:25.2658343Z self, 2025-05-07T20:32:25.2658421Z T: int, 2025-05-07T20:32:25.2658507Z D: int, 2025-05-07T20:32:25.2658604Z scale_ub: Optional[float], 2025-05-07T20:32:25.2658693Z contiguous: bool, 2025-05-07T20:32:25.2658786Z compiled: bool, 2025-05-07T20:32:25.2658866Z ) -> None: 2025-05-07T20:32:25.2658969Z torch.manual_seed(2025) 2025-05-07T20:32:25.2659042Z 2025-05-07T20:32:25.2659206Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.2659287Z 2025-05-07T20:32:25.2659377Z > x_sign = torch.sign(x) 2025-05-07T20:32:25.2661141Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:25.2661161Z 2025-05-07T20:32:25.2661279Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:25.2661287Z 2025-05-07T20:32:25.2661431Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.2661658Z self=, 2025-05-07T20:32:25.2661737Z T=16384, 2025-05-07T20:32:25.2661814Z D=5120, 2025-05-07T20:32:25.2661904Z scale_ub=None, 2025-05-07T20:32:25.2661988Z contiguous=True, 2025-05-07T20:32:25.2662078Z compiled=False, 2025-05-07T20:32:25.2662150Z ) 2025-05-07T20:32:25.2662365Z self = 2025-05-07T20:32:25.2662546Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:25.2662550Z 2025-05-07T20:32:25.2662628Z @given( 2025-05-07T20:32:25.2662785Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.2662931Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.2663043Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.2663197Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.2663318Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.2663391Z ) 2025-05-07T20:32:25.2663642Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.2663735Z def test_silu_mul_quant( 2025-05-07T20:32:25.2663810Z self, 2025-05-07T20:32:25.2663890Z T: int, 2025-05-07T20:32:25.2663965Z D: int, 2025-05-07T20:32:25.2664059Z scale_ub: Optional[float], 2025-05-07T20:32:25.2664154Z contiguous: bool, 2025-05-07T20:32:25.2664238Z compiled: bool, 2025-05-07T20:32:25.2664315Z ) -> None: 2025-05-07T20:32:25.2664416Z torch.manual_seed(2025) 2025-05-07T20:32:25.2664491Z 2025-05-07T20:32:25.2664658Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.2666431Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:25.2666437Z 2025-05-07T20:32:25.2666553Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:25.2666564Z 2025-05-07T20:32:25.2666664Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.2666881Z self=, 2025-05-07T20:32:25.2666967Z T=4096, 2025-05-07T20:32:25.2667044Z D=5120, 2025-05-07T20:32:25.2667125Z scale_ub=None, 2025-05-07T20:32:25.2667216Z contiguous=True, 2025-05-07T20:32:25.2667298Z compiled=False, 2025-05-07T20:32:25.2667370Z ) 2025-05-07T20:32:25.2667595Z self = 2025-05-07T20:32:25.2667763Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:25.2667768Z 2025-05-07T20:32:25.2667851Z @given( 2025-05-07T20:32:25.2667967Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.2668063Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.2668182Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.2668296Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.2668407Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.2668488Z ) 2025-05-07T20:32:25.2668730Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.2668827Z def test_silu_mul_quant( 2025-05-07T20:32:25.2668910Z self, 2025-05-07T20:32:25.2668986Z T: int, 2025-05-07T20:32:25.2669123Z D: int, 2025-05-07T20:32:25.2669231Z scale_ub: Optional[float], 2025-05-07T20:32:25.2669367Z contiguous: bool, 2025-05-07T20:32:25.2669460Z compiled: bool, 2025-05-07T20:32:25.2669539Z ) -> None: 2025-05-07T20:32:25.2669632Z torch.manual_seed(2025) 2025-05-07T20:32:25.2669710Z 2025-05-07T20:32:25.2669874Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.2671674Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:25.2671804Z 2025-05-07T20:32:25.2671924Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:25.2671928Z 2025-05-07T20:32:25.2672029Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.2672254Z self=, 2025-05-07T20:32:25.2672330Z T=2048, 2025-05-07T20:32:25.2672407Z D=5120, 2025-05-07T20:32:25.2672494Z scale_ub=None, 2025-05-07T20:32:25.2672582Z contiguous=False, 2025-05-07T20:32:25.2672671Z compiled=False, 2025-05-07T20:32:25.2672742Z ) 2025-05-07T20:32:25.2672956Z self = 2025-05-07T20:32:25.2673132Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:25.2673139Z 2025-05-07T20:32:25.2673219Z @given( 2025-05-07T20:32:25.2673333Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.2673441Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.2673559Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.2673675Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.2673794Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.2673869Z ) 2025-05-07T20:32:25.2674117Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.2674211Z def test_silu_mul_quant( 2025-05-07T20:32:25.2674287Z self, 2025-05-07T20:32:25.2674370Z T: int, 2025-05-07T20:32:25.2674445Z D: int, 2025-05-07T20:32:25.2674542Z scale_ub: Optional[float], 2025-05-07T20:32:25.2674642Z contiguous: bool, 2025-05-07T20:32:25.2674728Z compiled: bool, 2025-05-07T20:32:25.2674808Z ) -> None: 2025-05-07T20:32:25.2674916Z torch.manual_seed(2025) 2025-05-07T20:32:25.2674992Z 2025-05-07T20:32:25.2675157Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.2676920Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:25.2676926Z 2025-05-07T20:32:25.2677042Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:25.2677053Z 2025-05-07T20:32:25.2677154Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.2677375Z self=, 2025-05-07T20:32:25.2677464Z T=4096, 2025-05-07T20:32:25.2677541Z D=7168, 2025-05-07T20:32:25.2677624Z scale_ub=None, 2025-05-07T20:32:25.2677715Z contiguous=True, 2025-05-07T20:32:25.2677799Z compiled=True, 2025-05-07T20:32:25.2677919Z ) 2025-05-07T20:32:25.2678142Z self = 2025-05-07T20:32:25.2678311Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:25.2678316Z 2025-05-07T20:32:25.2678399Z @given( 2025-05-07T20:32:25.2678514Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.2678611Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.2678731Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.2678850Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.2678962Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.2679082Z ) 2025-05-07T20:32:25.2679361Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.2679453Z def test_silu_mul_quant( 2025-05-07T20:32:25.2679538Z self, 2025-05-07T20:32:25.2679696Z T: int, 2025-05-07T20:32:25.2679789Z D: int, 2025-05-07T20:32:25.2679904Z scale_ub: Optional[float], 2025-05-07T20:32:25.2680011Z contiguous: bool, 2025-05-07T20:32:25.2680106Z compiled: bool, 2025-05-07T20:32:25.2680182Z ) -> None: 2025-05-07T20:32:25.2680275Z torch.manual_seed(2025) 2025-05-07T20:32:25.2680355Z 2025-05-07T20:32:25.2680518Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.2682277Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:25.2682296Z 2025-05-07T20:32:25.2682413Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:25.2682417Z 2025-05-07T20:32:25.2682517Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.2682740Z self=, 2025-05-07T20:32:25.2682816Z T=2048, 2025-05-07T20:32:25.2682893Z D=5120, 2025-05-07T20:32:25.2682980Z scale_ub=1200.0, 2025-05-07T20:32:25.2683069Z contiguous=False, 2025-05-07T20:32:25.2683157Z compiled=False, 2025-05-07T20:32:25.2683230Z ) 2025-05-07T20:32:25.2683445Z self = 2025-05-07T20:32:25.2683627Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:25.2683635Z 2025-05-07T20:32:25.2683711Z @given( 2025-05-07T20:32:25.2683828Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.2683935Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.2684050Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.2684166Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.2684282Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.2684355Z ) 2025-05-07T20:32:25.2684602Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.2684695Z def test_silu_mul_quant( 2025-05-07T20:32:25.2684771Z self, 2025-05-07T20:32:25.2684852Z T: int, 2025-05-07T20:32:25.2684929Z D: int, 2025-05-07T20:32:25.2685026Z scale_ub: Optional[float], 2025-05-07T20:32:25.2685123Z contiguous: bool, 2025-05-07T20:32:25.2685209Z compiled: bool, 2025-05-07T20:32:25.2685292Z ) -> None: 2025-05-07T20:32:25.2685397Z torch.manual_seed(2025) 2025-05-07T20:32:25.2685468Z 2025-05-07T20:32:25.2685632Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.2687433Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:25.2687440Z 2025-05-07T20:32:25.2687555Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:25.2687604Z 2025-05-07T20:32:25.2687706Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.2687961Z self=, 2025-05-07T20:32:25.2688043Z T=4096, 2025-05-07T20:32:25.2688120Z D=7168, 2025-05-07T20:32:25.2688239Z scale_ub=1200.0, 2025-05-07T20:32:25.2688333Z contiguous=True, 2025-05-07T20:32:25.2688415Z compiled=False, 2025-05-07T20:32:25.2688486Z ) 2025-05-07T20:32:25.2688707Z self = 2025-05-07T20:32:25.2688878Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:25.2688882Z 2025-05-07T20:32:25.2688966Z @given( 2025-05-07T20:32:25.2689081Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.2689178Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.2689298Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.2689414Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.2689528Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.2689611Z ) 2025-05-07T20:32:25.2689855Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.2689951Z def test_silu_mul_quant( 2025-05-07T20:32:25.2690033Z self, 2025-05-07T20:32:25.2690114Z T: int, 2025-05-07T20:32:25.2690191Z D: int, 2025-05-07T20:32:25.2690294Z scale_ub: Optional[float], 2025-05-07T20:32:25.2690384Z contiguous: bool, 2025-05-07T20:32:25.2690474Z compiled: bool, 2025-05-07T20:32:25.2690553Z ) -> None: 2025-05-07T20:32:25.2690647Z torch.manual_seed(2025) 2025-05-07T20:32:25.2690726Z 2025-05-07T20:32:25.2690888Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.2692643Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:25.2692659Z 2025-05-07T20:32:25.2692774Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:25.2692779Z 2025-05-07T20:32:25.2692879Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.2693104Z self=, 2025-05-07T20:32:25.2693183Z T=16384, 2025-05-07T20:32:25.2693262Z D=7168, 2025-05-07T20:32:25.2693352Z scale_ub=None, 2025-05-07T20:32:25.2693438Z contiguous=False, 2025-05-07T20:32:25.2693527Z compiled=True, 2025-05-07T20:32:25.2693600Z ) 2025-05-07T20:32:25.2693814Z self = 2025-05-07T20:32:25.2693998Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:25.2694002Z 2025-05-07T20:32:25.2694081Z @given( 2025-05-07T20:32:25.2694198Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.2694347Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.2694464Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.2694577Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.2694694Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.2694766Z ) 2025-05-07T20:32:25.2695012Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.2695107Z def test_silu_mul_quant( 2025-05-07T20:32:25.2695182Z self, 2025-05-07T20:32:25.2695264Z T: int, 2025-05-07T20:32:25.2695342Z D: int, 2025-05-07T20:32:25.2695439Z scale_ub: Optional[float], 2025-05-07T20:32:25.2695617Z contiguous: bool, 2025-05-07T20:32:25.2695702Z compiled: bool, 2025-05-07T20:32:25.2695779Z ) -> None: 2025-05-07T20:32:25.2695881Z torch.manual_seed(2025) 2025-05-07T20:32:25.2695953Z 2025-05-07T20:32:25.2696167Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.2697925Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:25.2697931Z 2025-05-07T20:32:25.2698056Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:25.2698063Z 2025-05-07T20:32:25.2698164Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.2698381Z self=, 2025-05-07T20:32:25.2698464Z T=4096, 2025-05-07T20:32:25.2698542Z D=7168, 2025-05-07T20:32:25.2698623Z scale_ub=None, 2025-05-07T20:32:25.2698713Z contiguous=True, 2025-05-07T20:32:25.2698795Z compiled=False, 2025-05-07T20:32:25.2698866Z ) 2025-05-07T20:32:25.2699086Z self = 2025-05-07T20:32:25.2699253Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:25.2699258Z 2025-05-07T20:32:25.2699342Z @given( 2025-05-07T20:32:25.2699460Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.2699572Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.2699704Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.2699839Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.2699954Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.2700033Z ) 2025-05-07T20:32:25.2700279Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.2700375Z def test_silu_mul_quant( 2025-05-07T20:32:25.2700458Z self, 2025-05-07T20:32:25.2700534Z T: int, 2025-05-07T20:32:25.2700609Z D: int, 2025-05-07T20:32:25.2700713Z scale_ub: Optional[float], 2025-05-07T20:32:25.2700800Z contiguous: bool, 2025-05-07T20:32:25.2700892Z compiled: bool, 2025-05-07T20:32:25.2700969Z ) -> None: 2025-05-07T20:32:25.2701062Z torch.manual_seed(2025) 2025-05-07T20:32:25.2701139Z 2025-05-07T20:32:25.2701303Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.2703099Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:25.2703119Z 2025-05-07T20:32:25.2703235Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:25.2703240Z 2025-05-07T20:32:25.2703341Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.2703565Z self=, 2025-05-07T20:32:25.2703643Z T=16384, 2025-05-07T20:32:25.2703722Z D=7168, 2025-05-07T20:32:25.2703808Z scale_ub=None, 2025-05-07T20:32:25.2703894Z contiguous=True, 2025-05-07T20:32:25.2703987Z compiled=False, 2025-05-07T20:32:25.2704100Z ) 2025-05-07T20:32:25.2704354Z self = 2025-05-07T20:32:25.2704533Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:25.2704537Z 2025-05-07T20:32:25.2704653Z @given( 2025-05-07T20:32:25.2704773Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.2704877Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.2704989Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.2705104Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.2705223Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.2705296Z ) 2025-05-07T20:32:25.2705545Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.2705639Z def test_silu_mul_quant( 2025-05-07T20:32:25.2705715Z self, 2025-05-07T20:32:25.2705797Z T: int, 2025-05-07T20:32:25.2705877Z D: int, 2025-05-07T20:32:25.2705977Z scale_ub: Optional[float], 2025-05-07T20:32:25.2706072Z contiguous: bool, 2025-05-07T20:32:25.2706159Z compiled: bool, 2025-05-07T20:32:25.2706236Z ) -> None: 2025-05-07T20:32:25.2706341Z torch.manual_seed(2025) 2025-05-07T20:32:25.2706419Z 2025-05-07T20:32:25.2706582Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.2708343Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:25.2708354Z 2025-05-07T20:32:25.2708476Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:25.2708480Z 2025-05-07T20:32:25.2708581Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.2708802Z self=, 2025-05-07T20:32:25.2708888Z T=16384, 2025-05-07T20:32:25.2708965Z D=7168, 2025-05-07T20:32:25.2709049Z scale_ub=1200.0, 2025-05-07T20:32:25.2709192Z contiguous=True, 2025-05-07T20:32:25.2709274Z compiled=False, 2025-05-07T20:32:25.2709347Z ) 2025-05-07T20:32:25.2709567Z self = 2025-05-07T20:32:25.2709742Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:25.2709747Z 2025-05-07T20:32:25.2709829Z @given( 2025-05-07T20:32:25.2709966Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.2710073Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.2710214Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.2710331Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.2710440Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.2710521Z ) 2025-05-07T20:32:25.2710809Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.2710905Z def test_silu_mul_quant( 2025-05-07T20:32:25.2710988Z self, 2025-05-07T20:32:25.2711063Z T: int, 2025-05-07T20:32:25.2711138Z D: int, 2025-05-07T20:32:25.2711247Z scale_ub: Optional[float], 2025-05-07T20:32:25.2711337Z contiguous: bool, 2025-05-07T20:32:25.2711429Z compiled: bool, 2025-05-07T20:32:25.2711506Z ) -> None: 2025-05-07T20:32:25.2711599Z torch.manual_seed(2025) 2025-05-07T20:32:25.2711677Z 2025-05-07T20:32:25.2711841Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.2713695Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:25.2713743Z 2025-05-07T20:32:25.2713860Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:25.2713865Z 2025-05-07T20:32:25.2713965Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.2714189Z self=, 2025-05-07T20:32:25.2714266Z T=128, 2025-05-07T20:32:25.2714342Z D=5120, 2025-05-07T20:32:25.2714429Z scale_ub=1200.0, 2025-05-07T20:32:25.2714518Z contiguous=False, 2025-05-07T20:32:25.2714609Z compiled=False, 2025-05-07T20:32:25.2714681Z ) 2025-05-07T20:32:25.2714895Z self = 2025-05-07T20:32:25.2715073Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:25.2715077Z 2025-05-07T20:32:25.2715157Z @given( 2025-05-07T20:32:25.2715272Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.2715377Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.2715488Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.2715605Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.2715722Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.2715795Z ) 2025-05-07T20:32:25.2716044Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.2716136Z def test_silu_mul_quant( 2025-05-07T20:32:25.2716215Z self, 2025-05-07T20:32:25.2716299Z T: int, 2025-05-07T20:32:25.2716374Z D: int, 2025-05-07T20:32:25.2716471Z scale_ub: Optional[float], 2025-05-07T20:32:25.2716563Z contiguous: bool, 2025-05-07T20:32:25.2716648Z compiled: bool, 2025-05-07T20:32:25.2716726Z ) -> None: 2025-05-07T20:32:25.2716829Z torch.manual_seed(2025) 2025-05-07T20:32:25.2716902Z 2025-05-07T20:32:25.2717067Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.2717149Z 2025-05-07T20:32:25.2717240Z x_sign = torch.sign(x) 2025-05-07T20:32:25.2717370Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.2717458Z x = x_sign * x_clamp 2025-05-07T20:32:25.2717541Z x0 = x[:, :D] 2025-05-07T20:32:25.2717626Z x1 = x[:, D:] 2025-05-07T20:32:25.2717698Z 2025-05-07T20:32:25.2717782Z if contiguous: 2025-05-07T20:32:25.2717882Z x0 = x0.contiguous() 2025-05-07T20:32:25.2717973Z x1 = x1.contiguous() 2025-05-07T20:32:25.2718047Z 2025-05-07T20:32:25.2718152Z if scale_ub is not None: 2025-05-07T20:32:25.2722534Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.2722698Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.2722849Z ) 2025-05-07T20:32:25.2722935Z else: 2025-05-07T20:32:25.2723033Z scale_ub_tensor = None 2025-05-07T20:32:25.2723107Z 2025-05-07T20:32:25.2723250Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.2723344Z op = silu_mul_quant 2025-05-07T20:32:25.2723435Z if compiled: 2025-05-07T20:32:25.2723544Z op = torch.compile(op) 2025-05-07T20:32:25.2723651Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2723725Z 2025-05-07T20:32:25.2723824Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.2723829Z 2025-05-07T20:32:25.2723929Z moe/activation_test.py:117: 2025-05-07T20:32:25.2724114Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2724255Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.2724357Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2724909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.2725008Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.2725365Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.2725597Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.2725938Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.2726040Z kernel = self.compile( 2025-05-07T20:32:25.2726423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.2726607Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.2726749Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2726754Z 2025-05-07T20:32:25.2726964Z self = 2025-05-07T20:32:25.2727748Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.2728697Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efca290a5c0>} 2025-05-07T20:32:25.2729728Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.2729936Z context = 2025-05-07T20:32:25.2729942Z 2025-05-07T20:32:25.2730111Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.2730384Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.2730493Z module_map=module_map) 2025-05-07T20:32:25.2730655Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.2730762Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.2730842Z E ^ 2025-05-07T20:32:25.2731207Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.2731212Z 2025-05-07T20:32:25.2731623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.2731630Z 2025-05-07T20:32:25.2731737Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.2731967Z self=, 2025-05-07T20:32:25.2732047Z T=2048, 2025-05-07T20:32:25.2732134Z D=7168, 2025-05-07T20:32:25.2732394Z scale_ub=None, 2025-05-07T20:32:25.2732484Z contiguous=False, 2025-05-07T20:32:25.2732579Z compiled=False, 2025-05-07T20:32:25.2732657Z ) 2025-05-07T20:32:25.2732876Z self = 2025-05-07T20:32:25.2733055Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:25.2733060Z 2025-05-07T20:32:25.2733138Z @given( 2025-05-07T20:32:25.2733255Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.2733361Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.2733477Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.2733601Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.2733913Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.2733988Z ) 2025-05-07T20:32:25.2734243Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.2734400Z def test_silu_mul_quant( 2025-05-07T20:32:25.2734482Z self, 2025-05-07T20:32:25.2734569Z T: int, 2025-05-07T20:32:25.2734650Z D: int, 2025-05-07T20:32:25.2734748Z scale_ub: Optional[float], 2025-05-07T20:32:25.2734846Z contiguous: bool, 2025-05-07T20:32:25.2734934Z compiled: bool, 2025-05-07T20:32:25.2735018Z ) -> None: 2025-05-07T20:32:25.2735119Z torch.manual_seed(2025) 2025-05-07T20:32:25.2735196Z 2025-05-07T20:32:25.2735366Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.2737149Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:25.2737160Z 2025-05-07T20:32:25.2737286Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:25.2737290Z 2025-05-07T20:32:25.2737394Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.2737614Z self=, 2025-05-07T20:32:25.2737700Z T=128, 2025-05-07T20:32:25.2737777Z D=7168, 2025-05-07T20:32:25.2737862Z scale_ub=1200.0, 2025-05-07T20:32:25.2737956Z contiguous=True, 2025-05-07T20:32:25.2738042Z compiled=True, 2025-05-07T20:32:25.2738120Z ) 2025-05-07T20:32:25.2738344Z self = 2025-05-07T20:32:25.2738513Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:25.2738517Z 2025-05-07T20:32:25.2738607Z @given( 2025-05-07T20:32:25.2738728Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.2738828Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.2738949Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.2739066Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.2739179Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.2739261Z ) 2025-05-07T20:32:25.2739529Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.2739642Z def test_silu_mul_quant( 2025-05-07T20:32:25.2739733Z self, 2025-05-07T20:32:25.2739813Z T: int, 2025-05-07T20:32:25.2739897Z D: int, 2025-05-07T20:32:25.2739997Z scale_ub: Optional[float], 2025-05-07T20:32:25.2740111Z contiguous: bool, 2025-05-07T20:32:25.2740245Z compiled: bool, 2025-05-07T20:32:25.2740358Z ) -> None: 2025-05-07T20:32:25.2740490Z torch.manual_seed(2025) 2025-05-07T20:32:25.2740614Z 2025-05-07T20:32:25.2741108Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.2741603Z 2025-05-07T20:32:25.2741883Z x_sign = torch.sign(x) 2025-05-07T20:32:25.2742267Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.2742685Z x = x_sign * x_clamp 2025-05-07T20:32:25.2743041Z x0 = x[:, :D] 2025-05-07T20:32:25.2743366Z x1 = x[:, D:] 2025-05-07T20:32:25.2743699Z 2025-05-07T20:32:25.2743992Z if contiguous: 2025-05-07T20:32:25.2744332Z x0 = x0.contiguous() 2025-05-07T20:32:25.2744749Z x1 = x1.contiguous() 2025-05-07T20:32:25.2745136Z 2025-05-07T20:32:25.2745442Z if scale_ub is not None: 2025-05-07T20:32:25.2745998Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.2746541Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.2747026Z ) 2025-05-07T20:32:25.2747320Z else: 2025-05-07T20:32:25.2747784Z scale_ub_tensor = None 2025-05-07T20:32:25.2748165Z 2025-05-07T20:32:25.2748519Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.2748969Z op = silu_mul_quant 2025-05-07T20:32:25.2749397Z if compiled: 2025-05-07T20:32:25.2749745Z op = torch.compile(op) 2025-05-07T20:32:25.2750051Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2750332Z 2025-05-07T20:32:25.2750528Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.2750692Z 2025-05-07T20:32:25.2750801Z moe/activation_test.py:117: 2025-05-07T20:32:25.2751091Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2751429Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.2751714Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.2752266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:25.2752833Z return fn(*args, **kwargs) 2025-05-07T20:32:25.2753493Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.2754178Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.2754704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.2755381Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.2756046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.2756570Z kernel = self.compile( 2025-05-07T20:32:25.2757113Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.2757772Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.2758172Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.2758401Z 2025-05-07T20:32:25.2758608Z self = 2025-05-07T20:32:25.2759686Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.2761069Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7efca290aac0>} 2025-05-07T20:32:25.2762404Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.2763427Z context = 2025-05-07T20:32:25.2763714Z 2025-05-07T20:32:25.2763955Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.2764471Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.2764941Z module_map=module_map) 2025-05-07T20:32:25.2765295Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.2765648Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.2765907Z E ^ 2025-05-07T20:32:25.2766365Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.2766817Z 2025-05-07T20:32:25.2767235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.2767871Z 2025-05-07T20:32:25.2767977Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.2768392Z self=, 2025-05-07T20:32:25.2768832Z T=128, 2025-05-07T20:32:25.2769028Z D=7168, 2025-05-07T20:32:25.2769228Z scale_ub=1200.0, 2025-05-07T20:32:25.2769447Z contiguous=True, 2025-05-07T20:32:25.2769671Z compiled=False, 2025-05-07T20:32:25.2769876Z ) 2025-05-07T20:32:25.2770195Z self = 2025-05-07T20:32:25.2770677Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:25.2770950Z 2025-05-07T20:32:25.2771029Z @given( 2025-05-07T20:32:25.2771262Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.2771567Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.2771875Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.2772205Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.2772522Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.2772801Z ) 2025-05-07T20:32:25.2773154Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.2773596Z def test_silu_mul_quant( 2025-05-07T20:32:25.2773828Z self, 2025-05-07T20:32:25.2774027Z T: int, 2025-05-07T20:32:25.2774222Z D: int, 2025-05-07T20:32:25.2774433Z scale_ub: Optional[float], 2025-05-07T20:32:25.2774699Z contiguous: bool, 2025-05-07T20:32:25.2774936Z compiled: bool, 2025-05-07T20:32:25.2775149Z ) -> None: 2025-05-07T20:32:25.2775366Z torch.manual_seed(2025) 2025-05-07T20:32:25.2775605Z 2025-05-07T20:32:25.2775870Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.2776207Z 2025-05-07T20:32:25.2776400Z x_sign = torch.sign(x) 2025-05-07T20:32:25.2776685Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.2778679Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:25.2780535Z 2025-05-07T20:32:25.2780654Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:25.2780869Z 2025-05-07T20:32:25.2780974Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.2781378Z self=, 2025-05-07T20:32:25.2781773Z T=128, 2025-05-07T20:32:25.2781963Z D=5120, 2025-05-07T20:32:25.2782154Z scale_ub=1200.0, 2025-05-07T20:32:25.2782368Z contiguous=True, 2025-05-07T20:32:25.2782590Z compiled=True, 2025-05-07T20:32:25.2782793Z ) 2025-05-07T20:32:25.2783161Z self = 2025-05-07T20:32:25.2783645Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:25.2783916Z 2025-05-07T20:32:25.2783995Z @given( 2025-05-07T20:32:25.2784225Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.2784529Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.2784837Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.2785162Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.2785481Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.2785769Z ) 2025-05-07T20:32:25.2786115Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.2786630Z def test_silu_mul_quant( 2025-05-07T20:32:25.2786869Z self, 2025-05-07T20:32:25.2787065Z T: int, 2025-05-07T20:32:25.2787257Z D: int, 2025-05-07T20:32:25.2787516Z scale_ub: Optional[float], 2025-05-07T20:32:25.2787791Z contiguous: bool, 2025-05-07T20:32:25.2788023Z compiled: bool, 2025-05-07T20:32:25.2788244Z ) -> None: 2025-05-07T20:32:25.2788458Z torch.manual_seed(2025) 2025-05-07T20:32:25.2788695Z 2025-05-07T20:32:25.2788959Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.2789371Z 2025-05-07T20:32:25.2789565Z x_sign = torch.sign(x) 2025-05-07T20:32:25.2789845Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.2791829Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:25.2793694Z 2025-05-07T20:32:25.2793812Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:25.2794024Z 2025-05-07T20:32:25.2794134Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.2794538Z self=, 2025-05-07T20:32:25.2794949Z T=128, 2025-05-07T20:32:25.2795133Z D=7168, 2025-05-07T20:32:25.2795325Z scale_ub=None, 2025-05-07T20:32:25.2795543Z contiguous=True, 2025-05-07T20:32:25.2795760Z compiled=True, 2025-05-07T20:32:25.2795962Z ) 2025-05-07T20:32:25.2796281Z self = 2025-05-07T20:32:25.2796761Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:25.2797029Z 2025-05-07T20:32:25.2797107Z @given( 2025-05-07T20:32:25.2797344Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.2797654Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.2797964Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.2798293Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.2798619Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.2798894Z ) 2025-05-07T20:32:25.2799242Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.2799680Z def test_silu_mul_quant( 2025-05-07T20:32:25.2799915Z self, 2025-05-07T20:32:25.2800110Z T: int, 2025-05-07T20:32:25.2800302Z D: int, 2025-05-07T20:32:25.2800509Z scale_ub: Optional[float], 2025-05-07T20:32:25.2800778Z contiguous: bool, 2025-05-07T20:32:25.2801018Z compiled: bool, 2025-05-07T20:32:25.2801232Z ) -> None: 2025-05-07T20:32:25.2801449Z torch.manual_seed(2025) 2025-05-07T20:32:25.2801686Z 2025-05-07T20:32:25.2802006Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.2804028Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:25.2805890Z 2025-05-07T20:32:25.2806009Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:25.2806436Z =============================== warnings summary =============================== 2025-05-07T20:32:25.2806975Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:25.2807706Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:25.2808405Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:25.2809722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details. 2025-05-07T20:32:25.2810913Z warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See " 2025-05-07T20:32:25.2811242Z 2025-05-07T20:32:25.2811450Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html 2025-05-07T20:32:25.2811929Z ================= 1 failed, 1 deselected, 3 warnings in 13.85s ================= 2025-05-07T20:32:26.8576161Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error) 2025-05-07T20:32:26.9198054Z [EXEC] [ATTEMPT 1/2] Command attempt failed. 2025-05-07T20:32:26.9198320Z 2025-05-07T20:32:28.9214086Z [EXEC] [ATTEMPT 2/2] + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py 2025-05-07T20:32:31.0620323Z ============================= test session starts ============================== 2025-05-07T20:32:31.0622033Z platform linux -- Python 3.11.8, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:32:31.0623053Z cachedir: .pytest_cache 2025-05-07T20:32:31.0623729Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:32:31.0624458Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:32:31.0624862Z plugins: hypothesis-6.131.14 2025-05-07T20:32:32.6777003Z TMA benchmarks will be running with experimental grid constant TMA descriptor. 2025-05-07T20:32:32.8304038Z collecting ... collected 2 items / 1 deselected / 1 selected 2025-05-07T20:32:32.8304609Z run-last-failure: rerun previous 1 failure 2025-05-07T20:32:32.8304907Z 2025-05-07T20:32:35.1971696Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.1973265Z self=, 2025-05-07T20:32:35.1973716Z T=1, 2025-05-07T20:32:35.1973901Z D=5120, 2025-05-07T20:32:35.1974097Z scale_ub=None, 2025-05-07T20:32:35.1974341Z contiguous=True, 2025-05-07T20:32:35.1974567Z compiled=True, 2025-05-07T20:32:35.1974778Z ) 2025-05-07T20:32:35.1975103Z self = 2025-05-07T20:32:35.1975587Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:35.1976150Z 2025-05-07T20:32:35.1976234Z @given( 2025-05-07T20:32:35.1976474Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.1976781Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.1977089Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.1977419Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.1977746Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.1978026Z ) 2025-05-07T20:32:35.1978377Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.1978816Z def test_silu_mul_quant( 2025-05-07T20:32:35.1979146Z self, 2025-05-07T20:32:35.1979431Z T: int, 2025-05-07T20:32:35.1979638Z D: int, 2025-05-07T20:32:35.1979857Z scale_ub: Optional[float], 2025-05-07T20:32:35.1980130Z contiguous: bool, 2025-05-07T20:32:35.1980373Z compiled: bool, 2025-05-07T20:32:35.1980675Z ) -> None: 2025-05-07T20:32:35.1980903Z torch.manual_seed(2025) 2025-05-07T20:32:35.1981151Z 2025-05-07T20:32:35.1981421Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.1981763Z 2025-05-07T20:32:35.1981963Z x_sign = torch.sign(x) 2025-05-07T20:32:35.1982250Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.1982564Z x = x_sign * x_clamp 2025-05-07T20:32:35.1982808Z x0 = x[:, :D] 2025-05-07T20:32:35.1983027Z x1 = x[:, D:] 2025-05-07T20:32:35.1983231Z 2025-05-07T20:32:35.1983431Z if contiguous: 2025-05-07T20:32:35.1983708Z x0 = x0.contiguous() 2025-05-07T20:32:35.1983969Z x1 = x1.contiguous() 2025-05-07T20:32:35.1984215Z 2025-05-07T20:32:35.1984411Z if scale_ub is not None: 2025-05-07T20:32:35.1984680Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.1985021Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.1985334Z ) 2025-05-07T20:32:35.1985525Z else: 2025-05-07T20:32:35.1985737Z scale_ub_tensor = None 2025-05-07T20:32:35.1985991Z 2025-05-07T20:32:35.1986223Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.1986537Z op = silu_mul_quant 2025-05-07T20:32:35.1986791Z if compiled: 2025-05-07T20:32:35.1987039Z op = torch.compile(op) 2025-05-07T20:32:35.1987338Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.1987613Z 2025-05-07T20:32:35.1987808Z y_fp8, y_scale = fn() 2025-05-07T20:32:35.1988086Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:35.1988382Z 2025-05-07T20:32:35.1988625Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.1988951Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:35.1989329Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:35.1989652Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:35.1990004Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:35.1990316Z 2025-05-07T20:32:35.1990521Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:35.1990715Z 2025-05-07T20:32:35.1990816Z moe/activation_test.py:126: 2025-05-07T20:32:35.1991113Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.1991451Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:35.1991780Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:35.1992565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:35.1993379Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:35.1993924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.1994657Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.1995340Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:35.1996068Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:35.1996818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:35.1997556Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:35.1998287Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:35.1999011Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:35.1999612Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:35.2000173Z fn() 2025-05-07T20:32:35.2000693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:35.2001281Z self.fn.run( 2025-05-07T20:32:35.2001741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.2002272Z kernel = self.compile( 2025-05-07T20:32:35.2002822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.2003522Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.2003923Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.2004167Z 2025-05-07T20:32:35.2004374Z self = 2025-05-07T20:32:35.2005458Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.2006836Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9831ba1260>} 2025-05-07T20:32:35.2008171Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.2009179Z context = 2025-05-07T20:32:35.2009467Z 2025-05-07T20:32:35.2009637Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.2010158Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.2010618Z module_map=module_map) 2025-05-07T20:32:35.2010985Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.2011343Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:35.2011613Z E ^ 2025-05-07T20:32:35.2012074Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.2012527Z 2025-05-07T20:32:35.2012940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.2013448Z 2025-05-07T20:32:35.2013561Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.2014022Z self=, 2025-05-07T20:32:35.2014421Z T=2048, 2025-05-07T20:32:35.2014624Z D=5120, 2025-05-07T20:32:35.2014825Z scale_ub=1200.0, 2025-05-07T20:32:35.2015044Z contiguous=True, 2025-05-07T20:32:35.2015272Z compiled=False, 2025-05-07T20:32:35.2015478Z ) 2025-05-07T20:32:36.1327739Z self = 2025-05-07T20:32:36.1328831Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:36.1329211Z 2025-05-07T20:32:36.1335417Z @given( 2025-05-07T20:32:36.1335796Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:36.1336236Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:36.1336584Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:36.1336920Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:36.1337250Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:36.1337533Z ) 2025-05-07T20:32:36.1337884Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:36.1338576Z def test_silu_mul_quant( 2025-05-07T20:32:36.1338826Z self, 2025-05-07T20:32:36.1339024Z T: int, 2025-05-07T20:32:36.1339227Z D: int, 2025-05-07T20:32:36.1339520Z scale_ub: Optional[float], 2025-05-07T20:32:36.1339794Z contiguous: bool, 2025-05-07T20:32:36.1340037Z compiled: bool, 2025-05-07T20:32:36.1340271Z ) -> None: 2025-05-07T20:32:36.1340490Z torch.manual_seed(2025) 2025-05-07T20:32:36.1340735Z 2025-05-07T20:32:36.1341015Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:36.1341353Z 2025-05-07T20:32:36.1341553Z x_sign = torch.sign(x) 2025-05-07T20:32:36.1341849Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:36.1342153Z x = x_sign * x_clamp 2025-05-07T20:32:36.1342397Z x0 = x[:, :D] 2025-05-07T20:32:36.1342618Z x1 = x[:, D:] 2025-05-07T20:32:36.1342820Z 2025-05-07T20:32:36.1343016Z if contiguous: 2025-05-07T20:32:36.1343262Z x0 = x0.contiguous() 2025-05-07T20:32:36.1343553Z x1 = x1.contiguous() 2025-05-07T20:32:36.1343806Z 2025-05-07T20:32:36.1344004Z if scale_ub is not None: 2025-05-07T20:32:36.1344278Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:36.1344615Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:36.1344928Z ) 2025-05-07T20:32:36.1345126Z else: 2025-05-07T20:32:36.1345335Z scale_ub_tensor = None 2025-05-07T20:32:36.1345588Z 2025-05-07T20:32:36.1345823Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:36.1346128Z op = silu_mul_quant 2025-05-07T20:32:36.1346384Z if compiled: 2025-05-07T20:32:36.1346633Z op = torch.compile(op) 2025-05-07T20:32:36.1346929Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:36.1347208Z 2025-05-07T20:32:36.1347406Z > y_fp8, y_scale = fn() 2025-05-07T20:32:36.1347572Z 2025-05-07T20:32:36.1347712Z moe/activation_test.py:117: 2025-05-07T20:32:36.1348006Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:36.1348333Z moe/activation_test.py:115: in fn 2025-05-07T20:32:36.1348623Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:36.1349408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:36.1350102Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:36.1350636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:36.1351324Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:36.1351985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:36.1352513Z kernel = self.compile( 2025-05-07T20:32:36.1353064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:36.1353757Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:36.1354257Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:36.1354487Z 2025-05-07T20:32:36.1354694Z self = 2025-05-07T20:32:36.1355776Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:36.1357149Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f983184c180>} 2025-05-07T20:32:36.1358484Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:36.1359616Z context = 2025-05-07T20:32:36.1359903Z 2025-05-07T20:32:36.1360074Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:36.1360596Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:36.1361068Z module_map=module_map) 2025-05-07T20:32:36.1361431Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:36.1361793Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:36.1362059Z E ^ 2025-05-07T20:32:36.1362522Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:36.1362976Z 2025-05-07T20:32:36.1363396Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:36.1363917Z 2025-05-07T20:32:36.1364022Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:36.1364439Z self=, 2025-05-07T20:32:36.1364840Z T=2048, 2025-05-07T20:32:36.1365038Z D=5120, 2025-05-07T20:32:36.1365238Z scale_ub=1200.0, 2025-05-07T20:32:36.1365462Z contiguous=True, 2025-05-07T20:32:36.1365689Z compiled=True, 2025-05-07T20:32:36.1365905Z ) 2025-05-07T20:32:36.1366229Z self = 2025-05-07T20:32:36.1366711Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:36.1366986Z 2025-05-07T20:32:36.1367065Z @given( 2025-05-07T20:32:36.1367308Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:36.1367615Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:36.1367925Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:36.1368257Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:36.1368579Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:36.1368866Z ) 2025-05-07T20:32:36.1369222Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:36.1369666Z def test_silu_mul_quant( 2025-05-07T20:32:36.1369900Z self, 2025-05-07T20:32:36.1370101Z T: int, 2025-05-07T20:32:36.1370301Z D: int, 2025-05-07T20:32:36.1370514Z scale_ub: Optional[float], 2025-05-07T20:32:36.1370785Z contiguous: bool, 2025-05-07T20:32:36.1371026Z compiled: bool, 2025-05-07T20:32:36.1371243Z ) -> None: 2025-05-07T20:32:36.1371463Z torch.manual_seed(2025) 2025-05-07T20:32:36.1371708Z 2025-05-07T20:32:36.1371975Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:36.1372322Z 2025-05-07T20:32:36.1372521Z x_sign = torch.sign(x) 2025-05-07T20:32:36.1372808Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:36.1373123Z x = x_sign * x_clamp 2025-05-07T20:32:36.1373367Z x0 = x[:, :D] 2025-05-07T20:32:36.1373590Z x1 = x[:, D:] 2025-05-07T20:32:36.1373901Z 2025-05-07T20:32:36.1374122Z if contiguous: 2025-05-07T20:32:36.1374353Z x0 = x0.contiguous() 2025-05-07T20:32:36.1374615Z x1 = x1.contiguous() 2025-05-07T20:32:36.1374860Z 2025-05-07T20:32:36.1375064Z if scale_ub is not None: 2025-05-07T20:32:36.1375333Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:36.1375673Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:36.1375990Z ) 2025-05-07T20:32:36.1376181Z else: 2025-05-07T20:32:36.1376400Z scale_ub_tensor = None 2025-05-07T20:32:36.1376655Z 2025-05-07T20:32:36.1376885Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:36.1377299Z op = silu_mul_quant 2025-05-07T20:32:36.1377563Z if compiled: 2025-05-07T20:32:36.1377805Z op = torch.compile(op) 2025-05-07T20:32:36.1378108Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:36.1378421Z 2025-05-07T20:32:36.1378625Z y_fp8, y_scale = fn() 2025-05-07T20:32:36.1378911Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:36.1379208Z 2025-05-07T20:32:36.1379449Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:36.1379777Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:36.1380072Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:36.1380386Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:36.1380743Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:36.1381051Z 2025-05-07T20:32:36.1381253Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:36.1381451Z 2025-05-07T20:32:36.1381557Z moe/activation_test.py:126: 2025-05-07T20:32:36.1381855Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:36.1382192Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:36.1382517Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:36.1383294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:36.1384064Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:36.1384629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:36.1385304Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:36.1385983Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:36.1386701Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:36.1387459Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:36.1388207Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:36.1388927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:36.1389641Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:36.1390241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:36.1390754Z fn() 2025-05-07T20:32:36.1391258Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:36.1391839Z self.fn.run( 2025-05-07T20:32:36.1392307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:36.1392832Z kernel = self.compile( 2025-05-07T20:32:36.1393373Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:36.1394072Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:36.1394473Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:36.1394699Z 2025-05-07T20:32:36.1394904Z self = 2025-05-07T20:32:36.1395980Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:36.1397353Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9830943560>} 2025-05-07T20:32:36.1398804Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:36.1399830Z context = 2025-05-07T20:32:36.1400121Z 2025-05-07T20:32:36.1400286Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:36.1400798Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:36.1401265Z module_map=module_map) 2025-05-07T20:32:36.1401623Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:36.1401984Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:36.1402254Z E ^ 2025-05-07T20:32:36.1402712Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:36.1403170Z 2025-05-07T20:32:36.1403587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:36.1404101Z 2025-05-07T20:32:36.1404211Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:36.1404622Z self=, 2025-05-07T20:32:36.1405017Z T=16384, 2025-05-07T20:32:36.1405217Z D=7168, 2025-05-07T20:32:36.1405416Z scale_ub=1200.0, 2025-05-07T20:32:36.1405639Z contiguous=False, 2025-05-07T20:32:36.1405871Z compiled=False, 2025-05-07T20:32:36.1406082Z ) 2025-05-07T20:32:36.9342104Z self = 2025-05-07T20:32:36.9342866Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:36.9343259Z 2025-05-07T20:32:36.9343371Z @given( 2025-05-07T20:32:36.9343704Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:36.9344033Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:36.9344349Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:36.9344675Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:36.9345021Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:36.9345309Z ) 2025-05-07T20:32:36.9345662Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:36.9346099Z def test_silu_mul_quant( 2025-05-07T20:32:36.9346355Z self, 2025-05-07T20:32:36.9346557Z T: int, 2025-05-07T20:32:36.9346755Z D: int, 2025-05-07T20:32:36.9346986Z scale_ub: Optional[float], 2025-05-07T20:32:36.9347261Z contiguous: bool, 2025-05-07T20:32:36.9347499Z compiled: bool, 2025-05-07T20:32:36.9347736Z ) -> None: 2025-05-07T20:32:36.9347970Z torch.manual_seed(2025) 2025-05-07T20:32:36.9348216Z 2025-05-07T20:32:36.9348497Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:36.9348854Z 2025-05-07T20:32:36.9349115Z x_sign = torch.sign(x) 2025-05-07T20:32:36.9349414Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:36.9349728Z x = x_sign * x_clamp 2025-05-07T20:32:36.9350277Z x0 = x[:, :D] 2025-05-07T20:32:36.9350505Z x1 = x[:, D:] 2025-05-07T20:32:36.9350717Z 2025-05-07T20:32:36.9350902Z if contiguous: 2025-05-07T20:32:36.9351142Z x0 = x0.contiguous() 2025-05-07T20:32:36.9351405Z x1 = x1.contiguous() 2025-05-07T20:32:36.9351676Z 2025-05-07T20:32:36.9351874Z if scale_ub is not None: 2025-05-07T20:32:36.9352143Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:36.9352480Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:36.9352790Z ) 2025-05-07T20:32:36.9352983Z else: 2025-05-07T20:32:36.9353201Z scale_ub_tensor = None 2025-05-07T20:32:36.9353658Z 2025-05-07T20:32:36.9353893Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:36.9354201Z op = silu_mul_quant 2025-05-07T20:32:36.9354456Z if compiled: 2025-05-07T20:32:36.9354785Z op = torch.compile(op) 2025-05-07T20:32:36.9355083Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:36.9355360Z 2025-05-07T20:32:36.9355561Z > y_fp8, y_scale = fn() 2025-05-07T20:32:36.9355725Z 2025-05-07T20:32:36.9355833Z moe/activation_test.py:117: 2025-05-07T20:32:36.9356133Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:36.9356471Z moe/activation_test.py:115: in fn 2025-05-07T20:32:36.9356749Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:36.9357442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:36.9358139Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:36.9358680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:36.9359358Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:36.9360027Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:36.9360563Z kernel = self.compile( 2025-05-07T20:32:36.9361107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:36.9361766Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:36.9362169Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:36.9362400Z 2025-05-07T20:32:36.9362614Z self = 2025-05-07T20:32:36.9363690Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:36.9365075Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9830683ba0>} 2025-05-07T20:32:36.9366413Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:36.9367436Z context = 2025-05-07T20:32:36.9367722Z 2025-05-07T20:32:36.9367896Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:36.9368410Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:36.9368884Z module_map=module_map) 2025-05-07T20:32:36.9369259Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:36.9369615Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:36.9369873Z E ^ 2025-05-07T20:32:36.9370391Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:36.9370838Z 2025-05-07T20:32:36.9371258Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:36.9371766Z 2025-05-07T20:32:36.9371870Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:36.9372286Z self=, 2025-05-07T20:32:36.9372692Z T=1, 2025-05-07T20:32:36.9372882Z D=7168, 2025-05-07T20:32:36.9373074Z scale_ub=None, 2025-05-07T20:32:36.9373293Z contiguous=True, 2025-05-07T20:32:36.9373522Z compiled=True, 2025-05-07T20:32:36.9373772Z ) 2025-05-07T20:32:36.9374138Z self = 2025-05-07T20:32:36.9374617Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:36.9374871Z 2025-05-07T20:32:36.9374989Z @given( 2025-05-07T20:32:36.9375233Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:36.9375552Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:36.9375856Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:36.9376190Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:36.9376519Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:36.9376806Z ) 2025-05-07T20:32:36.9377151Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:36.9377594Z def test_silu_mul_quant( 2025-05-07T20:32:36.9377840Z self, 2025-05-07T20:32:36.9378037Z T: int, 2025-05-07T20:32:36.9378241Z D: int, 2025-05-07T20:32:36.9378467Z scale_ub: Optional[float], 2025-05-07T20:32:36.9378737Z contiguous: bool, 2025-05-07T20:32:36.9378982Z compiled: bool, 2025-05-07T20:32:36.9379206Z ) -> None: 2025-05-07T20:32:36.9379427Z torch.manual_seed(2025) 2025-05-07T20:32:36.9379681Z 2025-05-07T20:32:36.9380179Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:36.9380520Z 2025-05-07T20:32:36.9380722Z x_sign = torch.sign(x) 2025-05-07T20:32:36.9381017Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:36.9381324Z x = x_sign * x_clamp 2025-05-07T20:32:36.9381569Z x0 = x[:, :D] 2025-05-07T20:32:36.9381793Z x1 = x[:, D:] 2025-05-07T20:32:36.9382005Z 2025-05-07T20:32:36.9382194Z if contiguous: 2025-05-07T20:32:36.9382432Z x0 = x0.contiguous() 2025-05-07T20:32:36.9382700Z x1 = x1.contiguous() 2025-05-07T20:32:36.9382934Z 2025-05-07T20:32:36.9383139Z if scale_ub is not None: 2025-05-07T20:32:36.9383421Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:36.9383752Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:36.9384066Z ) 2025-05-07T20:32:36.9384272Z else: 2025-05-07T20:32:36.9384481Z scale_ub_tensor = None 2025-05-07T20:32:36.9384735Z 2025-05-07T20:32:36.9384971Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:36.9385281Z op = silu_mul_quant 2025-05-07T20:32:36.9385534Z if compiled: 2025-05-07T20:32:36.9385784Z op = torch.compile(op) 2025-05-07T20:32:36.9386076Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:36.9386359Z 2025-05-07T20:32:36.9386559Z y_fp8, y_scale = fn() 2025-05-07T20:32:36.9386846Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:36.9387133Z 2025-05-07T20:32:36.9387374Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:36.9387714Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:36.9388006Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:36.9388327Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:36.9388749Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:36.9389119Z 2025-05-07T20:32:36.9389327Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:36.9389521Z 2025-05-07T20:32:36.9389630Z moe/activation_test.py:126: 2025-05-07T20:32:36.9389926Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:36.9390264Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:36.9390597Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:36.9391380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:36.9392127Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:36.9392765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:36.9393449Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:36.9394175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:36.9394894Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:36.9395646Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:36.9396398Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:36.9397126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:36.9397773Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:36.9398389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:36.9398919Z fn() 2025-05-07T20:32:36.9399428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:36.9400021Z self.fn.run( 2025-05-07T20:32:36.9400491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:36.9401027Z kernel = self.compile( 2025-05-07T20:32:36.9401565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:36.9402226Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:36.9402634Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:36.9402864Z 2025-05-07T20:32:36.9403074Z self = 2025-05-07T20:32:36.9404167Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:36.9405540Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f98304ff6a0>} 2025-05-07T20:32:36.9406886Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:36.9407917Z context = 2025-05-07T20:32:36.9408205Z 2025-05-07T20:32:36.9408375Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:36.9408905Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:36.9409380Z module_map=module_map) 2025-05-07T20:32:36.9409740Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:36.9410150Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:36.9410431Z E ^ 2025-05-07T20:32:36.9410897Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:36.9411345Z 2025-05-07T20:32:36.9411760Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:36.9412274Z 2025-05-07T20:32:36.9412380Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:36.9412791Z self=, 2025-05-07T20:32:36.9413194Z T=4096, 2025-05-07T20:32:36.9413380Z D=5120, 2025-05-07T20:32:36.9413621Z scale_ub=None, 2025-05-07T20:32:36.9413885Z contiguous=False, 2025-05-07T20:32:36.9414109Z compiled=False, 2025-05-07T20:32:36.9414317Z ) 2025-05-07T20:32:37.8642477Z self = 2025-05-07T20:32:37.8643594Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:37.8643990Z 2025-05-07T20:32:37.8644097Z @given( 2025-05-07T20:32:37.8644406Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.8644820Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.8645222Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.8650983Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.8651360Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.8651654Z ) 2025-05-07T20:32:37.8652007Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.8652463Z def test_silu_mul_quant( 2025-05-07T20:32:37.8652723Z self, 2025-05-07T20:32:37.8652927Z T: int, 2025-05-07T20:32:37.8653134Z D: int, 2025-05-07T20:32:37.8653364Z scale_ub: Optional[float], 2025-05-07T20:32:37.8653642Z contiguous: bool, 2025-05-07T20:32:37.8653888Z compiled: bool, 2025-05-07T20:32:37.8654172Z ) -> None: 2025-05-07T20:32:37.8654406Z torch.manual_seed(2025) 2025-05-07T20:32:37.8654653Z 2025-05-07T20:32:37.8654933Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.8655278Z 2025-05-07T20:32:37.8655479Z x_sign = torch.sign(x) 2025-05-07T20:32:37.8655783Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.8656099Z x = x_sign * x_clamp 2025-05-07T20:32:37.8656376Z x0 = x[:, :D] 2025-05-07T20:32:37.8656607Z x1 = x[:, D:] 2025-05-07T20:32:37.8656816Z 2025-05-07T20:32:37.8657008Z if contiguous: 2025-05-07T20:32:37.8657244Z x0 = x0.contiguous() 2025-05-07T20:32:37.8657502Z x1 = x1.contiguous() 2025-05-07T20:32:37.8657748Z 2025-05-07T20:32:37.8657947Z if scale_ub is not None: 2025-05-07T20:32:37.8658220Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.8658568Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.8658882Z ) 2025-05-07T20:32:37.8659074Z else: 2025-05-07T20:32:37.8659291Z scale_ub_tensor = None 2025-05-07T20:32:37.8659545Z 2025-05-07T20:32:37.8659776Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.8660096Z op = silu_mul_quant 2025-05-07T20:32:37.8660353Z if compiled: 2025-05-07T20:32:37.8660603Z op = torch.compile(op) 2025-05-07T20:32:37.8660895Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.8661171Z 2025-05-07T20:32:37.8661366Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.8661531Z 2025-05-07T20:32:37.8661633Z moe/activation_test.py:117: 2025-05-07T20:32:37.8661942Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.8662277Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.8662554Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.8663428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.8664126Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.8664718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.8665398Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.8666061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.8666601Z kernel = self.compile( 2025-05-07T20:32:37.8667141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.8667980Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.8668383Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.8668658Z 2025-05-07T20:32:37.8668878Z self = 2025-05-07T20:32:37.8670045Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.8671426Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f98304e00e0>} 2025-05-07T20:32:37.8672764Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.8673795Z context = 2025-05-07T20:32:37.8674082Z 2025-05-07T20:32:37.8674258Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.8674774Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.8675250Z module_map=module_map) 2025-05-07T20:32:37.8675624Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.8675972Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.8676236Z E ^ 2025-05-07T20:32:37.8676709Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.8677155Z 2025-05-07T20:32:37.8677576Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.8678091Z 2025-05-07T20:32:37.8678198Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.8678618Z self=, 2025-05-07T20:32:37.8679028Z T=4096, 2025-05-07T20:32:37.8679222Z D=7168, 2025-05-07T20:32:37.8679423Z scale_ub=None, 2025-05-07T20:32:37.8679649Z contiguous=False, 2025-05-07T20:32:37.8679876Z compiled=False, 2025-05-07T20:32:37.8680095Z ) 2025-05-07T20:32:37.8680418Z self = 2025-05-07T20:32:37.8680917Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:37.8681192Z 2025-05-07T20:32:37.8681273Z @given( 2025-05-07T20:32:37.8681512Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.8681832Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.8682139Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.8682476Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.8682809Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.8683090Z ) 2025-05-07T20:32:37.8683444Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.8683938Z def test_silu_mul_quant( 2025-05-07T20:32:37.8684187Z self, 2025-05-07T20:32:37.8684380Z T: int, 2025-05-07T20:32:37.8684604Z D: int, 2025-05-07T20:32:37.8684853Z scale_ub: Optional[float], 2025-05-07T20:32:37.8685120Z contiguous: bool, 2025-05-07T20:32:37.8685363Z compiled: bool, 2025-05-07T20:32:37.8685591Z ) -> None: 2025-05-07T20:32:37.8685803Z torch.manual_seed(2025) 2025-05-07T20:32:37.8686048Z 2025-05-07T20:32:37.8686324Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.8686661Z 2025-05-07T20:32:37.8686861Z x_sign = torch.sign(x) 2025-05-07T20:32:37.8687154Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.8687573Z x = x_sign * x_clamp 2025-05-07T20:32:37.8688032Z x0 = x[:, :D] 2025-05-07T20:32:37.8688256Z x1 = x[:, D:] 2025-05-07T20:32:37.8688464Z 2025-05-07T20:32:37.8688709Z if contiguous: 2025-05-07T20:32:37.8688953Z x0 = x0.contiguous() 2025-05-07T20:32:37.8689210Z x1 = x1.contiguous() 2025-05-07T20:32:37.8689454Z 2025-05-07T20:32:37.8689651Z if scale_ub is not None: 2025-05-07T20:32:37.8689928Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.8690261Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.8690575Z ) 2025-05-07T20:32:37.8690779Z else: 2025-05-07T20:32:37.8690987Z scale_ub_tensor = None 2025-05-07T20:32:37.8691247Z 2025-05-07T20:32:37.8691487Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.8691797Z op = silu_mul_quant 2025-05-07T20:32:37.8692060Z if compiled: 2025-05-07T20:32:37.8692314Z op = torch.compile(op) 2025-05-07T20:32:37.8692604Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.8692884Z 2025-05-07T20:32:37.8693085Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.8693250Z 2025-05-07T20:32:37.8693351Z moe/activation_test.py:117: 2025-05-07T20:32:37.8693646Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.8693980Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.8694263Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.8695005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.8695688Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.8696225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.8696912Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.8697572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.8698111Z kernel = self.compile( 2025-05-07T20:32:37.8698656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.8699309Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.8699703Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.8699938Z 2025-05-07T20:32:37.8700144Z self = 2025-05-07T20:32:37.8701222Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.8702599Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f98304e2a20>} 2025-05-07T20:32:37.8703996Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.8705017Z context = 2025-05-07T20:32:37.8705307Z 2025-05-07T20:32:37.8705473Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.8705985Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.8706443Z module_map=module_map) 2025-05-07T20:32:37.8706813Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.8707169Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.8707472Z E ^ 2025-05-07T20:32:37.8707970Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.8708424Z 2025-05-07T20:32:37.8709399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.8709915Z 2025-05-07T20:32:37.8710027Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.8710433Z self=, 2025-05-07T20:32:37.8710836Z T=128, 2025-05-07T20:32:37.8711032Z D=7168, 2025-05-07T20:32:37.8711229Z scale_ub=None, 2025-05-07T20:32:37.8711447Z contiguous=False, 2025-05-07T20:32:37.8711671Z compiled=True, 2025-05-07T20:32:37.8711876Z ) 2025-05-07T20:32:37.9139885Z self = 2025-05-07T20:32:37.9140626Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:37.9140954Z 2025-05-07T20:32:37.9141045Z @given( 2025-05-07T20:32:37.9141279Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.9141597Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.9141910Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.9142238Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.9142569Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.9142864Z ) 2025-05-07T20:32:37.9143210Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.9143656Z def test_silu_mul_quant( 2025-05-07T20:32:37.9143904Z self, 2025-05-07T20:32:37.9144103Z T: int, 2025-05-07T20:32:37.9144301Z D: int, 2025-05-07T20:32:37.9144527Z scale_ub: Optional[float], 2025-05-07T20:32:37.9144807Z contiguous: bool, 2025-05-07T20:32:37.9145050Z compiled: bool, 2025-05-07T20:32:37.9145284Z ) -> None: 2025-05-07T20:32:37.9145510Z torch.manual_seed(2025) 2025-05-07T20:32:37.9145750Z 2025-05-07T20:32:37.9146025Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.9146373Z 2025-05-07T20:32:37.9146571Z x_sign = torch.sign(x) 2025-05-07T20:32:37.9146875Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.9147197Z x = x_sign * x_clamp 2025-05-07T20:32:37.9147435Z x0 = x[:, :D] 2025-05-07T20:32:37.9147660Z x1 = x[:, D:] 2025-05-07T20:32:37.9148059Z 2025-05-07T20:32:37.9148245Z if contiguous: 2025-05-07T20:32:37.9148478Z x0 = x0.contiguous() 2025-05-07T20:32:37.9148738Z x1 = x1.contiguous() 2025-05-07T20:32:37.9148974Z 2025-05-07T20:32:37.9149218Z if scale_ub is not None: 2025-05-07T20:32:37.9149504Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.9149842Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.9150159Z ) 2025-05-07T20:32:37.9150353Z else: 2025-05-07T20:32:37.9150570Z scale_ub_tensor = None 2025-05-07T20:32:37.9150824Z 2025-05-07T20:32:37.9151053Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.9151371Z op = silu_mul_quant 2025-05-07T20:32:37.9151925Z if compiled: 2025-05-07T20:32:37.9152174Z op = torch.compile(op) 2025-05-07T20:32:37.9152473Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.9152751Z 2025-05-07T20:32:37.9152943Z y_fp8, y_scale = fn() 2025-05-07T20:32:37.9153238Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:37.9153532Z 2025-05-07T20:32:37.9153768Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.9154103Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:37.9154396Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:37.9154743Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:37.9155277Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:37.9155589Z 2025-05-07T20:32:37.9155792Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:37.9155985Z 2025-05-07T20:32:37.9156162Z moe/activation_test.py:126: 2025-05-07T20:32:37.9156466Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.9156805Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:37.9157126Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:37.9157906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:37.9158655Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:37.9159199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.9159871Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.9160560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:37.9161286Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:37.9162035Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:37.9162772Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:37.9163505Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:37.9164143Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:37.9164738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:37.9165304Z fn() 2025-05-07T20:32:37.9165818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:37.9166403Z self.fn.run( 2025-05-07T20:32:37.9166870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.9167401Z kernel = self.compile( 2025-05-07T20:32:37.9167943Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.9168588Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.9168993Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.9169225Z 2025-05-07T20:32:37.9169430Z self = 2025-05-07T20:32:37.9170504Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.9171888Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f98304e3100>} 2025-05-07T20:32:37.9173267Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.9174291Z context = 2025-05-07T20:32:37.9174582Z 2025-05-07T20:32:37.9174749Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.9175267Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.9175729Z module_map=module_map) 2025-05-07T20:32:37.9176138Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.9176571Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:37.9176834Z E ^ 2025-05-07T20:32:37.9177340Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.9177795Z 2025-05-07T20:32:37.9178212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.9178719Z 2025-05-07T20:32:37.9178833Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.9179242Z self=, 2025-05-07T20:32:37.9179648Z T=128, 2025-05-07T20:32:37.9179845Z D=7168, 2025-05-07T20:32:37.9180041Z scale_ub=None, 2025-05-07T20:32:37.9180267Z contiguous=False, 2025-05-07T20:32:37.9180501Z compiled=False, 2025-05-07T20:32:37.9180708Z ) 2025-05-07T20:32:38.2144050Z self = 2025-05-07T20:32:38.2144859Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:38.2145231Z 2025-05-07T20:32:38.2145349Z @given( 2025-05-07T20:32:38.2145650Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:38.2145976Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:38.2146288Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:38.2146612Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:38.2146943Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:38.2147229Z ) 2025-05-07T20:32:38.2147580Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:38.2148049Z def test_silu_mul_quant( 2025-05-07T20:32:38.2148289Z self, 2025-05-07T20:32:38.2148487Z T: int, 2025-05-07T20:32:38.2148686Z D: int, 2025-05-07T20:32:38.2148902Z scale_ub: Optional[float], 2025-05-07T20:32:38.2149252Z contiguous: bool, 2025-05-07T20:32:38.2149499Z compiled: bool, 2025-05-07T20:32:38.2149727Z ) -> None: 2025-05-07T20:32:38.2149939Z torch.manual_seed(2025) 2025-05-07T20:32:38.2150183Z 2025-05-07T20:32:38.2150458Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:38.2150797Z 2025-05-07T20:32:38.2150993Z x_sign = torch.sign(x) 2025-05-07T20:32:38.2151288Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:38.2151593Z x = x_sign * x_clamp 2025-05-07T20:32:38.2151839Z x0 = x[:, :D] 2025-05-07T20:32:38.2152062Z x1 = x[:, D:] 2025-05-07T20:32:38.2152267Z 2025-05-07T20:32:38.2152463Z if contiguous: 2025-05-07T20:32:38.2152705Z x0 = x0.contiguous() 2025-05-07T20:32:38.2152963Z x1 = x1.contiguous() 2025-05-07T20:32:38.2153207Z 2025-05-07T20:32:38.2153407Z if scale_ub is not None: 2025-05-07T20:32:38.2153681Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:38.2154023Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:38.2154339Z ) 2025-05-07T20:32:38.2154567Z else: 2025-05-07T20:32:38.2154798Z scale_ub_tensor = None 2025-05-07T20:32:38.2155056Z 2025-05-07T20:32:38.2155582Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:38.2155903Z op = silu_mul_quant 2025-05-07T20:32:38.2156161Z if compiled: 2025-05-07T20:32:38.2156415Z op = torch.compile(op) 2025-05-07T20:32:38.2156708Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:38.2156986Z 2025-05-07T20:32:38.2157183Z > y_fp8, y_scale = fn() 2025-05-07T20:32:38.2157347Z 2025-05-07T20:32:38.2157449Z moe/activation_test.py:117: 2025-05-07T20:32:38.2157744Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:38.2158076Z moe/activation_test.py:115: in fn 2025-05-07T20:32:38.2158443Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:38.2159205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:38.2159892Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:38.2160504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:38.2161182Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:38.2161842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:38.2162376Z kernel = self.compile( 2025-05-07T20:32:38.2162919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:38.2163564Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:38.2163970Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:38.2164210Z 2025-05-07T20:32:38.2164424Z self = 2025-05-07T20:32:38.2165503Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:38.2166872Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9807db0cc0>} 2025-05-07T20:32:38.2168202Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:38.2169226Z context = 2025-05-07T20:32:38.2169512Z 2025-05-07T20:32:38.2169687Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:38.2170204Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:38.2170680Z module_map=module_map) 2025-05-07T20:32:38.2171050Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:38.2171405Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:38.2171663Z E ^ 2025-05-07T20:32:38.2172130Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:38.2172581Z 2025-05-07T20:32:38.2173002Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:38.2173510Z 2025-05-07T20:32:38.2173619Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:38.2174034Z self=, 2025-05-07T20:32:38.2174449Z T=4096, 2025-05-07T20:32:38.2174644Z D=5120, 2025-05-07T20:32:38.2174848Z scale_ub=1200.0, 2025-05-07T20:32:38.2175116Z contiguous=True, 2025-05-07T20:32:38.2175347Z compiled=False, 2025-05-07T20:32:38.2175559Z ) 2025-05-07T20:32:38.2175939Z self = 2025-05-07T20:32:38.2176443Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:38.2176713Z 2025-05-07T20:32:38.2176793Z @given( 2025-05-07T20:32:38.2177032Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:38.2177349Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:38.2177653Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:38.2177991Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:38.2178321Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:38.2178612Z ) 2025-05-07T20:32:38.2178961Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:38.2179496Z def test_silu_mul_quant( 2025-05-07T20:32:38.2179750Z self, 2025-05-07T20:32:38.2179948Z T: int, 2025-05-07T20:32:38.2180158Z D: int, 2025-05-07T20:32:38.2180426Z scale_ub: Optional[float], 2025-05-07T20:32:38.2180697Z contiguous: bool, 2025-05-07T20:32:38.2180943Z compiled: bool, 2025-05-07T20:32:38.2181168Z ) -> None: 2025-05-07T20:32:38.2181380Z torch.manual_seed(2025) 2025-05-07T20:32:38.2181626Z 2025-05-07T20:32:38.2181902Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:38.2182239Z 2025-05-07T20:32:38.2182437Z x_sign = torch.sign(x) 2025-05-07T20:32:38.2182734Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:38.2183049Z x = x_sign * x_clamp 2025-05-07T20:32:38.2183285Z x0 = x[:, :D] 2025-05-07T20:32:38.2183507Z x1 = x[:, D:] 2025-05-07T20:32:38.2183725Z 2025-05-07T20:32:38.2183915Z if contiguous: 2025-05-07T20:32:38.2184148Z x0 = x0.contiguous() 2025-05-07T20:32:38.2184406Z x1 = x1.contiguous() 2025-05-07T20:32:38.2184640Z 2025-05-07T20:32:38.2184864Z if scale_ub is not None: 2025-05-07T20:32:38.2185161Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:38.2185491Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:38.2185803Z ) 2025-05-07T20:32:38.2186000Z else: 2025-05-07T20:32:38.2186208Z scale_ub_tensor = None 2025-05-07T20:32:38.2186462Z 2025-05-07T20:32:38.2186701Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:38.2187009Z op = silu_mul_quant 2025-05-07T20:32:38.2193458Z if compiled: 2025-05-07T20:32:38.2193757Z op = torch.compile(op) 2025-05-07T20:32:38.2194070Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:38.2194361Z 2025-05-07T20:32:38.2194556Z > y_fp8, y_scale = fn() 2025-05-07T20:32:38.2194731Z 2025-05-07T20:32:38.2194833Z moe/activation_test.py:117: 2025-05-07T20:32:38.2195136Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:38.2195472Z moe/activation_test.py:115: in fn 2025-05-07T20:32:38.2195761Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:38.2196454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:38.2197147Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:38.2197677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:38.2198365Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:38.2199027Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:38.2199568Z kernel = self.compile( 2025-05-07T20:32:38.2200110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:38.2200770Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:38.2201257Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:38.2201489Z 2025-05-07T20:32:38.2201697Z self = 2025-05-07T20:32:38.2202779Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:38.2204147Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9807db1f80>} 2025-05-07T20:32:38.2205537Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:38.2206646Z context = 2025-05-07T20:32:38.2206935Z 2025-05-07T20:32:38.2207100Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:38.2207622Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:38.2208088Z module_map=module_map) 2025-05-07T20:32:38.2208449Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:38.2208802Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:38.2209065Z E ^ 2025-05-07T20:32:38.2209530Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:38.2209981Z 2025-05-07T20:32:38.2210399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:38.2210917Z 2025-05-07T20:32:38.2211022Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:38.2211444Z self=, 2025-05-07T20:32:38.2211848Z T=1, 2025-05-07T20:32:38.2212030Z D=5120, 2025-05-07T20:32:38.2212231Z scale_ub=None, 2025-05-07T20:32:38.2212450Z contiguous=True, 2025-05-07T20:32:38.2212670Z compiled=True, 2025-05-07T20:32:38.2212877Z ) 2025-05-07T20:32:38.6648041Z self = 2025-05-07T20:32:38.6648758Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:38.6649116Z 2025-05-07T20:32:38.6649227Z @given( 2025-05-07T20:32:38.6649561Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:38.6649899Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:38.6650248Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:38.6650584Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:38.6650925Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:38.6651232Z ) 2025-05-07T20:32:38.6651582Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:38.6652030Z def test_silu_mul_quant( 2025-05-07T20:32:38.6652280Z self, 2025-05-07T20:32:38.6652478Z T: int, 2025-05-07T20:32:38.6652691Z D: int, 2025-05-07T20:32:38.6652920Z scale_ub: Optional[float], 2025-05-07T20:32:38.6653192Z contiguous: bool, 2025-05-07T20:32:38.6653444Z compiled: bool, 2025-05-07T20:32:38.6653684Z ) -> None: 2025-05-07T20:32:38.6653903Z torch.manual_seed(2025) 2025-05-07T20:32:38.6654152Z 2025-05-07T20:32:38.6654439Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:38.6654800Z 2025-05-07T20:32:38.6654999Z x_sign = torch.sign(x) 2025-05-07T20:32:38.6655300Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:38.6655619Z x = x_sign * x_clamp 2025-05-07T20:32:38.6655861Z x0 = x[:, :D] 2025-05-07T20:32:38.6656092Z x1 = x[:, D:] 2025-05-07T20:32:38.6656593Z 2025-05-07T20:32:38.6656783Z if contiguous: 2025-05-07T20:32:38.6657027Z x0 = x0.contiguous() 2025-05-07T20:32:38.6657292Z x1 = x1.contiguous() 2025-05-07T20:32:38.6657530Z 2025-05-07T20:32:38.6657731Z if scale_ub is not None: 2025-05-07T20:32:38.6658013Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:38.6658347Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:38.6658690Z ) 2025-05-07T20:32:38.6658896Z else: 2025-05-07T20:32:38.6659118Z scale_ub_tensor = None 2025-05-07T20:32:38.6659363Z 2025-05-07T20:32:38.6659604Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:38.6660122Z op = silu_mul_quant 2025-05-07T20:32:38.6660374Z if compiled: 2025-05-07T20:32:38.6660628Z op = torch.compile(op) 2025-05-07T20:32:38.6661008Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:38.6661296Z 2025-05-07T20:32:38.6661490Z y_fp8, y_scale = fn() 2025-05-07T20:32:38.6661783Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:38.6662078Z 2025-05-07T20:32:38.6662317Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:38.6662655Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:38.6662953Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:38.6663318Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:38.6663782Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:38.6664104Z 2025-05-07T20:32:38.6664313Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:38.6664560Z 2025-05-07T20:32:38.6664676Z moe/activation_test.py:126: 2025-05-07T20:32:38.6664979Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:38.6665322Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:38.6665653Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:38.6666446Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:38.6667209Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:38.6667753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:38.6668446Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:38.6669251Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:38.6669979Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:38.6670738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:38.6671492Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:38.6672224Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:38.6672864Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:38.6673458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:38.6673985Z fn() 2025-05-07T20:32:38.6674547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:38.6675130Z self.fn.run( 2025-05-07T20:32:38.6675597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:38.6676136Z kernel = self.compile( 2025-05-07T20:32:38.6676675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:38.6677403Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:38.6677811Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:38.6678041Z 2025-05-07T20:32:38.6678250Z self = 2025-05-07T20:32:38.6679334Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:38.6680730Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9807db2fc0>} 2025-05-07T20:32:38.6682204Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:38.6683239Z context = 2025-05-07T20:32:38.6683529Z 2025-05-07T20:32:38.6683697Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:38.6684217Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:38.6684688Z module_map=module_map) 2025-05-07T20:32:38.6685100Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:38.6685449Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:38.6685727Z E ^ 2025-05-07T20:32:38.6686188Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:38.6686640Z 2025-05-07T20:32:38.6687057Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:38.6687581Z 2025-05-07T20:32:38.6687692Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:38.6688102Z self=, 2025-05-07T20:32:38.6688504Z T=2048, 2025-05-07T20:32:38.6688691Z D=5120, 2025-05-07T20:32:38.6688890Z scale_ub=None, 2025-05-07T20:32:38.6689110Z contiguous=True, 2025-05-07T20:32:38.6689330Z compiled=True, 2025-05-07T20:32:38.6689539Z ) 2025-05-07T20:32:39.0962508Z self = 2025-05-07T20:32:39.0963265Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:39.0963632Z 2025-05-07T20:32:39.0963751Z @given( 2025-05-07T20:32:39.0964086Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:39.0964497Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:39.0964899Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:39.0965338Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:39.0965717Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:39.0966000Z ) 2025-05-07T20:32:39.0966352Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:39.0966793Z def test_silu_mul_quant( 2025-05-07T20:32:39.0967039Z self, 2025-05-07T20:32:39.0967240Z T: int, 2025-05-07T20:32:39.0967433Z D: int, 2025-05-07T20:32:39.0967657Z scale_ub: Optional[float], 2025-05-07T20:32:39.0967928Z contiguous: bool, 2025-05-07T20:32:39.0968169Z compiled: bool, 2025-05-07T20:32:39.0968398Z ) -> None: 2025-05-07T20:32:39.0968617Z torch.manual_seed(2025) 2025-05-07T20:32:39.0968866Z 2025-05-07T20:32:39.0969133Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:39.0969481Z 2025-05-07T20:32:39.0969678Z x_sign = torch.sign(x) 2025-05-07T20:32:39.0969968Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:39.0970575Z x = x_sign * x_clamp 2025-05-07T20:32:39.0970821Z x0 = x[:, :D] 2025-05-07T20:32:39.0971035Z x1 = x[:, D:] 2025-05-07T20:32:39.0971252Z 2025-05-07T20:32:39.0971448Z if contiguous: 2025-05-07T20:32:39.0971683Z x0 = x0.contiguous() 2025-05-07T20:32:39.0971950Z x1 = x1.contiguous() 2025-05-07T20:32:39.0972198Z 2025-05-07T20:32:39.0972392Z if scale_ub is not None: 2025-05-07T20:32:39.0972674Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:39.0973017Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:39.0973324Z ) 2025-05-07T20:32:39.0973544Z else: 2025-05-07T20:32:39.0973853Z scale_ub_tensor = None 2025-05-07T20:32:39.0974191Z 2025-05-07T20:32:39.0974419Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:39.0974738Z op = silu_mul_quant 2025-05-07T20:32:39.0974996Z if compiled: 2025-05-07T20:32:39.0975320Z op = torch.compile(op) 2025-05-07T20:32:39.0975623Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.0975901Z 2025-05-07T20:32:39.0976093Z y_fp8, y_scale = fn() 2025-05-07T20:32:39.0976385Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:39.0976680Z 2025-05-07T20:32:39.0976915Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:39.0977254Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:39.0977552Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:39.0977868Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:39.0978226Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:39.0978543Z 2025-05-07T20:32:39.0978745Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:39.0978938Z 2025-05-07T20:32:39.0979039Z moe/activation_test.py:126: 2025-05-07T20:32:39.0979338Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.0979678Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:39.0979999Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:39.0980787Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:39.0981538Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:39.0982081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:39.0982756Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:39.0983438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:39.0984160Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:39.0984923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:39.0985711Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:39.0986439Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:39.0987081Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:39.0987680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:39.0988193Z fn() 2025-05-07T20:32:39.0988705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:39.0989408Z self.fn.run( 2025-05-07T20:32:39.0989869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:39.0990398Z kernel = self.compile( 2025-05-07T20:32:39.0991000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:39.0991654Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:39.0992047Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.0992281Z 2025-05-07T20:32:39.0992489Z self = 2025-05-07T20:32:39.0993566Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:39.0994986Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f980790a8e0>} 2025-05-07T20:32:39.0996453Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:39.0997476Z context = 2025-05-07T20:32:39.0997770Z 2025-05-07T20:32:39.0997943Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:39.0998464Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:39.0998926Z module_map=module_map) 2025-05-07T20:32:39.0999292Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:39.0999650Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:39.0999920Z E ^ 2025-05-07T20:32:39.1000387Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:39.1000847Z 2025-05-07T20:32:39.1001270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:39.1001784Z 2025-05-07T20:32:39.1001897Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:39.1002307Z self=, 2025-05-07T20:32:39.1002710Z T=128, 2025-05-07T20:32:39.1002905Z D=5120, 2025-05-07T20:32:39.1003104Z scale_ub=None, 2025-05-07T20:32:39.1003327Z contiguous=True, 2025-05-07T20:32:39.1003558Z compiled=True, 2025-05-07T20:32:39.1003764Z ) 2025-05-07T20:32:39.7590696Z self = 2025-05-07T20:32:39.7591259Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:39.7591564Z 2025-05-07T20:32:39.7591649Z @given( 2025-05-07T20:32:39.7591900Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:39.7592232Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:39.7592557Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:39.7592900Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:39.7593239Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:39.7593531Z ) 2025-05-07T20:32:39.7593884Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:39.7594332Z def test_silu_mul_quant( 2025-05-07T20:32:39.7594581Z self, 2025-05-07T20:32:39.7594777Z T: int, 2025-05-07T20:32:39.7595014Z D: int, 2025-05-07T20:32:39.7595241Z scale_ub: Optional[float], 2025-05-07T20:32:39.7595520Z contiguous: bool, 2025-05-07T20:32:39.7595760Z compiled: bool, 2025-05-07T20:32:39.7596004Z ) -> None: 2025-05-07T20:32:39.7596233Z torch.manual_seed(2025) 2025-05-07T20:32:39.7596475Z 2025-05-07T20:32:39.7596759Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:39.7597110Z 2025-05-07T20:32:39.7597310Z x_sign = torch.sign(x) 2025-05-07T20:32:39.7597902Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:39.7598223Z x = x_sign * x_clamp 2025-05-07T20:32:39.7598469Z x0 = x[:, :D] 2025-05-07T20:32:39.7598706Z x1 = x[:, D:] 2025-05-07T20:32:39.7598930Z 2025-05-07T20:32:39.7599122Z if contiguous: 2025-05-07T20:32:39.7599381Z x0 = x0.contiguous() 2025-05-07T20:32:39.7599680Z x1 = x1.contiguous() 2025-05-07T20:32:39.7599942Z 2025-05-07T20:32:39.7600144Z if scale_ub is not None: 2025-05-07T20:32:39.7600448Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:39.7600834Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:39.7601232Z ) 2025-05-07T20:32:39.7601518Z else: 2025-05-07T20:32:39.7601739Z scale_ub_tensor = None 2025-05-07T20:32:39.7601995Z 2025-05-07T20:32:39.7602238Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:39.7602636Z op = silu_mul_quant 2025-05-07T20:32:39.7602892Z if compiled: 2025-05-07T20:32:39.7603151Z op = torch.compile(op) 2025-05-07T20:32:39.7603462Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.7603738Z 2025-05-07T20:32:39.7603942Z y_fp8, y_scale = fn() 2025-05-07T20:32:39.7604237Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:39.7604536Z 2025-05-07T20:32:39.7604779Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:39.7605125Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:39.7605428Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:39.7605746Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:39.7606120Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:39.7606441Z 2025-05-07T20:32:39.7606644Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:39.7606848Z 2025-05-07T20:32:39.7606958Z moe/activation_test.py:126: 2025-05-07T20:32:39.7607266Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.7607604Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:39.7607932Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:39.7608731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:39.7609490Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:39.7610031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:39.7610718Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:39.7611417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:39.7612140Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:39.7612883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:39.7613630Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:39.7614355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:39.7614992Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:39.7615585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:39.7616105Z fn() 2025-05-07T20:32:39.7616615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:39.7617194Z self.fn.run( 2025-05-07T20:32:39.7617664Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:39.7618250Z kernel = self.compile( 2025-05-07T20:32:39.7618791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:39.7619438Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:39.7619838Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.7620065Z 2025-05-07T20:32:39.7620282Z self = 2025-05-07T20:32:39.7621356Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:39.7622858Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9807441b20>} 2025-05-07T20:32:39.7624195Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:39.7625269Z context = 2025-05-07T20:32:39.7625554Z 2025-05-07T20:32:39.7625728Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:39.7626248Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:39.7626710Z module_map=module_map) 2025-05-07T20:32:39.7627080Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:39.7627442Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:39.7627711Z E ^ 2025-05-07T20:32:39.7628365Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:39.7628818Z 2025-05-07T20:32:39.7629305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:39.7629816Z 2025-05-07T20:32:39.7629929Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:39.7630337Z self=, 2025-05-07T20:32:39.7630743Z T=4096, 2025-05-07T20:32:39.7630938Z D=5120, 2025-05-07T20:32:39.7631131Z scale_ub=None, 2025-05-07T20:32:39.7631359Z contiguous=True, 2025-05-07T20:32:39.7631591Z compiled=True, 2025-05-07T20:32:39.7631799Z ) 2025-05-07T20:32:40.2647889Z self = 2025-05-07T20:32:40.2648481Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:40.2648775Z 2025-05-07T20:32:40.2648913Z @given( 2025-05-07T20:32:40.2649247Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:40.2649672Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:40.2650074Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:40.2650490Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:40.2650831Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:40.2651118Z ) 2025-05-07T20:32:40.2651465Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:40.2651911Z def test_silu_mul_quant( 2025-05-07T20:32:40.2652159Z self, 2025-05-07T20:32:40.2652370Z T: int, 2025-05-07T20:32:40.2652567Z D: int, 2025-05-07T20:32:40.2652792Z scale_ub: Optional[float], 2025-05-07T20:32:40.2653094Z contiguous: bool, 2025-05-07T20:32:40.2653347Z compiled: bool, 2025-05-07T20:32:40.2653575Z ) -> None: 2025-05-07T20:32:40.2653800Z torch.manual_seed(2025) 2025-05-07T20:32:40.2659683Z 2025-05-07T20:32:40.2660157Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:40.2660523Z 2025-05-07T20:32:40.2660733Z x_sign = torch.sign(x) 2025-05-07T20:32:40.2661028Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:40.2661351Z x = x_sign * x_clamp 2025-05-07T20:32:40.2661607Z x0 = x[:, :D] 2025-05-07T20:32:40.2661831Z x1 = x[:, D:] 2025-05-07T20:32:40.2662046Z 2025-05-07T20:32:40.2662243Z if contiguous: 2025-05-07T20:32:40.2662478Z x0 = x0.contiguous() 2025-05-07T20:32:40.2662745Z x1 = x1.contiguous() 2025-05-07T20:32:40.2662997Z 2025-05-07T20:32:40.2663200Z if scale_ub is not None: 2025-05-07T20:32:40.2663475Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:40.2663944Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:40.2664265Z ) 2025-05-07T20:32:40.2664462Z else: 2025-05-07T20:32:40.2664683Z scale_ub_tensor = None 2025-05-07T20:32:40.2665004Z 2025-05-07T20:32:40.2665246Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:40.2665571Z op = silu_mul_quant 2025-05-07T20:32:40.2665830Z if compiled: 2025-05-07T20:32:40.2666078Z op = torch.compile(op) 2025-05-07T20:32:40.2666380Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.2666663Z 2025-05-07T20:32:40.2666861Z y_fp8, y_scale = fn() 2025-05-07T20:32:40.2667150Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:40.2667445Z 2025-05-07T20:32:40.2667690Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:40.2668033Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:40.2668337Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:40.2668662Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:40.2669018Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:40.2669397Z 2025-05-07T20:32:40.2669615Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:40.2669814Z 2025-05-07T20:32:40.2669918Z moe/activation_test.py:126: 2025-05-07T20:32:40.2670222Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.2670564Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:40.2670895Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:40.2671681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:40.2672439Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:40.2672990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:40.2673674Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:40.2674366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:40.2675096Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:40.2675851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:40.2676594Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:40.2677330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:40.2677982Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:40.2678590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:40.2679111Z fn() 2025-05-07T20:32:40.2679626Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:40.2680221Z self.fn.run( 2025-05-07T20:32:40.2680738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:40.2681276Z kernel = self.compile( 2025-05-07T20:32:40.2681822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:40.2682489Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:40.2682889Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.2683130Z 2025-05-07T20:32:40.2683339Z self = 2025-05-07T20:32:40.2684430Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:40.2685921Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f98075b2700>} 2025-05-07T20:32:40.2687256Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:40.2688285Z context = 2025-05-07T20:32:40.2688582Z 2025-05-07T20:32:40.2688751Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:40.2689274Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:40.2689744Z module_map=module_map) 2025-05-07T20:32:40.2690127Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:40.2690487Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:40.2690754Z E ^ 2025-05-07T20:32:40.2691228Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:40.2691683Z 2025-05-07T20:32:40.2692099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:40.2692610Z 2025-05-07T20:32:40.2692724Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:40.2693133Z self=, 2025-05-07T20:32:40.2693542Z T=16384, 2025-05-07T20:32:40.2693749Z D=5120, 2025-05-07T20:32:40.2693957Z scale_ub=None, 2025-05-07T20:32:40.2694177Z contiguous=True, 2025-05-07T20:32:40.2694414Z compiled=True, 2025-05-07T20:32:40.2694634Z ) 2025-05-07T20:32:40.2947124Z W0507 20:32:40.293000 239371 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8) 2025-05-07T20:32:40.2948375Z W0507 20:32:40.293000 239371 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55) 2025-05-07T20:32:40.2949743Z W0507 20:32:40.293000 239371 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240 2025-05-07T20:32:40.2950727Z W0507 20:32:40.293000 239371 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles". 2025-05-07T20:32:40.2951840Z W0507 20:32:40.293000 239371 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html. 2025-05-07T20:32:40.3629131Z self = 2025-05-07T20:32:40.3629672Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:40.3629949Z 2025-05-07T20:32:40.3630149Z @given( 2025-05-07T20:32:40.3630386Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:40.3630708Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:40.3631018Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:40.3631352Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:40.3631684Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:40.3631970Z ) 2025-05-07T20:32:40.3632327Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:40.3632769Z def test_silu_mul_quant( 2025-05-07T20:32:40.3633023Z self, 2025-05-07T20:32:40.3633233Z T: int, 2025-05-07T20:32:40.3633499Z D: int, 2025-05-07T20:32:40.3633829Z scale_ub: Optional[float], 2025-05-07T20:32:40.3634112Z contiguous: bool, 2025-05-07T20:32:40.3634353Z compiled: bool, 2025-05-07T20:32:40.3634584Z ) -> None: 2025-05-07T20:32:40.3634862Z torch.manual_seed(2025) 2025-05-07T20:32:40.3635112Z 2025-05-07T20:32:40.3635399Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:40.3635752Z 2025-05-07T20:32:40.3635960Z x_sign = torch.sign(x) 2025-05-07T20:32:40.3636254Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:40.3636568Z x = x_sign * x_clamp 2025-05-07T20:32:40.3636816Z x0 = x[:, :D] 2025-05-07T20:32:40.3637039Z x1 = x[:, D:] 2025-05-07T20:32:40.3637255Z 2025-05-07T20:32:40.3637448Z if contiguous: 2025-05-07T20:32:40.3637682Z x0 = x0.contiguous() 2025-05-07T20:32:40.3637942Z x1 = x1.contiguous() 2025-05-07T20:32:40.3638188Z 2025-05-07T20:32:40.3638390Z if scale_ub is not None: 2025-05-07T20:32:40.3638665Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:40.3639004Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:40.3639326Z ) 2025-05-07T20:32:40.3639525Z else: 2025-05-07T20:32:40.3639737Z scale_ub_tensor = None 2025-05-07T20:32:40.3639994Z 2025-05-07T20:32:40.3640228Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:40.3640549Z op = silu_mul_quant 2025-05-07T20:32:40.3640802Z if compiled: 2025-05-07T20:32:40.3641055Z op = torch.compile(op) 2025-05-07T20:32:40.3641353Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.3641625Z 2025-05-07T20:32:40.3641821Z y_fp8, y_scale = fn() 2025-05-07T20:32:40.3642112Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:40.3642399Z 2025-05-07T20:32:40.3642640Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:40.3642979Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:40.3643267Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:40.3643581Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:40.3643942Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:40.3644256Z 2025-05-07T20:32:40.3644455Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:40.3644654Z 2025-05-07T20:32:40.3644755Z moe/activation_test.py:126: 2025-05-07T20:32:40.3645060Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.3645394Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:40.3645722Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:40.3646512Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:40.3647260Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:40.3647806Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:40.3648488Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:40.3649236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:40.3649952Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:40.3650704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:40.3651450Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:40.3652177Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:40.3652851Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:40.3653487Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:40.3654005Z fn() 2025-05-07T20:32:40.3654581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:40.3655160Z self.fn.run( 2025-05-07T20:32:40.3655631Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:40.3656165Z kernel = self.compile( 2025-05-07T20:32:40.3656704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:40.3657368Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:40.3657772Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.3658006Z 2025-05-07T20:32:40.3658226Z self = 2025-05-07T20:32:40.3659307Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:40.3660673Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9806edd3a0>} 2025-05-07T20:32:40.3662021Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:40.3663054Z context = 2025-05-07T20:32:40.3663342Z 2025-05-07T20:32:40.3663509Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:40.3664033Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:40.3664512Z module_map=module_map) 2025-05-07T20:32:40.3664888Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:40.3665248Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:40.3665521Z E ^ 2025-05-07T20:32:40.3665994Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:40.3666442Z 2025-05-07T20:32:40.3666865Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:40.3667376Z 2025-05-07T20:32:40.3667484Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:40.3667900Z self=, 2025-05-07T20:32:40.3668300Z T=1, 2025-05-07T20:32:40.3668488Z D=5120, 2025-05-07T20:32:40.3668695Z scale_ub=1200.0, 2025-05-07T20:32:40.3668930Z contiguous=True, 2025-05-07T20:32:40.3669208Z compiled=True, 2025-05-07T20:32:40.3669420Z ) 2025-05-07T20:32:40.6304225Z self = 2025-05-07T20:32:40.6305184Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:40.6305485Z 2025-05-07T20:32:40.6305577Z @given( 2025-05-07T20:32:40.6305818Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:40.6306135Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:40.6306447Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:40.6306785Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:40.6307114Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:40.6307401Z ) 2025-05-07T20:32:40.6307765Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:40.6308269Z def test_silu_mul_quant( 2025-05-07T20:32:40.6308577Z self, 2025-05-07T20:32:40.6308778Z T: int, 2025-05-07T20:32:40.6308982Z D: int, 2025-05-07T20:32:40.6309260Z scale_ub: Optional[float], 2025-05-07T20:32:40.6309540Z contiguous: bool, 2025-05-07T20:32:40.6309838Z compiled: bool, 2025-05-07T20:32:40.6310072Z ) -> None: 2025-05-07T20:32:40.6310300Z torch.manual_seed(2025) 2025-05-07T20:32:40.6310537Z 2025-05-07T20:32:40.6310815Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:40.6311161Z 2025-05-07T20:32:40.6311361Z x_sign = torch.sign(x) 2025-05-07T20:32:40.6311651Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:40.6311963Z x = x_sign * x_clamp 2025-05-07T20:32:40.6312207Z x0 = x[:, :D] 2025-05-07T20:32:40.6312426Z x1 = x[:, D:] 2025-05-07T20:32:40.6312642Z 2025-05-07T20:32:40.6312838Z if contiguous: 2025-05-07T20:32:40.6313073Z x0 = x0.contiguous() 2025-05-07T20:32:40.6313343Z x1 = x1.contiguous() 2025-05-07T20:32:40.6313588Z 2025-05-07T20:32:40.6313777Z if scale_ub is not None: 2025-05-07T20:32:40.6314056Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:40.6314399Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:40.6314704Z ) 2025-05-07T20:32:40.6314905Z else: 2025-05-07T20:32:40.6315126Z scale_ub_tensor = None 2025-05-07T20:32:40.6315376Z 2025-05-07T20:32:40.6315648Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:40.6315983Z op = silu_mul_quant 2025-05-07T20:32:40.6316237Z if compiled: 2025-05-07T20:32:40.6316484Z op = torch.compile(op) 2025-05-07T20:32:40.6316786Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.6317067Z 2025-05-07T20:32:40.6317257Z > y_fp8, y_scale = fn() 2025-05-07T20:32:40.6317427Z 2025-05-07T20:32:40.6317531Z moe/activation_test.py:117: 2025-05-07T20:32:40.6317835Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.6318165Z moe/activation_test.py:115: in fn 2025-05-07T20:32:40.6318452Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.6319017Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:40.6319575Z return fn(*args, **kwargs) 2025-05-07T20:32:40.6320230Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:40.6320922Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:40.6321460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:40.6322135Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:40.6322803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:40.6323339Z kernel = self.compile( 2025-05-07T20:32:40.6323887Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:40.6324592Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:40.6324996Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.6325225Z 2025-05-07T20:32:40.6325442Z self = 2025-05-07T20:32:40.6326528Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:40.6327886Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9806db0900>} 2025-05-07T20:32:40.6329533Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:40.6330563Z context = 2025-05-07T20:32:40.6330853Z 2025-05-07T20:32:40.6331032Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:40.6331543Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:40.6332009Z module_map=module_map) 2025-05-07T20:32:40.6332373Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:40.6332731Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:40.6332987Z E ^ 2025-05-07T20:32:40.6333452Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:40.6333903Z 2025-05-07T20:32:40.6334321Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:40.6334830Z 2025-05-07T20:32:40.6334953Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:40.6335411Z self=, 2025-05-07T20:32:40.6335815Z T=1, 2025-05-07T20:32:40.6336002Z D=5120, 2025-05-07T20:32:40.6336192Z scale_ub=None, 2025-05-07T20:32:40.6336418Z contiguous=False, 2025-05-07T20:32:40.6336651Z compiled=True, 2025-05-07T20:32:40.6336852Z ) 2025-05-07T20:32:40.6813694Z self = 2025-05-07T20:32:40.6814312Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:40.6814658Z 2025-05-07T20:32:40.6814745Z @given( 2025-05-07T20:32:40.6815022Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:40.6815361Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:40.6815828Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:40.6816470Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:40.6817129Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:40.6817692Z ) 2025-05-07T20:32:40.6818377Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:40.6819263Z def test_silu_mul_quant( 2025-05-07T20:32:40.6819747Z self, 2025-05-07T20:32:40.6820131Z T: int, 2025-05-07T20:32:40.6820528Z D: int, 2025-05-07T20:32:40.6820966Z scale_ub: Optional[float], 2025-05-07T20:32:40.6821493Z contiguous: bool, 2025-05-07T20:32:40.6821975Z compiled: bool, 2025-05-07T20:32:40.6822425Z ) -> None: 2025-05-07T20:32:40.6822852Z torch.manual_seed(2025) 2025-05-07T20:32:40.6823325Z 2025-05-07T20:32:40.6823867Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:40.6824545Z 2025-05-07T20:32:40.6824924Z x_sign = torch.sign(x) 2025-05-07T20:32:40.6825386Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:40.6825702Z x = x_sign * x_clamp 2025-05-07T20:32:40.6826282Z x0 = x[:, :D] 2025-05-07T20:32:40.6826510Z x1 = x[:, D:] 2025-05-07T20:32:40.6826728Z 2025-05-07T20:32:40.6826917Z if contiguous: 2025-05-07T20:32:40.6827155Z x0 = x0.contiguous() 2025-05-07T20:32:40.6827414Z x1 = x1.contiguous() 2025-05-07T20:32:40.6827650Z 2025-05-07T20:32:40.6827851Z if scale_ub is not None: 2025-05-07T20:32:40.6828398Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:40.6828735Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:40.6829090Z ) 2025-05-07T20:32:40.6829288Z else: 2025-05-07T20:32:40.6829500Z scale_ub_tensor = None 2025-05-07T20:32:40.6829937Z 2025-05-07T20:32:40.6830179Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:40.6830487Z op = silu_mul_quant 2025-05-07T20:32:40.6830743Z if compiled: 2025-05-07T20:32:40.6831076Z op = torch.compile(op) 2025-05-07T20:32:40.6831384Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.6831677Z 2025-05-07T20:32:40.6831882Z y_fp8, y_scale = fn() 2025-05-07T20:32:40.6832174Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:40.6832465Z 2025-05-07T20:32:40.6832708Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:40.6833048Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:40.6833351Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:40.6833667Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:40.6834033Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:40.6834352Z 2025-05-07T20:32:40.6834560Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:40.6834764Z 2025-05-07T20:32:40.6834872Z moe/activation_test.py:126: 2025-05-07T20:32:40.6835184Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.6835521Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:40.6835859Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:40.6836659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:40.6837415Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:40.6837962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:40.6838660Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:40.6839351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:40.6840088Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:40.6840843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:40.6841599Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:40.6842331Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:40.6842969Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:40.6843573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:40.6844099Z fn() 2025-05-07T20:32:40.6844614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:40.6845197Z self.fn.run( 2025-05-07T20:32:40.6845675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:40.6846212Z kernel = self.compile( 2025-05-07T20:32:40.6846818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:40.6847477Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:40.6847878Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.6848108Z 2025-05-07T20:32:40.6848325Z self = 2025-05-07T20:32:40.6849401Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:40.6850843Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9806dce0c0>} 2025-05-07T20:32:40.6852273Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:40.6853300Z context = 2025-05-07T20:32:40.6853591Z 2025-05-07T20:32:40.6853771Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:40.6854282Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:40.6854763Z module_map=module_map) 2025-05-07T20:32:40.6855140Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:40.6855500Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:40.6855784Z E ^ 2025-05-07T20:32:40.6856270Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:40.6856719Z 2025-05-07T20:32:40.6857153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:40.6857670Z 2025-05-07T20:32:40.6857783Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:40.6858213Z self=, 2025-05-07T20:32:40.6858630Z T=1, 2025-05-07T20:32:40.6858822Z D=5120, 2025-05-07T20:32:40.6859031Z scale_ub=None, 2025-05-07T20:32:40.6859266Z contiguous=True, 2025-05-07T20:32:40.6859498Z compiled=False, 2025-05-07T20:32:40.6859722Z ) 2025-05-07T20:32:40.8010095Z self = 2025-05-07T20:32:40.8010616Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:40.8010918Z 2025-05-07T20:32:40.8011044Z @given( 2025-05-07T20:32:40.8011388Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:40.8011839Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:40.8012285Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:40.8012660Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:40.8013004Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:40.8013310Z ) 2025-05-07T20:32:40.8013661Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:40.8014121Z def test_silu_mul_quant( 2025-05-07T20:32:40.8014385Z self, 2025-05-07T20:32:40.8014594Z T: int, 2025-05-07T20:32:40.8014813Z D: int, 2025-05-07T20:32:40.8015049Z scale_ub: Optional[float], 2025-05-07T20:32:40.8015329Z contiguous: bool, 2025-05-07T20:32:40.8015587Z compiled: bool, 2025-05-07T20:32:40.8015855Z ) -> None: 2025-05-07T20:32:40.8016106Z torch.manual_seed(2025) 2025-05-07T20:32:40.8016366Z 2025-05-07T20:32:40.8016654Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:40.8017039Z 2025-05-07T20:32:40.8017252Z x_sign = torch.sign(x) 2025-05-07T20:32:40.8017736Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:40.8018071Z x = x_sign * x_clamp 2025-05-07T20:32:40.8018330Z x0 = x[:, :D] 2025-05-07T20:32:40.8018558Z x1 = x[:, D:] 2025-05-07T20:32:40.8018781Z 2025-05-07T20:32:40.8018986Z if contiguous: 2025-05-07T20:32:40.8019234Z x0 = x0.contiguous() 2025-05-07T20:32:40.8019501Z x1 = x1.contiguous() 2025-05-07T20:32:40.8019754Z 2025-05-07T20:32:40.8019962Z if scale_ub is not None: 2025-05-07T20:32:40.8020237Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:40.8020586Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:40.8020993Z ) 2025-05-07T20:32:40.8021276Z else: 2025-05-07T20:32:40.8021503Z scale_ub_tensor = None 2025-05-07T20:32:40.8021763Z 2025-05-07T20:32:40.8022000Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:40.8022407Z op = silu_mul_quant 2025-05-07T20:32:40.8022673Z if compiled: 2025-05-07T20:32:40.8022922Z op = torch.compile(op) 2025-05-07T20:32:40.8023231Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.8023520Z 2025-05-07T20:32:40.8023720Z > y_fp8, y_scale = fn() 2025-05-07T20:32:40.8023896Z 2025-05-07T20:32:40.8024002Z moe/activation_test.py:117: 2025-05-07T20:32:40.8024309Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.8024653Z moe/activation_test.py:115: in fn 2025-05-07T20:32:40.8024939Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.8025694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:40.8026400Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:40.8026937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:40.8027636Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:40.8028569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:40.8029162Z kernel = self.compile( 2025-05-07T20:32:40.8029702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:40.8030364Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:40.8030776Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.8031006Z 2025-05-07T20:32:40.8031227Z self = 2025-05-07T20:32:40.8032316Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:40.8033702Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9806dcf420>} 2025-05-07T20:32:40.8035049Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:40.8036129Z context = 2025-05-07T20:32:40.8036417Z 2025-05-07T20:32:40.8036586Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:40.8037116Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:40.8037600Z module_map=module_map) 2025-05-07T20:32:40.8037978Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:40.8038334Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:40.8038681Z E ^ 2025-05-07T20:32:40.8039162Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:40.8039616Z 2025-05-07T20:32:40.8040040Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:40.8040567Z 2025-05-07T20:32:40.8040676Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:40.8041103Z self=, 2025-05-07T20:32:40.8041517Z T=128, 2025-05-07T20:32:40.8041711Z D=5120, 2025-05-07T20:32:40.8041924Z scale_ub=None, 2025-05-07T20:32:40.8042220Z contiguous=False, 2025-05-07T20:32:40.8042513Z compiled=True, 2025-05-07T20:32:40.8042740Z ) 2025-05-07T20:32:40.8043077Z self = 2025-05-07T20:32:40.8043636Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:40.8043916Z 2025-05-07T20:32:40.8044001Z @given( 2025-05-07T20:32:40.8044249Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:40.8044579Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:40.8044893Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:40.8045241Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:40.8045583Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:40.8045918Z ) 2025-05-07T20:32:40.8046282Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:40.8046731Z def test_silu_mul_quant( 2025-05-07T20:32:40.8046980Z self, 2025-05-07T20:32:40.8047193Z T: int, 2025-05-07T20:32:40.8047405Z D: int, 2025-05-07T20:32:40.8047629Z scale_ub: Optional[float], 2025-05-07T20:32:40.8047909Z contiguous: bool, 2025-05-07T20:32:40.8048167Z compiled: bool, 2025-05-07T20:32:40.8048392Z ) -> None: 2025-05-07T20:32:40.8048622Z torch.manual_seed(2025) 2025-05-07T20:32:40.8048879Z 2025-05-07T20:32:40.8049165Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:40.8049508Z 2025-05-07T20:32:40.8049716Z x_sign = torch.sign(x) 2025-05-07T20:32:40.8050022Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:40.8050332Z x = x_sign * x_clamp 2025-05-07T20:32:40.8050585Z x0 = x[:, :D] 2025-05-07T20:32:40.8050816Z x1 = x[:, D:] 2025-05-07T20:32:40.8051026Z 2025-05-07T20:32:40.8051226Z if contiguous: 2025-05-07T20:32:40.8051470Z x0 = x0.contiguous() 2025-05-07T20:32:40.8051733Z x1 = x1.contiguous() 2025-05-07T20:32:40.8051990Z 2025-05-07T20:32:40.8052198Z if scale_ub is not None: 2025-05-07T20:32:40.8052476Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:40.8052827Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:40.8053153Z ) 2025-05-07T20:32:40.8053363Z else: 2025-05-07T20:32:40.8053575Z scale_ub_tensor = None 2025-05-07T20:32:40.8053833Z 2025-05-07T20:32:40.8054076Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:40.8054388Z op = silu_mul_quant 2025-05-07T20:32:40.8054648Z if compiled: 2025-05-07T20:32:40.8054905Z op = torch.compile(op) 2025-05-07T20:32:40.8055203Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.8055491Z 2025-05-07T20:32:40.8055699Z > y_fp8, y_scale = fn() 2025-05-07T20:32:40.8055863Z 2025-05-07T20:32:40.8055968Z moe/activation_test.py:117: 2025-05-07T20:32:40.8056321Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.8056666Z moe/activation_test.py:115: in fn 2025-05-07T20:32:40.8056957Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.8057564Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:40.8058128Z return fn(*args, **kwargs) 2025-05-07T20:32:40.8058786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:40.8059465Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:40.8060004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:40.8060687Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:40.8061352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:40.8061969Z kernel = self.compile( 2025-05-07T20:32:40.8062512Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:40.8063225Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:40.8063628Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.8063866Z 2025-05-07T20:32:40.8064074Z self = 2025-05-07T20:32:40.8065152Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:40.8066520Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9806dcf1a0>} 2025-05-07T20:32:40.8067874Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:40.8068896Z context = 2025-05-07T20:32:40.8069278Z 2025-05-07T20:32:40.8069448Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:40.8069970Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:40.8070440Z module_map=module_map) 2025-05-07T20:32:40.8070803Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:40.8071162Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:40.8071435Z E ^ 2025-05-07T20:32:40.8071897Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:40.8072361Z 2025-05-07T20:32:40.8072778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:40.8073297Z 2025-05-07T20:32:40.8073406Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:40.8073828Z self=, 2025-05-07T20:32:40.8074230Z T=128, 2025-05-07T20:32:40.8074428Z D=7168, 2025-05-07T20:32:40.8074633Z scale_ub=1200.0, 2025-05-07T20:32:40.8074863Z contiguous=False, 2025-05-07T20:32:40.8075109Z compiled=False, 2025-05-07T20:32:40.8075362Z ) 2025-05-07T20:32:40.8950252Z self = 2025-05-07T20:32:40.8951643Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:40.8952184Z 2025-05-07T20:32:40.8952344Z @given( 2025-05-07T20:32:40.8952812Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:40.8953466Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:40.8954078Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:40.8954726Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:40.8955573Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:40.8955870Z ) 2025-05-07T20:32:40.8956219Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:40.8956668Z def test_silu_mul_quant( 2025-05-07T20:32:40.8956919Z self, 2025-05-07T20:32:40.8957117Z T: int, 2025-05-07T20:32:40.8957322Z D: int, 2025-05-07T20:32:40.8957551Z scale_ub: Optional[float], 2025-05-07T20:32:40.8957819Z contiguous: bool, 2025-05-07T20:32:40.8958066Z compiled: bool, 2025-05-07T20:32:40.8958299Z ) -> None: 2025-05-07T20:32:40.8958517Z torch.manual_seed(2025) 2025-05-07T20:32:40.8958765Z 2025-05-07T20:32:40.8959046Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:40.8959608Z 2025-05-07T20:32:40.8959807Z x_sign = torch.sign(x) 2025-05-07T20:32:40.8960112Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:40.8960504Z x = x_sign * x_clamp 2025-05-07T20:32:40.8960751Z x0 = x[:, :D] 2025-05-07T20:32:40.8960980Z x1 = x[:, D:] 2025-05-07T20:32:40.8961197Z 2025-05-07T20:32:40.8961395Z if contiguous: 2025-05-07T20:32:40.8961629Z x0 = x0.contiguous() 2025-05-07T20:32:40.8961897Z x1 = x1.contiguous() 2025-05-07T20:32:40.8962143Z 2025-05-07T20:32:40.8962336Z if scale_ub is not None: 2025-05-07T20:32:40.8962617Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:40.8962959Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:40.8963266Z ) 2025-05-07T20:32:40.8963471Z else: 2025-05-07T20:32:40.8963693Z scale_ub_tensor = None 2025-05-07T20:32:40.8963944Z 2025-05-07T20:32:40.8964190Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:40.8964510Z op = silu_mul_quant 2025-05-07T20:32:40.8964761Z if compiled: 2025-05-07T20:32:40.8965021Z op = torch.compile(op) 2025-05-07T20:32:40.8965364Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.8965647Z 2025-05-07T20:32:40.8965850Z > y_fp8, y_scale = fn() 2025-05-07T20:32:40.8966022Z 2025-05-07T20:32:40.8966125Z moe/activation_test.py:117: 2025-05-07T20:32:40.8966431Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.8966762Z moe/activation_test.py:115: in fn 2025-05-07T20:32:40.8967052Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.8967742Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:40.8968429Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:40.8968969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:40.8969663Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:40.8970330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:40.8970860Z kernel = self.compile( 2025-05-07T20:32:40.8971407Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:40.8972066Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:40.8972459Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.8972696Z 2025-05-07T20:32:40.8972903Z self = 2025-05-07T20:32:40.8973985Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:40.8975428Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f98065307c0>} 2025-05-07T20:32:40.8976816Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:40.8977832Z context = 2025-05-07T20:32:40.8978128Z 2025-05-07T20:32:40.8978296Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:40.8978821Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:40.8979335Z module_map=module_map) 2025-05-07T20:32:40.8979737Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:40.8980100Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:40.8980369Z E ^ 2025-05-07T20:32:40.8980873Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:40.8981334Z 2025-05-07T20:32:40.8981750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:40.8982268Z 2025-05-07T20:32:40.8982378Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:40.8982802Z self=, 2025-05-07T20:32:40.8983202Z T=128, 2025-05-07T20:32:40.8983402Z D=5120, 2025-05-07T20:32:40.8983608Z scale_ub=None, 2025-05-07T20:32:40.8983828Z contiguous=False, 2025-05-07T20:32:40.8984063Z compiled=False, 2025-05-07T20:32:40.8984283Z ) 2025-05-07T20:32:40.8984602Z self = 2025-05-07T20:32:40.8985102Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:40.8985379Z 2025-05-07T20:32:40.8985466Z @given( 2025-05-07T20:32:40.8985710Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:40.8986023Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:40.8986339Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:40.8986676Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:40.8987001Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:40.8987288Z ) 2025-05-07T20:32:40.8987644Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:40.8988082Z def test_silu_mul_quant( 2025-05-07T20:32:40.8988329Z self, 2025-05-07T20:32:40.8988536Z T: int, 2025-05-07T20:32:40.8988734Z D: int, 2025-05-07T20:32:40.8988963Z scale_ub: Optional[float], 2025-05-07T20:32:40.8989345Z contiguous: bool, 2025-05-07T20:32:40.8989592Z compiled: bool, 2025-05-07T20:32:40.8989814Z ) -> None: 2025-05-07T20:32:40.8990039Z torch.manual_seed(2025) 2025-05-07T20:32:40.8990290Z 2025-05-07T20:32:40.8990563Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:40.8990916Z 2025-05-07T20:32:40.8991121Z x_sign = torch.sign(x) 2025-05-07T20:32:40.8991410Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:40.8991727Z x = x_sign * x_clamp 2025-05-07T20:32:40.8991980Z x0 = x[:, :D] 2025-05-07T20:32:40.8992199Z x1 = x[:, D:] 2025-05-07T20:32:40.8992414Z 2025-05-07T20:32:40.8992605Z if contiguous: 2025-05-07T20:32:40.8992834Z x0 = x0.contiguous() 2025-05-07T20:32:40.8993094Z x1 = x1.contiguous() 2025-05-07T20:32:40.8993336Z 2025-05-07T20:32:40.8993536Z if scale_ub is not None: 2025-05-07T20:32:40.8993816Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:40.8994156Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:40.8994463Z ) 2025-05-07T20:32:40.8994663Z else: 2025-05-07T20:32:40.8994936Z scale_ub_tensor = None 2025-05-07T20:32:40.8995217Z 2025-05-07T20:32:40.8995476Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:40.8995798Z op = silu_mul_quant 2025-05-07T20:32:40.8996064Z if compiled: 2025-05-07T20:32:40.8996311Z op = torch.compile(op) 2025-05-07T20:32:40.8996617Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.8996896Z 2025-05-07T20:32:40.8997096Z > y_fp8, y_scale = fn() 2025-05-07T20:32:40.8997267Z 2025-05-07T20:32:40.8997367Z moe/activation_test.py:117: 2025-05-07T20:32:40.8997670Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.8998046Z moe/activation_test.py:115: in fn 2025-05-07T20:32:40.8998369Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.8999101Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:40.8999803Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:40.9000338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:40.9001026Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:40.9001699Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:40.9002231Z kernel = self.compile( 2025-05-07T20:32:40.9002775Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:40.9003437Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:40.9003845Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.9004077Z 2025-05-07T20:32:40.9004281Z self = 2025-05-07T20:32:40.9005366Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:40.9006777Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f98062405e0>} 2025-05-07T20:32:40.9008120Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:40.9009142Z context = 2025-05-07T20:32:40.9009433Z 2025-05-07T20:32:40.9009597Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:40.9010121Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:40.9010590Z module_map=module_map) 2025-05-07T20:32:40.9010953Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:40.9011309Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:40.9011572Z E ^ 2025-05-07T20:32:40.9012037Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:40.9012482Z 2025-05-07T20:32:40.9012898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:40.9013416Z 2025-05-07T20:32:40.9013523Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:40.9013945Z self=, 2025-05-07T20:32:40.9014357Z T=128, 2025-05-07T20:32:40.9014545Z D=5120, 2025-05-07T20:32:40.9014748Z scale_ub=1200.0, 2025-05-07T20:32:40.9014975Z contiguous=True, 2025-05-07T20:32:40.9015197Z compiled=False, 2025-05-07T20:32:40.9015458Z ) 2025-05-07T20:32:41.1943427Z self = 2025-05-07T20:32:41.1944129Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:41.1944457Z 2025-05-07T20:32:41.1944547Z @given( 2025-05-07T20:32:41.1944796Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.1945194Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.1945584Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.1945994Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.1946405Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.1947104Z ) 2025-05-07T20:32:41.1947552Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.1947994Z def test_silu_mul_quant( 2025-05-07T20:32:41.1948243Z self, 2025-05-07T20:32:41.1948519Z T: int, 2025-05-07T20:32:41.1948725Z D: int, 2025-05-07T20:32:41.1948950Z scale_ub: Optional[float], 2025-05-07T20:32:41.1949346Z contiguous: bool, 2025-05-07T20:32:41.1949582Z compiled: bool, 2025-05-07T20:32:41.1949812Z ) -> None: 2025-05-07T20:32:41.1950031Z torch.manual_seed(2025) 2025-05-07T20:32:41.1950276Z 2025-05-07T20:32:41.1950558Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.1950898Z 2025-05-07T20:32:41.1951093Z x_sign = torch.sign(x) 2025-05-07T20:32:41.1951386Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.1951700Z x = x_sign * x_clamp 2025-05-07T20:32:41.1951940Z x0 = x[:, :D] 2025-05-07T20:32:41.1952163Z x1 = x[:, D:] 2025-05-07T20:32:41.1952376Z 2025-05-07T20:32:41.1952561Z if contiguous: 2025-05-07T20:32:41.1952798Z x0 = x0.contiguous() 2025-05-07T20:32:41.1953061Z x1 = x1.contiguous() 2025-05-07T20:32:41.1953302Z 2025-05-07T20:32:41.1953495Z if scale_ub is not None: 2025-05-07T20:32:41.1953769Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:41.1954105Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:41.1954412Z ) 2025-05-07T20:32:41.1954615Z else: 2025-05-07T20:32:41.1954833Z scale_ub_tensor = None 2025-05-07T20:32:41.1955106Z 2025-05-07T20:32:41.1955396Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.1955791Z op = silu_mul_quant 2025-05-07T20:32:41.1956096Z if compiled: 2025-05-07T20:32:41.1956412Z op = torch.compile(op) 2025-05-07T20:32:41.1956783Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.1957090Z 2025-05-07T20:32:41.1957299Z > y_fp8, y_scale = fn() 2025-05-07T20:32:41.1957463Z 2025-05-07T20:32:41.1957571Z moe/activation_test.py:117: 2025-05-07T20:32:41.1957871Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.1965200Z moe/activation_test.py:115: in fn 2025-05-07T20:32:41.1965576Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.1966288Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:41.1967000Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:41.1967561Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:41.1968247Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:41.1968930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:41.1969492Z kernel = self.compile( 2025-05-07T20:32:41.1970041Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:41.1970843Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:41.1971256Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.1971487Z 2025-05-07T20:32:41.1971704Z self = 2025-05-07T20:32:41.1972792Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:41.1974203Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9806241760>} 2025-05-07T20:32:41.1975921Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:41.1977197Z context = 2025-05-07T20:32:41.1977489Z 2025-05-07T20:32:41.1977668Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:41.1978193Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:41.1978676Z module_map=module_map) 2025-05-07T20:32:41.1979049Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:41.1979409Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:41.1979678Z E ^ 2025-05-07T20:32:41.1980153Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.1980612Z 2025-05-07T20:32:41.1981042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:41.1981559Z 2025-05-07T20:32:41.1981668Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.1982097Z self=, 2025-05-07T20:32:41.1982511Z T=1, 2025-05-07T20:32:41.1982695Z D=7168, 2025-05-07T20:32:41.1982906Z scale_ub=1200.0, 2025-05-07T20:32:41.1983137Z contiguous=True, 2025-05-07T20:32:41.1983363Z compiled=True, 2025-05-07T20:32:41.1983580Z ) 2025-05-07T20:32:41.1983908Z self = 2025-05-07T20:32:41.1984403Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:41.1984665Z 2025-05-07T20:32:41.1984749Z @given( 2025-05-07T20:32:41.1985021Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.1985428Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.1985808Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.1986225Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.1986649Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.1986982Z ) 2025-05-07T20:32:41.1987344Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.1987792Z def test_silu_mul_quant( 2025-05-07T20:32:41.1988043Z self, 2025-05-07T20:32:41.1988238Z T: int, 2025-05-07T20:32:41.1988443Z D: int, 2025-05-07T20:32:41.1988669Z scale_ub: Optional[float], 2025-05-07T20:32:41.1988940Z contiguous: bool, 2025-05-07T20:32:41.1989263Z compiled: bool, 2025-05-07T20:32:41.1989495Z ) -> None: 2025-05-07T20:32:41.1989714Z torch.manual_seed(2025) 2025-05-07T20:32:41.1989962Z 2025-05-07T20:32:41.1990246Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.1990595Z 2025-05-07T20:32:41.1990797Z x_sign = torch.sign(x) 2025-05-07T20:32:41.1991101Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.1991412Z x = x_sign * x_clamp 2025-05-07T20:32:41.1991710Z x0 = x[:, :D] 2025-05-07T20:32:41.1991939Z x1 = x[:, D:] 2025-05-07T20:32:41.1992147Z 2025-05-07T20:32:41.1992346Z if contiguous: 2025-05-07T20:32:41.1992587Z x0 = x0.contiguous() 2025-05-07T20:32:41.1992845Z x1 = x1.contiguous() 2025-05-07T20:32:41.1993097Z 2025-05-07T20:32:41.1993299Z if scale_ub is not None: 2025-05-07T20:32:41.1993582Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:41.1993924Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:41.1994247Z ) 2025-05-07T20:32:41.1994452Z else: 2025-05-07T20:32:41.1994665Z scale_ub_tensor = None 2025-05-07T20:32:41.1994973Z 2025-05-07T20:32:41.1995258Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.1995574Z op = silu_mul_quant 2025-05-07T20:32:41.1995832Z if compiled: 2025-05-07T20:32:41.1996131Z op = torch.compile(op) 2025-05-07T20:32:41.1996432Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.1996721Z 2025-05-07T20:32:41.1996923Z > y_fp8, y_scale = fn() 2025-05-07T20:32:41.1997092Z 2025-05-07T20:32:41.1997195Z moe/activation_test.py:117: 2025-05-07T20:32:41.1997509Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.1997851Z moe/activation_test.py:115: in fn 2025-05-07T20:32:41.1998142Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.1998699Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:41.1999265Z return fn(*args, **kwargs) 2025-05-07T20:32:41.1999936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:41.2000627Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:41.2001178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:41.2001873Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:41.2002547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:41.2003085Z kernel = self.compile( 2025-05-07T20:32:41.2003640Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:41.2004308Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:41.2004720Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.2005005Z 2025-05-07T20:32:41.2005265Z self = 2025-05-07T20:32:41.2006642Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:41.2008146Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9806242d40>} 2025-05-07T20:32:41.2009507Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:41.2010545Z context = 2025-05-07T20:32:41.2010845Z 2025-05-07T20:32:41.2011019Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:41.2011558Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:41.2012038Z module_map=module_map) 2025-05-07T20:32:41.2012459Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:41.2012824Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:41.2013096Z E ^ 2025-05-07T20:32:41.2013568Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.2014033Z 2025-05-07T20:32:41.2014453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:41.2014983Z 2025-05-07T20:32:41.2015089Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.2015607Z self=, 2025-05-07T20:32:41.2016112Z T=1, 2025-05-07T20:32:41.2016407Z D=7168, 2025-05-07T20:32:41.2016707Z scale_ub=1200.0, 2025-05-07T20:32:41.2016987Z contiguous=False, 2025-05-07T20:32:41.2017266Z compiled=True, 2025-05-07T20:32:41.2017486Z ) 2025-05-07T20:32:41.3014524Z self = 2025-05-07T20:32:41.3015425Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:41.3015734Z 2025-05-07T20:32:41.3015822Z @given( 2025-05-07T20:32:41.3016061Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.3016382Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.3016696Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.3017021Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.3017353Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.3017641Z ) 2025-05-07T20:32:41.3017997Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.3018442Z def test_silu_mul_quant( 2025-05-07T20:32:41.3018698Z self, 2025-05-07T20:32:41.3018898Z T: int, 2025-05-07T20:32:41.3019093Z D: int, 2025-05-07T20:32:41.3019313Z scale_ub: Optional[float], 2025-05-07T20:32:41.3019591Z contiguous: bool, 2025-05-07T20:32:41.3019835Z compiled: bool, 2025-05-07T20:32:41.3020063Z ) -> None: 2025-05-07T20:32:41.3020284Z torch.manual_seed(2025) 2025-05-07T20:32:41.3020524Z 2025-05-07T20:32:41.3020802Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.3021160Z 2025-05-07T20:32:41.3021357Z x_sign = torch.sign(x) 2025-05-07T20:32:41.3021657Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.3021978Z x = x_sign * x_clamp 2025-05-07T20:32:41.3022220Z x0 = x[:, :D] 2025-05-07T20:32:41.3022443Z x1 = x[:, D:] 2025-05-07T20:32:41.3022661Z 2025-05-07T20:32:41.3022850Z if contiguous: 2025-05-07T20:32:41.3023089Z x0 = x0.contiguous() 2025-05-07T20:32:41.3023357Z x1 = x1.contiguous() 2025-05-07T20:32:41.3023610Z 2025-05-07T20:32:41.3023804Z if scale_ub is not None: 2025-05-07T20:32:41.3024094Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:41.3024439Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:41.3024750Z ) 2025-05-07T20:32:41.3024950Z else: 2025-05-07T20:32:41.3025165Z scale_ub_tensor = None 2025-05-07T20:32:41.3025409Z 2025-05-07T20:32:41.3025647Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.3025975Z op = silu_mul_quant 2025-05-07T20:32:41.3026227Z if compiled: 2025-05-07T20:32:41.3026480Z op = torch.compile(op) 2025-05-07T20:32:41.3026779Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.3027048Z 2025-05-07T20:32:41.3027244Z > y_fp8, y_scale = fn() 2025-05-07T20:32:41.3027418Z 2025-05-07T20:32:41.3027521Z moe/activation_test.py:117: 2025-05-07T20:32:41.3027820Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.3028319Z moe/activation_test.py:115: in fn 2025-05-07T20:32:41.3028609Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.3029319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:41.3029880Z return fn(*args, **kwargs) 2025-05-07T20:32:41.3030535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:41.3031220Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:41.3031758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:41.3032436Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:41.3033168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:41.3033764Z kernel = self.compile( 2025-05-07T20:32:41.3034352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:41.3035008Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:41.3035407Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.3035633Z 2025-05-07T20:32:41.3035848Z self = 2025-05-07T20:32:41.3036928Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:41.3038308Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f98063a4540>} 2025-05-07T20:32:41.3039658Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:41.3040677Z context = 2025-05-07T20:32:41.3040964Z 2025-05-07T20:32:41.3041137Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:41.3041652Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:41.3042122Z module_map=module_map) 2025-05-07T20:32:41.3042489Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:41.3042842Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:41.3043107Z E ^ 2025-05-07T20:32:41.3043577Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.3044029Z 2025-05-07T20:32:41.3044459Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:41.3044967Z 2025-05-07T20:32:41.3045074Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.3045490Z self=, 2025-05-07T20:32:41.3045899Z T=1, 2025-05-07T20:32:41.3046088Z D=7168, 2025-05-07T20:32:41.3046291Z scale_ub=None, 2025-05-07T20:32:41.3046515Z contiguous=False, 2025-05-07T20:32:41.3046738Z compiled=True, 2025-05-07T20:32:41.3046944Z ) 2025-05-07T20:32:41.3718640Z self = 2025-05-07T20:32:41.3719175Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:41.3719434Z 2025-05-07T20:32:41.3719530Z @given( 2025-05-07T20:32:41.3719763Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.3720081Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.3720390Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.3720718Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.3721183Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.3721477Z ) 2025-05-07T20:32:41.3721823Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.3722265Z def test_silu_mul_quant( 2025-05-07T20:32:41.3722510Z self, 2025-05-07T20:32:41.3722701Z T: int, 2025-05-07T20:32:41.3722904Z D: int, 2025-05-07T20:32:41.3723128Z scale_ub: Optional[float], 2025-05-07T20:32:41.3723400Z contiguous: bool, 2025-05-07T20:32:41.3723636Z compiled: bool, 2025-05-07T20:32:41.3723862Z ) -> None: 2025-05-07T20:32:41.3724086Z torch.manual_seed(2025) 2025-05-07T20:32:41.3724390Z 2025-05-07T20:32:41.3724736Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.3725079Z 2025-05-07T20:32:41.3725273Z x_sign = torch.sign(x) 2025-05-07T20:32:41.3725624Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.3725939Z x = x_sign * x_clamp 2025-05-07T20:32:41.3726176Z x0 = x[:, :D] 2025-05-07T20:32:41.3726397Z x1 = x[:, D:] 2025-05-07T20:32:41.3726611Z 2025-05-07T20:32:41.3726798Z if contiguous: 2025-05-07T20:32:41.3727043Z x0 = x0.contiguous() 2025-05-07T20:32:41.3727311Z x1 = x1.contiguous() 2025-05-07T20:32:41.3727550Z 2025-05-07T20:32:41.3727750Z if scale_ub is not None: 2025-05-07T20:32:41.3728031Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:41.3728519Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:41.3728838Z ) 2025-05-07T20:32:41.3729039Z else: 2025-05-07T20:32:41.3729263Z scale_ub_tensor = None 2025-05-07T20:32:41.3729521Z 2025-05-07T20:32:41.3729756Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.3730078Z op = silu_mul_quant 2025-05-07T20:32:41.3730333Z if compiled: 2025-05-07T20:32:41.3730587Z op = torch.compile(op) 2025-05-07T20:32:41.3730887Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.3731158Z 2025-05-07T20:32:41.3731354Z y_fp8, y_scale = fn() 2025-05-07T20:32:41.3731639Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:41.3731922Z 2025-05-07T20:32:41.3732164Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.3732497Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:41.3732782Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:41.3733097Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:41.3733457Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:41.3733770Z 2025-05-07T20:32:41.3733969Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:41.3734168Z 2025-05-07T20:32:41.3734269Z moe/activation_test.py:126: 2025-05-07T20:32:41.3734571Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.3734903Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:41.3735230Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:41.3736018Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:41.3736767Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:41.3737304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:41.3737983Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:41.3738672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:41.3739383Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:41.3740208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:41.3740961Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:41.3741688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:41.3742317Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:41.3742923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:41.3743447Z fn() 2025-05-07T20:32:41.3743959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:41.3744649Z self.fn.run( 2025-05-07T20:32:41.3745116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:41.3745711Z kernel = self.compile( 2025-05-07T20:32:41.3746251Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:41.3746903Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:41.3747308Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.3747537Z 2025-05-07T20:32:41.3747753Z self = 2025-05-07T20:32:41.3748822Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:41.3750237Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f98063a5440>} 2025-05-07T20:32:41.3751576Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:41.3752594Z context = 2025-05-07T20:32:41.3752879Z 2025-05-07T20:32:41.3753059Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:41.3753578Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:41.3754051Z module_map=module_map) 2025-05-07T20:32:41.3754424Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:41.3754780Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:41.3755053Z E ^ 2025-05-07T20:32:41.3755524Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.3755969Z 2025-05-07T20:32:41.3756396Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:41.3756903Z 2025-05-07T20:32:41.3757009Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.3757424Z self=, 2025-05-07T20:32:41.3757827Z T=1, 2025-05-07T20:32:41.3758011Z D=5120, 2025-05-07T20:32:41.3758213Z scale_ub=1200.0, 2025-05-07T20:32:41.3758446Z contiguous=False, 2025-05-07T20:32:41.3758674Z compiled=True, 2025-05-07T20:32:41.3758881Z ) 2025-05-07T20:32:41.4961158Z self = 2025-05-07T20:32:41.4961709Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:41.4961988Z 2025-05-07T20:32:41.4962079Z @given( 2025-05-07T20:32:41.4962314Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.4962626Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.4963043Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.4963378Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.4963707Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.4963996Z ) 2025-05-07T20:32:41.4964340Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.4964777Z def test_silu_mul_quant( 2025-05-07T20:32:41.4965028Z self, 2025-05-07T20:32:41.4965219Z T: int, 2025-05-07T20:32:41.4965420Z D: int, 2025-05-07T20:32:41.4965642Z scale_ub: Optional[float], 2025-05-07T20:32:41.4965919Z contiguous: bool, 2025-05-07T20:32:41.4966158Z compiled: bool, 2025-05-07T20:32:41.4966459Z ) -> None: 2025-05-07T20:32:41.4966746Z torch.manual_seed(2025) 2025-05-07T20:32:41.4966994Z 2025-05-07T20:32:41.4967277Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.4967623Z 2025-05-07T20:32:41.4967876Z x_sign = torch.sign(x) 2025-05-07T20:32:41.4968172Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.4968488Z x = x_sign * x_clamp 2025-05-07T20:32:41.4968724Z x0 = x[:, :D] 2025-05-07T20:32:41.4968949Z x1 = x[:, D:] 2025-05-07T20:32:41.4969158Z 2025-05-07T20:32:41.4969346Z if contiguous: 2025-05-07T20:32:41.4969586Z x0 = x0.contiguous() 2025-05-07T20:32:41.4969855Z x1 = x1.contiguous() 2025-05-07T20:32:41.4970097Z 2025-05-07T20:32:41.4970300Z if scale_ub is not None: 2025-05-07T20:32:41.4970573Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:41.4970907Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:41.4971234Z ) 2025-05-07T20:32:41.4971437Z else: 2025-05-07T20:32:41.4971655Z scale_ub_tensor = None 2025-05-07T20:32:41.4971900Z 2025-05-07T20:32:41.4972134Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.4972456Z op = silu_mul_quant 2025-05-07T20:32:41.4972703Z if compiled: 2025-05-07T20:32:41.4972950Z op = torch.compile(op) 2025-05-07T20:32:41.4973243Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.4973511Z 2025-05-07T20:32:41.4973706Z > y_fp8, y_scale = fn() 2025-05-07T20:32:41.4973869Z 2025-05-07T20:32:41.4973980Z moe/activation_test.py:117: 2025-05-07T20:32:41.4974270Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.4974602Z moe/activation_test.py:115: in fn 2025-05-07T20:32:41.4974888Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.4975470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:41.4976049Z return fn(*args, **kwargs) 2025-05-07T20:32:41.4976706Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:41.4977390Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:41.4977916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:41.4978597Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:41.4979256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:41.4979789Z kernel = self.compile( 2025-05-07T20:32:41.4980325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:41.4980977Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:41.4981387Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.4981615Z 2025-05-07T20:32:41.4981826Z self = 2025-05-07T20:32:41.4982943Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:41.4984310Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f98063a6a20>} 2025-05-07T20:32:41.4985637Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:41.4986695Z context = 2025-05-07T20:32:41.4987018Z 2025-05-07T20:32:41.4987185Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:41.4987777Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:41.4997052Z module_map=module_map) 2025-05-07T20:32:41.4997428Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:41.4997779Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:41.4998032Z E ^ 2025-05-07T20:32:41.4998494Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.4998952Z 2025-05-07T20:32:41.4999377Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:41.4999893Z 2025-05-07T20:32:41.4999999Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.5000415Z self=, 2025-05-07T20:32:41.5000840Z T=1, 2025-05-07T20:32:41.5001100Z D=5120, 2025-05-07T20:32:41.5001348Z scale_ub=1200.0, 2025-05-07T20:32:41.5001577Z contiguous=False, 2025-05-07T20:32:41.5001801Z compiled=False, 2025-05-07T20:32:41.5001999Z ) 2025-05-07T20:32:41.5002313Z self = 2025-05-07T20:32:41.5002794Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:41.5003055Z 2025-05-07T20:32:41.5003135Z @given( 2025-05-07T20:32:41.5003359Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.5003667Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.5003967Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.5004286Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.5004605Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.5004890Z ) 2025-05-07T20:32:41.5005235Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.5005725Z def test_silu_mul_quant( 2025-05-07T20:32:41.5005962Z self, 2025-05-07T20:32:41.5006151Z T: int, 2025-05-07T20:32:41.5006344Z D: int, 2025-05-07T20:32:41.5006556Z scale_ub: Optional[float], 2025-05-07T20:32:41.5006819Z contiguous: bool, 2025-05-07T20:32:41.5007050Z compiled: bool, 2025-05-07T20:32:41.5007268Z ) -> None: 2025-05-07T20:32:41.5007478Z torch.manual_seed(2025) 2025-05-07T20:32:41.5007710Z 2025-05-07T20:32:41.5007997Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.5008331Z 2025-05-07T20:32:41.5008523Z x_sign = torch.sign(x) 2025-05-07T20:32:41.5008806Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.5009114Z x = x_sign * x_clamp 2025-05-07T20:32:41.5009351Z x0 = x[:, :D] 2025-05-07T20:32:41.5009562Z x1 = x[:, D:] 2025-05-07T20:32:41.5009764Z 2025-05-07T20:32:41.5009946Z if contiguous: 2025-05-07T20:32:41.5010169Z x0 = x0.contiguous() 2025-05-07T20:32:41.5010427Z x1 = x1.contiguous() 2025-05-07T20:32:41.5010761Z 2025-05-07T20:32:41.5010948Z if scale_ub is not None: 2025-05-07T20:32:41.5011214Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:41.5011544Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:41.5011843Z ) 2025-05-07T20:32:41.5012033Z else: 2025-05-07T20:32:41.5012244Z scale_ub_tensor = None 2025-05-07T20:32:41.5012487Z 2025-05-07T20:32:41.5012711Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.5013021Z op = silu_mul_quant 2025-05-07T20:32:41.5013269Z if compiled: 2025-05-07T20:32:41.5013508Z op = torch.compile(op) 2025-05-07T20:32:41.5013846Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.5014154Z 2025-05-07T20:32:41.5014339Z > y_fp8, y_scale = fn() 2025-05-07T20:32:41.5014501Z 2025-05-07T20:32:41.5014598Z moe/activation_test.py:117: 2025-05-07T20:32:41.5014927Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.5015252Z moe/activation_test.py:115: in fn 2025-05-07T20:32:41.5015526Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.5016215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:41.5016903Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:41.5017431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:41.5018111Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:41.5018780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:41.5019313Z kernel = self.compile( 2025-05-07T20:32:41.5019852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:41.5020506Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:41.5020897Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.5021123Z 2025-05-07T20:32:41.5021327Z self = 2025-05-07T20:32:41.5022409Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:41.5023780Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f98063a71a0>} 2025-05-07T20:32:41.5025140Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:41.5026165Z context = 2025-05-07T20:32:41.5026455Z 2025-05-07T20:32:41.5026618Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:41.5027133Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:41.5027596Z module_map=module_map) 2025-05-07T20:32:41.5027951Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:41.5028568Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:41.5028825Z E ^ 2025-05-07T20:32:41.5029326Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.5029782Z 2025-05-07T20:32:41.5030198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:41.5030718Z 2025-05-07T20:32:41.5030918Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.5031330Z self=, 2025-05-07T20:32:41.5031729Z T=16384, 2025-05-07T20:32:41.5031917Z D=5120, 2025-05-07T20:32:41.5032107Z scale_ub=1200.0, 2025-05-07T20:32:41.5032326Z contiguous=False, 2025-05-07T20:32:41.5032542Z compiled=True, 2025-05-07T20:32:41.5032740Z ) 2025-05-07T20:32:41.7309833Z self = 2025-05-07T20:32:41.7310917Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:41.7311470Z 2025-05-07T20:32:41.7311637Z @given( 2025-05-07T20:32:41.7312305Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.7313041Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.7313643Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.7314394Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.7315056Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.7315513Z ) 2025-05-07T20:32:41.7315902Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.7316331Z def test_silu_mul_quant( 2025-05-07T20:32:41.7316567Z self, 2025-05-07T20:32:41.7316760Z T: int, 2025-05-07T20:32:41.7316951Z D: int, 2025-05-07T20:32:41.7317169Z scale_ub: Optional[float], 2025-05-07T20:32:41.7317432Z contiguous: bool, 2025-05-07T20:32:41.7317665Z compiled: bool, 2025-05-07T20:32:41.7317896Z ) -> None: 2025-05-07T20:32:41.7318117Z torch.manual_seed(2025) 2025-05-07T20:32:41.7318358Z 2025-05-07T20:32:41.7318640Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.7318986Z 2025-05-07T20:32:41.7319184Z x_sign = torch.sign(x) 2025-05-07T20:32:41.7319476Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.7319787Z x = x_sign * x_clamp 2025-05-07T20:32:41.7320027Z x0 = x[:, :D] 2025-05-07T20:32:41.7320247Z x1 = x[:, D:] 2025-05-07T20:32:41.7320454Z 2025-05-07T20:32:41.7320642Z if contiguous: 2025-05-07T20:32:41.7320882Z x0 = x0.contiguous() 2025-05-07T20:32:41.7321136Z x1 = x1.contiguous() 2025-05-07T20:32:41.7321378Z 2025-05-07T20:32:41.7321567Z if scale_ub is not None: 2025-05-07T20:32:41.7321832Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:41.7322169Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:41.7322479Z ) 2025-05-07T20:32:41.7322669Z else: 2025-05-07T20:32:41.7322885Z scale_ub_tensor = None 2025-05-07T20:32:41.7323135Z 2025-05-07T20:32:41.7323360Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.7323674Z op = silu_mul_quant 2025-05-07T20:32:41.7323923Z if compiled: 2025-05-07T20:32:41.7324171Z op = torch.compile(op) 2025-05-07T20:32:41.7324462Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.7324738Z 2025-05-07T20:32:41.7324931Z > y_fp8, y_scale = fn() 2025-05-07T20:32:41.7325091Z 2025-05-07T20:32:41.7325191Z moe/activation_test.py:117: 2025-05-07T20:32:41.7325489Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.7325825Z moe/activation_test.py:115: in fn 2025-05-07T20:32:41.7326098Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.7326655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:41.7327211Z return fn(*args, **kwargs) 2025-05-07T20:32:41.7327869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:41.7328710Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:41.7329313Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:41.7329993Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:41.7330645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:41.7331181Z kernel = self.compile( 2025-05-07T20:32:41.7331718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:41.7332368Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:41.7332764Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.7333111Z 2025-05-07T20:32:41.7333317Z self = 2025-05-07T20:32:41.7334445Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:41.7335800Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9655b84ea0>} 2025-05-07T20:32:41.7337135Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:41.7338152Z context = 2025-05-07T20:32:41.7338447Z 2025-05-07T20:32:41.7338612Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:41.7339132Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:41.7339598Z module_map=module_map) 2025-05-07T20:32:41.7339964Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:41.7340316Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:41.7340578Z E ^ 2025-05-07T20:32:41.7341039Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.7341491Z 2025-05-07T20:32:41.7341907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:41.7342415Z 2025-05-07T20:32:41.7342525Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.7342934Z self=, 2025-05-07T20:32:41.7343336Z T=2048, 2025-05-07T20:32:41.7343529Z D=7168, 2025-05-07T20:32:41.7343727Z scale_ub=1200.0, 2025-05-07T20:32:41.7343953Z contiguous=False, 2025-05-07T20:32:41.7344180Z compiled=True, 2025-05-07T20:32:41.7344390Z ) 2025-05-07T20:32:41.7344714Z self = 2025-05-07T20:32:41.7345208Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:41.7345476Z 2025-05-07T20:32:41.7345559Z @given( 2025-05-07T20:32:41.7345784Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.7346099Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.7346401Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.7346727Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.7347062Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.7347344Z ) 2025-05-07T20:32:41.7347694Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.7348141Z def test_silu_mul_quant( 2025-05-07T20:32:41.7348383Z self, 2025-05-07T20:32:41.7348581Z T: int, 2025-05-07T20:32:41.7348782Z D: int, 2025-05-07T20:32:41.7348998Z scale_ub: Optional[float], 2025-05-07T20:32:41.7349377Z contiguous: bool, 2025-05-07T20:32:41.7349620Z compiled: bool, 2025-05-07T20:32:41.7349836Z ) -> None: 2025-05-07T20:32:41.7350056Z torch.manual_seed(2025) 2025-05-07T20:32:41.7350300Z 2025-05-07T20:32:41.7350567Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.7350915Z 2025-05-07T20:32:41.7351109Z x_sign = torch.sign(x) 2025-05-07T20:32:41.7351391Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.7351699Z x = x_sign * x_clamp 2025-05-07T20:32:41.7351941Z x0 = x[:, :D] 2025-05-07T20:32:41.7352157Z x1 = x[:, D:] 2025-05-07T20:32:41.7352405Z 2025-05-07T20:32:41.7352669Z if contiguous: 2025-05-07T20:32:41.7352898Z x0 = x0.contiguous() 2025-05-07T20:32:41.7353152Z x1 = x1.contiguous() 2025-05-07T20:32:41.7353392Z 2025-05-07T20:32:41.7353583Z if scale_ub is not None: 2025-05-07T20:32:41.7353889Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:41.7354224Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:41.7354532Z ) 2025-05-07T20:32:41.7354721Z else: 2025-05-07T20:32:41.7354928Z scale_ub_tensor = None 2025-05-07T20:32:41.7355176Z 2025-05-07T20:32:41.7355410Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.7355769Z op = silu_mul_quant 2025-05-07T20:32:41.7356026Z if compiled: 2025-05-07T20:32:41.7356267Z op = torch.compile(op) 2025-05-07T20:32:41.7356558Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.7356830Z 2025-05-07T20:32:41.7357022Z > y_fp8, y_scale = fn() 2025-05-07T20:32:41.7357193Z 2025-05-07T20:32:41.7357290Z moe/activation_test.py:117: 2025-05-07T20:32:41.7357588Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.7357922Z moe/activation_test.py:115: in fn 2025-05-07T20:32:41.7358196Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.7358752Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:41.7359308Z return fn(*args, **kwargs) 2025-05-07T20:32:41.7359953Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:41.7360628Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:41.7361157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:41.7361832Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:41.7362485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:41.7363015Z kernel = self.compile( 2025-05-07T20:32:41.7363554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:41.7364203Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:41.7364592Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.7364823Z 2025-05-07T20:32:41.7365026Z self = 2025-05-07T20:32:41.7366151Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:41.7367508Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9655b859e0>} 2025-05-07T20:32:41.7368950Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:41.7369970Z context = 2025-05-07T20:32:41.7370256Z 2025-05-07T20:32:41.7370423Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:41.7370940Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:41.7371396Z module_map=module_map) 2025-05-07T20:32:41.7371759Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:41.7372114Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:41.7372417Z E ^ 2025-05-07T20:32:41.7372924Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.7373371Z 2025-05-07T20:32:41.7373822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:41.7374329Z 2025-05-07T20:32:41.8262882Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.8263327Z self=, 2025-05-07T20:32:41.8263780Z T=1, 2025-05-07T20:32:41.8263972Z D=5120, 2025-05-07T20:32:41.8264165Z scale_ub=None, 2025-05-07T20:32:41.8264388Z contiguous=False, 2025-05-07T20:32:41.8264619Z compiled=False, 2025-05-07T20:32:41.8264821Z ) 2025-05-07T20:32:41.8265145Z self = 2025-05-07T20:32:41.8265629Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:41.8265922Z 2025-05-07T20:32:41.8266031Z @given( 2025-05-07T20:32:41.8266266Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.8266579Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.8266887Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.8267214Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.8267542Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.8267824Z ) 2025-05-07T20:32:41.8268166Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.8268606Z def test_silu_mul_quant( 2025-05-07T20:32:41.8268844Z self, 2025-05-07T20:32:41.8269088Z T: int, 2025-05-07T20:32:41.8269283Z D: int, 2025-05-07T20:32:41.8269499Z scale_ub: Optional[float], 2025-05-07T20:32:41.8269765Z contiguous: bool, 2025-05-07T20:32:41.8270005Z compiled: bool, 2025-05-07T20:32:41.8270233Z ) -> None: 2025-05-07T20:32:41.8270452Z torch.manual_seed(2025) 2025-05-07T20:32:41.8270689Z 2025-05-07T20:32:41.8270963Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.8271300Z 2025-05-07T20:32:41.8271492Z x_sign = torch.sign(x) 2025-05-07T20:32:41.8271785Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.8272095Z x = x_sign * x_clamp 2025-05-07T20:32:41.8272332Z x0 = x[:, :D] 2025-05-07T20:32:41.8272550Z x1 = x[:, D:] 2025-05-07T20:32:41.8272758Z 2025-05-07T20:32:41.8272938Z if contiguous: 2025-05-07T20:32:41.8273178Z x0 = x0.contiguous() 2025-05-07T20:32:41.8273440Z x1 = x1.contiguous() 2025-05-07T20:32:41.8273672Z 2025-05-07T20:32:41.8273867Z if scale_ub is not None: 2025-05-07T20:32:41.8274143Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:41.8274484Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:41.8274791Z ) 2025-05-07T20:32:41.8274983Z else: 2025-05-07T20:32:41.8275202Z scale_ub_tensor = None 2025-05-07T20:32:41.8275449Z 2025-05-07T20:32:41.8275680Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.8275995Z op = silu_mul_quant 2025-05-07T20:32:41.8276348Z if compiled: 2025-05-07T20:32:41.8276603Z op = torch.compile(op) 2025-05-07T20:32:41.8276900Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.8277166Z 2025-05-07T20:32:41.8277360Z > y_fp8, y_scale = fn() 2025-05-07T20:32:41.8277519Z 2025-05-07T20:32:41.8277623Z moe/activation_test.py:117: 2025-05-07T20:32:41.8277911Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.8278243Z moe/activation_test.py:115: in fn 2025-05-07T20:32:41.8278523Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.8279210Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:41.8280012Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:41.8280548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:41.8281289Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:41.8281946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:41.8282471Z kernel = self.compile( 2025-05-07T20:32:41.8283005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:41.8283655Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:41.8284050Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.8284279Z 2025-05-07T20:32:41.8284483Z self = 2025-05-07T20:32:41.8285590Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:41.8286975Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9655b86d40>} 2025-05-07T20:32:41.8288309Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:41.8289322Z context = 2025-05-07T20:32:41.8289613Z 2025-05-07T20:32:41.8289781Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:41.8290301Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:41.8290776Z module_map=module_map) 2025-05-07T20:32:41.8291134Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:41.8291485Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:41.8291748Z E ^ 2025-05-07T20:32:41.8292205Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.8292656Z 2025-05-07T20:32:41.8293072Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:41.8293584Z 2025-05-07T20:32:41.8293689Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.8294098Z self=, 2025-05-07T20:32:41.8294492Z T=4096, 2025-05-07T20:32:41.8294681Z D=7168, 2025-05-07T20:32:41.8294877Z scale_ub=1200.0, 2025-05-07T20:32:41.8295100Z contiguous=False, 2025-05-07T20:32:41.8295328Z compiled=False, 2025-05-07T20:32:41.8295531Z ) 2025-05-07T20:32:41.8295851Z self = 2025-05-07T20:32:41.8296448Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:41.8296723Z 2025-05-07T20:32:41.8296808Z @given( 2025-05-07T20:32:41.8297037Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.8297347Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.8297650Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.8297978Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.8298297Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.8298581Z ) 2025-05-07T20:32:41.8298926Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.8299360Z def test_silu_mul_quant( 2025-05-07T20:32:41.8299649Z self, 2025-05-07T20:32:41.8299882Z T: int, 2025-05-07T20:32:41.8300078Z D: int, 2025-05-07T20:32:41.8300293Z scale_ub: Optional[float], 2025-05-07T20:32:41.8300560Z contiguous: bool, 2025-05-07T20:32:41.8300850Z compiled: bool, 2025-05-07T20:32:41.8301076Z ) -> None: 2025-05-07T20:32:41.8301294Z torch.manual_seed(2025) 2025-05-07T20:32:41.8301529Z 2025-05-07T20:32:41.8301802Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.8302144Z 2025-05-07T20:32:41.8302346Z x_sign = torch.sign(x) 2025-05-07T20:32:41.8302644Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.8302949Z x = x_sign * x_clamp 2025-05-07T20:32:41.8308435Z x0 = x[:, :D] 2025-05-07T20:32:41.8308662Z x1 = x[:, D:] 2025-05-07T20:32:41.8308866Z 2025-05-07T20:32:41.8309108Z if contiguous: 2025-05-07T20:32:41.8309344Z x0 = x0.contiguous() 2025-05-07T20:32:41.8309608Z x1 = x1.contiguous() 2025-05-07T20:32:41.8309849Z 2025-05-07T20:32:41.8310038Z if scale_ub is not None: 2025-05-07T20:32:41.8310309Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:41.8310644Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:41.8310956Z ) 2025-05-07T20:32:41.8311147Z else: 2025-05-07T20:32:41.8311352Z scale_ub_tensor = None 2025-05-07T20:32:41.8311603Z 2025-05-07T20:32:41.8311838Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.8312154Z op = silu_mul_quant 2025-05-07T20:32:41.8312414Z if compiled: 2025-05-07T20:32:41.8312667Z op = torch.compile(op) 2025-05-07T20:32:41.8312957Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.8313244Z 2025-05-07T20:32:41.8313440Z > y_fp8, y_scale = fn() 2025-05-07T20:32:41.8313601Z 2025-05-07T20:32:41.8313703Z moe/activation_test.py:117: 2025-05-07T20:32:41.8313999Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.8314336Z moe/activation_test.py:115: in fn 2025-05-07T20:32:41.8314615Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.8315312Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:41.8316008Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:41.8316548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:41.8317228Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:41.8317889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:41.8318422Z kernel = self.compile( 2025-05-07T20:32:41.8318969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:41.8319632Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:41.8320036Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.8320271Z 2025-05-07T20:32:41.8320555Z self = 2025-05-07T20:32:41.8321645Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:41.8323019Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9655b87a60>} 2025-05-07T20:32:41.8324359Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:41.8325493Z context = 2025-05-07T20:32:41.8325784Z 2025-05-07T20:32:41.8325988Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:41.8326513Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:41.8326976Z module_map=module_map) 2025-05-07T20:32:41.8327341Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:41.8327687Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:41.8327938Z E ^ 2025-05-07T20:32:41.8328583Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.8329031Z 2025-05-07T20:32:41.8329453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:41.8329969Z 2025-05-07T20:32:41.8330081Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.8330497Z self=, 2025-05-07T20:32:41.8330909Z T=16384, 2025-05-07T20:32:41.8331111Z D=7168, 2025-05-07T20:32:41.8331301Z scale_ub=None, 2025-05-07T20:32:41.8331520Z contiguous=True, 2025-05-07T20:32:41.8331746Z compiled=True, 2025-05-07T20:32:41.8331943Z ) 2025-05-07T20:32:41.9691662Z self = 2025-05-07T20:32:41.9692230Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:41.9692513Z 2025-05-07T20:32:41.9692594Z @given( 2025-05-07T20:32:41.9692825Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.9693134Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.9693441Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.9693777Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.9694107Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.9694392Z ) 2025-05-07T20:32:41.9694742Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.9695181Z def test_silu_mul_quant( 2025-05-07T20:32:41.9695421Z self, 2025-05-07T20:32:41.9695647Z T: int, 2025-05-07T20:32:41.9695864Z D: int, 2025-05-07T20:32:41.9696089Z scale_ub: Optional[float], 2025-05-07T20:32:41.9696524Z contiguous: bool, 2025-05-07T20:32:41.9696768Z compiled: bool, 2025-05-07T20:32:41.9696985Z ) -> None: 2025-05-07T20:32:41.9697204Z torch.manual_seed(2025) 2025-05-07T20:32:41.9697444Z 2025-05-07T20:32:41.9697710Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.9698048Z 2025-05-07T20:32:41.9698239Z x_sign = torch.sign(x) 2025-05-07T20:32:41.9698525Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.9698833Z x = x_sign * x_clamp 2025-05-07T20:32:41.9699076Z x0 = x[:, :D] 2025-05-07T20:32:41.9699291Z x1 = x[:, D:] 2025-05-07T20:32:41.9699505Z 2025-05-07T20:32:41.9699690Z if contiguous: 2025-05-07T20:32:41.9700035Z x0 = x0.contiguous() 2025-05-07T20:32:41.9700291Z x1 = x1.contiguous() 2025-05-07T20:32:41.9700531Z 2025-05-07T20:32:41.9700715Z if scale_ub is not None: 2025-05-07T20:32:41.9700997Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:41.9701335Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:41.9701641Z ) 2025-05-07T20:32:41.9701828Z else: 2025-05-07T20:32:41.9702039Z scale_ub_tensor = None 2025-05-07T20:32:41.9702294Z 2025-05-07T20:32:41.9702527Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.9702841Z op = silu_mul_quant 2025-05-07T20:32:41.9703167Z if compiled: 2025-05-07T20:32:41.9703474Z op = torch.compile(op) 2025-05-07T20:32:41.9703778Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.9704058Z 2025-05-07T20:32:41.9704254Z > y_fp8, y_scale = fn() 2025-05-07T20:32:41.9704478Z 2025-05-07T20:32:41.9704591Z moe/activation_test.py:117: 2025-05-07T20:32:41.9704889Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.9705216Z moe/activation_test.py:115: in fn 2025-05-07T20:32:41.9705506Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.9706119Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:41.9706680Z return fn(*args, **kwargs) 2025-05-07T20:32:41.9707336Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:41.9708019Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:41.9708560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:41.9709289Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:41.9709948Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:41.9710476Z kernel = self.compile( 2025-05-07T20:32:41.9711022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:41.9711669Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:41.9712078Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.9712306Z 2025-05-07T20:32:41.9712516Z self = 2025-05-07T20:32:41.9713585Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:41.9714953Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9806ae1120>} 2025-05-07T20:32:41.9716331Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:41.9717353Z context = 2025-05-07T20:32:41.9717643Z 2025-05-07T20:32:41.9717809Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:41.9718328Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:41.9718788Z module_map=module_map) 2025-05-07T20:32:41.9719155Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:41.9719507Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:41.9719769Z E ^ 2025-05-07T20:32:41.9720278Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.9720728Z 2025-05-07T20:32:41.9721144Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:41.9721653Z 2025-05-07T20:32:41.9721761Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.9722163Z self=, 2025-05-07T20:32:41.9722560Z T=4096, 2025-05-07T20:32:41.9722750Z D=5120, 2025-05-07T20:32:41.9722944Z scale_ub=None, 2025-05-07T20:32:41.9723157Z contiguous=False, 2025-05-07T20:32:41.9723381Z compiled=True, 2025-05-07T20:32:41.9723629Z ) 2025-05-07T20:32:41.9723945Z self = 2025-05-07T20:32:41.9724476Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:41.9724745Z 2025-05-07T20:32:41.9724830Z @given( 2025-05-07T20:32:41.9725099Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.9725415Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.9725721Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.9726042Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.9726365Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.9726651Z ) 2025-05-07T20:32:41.9726996Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.9727425Z def test_silu_mul_quant( 2025-05-07T20:32:41.9727691Z self, 2025-05-07T20:32:41.9727886Z T: int, 2025-05-07T20:32:41.9728084Z D: int, 2025-05-07T20:32:41.9728459Z scale_ub: Optional[float], 2025-05-07T20:32:41.9728731Z contiguous: bool, 2025-05-07T20:32:41.9728969Z compiled: bool, 2025-05-07T20:32:41.9729191Z ) -> None: 2025-05-07T20:32:41.9729413Z torch.manual_seed(2025) 2025-05-07T20:32:41.9729654Z 2025-05-07T20:32:41.9729922Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.9730268Z 2025-05-07T20:32:41.9730458Z x_sign = torch.sign(x) 2025-05-07T20:32:41.9730750Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.9731058Z x = x_sign * x_clamp 2025-05-07T20:32:41.9731302Z x0 = x[:, :D] 2025-05-07T20:32:41.9731527Z x1 = x[:, D:] 2025-05-07T20:32:41.9731741Z 2025-05-07T20:32:41.9731931Z if contiguous: 2025-05-07T20:32:41.9732165Z x0 = x0.contiguous() 2025-05-07T20:32:41.9732417Z x1 = x1.contiguous() 2025-05-07T20:32:41.9732659Z 2025-05-07T20:32:41.9732865Z if scale_ub is not None: 2025-05-07T20:32:41.9733146Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:41.9733484Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:41.9733794Z ) 2025-05-07T20:32:41.9733987Z else: 2025-05-07T20:32:41.9734197Z scale_ub_tensor = None 2025-05-07T20:32:41.9734446Z 2025-05-07T20:32:41.9734679Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.9734997Z op = silu_mul_quant 2025-05-07T20:32:41.9735249Z if compiled: 2025-05-07T20:32:41.9735493Z op = torch.compile(op) 2025-05-07T20:32:41.9735789Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.9736071Z 2025-05-07T20:32:41.9736258Z > y_fp8, y_scale = fn() 2025-05-07T20:32:41.9736425Z 2025-05-07T20:32:41.9736522Z moe/activation_test.py:117: 2025-05-07T20:32:41.9736819Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.9737155Z moe/activation_test.py:115: in fn 2025-05-07T20:32:41.9737431Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.9737993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:41.9738557Z return fn(*args, **kwargs) 2025-05-07T20:32:41.9739287Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:41.9739967Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:41.9740495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:41.9741171Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:41.9741824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:41.9742347Z kernel = self.compile( 2025-05-07T20:32:41.9742945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:41.9743648Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:41.9744091Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.9744328Z 2025-05-07T20:32:41.9744532Z self = 2025-05-07T20:32:41.9745641Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:41.9747015Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9806ae1c60>} 2025-05-07T20:32:41.9748349Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:41.9749409Z context = 2025-05-07T20:32:41.9749695Z 2025-05-07T20:32:41.9749867Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:41.9750386Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:41.9750850Z module_map=module_map) 2025-05-07T20:32:41.9751208Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:41.9751555Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:41.9751814Z E ^ 2025-05-07T20:32:41.9752266Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.9752717Z 2025-05-07T20:32:41.9753135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:41.9753654Z 2025-05-07T20:32:42.0898834Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.0899321Z self=, 2025-05-07T20:32:42.0899740Z T=4096, 2025-05-07T20:32:42.0899930Z D=5120, 2025-05-07T20:32:42.0900119Z scale_ub=1200.0, 2025-05-07T20:32:42.0900345Z contiguous=False, 2025-05-07T20:32:42.0900568Z compiled=False, 2025-05-07T20:32:42.0900773Z ) 2025-05-07T20:32:42.0901093Z self = 2025-05-07T20:32:42.0901582Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:42.0901859Z 2025-05-07T20:32:42.0901938Z @given( 2025-05-07T20:32:42.0902174Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.0902481Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.0902792Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.0903126Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.0903453Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.0903729Z ) 2025-05-07T20:32:42.0904213Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.0904656Z def test_silu_mul_quant( 2025-05-07T20:32:42.0904891Z self, 2025-05-07T20:32:42.0905084Z T: int, 2025-05-07T20:32:42.0905283Z D: int, 2025-05-07T20:32:42.0905497Z scale_ub: Optional[float], 2025-05-07T20:32:42.0905773Z contiguous: bool, 2025-05-07T20:32:42.0906011Z compiled: bool, 2025-05-07T20:32:42.0906225Z ) -> None: 2025-05-07T20:32:42.0906441Z torch.manual_seed(2025) 2025-05-07T20:32:42.0906678Z 2025-05-07T20:32:42.0906943Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.0907285Z 2025-05-07T20:32:42.0907480Z x_sign = torch.sign(x) 2025-05-07T20:32:42.0907888Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.0908191Z x = x_sign * x_clamp 2025-05-07T20:32:42.0908433Z x0 = x[:, :D] 2025-05-07T20:32:42.0908657Z x1 = x[:, D:] 2025-05-07T20:32:42.0908913Z 2025-05-07T20:32:42.0909153Z if contiguous: 2025-05-07T20:32:42.0909386Z x0 = x0.contiguous() 2025-05-07T20:32:42.0909643Z x1 = x1.contiguous() 2025-05-07T20:32:42.0909883Z 2025-05-07T20:32:42.0910074Z if scale_ub is not None: 2025-05-07T20:32:42.0910338Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.0910665Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.0910972Z ) 2025-05-07T20:32:42.0911160Z else: 2025-05-07T20:32:42.0911368Z scale_ub_tensor = None 2025-05-07T20:32:42.0911616Z 2025-05-07T20:32:42.0911843Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.0912162Z op = silu_mul_quant 2025-05-07T20:32:42.0912416Z if compiled: 2025-05-07T20:32:42.0912657Z op = torch.compile(op) 2025-05-07T20:32:42.0912951Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.0913223Z 2025-05-07T20:32:42.0913417Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.0913614Z 2025-05-07T20:32:42.0913723Z moe/activation_test.py:117: 2025-05-07T20:32:42.0914019Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.0914351Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.0914628Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.0915306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.0915992Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.0916525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.0917206Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.0917858Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.0918388Z kernel = self.compile( 2025-05-07T20:32:42.0918924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.0919567Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.0919961Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.0920192Z 2025-05-07T20:32:42.0920395Z self = 2025-05-07T20:32:42.0921468Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.0923018Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9806ae3240>} 2025-05-07T20:32:42.0924390Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.0925403Z context = 2025-05-07T20:32:42.0925689Z 2025-05-07T20:32:42.0925858Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.0926367Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.0926828Z module_map=module_map) 2025-05-07T20:32:42.0927190Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.0927579Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.0927872Z E ^ 2025-05-07T20:32:42.0928484Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.0928930Z 2025-05-07T20:32:42.0929418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.0929924Z 2025-05-07T20:32:42.0930033Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.0930439Z self=, 2025-05-07T20:32:42.0930837Z T=4096, 2025-05-07T20:32:42.0931023Z D=5120, 2025-05-07T20:32:42.0931213Z scale_ub=1200.0, 2025-05-07T20:32:42.0931436Z contiguous=False, 2025-05-07T20:32:42.0931660Z compiled=True, 2025-05-07T20:32:42.0931858Z ) 2025-05-07T20:32:42.0932175Z self = 2025-05-07T20:32:42.0932664Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:42.0932936Z 2025-05-07T20:32:42.0933020Z @given( 2025-05-07T20:32:42.0933246Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.0933560Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.0933867Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.0934190Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.0934513Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.0934793Z ) 2025-05-07T20:32:42.0935133Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.0935582Z def test_silu_mul_quant( 2025-05-07T20:32:42.0935862Z self, 2025-05-07T20:32:42.0936052Z T: int, 2025-05-07T20:32:42.0936252Z D: int, 2025-05-07T20:32:42.0936474Z scale_ub: Optional[float], 2025-05-07T20:32:42.0936739Z contiguous: bool, 2025-05-07T20:32:42.0936975Z compiled: bool, 2025-05-07T20:32:42.0937195Z ) -> None: 2025-05-07T20:32:42.0937412Z torch.manual_seed(2025) 2025-05-07T20:32:42.0937645Z 2025-05-07T20:32:42.0937916Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.0938255Z 2025-05-07T20:32:42.0938446Z x_sign = torch.sign(x) 2025-05-07T20:32:42.0938732Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.0939035Z x = x_sign * x_clamp 2025-05-07T20:32:42.0939270Z x0 = x[:, :D] 2025-05-07T20:32:42.0939487Z x1 = x[:, D:] 2025-05-07T20:32:42.0939697Z 2025-05-07T20:32:42.0939878Z if contiguous: 2025-05-07T20:32:42.0940108Z x0 = x0.contiguous() 2025-05-07T20:32:42.0940361Z x1 = x1.contiguous() 2025-05-07T20:32:42.0940590Z 2025-05-07T20:32:42.0940780Z if scale_ub is not None: 2025-05-07T20:32:42.0941045Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.0941373Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.0941681Z ) 2025-05-07T20:32:42.0941876Z else: 2025-05-07T20:32:42.0942084Z scale_ub_tensor = None 2025-05-07T20:32:42.0942326Z 2025-05-07T20:32:42.0942554Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.0942938Z op = silu_mul_quant 2025-05-07T20:32:42.0943185Z if compiled: 2025-05-07T20:32:42.0943428Z op = torch.compile(op) 2025-05-07T20:32:42.0943722Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.0943987Z 2025-05-07T20:32:42.0944178Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.0944340Z 2025-05-07T20:32:42.0944445Z moe/activation_test.py:117: 2025-05-07T20:32:42.0944732Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.0945063Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.0945341Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.0946007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:42.0946615Z return fn(*args, **kwargs) 2025-05-07T20:32:42.0947301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.0947980Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.0948504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.0949240Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.0949898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.0950425Z kernel = self.compile( 2025-05-07T20:32:42.0950955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.0951603Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.0951999Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.0952224Z 2025-05-07T20:32:42.0952433Z self = 2025-05-07T20:32:42.0953505Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.0954855Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f980664c720>} 2025-05-07T20:32:42.0956179Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.0957195Z context = 2025-05-07T20:32:42.0957476Z 2025-05-07T20:32:42.0957644Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.0958159Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.0958618Z module_map=module_map) 2025-05-07T20:32:42.0958983Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.0959329Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.0959586Z E ^ 2025-05-07T20:32:42.0960050Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.0960495Z 2025-05-07T20:32:42.0960907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.0961417Z 2025-05-07T20:32:42.1841800Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.1850035Z self=, 2025-05-07T20:32:42.1850449Z T=2048, 2025-05-07T20:32:42.1850637Z D=7168, 2025-05-07T20:32:42.1850838Z scale_ub=1200.0, 2025-05-07T20:32:42.1851178Z contiguous=False, 2025-05-07T20:32:42.1851405Z compiled=False, 2025-05-07T20:32:42.1851609Z ) 2025-05-07T20:32:42.1851925Z self = 2025-05-07T20:32:42.1852412Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:42.1852693Z 2025-05-07T20:32:42.1852770Z @given( 2025-05-07T20:32:42.1853001Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.1853310Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.1853614Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.1853948Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.1854340Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.1854680Z ) 2025-05-07T20:32:42.1855033Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.1855484Z def test_silu_mul_quant( 2025-05-07T20:32:42.1855855Z self, 2025-05-07T20:32:42.1856064Z T: int, 2025-05-07T20:32:42.1856266Z D: int, 2025-05-07T20:32:42.1856484Z scale_ub: Optional[float], 2025-05-07T20:32:42.1856750Z contiguous: bool, 2025-05-07T20:32:42.1856992Z compiled: bool, 2025-05-07T20:32:42.1857211Z ) -> None: 2025-05-07T20:32:42.1857429Z torch.manual_seed(2025) 2025-05-07T20:32:42.1857666Z 2025-05-07T20:32:42.1857936Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.1858280Z 2025-05-07T20:32:42.1858483Z x_sign = torch.sign(x) 2025-05-07T20:32:42.1858773Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.1859089Z x = x_sign * x_clamp 2025-05-07T20:32:42.1859335Z x0 = x[:, :D] 2025-05-07T20:32:42.1859551Z x1 = x[:, D:] 2025-05-07T20:32:42.1859760Z 2025-05-07T20:32:42.1859942Z if contiguous: 2025-05-07T20:32:42.1860175Z x0 = x0.contiguous() 2025-05-07T20:32:42.1860431Z x1 = x1.contiguous() 2025-05-07T20:32:42.1860668Z 2025-05-07T20:32:42.1860859Z if scale_ub is not None: 2025-05-07T20:32:42.1861122Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.1861455Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.1861758Z ) 2025-05-07T20:32:42.1861944Z else: 2025-05-07T20:32:42.1862159Z scale_ub_tensor = None 2025-05-07T20:32:42.1862407Z 2025-05-07T20:32:42.1862631Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.1862948Z op = silu_mul_quant 2025-05-07T20:32:42.1863197Z if compiled: 2025-05-07T20:32:42.1863440Z op = torch.compile(op) 2025-05-07T20:32:42.1863733Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.1864012Z 2025-05-07T20:32:42.1864198Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.1864366Z 2025-05-07T20:32:42.1864468Z moe/activation_test.py:117: 2025-05-07T20:32:42.1864781Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1865134Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.1865422Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.1866170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.1866851Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.1867372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.1868048Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.1868717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.1869314Z kernel = self.compile( 2025-05-07T20:32:42.1869903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.1870553Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.1870954Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1871178Z 2025-05-07T20:32:42.1871383Z self = 2025-05-07T20:32:42.1872450Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.1873804Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f980664d580>} 2025-05-07T20:32:42.1875250Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.1876265Z context = 2025-05-07T20:32:42.1876548Z 2025-05-07T20:32:42.1876713Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.1877223Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.1877691Z module_map=module_map) 2025-05-07T20:32:42.1878054Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.1878398Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.1878651Z E ^ 2025-05-07T20:32:42.1879110Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.1879559Z 2025-05-07T20:32:42.1879974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.1880484Z 2025-05-07T20:32:42.1880586Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.1881027Z self=, 2025-05-07T20:32:42.1881427Z T=1, 2025-05-07T20:32:42.1881614Z D=7168, 2025-05-07T20:32:42.1881799Z scale_ub=None, 2025-05-07T20:32:42.1882014Z contiguous=True, 2025-05-07T20:32:42.1882237Z compiled=False, 2025-05-07T20:32:42.1882438Z ) 2025-05-07T20:32:42.1882756Z self = 2025-05-07T20:32:42.1883241Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:42.1883496Z 2025-05-07T20:32:42.1883578Z @given( 2025-05-07T20:32:42.1883812Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.1884127Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.1884431Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.1884757Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.1885085Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.1885372Z ) 2025-05-07T20:32:42.1885719Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.1886158Z def test_silu_mul_quant( 2025-05-07T20:32:42.1886404Z self, 2025-05-07T20:32:42.1886593Z T: int, 2025-05-07T20:32:42.1886792Z D: int, 2025-05-07T20:32:42.1887010Z scale_ub: Optional[float], 2025-05-07T20:32:42.1887271Z contiguous: bool, 2025-05-07T20:32:42.1887516Z compiled: bool, 2025-05-07T20:32:42.1887741Z ) -> None: 2025-05-07T20:32:42.1887951Z torch.manual_seed(2025) 2025-05-07T20:32:42.1888197Z 2025-05-07T20:32:42.1888469Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.1888809Z 2025-05-07T20:32:42.1889001Z x_sign = torch.sign(x) 2025-05-07T20:32:42.1889291Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.1889644Z x = x_sign * x_clamp 2025-05-07T20:32:42.1889878Z x0 = x[:, :D] 2025-05-07T20:32:42.1890096Z x1 = x[:, D:] 2025-05-07T20:32:42.1890299Z 2025-05-07T20:32:42.1890482Z if contiguous: 2025-05-07T20:32:42.1890712Z x0 = x0.contiguous() 2025-05-07T20:32:42.1890966Z x1 = x1.contiguous() 2025-05-07T20:32:42.1891194Z 2025-05-07T20:32:42.1891399Z if scale_ub is not None: 2025-05-07T20:32:42.1891675Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.1892013Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.1892316Z ) 2025-05-07T20:32:42.1892558Z else: 2025-05-07T20:32:42.1892768Z scale_ub_tensor = None 2025-05-07T20:32:42.1893047Z 2025-05-07T20:32:42.1893281Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.1893591Z op = silu_mul_quant 2025-05-07T20:32:42.1893874Z if compiled: 2025-05-07T20:32:42.1894124Z op = torch.compile(op) 2025-05-07T20:32:42.1894413Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.1894678Z 2025-05-07T20:32:42.1894869Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.1895037Z 2025-05-07T20:32:42.1895133Z moe/activation_test.py:117: 2025-05-07T20:32:42.1895433Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1895758Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.1896059Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.1896774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.1897457Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.1897990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.1898670Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.1899322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.1899844Z kernel = self.compile( 2025-05-07T20:32:42.1900376Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.1901029Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.1901420Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.1901649Z 2025-05-07T20:32:42.1901851Z self = 2025-05-07T20:32:42.1902918Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.1904284Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f980664cea0>} 2025-05-07T20:32:42.1905608Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.1906668Z context = 2025-05-07T20:32:42.1906958Z 2025-05-07T20:32:42.1907121Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.1907633Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.1908103Z module_map=module_map) 2025-05-07T20:32:42.1908468Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.1908816Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.1909119Z E ^ 2025-05-07T20:32:42.1909628Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.1910083Z 2025-05-07T20:32:42.1910497Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.1911005Z 2025-05-07T20:32:42.1911107Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.1911518Z self=, 2025-05-07T20:32:42.1911914Z T=16384, 2025-05-07T20:32:42.1912110Z D=7168, 2025-05-07T20:32:42.1912309Z scale_ub=1200.0, 2025-05-07T20:32:42.1912529Z contiguous=False, 2025-05-07T20:32:42.1912798Z compiled=True, 2025-05-07T20:32:42.5375502Z ) 2025-05-07T20:32:42.5377063Z self = 2025-05-07T20:32:42.5378207Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:42.5378617Z 2025-05-07T20:32:42.5378742Z @given( 2025-05-07T20:32:42.5379074Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.5379489Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.5379900Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.5380321Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.5380677Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.5380978Z ) 2025-05-07T20:32:42.5381329Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.5381782Z def test_silu_mul_quant( 2025-05-07T20:32:42.5382035Z self, 2025-05-07T20:32:42.5382240Z T: int, 2025-05-07T20:32:42.5382458Z D: int, 2025-05-07T20:32:42.5382694Z scale_ub: Optional[float], 2025-05-07T20:32:42.5382967Z contiguous: bool, 2025-05-07T20:32:42.5383218Z compiled: bool, 2025-05-07T20:32:42.5383460Z ) -> None: 2025-05-07T20:32:42.5383681Z torch.manual_seed(2025) 2025-05-07T20:32:42.5383933Z 2025-05-07T20:32:42.5384220Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.5384575Z 2025-05-07T20:32:42.5384780Z x_sign = torch.sign(x) 2025-05-07T20:32:42.5385085Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.5385405Z x = x_sign * x_clamp 2025-05-07T20:32:42.5385644Z x0 = x[:, :D] 2025-05-07T20:32:42.5385867Z x1 = x[:, D:] 2025-05-07T20:32:42.5386085Z 2025-05-07T20:32:42.5386273Z if contiguous: 2025-05-07T20:32:42.5386512Z x0 = x0.contiguous() 2025-05-07T20:32:42.5386779Z x1 = x1.contiguous() 2025-05-07T20:32:42.5387019Z 2025-05-07T20:32:42.5387228Z if scale_ub is not None: 2025-05-07T20:32:42.5387517Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.5387859Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.5388181Z ) 2025-05-07T20:32:42.5388387Z else: 2025-05-07T20:32:42.5388600Z scale_ub_tensor = None 2025-05-07T20:32:42.5388860Z 2025-05-07T20:32:42.5389202Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.5389520Z op = silu_mul_quant 2025-05-07T20:32:42.5389783Z if compiled: 2025-05-07T20:32:42.5390043Z op = torch.compile(op) 2025-05-07T20:32:42.5390346Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.5390620Z 2025-05-07T20:32:42.5390824Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.5390989Z 2025-05-07T20:32:42.5391102Z moe/activation_test.py:117: 2025-05-07T20:32:42.5391396Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.5391748Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.5392041Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.5392712Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:42.5393286Z return fn(*args, **kwargs) 2025-05-07T20:32:42.5393951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.5394640Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.5395174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.5395888Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.5396583Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.5397212Z kernel = self.compile( 2025-05-07T20:32:42.5397837Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.5398538Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.5398949Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.5399182Z 2025-05-07T20:32:42.5399390Z self = 2025-05-07T20:32:42.5400479Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.5401864Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f980664f9c0>} 2025-05-07T20:32:42.5403212Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.5404246Z context = 2025-05-07T20:32:42.5404533Z 2025-05-07T20:32:42.5404704Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.5405230Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.5405710Z module_map=module_map) 2025-05-07T20:32:42.5406075Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.5406444Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.5406760Z E ^ 2025-05-07T20:32:42.5407235Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.5407688Z 2025-05-07T20:32:42.5408111Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.5408630Z 2025-05-07T20:32:42.5408738Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.5409166Z self=, 2025-05-07T20:32:42.5409574Z T=1, 2025-05-07T20:32:42.5409761Z D=7168, 2025-05-07T20:32:42.5409970Z scale_ub=None, 2025-05-07T20:32:42.5410199Z contiguous=False, 2025-05-07T20:32:42.5410428Z compiled=False, 2025-05-07T20:32:42.5410645Z ) 2025-05-07T20:32:42.5410976Z self = 2025-05-07T20:32:42.5411460Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:42.5411730Z 2025-05-07T20:32:42.5411813Z @given( 2025-05-07T20:32:42.5412055Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.5412374Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.5412692Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.5413031Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.5413366Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.5413651Z ) 2025-05-07T20:32:42.5414063Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.5414515Z def test_silu_mul_quant( 2025-05-07T20:32:42.5414760Z self, 2025-05-07T20:32:42.5414966Z T: int, 2025-05-07T20:32:42.5415172Z D: int, 2025-05-07T20:32:42.5415393Z scale_ub: Optional[float], 2025-05-07T20:32:42.5415673Z contiguous: bool, 2025-05-07T20:32:42.5415923Z compiled: bool, 2025-05-07T20:32:42.5416149Z ) -> None: 2025-05-07T20:32:42.5416373Z torch.manual_seed(2025) 2025-05-07T20:32:42.5416629Z 2025-05-07T20:32:42.5416902Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.5417300Z 2025-05-07T20:32:42.5417548Z x_sign = torch.sign(x) 2025-05-07T20:32:42.5417842Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.5418159Z x = x_sign * x_clamp 2025-05-07T20:32:42.5418451Z x0 = x[:, :D] 2025-05-07T20:32:42.5418683Z x1 = x[:, D:] 2025-05-07T20:32:42.5418893Z 2025-05-07T20:32:42.5419093Z if contiguous: 2025-05-07T20:32:42.5419335Z x0 = x0.contiguous() 2025-05-07T20:32:42.5419595Z x1 = x1.contiguous() 2025-05-07T20:32:42.5419842Z 2025-05-07T20:32:42.5420047Z if scale_ub is not None: 2025-05-07T20:32:42.5420320Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.5420660Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.5420980Z ) 2025-05-07T20:32:42.5421174Z else: 2025-05-07T20:32:42.5421394Z scale_ub_tensor = None 2025-05-07T20:32:42.5421652Z 2025-05-07T20:32:42.5421885Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.5422210Z op = silu_mul_quant 2025-05-07T20:32:42.5422469Z if compiled: 2025-05-07T20:32:42.5422717Z op = torch.compile(op) 2025-05-07T20:32:42.5423023Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.5423307Z 2025-05-07T20:32:42.5423507Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.5423674Z 2025-05-07T20:32:42.5423796Z moe/activation_test.py:117: 2025-05-07T20:32:42.5424103Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.5424444Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.5424726Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.5425416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.5426115Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.5426705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.5427394Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.5428065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.5428913Z kernel = self.compile( 2025-05-07T20:32:42.5429495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.5430151Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.5430558Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.5430789Z 2025-05-07T20:32:42.5431006Z self = 2025-05-07T20:32:42.5432090Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.5433532Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9655c5c860>} 2025-05-07T20:32:42.5434880Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.5435912Z context = 2025-05-07T20:32:42.5436199Z 2025-05-07T20:32:42.5436379Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.5436940Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.5437415Z module_map=module_map) 2025-05-07T20:32:42.5437888Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.5438305Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.5438570Z E ^ 2025-05-07T20:32:42.5439107Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.5439562Z 2025-05-07T20:32:42.5439991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.5440502Z 2025-05-07T20:32:42.5440609Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.5441030Z self=, 2025-05-07T20:32:42.5441444Z T=2048, 2025-05-07T20:32:42.5441648Z D=7168, 2025-05-07T20:32:42.5441845Z scale_ub=None, 2025-05-07T20:32:42.5442072Z contiguous=False, 2025-05-07T20:32:42.5442308Z compiled=True, 2025-05-07T20:32:42.5442517Z ) 2025-05-07T20:32:42.6129002Z self = 2025-05-07T20:32:42.6130433Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:42.6131175Z 2025-05-07T20:32:42.6131401Z @given( 2025-05-07T20:32:42.6131879Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6132514Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6133121Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6133766Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6134417Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6134981Z ) 2025-05-07T20:32:42.6135663Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6136373Z def test_silu_mul_quant( 2025-05-07T20:32:42.6136654Z self, 2025-05-07T20:32:42.6136851Z T: int, 2025-05-07T20:32:42.6137098Z D: int, 2025-05-07T20:32:42.6137320Z scale_ub: Optional[float], 2025-05-07T20:32:42.6137604Z contiguous: bool, 2025-05-07T20:32:42.6137854Z compiled: bool, 2025-05-07T20:32:42.6138084Z ) -> None: 2025-05-07T20:32:42.6138312Z torch.manual_seed(2025) 2025-05-07T20:32:42.6138565Z 2025-05-07T20:32:42.6138842Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6139190Z 2025-05-07T20:32:42.6139397Z x_sign = torch.sign(x) 2025-05-07T20:32:42.6139687Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.6140002Z x = x_sign * x_clamp 2025-05-07T20:32:42.6140251Z x0 = x[:, :D] 2025-05-07T20:32:42.6140475Z x1 = x[:, D:] 2025-05-07T20:32:42.6140686Z 2025-05-07T20:32:42.6140886Z if contiguous: 2025-05-07T20:32:42.6141129Z x0 = x0.contiguous() 2025-05-07T20:32:42.6141389Z x1 = x1.contiguous() 2025-05-07T20:32:42.6141638Z 2025-05-07T20:32:42.6149525Z if scale_ub is not None: 2025-05-07T20:32:42.6149841Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.6150185Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.6150505Z ) 2025-05-07T20:32:42.6150714Z else: 2025-05-07T20:32:42.6150929Z scale_ub_tensor = None 2025-05-07T20:32:42.6151194Z 2025-05-07T20:32:42.6151728Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.6152061Z op = silu_mul_quant 2025-05-07T20:32:42.6152315Z if compiled: 2025-05-07T20:32:42.6152578Z op = torch.compile(op) 2025-05-07T20:32:42.6152886Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6153163Z 2025-05-07T20:32:42.6153370Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.6153537Z 2025-05-07T20:32:42.6153655Z moe/activation_test.py:117: 2025-05-07T20:32:42.6153952Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6154299Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.6154682Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6155342Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:42.6155961Z return fn(*args, **kwargs) 2025-05-07T20:32:42.6156713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.6157406Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.6157940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.6158625Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.6159289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.6159827Z kernel = self.compile( 2025-05-07T20:32:42.6160368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.6161037Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.6161446Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6161680Z 2025-05-07T20:32:42.6161890Z self = 2025-05-07T20:32:42.6162980Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.6164374Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9655c5dbc0>} 2025-05-07T20:32:42.6165727Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.6166764Z context = 2025-05-07T20:32:42.6167053Z 2025-05-07T20:32:42.6167225Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.6167755Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.6168232Z module_map=module_map) 2025-05-07T20:32:42.6168606Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.6168958Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.6169228Z E ^ 2025-05-07T20:32:42.6169701Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.6170155Z 2025-05-07T20:32:42.6170576Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.6171101Z 2025-05-07T20:32:42.6171209Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.6171631Z self=, 2025-05-07T20:32:42.6172044Z T=4096, 2025-05-07T20:32:42.6172240Z D=7168, 2025-05-07T20:32:42.6172497Z scale_ub=None, 2025-05-07T20:32:42.6172730Z contiguous=False, 2025-05-07T20:32:42.6172959Z compiled=True, 2025-05-07T20:32:42.6173183Z ) 2025-05-07T20:32:42.6173514Z self = 2025-05-07T20:32:42.6174003Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:42.6174284Z 2025-05-07T20:32:42.6174368Z @given( 2025-05-07T20:32:42.6174615Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.6174929Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.6175246Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.6175638Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.6176024Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.6176312Z ) 2025-05-07T20:32:42.6176709Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.6177164Z def test_silu_mul_quant( 2025-05-07T20:32:42.6177408Z self, 2025-05-07T20:32:42.6177618Z T: int, 2025-05-07T20:32:42.6177826Z D: int, 2025-05-07T20:32:42.6178047Z scale_ub: Optional[float], 2025-05-07T20:32:42.6178329Z contiguous: bool, 2025-05-07T20:32:42.6178581Z compiled: bool, 2025-05-07T20:32:42.6178807Z ) -> None: 2025-05-07T20:32:42.6179034Z torch.manual_seed(2025) 2025-05-07T20:32:42.6179288Z 2025-05-07T20:32:42.6179561Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.6179910Z 2025-05-07T20:32:42.6180117Z x_sign = torch.sign(x) 2025-05-07T20:32:42.6180420Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.6180736Z x = x_sign * x_clamp 2025-05-07T20:32:42.6180990Z x0 = x[:, :D] 2025-05-07T20:32:42.6181216Z x1 = x[:, D:] 2025-05-07T20:32:42.6181425Z 2025-05-07T20:32:42.6181627Z if contiguous: 2025-05-07T20:32:42.6181872Z x0 = x0.contiguous() 2025-05-07T20:32:42.6182130Z x1 = x1.contiguous() 2025-05-07T20:32:42.6182379Z 2025-05-07T20:32:42.6182585Z if scale_ub is not None: 2025-05-07T20:32:42.6182858Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.6183201Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.6183518Z ) 2025-05-07T20:32:42.6183715Z else: 2025-05-07T20:32:42.6183936Z scale_ub_tensor = None 2025-05-07T20:32:42.6184195Z 2025-05-07T20:32:42.6184429Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.6184751Z op = silu_mul_quant 2025-05-07T20:32:42.6185016Z if compiled: 2025-05-07T20:32:42.6185267Z op = torch.compile(op) 2025-05-07T20:32:42.6185573Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6185859Z 2025-05-07T20:32:42.6186067Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.6186236Z 2025-05-07T20:32:42.6186344Z moe/activation_test.py:117: 2025-05-07T20:32:42.6186696Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6187041Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.6187324Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.6187886Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:42.6188452Z return fn(*args, **kwargs) 2025-05-07T20:32:42.6189195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.6189882Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.6190432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.6191118Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.6191835Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.6192377Z kernel = self.compile( 2025-05-07T20:32:42.6192927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.6193592Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.6193993Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.6194233Z 2025-05-07T20:32:42.6194441Z self = 2025-05-07T20:32:42.6195527Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.6197067Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9655c5e700>} 2025-05-07T20:32:42.6198402Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.6199429Z context = 2025-05-07T20:32:42.6199725Z 2025-05-07T20:32:42.6199893Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.6200416Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.6200883Z module_map=module_map) 2025-05-07T20:32:42.6201262Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.6201619Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.6201886Z E ^ 2025-05-07T20:32:42.6202353Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.6202807Z 2025-05-07T20:32:42.6203223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.6203738Z 2025-05-07T20:32:42.7461929Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.7462401Z self=, 2025-05-07T20:32:42.7462990Z T=16384, 2025-05-07T20:32:42.7463267Z D=5120, 2025-05-07T20:32:42.7463543Z scale_ub=1200.0, 2025-05-07T20:32:42.7463861Z contiguous=False, 2025-05-07T20:32:42.7464172Z compiled=False, 2025-05-07T20:32:42.7464486Z ) 2025-05-07T20:32:42.7464941Z self = 2025-05-07T20:32:42.7465467Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:42.7465756Z 2025-05-07T20:32:42.7465846Z @given( 2025-05-07T20:32:42.7466103Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.7466463Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.7466767Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.7467107Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.7467441Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.7467724Z ) 2025-05-07T20:32:42.7468080Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.7468528Z def test_silu_mul_quant( 2025-05-07T20:32:42.7468772Z self, 2025-05-07T20:32:42.7468977Z T: int, 2025-05-07T20:32:42.7469250Z D: int, 2025-05-07T20:32:42.7469472Z scale_ub: Optional[float], 2025-05-07T20:32:42.7469756Z contiguous: bool, 2025-05-07T20:32:42.7470008Z compiled: bool, 2025-05-07T20:32:42.7470244Z ) -> None: 2025-05-07T20:32:42.7470461Z torch.manual_seed(2025) 2025-05-07T20:32:42.7470713Z 2025-05-07T20:32:42.7471328Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.7471672Z 2025-05-07T20:32:42.7471876Z x_sign = torch.sign(x) 2025-05-07T20:32:42.7472173Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.7472484Z x = x_sign * x_clamp 2025-05-07T20:32:42.7472734Z x0 = x[:, :D] 2025-05-07T20:32:42.7472963Z x1 = x[:, D:] 2025-05-07T20:32:42.7473173Z 2025-05-07T20:32:42.7473372Z if contiguous: 2025-05-07T20:32:42.7473618Z x0 = x0.contiguous() 2025-05-07T20:32:42.7473877Z x1 = x1.contiguous() 2025-05-07T20:32:42.7474126Z 2025-05-07T20:32:42.7474419Z if scale_ub is not None: 2025-05-07T20:32:42.7474773Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.7475124Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.7475454Z ) 2025-05-07T20:32:42.7475755Z else: 2025-05-07T20:32:42.7475982Z scale_ub_tensor = None 2025-05-07T20:32:42.7476243Z 2025-05-07T20:32:42.7476491Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.7476809Z op = silu_mul_quant 2025-05-07T20:32:42.7477094Z if compiled: 2025-05-07T20:32:42.7477353Z op = torch.compile(op) 2025-05-07T20:32:42.7477658Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.7477934Z 2025-05-07T20:32:42.7478143Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.7478310Z 2025-05-07T20:32:42.7478422Z moe/activation_test.py:117: 2025-05-07T20:32:42.7478722Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.7479072Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.7479369Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.7480052Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.7480745Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.7481283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.7481973Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.7482655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.7483184Z kernel = self.compile( 2025-05-07T20:32:42.7483729Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.7484391Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.7484802Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.7485032Z 2025-05-07T20:32:42.7485244Z self = 2025-05-07T20:32:42.7486330Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.7487718Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9655c5f060>} 2025-05-07T20:32:42.7489056Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.7490070Z context = 2025-05-07T20:32:42.7490364Z 2025-05-07T20:32:42.7490532Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.7491051Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.7491577Z module_map=module_map) 2025-05-07T20:32:42.7491941Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.7492303Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.7492570Z E ^ 2025-05-07T20:32:42.7493032Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.7493488Z 2025-05-07T20:32:42.7493907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.7494426Z 2025-05-07T20:32:42.7494533Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:42.7494992Z self=, 2025-05-07T20:32:42.7495427Z T=16384, 2025-05-07T20:32:42.7495625Z D=5120, 2025-05-07T20:32:42.7495823Z scale_ub=1200.0, 2025-05-07T20:32:42.7496082Z contiguous=True, 2025-05-07T20:32:42.7496314Z compiled=True, 2025-05-07T20:32:42.7496519Z ) 2025-05-07T20:32:42.7496835Z self = 2025-05-07T20:32:42.7497328Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:42.7497609Z 2025-05-07T20:32:42.7497689Z @given( 2025-05-07T20:32:42.7497926Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:42.7498234Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:42.7498545Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:42.7498879Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:42.7499200Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:42.7499493Z ) 2025-05-07T20:32:42.7499845Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:42.7500284Z def test_silu_mul_quant( 2025-05-07T20:32:42.7500524Z self, 2025-05-07T20:32:42.7500724Z T: int, 2025-05-07T20:32:42.7500932Z D: int, 2025-05-07T20:32:42.7501149Z scale_ub: Optional[float], 2025-05-07T20:32:42.7501421Z contiguous: bool, 2025-05-07T20:32:42.7501663Z compiled: bool, 2025-05-07T20:32:42.7501883Z ) -> None: 2025-05-07T20:32:42.7502101Z torch.manual_seed(2025) 2025-05-07T20:32:42.7502352Z 2025-05-07T20:32:42.7502622Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:42.7502967Z 2025-05-07T20:32:42.7503167Z x_sign = torch.sign(x) 2025-05-07T20:32:42.7503458Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:42.7503775Z x = x_sign * x_clamp 2025-05-07T20:32:42.7504025Z x0 = x[:, :D] 2025-05-07T20:32:42.7504245Z x1 = x[:, D:] 2025-05-07T20:32:42.7504457Z 2025-05-07T20:32:42.7504654Z if contiguous: 2025-05-07T20:32:42.7504887Z x0 = x0.contiguous() 2025-05-07T20:32:42.7505161Z x1 = x1.contiguous() 2025-05-07T20:32:42.7505404Z 2025-05-07T20:32:42.7505596Z if scale_ub is not None: 2025-05-07T20:32:42.7505871Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:42.7506214Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:42.7506530Z ) 2025-05-07T20:32:42.7506724Z else: 2025-05-07T20:32:42.7506942Z scale_ub_tensor = None 2025-05-07T20:32:42.7507203Z 2025-05-07T20:32:42.7507436Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:42.7507753Z op = silu_mul_quant 2025-05-07T20:32:42.7508010Z if compiled: 2025-05-07T20:32:42.7508256Z op = torch.compile(op) 2025-05-07T20:32:42.7508566Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.7508844Z 2025-05-07T20:32:42.7509126Z > y_fp8, y_scale = fn() 2025-05-07T20:32:42.7509295Z 2025-05-07T20:32:42.7509396Z moe/activation_test.py:117: 2025-05-07T20:32:42.7509746Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.7510085Z moe/activation_test.py:115: in fn 2025-05-07T20:32:42.7510360Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:42.7510916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:42.7511474Z return fn(*args, **kwargs) 2025-05-07T20:32:42.7512119Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:42.7512802Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:42.7513339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:42.7514109Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:42.7514759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:42.7515334Z kernel = self.compile( 2025-05-07T20:32:42.7515875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:42.7516525Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.7516931Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:42.7517163Z 2025-05-07T20:32:42.7517368Z self = 2025-05-07T20:32:42.7518440Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:42.7519799Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f96559e51c0>} 2025-05-07T20:32:42.7521141Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:42.7522183Z context = 2025-05-07T20:32:42.7522467Z 2025-05-07T20:32:42.7522641Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:42.7523155Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.7523612Z module_map=module_map) 2025-05-07T20:32:42.7523987Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.7524347Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.7524610Z E ^ 2025-05-07T20:32:42.7525077Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.7525537Z 2025-05-07T20:32:42.7525971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:42.7526484Z 2025-05-07T20:32:43.0486106Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.0486926Z self=, 2025-05-07T20:32:43.0488046Z T=16384, 2025-05-07T20:32:43.0488453Z D=5120, 2025-05-07T20:32:43.0488837Z scale_ub=None, 2025-05-07T20:32:43.0489289Z contiguous=False, 2025-05-07T20:32:43.0489748Z compiled=True, 2025-05-07T20:32:43.0490156Z ) 2025-05-07T20:32:43.0490798Z self = 2025-05-07T20:32:43.0491821Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:43.0492387Z 2025-05-07T20:32:43.0492564Z @given( 2025-05-07T20:32:43.0493030Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.0493673Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.0494662Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.0495321Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.0495982Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.0496521Z ) 2025-05-07T20:32:43.0496872Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.0497316Z def test_silu_mul_quant( 2025-05-07T20:32:43.0497565Z self, 2025-05-07T20:32:43.0497762Z T: int, 2025-05-07T20:32:43.0497970Z D: int, 2025-05-07T20:32:43.0498203Z scale_ub: Optional[float], 2025-05-07T20:32:43.0498483Z contiguous: bool, 2025-05-07T20:32:43.0498810Z compiled: bool, 2025-05-07T20:32:43.0499133Z ) -> None: 2025-05-07T20:32:43.0499357Z torch.manual_seed(2025) 2025-05-07T20:32:43.0499600Z 2025-05-07T20:32:43.0500004Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.0500353Z 2025-05-07T20:32:43.0500556Z x_sign = torch.sign(x) 2025-05-07T20:32:43.0500853Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.0501172Z x = x_sign * x_clamp 2025-05-07T20:32:43.0501416Z x0 = x[:, :D] 2025-05-07T20:32:43.0501642Z x1 = x[:, D:] 2025-05-07T20:32:43.0501857Z 2025-05-07T20:32:43.0502044Z if contiguous: 2025-05-07T20:32:43.0502284Z x0 = x0.contiguous() 2025-05-07T20:32:43.0502547Z x1 = x1.contiguous() 2025-05-07T20:32:43.0502785Z 2025-05-07T20:32:43.0502982Z if scale_ub is not None: 2025-05-07T20:32:43.0503258Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.0503594Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.0503915Z ) 2025-05-07T20:32:43.0504126Z else: 2025-05-07T20:32:43.0504356Z scale_ub_tensor = None 2025-05-07T20:32:43.0504615Z 2025-05-07T20:32:43.0504863Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.0505186Z op = silu_mul_quant 2025-05-07T20:32:43.0505441Z if compiled: 2025-05-07T20:32:43.0505697Z op = torch.compile(op) 2025-05-07T20:32:43.0506010Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.0506326Z 2025-05-07T20:32:43.0506529Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.0506694Z 2025-05-07T20:32:43.0506805Z moe/activation_test.py:117: 2025-05-07T20:32:43.0507100Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.0507442Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.0507730Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.0508296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:43.0508860Z return fn(*args, **kwargs) 2025-05-07T20:32:43.0509616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.0510309Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.0510842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.0511534Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.0512204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.0512742Z kernel = self.compile( 2025-05-07T20:32:43.0513282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.0513945Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.0514353Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.0514582Z 2025-05-07T20:32:43.0514852Z self = 2025-05-07T20:32:43.0515929Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.0517361Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f96559e5d00>} 2025-05-07T20:32:43.0518710Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.0519839Z context = 2025-05-07T20:32:43.0520127Z 2025-05-07T20:32:43.0520294Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.0520860Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.0521342Z module_map=module_map) 2025-05-07T20:32:43.0528698Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.0529108Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.0529375Z E ^ 2025-05-07T20:32:43.0529839Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.0530302Z 2025-05-07T20:32:43.0530726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.0531248Z 2025-05-07T20:32:43.0531366Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.0531794Z self=, 2025-05-07T20:32:43.0532195Z T=2048, 2025-05-07T20:32:43.0532395Z D=5120, 2025-05-07T20:32:43.0532600Z scale_ub=None, 2025-05-07T20:32:43.0532819Z contiguous=False, 2025-05-07T20:32:43.0533056Z compiled=True, 2025-05-07T20:32:43.0533269Z ) 2025-05-07T20:32:43.1244942Z self = 2025-05-07T20:32:43.1245743Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:43.1246106Z 2025-05-07T20:32:43.1246193Z @given( 2025-05-07T20:32:43.1246447Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.1246845Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.1247481Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.1248144Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.1248819Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.1249404Z ) 2025-05-07T20:32:43.1250095Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.1250989Z def test_silu_mul_quant( 2025-05-07T20:32:43.1251487Z self, 2025-05-07T20:32:43.1251880Z T: int, 2025-05-07T20:32:43.1252287Z D: int, 2025-05-07T20:32:43.1252736Z scale_ub: Optional[float], 2025-05-07T20:32:43.1253277Z contiguous: bool, 2025-05-07T20:32:43.1253764Z compiled: bool, 2025-05-07T20:32:43.1254221Z ) -> None: 2025-05-07T20:32:43.1254654Z torch.manual_seed(2025) 2025-05-07T20:32:43.1255142Z 2025-05-07T20:32:43.1255703Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.1256382Z 2025-05-07T20:32:43.1256785Z x_sign = torch.sign(x) 2025-05-07T20:32:43.1257142Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.1257476Z x = x_sign * x_clamp 2025-05-07T20:32:43.1257728Z x0 = x[:, :D] 2025-05-07T20:32:43.1257956Z x1 = x[:, D:] 2025-05-07T20:32:43.1258165Z 2025-05-07T20:32:43.1258358Z if contiguous: 2025-05-07T20:32:43.1258606Z x0 = x0.contiguous() 2025-05-07T20:32:43.1259173Z x1 = x1.contiguous() 2025-05-07T20:32:43.1259414Z 2025-05-07T20:32:43.1259619Z if scale_ub is not None: 2025-05-07T20:32:43.1259896Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.1260235Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.1260549Z ) 2025-05-07T20:32:43.1260757Z else: 2025-05-07T20:32:43.1260969Z scale_ub_tensor = None 2025-05-07T20:32:43.1261226Z 2025-05-07T20:32:43.1261467Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.1261782Z op = silu_mul_quant 2025-05-07T20:32:43.1262047Z if compiled: 2025-05-07T20:32:43.1262392Z op = torch.compile(op) 2025-05-07T20:32:43.1262768Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.1263053Z 2025-05-07T20:32:43.1263256Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.1263421Z 2025-05-07T20:32:43.1263610Z moe/activation_test.py:117: 2025-05-07T20:32:43.1263912Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.1264258Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.1264549Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.1265107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:43.1265676Z return fn(*args, **kwargs) 2025-05-07T20:32:43.1266370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.1267092Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.1267623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.1268314Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.1268981Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.1269592Z kernel = self.compile( 2025-05-07T20:32:43.1270138Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.1270798Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.1271207Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.1271438Z 2025-05-07T20:32:43.1271650Z self = 2025-05-07T20:32:43.1272733Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.1274121Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f96559e5620>} 2025-05-07T20:32:43.1275456Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.1276488Z context = 2025-05-07T20:32:43.1276816Z 2025-05-07T20:32:43.1276984Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.1277504Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.1277980Z module_map=module_map) 2025-05-07T20:32:43.1278344Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.1278703Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.1278968Z E ^ 2025-05-07T20:32:43.1279487Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.1279945Z 2025-05-07T20:32:43.1280362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.1280879Z 2025-05-07T20:32:43.1280988Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.1281404Z self=, 2025-05-07T20:32:43.1281804Z T=2048, 2025-05-07T20:32:43.1282001Z D=5120, 2025-05-07T20:32:43.1282202Z scale_ub=1200.0, 2025-05-07T20:32:43.1282426Z contiguous=False, 2025-05-07T20:32:43.1282659Z compiled=True, 2025-05-07T20:32:43.1282880Z ) 2025-05-07T20:32:43.1283202Z self = 2025-05-07T20:32:43.1283800Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:43.1284088Z 2025-05-07T20:32:43.1284174Z @given( 2025-05-07T20:32:43.1284460Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.1284780Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.1285099Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.1285442Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.1285774Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.1286073Z ) 2025-05-07T20:32:43.1286454Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.1286940Z def test_silu_mul_quant( 2025-05-07T20:32:43.1287184Z self, 2025-05-07T20:32:43.1287400Z T: int, 2025-05-07T20:32:43.1287608Z D: int, 2025-05-07T20:32:43.1287833Z scale_ub: Optional[float], 2025-05-07T20:32:43.1288123Z contiguous: bool, 2025-05-07T20:32:43.1288371Z compiled: bool, 2025-05-07T20:32:43.1288593Z ) -> None: 2025-05-07T20:32:43.1288820Z torch.manual_seed(2025) 2025-05-07T20:32:43.1289071Z 2025-05-07T20:32:43.1289352Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.1289710Z 2025-05-07T20:32:43.1289922Z x_sign = torch.sign(x) 2025-05-07T20:32:43.1290211Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.1290534Z x = x_sign * x_clamp 2025-05-07T20:32:43.1290787Z x0 = x[:, :D] 2025-05-07T20:32:43.1291012Z x1 = x[:, D:] 2025-05-07T20:32:43.1291223Z 2025-05-07T20:32:43.1291427Z if contiguous: 2025-05-07T20:32:43.1291676Z x0 = x0.contiguous() 2025-05-07T20:32:43.1291939Z x1 = x1.contiguous() 2025-05-07T20:32:43.1292192Z 2025-05-07T20:32:43.1292397Z if scale_ub is not None: 2025-05-07T20:32:43.1292676Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.1293027Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.1293354Z ) 2025-05-07T20:32:43.1293556Z else: 2025-05-07T20:32:43.1293782Z scale_ub_tensor = None 2025-05-07T20:32:43.1294039Z 2025-05-07T20:32:43.1294276Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.1294594Z op = silu_mul_quant 2025-05-07T20:32:43.1294856Z if compiled: 2025-05-07T20:32:43.1295112Z op = torch.compile(op) 2025-05-07T20:32:43.1295403Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.1295683Z 2025-05-07T20:32:43.1295884Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.1296049Z 2025-05-07T20:32:43.1296150Z moe/activation_test.py:117: 2025-05-07T20:32:43.1296467Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.1296844Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.1297132Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.1297696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:43.1298256Z return fn(*args, **kwargs) 2025-05-07T20:32:43.1298968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.1299652Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.1300187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.1300864Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.1301523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.1302047Z kernel = self.compile( 2025-05-07T20:32:43.1302588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.1303360Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.1303752Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.1304021Z 2025-05-07T20:32:43.1304236Z self = 2025-05-07T20:32:43.1305316Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.1306723Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f96558145e0>} 2025-05-07T20:32:43.1308062Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.1309152Z context = 2025-05-07T20:32:43.1309446Z 2025-05-07T20:32:43.1309620Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.1310142Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.1310614Z module_map=module_map) 2025-05-07T20:32:43.1310980Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.1311340Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.1311616Z E ^ 2025-05-07T20:32:43.1312083Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.1312543Z 2025-05-07T20:32:43.1312964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.1313494Z 2025-05-07T20:32:43.2649309Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2649950Z self=, 2025-05-07T20:32:43.2650524Z T=4096, 2025-05-07T20:32:43.2650787Z D=5120, 2025-05-07T20:32:43.2650998Z scale_ub=1200.0, 2025-05-07T20:32:43.2651225Z contiguous=True, 2025-05-07T20:32:43.2651452Z compiled=True, 2025-05-07T20:32:43.2651664Z ) 2025-05-07T20:32:43.2651980Z self = 2025-05-07T20:32:43.2652481Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:43.2652758Z 2025-05-07T20:32:43.2652841Z @given( 2025-05-07T20:32:43.2653076Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.2653383Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.2653691Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.2654033Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.2654386Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.2654670Z ) 2025-05-07T20:32:43.2655041Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.2655782Z def test_silu_mul_quant( 2025-05-07T20:32:43.2656040Z self, 2025-05-07T20:32:43.2656241Z T: int, 2025-05-07T20:32:43.2656456Z D: int, 2025-05-07T20:32:43.2656690Z scale_ub: Optional[float], 2025-05-07T20:32:43.2656998Z contiguous: bool, 2025-05-07T20:32:43.2657263Z compiled: bool, 2025-05-07T20:32:43.2657505Z ) -> None: 2025-05-07T20:32:43.2657721Z torch.manual_seed(2025) 2025-05-07T20:32:43.2657972Z 2025-05-07T20:32:43.2658257Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.2658603Z 2025-05-07T20:32:43.2658798Z x_sign = torch.sign(x) 2025-05-07T20:32:43.2659178Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.2659568Z x = x_sign * x_clamp 2025-05-07T20:32:43.2659811Z x0 = x[:, :D] 2025-05-07T20:32:43.2660032Z x1 = x[:, D:] 2025-05-07T20:32:43.2660243Z 2025-05-07T20:32:43.2660499Z if contiguous: 2025-05-07T20:32:43.2660736Z x0 = x0.contiguous() 2025-05-07T20:32:43.2660997Z x1 = x1.contiguous() 2025-05-07T20:32:43.2661232Z 2025-05-07T20:32:43.2661426Z if scale_ub is not None: 2025-05-07T20:32:43.2661699Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.2662030Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.2662412Z ) 2025-05-07T20:32:43.2662680Z else: 2025-05-07T20:32:43.2662890Z scale_ub_tensor = None 2025-05-07T20:32:43.2663143Z 2025-05-07T20:32:43.2663378Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.2663687Z op = silu_mul_quant 2025-05-07T20:32:43.2663950Z if compiled: 2025-05-07T20:32:43.2664206Z op = torch.compile(op) 2025-05-07T20:32:43.2664505Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2664774Z 2025-05-07T20:32:43.2664973Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.2665136Z 2025-05-07T20:32:43.2665243Z moe/activation_test.py:117: 2025-05-07T20:32:43.2665532Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2665868Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.2666151Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.2666704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:43.2667272Z return fn(*args, **kwargs) 2025-05-07T20:32:43.2667925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.2668611Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.2669246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.2669931Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.2670595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.2671129Z kernel = self.compile( 2025-05-07T20:32:43.2671663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.2672321Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.2672727Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.2672954Z 2025-05-07T20:32:43.2673158Z self = 2025-05-07T20:32:43.2674230Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.2675677Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9655815120>} 2025-05-07T20:32:43.2677011Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.2678025Z context = 2025-05-07T20:32:43.2678312Z 2025-05-07T20:32:43.2678478Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.2678996Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.2679505Z module_map=module_map) 2025-05-07T20:32:43.2679908Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.2680268Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.2680534Z E ^ 2025-05-07T20:32:43.2681049Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.2681496Z 2025-05-07T20:32:43.2681913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.2682425Z 2025-05-07T20:32:43.2682532Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.2682946Z self=, 2025-05-07T20:32:43.2683348Z T=128, 2025-05-07T20:32:43.2683533Z D=5120, 2025-05-07T20:32:43.2683729Z scale_ub=1200.0, 2025-05-07T20:32:43.2683956Z contiguous=False, 2025-05-07T20:32:43.2684177Z compiled=True, 2025-05-07T20:32:43.2684387Z ) 2025-05-07T20:32:43.5168281Z self = 2025-05-07T20:32:43.5169019Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:43.5169397Z 2025-05-07T20:32:43.5169527Z @given( 2025-05-07T20:32:43.5169909Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.5170247Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.5170565Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.5170904Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.5171237Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.5171531Z ) 2025-05-07T20:32:43.5171894Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.5172334Z def test_silu_mul_quant( 2025-05-07T20:32:43.5172586Z self, 2025-05-07T20:32:43.5172795Z T: int, 2025-05-07T20:32:43.5173007Z D: int, 2025-05-07T20:32:43.5173241Z scale_ub: Optional[float], 2025-05-07T20:32:43.5173515Z contiguous: bool, 2025-05-07T20:32:43.5173757Z compiled: bool, 2025-05-07T20:32:43.5173993Z ) -> None: 2025-05-07T20:32:43.5174216Z torch.manual_seed(2025) 2025-05-07T20:32:43.5174461Z 2025-05-07T20:32:43.5174732Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.5175075Z 2025-05-07T20:32:43.5175285Z x_sign = torch.sign(x) 2025-05-07T20:32:43.5175577Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.5175893Z x = x_sign * x_clamp 2025-05-07T20:32:43.5176164Z x0 = x[:, :D] 2025-05-07T20:32:43.5176412Z x1 = x[:, D:] 2025-05-07T20:32:43.5176628Z 2025-05-07T20:32:43.5176818Z if contiguous: 2025-05-07T20:32:43.5177047Z x0 = x0.contiguous() 2025-05-07T20:32:43.5177309Z x1 = x1.contiguous() 2025-05-07T20:32:43.5177554Z 2025-05-07T20:32:43.5177748Z if scale_ub is not None: 2025-05-07T20:32:43.5181116Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.5181459Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.5181765Z ) 2025-05-07T20:32:43.5181965Z else: 2025-05-07T20:32:43.5182272Z scale_ub_tensor = None 2025-05-07T20:32:43.5182520Z 2025-05-07T20:32:43.5182757Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.5183072Z op = silu_mul_quant 2025-05-07T20:32:43.5183318Z if compiled: 2025-05-07T20:32:43.5183568Z op = torch.compile(op) 2025-05-07T20:32:43.5183872Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.5184140Z 2025-05-07T20:32:43.5184342Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.5184505Z 2025-05-07T20:32:43.5184616Z moe/activation_test.py:117: 2025-05-07T20:32:43.5184917Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.5185343Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.5185656Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.5186221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:43.5187021Z return fn(*args, **kwargs) 2025-05-07T20:32:43.5187840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.5188576Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.5189210Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.5189887Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.5190549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.5191073Z kernel = self.compile( 2025-05-07T20:32:43.5191618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.5192283Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.5192684Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.5192924Z 2025-05-07T20:32:43.5193136Z self = 2025-05-07T20:32:43.5194212Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.5195577Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9655816340>} 2025-05-07T20:32:43.5196922Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.5197992Z context = 2025-05-07T20:32:43.5198285Z 2025-05-07T20:32:43.5198452Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.5198971Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.5199437Z module_map=module_map) 2025-05-07T20:32:43.5199799Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.5200153Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.5200415Z E ^ 2025-05-07T20:32:43.5200872Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.5201323Z 2025-05-07T20:32:43.5201737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.5202355Z 2025-05-07T20:32:43.5202464Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.5202880Z self=, 2025-05-07T20:32:43.5203353Z T=16384, 2025-05-07T20:32:43.5203560Z D=7168, 2025-05-07T20:32:43.5203761Z scale_ub=1200.0, 2025-05-07T20:32:43.5203985Z contiguous=True, 2025-05-07T20:32:43.5204216Z compiled=True, 2025-05-07T20:32:43.5204430Z ) 2025-05-07T20:32:43.5204747Z self = 2025-05-07T20:32:43.5205240Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:43.5205525Z 2025-05-07T20:32:43.5205606Z @given( 2025-05-07T20:32:43.5205845Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.5206183Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.5206519Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.5206897Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.5207219Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.5207504Z ) 2025-05-07T20:32:43.5207895Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.5208331Z def test_silu_mul_quant( 2025-05-07T20:32:43.5208573Z self, 2025-05-07T20:32:43.5208774Z T: int, 2025-05-07T20:32:43.5208969Z D: int, 2025-05-07T20:32:43.5209195Z scale_ub: Optional[float], 2025-05-07T20:32:43.5209475Z contiguous: bool, 2025-05-07T20:32:43.5209717Z compiled: bool, 2025-05-07T20:32:43.5209941Z ) -> None: 2025-05-07T20:32:43.5210166Z torch.manual_seed(2025) 2025-05-07T20:32:43.5210411Z 2025-05-07T20:32:43.5217393Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.5217868Z 2025-05-07T20:32:43.5218131Z x_sign = torch.sign(x) 2025-05-07T20:32:43.5218441Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.5218760Z x = x_sign * x_clamp 2025-05-07T20:32:43.5219014Z x0 = x[:, :D] 2025-05-07T20:32:43.5219251Z x1 = x[:, D:] 2025-05-07T20:32:43.5219466Z 2025-05-07T20:32:43.5219674Z if contiguous: 2025-05-07T20:32:43.5219922Z x0 = x0.contiguous() 2025-05-07T20:32:43.5220184Z x1 = x1.contiguous() 2025-05-07T20:32:43.5220434Z 2025-05-07T20:32:43.5220639Z if scale_ub is not None: 2025-05-07T20:32:43.5220917Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.5221251Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.5221560Z ) 2025-05-07T20:32:43.5221771Z else: 2025-05-07T20:32:43.5221987Z scale_ub_tensor = None 2025-05-07T20:32:43.5222254Z 2025-05-07T20:32:43.5222499Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.5222819Z op = silu_mul_quant 2025-05-07T20:32:43.5223096Z if compiled: 2025-05-07T20:32:43.5223356Z op = torch.compile(op) 2025-05-07T20:32:43.5223661Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.5223952Z 2025-05-07T20:32:43.5224160Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.5224333Z 2025-05-07T20:32:43.5224439Z moe/activation_test.py:117: 2025-05-07T20:32:43.5224746Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.5225090Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.5225376Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.5225962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:43.5226536Z return fn(*args, **kwargs) 2025-05-07T20:32:43.5227258Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.5227956Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.5228891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.5229707Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.5230373Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.5230900Z kernel = self.compile( 2025-05-07T20:32:43.5231451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.5232113Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.5232513Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.5232751Z 2025-05-07T20:32:43.5232960Z self = 2025-05-07T20:32:43.5234046Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.5235555Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9655817c40>} 2025-05-07T20:32:43.5236902Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.5237919Z context = 2025-05-07T20:32:43.5238219Z 2025-05-07T20:32:43.5238388Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.5238912Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.5239397Z module_map=module_map) 2025-05-07T20:32:43.5239768Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.5240137Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.5240413Z E ^ 2025-05-07T20:32:43.5240886Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.5241343Z 2025-05-07T20:32:43.5241765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.5242286Z 2025-05-07T20:32:43.6203635Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.6204337Z self=, 2025-05-07T20:32:43.6204913Z T=16384, 2025-05-07T20:32:43.6205213Z D=5120, 2025-05-07T20:32:43.6205424Z scale_ub=1200.0, 2025-05-07T20:32:43.6205668Z contiguous=True, 2025-05-07T20:32:43.6205945Z compiled=False, 2025-05-07T20:32:43.6206180Z ) 2025-05-07T20:32:43.6206546Z self = 2025-05-07T20:32:43.6207140Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:43.6207479Z 2025-05-07T20:32:43.6207566Z @given( 2025-05-07T20:32:43.6207827Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.6208187Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.6208541Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.6208875Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.6209218Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.6209509Z ) 2025-05-07T20:32:43.6209861Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.6210306Z def test_silu_mul_quant( 2025-05-07T20:32:43.6210558Z self, 2025-05-07T20:32:43.6210758Z T: int, 2025-05-07T20:32:43.6210972Z D: int, 2025-05-07T20:32:43.6211484Z scale_ub: Optional[float], 2025-05-07T20:32:43.6211758Z contiguous: bool, 2025-05-07T20:32:43.6212010Z compiled: bool, 2025-05-07T20:32:43.6212251Z ) -> None: 2025-05-07T20:32:43.6212604Z torch.manual_seed(2025) 2025-05-07T20:32:43.6212861Z 2025-05-07T20:32:43.6213143Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.6213493Z 2025-05-07T20:32:43.6213693Z x_sign = torch.sign(x) 2025-05-07T20:32:43.6213998Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.6214317Z x = x_sign * x_clamp 2025-05-07T20:32:43.6214557Z x0 = x[:, :D] 2025-05-07T20:32:43.6214786Z x1 = x[:, D:] 2025-05-07T20:32:43.6215003Z 2025-05-07T20:32:43.6215194Z if contiguous: 2025-05-07T20:32:43.6215437Z x0 = x0.contiguous() 2025-05-07T20:32:43.6215707Z x1 = x1.contiguous() 2025-05-07T20:32:43.6216037Z 2025-05-07T20:32:43.6216255Z if scale_ub is not None: 2025-05-07T20:32:43.6216541Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.6216958Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.6217281Z ) 2025-05-07T20:32:43.6217492Z else: 2025-05-07T20:32:43.6217712Z scale_ub_tensor = None 2025-05-07T20:32:43.6217970Z 2025-05-07T20:32:43.6218219Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.6218538Z op = silu_mul_quant 2025-05-07T20:32:43.6218804Z if compiled: 2025-05-07T20:32:43.6219070Z op = torch.compile(op) 2025-05-07T20:32:43.6219373Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.6219646Z 2025-05-07T20:32:43.6219853Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.6220020Z 2025-05-07T20:32:43.6220135Z moe/activation_test.py:117: 2025-05-07T20:32:43.6220433Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.6220771Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.6221057Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.6221756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.6222449Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.6222985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.6223675Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.6224343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.6224871Z kernel = self.compile( 2025-05-07T20:32:43.6225417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.6226079Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.6226477Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.6226714Z 2025-05-07T20:32:43.6226924Z self = 2025-05-07T20:32:43.6228000Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.6229707Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9655618ae0>} 2025-05-07T20:32:43.6231040Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.6232144Z context = 2025-05-07T20:32:43.6232435Z 2025-05-07T20:32:43.6232603Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.6233187Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.6233659Z module_map=module_map) 2025-05-07T20:32:43.6234021Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.6234379Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.6234644Z E ^ 2025-05-07T20:32:43.6235108Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.6235561Z 2025-05-07T20:32:43.6235976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.6236496Z 2025-05-07T20:32:43.6236670Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.6237091Z self=, 2025-05-07T20:32:43.6237490Z T=1, 2025-05-07T20:32:43.6237686Z D=7168, 2025-05-07T20:32:43.6237949Z scale_ub=1200.0, 2025-05-07T20:32:43.6238183Z contiguous=False, 2025-05-07T20:32:43.6238416Z compiled=False, 2025-05-07T20:32:43.6238629Z ) 2025-05-07T20:32:43.6238946Z self = 2025-05-07T20:32:43.6239436Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:43.6239700Z 2025-05-07T20:32:43.6239791Z @given( 2025-05-07T20:32:43.6240027Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.6240346Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.6240658Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.6240992Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.6241322Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.6241614Z ) 2025-05-07T20:32:43.6241969Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.6242417Z def test_silu_mul_quant( 2025-05-07T20:32:43.6242676Z self, 2025-05-07T20:32:43.6243446Z T: int, 2025-05-07T20:32:43.6243732Z D: int, 2025-05-07T20:32:43.6243980Z scale_ub: Optional[float], 2025-05-07T20:32:43.6244286Z contiguous: bool, 2025-05-07T20:32:43.6244594Z compiled: bool, 2025-05-07T20:32:43.6244826Z ) -> None: 2025-05-07T20:32:43.6245053Z torch.manual_seed(2025) 2025-05-07T20:32:43.6245295Z 2025-05-07T20:32:43.6245579Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.6245924Z 2025-05-07T20:32:43.6246130Z x_sign = torch.sign(x) 2025-05-07T20:32:43.6246418Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.6246730Z x = x_sign * x_clamp 2025-05-07T20:32:43.6246980Z x0 = x[:, :D] 2025-05-07T20:32:43.6247195Z x1 = x[:, D:] 2025-05-07T20:32:43.6247403Z 2025-05-07T20:32:43.6247603Z if contiguous: 2025-05-07T20:32:43.6247842Z x0 = x0.contiguous() 2025-05-07T20:32:43.6248101Z x1 = x1.contiguous() 2025-05-07T20:32:43.6248346Z 2025-05-07T20:32:43.6248545Z if scale_ub is not None: 2025-05-07T20:32:43.6248817Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.6249158Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.6249471Z ) 2025-05-07T20:32:43.6249663Z else: 2025-05-07T20:32:43.6249880Z scale_ub_tensor = None 2025-05-07T20:32:43.6250135Z 2025-05-07T20:32:43.6250405Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.6250739Z op = silu_mul_quant 2025-05-07T20:32:43.6251001Z if compiled: 2025-05-07T20:32:43.6251247Z op = torch.compile(op) 2025-05-07T20:32:43.6251634Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.6251912Z 2025-05-07T20:32:43.6252109Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.6252272Z 2025-05-07T20:32:43.6252376Z moe/activation_test.py:117: 2025-05-07T20:32:43.6252718Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.6253054Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.6253330Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.6254016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.6254701Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.6255239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.6255911Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.6256681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.6257219Z kernel = self.compile( 2025-05-07T20:32:43.6257797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.6258455Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.6258855Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.6259080Z 2025-05-07T20:32:43.6259292Z self = 2025-05-07T20:32:43.6260361Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.6261724Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9655618400>} 2025-05-07T20:32:43.6263064Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.6264081Z context = 2025-05-07T20:32:43.6264365Z 2025-05-07T20:32:43.6264536Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.6265045Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.6265518Z module_map=module_map) 2025-05-07T20:32:43.6265886Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.6266233Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.6266499Z E ^ 2025-05-07T20:32:43.6266973Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.6267425Z 2025-05-07T20:32:43.6267854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.6268362Z 2025-05-07T20:32:43.7609391Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.7610577Z self=, 2025-05-07T20:32:43.7611391Z T=4096, 2025-05-07T20:32:43.7611772Z D=7168, 2025-05-07T20:32:43.7612156Z scale_ub=1200.0, 2025-05-07T20:32:43.7612614Z contiguous=False, 2025-05-07T20:32:43.7613068Z compiled=True, 2025-05-07T20:32:43.7613471Z ) 2025-05-07T20:32:43.7614105Z self = 2025-05-07T20:32:43.7615082Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:43.7615624Z 2025-05-07T20:32:43.7615816Z @given( 2025-05-07T20:32:43.7616447Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.7616767Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.7617091Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.7617514Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.7617845Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.7618133Z ) 2025-05-07T20:32:43.7618478Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.7618922Z def test_silu_mul_quant( 2025-05-07T20:32:43.7619172Z self, 2025-05-07T20:32:43.7619374Z T: int, 2025-05-07T20:32:43.7619570Z D: int, 2025-05-07T20:32:43.7619795Z scale_ub: Optional[float], 2025-05-07T20:32:43.7620068Z contiguous: bool, 2025-05-07T20:32:43.7620302Z compiled: bool, 2025-05-07T20:32:43.7620530Z ) -> None: 2025-05-07T20:32:43.7620750Z torch.manual_seed(2025) 2025-05-07T20:32:43.7621082Z 2025-05-07T20:32:43.7621360Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.7621704Z 2025-05-07T20:32:43.7621897Z x_sign = torch.sign(x) 2025-05-07T20:32:43.7622283Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.7622597Z x = x_sign * x_clamp 2025-05-07T20:32:43.7622833Z x0 = x[:, :D] 2025-05-07T20:32:43.7623055Z x1 = x[:, D:] 2025-05-07T20:32:43.7623266Z 2025-05-07T20:32:43.7623450Z if contiguous: 2025-05-07T20:32:43.7623683Z x0 = x0.contiguous() 2025-05-07T20:32:43.7623945Z x1 = x1.contiguous() 2025-05-07T20:32:43.7624184Z 2025-05-07T20:32:43.7624386Z if scale_ub is not None: 2025-05-07T20:32:43.7624667Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.7625002Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.7625305Z ) 2025-05-07T20:32:43.7625503Z else: 2025-05-07T20:32:43.7625723Z scale_ub_tensor = None 2025-05-07T20:32:43.7625972Z 2025-05-07T20:32:43.7626206Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.7626523Z op = silu_mul_quant 2025-05-07T20:32:43.7626775Z if compiled: 2025-05-07T20:32:43.7627030Z op = torch.compile(op) 2025-05-07T20:32:43.7627330Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.7627600Z 2025-05-07T20:32:43.7627808Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.7627974Z 2025-05-07T20:32:43.7628082Z moe/activation_test.py:117: 2025-05-07T20:32:43.7628649Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.7628989Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.7629348Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.7629911Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:43.7630470Z return fn(*args, **kwargs) 2025-05-07T20:32:43.7631129Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.7631819Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.7632354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.7633035Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.7633698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.7634231Z kernel = self.compile( 2025-05-07T20:32:43.7634767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.7635419Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.7635819Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.7636177Z 2025-05-07T20:32:43.7636391Z self = 2025-05-07T20:32:43.7637523Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.7638896Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f965561af20>} 2025-05-07T20:32:43.7640225Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.7641242Z context = 2025-05-07T20:32:43.7641589Z 2025-05-07T20:32:43.7641766Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.7642281Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.7642816Z module_map=module_map) 2025-05-07T20:32:43.7643184Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.7643530Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.7643794Z E ^ 2025-05-07T20:32:43.7644258Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.7644707Z 2025-05-07T20:32:43.7645128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.7645636Z 2025-05-07T20:32:43.7645744Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.7646158Z self=, 2025-05-07T20:32:43.7646567Z T=128, 2025-05-07T20:32:43.7646755Z D=7168, 2025-05-07T20:32:43.7646951Z scale_ub=1200.0, 2025-05-07T20:32:43.7647183Z contiguous=False, 2025-05-07T20:32:43.7647411Z compiled=True, 2025-05-07T20:32:43.7647617Z ) 2025-05-07T20:32:43.8366008Z self = 2025-05-07T20:32:43.8366774Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:43.8367055Z 2025-05-07T20:32:43.8367151Z @given( 2025-05-07T20:32:43.8367386Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.8367714Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.8368034Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.8368364Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.8368697Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.8368989Z ) 2025-05-07T20:32:43.8369339Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.8369803Z def test_silu_mul_quant( 2025-05-07T20:32:43.8370055Z self, 2025-05-07T20:32:43.8370256Z T: int, 2025-05-07T20:32:43.8370473Z D: int, 2025-05-07T20:32:43.8370707Z scale_ub: Optional[float], 2025-05-07T20:32:43.8370987Z contiguous: bool, 2025-05-07T20:32:43.8371229Z compiled: bool, 2025-05-07T20:32:43.8371467Z ) -> None: 2025-05-07T20:32:43.8371692Z torch.manual_seed(2025) 2025-05-07T20:32:43.8371936Z 2025-05-07T20:32:43.8372216Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.8372563Z 2025-05-07T20:32:43.8372762Z x_sign = torch.sign(x) 2025-05-07T20:32:43.8373059Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.8373372Z x = x_sign * x_clamp 2025-05-07T20:32:43.8373616Z x0 = x[:, :D] 2025-05-07T20:32:43.8373850Z x1 = x[:, D:] 2025-05-07T20:32:43.8374071Z 2025-05-07T20:32:43.8374488Z if contiguous: 2025-05-07T20:32:43.8374734Z x0 = x0.contiguous() 2025-05-07T20:32:43.8374998Z x1 = x1.contiguous() 2025-05-07T20:32:43.8375236Z 2025-05-07T20:32:43.8375526Z if scale_ub is not None: 2025-05-07T20:32:43.8375808Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.8376140Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.8376503Z ) 2025-05-07T20:32:43.8376707Z else: 2025-05-07T20:32:43.8376923Z scale_ub_tensor = None 2025-05-07T20:32:43.8377173Z 2025-05-07T20:32:43.8377414Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.8377730Z op = silu_mul_quant 2025-05-07T20:32:43.8377983Z if compiled: 2025-05-07T20:32:43.8378254Z op = torch.compile(op) 2025-05-07T20:32:43.8385462Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.8385887Z 2025-05-07T20:32:43.8386093Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.8386273Z 2025-05-07T20:32:43.8386380Z moe/activation_test.py:117: 2025-05-07T20:32:43.8386813Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.8387187Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.8387474Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.8388046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:43.8388622Z return fn(*args, **kwargs) 2025-05-07T20:32:43.8389359Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.8390061Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.8390605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.8391294Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.8391963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.8392518Z kernel = self.compile( 2025-05-07T20:32:43.8393074Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.8393737Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.8394146Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.8394383Z 2025-05-07T20:32:43.8394593Z self = 2025-05-07T20:32:43.8395691Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.8397103Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9655d14220>} 2025-05-07T20:32:43.8398460Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.8399503Z context = 2025-05-07T20:32:43.8399798Z 2025-05-07T20:32:43.8399968Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.8400497Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.8400969Z module_map=module_map) 2025-05-07T20:32:43.8401344Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.8401714Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.8401981Z E ^ 2025-05-07T20:32:43.8402528Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.8402993Z 2025-05-07T20:32:43.8403463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.8403980Z 2025-05-07T20:32:43.8404096Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.8404518Z self=, 2025-05-07T20:32:43.8404937Z T=2048, 2025-05-07T20:32:43.8405147Z D=7168, 2025-05-07T20:32:43.8405355Z scale_ub=None, 2025-05-07T20:32:43.8405585Z contiguous=True, 2025-05-07T20:32:43.8405828Z compiled=True, 2025-05-07T20:32:43.8406054Z ) 2025-05-07T20:32:43.8406423Z self = 2025-05-07T20:32:43.8406951Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:43.8407268Z 2025-05-07T20:32:43.8407365Z @given( 2025-05-07T20:32:43.8407610Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.8407937Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.8408296Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.8408627Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.8408969Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.8409266Z ) 2025-05-07T20:32:43.8409629Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.8410097Z def test_silu_mul_quant( 2025-05-07T20:32:43.8410354Z self, 2025-05-07T20:32:43.8410556Z T: int, 2025-05-07T20:32:43.8410774Z D: int, 2025-05-07T20:32:43.8410997Z scale_ub: Optional[float], 2025-05-07T20:32:43.8411274Z contiguous: bool, 2025-05-07T20:32:43.8411516Z compiled: bool, 2025-05-07T20:32:43.8411737Z ) -> None: 2025-05-07T20:32:43.8411964Z torch.manual_seed(2025) 2025-05-07T20:32:43.8412211Z 2025-05-07T20:32:43.8412502Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.8412856Z 2025-05-07T20:32:43.8413064Z x_sign = torch.sign(x) 2025-05-07T20:32:43.8413368Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.8413694Z x = x_sign * x_clamp 2025-05-07T20:32:43.8413939Z x0 = x[:, :D] 2025-05-07T20:32:43.8414169Z x1 = x[:, D:] 2025-05-07T20:32:43.8414393Z 2025-05-07T20:32:43.8414587Z if contiguous: 2025-05-07T20:32:43.8414832Z x0 = x0.contiguous() 2025-05-07T20:32:43.8415107Z x1 = x1.contiguous() 2025-05-07T20:32:43.8415348Z 2025-05-07T20:32:43.8415551Z if scale_ub is not None: 2025-05-07T20:32:43.8415842Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:43.8416183Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:43.8416522Z ) 2025-05-07T20:32:43.8416772Z else: 2025-05-07T20:32:43.8416999Z scale_ub_tensor = None 2025-05-07T20:32:43.8417256Z 2025-05-07T20:32:43.8417511Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:43.8417843Z op = silu_mul_quant 2025-05-07T20:32:43.8418109Z if compiled: 2025-05-07T20:32:43.8418376Z op = torch.compile(op) 2025-05-07T20:32:43.8418685Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.8418963Z 2025-05-07T20:32:43.8419171Z > y_fp8, y_scale = fn() 2025-05-07T20:32:43.8419342Z 2025-05-07T20:32:43.8419454Z moe/activation_test.py:117: 2025-05-07T20:32:43.8419756Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.8420113Z moe/activation_test.py:115: in fn 2025-05-07T20:32:43.8420412Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:43.8420995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:43.8421624Z return fn(*args, **kwargs) 2025-05-07T20:32:43.8422311Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:43.8423056Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:43.8423603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:43.8424307Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:43.8424984Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:43.8425534Z kernel = self.compile( 2025-05-07T20:32:43.8426087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:43.8426770Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:43.8427234Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:43.8427472Z 2025-05-07T20:32:43.8427767Z self = 2025-05-07T20:32:43.8429193Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:43.8430587Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9655d14d60>} 2025-05-07T20:32:43.8431957Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:43.8433002Z context = 2025-05-07T20:32:43.8433301Z 2025-05-07T20:32:43.8433473Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:43.8434015Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:43.8434498Z module_map=module_map) 2025-05-07T20:32:43.8434872Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:43.8435239Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:43.8435515Z E ^ 2025-05-07T20:32:43.8435994Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:43.8436500Z 2025-05-07T20:32:43.8436923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:43.8437452Z 2025-05-07T20:32:43.9089239Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.9089822Z self=, 2025-05-07T20:32:43.9090356Z T=16384, 2025-05-07T20:32:43.9090562Z D=5120, 2025-05-07T20:32:43.9090755Z scale_ub=None, 2025-05-07T20:32:43.9090986Z contiguous=False, 2025-05-07T20:32:43.9091226Z compiled=False, 2025-05-07T20:32:43.9091452Z ) 2025-05-07T20:32:43.9091775Z self = 2025-05-07T20:32:43.9092274Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:43.9092550Z 2025-05-07T20:32:43.9092635Z @given( 2025-05-07T20:32:43.9092863Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.9093193Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.9093502Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.9093830Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.9094164Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.9094453Z ) 2025-05-07T20:32:43.9094991Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.9095440Z def test_silu_mul_quant( 2025-05-07T20:32:43.9095691Z self, 2025-05-07T20:32:43.9095963Z T: int, 2025-05-07T20:32:43.9096160Z D: int, 2025-05-07T20:32:43.9096422Z scale_ub: Optional[float], 2025-05-07T20:32:43.9096708Z contiguous: bool, 2025-05-07T20:32:43.9096957Z compiled: bool, 2025-05-07T20:32:43.9097184Z ) -> None: 2025-05-07T20:32:43.9097403Z torch.manual_seed(2025) 2025-05-07T20:32:43.9097640Z 2025-05-07T20:32:43.9097920Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.9098266Z 2025-05-07T20:32:43.9098459Z x_sign = torch.sign(x) 2025-05-07T20:32:43.9098751Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.9100872Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:43.9102828Z 2025-05-07T20:32:43.9102952Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:43.9103164Z 2025-05-07T20:32:43.9103277Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.9103682Z self=, 2025-05-07T20:32:43.9104101Z T=4096, 2025-05-07T20:32:43.9104367Z D=7168, 2025-05-07T20:32:43.9104607Z scale_ub=1200.0, 2025-05-07T20:32:43.9104840Z contiguous=True, 2025-05-07T20:32:43.9105070Z compiled=True, 2025-05-07T20:32:43.9105277Z ) 2025-05-07T20:32:43.9105601Z self = 2025-05-07T20:32:43.9106110Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:43.9106399Z 2025-05-07T20:32:43.9106498Z @given( 2025-05-07T20:32:43.9106753Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.9107076Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.9107389Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.9107718Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.9108055Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.9108349Z ) 2025-05-07T20:32:43.9108698Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.9109252Z def test_silu_mul_quant( 2025-05-07T20:32:43.9109499Z self, 2025-05-07T20:32:43.9109696Z T: int, 2025-05-07T20:32:43.9109901Z D: int, 2025-05-07T20:32:43.9110131Z scale_ub: Optional[float], 2025-05-07T20:32:43.9110409Z contiguous: bool, 2025-05-07T20:32:43.9110643Z compiled: bool, 2025-05-07T20:32:43.9110870Z ) -> None: 2025-05-07T20:32:43.9111094Z torch.manual_seed(2025) 2025-05-07T20:32:43.9111331Z 2025-05-07T20:32:43.9111619Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.9111962Z 2025-05-07T20:32:43.9112165Z x_sign = torch.sign(x) 2025-05-07T20:32:43.9112459Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.9114469Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:43.9116485Z 2025-05-07T20:32:43.9116645Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:43.9116862Z 2025-05-07T20:32:43.9116974Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.9117388Z self=, 2025-05-07T20:32:43.9117786Z T=16384, 2025-05-07T20:32:43.9117995Z D=7168, 2025-05-07T20:32:43.9118190Z scale_ub=None, 2025-05-07T20:32:43.9118407Z contiguous=False, 2025-05-07T20:32:43.9118640Z compiled=False, 2025-05-07T20:32:43.9118850Z ) 2025-05-07T20:32:43.9119163Z self = 2025-05-07T20:32:43.9119663Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:43.9119987Z 2025-05-07T20:32:43.9120075Z @given( 2025-05-07T20:32:43.9120303Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.9120624Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.9120974Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.9121317Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.9121643Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.9121930Z ) 2025-05-07T20:32:43.9122281Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.9122716Z def test_silu_mul_quant( 2025-05-07T20:32:43.9122963Z self, 2025-05-07T20:32:43.9123166Z T: int, 2025-05-07T20:32:43.9123360Z D: int, 2025-05-07T20:32:43.9123582Z scale_ub: Optional[float], 2025-05-07T20:32:43.9123855Z contiguous: bool, 2025-05-07T20:32:43.9124093Z compiled: bool, 2025-05-07T20:32:43.9124316Z ) -> None: 2025-05-07T20:32:43.9124536Z torch.manual_seed(2025) 2025-05-07T20:32:43.9124783Z 2025-05-07T20:32:43.9125057Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.9127168Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:43.9129425Z 2025-05-07T20:32:43.9129554Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:43.9129769Z 2025-05-07T20:32:43.9129883Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.9130289Z self=, 2025-05-07T20:32:43.9130710Z T=2048, 2025-05-07T20:32:43.9130903Z D=7168, 2025-05-07T20:32:43.9131093Z scale_ub=1200.0, 2025-05-07T20:32:43.9131332Z contiguous=True, 2025-05-07T20:32:43.9131558Z compiled=True, 2025-05-07T20:32:43.9131764Z ) 2025-05-07T20:32:43.9132094Z self = 2025-05-07T20:32:43.9132598Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:43.9132869Z 2025-05-07T20:32:43.9132959Z @given( 2025-05-07T20:32:43.9133185Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:43.9133501Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:43.9133815Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:43.9134142Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:43.9134482Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:43.9134774Z ) 2025-05-07T20:32:43.9135124Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:43.9135673Z def test_silu_mul_quant( 2025-05-07T20:32:43.9135923Z self, 2025-05-07T20:32:43.9136121Z T: int, 2025-05-07T20:32:43.9136390Z D: int, 2025-05-07T20:32:43.9136613Z scale_ub: Optional[float], 2025-05-07T20:32:43.9136889Z contiguous: bool, 2025-05-07T20:32:43.9137131Z compiled: bool, 2025-05-07T20:32:43.9137388Z ) -> None: 2025-05-07T20:32:43.9137632Z torch.manual_seed(2025) 2025-05-07T20:32:43.9137874Z 2025-05-07T20:32:43.9138155Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:43.9138514Z 2025-05-07T20:32:43.9138708Z x_sign = torch.sign(x) 2025-05-07T20:32:43.9139019Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:43.9141094Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:43.9143015Z 2025-05-07T20:32:43.9143135Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:43.9143348Z 2025-05-07T20:32:43.9143464Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:43.9143871Z self=, 2025-05-07T20:32:43.9144279Z T=2048, 2025-05-07T20:32:43.9144469Z D=7168, 2025-05-07T20:32:43.9144657Z scale_ub=None, 2025-05-07T20:32:43.9144874Z contiguous=True, 2025-05-07T20:32:43.9145103Z compiled=False, 2025-05-07T20:32:43.9145310Z ) 2025-05-07T20:32:44.1690670Z self = 2025-05-07T20:32:44.1691401Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:44.1691790Z 2025-05-07T20:32:44.1691939Z @given( 2025-05-07T20:32:44.1692237Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.1692639Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.1692948Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.1693276Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.1693600Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.1693889Z ) 2025-05-07T20:32:44.1694242Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.1694675Z def test_silu_mul_quant( 2025-05-07T20:32:44.1694920Z self, 2025-05-07T20:32:44.1695119Z T: int, 2025-05-07T20:32:44.1695312Z D: int, 2025-05-07T20:32:44.1695549Z scale_ub: Optional[float], 2025-05-07T20:32:44.1695838Z contiguous: bool, 2025-05-07T20:32:44.1696089Z compiled: bool, 2025-05-07T20:32:44.1696314Z ) -> None: 2025-05-07T20:32:44.1696534Z torch.manual_seed(2025) 2025-05-07T20:32:44.1696780Z 2025-05-07T20:32:44.1697051Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.1697393Z 2025-05-07T20:32:44.1697592Z > x_sign = torch.sign(x) 2025-05-07T20:32:44.1699524Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.1701687Z 2025-05-07T20:32:44.1701808Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:44.1702025Z 2025-05-07T20:32:44.1702220Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.1702637Z self=, 2025-05-07T20:32:44.1703042Z T=1, 2025-05-07T20:32:44.1703230Z D=7168, 2025-05-07T20:32:44.1703427Z scale_ub=1200.0, 2025-05-07T20:32:44.1703654Z contiguous=True, 2025-05-07T20:32:44.1703873Z compiled=False, 2025-05-07T20:32:44.1704086Z ) 2025-05-07T20:32:44.1704404Z self = 2025-05-07T20:32:44.1704887Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:44.1705158Z 2025-05-07T20:32:44.1705238Z @given( 2025-05-07T20:32:44.1705469Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.1705907Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.1706221Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.1706550Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.1706961Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.1707242Z ) 2025-05-07T20:32:44.1707594Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.1708036Z def test_silu_mul_quant( 2025-05-07T20:32:44.1708273Z self, 2025-05-07T20:32:44.1708471Z T: int, 2025-05-07T20:32:44.1708676Z D: int, 2025-05-07T20:32:44.1708890Z scale_ub: Optional[float], 2025-05-07T20:32:44.1709289Z contiguous: bool, 2025-05-07T20:32:44.1709530Z compiled: bool, 2025-05-07T20:32:44.1709746Z ) -> None: 2025-05-07T20:32:44.1709963Z torch.manual_seed(2025) 2025-05-07T20:32:44.1710203Z 2025-05-07T20:32:44.1710471Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.1710818Z 2025-05-07T20:32:44.1711017Z x_sign = torch.sign(x) 2025-05-07T20:32:44.1711308Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.1711616Z x = x_sign * x_clamp 2025-05-07T20:32:44.1711866Z x0 = x[:, :D] 2025-05-07T20:32:44.1712084Z x1 = x[:, D:] 2025-05-07T20:32:44.1712288Z 2025-05-07T20:32:44.1712478Z if contiguous: 2025-05-07T20:32:44.1712712Z x0 = x0.contiguous() 2025-05-07T20:32:44.1712968Z x1 = x1.contiguous() 2025-05-07T20:32:44.1713213Z 2025-05-07T20:32:44.1713418Z if scale_ub is not None: 2025-05-07T20:32:44.1713697Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.1714048Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.1714359Z ) 2025-05-07T20:32:44.1714553Z else: 2025-05-07T20:32:44.1714770Z scale_ub_tensor = None 2025-05-07T20:32:44.1715026Z 2025-05-07T20:32:44.1715256Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.1715579Z op = silu_mul_quant 2025-05-07T20:32:44.1715835Z if compiled: 2025-05-07T20:32:44.1716086Z op = torch.compile(op) 2025-05-07T20:32:44.1716394Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.1716674Z 2025-05-07T20:32:44.1716876Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.1717037Z 2025-05-07T20:32:44.1717138Z moe/activation_test.py:117: 2025-05-07T20:32:44.1717435Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.1717769Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.1718047Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.1718738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.1719427Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.1719968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.1720704Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.1721405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.1721941Z kernel = self.compile( 2025-05-07T20:32:44.1722477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.1723135Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.1723542Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.1723772Z 2025-05-07T20:32:44.1723984Z self = 2025-05-07T20:32:44.1725058Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.1726506Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f965556c540>} 2025-05-07T20:32:44.1727844Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.1729139Z context = 2025-05-07T20:32:44.1729424Z 2025-05-07T20:32:44.1729596Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.1730104Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.1730566Z module_map=module_map) 2025-05-07T20:32:44.1730937Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.1731283Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.1731547Z E ^ 2025-05-07T20:32:44.1732025Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.1732473Z 2025-05-07T20:32:44.1732894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.1733403Z 2025-05-07T20:32:44.1733516Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.1733933Z self=, 2025-05-07T20:32:44.1734337Z T=128, 2025-05-07T20:32:44.1734522Z D=5120, 2025-05-07T20:32:44.1734743Z scale_ub=None, 2025-05-07T20:32:44.1734977Z contiguous=True, 2025-05-07T20:32:44.1735205Z compiled=False, 2025-05-07T20:32:44.1735427Z ) 2025-05-07T20:32:44.2284922Z self = 2025-05-07T20:32:44.2285449Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:44.2285814Z 2025-05-07T20:32:44.2285950Z @given( 2025-05-07T20:32:44.2286291Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.2286718Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.2298122Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.2298519Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.2298857Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.2299139Z ) 2025-05-07T20:32:44.2299497Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.2299951Z def test_silu_mul_quant( 2025-05-07T20:32:44.2300194Z self, 2025-05-07T20:32:44.2300397Z T: int, 2025-05-07T20:32:44.2300600Z D: int, 2025-05-07T20:32:44.2300825Z scale_ub: Optional[float], 2025-05-07T20:32:44.2301392Z contiguous: bool, 2025-05-07T20:32:44.2301642Z compiled: bool, 2025-05-07T20:32:44.2301872Z ) -> None: 2025-05-07T20:32:44.2302097Z torch.manual_seed(2025) 2025-05-07T20:32:44.2302346Z 2025-05-07T20:32:44.2302717Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.2303060Z 2025-05-07T20:32:44.2303259Z x_sign = torch.sign(x) 2025-05-07T20:32:44.2303585Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.2303905Z x = x_sign * x_clamp 2025-05-07T20:32:44.2304151Z x0 = x[:, :D] 2025-05-07T20:32:44.2304366Z x1 = x[:, D:] 2025-05-07T20:32:44.2304581Z 2025-05-07T20:32:44.2304778Z if contiguous: 2025-05-07T20:32:44.2305014Z x0 = x0.contiguous() 2025-05-07T20:32:44.2305288Z x1 = x1.contiguous() 2025-05-07T20:32:44.2305533Z 2025-05-07T20:32:44.2305724Z if scale_ub is not None: 2025-05-07T20:32:44.2306105Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.2306478Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.2306809Z ) 2025-05-07T20:32:44.2307099Z else: 2025-05-07T20:32:44.2307321Z scale_ub_tensor = None 2025-05-07T20:32:44.2307576Z 2025-05-07T20:32:44.2307818Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.2308138Z op = silu_mul_quant 2025-05-07T20:32:44.2308399Z if compiled: 2025-05-07T20:32:44.2308643Z op = torch.compile(op) 2025-05-07T20:32:44.2308942Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.2309288Z 2025-05-07T20:32:44.2309482Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.2309653Z 2025-05-07T20:32:44.2309753Z moe/activation_test.py:117: 2025-05-07T20:32:44.2310051Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.2310378Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.2310669Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.2311365Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.2312069Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.2312601Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.2313291Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.2313956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.2314484Z kernel = self.compile( 2025-05-07T20:32:44.2315033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.2315704Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.2316112Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.2316348Z 2025-05-07T20:32:44.2316592Z self = 2025-05-07T20:32:44.2317703Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.2319109Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f965556d620>} 2025-05-07T20:32:44.2320463Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.2321500Z context = 2025-05-07T20:32:44.2321873Z 2025-05-07T20:32:44.2322042Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.2322570Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.2323090Z module_map=module_map) 2025-05-07T20:32:44.2323456Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.2323814Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.2324079Z E ^ 2025-05-07T20:32:44.2324541Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.2324998Z 2025-05-07T20:32:44.2325422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.2325941Z 2025-05-07T20:32:44.2326048Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.2326467Z self=, 2025-05-07T20:32:44.2326915Z T=128, 2025-05-07T20:32:44.2327112Z D=7168, 2025-05-07T20:32:44.2327315Z scale_ub=None, 2025-05-07T20:32:44.2327574Z contiguous=True, 2025-05-07T20:32:44.2327812Z compiled=False, 2025-05-07T20:32:44.2328026Z ) 2025-05-07T20:32:44.2328635Z self = 2025-05-07T20:32:44.2329129Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:44.2329408Z 2025-05-07T20:32:44.2329491Z @given( 2025-05-07T20:32:44.2329731Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.2330042Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.2330355Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.2330695Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.2331021Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.2331313Z ) 2025-05-07T20:32:44.2331665Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.2332118Z def test_silu_mul_quant( 2025-05-07T20:32:44.2332361Z self, 2025-05-07T20:32:44.2332567Z T: int, 2025-05-07T20:32:44.2332780Z D: int, 2025-05-07T20:32:44.2332998Z scale_ub: Optional[float], 2025-05-07T20:32:44.2333272Z contiguous: bool, 2025-05-07T20:32:44.2333516Z compiled: bool, 2025-05-07T20:32:44.2333740Z ) -> None: 2025-05-07T20:32:44.2333962Z torch.manual_seed(2025) 2025-05-07T20:32:44.2334210Z 2025-05-07T20:32:44.2334482Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.2334836Z 2025-05-07T20:32:44.2335040Z x_sign = torch.sign(x) 2025-05-07T20:32:44.2335333Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.2335653Z x = x_sign * x_clamp 2025-05-07T20:32:44.2335904Z x0 = x[:, :D] 2025-05-07T20:32:44.2336128Z x1 = x[:, D:] 2025-05-07T20:32:44.2336349Z 2025-05-07T20:32:44.2336553Z if contiguous: 2025-05-07T20:32:44.2336805Z x0 = x0.contiguous() 2025-05-07T20:32:44.2337114Z x1 = x1.contiguous() 2025-05-07T20:32:44.2337354Z 2025-05-07T20:32:44.2337559Z if scale_ub is not None: 2025-05-07T20:32:44.2337843Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.2338181Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.2338501Z ) 2025-05-07T20:32:44.2338699Z else: 2025-05-07T20:32:44.2338915Z scale_ub_tensor = None 2025-05-07T20:32:44.2339182Z 2025-05-07T20:32:44.2339419Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.2339744Z op = silu_mul_quant 2025-05-07T20:32:44.2340000Z if compiled: 2025-05-07T20:32:44.2340259Z op = torch.compile(op) 2025-05-07T20:32:44.2340562Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.2340920Z 2025-05-07T20:32:44.2341116Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.2341283Z 2025-05-07T20:32:44.2341389Z moe/activation_test.py:117: 2025-05-07T20:32:44.2341778Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.2342122Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.2342405Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.2343090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.2343783Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.2344321Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.2345006Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.2345664Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.2346272Z kernel = self.compile( 2025-05-07T20:32:44.2346817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.2347540Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.2347941Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.2348175Z 2025-05-07T20:32:44.2348382Z self = 2025-05-07T20:32:44.2349540Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.2350919Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f965556e480>} 2025-05-07T20:32:44.2352275Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.2353309Z context = 2025-05-07T20:32:44.2353602Z 2025-05-07T20:32:44.2353768Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.2354291Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.2354756Z module_map=module_map) 2025-05-07T20:32:44.2355121Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.2355479Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.2355735Z E ^ 2025-05-07T20:32:44.2356202Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.2356670Z 2025-05-07T20:32:44.2357092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.2357607Z 2025-05-07T20:32:44.2357725Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.2358132Z self=, 2025-05-07T20:32:44.2358541Z T=2048, 2025-05-07T20:32:44.2358736Z D=7168, 2025-05-07T20:32:44.2358931Z scale_ub=1200.0, 2025-05-07T20:32:44.2359163Z contiguous=True, 2025-05-07T20:32:44.2359389Z compiled=False, 2025-05-07T20:32:44.2359600Z ) 2025-05-07T20:32:44.3021275Z self = 2025-05-07T20:32:44.3022043Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:44.3022434Z 2025-05-07T20:32:44.3022546Z @given( 2025-05-07T20:32:44.3022868Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.3023278Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.3023773Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.3024106Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.3024544Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.3024833Z ) 2025-05-07T20:32:44.3025184Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.3025629Z def test_silu_mul_quant( 2025-05-07T20:32:44.3025870Z self, 2025-05-07T20:32:44.3026073Z T: int, 2025-05-07T20:32:44.3026274Z D: int, 2025-05-07T20:32:44.3026517Z scale_ub: Optional[float], 2025-05-07T20:32:44.3026824Z contiguous: bool, 2025-05-07T20:32:44.3027067Z compiled: bool, 2025-05-07T20:32:44.3027299Z ) -> None: 2025-05-07T20:32:44.3027519Z torch.manual_seed(2025) 2025-05-07T20:32:44.3027768Z 2025-05-07T20:32:44.3028050Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.3030578Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.3032459Z 2025-05-07T20:32:44.3032581Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:44.3032802Z 2025-05-07T20:32:44.3032912Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.3033327Z self=, 2025-05-07T20:32:44.3033730Z T=1, 2025-05-07T20:32:44.3033916Z D=5120, 2025-05-07T20:32:44.3034115Z scale_ub=1200.0, 2025-05-07T20:32:44.3034343Z contiguous=True, 2025-05-07T20:32:44.3034560Z compiled=False, 2025-05-07T20:32:44.3034770Z ) 2025-05-07T20:32:44.3035100Z self = 2025-05-07T20:32:44.3035577Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:44.3035847Z 2025-05-07T20:32:44.3035928Z @given( 2025-05-07T20:32:44.3036160Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.3036473Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.3036785Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.3037118Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.3037447Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.3037730Z ) 2025-05-07T20:32:44.3038089Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.3038532Z def test_silu_mul_quant( 2025-05-07T20:32:44.3038776Z self, 2025-05-07T20:32:44.3038977Z T: int, 2025-05-07T20:32:44.3039176Z D: int, 2025-05-07T20:32:44.3039396Z scale_ub: Optional[float], 2025-05-07T20:32:44.3039671Z contiguous: bool, 2025-05-07T20:32:44.3039926Z compiled: bool, 2025-05-07T20:32:44.3040145Z ) -> None: 2025-05-07T20:32:44.3040369Z torch.manual_seed(2025) 2025-05-07T20:32:44.3040615Z 2025-05-07T20:32:44.3040885Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.3041230Z 2025-05-07T20:32:44.3041433Z x_sign = torch.sign(x) 2025-05-07T20:32:44.3041728Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.3042045Z x = x_sign * x_clamp 2025-05-07T20:32:44.3042293Z x0 = x[:, :D] 2025-05-07T20:32:44.3042522Z x1 = x[:, D:] 2025-05-07T20:32:44.3042730Z 2025-05-07T20:32:44.3042927Z if contiguous: 2025-05-07T20:32:44.3043166Z x0 = x0.contiguous() 2025-05-07T20:32:44.3043494Z x1 = x1.contiguous() 2025-05-07T20:32:44.3043737Z 2025-05-07T20:32:44.3043936Z if scale_ub is not None: 2025-05-07T20:32:44.3044207Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.3044604Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.3044921Z ) 2025-05-07T20:32:44.3045117Z else: 2025-05-07T20:32:44.3045340Z scale_ub_tensor = None 2025-05-07T20:32:44.3045594Z 2025-05-07T20:32:44.3045826Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.3046144Z op = silu_mul_quant 2025-05-07T20:32:44.3046402Z if compiled: 2025-05-07T20:32:44.3046678Z op = torch.compile(op) 2025-05-07T20:32:44.3047020Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.3047296Z 2025-05-07T20:32:44.3047500Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.3047732Z 2025-05-07T20:32:44.3047839Z moe/activation_test.py:117: 2025-05-07T20:32:44.3048133Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.3048467Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.3048791Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.3049475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.3050164Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.3050697Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.3051376Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.3052032Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.3052576Z kernel = self.compile( 2025-05-07T20:32:44.3053129Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.3053789Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.3054195Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.3054430Z 2025-05-07T20:32:44.3054636Z self = 2025-05-07T20:32:44.3055713Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.3057123Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f965556fa60>} 2025-05-07T20:32:44.3058464Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.3059496Z context = 2025-05-07T20:32:44.3059783Z 2025-05-07T20:32:44.3059958Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.3060472Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.3060932Z module_map=module_map) 2025-05-07T20:32:44.3061303Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.3061657Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.3061921Z E ^ 2025-05-07T20:32:44.3062389Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.3062843Z 2025-05-07T20:32:44.3063270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.3063840Z 2025-05-07T20:32:44.3063957Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.3064373Z self=, 2025-05-07T20:32:44.3064822Z T=2048, 2025-05-07T20:32:44.3065023Z D=5120, 2025-05-07T20:32:44.3065215Z scale_ub=None, 2025-05-07T20:32:44.3065444Z contiguous=True, 2025-05-07T20:32:44.3065678Z compiled=False, 2025-05-07T20:32:44.3065885Z ) 2025-05-07T20:32:44.3066207Z self = 2025-05-07T20:32:44.3066703Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:44.3066973Z 2025-05-07T20:32:44.3067064Z @given( 2025-05-07T20:32:44.3067299Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.3067616Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.3067927Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.3068305Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.3068635Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.3068922Z ) 2025-05-07T20:32:44.3069373Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.3069814Z def test_silu_mul_quant( 2025-05-07T20:32:44.3070063Z self, 2025-05-07T20:32:44.3070256Z T: int, 2025-05-07T20:32:44.3070461Z D: int, 2025-05-07T20:32:44.3070682Z scale_ub: Optional[float], 2025-05-07T20:32:44.3070949Z contiguous: bool, 2025-05-07T20:32:44.3071193Z compiled: bool, 2025-05-07T20:32:44.3071418Z ) -> None: 2025-05-07T20:32:44.3071641Z torch.manual_seed(2025) 2025-05-07T20:32:44.3071881Z 2025-05-07T20:32:44.3072160Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.3072505Z 2025-05-07T20:32:44.3072698Z > x_sign = torch.sign(x) 2025-05-07T20:32:44.3074685Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.3076565Z 2025-05-07T20:32:44.3076689Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:44.3076902Z 2025-05-07T20:32:44.3077014Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.3077430Z self=, 2025-05-07T20:32:44.3077836Z T=16384, 2025-05-07T20:32:44.3078034Z D=5120, 2025-05-07T20:32:44.3078233Z scale_ub=None, 2025-05-07T20:32:44.3078447Z contiguous=True, 2025-05-07T20:32:44.3078680Z compiled=False, 2025-05-07T20:32:44.3078888Z ) 2025-05-07T20:32:44.3784125Z self = 2025-05-07T20:32:44.3785625Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:44.3786375Z 2025-05-07T20:32:44.3786594Z @given( 2025-05-07T20:32:44.3787127Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.3787491Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.3787812Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.3788145Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.3788484Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.3788765Z ) 2025-05-07T20:32:44.3789201Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.3789651Z def test_silu_mul_quant( 2025-05-07T20:32:44.3790099Z self, 2025-05-07T20:32:44.3790299Z T: int, 2025-05-07T20:32:44.3790505Z D: int, 2025-05-07T20:32:44.3790731Z scale_ub: Optional[float], 2025-05-07T20:32:44.3791003Z contiguous: bool, 2025-05-07T20:32:44.3791322Z compiled: bool, 2025-05-07T20:32:44.3791556Z ) -> None: 2025-05-07T20:32:44.3791780Z torch.manual_seed(2025) 2025-05-07T20:32:44.3792025Z 2025-05-07T20:32:44.3792301Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.3794323Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.3796242Z 2025-05-07T20:32:44.3796428Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:44.3796645Z 2025-05-07T20:32:44.3796752Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.3797163Z self=, 2025-05-07T20:32:44.3797571Z T=4096, 2025-05-07T20:32:44.3797764Z D=5120, 2025-05-07T20:32:44.3797966Z scale_ub=None, 2025-05-07T20:32:44.3798190Z contiguous=True, 2025-05-07T20:32:44.3798412Z compiled=False, 2025-05-07T20:32:44.3798620Z ) 2025-05-07T20:32:44.3798943Z self = 2025-05-07T20:32:44.3799425Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:44.3799696Z 2025-05-07T20:32:44.3799775Z @given( 2025-05-07T20:32:44.3800012Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.3800337Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.3800638Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.3800970Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.3801302Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.3801581Z ) 2025-05-07T20:32:44.3801931Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.3802373Z def test_silu_mul_quant( 2025-05-07T20:32:44.3802614Z self, 2025-05-07T20:32:44.3802820Z T: int, 2025-05-07T20:32:44.3803022Z D: int, 2025-05-07T20:32:44.3803243Z scale_ub: Optional[float], 2025-05-07T20:32:44.3803521Z contiguous: bool, 2025-05-07T20:32:44.3803766Z compiled: bool, 2025-05-07T20:32:44.3803988Z ) -> None: 2025-05-07T20:32:44.3804210Z torch.manual_seed(2025) 2025-05-07T20:32:44.3804472Z 2025-05-07T20:32:44.3804751Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.3806789Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.3808667Z 2025-05-07T20:32:44.3808792Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:44.3809016Z 2025-05-07T20:32:44.3809124Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.3809550Z self=, 2025-05-07T20:32:44.3809957Z T=2048, 2025-05-07T20:32:44.3810222Z D=5120, 2025-05-07T20:32:44.3810431Z scale_ub=None, 2025-05-07T20:32:44.3810652Z contiguous=False, 2025-05-07T20:32:44.3810890Z compiled=False, 2025-05-07T20:32:44.3811111Z ) 2025-05-07T20:32:44.3811483Z self = 2025-05-07T20:32:44.3811973Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:44.3819486Z 2025-05-07T20:32:44.3819593Z @given( 2025-05-07T20:32:44.3819850Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.3820180Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.3820494Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.3820836Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.3821172Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.3821457Z ) 2025-05-07T20:32:44.3821819Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.3822358Z def test_silu_mul_quant( 2025-05-07T20:32:44.3822602Z self, 2025-05-07T20:32:44.3822809Z T: int, 2025-05-07T20:32:44.3823020Z D: int, 2025-05-07T20:32:44.3823286Z scale_ub: Optional[float], 2025-05-07T20:32:44.3823570Z contiguous: bool, 2025-05-07T20:32:44.3823821Z compiled: bool, 2025-05-07T20:32:44.3824048Z ) -> None: 2025-05-07T20:32:44.3824278Z torch.manual_seed(2025) 2025-05-07T20:32:44.3824527Z 2025-05-07T20:32:44.3824809Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.3826875Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.3829070Z 2025-05-07T20:32:44.3829199Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:44.3829428Z 2025-05-07T20:32:44.3829534Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.3829960Z self=, 2025-05-07T20:32:44.3830363Z T=4096, 2025-05-07T20:32:44.3830567Z D=7168, 2025-05-07T20:32:44.3830771Z scale_ub=None, 2025-05-07T20:32:44.3830993Z contiguous=True, 2025-05-07T20:32:44.3831224Z compiled=True, 2025-05-07T20:32:44.3831432Z ) 2025-05-07T20:32:44.3831758Z self = 2025-05-07T20:32:44.3832244Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:44.3832522Z 2025-05-07T20:32:44.3832611Z @given( 2025-05-07T20:32:44.3832858Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.3833168Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.3833492Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.3833833Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.3834165Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.3834461Z ) 2025-05-07T20:32:44.3834820Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.3835277Z def test_silu_mul_quant( 2025-05-07T20:32:44.3835526Z self, 2025-05-07T20:32:44.3835736Z T: int, 2025-05-07T20:32:44.3835946Z D: int, 2025-05-07T20:32:44.3836169Z scale_ub: Optional[float], 2025-05-07T20:32:44.3836457Z contiguous: bool, 2025-05-07T20:32:44.3836707Z compiled: bool, 2025-05-07T20:32:44.3836953Z ) -> None: 2025-05-07T20:32:44.3837206Z torch.manual_seed(2025) 2025-05-07T20:32:44.3837559Z 2025-05-07T20:32:44.3837835Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.3839984Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.3841870Z 2025-05-07T20:32:44.3841997Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:44.3842220Z 2025-05-07T20:32:44.3842329Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.3842748Z self=, 2025-05-07T20:32:44.3843218Z T=2048, 2025-05-07T20:32:44.3843423Z D=5120, 2025-05-07T20:32:44.3843639Z scale_ub=1200.0, 2025-05-07T20:32:44.3843867Z contiguous=False, 2025-05-07T20:32:44.3844167Z compiled=False, 2025-05-07T20:32:44.3844387Z ) 2025-05-07T20:32:44.3844706Z self = 2025-05-07T20:32:44.3845214Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:44.3845496Z 2025-05-07T20:32:44.3845591Z @given( 2025-05-07T20:32:44.3845832Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.3846144Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.3846462Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.3846803Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.3847137Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.3847481Z ) 2025-05-07T20:32:44.3847847Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.3848296Z def test_silu_mul_quant( 2025-05-07T20:32:44.3848549Z self, 2025-05-07T20:32:44.3848758Z T: int, 2025-05-07T20:32:44.3848964Z D: int, 2025-05-07T20:32:44.3849196Z scale_ub: Optional[float], 2025-05-07T20:32:44.3849474Z contiguous: bool, 2025-05-07T20:32:44.3849711Z compiled: bool, 2025-05-07T20:32:44.3849946Z ) -> None: 2025-05-07T20:32:44.3850174Z torch.manual_seed(2025) 2025-05-07T20:32:44.3850416Z 2025-05-07T20:32:44.3850698Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.3852767Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.3854653Z 2025-05-07T20:32:44.3854775Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:44.3854991Z 2025-05-07T20:32:44.3855104Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.3855520Z self=, 2025-05-07T20:32:44.3855946Z T=4096, 2025-05-07T20:32:44.3856147Z D=7168, 2025-05-07T20:32:44.3856345Z scale_ub=1200.0, 2025-05-07T20:32:44.3856578Z contiguous=True, 2025-05-07T20:32:44.3856813Z compiled=False, 2025-05-07T20:32:44.3857032Z ) 2025-05-07T20:32:44.4769657Z self = 2025-05-07T20:32:44.4771182Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:44.4772219Z 2025-05-07T20:32:44.4772384Z @given( 2025-05-07T20:32:44.4772848Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.4773478Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.4774223Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.4774881Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.4775529Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.4776088Z ) 2025-05-07T20:32:44.4776784Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.4777319Z def test_silu_mul_quant( 2025-05-07T20:32:44.4777560Z self, 2025-05-07T20:32:44.4777769Z T: int, 2025-05-07T20:32:44.4777976Z D: int, 2025-05-07T20:32:44.4778205Z scale_ub: Optional[float], 2025-05-07T20:32:44.4778472Z contiguous: bool, 2025-05-07T20:32:44.4778718Z compiled: bool, 2025-05-07T20:32:44.4779117Z ) -> None: 2025-05-07T20:32:44.4779333Z torch.manual_seed(2025) 2025-05-07T20:32:44.4779576Z 2025-05-07T20:32:44.4779976Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.4782016Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.4783876Z 2025-05-07T20:32:44.4784002Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:44.4784212Z 2025-05-07T20:32:44.4784315Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.4784731Z self=, 2025-05-07T20:32:44.4785136Z T=16384, 2025-05-07T20:32:44.4785328Z D=7168, 2025-05-07T20:32:44.4785531Z scale_ub=None, 2025-05-07T20:32:44.4785757Z contiguous=False, 2025-05-07T20:32:44.4785980Z compiled=True, 2025-05-07T20:32:44.4786190Z ) 2025-05-07T20:32:44.4786512Z self = 2025-05-07T20:32:44.4787004Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:44.4787277Z 2025-05-07T20:32:44.4787356Z @given( 2025-05-07T20:32:44.4787586Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.4787897Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.4788197Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.4788526Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.4788856Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.4789250Z ) 2025-05-07T20:32:44.4789596Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.4790044Z def test_silu_mul_quant( 2025-05-07T20:32:44.4790289Z self, 2025-05-07T20:32:44.4790481Z T: int, 2025-05-07T20:32:44.4790678Z D: int, 2025-05-07T20:32:44.4790895Z scale_ub: Optional[float], 2025-05-07T20:32:44.4791163Z contiguous: bool, 2025-05-07T20:32:44.4791409Z compiled: bool, 2025-05-07T20:32:44.4791641Z ) -> None: 2025-05-07T20:32:44.4791850Z torch.manual_seed(2025) 2025-05-07T20:32:44.4792094Z 2025-05-07T20:32:44.4792363Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.4794435Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.4796328Z 2025-05-07T20:32:44.4796446Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:44.4796688Z 2025-05-07T20:32:44.4796801Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.4797230Z self=, 2025-05-07T20:32:44.4797636Z T=4096, 2025-05-07T20:32:44.4797829Z D=7168, 2025-05-07T20:32:44.4798024Z scale_ub=None, 2025-05-07T20:32:44.4798244Z contiguous=True, 2025-05-07T20:32:44.4798465Z compiled=False, 2025-05-07T20:32:44.4798673Z ) 2025-05-07T20:32:44.4798991Z self = 2025-05-07T20:32:44.4799524Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:44.4799798Z 2025-05-07T20:32:44.4799881Z @given( 2025-05-07T20:32:44.4800160Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.4800475Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.4800786Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.4801125Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.4801460Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.4801740Z ) 2025-05-07T20:32:44.4802099Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.4802544Z def test_silu_mul_quant( 2025-05-07T20:32:44.4802791Z self, 2025-05-07T20:32:44.4802996Z T: int, 2025-05-07T20:32:44.4803205Z D: int, 2025-05-07T20:32:44.4803426Z scale_ub: Optional[float], 2025-05-07T20:32:44.4803711Z contiguous: bool, 2025-05-07T20:32:44.4803958Z compiled: bool, 2025-05-07T20:32:44.4804177Z ) -> None: 2025-05-07T20:32:44.4804401Z torch.manual_seed(2025) 2025-05-07T20:32:44.4804661Z 2025-05-07T20:32:44.4804943Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.4807010Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.4808866Z 2025-05-07T20:32:44.4809000Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:44.4809218Z 2025-05-07T20:32:44.4809331Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.4809752Z self=, 2025-05-07T20:32:44.4810168Z T=16384, 2025-05-07T20:32:44.4810369Z D=7168, 2025-05-07T20:32:44.4810574Z scale_ub=None, 2025-05-07T20:32:44.4810796Z contiguous=True, 2025-05-07T20:32:44.4811021Z compiled=False, 2025-05-07T20:32:44.4811240Z ) 2025-05-07T20:32:44.4811569Z self = 2025-05-07T20:32:44.4812065Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:44.4812358Z 2025-05-07T20:32:44.4812444Z @given( 2025-05-07T20:32:44.4812683Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.4813001Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.4813308Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.4813644Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.4814027Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.4814309Z ) 2025-05-07T20:32:44.4814664Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.4815157Z def test_silu_mul_quant( 2025-05-07T20:32:44.4815398Z self, 2025-05-07T20:32:44.4815602Z T: int, 2025-05-07T20:32:44.4815803Z D: int, 2025-05-07T20:32:44.4816020Z scale_ub: Optional[float], 2025-05-07T20:32:44.4816292Z contiguous: bool, 2025-05-07T20:32:44.4816541Z compiled: bool, 2025-05-07T20:32:44.4816761Z ) -> None: 2025-05-07T20:32:44.4816974Z torch.manual_seed(2025) 2025-05-07T20:32:44.4817220Z 2025-05-07T20:32:44.4817499Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.4819559Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.4821460Z 2025-05-07T20:32:44.4821581Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:44.4821801Z 2025-05-07T20:32:44.4821904Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.4822311Z self=, 2025-05-07T20:32:44.4822714Z T=16384, 2025-05-07T20:32:44.4822907Z D=7168, 2025-05-07T20:32:44.4823108Z scale_ub=1200.0, 2025-05-07T20:32:44.4823335Z contiguous=True, 2025-05-07T20:32:44.4823563Z compiled=False, 2025-05-07T20:32:44.4823776Z ) 2025-05-07T20:32:44.4824101Z self = 2025-05-07T20:32:44.4824592Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:44.4824880Z 2025-05-07T20:32:44.4824962Z @given( 2025-05-07T20:32:44.4825201Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.4825511Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.4825821Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.4826161Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.4826505Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.4826815Z ) 2025-05-07T20:32:44.4827194Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.4827639Z def test_silu_mul_quant( 2025-05-07T20:32:44.4827887Z self, 2025-05-07T20:32:44.4828094Z T: int, 2025-05-07T20:32:44.4828583Z D: int, 2025-05-07T20:32:44.4828803Z scale_ub: Optional[float], 2025-05-07T20:32:44.4829118Z contiguous: bool, 2025-05-07T20:32:44.4829365Z compiled: bool, 2025-05-07T20:32:44.4829583Z ) -> None: 2025-05-07T20:32:44.4829812Z torch.manual_seed(2025) 2025-05-07T20:32:44.4830068Z 2025-05-07T20:32:44.4830334Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.4832363Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.4834217Z 2025-05-07T20:32:44.4834427Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:44.4834681Z 2025-05-07T20:32:44.4834794Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.4835340Z self=, 2025-05-07T20:32:44.4835742Z T=128, 2025-05-07T20:32:44.4835937Z D=5120, 2025-05-07T20:32:44.4836137Z scale_ub=1200.0, 2025-05-07T20:32:44.4836367Z contiguous=False, 2025-05-07T20:32:44.4836593Z compiled=False, 2025-05-07T20:32:44.4836799Z ) 2025-05-07T20:32:44.5849674Z self = 2025-05-07T20:32:44.5851121Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:44.5851786Z 2025-05-07T20:32:44.5851950Z @given( 2025-05-07T20:32:44.5852411Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.5853031Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.5854020Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.5854681Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.5855326Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.5855875Z ) 2025-05-07T20:32:44.5856714Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.5857463Z def test_silu_mul_quant( 2025-05-07T20:32:44.5857735Z self, 2025-05-07T20:32:44.5857935Z T: int, 2025-05-07T20:32:44.5858143Z D: int, 2025-05-07T20:32:44.5858364Z scale_ub: Optional[float], 2025-05-07T20:32:44.5858632Z contiguous: bool, 2025-05-07T20:32:44.5858873Z compiled: bool, 2025-05-07T20:32:44.5859106Z ) -> None: 2025-05-07T20:32:44.5859320Z torch.manual_seed(2025) 2025-05-07T20:32:44.5859563Z 2025-05-07T20:32:44.5859840Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.5860179Z 2025-05-07T20:32:44.5860384Z x_sign = torch.sign(x) 2025-05-07T20:32:44.5860683Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.5860990Z x = x_sign * x_clamp 2025-05-07T20:32:44.5861237Z x0 = x[:, :D] 2025-05-07T20:32:44.5861460Z x1 = x[:, D:] 2025-05-07T20:32:44.5861667Z 2025-05-07T20:32:44.5861858Z if contiguous: 2025-05-07T20:32:44.5862095Z x0 = x0.contiguous() 2025-05-07T20:32:44.5862348Z x1 = x1.contiguous() 2025-05-07T20:32:44.5862589Z 2025-05-07T20:32:44.5862788Z if scale_ub is not None: 2025-05-07T20:32:44.5863061Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.5863398Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.5863715Z ) 2025-05-07T20:32:44.5863914Z else: 2025-05-07T20:32:44.5864124Z scale_ub_tensor = None 2025-05-07T20:32:44.5864378Z 2025-05-07T20:32:44.5864611Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.5864923Z op = silu_mul_quant 2025-05-07T20:32:44.5865178Z if compiled: 2025-05-07T20:32:44.5865429Z op = torch.compile(op) 2025-05-07T20:32:44.5865722Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.5866002Z 2025-05-07T20:32:44.5866200Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.5866366Z 2025-05-07T20:32:44.5866468Z moe/activation_test.py:117: 2025-05-07T20:32:44.5866763Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.5867101Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.5867385Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.5868077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.5868770Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.5869411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.5870178Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.5870837Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.5871443Z kernel = self.compile( 2025-05-07T20:32:44.5871987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.5872632Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.5873033Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.5873259Z 2025-05-07T20:32:44.5873473Z self = 2025-05-07T20:32:44.5874549Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.5876045Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f96553b6660>} 2025-05-07T20:32:44.5877441Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.5878474Z context = 2025-05-07T20:32:44.5878767Z 2025-05-07T20:32:44.5878941Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.5879453Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.5879927Z module_map=module_map) 2025-05-07T20:32:44.5880300Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.5880659Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.5880924Z E ^ 2025-05-07T20:32:44.5881395Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.5881847Z 2025-05-07T20:32:44.5882270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.5882779Z 2025-05-07T20:32:44.5882898Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.5883308Z self=, 2025-05-07T20:32:44.5883713Z T=2048, 2025-05-07T20:32:44.5883907Z D=7168, 2025-05-07T20:32:44.5884099Z scale_ub=None, 2025-05-07T20:32:44.5884324Z contiguous=False, 2025-05-07T20:32:44.5884562Z compiled=False, 2025-05-07T20:32:44.5884770Z ) 2025-05-07T20:32:44.5885097Z self = 2025-05-07T20:32:44.5885596Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:44.5885869Z 2025-05-07T20:32:44.5885953Z @given( 2025-05-07T20:32:44.5886193Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.5886521Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.5886837Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.5887191Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.5887540Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.5887831Z ) 2025-05-07T20:32:44.5888172Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.5888611Z def test_silu_mul_quant( 2025-05-07T20:32:44.5888858Z self, 2025-05-07T20:32:44.5889048Z T: int, 2025-05-07T20:32:44.5889255Z D: int, 2025-05-07T20:32:44.5889480Z scale_ub: Optional[float], 2025-05-07T20:32:44.5889748Z contiguous: bool, 2025-05-07T20:32:44.5889992Z compiled: bool, 2025-05-07T20:32:44.5890272Z ) -> None: 2025-05-07T20:32:44.5890489Z torch.manual_seed(2025) 2025-05-07T20:32:44.5890734Z 2025-05-07T20:32:44.5891014Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.5893133Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.5894970Z 2025-05-07T20:32:44.5895096Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:44.5895308Z 2025-05-07T20:32:44.5895456Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.5895875Z self=, 2025-05-07T20:32:44.5896282Z T=128, 2025-05-07T20:32:44.5896508Z D=7168, 2025-05-07T20:32:44.5896714Z scale_ub=1200.0, 2025-05-07T20:32:44.5896950Z contiguous=True, 2025-05-07T20:32:44.5897168Z compiled=True, 2025-05-07T20:32:44.5897375Z ) 2025-05-07T20:32:44.6197041Z self = 2025-05-07T20:32:44.6197834Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:44.6198214Z 2025-05-07T20:32:44.6198332Z @given( 2025-05-07T20:32:44.6198642Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.6199052Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.6199367Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.6199707Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.6200042Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.6200338Z ) 2025-05-07T20:32:44.6200697Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.6201139Z def test_silu_mul_quant( 2025-05-07T20:32:44.6201395Z self, 2025-05-07T20:32:44.6201597Z T: int, 2025-05-07T20:32:44.6201798Z D: int, 2025-05-07T20:32:44.6202025Z scale_ub: Optional[float], 2025-05-07T20:32:44.6202300Z contiguous: bool, 2025-05-07T20:32:44.6210287Z compiled: bool, 2025-05-07T20:32:44.6210551Z ) -> None: 2025-05-07T20:32:44.6210778Z torch.manual_seed(2025) 2025-05-07T20:32:44.6211035Z 2025-05-07T20:32:44.6211318Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.6211663Z 2025-05-07T20:32:44.6211870Z x_sign = torch.sign(x) 2025-05-07T20:32:44.6212175Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.6212499Z x = x_sign * x_clamp 2025-05-07T20:32:44.6212756Z x0 = x[:, :D] 2025-05-07T20:32:44.6212987Z x1 = x[:, D:] 2025-05-07T20:32:44.6213198Z 2025-05-07T20:32:44.6213397Z if contiguous: 2025-05-07T20:32:44.6213646Z x0 = x0.contiguous() 2025-05-07T20:32:44.6213911Z x1 = x1.contiguous() 2025-05-07T20:32:44.6214160Z 2025-05-07T20:32:44.6214362Z if scale_ub is not None: 2025-05-07T20:32:44.6214637Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.6214980Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.6215297Z ) 2025-05-07T20:32:44.6215503Z else: 2025-05-07T20:32:44.6215717Z scale_ub_tensor = None 2025-05-07T20:32:44.6215974Z 2025-05-07T20:32:44.6216215Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.6216529Z op = silu_mul_quant 2025-05-07T20:32:44.6216804Z if compiled: 2025-05-07T20:32:44.6217096Z op = torch.compile(op) 2025-05-07T20:32:44.6217638Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.6217919Z 2025-05-07T20:32:44.6218121Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.6218287Z 2025-05-07T20:32:44.6218485Z moe/activation_test.py:117: 2025-05-07T20:32:44.6218789Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.6219136Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.6219424Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.6219987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:44.6220566Z return fn(*args, **kwargs) 2025-05-07T20:32:44.6221246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.6221938Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.6222478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.6223379Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.6224253Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.6224879Z kernel = self.compile( 2025-05-07T20:32:44.6225518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.6226300Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.6226754Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.6227033Z 2025-05-07T20:32:44.6227267Z self = 2025-05-07T20:32:44.6229183Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.6230940Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f96553b7c40>} 2025-05-07T20:32:44.6232608Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.6233859Z context = 2025-05-07T20:32:44.6234207Z 2025-05-07T20:32:44.6234393Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.6235007Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.6235560Z module_map=module_map) 2025-05-07T20:32:44.6235972Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.6236381Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.6236696Z E ^ 2025-05-07T20:32:44.6237266Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.6237825Z 2025-05-07T20:32:44.6238330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.6238962Z 2025-05-07T20:32:44.6239076Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.6239563Z self=, 2025-05-07T20:32:44.6240025Z T=128, 2025-05-07T20:32:44.6240232Z D=7168, 2025-05-07T20:32:44.6240445Z scale_ub=1200.0, 2025-05-07T20:32:44.6240685Z contiguous=True, 2025-05-07T20:32:44.6240930Z compiled=False, 2025-05-07T20:32:44.6241166Z ) 2025-05-07T20:32:44.6241522Z self = 2025-05-07T20:32:44.6242121Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:44.6242400Z 2025-05-07T20:32:44.6242482Z @given( 2025-05-07T20:32:44.6242789Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.6243104Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.6243418Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.6243758Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.6244086Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.6244379Z ) 2025-05-07T20:32:44.6244738Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.6245181Z def test_silu_mul_quant( 2025-05-07T20:32:44.6245427Z self, 2025-05-07T20:32:44.6245629Z T: int, 2025-05-07T20:32:44.6245826Z D: int, 2025-05-07T20:32:44.6246049Z scale_ub: Optional[float], 2025-05-07T20:32:44.6246442Z contiguous: bool, 2025-05-07T20:32:44.6246692Z compiled: bool, 2025-05-07T20:32:44.6246913Z ) -> None: 2025-05-07T20:32:44.6247170Z torch.manual_seed(2025) 2025-05-07T20:32:44.6247528Z 2025-05-07T20:32:44.6247804Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.6248156Z 2025-05-07T20:32:44.6248355Z x_sign = torch.sign(x) 2025-05-07T20:32:44.6248639Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.6250665Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.6252545Z 2025-05-07T20:32:44.6252667Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:44.6252896Z 2025-05-07T20:32:44.6253004Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.6253422Z self=, 2025-05-07T20:32:44.6253828Z T=128, 2025-05-07T20:32:44.6254023Z D=5120, 2025-05-07T20:32:44.6254223Z scale_ub=1200.0, 2025-05-07T20:32:44.6254446Z contiguous=True, 2025-05-07T20:32:44.6254676Z compiled=True, 2025-05-07T20:32:44.6254885Z ) 2025-05-07T20:32:44.6255208Z self = 2025-05-07T20:32:44.6255705Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:44.6255981Z 2025-05-07T20:32:44.6256064Z @given( 2025-05-07T20:32:44.6256298Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.6256615Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.6256926Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.6257264Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.6257592Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.6257883Z ) 2025-05-07T20:32:44.6258236Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.6258683Z def test_silu_mul_quant( 2025-05-07T20:32:44.6258927Z self, 2025-05-07T20:32:44.6259125Z T: int, 2025-05-07T20:32:44.6259333Z D: int, 2025-05-07T20:32:44.6259552Z scale_ub: Optional[float], 2025-05-07T20:32:44.6259825Z contiguous: bool, 2025-05-07T20:32:44.6260066Z compiled: bool, 2025-05-07T20:32:44.6260286Z ) -> None: 2025-05-07T20:32:44.6260503Z torch.manual_seed(2025) 2025-05-07T20:32:44.6260799Z 2025-05-07T20:32:44.6261167Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.6261748Z 2025-05-07T20:32:44.6262015Z x_sign = torch.sign(x) 2025-05-07T20:32:44.6262397Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.6264504Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.6266381Z 2025-05-07T20:32:44.6266501Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:44.6266745Z 2025-05-07T20:32:44.6266863Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.6267335Z self=, 2025-05-07T20:32:44.6267735Z T=128, 2025-05-07T20:32:44.6267932Z D=7168, 2025-05-07T20:32:44.6268171Z scale_ub=None, 2025-05-07T20:32:44.6268386Z contiguous=True, 2025-05-07T20:32:44.6268612Z compiled=True, 2025-05-07T20:32:44.6268816Z ) 2025-05-07T20:32:44.8199264Z self = 2025-05-07T20:32:44.8199984Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:44.8200342Z 2025-05-07T20:32:44.8200427Z @given( 2025-05-07T20:32:44.8200666Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.8200979Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.8201289Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.8201620Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.8201950Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.8202258Z ) 2025-05-07T20:32:44.8202610Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.8203053Z def test_silu_mul_quant( 2025-05-07T20:32:44.8203304Z self, 2025-05-07T20:32:44.8203511Z T: int, 2025-05-07T20:32:44.8203713Z D: int, 2025-05-07T20:32:44.8203929Z scale_ub: Optional[float], 2025-05-07T20:32:44.8204204Z contiguous: bool, 2025-05-07T20:32:44.8204450Z compiled: bool, 2025-05-07T20:32:44.8204679Z ) -> None: 2025-05-07T20:32:44.8204903Z torch.manual_seed(2025) 2025-05-07T20:32:44.8205148Z 2025-05-07T20:32:44.8205419Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.8207468Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.8209336Z 2025-05-07T20:32:44.8209456Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:44.8209674Z 2025-05-07T20:32:44.8219002Z FAILED 2025-05-07T20:32:44.8219173Z 2025-05-07T20:32:44.8219629Z =================================== FAILURES =================================== 2025-05-07T20:32:44.8220233Z _____________________ ActivationTests.test_silu_mul_quant ______________________ 2025-05-07T20:32:44.8220854Z + Exception Group Traceback (most recent call last): 2025-05-07T20:32:44.8221708Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/unittest/case.py", line 57, in testPartExecutor 2025-05-07T20:32:44.8222478Z | yield 2025-05-07T20:32:44.8223261Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/unittest/case.py", line 623, in run 2025-05-07T20:32:44.8223982Z | self._callTestMethod(testMethod) 2025-05-07T20:32:44.8224863Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/unittest/case.py", line 579, in _callTestMethod 2025-05-07T20:32:44.8225617Z | if method() is not None: 2025-05-07T20:32:44.8225951Z | ^^^^^^^^ 2025-05-07T20:32:44.8227010Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant 2025-05-07T20:32:44.8228025Z | T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.8228674Z | ^^^^^^^ 2025-05-07T20:32:44.8229538Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/hypothesis/core.py", line 1850, in wrapped_test 2025-05-07T20:32:44.8230423Z | raise the_error_hypothesis_found 2025-05-07T20:32:44.8231137Z | ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions) 2025-05-07T20:32:44.8231718Z +-+---------------- 1 ---------------- 2025-05-07T20:32:44.8232220Z | Traceback (most recent call last): 2025-05-07T20:32:44.8233220Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:44.8234291Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.8234814Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:44.8237647Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.8240437Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:44.8241051Z | self=, 2025-05-07T20:32:44.8241609Z | T=2048, 2025-05-07T20:32:44.8241935Z | D=5120, # or any other generated value 2025-05-07T20:32:44.8242404Z | scale_ub=None, # or any other generated value 2025-05-07T20:32:44.8242892Z | contiguous=True, # or any other generated value 2025-05-07T20:32:44.8243402Z | compiled=False, # or any other generated value 2025-05-07T20:32:44.8243832Z | ) 2025-05-07T20:32:44.8244092Z | 2025-05-07T20:32:44.8244828Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case 2025-05-07T20:32:44.8245688Z +---------------- 2 ---------------- 2025-05-07T20:32:44.8246090Z | Traceback (most recent call last): 2025-05-07T20:32:44.8247114Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:44.8248215Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.8248733Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:44.8251506Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.8254379Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:44.8255062Z | self=, 2025-05-07T20:32:44.8255618Z | T=128, 2025-05-07T20:32:44.8255891Z | D=7168, 2025-05-07T20:32:44.8256171Z | scale_ub=None, 2025-05-07T20:32:44.8256503Z | contiguous=True, 2025-05-07T20:32:44.8256832Z | compiled=True, 2025-05-07T20:32:44.8257137Z | ) 2025-05-07T20:32:44.8257380Z | 2025-05-07T20:32:44.8258107Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case 2025-05-07T20:32:44.8258818Z +---------------- 3 ---------------- 2025-05-07T20:32:44.8259134Z | Traceback (most recent call last): 2025-05-07T20:32:44.8259986Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:44.8260860Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.8261253Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:44.8263213Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.8265167Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:44.8265611Z | self=, 2025-05-07T20:32:44.8266024Z | T=128, 2025-05-07T20:32:44.8266223Z | D=5120, 2025-05-07T20:32:44.8266442Z | scale_ub=1200.0, 2025-05-07T20:32:44.8266688Z | contiguous=True, 2025-05-07T20:32:44.8266951Z | compiled=True, 2025-05-07T20:32:44.8267296Z | ) 2025-05-07T20:32:44.8267544Z | 2025-05-07T20:32:44.8268262Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case 2025-05-07T20:32:44.8269205Z +---------------- 4 ---------------- 2025-05-07T20:32:44.8269609Z | Traceback (most recent call last): 2025-05-07T20:32:44.8270611Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant 2025-05-07T20:32:44.8271615Z | y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:44.8272029Z | ^^^^^^^^ 2025-05-07T20:32:44.8272946Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn 2025-05-07T20:32:44.8273932Z | return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:44.8274407Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:44.8275535Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row 2025-05-07T20:32:44.8276664Z | _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:44.8277517Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py", line 330, in 2025-05-07T20:32:44.8278531Z | return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.8279151Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:44.8280143Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 186, in run 2025-05-07T20:32:44.8281280Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:44.8281958Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:44.8282895Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 186, in 2025-05-07T20:32:44.8284022Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:44.8284672Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:44.8285571Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 166, in _bench 2025-05-07T20:32:44.8286641Z | return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:44.8287286Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:44.8288121Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py", line 117, in do_bench 2025-05-07T20:32:44.8288925Z | fn() 2025-05-07T20:32:44.8289736Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call 2025-05-07T20:32:44.8290627Z | self.fn.run( 2025-05-07T20:32:44.8291375Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py", line 623, in run 2025-05-07T20:32:44.8292208Z | kernel = self.compile( 2025-05-07T20:32:44.8292583Z | ^^^^^^^^^^^^^ 2025-05-07T20:32:44.8293420Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 273, in compile 2025-05-07T20:32:44.8294422Z | module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.8294972Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:44.8295871Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:44.8297029Z | return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.8297700Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:44.8298235Z | triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.8298721Z | def _kernel_quantize_fp8_row( 2025-05-07T20:32:44.8299092Z | ^ 2025-05-07T20:32:44.8299744Z | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.8300550Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:44.8301106Z | # The test always failed when commented parts were varied together. 2025-05-07T20:32:44.8301821Z | self=, 2025-05-07T20:32:44.8302423Z | T=1, # or any other generated value 2025-05-07T20:32:44.8302849Z | D=5120, # or any other generated value 2025-05-07T20:32:44.8303337Z | scale_ub=None, # or any other generated value 2025-05-07T20:32:44.8303847Z | contiguous=True, # or any other generated value 2025-05-07T20:32:44.8304359Z | compiled=True, # or any other generated value 2025-05-07T20:32:44.8304779Z | ) 2025-05-07T20:32:44.8305039Z | 2025-05-07T20:32:44.8305780Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case 2025-05-07T20:32:44.8306640Z +------------------------------------ 2025-05-07T20:32:44.8307245Z ---------------------------------- Hypothesis ---------------------------------- 2025-05-07T20:32:44.8307775Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.8308399Z self=, 2025-05-07T20:32:44.8308970Z T=1, 2025-05-07T20:32:44.8309354Z D=5120, 2025-05-07T20:32:44.8309620Z scale_ub=None, 2025-05-07T20:32:44.8309921Z contiguous=True, 2025-05-07T20:32:44.8310234Z compiled=True, 2025-05-07T20:32:44.8310518Z ) 2025-05-07T20:32:44.8310950Z self = 2025-05-07T20:32:44.8311613Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:44.8311974Z 2025-05-07T20:32:44.8312087Z @given( 2025-05-07T20:32:44.8312394Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.8312812Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.8313276Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.8313725Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.8314180Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.8314637Z ) 2025-05-07T20:32:44.8315116Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.8315734Z def test_silu_mul_quant( 2025-05-07T20:32:44.8316078Z self, 2025-05-07T20:32:44.8316355Z T: int, 2025-05-07T20:32:44.8316631Z D: int, 2025-05-07T20:32:44.8316961Z scale_ub: Optional[float], 2025-05-07T20:32:44.8317373Z contiguous: bool, 2025-05-07T20:32:44.8317693Z compiled: bool, 2025-05-07T20:32:44.8317994Z ) -> None: 2025-05-07T20:32:44.8318277Z torch.manual_seed(2025) 2025-05-07T20:32:44.8318592Z 2025-05-07T20:32:44.8318963Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.8319436Z 2025-05-07T20:32:44.8319714Z x_sign = torch.sign(x) 2025-05-07T20:32:44.8320134Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.8320575Z x = x_sign * x_clamp 2025-05-07T20:32:44.8320910Z x0 = x[:, :D] 2025-05-07T20:32:44.8340781Z x1 = x[:, D:] 2025-05-07T20:32:44.8341115Z 2025-05-07T20:32:44.8341359Z if contiguous: 2025-05-07T20:32:44.8341660Z x0 = x0.contiguous() 2025-05-07T20:32:44.8341992Z x1 = x1.contiguous() 2025-05-07T20:32:44.8342302Z 2025-05-07T20:32:44.8342551Z if scale_ub is not None: 2025-05-07T20:32:44.8342900Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.8343349Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.8343768Z ) 2025-05-07T20:32:44.8344027Z else: 2025-05-07T20:32:44.8344294Z scale_ub_tensor = None 2025-05-07T20:32:44.8344634Z 2025-05-07T20:32:44.8344935Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.8345352Z op = silu_mul_quant 2025-05-07T20:32:44.8345680Z if compiled: 2025-05-07T20:32:44.8346007Z op = torch.compile(op) 2025-05-07T20:32:44.8346391Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.8346762Z 2025-05-07T20:32:44.8347058Z y_fp8, y_scale = fn() 2025-05-07T20:32:44.8347450Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:44.8347847Z 2025-05-07T20:32:44.8348188Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.8348642Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:44.8349152Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:44.8349595Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:44.8350089Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:44.8350516Z 2025-05-07T20:32:44.8350801Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:44.8351084Z 2025-05-07T20:32:44.8351235Z moe/activation_test.py:126: 2025-05-07T20:32:44.8351832Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8352297Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:44.8352842Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:44.8353942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:44.8354992Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:44.8355744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.8356729Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.8357697Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:44.8358717Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:44.8359868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:44.8361009Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:44.8362007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:44.8362895Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:44.8363733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:44.8364463Z fn() 2025-05-07T20:32:44.8365161Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:44.8365952Z self.fn.run( 2025-05-07T20:32:44.8366574Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.8367285Z kernel = self.compile( 2025-05-07T20:32:44.8368004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.8368876Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.8369409Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8369716Z 2025-05-07T20:32:44.8369988Z self = 2025-05-07T20:32:44.8371442Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.8373307Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9831ba1260>} 2025-05-07T20:32:44.8375167Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.8376649Z context = 2025-05-07T20:32:44.8377069Z 2025-05-07T20:32:44.8377290Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.8378010Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.8378678Z module_map=module_map) 2025-05-07T20:32:44.8379163Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.8379639Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:44.8379997Z E ^ 2025-05-07T20:32:44.8380615Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.8381285Z 2025-05-07T20:32:44.8381840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.8382542Z 2025-05-07T20:32:44.8382725Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.8383278Z self=, 2025-05-07T20:32:44.8383807Z T=2048, 2025-05-07T20:32:44.8384060Z D=5120, 2025-05-07T20:32:44.8384320Z scale_ub=1200.0, 2025-05-07T20:32:44.8384604Z contiguous=True, 2025-05-07T20:32:44.8384903Z compiled=False, 2025-05-07T20:32:44.8385179Z ) 2025-05-07T20:32:44.8385598Z self = 2025-05-07T20:32:44.8386252Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:44.8386627Z 2025-05-07T20:32:44.8386750Z @given( 2025-05-07T20:32:44.8387091Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.8387556Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.8387963Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.8388472Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.8388906Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.8389394Z ) 2025-05-07T20:32:44.8389861Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.8390448Z def test_silu_mul_quant( 2025-05-07T20:32:44.8390761Z self, 2025-05-07T20:32:44.8391021Z T: int, 2025-05-07T20:32:44.8391287Z D: int, 2025-05-07T20:32:44.8391570Z scale_ub: Optional[float], 2025-05-07T20:32:44.8391929Z contiguous: bool, 2025-05-07T20:32:44.8392245Z compiled: bool, 2025-05-07T20:32:44.8392535Z ) -> None: 2025-05-07T20:32:44.8392824Z torch.manual_seed(2025) 2025-05-07T20:32:44.8393147Z 2025-05-07T20:32:44.8393505Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.8393964Z 2025-05-07T20:32:44.8394225Z x_sign = torch.sign(x) 2025-05-07T20:32:44.8394603Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.8395021Z x = x_sign * x_clamp 2025-05-07T20:32:44.8395343Z x0 = x[:, :D] 2025-05-07T20:32:44.8395625Z x1 = x[:, D:] 2025-05-07T20:32:44.8395901Z 2025-05-07T20:32:44.8396165Z if contiguous: 2025-05-07T20:32:44.8396499Z x0 = x0.contiguous() 2025-05-07T20:32:44.8396865Z x1 = x1.contiguous() 2025-05-07T20:32:44.8397202Z 2025-05-07T20:32:44.8397463Z if scale_ub is not None: 2025-05-07T20:32:44.8397843Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.8398309Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.8398727Z ) 2025-05-07T20:32:44.8398995Z else: 2025-05-07T20:32:44.8399297Z scale_ub_tensor = None 2025-05-07T20:32:44.8399656Z 2025-05-07T20:32:44.8399958Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.8400383Z op = silu_mul_quant 2025-05-07T20:32:44.8400726Z if compiled: 2025-05-07T20:32:44.8401053Z op = torch.compile(op) 2025-05-07T20:32:44.8401453Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.8401829Z 2025-05-07T20:32:44.8402079Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.8402305Z 2025-05-07T20:32:44.8402436Z moe/activation_test.py:117: 2025-05-07T20:32:44.8402854Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8403328Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.8403714Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.8404648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.8405601Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.8406416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.8407371Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.8408344Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.8409075Z kernel = self.compile( 2025-05-07T20:32:44.8409811Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.8410726Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.8411261Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8411583Z 2025-05-07T20:32:44.8411857Z self = 2025-05-07T20:32:44.8413329Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.8415406Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f983184c180>} 2025-05-07T20:32:44.8417365Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.8418791Z context = 2025-05-07T20:32:44.8419180Z 2025-05-07T20:32:44.8419403Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.8420091Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.8420732Z module_map=module_map) 2025-05-07T20:32:44.8421221Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.8421689Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.8422044Z E ^ 2025-05-07T20:32:44.8422679Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.8423313Z 2025-05-07T20:32:44.8423912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.8424624Z 2025-05-07T20:32:44.8424766Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.8425329Z self=, 2025-05-07T20:32:44.8425873Z T=2048, 2025-05-07T20:32:44.8426127Z D=5120, 2025-05-07T20:32:44.8426377Z scale_ub=1200.0, 2025-05-07T20:32:44.8426721Z contiguous=True, 2025-05-07T20:32:44.8427023Z compiled=True, 2025-05-07T20:32:44.8427295Z ) 2025-05-07T20:32:44.8427723Z self = 2025-05-07T20:32:44.8428671Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:44.8429108Z 2025-05-07T20:32:44.8429218Z @given( 2025-05-07T20:32:44.8429525Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.8429944Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.8430350Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.8430806Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.8431264Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.8431668Z ) 2025-05-07T20:32:44.8432154Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.8432787Z def test_silu_mul_quant( 2025-05-07T20:32:44.8433129Z self, 2025-05-07T20:32:44.8433386Z T: int, 2025-05-07T20:32:44.8433656Z D: int, 2025-05-07T20:32:44.8434060Z scale_ub: Optional[float], 2025-05-07T20:32:44.8434326Z contiguous: bool, 2025-05-07T20:32:44.8434570Z compiled: bool, 2025-05-07T20:32:44.8434793Z ) -> None: 2025-05-07T20:32:44.8435091Z torch.manual_seed(2025) 2025-05-07T20:32:44.8435335Z 2025-05-07T20:32:44.8435612Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.8435951Z 2025-05-07T20:32:44.8436149Z x_sign = torch.sign(x) 2025-05-07T20:32:44.8436439Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.8436743Z x = x_sign * x_clamp 2025-05-07T20:32:44.8437003Z x0 = x[:, :D] 2025-05-07T20:32:44.8437257Z x1 = x[:, D:] 2025-05-07T20:32:44.8437460Z 2025-05-07T20:32:44.8437651Z if contiguous: 2025-05-07T20:32:44.8437883Z x0 = x0.contiguous() 2025-05-07T20:32:44.8438143Z x1 = x1.contiguous() 2025-05-07T20:32:44.8438374Z 2025-05-07T20:32:44.8438636Z if scale_ub is not None: 2025-05-07T20:32:44.8438909Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.8439237Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.8439611Z ) 2025-05-07T20:32:44.8439824Z else: 2025-05-07T20:32:44.8440038Z scale_ub_tensor = None 2025-05-07T20:32:44.8440283Z 2025-05-07T20:32:44.8440516Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.8440834Z op = silu_mul_quant 2025-05-07T20:32:44.8441081Z if compiled: 2025-05-07T20:32:44.8441330Z op = torch.compile(op) 2025-05-07T20:32:44.8441628Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.8441898Z 2025-05-07T20:32:44.8442093Z y_fp8, y_scale = fn() 2025-05-07T20:32:44.8442380Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:44.8442664Z 2025-05-07T20:32:44.8442903Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.8443246Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:44.8443541Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:44.8443852Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:44.8444212Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:44.8444523Z 2025-05-07T20:32:44.8444722Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:44.8444919Z 2025-05-07T20:32:44.8445018Z moe/activation_test.py:126: 2025-05-07T20:32:44.8445313Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8445644Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:44.8445973Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:44.8446765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:44.8447573Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:44.8448119Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.8448811Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.8449500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:44.8450228Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:44.8450979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:44.8451728Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:44.8452462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:44.8453100Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:44.8453758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:44.8454284Z fn() 2025-05-07T20:32:44.8454835Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:44.8455415Z self.fn.run( 2025-05-07T20:32:44.8455883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.8456453Z kernel = self.compile( 2025-05-07T20:32:44.8457216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.8458060Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.8458466Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8458700Z 2025-05-07T20:32:44.8458913Z self = 2025-05-07T20:32:44.8460110Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.8461499Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9830943560>} 2025-05-07T20:32:44.8462850Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.8463883Z context = 2025-05-07T20:32:44.8464171Z 2025-05-07T20:32:44.8464346Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.8464863Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.8465342Z module_map=module_map) 2025-05-07T20:32:44.8465715Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.8466070Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:44.8466344Z E ^ 2025-05-07T20:32:44.8466815Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.8467319Z 2025-05-07T20:32:44.8467744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.8468256Z 2025-05-07T20:32:44.8468360Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.8468781Z self=, 2025-05-07T20:32:44.8469303Z T=16384, 2025-05-07T20:32:44.8469491Z D=7168, 2025-05-07T20:32:44.8469694Z scale_ub=1200.0, 2025-05-07T20:32:44.8469923Z contiguous=False, 2025-05-07T20:32:44.8470146Z compiled=False, 2025-05-07T20:32:44.8470357Z ) 2025-05-07T20:32:44.8470682Z self = 2025-05-07T20:32:44.8471187Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:44.8471468Z 2025-05-07T20:32:44.8471546Z @given( 2025-05-07T20:32:44.8471778Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.8472090Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.8472398Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.8472733Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.8473059Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.8473340Z ) 2025-05-07T20:32:44.8473690Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.8474134Z def test_silu_mul_quant( 2025-05-07T20:32:44.8474382Z self, 2025-05-07T20:32:44.8474622Z T: int, 2025-05-07T20:32:44.8474820Z D: int, 2025-05-07T20:32:44.8475040Z scale_ub: Optional[float], 2025-05-07T20:32:44.8475308Z contiguous: bool, 2025-05-07T20:32:44.8475588Z compiled: bool, 2025-05-07T20:32:44.8475812Z ) -> None: 2025-05-07T20:32:44.8476027Z torch.manual_seed(2025) 2025-05-07T20:32:44.8476271Z 2025-05-07T20:32:44.8476544Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.8476881Z 2025-05-07T20:32:44.8477079Z x_sign = torch.sign(x) 2025-05-07T20:32:44.8477397Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.8477729Z x = x_sign * x_clamp 2025-05-07T20:32:44.8477971Z x0 = x[:, :D] 2025-05-07T20:32:44.8478192Z x1 = x[:, D:] 2025-05-07T20:32:44.8478396Z 2025-05-07T20:32:44.8478584Z if contiguous: 2025-05-07T20:32:44.8478817Z x0 = x0.contiguous() 2025-05-07T20:32:44.8479144Z x1 = x1.contiguous() 2025-05-07T20:32:44.8479383Z 2025-05-07T20:32:44.8479574Z if scale_ub is not None: 2025-05-07T20:32:44.8479886Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.8480221Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.8480534Z ) 2025-05-07T20:32:44.8480728Z else: 2025-05-07T20:32:44.8480935Z scale_ub_tensor = None 2025-05-07T20:32:44.8481188Z 2025-05-07T20:32:44.8481420Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.8481728Z op = silu_mul_quant 2025-05-07T20:32:44.8481985Z if compiled: 2025-05-07T20:32:44.8482232Z op = torch.compile(op) 2025-05-07T20:32:44.8482523Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.8482803Z 2025-05-07T20:32:44.8482996Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.8483163Z 2025-05-07T20:32:44.8483263Z moe/activation_test.py:117: 2025-05-07T20:32:44.8483564Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8483897Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.8484180Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.8484867Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.8485557Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.8486092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.8486772Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.8487486Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.8488021Z kernel = self.compile( 2025-05-07T20:32:44.8488560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.8489213Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.8489617Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8489845Z 2025-05-07T20:32:44.8490056Z self = 2025-05-07T20:32:44.8491138Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.8492508Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9830683ba0>} 2025-05-07T20:32:44.8493854Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.8494941Z context = 2025-05-07T20:32:44.8495227Z 2025-05-07T20:32:44.8495442Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.8495956Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.8496424Z module_map=module_map) 2025-05-07T20:32:44.8496786Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.8497171Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.8497448Z E ^ 2025-05-07T20:32:44.8497915Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.8498366Z 2025-05-07T20:32:44.8498792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.8499353Z 2025-05-07T20:32:44.8499462Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.8499868Z self=, 2025-05-07T20:32:44.8500307Z T=1, 2025-05-07T20:32:44.8500498Z D=7168, 2025-05-07T20:32:44.8500684Z scale_ub=None, 2025-05-07T20:32:44.8500901Z contiguous=True, 2025-05-07T20:32:44.8501125Z compiled=True, 2025-05-07T20:32:44.8501320Z ) 2025-05-07T20:32:44.8501644Z self = 2025-05-07T20:32:44.8502127Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:44.8502383Z 2025-05-07T20:32:44.8502462Z @given( 2025-05-07T20:32:44.8502698Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.8503012Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.8503323Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.8503652Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.8503984Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.8504272Z ) 2025-05-07T20:32:44.8504618Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.8505064Z def test_silu_mul_quant( 2025-05-07T20:32:44.8505308Z self, 2025-05-07T20:32:44.8505500Z T: int, 2025-05-07T20:32:44.8505699Z D: int, 2025-05-07T20:32:44.8505916Z scale_ub: Optional[float], 2025-05-07T20:32:44.8506183Z contiguous: bool, 2025-05-07T20:32:44.8506421Z compiled: bool, 2025-05-07T20:32:44.8506646Z ) -> None: 2025-05-07T20:32:44.8506855Z torch.manual_seed(2025) 2025-05-07T20:32:44.8507094Z 2025-05-07T20:32:44.8507368Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.8507758Z 2025-05-07T20:32:44.8508005Z x_sign = torch.sign(x) 2025-05-07T20:32:44.8508403Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.8508800Z x = x_sign * x_clamp 2025-05-07T20:32:44.8509122Z x0 = x[:, :D] 2025-05-07T20:32:44.8509342Z x1 = x[:, D:] 2025-05-07T20:32:44.8509549Z 2025-05-07T20:32:44.8509734Z if contiguous: 2025-05-07T20:32:44.8509965Z x0 = x0.contiguous() 2025-05-07T20:32:44.8510223Z x1 = x1.contiguous() 2025-05-07T20:32:44.8510458Z 2025-05-07T20:32:44.8510651Z if scale_ub is not None: 2025-05-07T20:32:44.8510923Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.8511252Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.8511557Z ) 2025-05-07T20:32:44.8511749Z else: 2025-05-07T20:32:44.8511953Z scale_ub_tensor = None 2025-05-07T20:32:44.8512202Z 2025-05-07T20:32:44.8512432Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.8512744Z op = silu_mul_quant 2025-05-07T20:32:44.8512994Z if compiled: 2025-05-07T20:32:44.8513307Z op = torch.compile(op) 2025-05-07T20:32:44.8513601Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.8513867Z 2025-05-07T20:32:44.8514060Z y_fp8, y_scale = fn() 2025-05-07T20:32:44.8521949Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:44.8522259Z 2025-05-07T20:32:44.8522503Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.8522850Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:44.8523152Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:44.8523468Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:44.8523824Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:44.8524140Z 2025-05-07T20:32:44.8524351Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:44.8524546Z 2025-05-07T20:32:44.8524649Z moe/activation_test.py:126: 2025-05-07T20:32:44.8525002Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8525351Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:44.8525676Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:44.8526518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:44.8527334Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:44.8527883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.8528942Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.8529774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:44.8530647Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:44.8531569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:44.8532471Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:44.8533357Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:44.8534123Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:44.8534838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:44.8535449Z fn() 2025-05-07T20:32:44.8536047Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:44.8536743Z self.fn.run( 2025-05-07T20:32:44.8537300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.8537966Z kernel = self.compile( 2025-05-07T20:32:44.8538609Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.8539393Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.8539851Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8540126Z 2025-05-07T20:32:44.8540363Z self = 2025-05-07T20:32:44.8541694Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.8543408Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f98304ff6a0>} 2025-05-07T20:32:44.8545070Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.8546315Z context = 2025-05-07T20:32:44.8546619Z 2025-05-07T20:32:44.8546788Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.8547319Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.8547786Z module_map=module_map) 2025-05-07T20:32:44.8548155Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.8548516Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:44.8548779Z E ^ 2025-05-07T20:32:44.8549314Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.8549780Z 2025-05-07T20:32:44.8550274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.8550791Z 2025-05-07T20:32:44.8550908Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.8551389Z self=, 2025-05-07T20:32:44.8551799Z T=4096, 2025-05-07T20:32:44.8551993Z D=5120, 2025-05-07T20:32:44.8552189Z scale_ub=None, 2025-05-07T20:32:44.8552414Z contiguous=False, 2025-05-07T20:32:44.8552646Z compiled=False, 2025-05-07T20:32:44.8552860Z ) 2025-05-07T20:32:44.8553177Z self = 2025-05-07T20:32:44.8553678Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:44.8553950Z 2025-05-07T20:32:44.8554037Z @given( 2025-05-07T20:32:44.8554264Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.8554583Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.8554899Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.8555224Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.8555557Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.8555849Z ) 2025-05-07T20:32:44.8556194Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.8556639Z def test_silu_mul_quant( 2025-05-07T20:32:44.8556888Z self, 2025-05-07T20:32:44.8557120Z T: int, 2025-05-07T20:32:44.8557333Z D: int, 2025-05-07T20:32:44.8557555Z scale_ub: Optional[float], 2025-05-07T20:32:44.8557827Z contiguous: bool, 2025-05-07T20:32:44.8558061Z compiled: bool, 2025-05-07T20:32:44.8558288Z ) -> None: 2025-05-07T20:32:44.8558505Z torch.manual_seed(2025) 2025-05-07T20:32:44.8558739Z 2025-05-07T20:32:44.8559014Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.8559365Z 2025-05-07T20:32:44.8559558Z x_sign = torch.sign(x) 2025-05-07T20:32:44.8559852Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.8560163Z x = x_sign * x_clamp 2025-05-07T20:32:44.8560400Z x0 = x[:, :D] 2025-05-07T20:32:44.8560628Z x1 = x[:, D:] 2025-05-07T20:32:44.8560839Z 2025-05-07T20:32:44.8561021Z if contiguous: 2025-05-07T20:32:44.8561256Z x0 = x0.contiguous() 2025-05-07T20:32:44.8561522Z x1 = x1.contiguous() 2025-05-07T20:32:44.8561763Z 2025-05-07T20:32:44.8561951Z if scale_ub is not None: 2025-05-07T20:32:44.8562234Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.8562574Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.8562876Z ) 2025-05-07T20:32:44.8563076Z else: 2025-05-07T20:32:44.8563291Z scale_ub_tensor = None 2025-05-07T20:32:44.8563537Z 2025-05-07T20:32:44.8563773Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.8564144Z op = silu_mul_quant 2025-05-07T20:32:44.8564396Z if compiled: 2025-05-07T20:32:44.8564646Z op = torch.compile(op) 2025-05-07T20:32:44.8565058Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.8565333Z 2025-05-07T20:32:44.8565532Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.8565696Z 2025-05-07T20:32:44.8565802Z moe/activation_test.py:117: 2025-05-07T20:32:44.8566106Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8566435Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.8566719Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.8567433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.8568149Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.8568690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.8569423Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.8570133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.8570665Z kernel = self.compile( 2025-05-07T20:32:44.8571212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.8571882Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.8572280Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8572515Z 2025-05-07T20:32:44.8572723Z self = 2025-05-07T20:32:44.8573813Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.8575204Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f98304e00e0>} 2025-05-07T20:32:44.8576562Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.8577644Z context = 2025-05-07T20:32:44.8577938Z 2025-05-07T20:32:44.8578106Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.8578628Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.8579092Z module_map=module_map) 2025-05-07T20:32:44.8579466Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.8579822Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.8580082Z E ^ 2025-05-07T20:32:44.8580550Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.8581013Z 2025-05-07T20:32:44.8581433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.8581947Z 2025-05-07T20:32:44.8582057Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.8582470Z self=, 2025-05-07T20:32:44.8582874Z T=4096, 2025-05-07T20:32:44.8583063Z D=7168, 2025-05-07T20:32:44.8583255Z scale_ub=None, 2025-05-07T20:32:44.8583468Z contiguous=False, 2025-05-07T20:32:44.8583696Z compiled=False, 2025-05-07T20:32:44.8583900Z ) 2025-05-07T20:32:44.8584212Z self = 2025-05-07T20:32:44.8584758Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:44.8585029Z 2025-05-07T20:32:44.8585115Z @given( 2025-05-07T20:32:44.8585381Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.8585695Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.8586001Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.8586325Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.8586654Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.8586950Z ) 2025-05-07T20:32:44.8587340Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.8587776Z def test_silu_mul_quant( 2025-05-07T20:32:44.8588019Z self, 2025-05-07T20:32:44.8588216Z T: int, 2025-05-07T20:32:44.8588410Z D: int, 2025-05-07T20:32:44.8588631Z scale_ub: Optional[float], 2025-05-07T20:32:44.8588947Z contiguous: bool, 2025-05-07T20:32:44.8589249Z compiled: bool, 2025-05-07T20:32:44.8589471Z ) -> None: 2025-05-07T20:32:44.8589686Z torch.manual_seed(2025) 2025-05-07T20:32:44.8589919Z 2025-05-07T20:32:44.8590243Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.8590587Z 2025-05-07T20:32:44.8590773Z x_sign = torch.sign(x) 2025-05-07T20:32:44.8591067Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.8591378Z x = x_sign * x_clamp 2025-05-07T20:32:44.8591611Z x0 = x[:, :D] 2025-05-07T20:32:44.8591827Z x1 = x[:, D:] 2025-05-07T20:32:44.8592033Z 2025-05-07T20:32:44.8592217Z if contiguous: 2025-05-07T20:32:44.8592442Z x0 = x0.contiguous() 2025-05-07T20:32:44.8592701Z x1 = x1.contiguous() 2025-05-07T20:32:44.8592944Z 2025-05-07T20:32:44.8593130Z if scale_ub is not None: 2025-05-07T20:32:44.8593403Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.8593747Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.8594051Z ) 2025-05-07T20:32:44.8594247Z else: 2025-05-07T20:32:44.8594464Z scale_ub_tensor = None 2025-05-07T20:32:44.8594711Z 2025-05-07T20:32:44.8594945Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.8595263Z op = silu_mul_quant 2025-05-07T20:32:44.8595508Z if compiled: 2025-05-07T20:32:44.8595756Z op = torch.compile(op) 2025-05-07T20:32:44.8596052Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.8596319Z 2025-05-07T20:32:44.8596515Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.8596685Z 2025-05-07T20:32:44.8596782Z moe/activation_test.py:117: 2025-05-07T20:32:44.8597076Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8597430Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.8597739Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.8598436Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.8599124Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.8599666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.8600350Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.8601017Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.8601557Z kernel = self.compile( 2025-05-07T20:32:44.8602095Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.8602280Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.8602410Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8602488Z 2025-05-07T20:32:44.8602694Z self = 2025-05-07T20:32:44.8603529Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.8604035Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f98304e2a20>} 2025-05-07T20:32:44.8604796Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.8604987Z context = 2025-05-07T20:32:44.8605031Z 2025-05-07T20:32:44.8605206Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.8605467Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.8605612Z module_map=module_map) 2025-05-07T20:32:44.8605783Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.8605884Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.8605966Z E ^ 2025-05-07T20:32:44.8606333Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.8606338Z 2025-05-07T20:32:44.8606753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.8606758Z 2025-05-07T20:32:44.8606871Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.8607097Z self=, 2025-05-07T20:32:44.8607184Z T=128, 2025-05-07T20:32:44.8607269Z D=7168, 2025-05-07T20:32:44.8607355Z scale_ub=None, 2025-05-07T20:32:44.8607447Z contiguous=False, 2025-05-07T20:32:44.8607540Z compiled=True, 2025-05-07T20:32:44.8607616Z ) 2025-05-07T20:32:44.8607844Z self = 2025-05-07T20:32:44.8608014Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:44.8608019Z 2025-05-07T20:32:44.8608097Z @given( 2025-05-07T20:32:44.8608223Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.8608323Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.8608437Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.8608561Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.8608675Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.8608756Z ) 2025-05-07T20:32:44.8609003Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.8609103Z def test_silu_mul_quant( 2025-05-07T20:32:44.8609186Z self, 2025-05-07T20:32:44.8609263Z T: int, 2025-05-07T20:32:44.8609342Z D: int, 2025-05-07T20:32:44.8609449Z scale_ub: Optional[float], 2025-05-07T20:32:44.8609539Z contiguous: bool, 2025-05-07T20:32:44.8609625Z compiled: bool, 2025-05-07T20:32:44.8609710Z ) -> None: 2025-05-07T20:32:44.8609810Z torch.manual_seed(2025) 2025-05-07T20:32:44.8609881Z 2025-05-07T20:32:44.8610056Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.8610128Z 2025-05-07T20:32:44.8610220Z x_sign = torch.sign(x) 2025-05-07T20:32:44.8610351Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.8610439Z x = x_sign * x_clamp 2025-05-07T20:32:44.8610525Z x0 = x[:, :D] 2025-05-07T20:32:44.8610605Z x1 = x[:, D:] 2025-05-07T20:32:44.8610679Z 2025-05-07T20:32:44.8610818Z if contiguous: 2025-05-07T20:32:44.8610912Z x0 = x0.contiguous() 2025-05-07T20:32:44.8611000Z x1 = x1.contiguous() 2025-05-07T20:32:44.8611081Z 2025-05-07T20:32:44.8611213Z if scale_ub is not None: 2025-05-07T20:32:44.8611320Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.8611464Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.8611539Z ) 2025-05-07T20:32:44.8611619Z else: 2025-05-07T20:32:44.8611719Z scale_ub_tensor = None 2025-05-07T20:32:44.8611790Z 2025-05-07T20:32:44.8611926Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.8612016Z op = silu_mul_quant 2025-05-07T20:32:44.8612101Z if compiled: 2025-05-07T20:32:44.8612208Z op = torch.compile(op) 2025-05-07T20:32:44.8612314Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.8612389Z 2025-05-07T20:32:44.8612526Z y_fp8, y_scale = fn() 2025-05-07T20:32:44.8612652Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:44.8612726Z 2025-05-07T20:32:44.8612907Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.8613014Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:44.8613113Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:44.8613242Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:44.8613383Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:44.8613463Z 2025-05-07T20:32:44.8613563Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:44.8613567Z 2025-05-07T20:32:44.8613666Z moe/activation_test.py:126: 2025-05-07T20:32:44.8613804Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8613911Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:44.8614047Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:44.8614622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:44.8614731Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:44.8615103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.8615327Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.8615699Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:44.8615964Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:44.8616370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:44.8616632Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:44.8617056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:44.8617239Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:44.8617596Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:44.8617675Z fn() 2025-05-07T20:32:44.8618081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:44.8618173Z self.fn.run( 2025-05-07T20:32:44.8618513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.8618617Z kernel = self.compile( 2025-05-07T20:32:44.8619000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.8619176Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.8619388Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8619393Z 2025-05-07T20:32:44.8619639Z self = 2025-05-07T20:32:44.8620432Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.8620936Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f98304e3100>} 2025-05-07T20:32:44.8621687Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.8621943Z context = 2025-05-07T20:32:44.8621951Z 2025-05-07T20:32:44.8622116Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.8622426Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.8622536Z module_map=module_map) 2025-05-07T20:32:44.8622699Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.8622809Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:44.8622888Z E ^ 2025-05-07T20:32:44.8623247Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.8623258Z 2025-05-07T20:32:44.8623675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.8623680Z 2025-05-07T20:32:44.8623784Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.8624019Z self=, 2025-05-07T20:32:44.8624098Z T=128, 2025-05-07T20:32:44.8624177Z D=7168, 2025-05-07T20:32:44.8624270Z scale_ub=None, 2025-05-07T20:32:44.8624360Z contiguous=False, 2025-05-07T20:32:44.8624444Z compiled=False, 2025-05-07T20:32:44.8624525Z ) 2025-05-07T20:32:44.8624746Z self = 2025-05-07T20:32:44.8624924Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:44.8624928Z 2025-05-07T20:32:44.8625008Z @given( 2025-05-07T20:32:44.8625126Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.8625235Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.8625351Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.8625469Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.8625588Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.8625669Z ) 2025-05-07T20:32:44.8625926Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.8626024Z def test_silu_mul_quant( 2025-05-07T20:32:44.8626104Z self, 2025-05-07T20:32:44.8626187Z T: int, 2025-05-07T20:32:44.8626263Z D: int, 2025-05-07T20:32:44.8626361Z scale_ub: Optional[float], 2025-05-07T20:32:44.8626460Z contiguous: bool, 2025-05-07T20:32:44.8626547Z compiled: bool, 2025-05-07T20:32:44.8626626Z ) -> None: 2025-05-07T20:32:44.8626729Z torch.manual_seed(2025) 2025-05-07T20:32:44.8626803Z 2025-05-07T20:32:44.8626975Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.8627059Z 2025-05-07T20:32:44.8627173Z x_sign = torch.sign(x) 2025-05-07T20:32:44.8627316Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.8627422Z x = x_sign * x_clamp 2025-05-07T20:32:44.8627503Z x0 = x[:, :D] 2025-05-07T20:32:44.8627637Z x1 = x[:, D:] 2025-05-07T20:32:44.8627711Z 2025-05-07T20:32:44.8627797Z if contiguous: 2025-05-07T20:32:44.8627897Z x0 = x0.contiguous() 2025-05-07T20:32:44.8628030Z x1 = x1.contiguous() 2025-05-07T20:32:44.8628103Z 2025-05-07T20:32:44.8628439Z if scale_ub is not None: 2025-05-07T20:32:44.8628598Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.8628744Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.8628825Z ) 2025-05-07T20:32:44.8628901Z else: 2025-05-07T20:32:44.8628994Z scale_ub_tensor = None 2025-05-07T20:32:44.8629117Z 2025-05-07T20:32:44.8629248Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.8629343Z op = silu_mul_quant 2025-05-07T20:32:44.8629428Z if compiled: 2025-05-07T20:32:44.8629526Z op = torch.compile(op) 2025-05-07T20:32:44.8629755Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.8629832Z 2025-05-07T20:32:44.8629923Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.8629927Z 2025-05-07T20:32:44.8630097Z moe/activation_test.py:117: 2025-05-07T20:32:44.8630229Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8630330Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.8630435Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.8630939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.8631041Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.8631402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.8631624Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.8631978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.8632076Z kernel = self.compile( 2025-05-07T20:32:44.8632462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.8632649Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.8632776Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8632780Z 2025-05-07T20:32:44.8632994Z self = 2025-05-07T20:32:44.8633776Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.8634285Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9807db0cc0>} 2025-05-07T20:32:44.8635048Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.8635239Z context = 2025-05-07T20:32:44.8635244Z 2025-05-07T20:32:44.8635418Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.8635679Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.8635794Z module_map=module_map) 2025-05-07T20:32:44.8635956Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.8636057Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.8636140Z E ^ 2025-05-07T20:32:44.8636498Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.8636570Z 2025-05-07T20:32:44.8637042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.8637112Z 2025-05-07T20:32:44.8637219Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.8637447Z self=, 2025-05-07T20:32:44.8637534Z T=4096, 2025-05-07T20:32:44.8637611Z D=5120, 2025-05-07T20:32:44.8637698Z scale_ub=1200.0, 2025-05-07T20:32:44.8637792Z contiguous=True, 2025-05-07T20:32:44.8637877Z compiled=False, 2025-05-07T20:32:44.8637950Z ) 2025-05-07T20:32:44.8638181Z self = 2025-05-07T20:32:44.8638356Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:44.8638361Z 2025-05-07T20:32:44.8638445Z @given( 2025-05-07T20:32:44.8638604Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.8638706Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.8638830Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.8638986Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.8639099Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.8639183Z ) 2025-05-07T20:32:44.8639430Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.8639526Z def test_silu_mul_quant( 2025-05-07T20:32:44.8639608Z self, 2025-05-07T20:32:44.8639688Z T: int, 2025-05-07T20:32:44.8639766Z D: int, 2025-05-07T20:32:44.8639872Z scale_ub: Optional[float], 2025-05-07T20:32:44.8639962Z contiguous: bool, 2025-05-07T20:32:44.8640052Z compiled: bool, 2025-05-07T20:32:44.8640130Z ) -> None: 2025-05-07T20:32:44.8640222Z torch.manual_seed(2025) 2025-05-07T20:32:44.8640302Z 2025-05-07T20:32:44.8640479Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.8640554Z 2025-05-07T20:32:44.8640651Z x_sign = torch.sign(x) 2025-05-07T20:32:44.8640780Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.8640867Z x = x_sign * x_clamp 2025-05-07T20:32:44.8640952Z x0 = x[:, :D] 2025-05-07T20:32:44.8641031Z x1 = x[:, D:] 2025-05-07T20:32:44.8641107Z 2025-05-07T20:32:44.8641196Z if contiguous: 2025-05-07T20:32:44.8641287Z x0 = x0.contiguous() 2025-05-07T20:32:44.8641375Z x1 = x1.contiguous() 2025-05-07T20:32:44.8641455Z 2025-05-07T20:32:44.8641544Z if scale_ub is not None: 2025-05-07T20:32:44.8641654Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.8641792Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.8641870Z ) 2025-05-07T20:32:44.8641950Z else: 2025-05-07T20:32:44.8642046Z scale_ub_tensor = None 2025-05-07T20:32:44.8642122Z 2025-05-07T20:32:44.8642257Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.8642346Z op = silu_mul_quant 2025-05-07T20:32:44.8642433Z if compiled: 2025-05-07T20:32:44.8642543Z op = torch.compile(op) 2025-05-07T20:32:44.8642649Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.8642721Z 2025-05-07T20:32:44.8642819Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.8642824Z 2025-05-07T20:32:44.8642923Z moe/activation_test.py:117: 2025-05-07T20:32:44.8643055Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8643156Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.8643256Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.8643765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.8643862Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.8644273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.8644538Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.8644882Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.8644983Z kernel = self.compile( 2025-05-07T20:32:44.8645365Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.8645537Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.8645672Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8645677Z 2025-05-07T20:32:44.8645878Z self = 2025-05-07T20:32:44.8646737Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.8647331Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9807db1f80>} 2025-05-07T20:32:44.8648084Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.8648281Z context = 2025-05-07T20:32:44.8648285Z 2025-05-07T20:32:44.8648446Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.8648713Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.8648824Z module_map=module_map) 2025-05-07T20:32:44.8648986Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.8649093Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.8649173Z E ^ 2025-05-07T20:32:44.8649537Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.8649541Z 2025-05-07T20:32:44.8649956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.8649961Z 2025-05-07T20:32:44.8650068Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.8650298Z self=, 2025-05-07T20:32:44.8650378Z T=1, 2025-05-07T20:32:44.8650458Z D=5120, 2025-05-07T20:32:44.8650546Z scale_ub=None, 2025-05-07T20:32:44.8650631Z contiguous=True, 2025-05-07T20:32:44.8650725Z compiled=True, 2025-05-07T20:32:44.8650801Z ) 2025-05-07T20:32:44.8651023Z self = 2025-05-07T20:32:44.8651191Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:44.8651198Z 2025-05-07T20:32:44.8651276Z @given( 2025-05-07T20:32:44.8651398Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.8651504Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.8651618Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.8651733Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.8651855Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.8651928Z ) 2025-05-07T20:32:44.8652184Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.8652278Z def test_silu_mul_quant( 2025-05-07T20:32:44.8652357Z self, 2025-05-07T20:32:44.8652445Z T: int, 2025-05-07T20:32:44.8652523Z D: int, 2025-05-07T20:32:44.8652667Z scale_ub: Optional[float], 2025-05-07T20:32:44.8652767Z contiguous: bool, 2025-05-07T20:32:44.8652855Z compiled: bool, 2025-05-07T20:32:44.8652935Z ) -> None: 2025-05-07T20:32:44.8653075Z torch.manual_seed(2025) 2025-05-07T20:32:44.8653150Z 2025-05-07T20:32:44.8653319Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.8653402Z 2025-05-07T20:32:44.8653494Z x_sign = torch.sign(x) 2025-05-07T20:32:44.8653626Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.8653717Z x = x_sign * x_clamp 2025-05-07T20:32:44.8653800Z x0 = x[:, :D] 2025-05-07T20:32:44.8653887Z x1 = x[:, D:] 2025-05-07T20:32:44.8653960Z 2025-05-07T20:32:44.8654047Z if contiguous: 2025-05-07T20:32:44.8654146Z x0 = x0.contiguous() 2025-05-07T20:32:44.8654235Z x1 = x1.contiguous() 2025-05-07T20:32:44.8654350Z 2025-05-07T20:32:44.8654450Z if scale_ub is not None: 2025-05-07T20:32:44.8654556Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.8654690Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.8654808Z ) 2025-05-07T20:32:44.8654891Z else: 2025-05-07T20:32:44.8654989Z scale_ub_tensor = None 2025-05-07T20:32:44.8655062Z 2025-05-07T20:32:44.8655190Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.8655285Z op = silu_mul_quant 2025-05-07T20:32:44.8655369Z if compiled: 2025-05-07T20:32:44.8655470Z op = torch.compile(op) 2025-05-07T20:32:44.8656216Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.8656288Z 2025-05-07T20:32:44.8656381Z y_fp8, y_scale = fn() 2025-05-07T20:32:44.8656509Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:44.8656581Z 2025-05-07T20:32:44.8656716Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.8656828Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:44.8656931Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:44.8657067Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:44.8657233Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:44.8657327Z 2025-05-07T20:32:44.8657439Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:44.8657444Z 2025-05-07T20:32:44.8657540Z moe/activation_test.py:126: 2025-05-07T20:32:44.8657667Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8657779Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:44.8657912Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:44.8658480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:44.8658586Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:44.8658949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.8659180Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.8659546Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:44.8659807Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:44.8660207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:44.8660459Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:44.8660842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:44.8661008Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:44.8661410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:44.8661493Z fn() 2025-05-07T20:32:44.8661938Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:44.8662028Z self.fn.run( 2025-05-07T20:32:44.8662365Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.8662467Z kernel = self.compile( 2025-05-07T20:32:44.8662855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.8674273Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.8674428Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8674433Z 2025-05-07T20:32:44.8674735Z self = 2025-05-07T20:32:44.8675575Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.8676088Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9807db2fc0>} 2025-05-07T20:32:44.8676849Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.8677068Z context = 2025-05-07T20:32:44.8677073Z 2025-05-07T20:32:44.8677275Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.8677539Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.8677660Z module_map=module_map) 2025-05-07T20:32:44.8677827Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.8677931Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:44.8678017Z E ^ 2025-05-07T20:32:44.8678375Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.8678381Z 2025-05-07T20:32:44.8678801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.8678813Z 2025-05-07T20:32:44.8678917Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.8679142Z self=, 2025-05-07T20:32:44.8679227Z T=2048, 2025-05-07T20:32:44.8679306Z D=5120, 2025-05-07T20:32:44.8679393Z scale_ub=None, 2025-05-07T20:32:44.8679490Z contiguous=True, 2025-05-07T20:32:44.8679573Z compiled=True, 2025-05-07T20:32:44.8679648Z ) 2025-05-07T20:32:44.8679879Z self = 2025-05-07T20:32:44.8680054Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:44.8680058Z 2025-05-07T20:32:44.8680145Z @given( 2025-05-07T20:32:44.8680266Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.8680366Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.8680489Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.8680608Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.8680724Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.8680810Z ) 2025-05-07T20:32:44.8681059Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.8681155Z def test_silu_mul_quant( 2025-05-07T20:32:44.8681291Z self, 2025-05-07T20:32:44.8681372Z T: int, 2025-05-07T20:32:44.8681449Z D: int, 2025-05-07T20:32:44.8681558Z scale_ub: Optional[float], 2025-05-07T20:32:44.8681651Z contiguous: bool, 2025-05-07T20:32:44.8681790Z compiled: bool, 2025-05-07T20:32:44.8681873Z ) -> None: 2025-05-07T20:32:44.8681969Z torch.manual_seed(2025) 2025-05-07T20:32:44.8682051Z 2025-05-07T20:32:44.8682221Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.8682296Z 2025-05-07T20:32:44.8682394Z x_sign = torch.sign(x) 2025-05-07T20:32:44.8682520Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.8682610Z x = x_sign * x_clamp 2025-05-07T20:32:44.8682699Z x0 = x[:, :D] 2025-05-07T20:32:44.8682779Z x1 = x[:, D:] 2025-05-07T20:32:44.8682856Z 2025-05-07T20:32:44.8682948Z if contiguous: 2025-05-07T20:32:44.8683086Z x0 = x0.contiguous() 2025-05-07T20:32:44.8683185Z x1 = x1.contiguous() 2025-05-07T20:32:44.8683264Z 2025-05-07T20:32:44.8683354Z if scale_ub is not None: 2025-05-07T20:32:44.8683506Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.8683645Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.8683722Z ) 2025-05-07T20:32:44.8683811Z else: 2025-05-07T20:32:44.8683906Z scale_ub_tensor = None 2025-05-07T20:32:44.8683979Z 2025-05-07T20:32:44.8684118Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.8684209Z op = silu_mul_quant 2025-05-07T20:32:44.8684294Z if compiled: 2025-05-07T20:32:44.8684404Z op = torch.compile(op) 2025-05-07T20:32:44.8684510Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.8684590Z 2025-05-07T20:32:44.8684683Z y_fp8, y_scale = fn() 2025-05-07T20:32:44.8684804Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:44.8684889Z 2025-05-07T20:32:44.8685025Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.8685128Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:44.8685238Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:44.8685361Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:44.8685502Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:44.8685581Z 2025-05-07T20:32:44.8685681Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:44.8685685Z 2025-05-07T20:32:44.8685791Z moe/activation_test.py:126: 2025-05-07T20:32:44.8685923Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8686030Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:44.8686173Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:44.8686740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:44.8686847Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:44.8687222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.8687469Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.8687875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:44.8688131Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:44.8688531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:44.8688793Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:44.8689172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:44.8689396Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:44.8689780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:44.8689859Z fn() 2025-05-07T20:32:44.8690271Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:44.8690355Z self.fn.run( 2025-05-07T20:32:44.8690696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.8690801Z kernel = self.compile( 2025-05-07T20:32:44.8691184Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.8691366Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.8691498Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8691544Z 2025-05-07T20:32:44.8691751Z self = 2025-05-07T20:32:44.8692587Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.8693093Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f980790a8e0>} 2025-05-07T20:32:44.8693854Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.8694044Z context = 2025-05-07T20:32:44.8694051Z 2025-05-07T20:32:44.8694220Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.8694499Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.8694610Z module_map=module_map) 2025-05-07T20:32:44.8694782Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.8694889Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:44.8694968Z E ^ 2025-05-07T20:32:44.8695333Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.8695338Z 2025-05-07T20:32:44.8695757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.8695762Z 2025-05-07T20:32:44.8695874Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.8696099Z self=, 2025-05-07T20:32:44.8696183Z T=128, 2025-05-07T20:32:44.8696269Z D=5120, 2025-05-07T20:32:44.8696353Z scale_ub=None, 2025-05-07T20:32:44.8696440Z contiguous=True, 2025-05-07T20:32:44.8696532Z compiled=True, 2025-05-07T20:32:44.8696607Z ) 2025-05-07T20:32:44.8696831Z self = 2025-05-07T20:32:44.8697008Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:44.8697013Z 2025-05-07T20:32:44.8697092Z @given( 2025-05-07T20:32:44.8697222Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.8697324Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.8697440Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.8697565Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.8697680Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.8697756Z ) 2025-05-07T20:32:44.8698012Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.8698159Z def test_silu_mul_quant( 2025-05-07T20:32:44.8698238Z self, 2025-05-07T20:32:44.8698323Z T: int, 2025-05-07T20:32:44.8698403Z D: int, 2025-05-07T20:32:44.8698543Z scale_ub: Optional[float], 2025-05-07T20:32:44.8698642Z contiguous: bool, 2025-05-07T20:32:44.8698729Z compiled: bool, 2025-05-07T20:32:44.8698817Z ) -> None: 2025-05-07T20:32:44.8698913Z torch.manual_seed(2025) 2025-05-07T20:32:44.8698989Z 2025-05-07T20:32:44.8699167Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.8699241Z 2025-05-07T20:32:44.8699334Z x_sign = torch.sign(x) 2025-05-07T20:32:44.8699464Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.8699554Z x = x_sign * x_clamp 2025-05-07T20:32:44.8699635Z x0 = x[:, :D] 2025-05-07T20:32:44.8699724Z x1 = x[:, D:] 2025-05-07T20:32:44.8699866Z 2025-05-07T20:32:44.8699953Z if contiguous: 2025-05-07T20:32:44.8700053Z x0 = x0.contiguous() 2025-05-07T20:32:44.8700144Z x1 = x1.contiguous() 2025-05-07T20:32:44.8700225Z 2025-05-07T20:32:44.8700360Z if scale_ub is not None: 2025-05-07T20:32:44.8700467Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.8700609Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.8700685Z ) 2025-05-07T20:32:44.8700763Z else: 2025-05-07T20:32:44.8700865Z scale_ub_tensor = None 2025-05-07T20:32:44.8700939Z 2025-05-07T20:32:44.8701071Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.8701172Z op = silu_mul_quant 2025-05-07T20:32:44.8701259Z if compiled: 2025-05-07T20:32:44.8701360Z op = torch.compile(op) 2025-05-07T20:32:44.8701475Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.8701550Z 2025-05-07T20:32:44.8701642Z y_fp8, y_scale = fn() 2025-05-07T20:32:44.8701773Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:44.8701846Z 2025-05-07T20:32:44.8701992Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.8702098Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:44.8702198Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:44.8702328Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:44.8702468Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:44.8702544Z 2025-05-07T20:32:44.8702651Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:44.8702655Z 2025-05-07T20:32:44.8702754Z moe/activation_test.py:126: 2025-05-07T20:32:44.8702888Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8702998Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:44.8703133Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:44.8703706Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:44.8703812Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:44.8704177Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.8704408Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.8704778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:44.8705050Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:44.8705450Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:44.8705702Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:44.8706136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:44.8706306Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:44.8706755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:44.8706838Z fn() 2025-05-07T20:32:44.8707291Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:44.8707383Z self.fn.run( 2025-05-07T20:32:44.8707725Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.8707820Z kernel = self.compile( 2025-05-07T20:32:44.8708212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.8708386Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.8708564Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8708569Z 2025-05-07T20:32:44.8708811Z self = 2025-05-07T20:32:44.8709692Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.8710210Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9807441b20>} 2025-05-07T20:32:44.8710967Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.8711173Z context = 2025-05-07T20:32:44.8711180Z 2025-05-07T20:32:44.8711345Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.8711613Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.8711730Z module_map=module_map) 2025-05-07T20:32:44.8711893Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.8712006Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:44.8712085Z E ^ 2025-05-07T20:32:44.8712446Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.8712450Z 2025-05-07T20:32:44.8712874Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.8712878Z 2025-05-07T20:32:44.8712983Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.8713222Z self=, 2025-05-07T20:32:44.8713302Z T=4096, 2025-05-07T20:32:44.8713381Z D=5120, 2025-05-07T20:32:44.8713478Z scale_ub=None, 2025-05-07T20:32:44.8713568Z contiguous=True, 2025-05-07T20:32:44.8713654Z compiled=True, 2025-05-07T20:32:44.8713735Z ) 2025-05-07T20:32:44.8713953Z self = 2025-05-07T20:32:44.8714124Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:44.8714129Z 2025-05-07T20:32:44.8714218Z @given( 2025-05-07T20:32:44.8714337Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.8714444Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.8714561Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.8714678Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.8714801Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.8714930Z ) 2025-05-07T20:32:44.8715175Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.8715278Z def test_silu_mul_quant( 2025-05-07T20:32:44.8715394Z self, 2025-05-07T20:32:44.8715473Z T: int, 2025-05-07T20:32:44.8715560Z D: int, 2025-05-07T20:32:44.8715659Z scale_ub: Optional[float], 2025-05-07T20:32:44.8715750Z contiguous: bool, 2025-05-07T20:32:44.8715840Z compiled: bool, 2025-05-07T20:32:44.8715922Z ) -> None: 2025-05-07T20:32:44.8716024Z torch.manual_seed(2025) 2025-05-07T20:32:44.8716095Z 2025-05-07T20:32:44.8716263Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.8716343Z 2025-05-07T20:32:44.8716435Z x_sign = torch.sign(x) 2025-05-07T20:32:44.8716564Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.8716653Z x = x_sign * x_clamp 2025-05-07T20:32:44.8716778Z x0 = x[:, :D] 2025-05-07T20:32:44.8716870Z x1 = x[:, D:] 2025-05-07T20:32:44.8716942Z 2025-05-07T20:32:44.8717025Z if contiguous: 2025-05-07T20:32:44.8717160Z x0 = x0.contiguous() 2025-05-07T20:32:44.8717256Z x1 = x1.contiguous() 2025-05-07T20:32:44.8717335Z 2025-05-07T20:32:44.8717455Z if scale_ub is not None: 2025-05-07T20:32:44.8717578Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.8717719Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.8717802Z ) 2025-05-07T20:32:44.8717878Z else: 2025-05-07T20:32:44.8717976Z scale_ub_tensor = None 2025-05-07T20:32:44.8718050Z 2025-05-07T20:32:44.8718182Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.8718280Z op = silu_mul_quant 2025-05-07T20:32:44.8718365Z if compiled: 2025-05-07T20:32:44.8718465Z op = torch.compile(op) 2025-05-07T20:32:44.8718584Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.8718656Z 2025-05-07T20:32:44.8718746Z y_fp8, y_scale = fn() 2025-05-07T20:32:44.8718874Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:44.8718952Z 2025-05-07T20:32:44.8719086Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.8719194Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:44.8719292Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:44.8719418Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:44.8719556Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:44.8719629Z 2025-05-07T20:32:44.8719734Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:44.8719739Z 2025-05-07T20:32:44.8719836Z moe/activation_test.py:126: 2025-05-07T20:32:44.8719964Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8720082Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:44.8720217Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:44.8720797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:44.8720898Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:44.8721260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.8721489Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.8721857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:44.8722111Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:44.8722517Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:44.8722820Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:44.8723250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:44.8723417Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:44.8723759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:44.8723843Z fn() 2025-05-07T20:32:44.8724245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:44.8724333Z self.fn.run( 2025-05-07T20:32:44.8724671Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.8724764Z kernel = self.compile( 2025-05-07T20:32:44.8725151Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.8725367Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.8725535Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8725540Z 2025-05-07T20:32:44.8725750Z self = 2025-05-07T20:32:44.8726533Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.8727046Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f98075b2700>} 2025-05-07T20:32:44.8727798Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.8727997Z context = 2025-05-07T20:32:44.8728003Z 2025-05-07T20:32:44.8728424Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.8728805Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.8728924Z module_map=module_map) 2025-05-07T20:32:44.8729087Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.8729189Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:44.8729272Z E ^ 2025-05-07T20:32:44.8729625Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.8729630Z 2025-05-07T20:32:44.8730048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.8730058Z 2025-05-07T20:32:44.8730163Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.8730388Z self=, 2025-05-07T20:32:44.8730474Z T=16384, 2025-05-07T20:32:44.8730554Z D=5120, 2025-05-07T20:32:44.8730637Z scale_ub=None, 2025-05-07T20:32:44.8730729Z contiguous=True, 2025-05-07T20:32:44.8730816Z compiled=True, 2025-05-07T20:32:44.8730894Z ) 2025-05-07T20:32:44.8731111Z self = 2025-05-07T20:32:44.8731283Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:44.8731288Z 2025-05-07T20:32:44.8731374Z @given( 2025-05-07T20:32:44.8731492Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.8731591Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.8731717Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.8731837Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.8732117Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.8732203Z ) 2025-05-07T20:32:44.8732549Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.8732654Z def test_silu_mul_quant( 2025-05-07T20:32:44.8732731Z self, 2025-05-07T20:32:44.8732808Z T: int, 2025-05-07T20:32:44.8732894Z D: int, 2025-05-07T20:32:44.8732992Z scale_ub: Optional[float], 2025-05-07T20:32:44.8733082Z contiguous: bool, 2025-05-07T20:32:44.8733178Z compiled: bool, 2025-05-07T20:32:44.8733258Z ) -> None: 2025-05-07T20:32:44.8733355Z torch.manual_seed(2025) 2025-05-07T20:32:44.8733435Z 2025-05-07T20:32:44.8733602Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.8733675Z 2025-05-07T20:32:44.8733772Z x_sign = torch.sign(x) 2025-05-07T20:32:44.8733961Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.8734061Z x = x_sign * x_clamp 2025-05-07T20:32:44.8734143Z x0 = x[:, :D] 2025-05-07T20:32:44.8734223Z x1 = x[:, D:] 2025-05-07T20:32:44.8734366Z 2025-05-07T20:32:44.8734455Z if contiguous: 2025-05-07T20:32:44.8734546Z x0 = x0.contiguous() 2025-05-07T20:32:44.8734644Z x1 = x1.contiguous() 2025-05-07T20:32:44.8734715Z 2025-05-07T20:32:44.8734805Z if scale_ub is not None: 2025-05-07T20:32:44.8734915Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.8735049Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.8735124Z ) 2025-05-07T20:32:44.8735206Z else: 2025-05-07T20:32:44.8735300Z scale_ub_tensor = None 2025-05-07T20:32:44.8735384Z 2025-05-07T20:32:44.8735513Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.8735602Z op = silu_mul_quant 2025-05-07T20:32:44.8735696Z if compiled: 2025-05-07T20:32:44.8735797Z op = torch.compile(op) 2025-05-07T20:32:44.8735902Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.8735979Z 2025-05-07T20:32:44.8736074Z y_fp8, y_scale = fn() 2025-05-07T20:32:44.8736194Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:44.8736271Z 2025-05-07T20:32:44.8736405Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.8736506Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:44.8736610Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:44.8736730Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:44.8736882Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:44.8736955Z 2025-05-07T20:32:44.8737065Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:44.8737070Z 2025-05-07T20:32:44.8737184Z moe/activation_test.py:126: 2025-05-07T20:32:44.8737333Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8737465Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:44.8737604Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:44.8738174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:44.8738284Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:44.8738645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.8738874Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.8739242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:44.8739497Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:44.8739903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:44.8740207Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:44.8740628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:44.8740796Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:44.8741138Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:44.8741223Z fn() 2025-05-07T20:32:44.8741624Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:44.8741709Z self.fn.run( 2025-05-07T20:32:44.8742057Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.8742189Z kernel = self.compile( 2025-05-07T20:32:44.8742580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.8742794Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.8742923Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8742928Z 2025-05-07T20:32:44.8743143Z self = 2025-05-07T20:32:44.8743929Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.8744441Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9806edd3a0>} 2025-05-07T20:32:44.8745199Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.8745395Z context = 2025-05-07T20:32:44.8745407Z 2025-05-07T20:32:44.8745571Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.8745833Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.8745948Z module_map=module_map) 2025-05-07T20:32:44.8746109Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.8746211Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:44.8746295Z E ^ 2025-05-07T20:32:44.8746657Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.8746665Z 2025-05-07T20:32:44.8747088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.8747093Z 2025-05-07T20:32:44.8747197Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.8747441Z self=, 2025-05-07T20:32:44.8747534Z T=1, 2025-05-07T20:32:44.8747626Z D=5120, 2025-05-07T20:32:44.8747721Z scale_ub=1200.0, 2025-05-07T20:32:44.8747812Z contiguous=True, 2025-05-07T20:32:44.8747896Z compiled=True, 2025-05-07T20:32:44.8747970Z ) 2025-05-07T20:32:44.8748194Z self = 2025-05-07T20:32:44.8748358Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:44.8748362Z 2025-05-07T20:32:44.8748446Z @given( 2025-05-07T20:32:44.8748565Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.8748666Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.8748885Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.8749002Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.8749190Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.8749314Z ) 2025-05-07T20:32:44.8749561Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.8749657Z def test_silu_mul_quant( 2025-05-07T20:32:44.8749740Z self, 2025-05-07T20:32:44.8749817Z T: int, 2025-05-07T20:32:44.8749901Z D: int, 2025-05-07T20:32:44.8750001Z scale_ub: Optional[float], 2025-05-07T20:32:44.8750090Z contiguous: bool, 2025-05-07T20:32:44.8750183Z compiled: bool, 2025-05-07T20:32:44.8750262Z ) -> None: 2025-05-07T20:32:44.8750355Z torch.manual_seed(2025) 2025-05-07T20:32:44.8750434Z 2025-05-07T20:32:44.8750602Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.8750718Z 2025-05-07T20:32:44.8750817Z x_sign = torch.sign(x) 2025-05-07T20:32:44.8750941Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.8751028Z x = x_sign * x_clamp 2025-05-07T20:32:44.8751154Z x0 = x[:, :D] 2025-05-07T20:32:44.8751235Z x1 = x[:, D:] 2025-05-07T20:32:44.8751314Z 2025-05-07T20:32:44.8751398Z if contiguous: 2025-05-07T20:32:44.8751488Z x0 = x0.contiguous() 2025-05-07T20:32:44.8751581Z x1 = x1.contiguous() 2025-05-07T20:32:44.8751656Z 2025-05-07T20:32:44.8751746Z if scale_ub is not None: 2025-05-07T20:32:44.8751859Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.8751994Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.8752069Z ) 2025-05-07T20:32:44.8752152Z else: 2025-05-07T20:32:44.8752248Z scale_ub_tensor = None 2025-05-07T20:32:44.8752321Z 2025-05-07T20:32:44.8752459Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.8752552Z op = silu_mul_quant 2025-05-07T20:32:44.8752636Z if compiled: 2025-05-07T20:32:44.8752741Z op = torch.compile(op) 2025-05-07T20:32:44.8752852Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.8752929Z 2025-05-07T20:32:44.8753020Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.8753025Z 2025-05-07T20:32:44.8753124Z moe/activation_test.py:117: 2025-05-07T20:32:44.8753259Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8753360Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.8753462Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.8753840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:44.8753936Z return fn(*args, **kwargs) 2025-05-07T20:32:44.8754447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.8754550Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.8754914Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.8755144Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.8755483Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.8755575Z kernel = self.compile( 2025-05-07T20:32:44.8755963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.8756137Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.8756272Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8756277Z 2025-05-07T20:32:44.8756479Z self = 2025-05-07T20:32:44.8757361Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.8757873Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9806db0900>} 2025-05-07T20:32:44.8758628Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.8758824Z context = 2025-05-07T20:32:44.8758828Z 2025-05-07T20:32:44.8758992Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.8759299Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.8759408Z module_map=module_map) 2025-05-07T20:32:44.8759607Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.8759720Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.8759798Z E ^ 2025-05-07T20:32:44.8760159Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.8760164Z 2025-05-07T20:32:44.8760590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.8760594Z 2025-05-07T20:32:44.8760696Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.8760931Z self=, 2025-05-07T20:32:44.8761006Z T=1, 2025-05-07T20:32:44.8761082Z D=5120, 2025-05-07T20:32:44.8761175Z scale_ub=None, 2025-05-07T20:32:44.8761265Z contiguous=False, 2025-05-07T20:32:44.8761348Z compiled=True, 2025-05-07T20:32:44.8761428Z ) 2025-05-07T20:32:44.8761650Z self = 2025-05-07T20:32:44.8761816Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:44.8761827Z 2025-05-07T20:32:44.8761903Z @given( 2025-05-07T20:32:44.8762020Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.8762124Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.8762237Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.8762353Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.8762470Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.8762544Z ) 2025-05-07T20:32:44.8762790Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.8762890Z def test_silu_mul_quant( 2025-05-07T20:32:44.8762970Z self, 2025-05-07T20:32:44.8763048Z T: int, 2025-05-07T20:32:44.8763131Z D: int, 2025-05-07T20:32:44.8763228Z scale_ub: Optional[float], 2025-05-07T20:32:44.8763324Z contiguous: bool, 2025-05-07T20:32:44.8763412Z compiled: bool, 2025-05-07T20:32:44.8763490Z ) -> None: 2025-05-07T20:32:44.8763589Z torch.manual_seed(2025) 2025-05-07T20:32:44.8763662Z 2025-05-07T20:32:44.8763829Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.8763907Z 2025-05-07T20:32:44.8763997Z x_sign = torch.sign(x) 2025-05-07T20:32:44.8764119Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.8764215Z x = x_sign * x_clamp 2025-05-07T20:32:44.8764298Z x0 = x[:, :D] 2025-05-07T20:32:44.8764378Z x1 = x[:, D:] 2025-05-07T20:32:44.8764456Z 2025-05-07T20:32:44.8764539Z if contiguous: 2025-05-07T20:32:44.8764629Z x0 = x0.contiguous() 2025-05-07T20:32:44.8764726Z x1 = x1.contiguous() 2025-05-07T20:32:44.8764872Z 2025-05-07T20:32:44.8764970Z if scale_ub is not None: 2025-05-07T20:32:44.8765078Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.8765253Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.8765335Z ) 2025-05-07T20:32:44.8765410Z else: 2025-05-07T20:32:44.8765503Z scale_ub_tensor = None 2025-05-07T20:32:44.8765580Z 2025-05-07T20:32:44.8765711Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.8765801Z op = silu_mul_quant 2025-05-07T20:32:44.8765890Z if compiled: 2025-05-07T20:32:44.8765990Z op = torch.compile(op) 2025-05-07T20:32:44.8766094Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.8766171Z 2025-05-07T20:32:44.8766261Z y_fp8, y_scale = fn() 2025-05-07T20:32:44.8766385Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:44.8766501Z 2025-05-07T20:32:44.8766635Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.8766741Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:44.8766879Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:44.8767008Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:44.8767172Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:44.8767253Z 2025-05-07T20:32:44.8767375Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:44.8767387Z 2025-05-07T20:32:44.8767486Z moe/activation_test.py:126: 2025-05-07T20:32:44.8767613Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8767730Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:44.8767862Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:44.8768428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:44.8768542Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:44.8768906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.8769138Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.8769514Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:44.8769771Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:44.8770179Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:44.8770433Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:44.8770811Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:44.8770990Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:44.8771338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:44.8771424Z fn() 2025-05-07T20:32:44.8771830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:44.8771910Z self.fn.run( 2025-05-07T20:32:44.8772257Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.8772351Z kernel = self.compile( 2025-05-07T20:32:44.8772736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.8772919Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.8773052Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8773106Z 2025-05-07T20:32:44.8773320Z self = 2025-05-07T20:32:44.8774149Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.8774661Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9806dce0c0>} 2025-05-07T20:32:44.8775413Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.8775603Z context = 2025-05-07T20:32:44.8775608Z 2025-05-07T20:32:44.8775819Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.8776084Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.8776237Z module_map=module_map) 2025-05-07T20:32:44.8776400Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.8776502Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:44.8776585Z E ^ 2025-05-07T20:32:44.8776945Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.8776950Z 2025-05-07T20:32:44.8777366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.8777379Z 2025-05-07T20:32:44.8777484Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.8777735Z self=, 2025-05-07T20:32:44.8777832Z T=1, 2025-05-07T20:32:44.8777922Z D=5120, 2025-05-07T20:32:44.8778002Z scale_ub=None, 2025-05-07T20:32:44.8778097Z contiguous=True, 2025-05-07T20:32:44.8778180Z compiled=False, 2025-05-07T20:32:44.8778255Z ) 2025-05-07T20:32:44.8778485Z self = 2025-05-07T20:32:44.8778647Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:44.8778651Z 2025-05-07T20:32:44.8778733Z @given( 2025-05-07T20:32:44.8778849Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.8778949Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.8779068Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.8779184Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.8779296Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.8779374Z ) 2025-05-07T20:32:44.8779619Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.8779718Z def test_silu_mul_quant( 2025-05-07T20:32:44.8779801Z self, 2025-05-07T20:32:44.8779877Z T: int, 2025-05-07T20:32:44.8779954Z D: int, 2025-05-07T20:32:44.8780060Z scale_ub: Optional[float], 2025-05-07T20:32:44.8780147Z contiguous: bool, 2025-05-07T20:32:44.8780239Z compiled: bool, 2025-05-07T20:32:44.8780316Z ) -> None: 2025-05-07T20:32:44.8780407Z torch.manual_seed(2025) 2025-05-07T20:32:44.8780486Z 2025-05-07T20:32:44.8780654Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.8780725Z 2025-05-07T20:32:44.8780822Z x_sign = torch.sign(x) 2025-05-07T20:32:44.8780946Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.8781035Z x = x_sign * x_clamp 2025-05-07T20:32:44.8781120Z x0 = x[:, :D] 2025-05-07T20:32:44.8781200Z x1 = x[:, D:] 2025-05-07T20:32:44.8781274Z 2025-05-07T20:32:44.8781365Z if contiguous: 2025-05-07T20:32:44.8781505Z x0 = x0.contiguous() 2025-05-07T20:32:44.8781595Z x1 = x1.contiguous() 2025-05-07T20:32:44.8781679Z 2025-05-07T20:32:44.8781771Z if scale_ub is not None: 2025-05-07T20:32:44.8781925Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.8782060Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.8782135Z ) 2025-05-07T20:32:44.8782217Z else: 2025-05-07T20:32:44.8782312Z scale_ub_tensor = None 2025-05-07T20:32:44.8782385Z 2025-05-07T20:32:44.8782519Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.8782608Z op = silu_mul_quant 2025-05-07T20:32:44.8782692Z if compiled: 2025-05-07T20:32:44.8782799Z op = torch.compile(op) 2025-05-07T20:32:44.8782908Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.8782980Z 2025-05-07T20:32:44.8783122Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.8783128Z 2025-05-07T20:32:44.8783227Z moe/activation_test.py:117: 2025-05-07T20:32:44.8783364Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8783507Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.8783609Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.8784117Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.8784213Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.8784573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.8784801Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.8785145Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.8785249Z kernel = self.compile( 2025-05-07T20:32:44.8785633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.8785808Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.8785945Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8785949Z 2025-05-07T20:32:44.8786153Z self = 2025-05-07T20:32:44.8786945Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.8787449Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9806dcf420>} 2025-05-07T20:32:44.8788206Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.8788412Z context = 2025-05-07T20:32:44.8788417Z 2025-05-07T20:32:44.8788579Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.8788852Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.8788958Z module_map=module_map) 2025-05-07T20:32:44.8789233Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.8789340Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.8789421Z E ^ 2025-05-07T20:32:44.8789784Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.8789789Z 2025-05-07T20:32:44.8790207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.8790260Z 2025-05-07T20:32:44.8790368Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.8790638Z self=, 2025-05-07T20:32:44.8790719Z T=128, 2025-05-07T20:32:44.8790795Z D=5120, 2025-05-07T20:32:44.8790885Z scale_ub=None, 2025-05-07T20:32:44.8790975Z contiguous=False, 2025-05-07T20:32:44.8791062Z compiled=True, 2025-05-07T20:32:44.8791135Z ) 2025-05-07T20:32:44.8791354Z self = 2025-05-07T20:32:44.8791528Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:44.8791532Z 2025-05-07T20:32:44.8791608Z @given( 2025-05-07T20:32:44.8791727Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.8791830Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.8791990Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.8792108Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.8792293Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.8792368Z ) 2025-05-07T20:32:44.8792624Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.8792716Z def test_silu_mul_quant( 2025-05-07T20:32:44.8792791Z self, 2025-05-07T20:32:44.8792873Z T: int, 2025-05-07T20:32:44.8792948Z D: int, 2025-05-07T20:32:44.8793045Z scale_ub: Optional[float], 2025-05-07T20:32:44.8793140Z contiguous: bool, 2025-05-07T20:32:44.8793224Z compiled: bool, 2025-05-07T20:32:44.8793301Z ) -> None: 2025-05-07T20:32:44.8793401Z torch.manual_seed(2025) 2025-05-07T20:32:44.8793473Z 2025-05-07T20:32:44.8793640Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.8793722Z 2025-05-07T20:32:44.8793813Z x_sign = torch.sign(x) 2025-05-07T20:32:44.8793943Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.8794031Z x = x_sign * x_clamp 2025-05-07T20:32:44.8794114Z x0 = x[:, :D] 2025-05-07T20:32:44.8794202Z x1 = x[:, D:] 2025-05-07T20:32:44.8794273Z 2025-05-07T20:32:44.8794356Z if contiguous: 2025-05-07T20:32:44.8794452Z x0 = x0.contiguous() 2025-05-07T20:32:44.8794539Z x1 = x1.contiguous() 2025-05-07T20:32:44.8794611Z 2025-05-07T20:32:44.8794707Z if scale_ub is not None: 2025-05-07T20:32:44.8794812Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.8794944Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.8795025Z ) 2025-05-07T20:32:44.8795100Z else: 2025-05-07T20:32:44.8795200Z scale_ub_tensor = None 2025-05-07T20:32:44.8795272Z 2025-05-07T20:32:44.8795401Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.8795501Z op = silu_mul_quant 2025-05-07T20:32:44.8795589Z if compiled: 2025-05-07T20:32:44.8795687Z op = torch.compile(op) 2025-05-07T20:32:44.8795800Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.8795873Z 2025-05-07T20:32:44.8795964Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.8795969Z 2025-05-07T20:32:44.8796073Z moe/activation_test.py:117: 2025-05-07T20:32:44.8796200Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8796300Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.8796403Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.8796803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:44.8796924Z return fn(*args, **kwargs) 2025-05-07T20:32:44.8797419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.8797570Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.8797935Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.8798196Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.8798542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.8798635Z kernel = self.compile( 2025-05-07T20:32:44.8799016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.8799193Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.8799324Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8799328Z 2025-05-07T20:32:44.8799530Z self = 2025-05-07T20:32:44.8800403Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.8800905Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9806dcf1a0>} 2025-05-07T20:32:44.8801670Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.8806329Z context = 2025-05-07T20:32:44.8806338Z 2025-05-07T20:32:44.8806521Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.8806786Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.8806913Z module_map=module_map) 2025-05-07T20:32:44.8807094Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.8807218Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.8807315Z E ^ 2025-05-07T20:32:44.8807677Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.8807681Z 2025-05-07T20:32:44.8808117Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.8808121Z 2025-05-07T20:32:44.8808226Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.8808459Z self=, 2025-05-07T20:32:44.8808537Z T=128, 2025-05-07T20:32:44.8808616Z D=7168, 2025-05-07T20:32:44.8808708Z scale_ub=1200.0, 2025-05-07T20:32:44.8808801Z contiguous=False, 2025-05-07T20:32:44.8808889Z compiled=False, 2025-05-07T20:32:44.8808973Z ) 2025-05-07T20:32:44.8809192Z self = 2025-05-07T20:32:44.8809371Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:44.8809376Z 2025-05-07T20:32:44.8809463Z @given( 2025-05-07T20:32:44.8809585Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.8809693Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.8809810Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.8809929Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.8810050Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.8810128Z ) 2025-05-07T20:32:44.8810377Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.8810480Z def test_silu_mul_quant( 2025-05-07T20:32:44.8810560Z self, 2025-05-07T20:32:44.8810720Z T: int, 2025-05-07T20:32:44.8810810Z D: int, 2025-05-07T20:32:44.8810911Z scale_ub: Optional[float], 2025-05-07T20:32:44.8811001Z contiguous: bool, 2025-05-07T20:32:44.8811099Z compiled: bool, 2025-05-07T20:32:44.8811220Z ) -> None: 2025-05-07T20:32:44.8811326Z torch.manual_seed(2025) 2025-05-07T20:32:44.8811401Z 2025-05-07T20:32:44.8811572Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.8811653Z 2025-05-07T20:32:44.8811750Z x_sign = torch.sign(x) 2025-05-07T20:32:44.8811876Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.8811976Z x = x_sign * x_clamp 2025-05-07T20:32:44.8812059Z x0 = x[:, :D] 2025-05-07T20:32:44.8812140Z x1 = x[:, D:] 2025-05-07T20:32:44.8812219Z 2025-05-07T20:32:44.8812307Z if contiguous: 2025-05-07T20:32:44.8812405Z x0 = x0.contiguous() 2025-05-07T20:32:44.8812547Z x1 = x1.contiguous() 2025-05-07T20:32:44.8812622Z 2025-05-07T20:32:44.8812715Z if scale_ub is not None: 2025-05-07T20:32:44.8812830Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.8813006Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.8813091Z ) 2025-05-07T20:32:44.8813168Z else: 2025-05-07T20:32:44.8813262Z scale_ub_tensor = None 2025-05-07T20:32:44.8813345Z 2025-05-07T20:32:44.8813475Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.8813566Z op = silu_mul_quant 2025-05-07T20:32:44.8813662Z if compiled: 2025-05-07T20:32:44.8813763Z op = torch.compile(op) 2025-05-07T20:32:44.8813869Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.8813951Z 2025-05-07T20:32:44.8814043Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.8814048Z 2025-05-07T20:32:44.8814154Z moe/activation_test.py:117: 2025-05-07T20:32:44.8814290Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8814396Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.8814505Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.8815016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.8815115Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.8815488Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.8815713Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.8816061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.8816157Z kernel = self.compile( 2025-05-07T20:32:44.8816546Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.8816732Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.8816864Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8816868Z 2025-05-07T20:32:44.8817075Z self = 2025-05-07T20:32:44.8817869Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.8818377Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f98065307c0>} 2025-05-07T20:32:44.8819138Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.8819377Z context = 2025-05-07T20:32:44.8819382Z 2025-05-07T20:32:44.8819594Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.8819857Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.8819964Z module_map=module_map) 2025-05-07T20:32:44.8820136Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.8820236Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.8820313Z E ^ 2025-05-07T20:32:44.8820678Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.8820682Z 2025-05-07T20:32:44.8821103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.8821147Z 2025-05-07T20:32:44.8821262Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.8821487Z self=, 2025-05-07T20:32:44.8821605Z T=128, 2025-05-07T20:32:44.8821693Z D=5120, 2025-05-07T20:32:44.8821781Z scale_ub=None, 2025-05-07T20:32:44.8821871Z contiguous=False, 2025-05-07T20:32:44.8821969Z compiled=False, 2025-05-07T20:32:44.8822044Z ) 2025-05-07T20:32:44.8822273Z self = 2025-05-07T20:32:44.8822446Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:44.8822450Z 2025-05-07T20:32:44.8822530Z @given( 2025-05-07T20:32:44.8822660Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.8822763Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.8822880Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.8823005Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.8823125Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.8823210Z ) 2025-05-07T20:32:44.8823461Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.8823561Z def test_silu_mul_quant( 2025-05-07T20:32:44.8823648Z self, 2025-05-07T20:32:44.8823726Z T: int, 2025-05-07T20:32:44.8823806Z D: int, 2025-05-07T20:32:44.8823913Z scale_ub: Optional[float], 2025-05-07T20:32:44.8824004Z contiguous: bool, 2025-05-07T20:32:44.8824094Z compiled: bool, 2025-05-07T20:32:44.8824185Z ) -> None: 2025-05-07T20:32:44.8824281Z torch.manual_seed(2025) 2025-05-07T20:32:44.8824354Z 2025-05-07T20:32:44.8824534Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.8824608Z 2025-05-07T20:32:44.8824702Z x_sign = torch.sign(x) 2025-05-07T20:32:44.8824837Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.8824932Z x = x_sign * x_clamp 2025-05-07T20:32:44.8825022Z x0 = x[:, :D] 2025-05-07T20:32:44.8825104Z x1 = x[:, D:] 2025-05-07T20:32:44.8825179Z 2025-05-07T20:32:44.8825278Z if contiguous: 2025-05-07T20:32:44.8825373Z x0 = x0.contiguous() 2025-05-07T20:32:44.8825463Z x1 = x1.contiguous() 2025-05-07T20:32:44.8825545Z 2025-05-07T20:32:44.8825639Z if scale_ub is not None: 2025-05-07T20:32:44.8825748Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.8825898Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.8825976Z ) 2025-05-07T20:32:44.8826053Z else: 2025-05-07T20:32:44.8826154Z scale_ub_tensor = None 2025-05-07T20:32:44.8826227Z 2025-05-07T20:32:44.8826368Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.8826460Z op = silu_mul_quant 2025-05-07T20:32:44.8826547Z if compiled: 2025-05-07T20:32:44.8826711Z op = torch.compile(op) 2025-05-07T20:32:44.8826818Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.8826894Z 2025-05-07T20:32:44.8826996Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.8827000Z 2025-05-07T20:32:44.8827143Z moe/activation_test.py:117: 2025-05-07T20:32:44.8827288Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8827411Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.8827534Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.8828047Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.8828460Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.8828924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.8829226Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.8829764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.8829859Z kernel = self.compile( 2025-05-07T20:32:44.8830321Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.8830496Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.8830636Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8830640Z 2025-05-07T20:32:44.8830843Z self = 2025-05-07T20:32:44.8831629Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.8832144Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f98062405e0>} 2025-05-07T20:32:44.8832906Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.8833105Z context = 2025-05-07T20:32:44.8833110Z 2025-05-07T20:32:44.8833275Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.8833547Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.8833655Z module_map=module_map) 2025-05-07T20:32:44.8833816Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.8833920Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.8834001Z E ^ 2025-05-07T20:32:44.8834364Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.8834369Z 2025-05-07T20:32:44.8834801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.8834805Z 2025-05-07T20:32:44.8834910Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.8835141Z self=, 2025-05-07T20:32:44.8835217Z T=128, 2025-05-07T20:32:44.8835295Z D=5120, 2025-05-07T20:32:44.8835387Z scale_ub=1200.0, 2025-05-07T20:32:44.8835472Z contiguous=True, 2025-05-07T20:32:44.8835558Z compiled=False, 2025-05-07T20:32:44.8835638Z ) 2025-05-07T20:32:44.8835856Z self = 2025-05-07T20:32:44.8836026Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:44.8836039Z 2025-05-07T20:32:44.8836184Z @given( 2025-05-07T20:32:44.8836302Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.8836408Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.8836588Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.8836705Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.8836824Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.8836900Z ) 2025-05-07T20:32:44.8837146Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.8837247Z def test_silu_mul_quant( 2025-05-07T20:32:44.8837324Z self, 2025-05-07T20:32:44.8837406Z T: int, 2025-05-07T20:32:44.8837491Z D: int, 2025-05-07T20:32:44.8837601Z scale_ub: Optional[float], 2025-05-07T20:32:44.8837717Z contiguous: bool, 2025-05-07T20:32:44.8837819Z compiled: bool, 2025-05-07T20:32:44.8837908Z ) -> None: 2025-05-07T20:32:44.8838061Z torch.manual_seed(2025) 2025-05-07T20:32:44.8838137Z 2025-05-07T20:32:44.8838307Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.8838390Z 2025-05-07T20:32:44.8838524Z x_sign = torch.sign(x) 2025-05-07T20:32:44.8838654Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.8838752Z x = x_sign * x_clamp 2025-05-07T20:32:44.8838834Z x0 = x[:, :D] 2025-05-07T20:32:44.8838919Z x1 = x[:, D:] 2025-05-07T20:32:44.8839004Z 2025-05-07T20:32:44.8839090Z if contiguous: 2025-05-07T20:32:44.8839182Z x0 = x0.contiguous() 2025-05-07T20:32:44.8839280Z x1 = x1.contiguous() 2025-05-07T20:32:44.8839353Z 2025-05-07T20:32:44.8839452Z if scale_ub is not None: 2025-05-07T20:32:44.8839561Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.8839699Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.8839786Z ) 2025-05-07T20:32:44.8839866Z else: 2025-05-07T20:32:44.8839964Z scale_ub_tensor = None 2025-05-07T20:32:44.8840047Z 2025-05-07T20:32:44.8840180Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.8840273Z op = silu_mul_quant 2025-05-07T20:32:44.8840369Z if compiled: 2025-05-07T20:32:44.8840470Z op = torch.compile(op) 2025-05-07T20:32:44.8840577Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.8840660Z 2025-05-07T20:32:44.8840752Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.8840756Z 2025-05-07T20:32:44.8840864Z moe/activation_test.py:117: 2025-05-07T20:32:44.8840993Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8841096Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.8841205Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.8841707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.8841816Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.8842189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.8842414Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.8842763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.8842857Z kernel = self.compile( 2025-05-07T20:32:44.8843237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.8843417Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.8843545Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8843550Z 2025-05-07T20:32:44.8843761Z self = 2025-05-07T20:32:44.8844705Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.8845215Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9806241760>} 2025-05-07T20:32:44.8845969Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.8846159Z context = 2025-05-07T20:32:44.8846173Z 2025-05-07T20:32:44.8846339Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.8846601Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.8846754Z module_map=module_map) 2025-05-07T20:32:44.8846952Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.8847057Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.8847141Z E ^ 2025-05-07T20:32:44.8847499Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.8847504Z 2025-05-07T20:32:44.8847927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.8847931Z 2025-05-07T20:32:44.8848035Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.8848258Z self=, 2025-05-07T20:32:44.8848340Z T=1, 2025-05-07T20:32:44.8848417Z D=7168, 2025-05-07T20:32:44.8848502Z scale_ub=1200.0, 2025-05-07T20:32:44.8848597Z contiguous=True, 2025-05-07T20:32:44.8848687Z compiled=True, 2025-05-07T20:32:44.8848759Z ) 2025-05-07T20:32:44.8848985Z self = 2025-05-07T20:32:44.8849153Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:44.8849157Z 2025-05-07T20:32:44.8849242Z @given( 2025-05-07T20:32:44.8849360Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.8849459Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.8849578Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.8849694Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.8849805Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.8849887Z ) 2025-05-07T20:32:44.8850132Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.8850234Z def test_silu_mul_quant( 2025-05-07T20:32:44.8850313Z self, 2025-05-07T20:32:44.8850392Z T: int, 2025-05-07T20:32:44.8850475Z D: int, 2025-05-07T20:32:44.8850572Z scale_ub: Optional[float], 2025-05-07T20:32:44.8850660Z contiguous: bool, 2025-05-07T20:32:44.8850756Z compiled: bool, 2025-05-07T20:32:44.8850833Z ) -> None: 2025-05-07T20:32:44.8850927Z torch.manual_seed(2025) 2025-05-07T20:32:44.8851006Z 2025-05-07T20:32:44.8851174Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.8851248Z 2025-05-07T20:32:44.8851346Z x_sign = torch.sign(x) 2025-05-07T20:32:44.8851470Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.8851559Z x = x_sign * x_clamp 2025-05-07T20:32:44.8851646Z x0 = x[:, :D] 2025-05-07T20:32:44.8851726Z x1 = x[:, D:] 2025-05-07T20:32:44.8851804Z 2025-05-07T20:32:44.8851888Z if contiguous: 2025-05-07T20:32:44.8851979Z x0 = x0.contiguous() 2025-05-07T20:32:44.8852077Z x1 = x1.contiguous() 2025-05-07T20:32:44.8852198Z 2025-05-07T20:32:44.8852288Z if scale_ub is not None: 2025-05-07T20:32:44.8852399Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.8852575Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.8852651Z ) 2025-05-07T20:32:44.8852737Z else: 2025-05-07T20:32:44.8852830Z scale_ub_tensor = None 2025-05-07T20:32:44.8852902Z 2025-05-07T20:32:44.8853036Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.8853127Z op = silu_mul_quant 2025-05-07T20:32:44.8853220Z if compiled: 2025-05-07T20:32:44.8853320Z op = torch.compile(op) 2025-05-07T20:32:44.8853426Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.8853505Z 2025-05-07T20:32:44.8853595Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.8853600Z 2025-05-07T20:32:44.8853697Z moe/activation_test.py:117: 2025-05-07T20:32:44.8853872Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8853979Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.8854078Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.8854496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:44.8854591Z return fn(*args, **kwargs) 2025-05-07T20:32:44.8855092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.8855192Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.8855549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.8855778Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.8856116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.8856218Z kernel = self.compile( 2025-05-07T20:32:44.8856602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.8856776Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.8856910Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8856914Z 2025-05-07T20:32:44.8857117Z self = 2025-05-07T20:32:44.8857952Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.8858458Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9806242d40>} 2025-05-07T20:32:44.8859216Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.8859416Z context = 2025-05-07T20:32:44.8859421Z 2025-05-07T20:32:44.8859584Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.8859854Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.8859962Z module_map=module_map) 2025-05-07T20:32:44.8860122Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.8860229Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.8860305Z E ^ 2025-05-07T20:32:44.8860662Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.8860712Z 2025-05-07T20:32:44.8861140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.8861144Z 2025-05-07T20:32:44.8861324Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.8861557Z self=, 2025-05-07T20:32:44.8861636Z T=1, 2025-05-07T20:32:44.8861713Z D=7168, 2025-05-07T20:32:44.8861807Z scale_ub=1200.0, 2025-05-07T20:32:44.8861895Z contiguous=False, 2025-05-07T20:32:44.8861979Z compiled=True, 2025-05-07T20:32:44.8862058Z ) 2025-05-07T20:32:44.8862279Z self = 2025-05-07T20:32:44.8862445Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:44.8862455Z 2025-05-07T20:32:44.8862532Z @given( 2025-05-07T20:32:44.8862655Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.8862806Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.8862921Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.8863038Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.8863200Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.8863275Z ) 2025-05-07T20:32:44.8863520Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.8863621Z def test_silu_mul_quant( 2025-05-07T20:32:44.8863700Z self, 2025-05-07T20:32:44.8863783Z T: int, 2025-05-07T20:32:44.8863860Z D: int, 2025-05-07T20:32:44.8863959Z scale_ub: Optional[float], 2025-05-07T20:32:44.8864057Z contiguous: bool, 2025-05-07T20:32:44.8864142Z compiled: bool, 2025-05-07T20:32:44.8864219Z ) -> None: 2025-05-07T20:32:44.8864319Z torch.manual_seed(2025) 2025-05-07T20:32:44.8864394Z 2025-05-07T20:32:44.8864562Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.8864648Z 2025-05-07T20:32:44.8864739Z x_sign = torch.sign(x) 2025-05-07T20:32:44.8864863Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.8864960Z x = x_sign * x_clamp 2025-05-07T20:32:44.8865044Z x0 = x[:, :D] 2025-05-07T20:32:44.8865124Z x1 = x[:, D:] 2025-05-07T20:32:44.8865204Z 2025-05-07T20:32:44.8865287Z if contiguous: 2025-05-07T20:32:44.8865387Z x0 = x0.contiguous() 2025-05-07T20:32:44.8865474Z x1 = x1.contiguous() 2025-05-07T20:32:44.8865546Z 2025-05-07T20:32:44.8865640Z if scale_ub is not None: 2025-05-07T20:32:44.8865759Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.8865901Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.8865976Z ) 2025-05-07T20:32:44.8866053Z else: 2025-05-07T20:32:44.8866153Z scale_ub_tensor = None 2025-05-07T20:32:44.8866231Z 2025-05-07T20:32:44.8866365Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.8866465Z op = silu_mul_quant 2025-05-07T20:32:44.8866551Z if compiled: 2025-05-07T20:32:44.8866653Z op = torch.compile(op) 2025-05-07T20:32:44.8866770Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.8866844Z 2025-05-07T20:32:44.8866942Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.8866946Z 2025-05-07T20:32:44.8867044Z moe/activation_test.py:117: 2025-05-07T20:32:44.8867172Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8867280Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.8867383Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.8867796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:44.8867902Z return fn(*args, **kwargs) 2025-05-07T20:32:44.8868395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.8868557Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.8868955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.8869284Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.8869629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.8869723Z kernel = self.compile( 2025-05-07T20:32:44.8870103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.8870282Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.8870409Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8870456Z 2025-05-07T20:32:44.8870668Z self = 2025-05-07T20:32:44.8871495Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.8872001Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f98063a4540>} 2025-05-07T20:32:44.8872765Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.8872954Z context = 2025-05-07T20:32:44.8872959Z 2025-05-07T20:32:44.8873131Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.8873396Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.8873510Z module_map=module_map) 2025-05-07T20:32:44.8873677Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.8873775Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.8873861Z E ^ 2025-05-07T20:32:44.8874218Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.8874223Z 2025-05-07T20:32:44.8874639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.8874643Z 2025-05-07T20:32:44.8874750Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.8874971Z self=, 2025-05-07T20:32:44.8875054Z T=1, 2025-05-07T20:32:44.8875132Z D=7168, 2025-05-07T20:32:44.8875215Z scale_ub=None, 2025-05-07T20:32:44.8875311Z contiguous=False, 2025-05-07T20:32:44.8875394Z compiled=True, 2025-05-07T20:32:44.8875466Z ) 2025-05-07T20:32:44.8875694Z self = 2025-05-07T20:32:44.8875855Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:44.8875859Z 2025-05-07T20:32:44.8875937Z @given( 2025-05-07T20:32:44.8876064Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.8876161Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.8876279Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.8876397Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.8876511Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.8876589Z ) 2025-05-07T20:32:44.8876832Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.8876930Z def test_silu_mul_quant( 2025-05-07T20:32:44.8877061Z self, 2025-05-07T20:32:44.8877137Z T: int, 2025-05-07T20:32:44.8877212Z D: int, 2025-05-07T20:32:44.8877318Z scale_ub: Optional[float], 2025-05-07T20:32:44.8878114Z contiguous: bool, 2025-05-07T20:32:44.8878205Z compiled: bool, 2025-05-07T20:32:44.8878291Z ) -> None: 2025-05-07T20:32:44.8878386Z torch.manual_seed(2025) 2025-05-07T20:32:44.8878464Z 2025-05-07T20:32:44.8878634Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.8878707Z 2025-05-07T20:32:44.8878803Z x_sign = torch.sign(x) 2025-05-07T20:32:44.8878929Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.8879017Z x = x_sign * x_clamp 2025-05-07T20:32:44.8879102Z x0 = x[:, :D] 2025-05-07T20:32:44.8879184Z x1 = x[:, D:] 2025-05-07T20:32:44.8879256Z 2025-05-07T20:32:44.8879344Z if contiguous: 2025-05-07T20:32:44.8879486Z x0 = x0.contiguous() 2025-05-07T20:32:44.8879579Z x1 = x1.contiguous() 2025-05-07T20:32:44.8879660Z 2025-05-07T20:32:44.8879752Z if scale_ub is not None: 2025-05-07T20:32:44.8879902Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.8880042Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.8880119Z ) 2025-05-07T20:32:44.8880200Z else: 2025-05-07T20:32:44.8880294Z scale_ub_tensor = None 2025-05-07T20:32:44.8880368Z 2025-05-07T20:32:44.8880502Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.8880593Z op = silu_mul_quant 2025-05-07T20:32:44.8880678Z if compiled: 2025-05-07T20:32:44.8880783Z op = torch.compile(op) 2025-05-07T20:32:44.8880888Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.8880961Z 2025-05-07T20:32:44.8881058Z y_fp8, y_scale = fn() 2025-05-07T20:32:44.8881182Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:44.8881264Z 2025-05-07T20:32:44.8881399Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.8881503Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:44.8881613Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:44.8881735Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:44.8881875Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:44.8881957Z 2025-05-07T20:32:44.8882058Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:44.8882063Z 2025-05-07T20:32:44.8882162Z moe/activation_test.py:126: 2025-05-07T20:32:44.8882297Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8882404Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:44.8882548Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:44.8883116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:44.8883224Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:44.8883599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.8883825Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.8884192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:44.8884455Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:44.8884855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:186: in 2025-05-07T20:32:44.8885116Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:44.8885493Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:44.8885737Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:44.8886127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:44.8886205Z fn() 2025-05-07T20:32:44.8886632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:44.8886724Z self.fn.run( 2025-05-07T20:32:44.8887089Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.8887188Z kernel = self.compile( 2025-05-07T20:32:44.8887573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.8887746Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.8887922Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8887929Z 2025-05-07T20:32:44.8888133Z self = 2025-05-07T20:32:44.8888962Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.8889467Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f98063a5440>} 2025-05-07T20:32:44.8890229Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.8890418Z context = 2025-05-07T20:32:44.8890426Z 2025-05-07T20:32:44.8890592Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.8890862Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.8890971Z module_map=module_map) 2025-05-07T20:32:44.8891137Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.8891240Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:44.8891317Z E ^ 2025-05-07T20:32:44.8891679Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.8891684Z 2025-05-07T20:32:44.8892100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.8892104Z 2025-05-07T20:32:44.8892205Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.8892433Z self=, 2025-05-07T20:32:44.8892515Z T=1, 2025-05-07T20:32:44.8892597Z D=5120, 2025-05-07T20:32:44.8892679Z scale_ub=1200.0, 2025-05-07T20:32:44.8892765Z contiguous=False, 2025-05-07T20:32:44.8892856Z compiled=True, 2025-05-07T20:32:44.8892932Z ) 2025-05-07T20:32:44.8893150Z self = 2025-05-07T20:32:44.8893321Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:44.8893325Z 2025-05-07T20:32:44.8893402Z @given( 2025-05-07T20:32:44.8893520Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.8893626Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.8893741Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.8893864Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.8893976Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.8894050Z ) 2025-05-07T20:32:44.8894301Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.8894476Z def test_silu_mul_quant( 2025-05-07T20:32:44.8894553Z self, 2025-05-07T20:32:44.8894643Z T: int, 2025-05-07T20:32:44.8894758Z D: int, 2025-05-07T20:32:44.8894856Z scale_ub: Optional[float], 2025-05-07T20:32:44.8894951Z contiguous: bool, 2025-05-07T20:32:44.8895035Z compiled: bool, 2025-05-07T20:32:44.8895111Z ) -> None: 2025-05-07T20:32:44.8895210Z torch.manual_seed(2025) 2025-05-07T20:32:44.8895282Z 2025-05-07T20:32:44.8895455Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.8895528Z 2025-05-07T20:32:44.8895620Z x_sign = torch.sign(x) 2025-05-07T20:32:44.8895749Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.8895837Z x = x_sign * x_clamp 2025-05-07T20:32:44.8895918Z x0 = x[:, :D] 2025-05-07T20:32:44.8896006Z x1 = x[:, D:] 2025-05-07T20:32:44.8896122Z 2025-05-07T20:32:44.8896206Z if contiguous: 2025-05-07T20:32:44.8896304Z x0 = x0.contiguous() 2025-05-07T20:32:44.8896393Z x1 = x1.contiguous() 2025-05-07T20:32:44.8896505Z 2025-05-07T20:32:44.8896605Z if scale_ub is not None: 2025-05-07T20:32:44.8896710Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.8896856Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.8896940Z ) 2025-05-07T20:32:44.8897031Z else: 2025-05-07T20:32:44.8897141Z scale_ub_tensor = None 2025-05-07T20:32:44.8897228Z 2025-05-07T20:32:44.8897357Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.8897454Z op = silu_mul_quant 2025-05-07T20:32:44.8897542Z if compiled: 2025-05-07T20:32:44.8897643Z op = torch.compile(op) 2025-05-07T20:32:44.8897755Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.8897831Z 2025-05-07T20:32:44.8897923Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.8897928Z 2025-05-07T20:32:44.8898032Z moe/activation_test.py:117: 2025-05-07T20:32:44.8898162Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8898272Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.8898371Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.8898739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:44.8898838Z return fn(*args, **kwargs) 2025-05-07T20:32:44.8899334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.8899431Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.8899796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.8900019Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.8900371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.8900468Z kernel = self.compile( 2025-05-07T20:32:44.8900852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.8901031Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.8901158Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8901164Z 2025-05-07T20:32:44.8901375Z self = 2025-05-07T20:32:44.8902158Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.8902663Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f98063a6a20>} 2025-05-07T20:32:44.8903513Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.8903703Z context = 2025-05-07T20:32:44.8903708Z 2025-05-07T20:32:44.8903882Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.8904145Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.8904252Z module_map=module_map) 2025-05-07T20:32:44.8904420Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.8904518Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.8904646Z E ^ 2025-05-07T20:32:44.8905006Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.8905011Z 2025-05-07T20:32:44.8905471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.8905476Z 2025-05-07T20:32:44.8905586Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.8905811Z self=, 2025-05-07T20:32:44.8905887Z T=1, 2025-05-07T20:32:44.8905969Z D=5120, 2025-05-07T20:32:44.8906051Z scale_ub=1200.0, 2025-05-07T20:32:44.8906143Z contiguous=False, 2025-05-07T20:32:44.8906228Z compiled=False, 2025-05-07T20:32:44.8906302Z ) 2025-05-07T20:32:44.8906530Z self = 2025-05-07T20:32:44.8906696Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:44.8906706Z 2025-05-07T20:32:44.8906782Z @given( 2025-05-07T20:32:44.8906904Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.8907005Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.8907123Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.8907245Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.8907357Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.8907437Z ) 2025-05-07T20:32:44.8907685Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.8907778Z def test_silu_mul_quant( 2025-05-07T20:32:44.8907859Z self, 2025-05-07T20:32:44.8907934Z T: int, 2025-05-07T20:32:44.8908010Z D: int, 2025-05-07T20:32:44.8908114Z scale_ub: Optional[float], 2025-05-07T20:32:44.8908203Z contiguous: bool, 2025-05-07T20:32:44.8908288Z compiled: bool, 2025-05-07T20:32:44.8908375Z ) -> None: 2025-05-07T20:32:44.8908472Z torch.manual_seed(2025) 2025-05-07T20:32:44.8908544Z 2025-05-07T20:32:44.8908718Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.8908793Z 2025-05-07T20:32:44.8908894Z x_sign = torch.sign(x) 2025-05-07T20:32:44.8909020Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.8909220Z x = x_sign * x_clamp 2025-05-07T20:32:44.8909311Z x0 = x[:, :D] 2025-05-07T20:32:44.8909393Z x1 = x[:, D:] 2025-05-07T20:32:44.8909464Z 2025-05-07T20:32:44.8909554Z if contiguous: 2025-05-07T20:32:44.8909644Z x0 = x0.contiguous() 2025-05-07T20:32:44.8909733Z x1 = x1.contiguous() 2025-05-07T20:32:44.8909814Z 2025-05-07T20:32:44.8909905Z if scale_ub is not None: 2025-05-07T20:32:44.8910009Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.8910151Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.8910282Z ) 2025-05-07T20:32:44.8910363Z else: 2025-05-07T20:32:44.8910457Z scale_ub_tensor = None 2025-05-07T20:32:44.8910528Z 2025-05-07T20:32:44.8910667Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.8910800Z op = silu_mul_quant 2025-05-07T20:32:44.8910888Z if compiled: 2025-05-07T20:32:44.8910995Z op = torch.compile(op) 2025-05-07T20:32:44.8911101Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.8911173Z 2025-05-07T20:32:44.8911271Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.8911276Z 2025-05-07T20:32:44.8911374Z moe/activation_test.py:117: 2025-05-07T20:32:44.8911503Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8911611Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.8911712Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.8912225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.8912365Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.8912760Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.8912989Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.8913328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.8913428Z kernel = self.compile( 2025-05-07T20:32:44.8913809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.8913980Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.8914114Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8914118Z 2025-05-07T20:32:44.8914328Z self = 2025-05-07T20:32:44.8915118Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.8915628Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f98063a71a0>} 2025-05-07T20:32:44.8916383Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.8916579Z context = 2025-05-07T20:32:44.8916583Z 2025-05-07T20:32:44.8916747Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.8917022Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.8917140Z module_map=module_map) 2025-05-07T20:32:44.8917332Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.8917454Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.8917532Z E ^ 2025-05-07T20:32:44.8917894Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.8917899Z 2025-05-07T20:32:44.8918326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.8918330Z 2025-05-07T20:32:44.8918434Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.8918665Z self=, 2025-05-07T20:32:44.8918744Z T=16384, 2025-05-07T20:32:44.8918821Z D=5120, 2025-05-07T20:32:44.8918914Z scale_ub=1200.0, 2025-05-07T20:32:44.8919056Z contiguous=False, 2025-05-07T20:32:44.8919140Z compiled=True, 2025-05-07T20:32:44.8919221Z ) 2025-05-07T20:32:44.8919442Z self = 2025-05-07T20:32:44.8919670Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:44.8919675Z 2025-05-07T20:32:44.8919752Z @given( 2025-05-07T20:32:44.8919870Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.8919975Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.8920088Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.8920204Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.8920326Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.8920399Z ) 2025-05-07T20:32:44.8920643Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.8920787Z def test_silu_mul_quant( 2025-05-07T20:32:44.8920867Z self, 2025-05-07T20:32:44.8920947Z T: int, 2025-05-07T20:32:44.8921022Z D: int, 2025-05-07T20:32:44.8921119Z scale_ub: Optional[float], 2025-05-07T20:32:44.8921278Z contiguous: bool, 2025-05-07T20:32:44.8921372Z compiled: bool, 2025-05-07T20:32:44.8921452Z ) -> None: 2025-05-07T20:32:44.8921551Z torch.manual_seed(2025) 2025-05-07T20:32:44.8921623Z 2025-05-07T20:32:44.8921792Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.8921872Z 2025-05-07T20:32:44.8921963Z x_sign = torch.sign(x) 2025-05-07T20:32:44.8922088Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.8922181Z x = x_sign * x_clamp 2025-05-07T20:32:44.8922260Z x0 = x[:, :D] 2025-05-07T20:32:44.8922340Z x1 = x[:, D:] 2025-05-07T20:32:44.8922418Z 2025-05-07T20:32:44.8922503Z if contiguous: 2025-05-07T20:32:44.8922602Z x0 = x0.contiguous() 2025-05-07T20:32:44.8922694Z x1 = x1.contiguous() 2025-05-07T20:32:44.8922764Z 2025-05-07T20:32:44.8922860Z if scale_ub is not None: 2025-05-07T20:32:44.8922970Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.8923109Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.8923194Z ) 2025-05-07T20:32:44.8923270Z else: 2025-05-07T20:32:44.8923363Z scale_ub_tensor = None 2025-05-07T20:32:44.8923442Z 2025-05-07T20:32:44.8923571Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.8923661Z op = silu_mul_quant 2025-05-07T20:32:44.8923750Z if compiled: 2025-05-07T20:32:44.8923849Z op = torch.compile(op) 2025-05-07T20:32:44.8923959Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.8924030Z 2025-05-07T20:32:44.8924121Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.8924125Z 2025-05-07T20:32:44.8924233Z moe/activation_test.py:117: 2025-05-07T20:32:44.8924364Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8924465Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.8924572Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.8924944Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:44.8925036Z return fn(*args, **kwargs) 2025-05-07T20:32:44.8925535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.8925631Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.8925993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.8926215Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.8926554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.8926708Z kernel = self.compile( 2025-05-07T20:32:44.8927131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.8927314Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.8927465Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8927469Z 2025-05-07T20:32:44.8927696Z self = 2025-05-07T20:32:44.8928843Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.8934851Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9655b84ea0>} 2025-05-07T20:32:44.8935892Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.8936102Z context = 2025-05-07T20:32:44.8936107Z 2025-05-07T20:32:44.8936275Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.8936542Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.8936659Z module_map=module_map) 2025-05-07T20:32:44.8936822Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.8936925Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.8937014Z E ^ 2025-05-07T20:32:44.8937379Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.8937390Z 2025-05-07T20:32:44.8937820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.8937824Z 2025-05-07T20:32:44.8937932Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.8938157Z self=, 2025-05-07T20:32:44.8938245Z T=2048, 2025-05-07T20:32:44.8938322Z D=7168, 2025-05-07T20:32:44.8938407Z scale_ub=1200.0, 2025-05-07T20:32:44.8938504Z contiguous=False, 2025-05-07T20:32:44.8938588Z compiled=True, 2025-05-07T20:32:44.8938672Z ) 2025-05-07T20:32:44.8938892Z self = 2025-05-07T20:32:44.8939068Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:44.8939072Z 2025-05-07T20:32:44.8939158Z @given( 2025-05-07T20:32:44.8939282Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.8939385Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.8939510Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.8939632Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.8939745Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.8939828Z ) 2025-05-07T20:32:44.8940076Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.8940182Z def test_silu_mul_quant( 2025-05-07T20:32:44.8940260Z self, 2025-05-07T20:32:44.8940340Z T: int, 2025-05-07T20:32:44.8940426Z D: int, 2025-05-07T20:32:44.8940526Z scale_ub: Optional[float], 2025-05-07T20:32:44.8940618Z contiguous: bool, 2025-05-07T20:32:44.8940713Z compiled: bool, 2025-05-07T20:32:44.8940793Z ) -> None: 2025-05-07T20:32:44.8940892Z torch.manual_seed(2025) 2025-05-07T20:32:44.8940974Z 2025-05-07T20:32:44.8941147Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.8941299Z 2025-05-07T20:32:44.8941404Z x_sign = torch.sign(x) 2025-05-07T20:32:44.8941533Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.8941703Z x = x_sign * x_clamp 2025-05-07T20:32:44.8941786Z x0 = x[:, :D] 2025-05-07T20:32:44.8941867Z x1 = x[:, D:] 2025-05-07T20:32:44.8941949Z 2025-05-07T20:32:44.8942034Z if contiguous: 2025-05-07T20:32:44.8942128Z x0 = x0.contiguous() 2025-05-07T20:32:44.8942225Z x1 = x1.contiguous() 2025-05-07T20:32:44.8942300Z 2025-05-07T20:32:44.8942392Z if scale_ub is not None: 2025-05-07T20:32:44.8942515Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.8942651Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.8942727Z ) 2025-05-07T20:32:44.8942812Z else: 2025-05-07T20:32:44.8942975Z scale_ub_tensor = None 2025-05-07T20:32:44.8943059Z 2025-05-07T20:32:44.8943190Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.8943282Z op = silu_mul_quant 2025-05-07T20:32:44.8943422Z if compiled: 2025-05-07T20:32:44.8943529Z op = torch.compile(op) 2025-05-07T20:32:44.8943637Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.8943717Z 2025-05-07T20:32:44.8943809Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.8943813Z 2025-05-07T20:32:44.8943912Z moe/activation_test.py:117: 2025-05-07T20:32:44.8944050Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8944152Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.8944262Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.8944635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:44.8944728Z return fn(*args, **kwargs) 2025-05-07T20:32:44.8945241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.8945343Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.8945704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.8945935Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.8946276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.8946376Z kernel = self.compile( 2025-05-07T20:32:44.8946761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.8946937Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.8947089Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8947099Z 2025-05-07T20:32:44.8947333Z self = 2025-05-07T20:32:44.8948132Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.8948639Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9655b859e0>} 2025-05-07T20:32:44.8949501Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.8949701Z context = 2025-05-07T20:32:44.8949705Z 2025-05-07T20:32:44.8949873Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.8950194Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.8950306Z module_map=module_map) 2025-05-07T20:32:44.8950505Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.8950616Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.8950695Z E ^ 2025-05-07T20:32:44.8951052Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.8951064Z 2025-05-07T20:32:44.8951483Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.8951487Z 2025-05-07T20:32:44.8951592Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.8951822Z self=, 2025-05-07T20:32:44.8951951Z T=1, 2025-05-07T20:32:44.8952035Z D=5120, 2025-05-07T20:32:44.8952128Z scale_ub=None, 2025-05-07T20:32:44.8952217Z contiguous=False, 2025-05-07T20:32:44.8952303Z compiled=False, 2025-05-07T20:32:44.8952421Z ) 2025-05-07T20:32:44.8952646Z self = 2025-05-07T20:32:44.8952820Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:44.8952825Z 2025-05-07T20:32:44.8952904Z @given( 2025-05-07T20:32:44.8953025Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.8953135Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.8953251Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.8953369Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.8953491Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.8953567Z ) 2025-05-07T20:32:44.8953813Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.8953927Z def test_silu_mul_quant( 2025-05-07T20:32:44.8954009Z self, 2025-05-07T20:32:44.8954099Z T: int, 2025-05-07T20:32:44.8954178Z D: int, 2025-05-07T20:32:44.8954282Z scale_ub: Optional[float], 2025-05-07T20:32:44.8954381Z contiguous: bool, 2025-05-07T20:32:44.8954472Z compiled: bool, 2025-05-07T20:32:44.8954553Z ) -> None: 2025-05-07T20:32:44.8954657Z torch.manual_seed(2025) 2025-05-07T20:32:44.8954731Z 2025-05-07T20:32:44.8954899Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.8954983Z 2025-05-07T20:32:44.8955077Z x_sign = torch.sign(x) 2025-05-07T20:32:44.8955204Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.8955303Z x = x_sign * x_clamp 2025-05-07T20:32:44.8955385Z x0 = x[:, :D] 2025-05-07T20:32:44.8955477Z x1 = x[:, D:] 2025-05-07T20:32:44.8955551Z 2025-05-07T20:32:44.8955639Z if contiguous: 2025-05-07T20:32:44.8955744Z x0 = x0.contiguous() 2025-05-07T20:32:44.8955838Z x1 = x1.contiguous() 2025-05-07T20:32:44.8955913Z 2025-05-07T20:32:44.8956017Z if scale_ub is not None: 2025-05-07T20:32:44.8956126Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.8956261Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.8956346Z ) 2025-05-07T20:32:44.8956424Z else: 2025-05-07T20:32:44.8956523Z scale_ub_tensor = None 2025-05-07T20:32:44.8956605Z 2025-05-07T20:32:44.8956734Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.8956833Z op = silu_mul_quant 2025-05-07T20:32:44.8956921Z if compiled: 2025-05-07T20:32:44.8957021Z op = torch.compile(op) 2025-05-07T20:32:44.8957135Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.8957213Z 2025-05-07T20:32:44.8957311Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.8957378Z 2025-05-07T20:32:44.8957495Z moe/activation_test.py:117: 2025-05-07T20:32:44.8957647Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8957750Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.8957901Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.8958406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.8958514Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.8958876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.8959100Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.8959454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.8959549Z kernel = self.compile( 2025-05-07T20:32:44.8959976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.8960202Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.8960335Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8960340Z 2025-05-07T20:32:44.8960551Z self = 2025-05-07T20:32:44.8961332Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.8961845Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9655b86d40>} 2025-05-07T20:32:44.8962598Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.8962798Z context = 2025-05-07T20:32:44.8962803Z 2025-05-07T20:32:44.8962976Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.8963237Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.8963351Z module_map=module_map) 2025-05-07T20:32:44.8963515Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.8963615Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.8963705Z E ^ 2025-05-07T20:32:44.8964064Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.8964069Z 2025-05-07T20:32:44.8964495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.8964509Z 2025-05-07T20:32:44.8964613Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.8964842Z self=, 2025-05-07T20:32:44.8964927Z T=4096, 2025-05-07T20:32:44.8965005Z D=7168, 2025-05-07T20:32:44.8965090Z scale_ub=1200.0, 2025-05-07T20:32:44.8965187Z contiguous=False, 2025-05-07T20:32:44.8965273Z compiled=False, 2025-05-07T20:32:44.8965348Z ) 2025-05-07T20:32:44.8965574Z self = 2025-05-07T20:32:44.8965753Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:44.8965758Z 2025-05-07T20:32:44.8965838Z @given( 2025-05-07T20:32:44.8965963Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.8966065Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.8966192Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.8966353Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.8966468Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.8966555Z ) 2025-05-07T20:32:44.8966839Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.8966937Z def test_silu_mul_quant( 2025-05-07T20:32:44.8967022Z self, 2025-05-07T20:32:44.8967101Z T: int, 2025-05-07T20:32:44.8967178Z D: int, 2025-05-07T20:32:44.8967286Z scale_ub: Optional[float], 2025-05-07T20:32:44.8967376Z contiguous: bool, 2025-05-07T20:32:44.8967476Z compiled: bool, 2025-05-07T20:32:44.8967556Z ) -> None: 2025-05-07T20:32:44.8967653Z torch.manual_seed(2025) 2025-05-07T20:32:44.8967735Z 2025-05-07T20:32:44.8967906Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.8967982Z 2025-05-07T20:32:44.8968127Z x_sign = torch.sign(x) 2025-05-07T20:32:44.8968257Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.8968349Z x = x_sign * x_clamp 2025-05-07T20:32:44.8968476Z x0 = x[:, :D] 2025-05-07T20:32:44.8968563Z x1 = x[:, D:] 2025-05-07T20:32:44.8968637Z 2025-05-07T20:32:44.8968731Z if contiguous: 2025-05-07T20:32:44.8968825Z x0 = x0.contiguous() 2025-05-07T20:32:44.8968915Z x1 = x1.contiguous() 2025-05-07T20:32:44.8968997Z 2025-05-07T20:32:44.8969088Z if scale_ub is not None: 2025-05-07T20:32:44.8969204Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.8969340Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.8969417Z ) 2025-05-07T20:32:44.8969503Z else: 2025-05-07T20:32:44.8969599Z scale_ub_tensor = None 2025-05-07T20:32:44.8969674Z 2025-05-07T20:32:44.8969813Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.8969910Z op = silu_mul_quant 2025-05-07T20:32:44.8969997Z if compiled: 2025-05-07T20:32:44.8970106Z op = torch.compile(op) 2025-05-07T20:32:44.8970215Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.8970291Z 2025-05-07T20:32:44.8970393Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.8970399Z 2025-05-07T20:32:44.8970497Z moe/activation_test.py:117: 2025-05-07T20:32:44.8970634Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8970736Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.8970836Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.8971346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.8971442Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.8971802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.8972038Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.8972387Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.8972489Z kernel = self.compile( 2025-05-07T20:32:44.8972873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.8973059Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.8973190Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8973194Z 2025-05-07T20:32:44.8973398Z self = 2025-05-07T20:32:44.8974186Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.8974806Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9655b87a60>} 2025-05-07T20:32:44.8975565Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.8975755Z context = 2025-05-07T20:32:44.8975759Z 2025-05-07T20:32:44.8975930Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.8976192Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.8976300Z module_map=module_map) 2025-05-07T20:32:44.8976468Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.8976611Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.8976689Z E ^ 2025-05-07T20:32:44.8977095Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.8977100Z 2025-05-07T20:32:44.8977571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.8977576Z 2025-05-07T20:32:44.8977687Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.8977912Z self=, 2025-05-07T20:32:44.8977991Z T=16384, 2025-05-07T20:32:44.8978074Z D=7168, 2025-05-07T20:32:44.8978157Z scale_ub=None, 2025-05-07T20:32:44.8978243Z contiguous=True, 2025-05-07T20:32:44.8978335Z compiled=True, 2025-05-07T20:32:44.8978407Z ) 2025-05-07T20:32:44.8978625Z self = 2025-05-07T20:32:44.8978808Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:44.8978813Z 2025-05-07T20:32:44.8978892Z @given( 2025-05-07T20:32:44.8979016Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.8979119Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.8979233Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.8979354Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.8979466Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.8979541Z ) 2025-05-07T20:32:44.8979794Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.8979888Z def test_silu_mul_quant( 2025-05-07T20:32:44.8979970Z self, 2025-05-07T20:32:44.8980046Z T: int, 2025-05-07T20:32:44.8980122Z D: int, 2025-05-07T20:32:44.8980227Z scale_ub: Optional[float], 2025-05-07T20:32:44.8980316Z contiguous: bool, 2025-05-07T20:32:44.8980406Z compiled: bool, 2025-05-07T20:32:44.8980489Z ) -> None: 2025-05-07T20:32:44.8980584Z torch.manual_seed(2025) 2025-05-07T20:32:44.8980656Z 2025-05-07T20:32:44.8980835Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.8980912Z 2025-05-07T20:32:44.8981007Z x_sign = torch.sign(x) 2025-05-07T20:32:44.8981138Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.8981226Z x = x_sign * x_clamp 2025-05-07T20:32:44.8981307Z x0 = x[:, :D] 2025-05-07T20:32:44.8981396Z x1 = x[:, D:] 2025-05-07T20:32:44.8981469Z 2025-05-07T20:32:44.8981562Z if contiguous: 2025-05-07T20:32:44.8981654Z x0 = x0.contiguous() 2025-05-07T20:32:44.8981744Z x1 = x1.contiguous() 2025-05-07T20:32:44.8981821Z 2025-05-07T20:32:44.8981913Z if scale_ub is not None: 2025-05-07T20:32:44.8982019Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.8982163Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.8982290Z ) 2025-05-07T20:32:44.8982369Z else: 2025-05-07T20:32:44.8982471Z scale_ub_tensor = None 2025-05-07T20:32:44.8982545Z 2025-05-07T20:32:44.8982713Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.8982809Z op = silu_mul_quant 2025-05-07T20:32:44.8982894Z if compiled: 2025-05-07T20:32:44.8982998Z op = torch.compile(op) 2025-05-07T20:32:44.8983102Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.8983174Z 2025-05-07T20:32:44.8983272Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.8983277Z 2025-05-07T20:32:44.8983374Z moe/activation_test.py:117: 2025-05-07T20:32:44.8983508Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8983616Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.8983717Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.8984130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:44.8984228Z return fn(*args, **kwargs) 2025-05-07T20:32:44.8984760Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.8984865Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.8985220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.8985440Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.8985785Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.8985877Z kernel = self.compile( 2025-05-07T20:32:44.8986263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.8986440Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.8986567Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8986574Z 2025-05-07T20:32:44.8986789Z self = 2025-05-07T20:32:44.8987597Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.8988131Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9806ae1120>} 2025-05-07T20:32:44.8988885Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.8989186Z context = 2025-05-07T20:32:44.8989191Z 2025-05-07T20:32:44.8989366Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.8989630Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.8989745Z module_map=module_map) 2025-05-07T20:32:44.8989905Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.8990005Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.8990089Z E ^ 2025-05-07T20:32:44.8990448Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.8990452Z 2025-05-07T20:32:44.8990870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.8990884Z 2025-05-07T20:32:44.8991039Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.8991266Z self=, 2025-05-07T20:32:44.8991352Z T=4096, 2025-05-07T20:32:44.8991430Z D=5120, 2025-05-07T20:32:44.8991553Z scale_ub=None, 2025-05-07T20:32:44.8991647Z contiguous=False, 2025-05-07T20:32:44.8991731Z compiled=True, 2025-05-07T20:32:44.8991806Z ) 2025-05-07T20:32:44.8992030Z self = 2025-05-07T20:32:44.8992204Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:44.8992208Z 2025-05-07T20:32:44.8992294Z @given( 2025-05-07T20:32:44.8992412Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.8992510Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.8992631Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.8992748Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.8992905Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.8992986Z ) 2025-05-07T20:32:44.8993269Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.8993372Z def test_silu_mul_quant( 2025-05-07T20:32:44.8993456Z self, 2025-05-07T20:32:44.8993531Z T: int, 2025-05-07T20:32:44.8993608Z D: int, 2025-05-07T20:32:44.8993712Z scale_ub: Optional[float], 2025-05-07T20:32:44.8993802Z contiguous: bool, 2025-05-07T20:32:44.8993893Z compiled: bool, 2025-05-07T20:32:44.8993970Z ) -> None: 2025-05-07T20:32:44.8994064Z torch.manual_seed(2025) 2025-05-07T20:32:44.8994143Z 2025-05-07T20:32:44.8994309Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.8994382Z 2025-05-07T20:32:44.8994480Z x_sign = torch.sign(x) 2025-05-07T20:32:44.8994603Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.8994699Z x = x_sign * x_clamp 2025-05-07T20:32:44.8994785Z x0 = x[:, :D] 2025-05-07T20:32:44.8994864Z x1 = x[:, D:] 2025-05-07T20:32:44.8994937Z 2025-05-07T20:32:44.8995027Z if contiguous: 2025-05-07T20:32:44.8995139Z x0 = x0.contiguous() 2025-05-07T20:32:44.8995237Z x1 = x1.contiguous() 2025-05-07T20:32:44.8995310Z 2025-05-07T20:32:44.8995401Z if scale_ub is not None: 2025-05-07T20:32:44.8995513Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.8995648Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.8995730Z ) 2025-05-07T20:32:44.8995806Z else: 2025-05-07T20:32:44.8995901Z scale_ub_tensor = None 2025-05-07T20:32:44.8995982Z 2025-05-07T20:32:44.8996110Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.8996202Z op = silu_mul_quant 2025-05-07T20:32:44.8996293Z if compiled: 2025-05-07T20:32:44.8996399Z op = torch.compile(op) 2025-05-07T20:32:44.8996505Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.8996587Z 2025-05-07T20:32:44.8996677Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.8996684Z 2025-05-07T20:32:44.8996792Z moe/activation_test.py:117: 2025-05-07T20:32:44.8996919Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.8997019Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.8997128Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.8997495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:44.8997588Z return fn(*args, **kwargs) 2025-05-07T20:32:44.8998087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.8998184Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.8998548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.8998818Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.8999236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.8999336Z kernel = self.compile( 2025-05-07T20:32:44.8999720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.8999894Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.9000028Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9000033Z 2025-05-07T20:32:44.9000237Z self = 2025-05-07T20:32:44.9001026Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.9001613Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9806ae1c60>} 2025-05-07T20:32:44.9002376Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.9002564Z context = 2025-05-07T20:32:44.9002568Z 2025-05-07T20:32:44.9002730Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.9002995Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.9003100Z module_map=module_map) 2025-05-07T20:32:44.9003270Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.9003370Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.9003446Z E ^ 2025-05-07T20:32:44.9003815Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.9003819Z 2025-05-07T20:32:44.9004236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.9004241Z 2025-05-07T20:32:44.9004343Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9004575Z self=, 2025-05-07T20:32:44.9004652Z T=4096, 2025-05-07T20:32:44.9004736Z D=5120, 2025-05-07T20:32:44.9004818Z scale_ub=1200.0, 2025-05-07T20:32:44.9004904Z contiguous=False, 2025-05-07T20:32:44.9004994Z compiled=False, 2025-05-07T20:32:44.9005066Z ) 2025-05-07T20:32:44.9005288Z self = 2025-05-07T20:32:44.9005472Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:44.9005477Z 2025-05-07T20:32:44.9005556Z @given( 2025-05-07T20:32:44.9005674Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9005781Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9005893Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9006018Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9006131Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9006203Z ) 2025-05-07T20:32:44.9006452Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9006546Z def test_silu_mul_quant( 2025-05-07T20:32:44.9006621Z self, 2025-05-07T20:32:44.9006702Z T: int, 2025-05-07T20:32:44.9006779Z D: int, 2025-05-07T20:32:44.9006881Z scale_ub: Optional[float], 2025-05-07T20:32:44.9007080Z contiguous: bool, 2025-05-07T20:32:44.9007172Z compiled: bool, 2025-05-07T20:32:44.9007268Z ) -> None: 2025-05-07T20:32:44.9007372Z torch.manual_seed(2025) 2025-05-07T20:32:44.9007445Z 2025-05-07T20:32:44.9007661Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9007733Z 2025-05-07T20:32:44.9007825Z x_sign = torch.sign(x) 2025-05-07T20:32:44.9007956Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.9008043Z x = x_sign * x_clamp 2025-05-07T20:32:44.9008124Z x0 = x[:, :D] 2025-05-07T20:32:44.9008211Z x1 = x[:, D:] 2025-05-07T20:32:44.9008284Z 2025-05-07T20:32:44.9008366Z if contiguous: 2025-05-07T20:32:44.9008467Z x0 = x0.contiguous() 2025-05-07T20:32:44.9008561Z x1 = x1.contiguous() 2025-05-07T20:32:44.9008634Z 2025-05-07T20:32:44.9008730Z if scale_ub is not None: 2025-05-07T20:32:44.9008876Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.9009022Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.9009097Z ) 2025-05-07T20:32:44.9009209Z else: 2025-05-07T20:32:44.9009312Z scale_ub_tensor = None 2025-05-07T20:32:44.9009383Z 2025-05-07T20:32:44.9009513Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.9009610Z op = silu_mul_quant 2025-05-07T20:32:44.9009694Z if compiled: 2025-05-07T20:32:44.9009795Z op = torch.compile(op) 2025-05-07T20:32:44.9009907Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9009979Z 2025-05-07T20:32:44.9010069Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.9010073Z 2025-05-07T20:32:44.9010177Z moe/activation_test.py:117: 2025-05-07T20:32:44.9010305Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9010413Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.9010517Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9011019Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.9011125Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.9011481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.9011702Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.9012048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.9012143Z kernel = self.compile( 2025-05-07T20:32:44.9012532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.9012704Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.9012834Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9012841Z 2025-05-07T20:32:44.9013052Z self = 2025-05-07T20:32:44.9013838Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.9014349Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9806ae3240>} 2025-05-07T20:32:44.9015101Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.9015290Z context = 2025-05-07T20:32:44.9015350Z 2025-05-07T20:32:44.9015515Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.9015779Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.9015937Z module_map=module_map) 2025-05-07T20:32:44.9016098Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.9016198Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.9016284Z E ^ 2025-05-07T20:32:44.9016643Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.9016648Z 2025-05-07T20:32:44.9017069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.9017074Z 2025-05-07T20:32:44.9017177Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9017405Z self=, 2025-05-07T20:32:44.9017556Z T=4096, 2025-05-07T20:32:44.9017638Z D=5120, 2025-05-07T20:32:44.9017739Z scale_ub=1200.0, 2025-05-07T20:32:44.9017873Z contiguous=False, 2025-05-07T20:32:44.9017962Z compiled=True, 2025-05-07T20:32:44.9018034Z ) 2025-05-07T20:32:44.9018256Z self = 2025-05-07T20:32:44.9018429Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:44.9018433Z 2025-05-07T20:32:44.9018515Z @given( 2025-05-07T20:32:44.9018632Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9018730Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9018849Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9018966Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9019078Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9019161Z ) 2025-05-07T20:32:44.9019411Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9019509Z def test_silu_mul_quant( 2025-05-07T20:32:44.9019581Z self, 2025-05-07T20:32:44.9019661Z T: int, 2025-05-07T20:32:44.9019746Z D: int, 2025-05-07T20:32:44.9019844Z scale_ub: Optional[float], 2025-05-07T20:32:44.9019932Z contiguous: bool, 2025-05-07T20:32:44.9020024Z compiled: bool, 2025-05-07T20:32:44.9020100Z ) -> None: 2025-05-07T20:32:44.9020195Z torch.manual_seed(2025) 2025-05-07T20:32:44.9020273Z 2025-05-07T20:32:44.9020441Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9020514Z 2025-05-07T20:32:44.9020612Z x_sign = torch.sign(x) 2025-05-07T20:32:44.9020736Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.9020826Z x = x_sign * x_clamp 2025-05-07T20:32:44.9020913Z x0 = x[:, :D] 2025-05-07T20:32:44.9020996Z x1 = x[:, D:] 2025-05-07T20:32:44.9021076Z 2025-05-07T20:32:44.9021159Z if contiguous: 2025-05-07T20:32:44.9021250Z x0 = x0.contiguous() 2025-05-07T20:32:44.9021344Z x1 = x1.contiguous() 2025-05-07T20:32:44.9021417Z 2025-05-07T20:32:44.9021507Z if scale_ub is not None: 2025-05-07T20:32:44.9021620Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.9021752Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.9021829Z ) 2025-05-07T20:32:44.9021911Z else: 2025-05-07T20:32:44.9022004Z scale_ub_tensor = None 2025-05-07T20:32:44.9022077Z 2025-05-07T20:32:44.9022209Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.9022298Z op = silu_mul_quant 2025-05-07T20:32:44.9022391Z if compiled: 2025-05-07T20:32:44.9022491Z op = torch.compile(op) 2025-05-07T20:32:44.9022596Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9022751Z 2025-05-07T20:32:44.9022842Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.9022847Z 2025-05-07T20:32:44.9022944Z moe/activation_test.py:117: 2025-05-07T20:32:44.9023121Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9023225Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.9023326Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9023700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:44.9023792Z return fn(*args, **kwargs) 2025-05-07T20:32:44.9024294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.9024392Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.9024748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.9025016Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.9025395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.9025500Z kernel = self.compile( 2025-05-07T20:32:44.9025881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.9026051Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.9026183Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9026188Z 2025-05-07T20:32:44.9026390Z self = 2025-05-07T20:32:44.9027170Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.9027684Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f980664c720>} 2025-05-07T20:32:44.9028828Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.9029079Z context = 2025-05-07T20:32:44.9029085Z 2025-05-07T20:32:44.9029248Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.9029516Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.9029623Z module_map=module_map) 2025-05-07T20:32:44.9029784Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.9029893Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.9029972Z E ^ 2025-05-07T20:32:44.9030328Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.9030335Z 2025-05-07T20:32:44.9030760Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.9030764Z 2025-05-07T20:32:44.9030866Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9031094Z self=, 2025-05-07T20:32:44.9031175Z T=2048, 2025-05-07T20:32:44.9031251Z D=7168, 2025-05-07T20:32:44.9031342Z scale_ub=1200.0, 2025-05-07T20:32:44.9031429Z contiguous=False, 2025-05-07T20:32:44.9031513Z compiled=False, 2025-05-07T20:32:44.9031591Z ) 2025-05-07T20:32:44.9031807Z self = 2025-05-07T20:32:44.9031988Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:44.9032151Z 2025-05-07T20:32:44.9032232Z @given( 2025-05-07T20:32:44.9032349Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9032530Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9032646Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9032763Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9032880Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9032954Z ) 2025-05-07T20:32:44.9033198Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9033296Z def test_silu_mul_quant( 2025-05-07T20:32:44.9033371Z self, 2025-05-07T20:32:44.9033453Z T: int, 2025-05-07T20:32:44.9033529Z D: int, 2025-05-07T20:32:44.9033627Z scale_ub: Optional[float], 2025-05-07T20:32:44.9033720Z contiguous: bool, 2025-05-07T20:32:44.9033805Z compiled: bool, 2025-05-07T20:32:44.9033952Z ) -> None: 2025-05-07T20:32:44.9034054Z torch.manual_seed(2025) 2025-05-07T20:32:44.9034125Z 2025-05-07T20:32:44.9034387Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9034469Z 2025-05-07T20:32:44.9034560Z x_sign = torch.sign(x) 2025-05-07T20:32:44.9034683Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.9034778Z x = x_sign * x_clamp 2025-05-07T20:32:44.9034857Z x0 = x[:, :D] 2025-05-07T20:32:44.9034938Z x1 = x[:, D:] 2025-05-07T20:32:44.9035015Z 2025-05-07T20:32:44.9035098Z if contiguous: 2025-05-07T20:32:44.9035193Z x0 = x0.contiguous() 2025-05-07T20:32:44.9035280Z x1 = x1.contiguous() 2025-05-07T20:32:44.9035351Z 2025-05-07T20:32:44.9035448Z if scale_ub is not None: 2025-05-07T20:32:44.9035553Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.9035687Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.9035776Z ) 2025-05-07T20:32:44.9035851Z else: 2025-05-07T20:32:44.9035943Z scale_ub_tensor = None 2025-05-07T20:32:44.9036022Z 2025-05-07T20:32:44.9036154Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.9036244Z op = silu_mul_quant 2025-05-07T20:32:44.9036338Z if compiled: 2025-05-07T20:32:44.9036437Z op = torch.compile(op) 2025-05-07T20:32:44.9036552Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9036626Z 2025-05-07T20:32:44.9036715Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.9036719Z 2025-05-07T20:32:44.9036827Z moe/activation_test.py:117: 2025-05-07T20:32:44.9036956Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9037079Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.9037196Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9037710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.9037814Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.9038175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.9038396Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.9038741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.9038833Z kernel = self.compile( 2025-05-07T20:32:44.9039213Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.9039391Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.9039520Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9039526Z 2025-05-07T20:32:44.9039736Z self = 2025-05-07T20:32:44.9040612Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.9041117Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f980664d580>} 2025-05-07T20:32:44.9041877Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.9042067Z context = 2025-05-07T20:32:44.9042072Z 2025-05-07T20:32:44.9042242Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.9042545Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.9042652Z module_map=module_map) 2025-05-07T20:32:44.9042858Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.9042961Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.9043048Z E ^ 2025-05-07T20:32:44.9043404Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.9043409Z 2025-05-07T20:32:44.9043825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.9043829Z 2025-05-07T20:32:44.9043940Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9044163Z self=, 2025-05-07T20:32:44.9044247Z T=1, 2025-05-07T20:32:44.9044325Z D=7168, 2025-05-07T20:32:44.9044410Z scale_ub=None, 2025-05-07T20:32:44.9044503Z contiguous=True, 2025-05-07T20:32:44.9044588Z compiled=False, 2025-05-07T20:32:44.9044660Z ) 2025-05-07T20:32:44.9044889Z self = 2025-05-07T20:32:44.9045053Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:44.9045058Z 2025-05-07T20:32:44.9045134Z @given( 2025-05-07T20:32:44.9045260Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9045360Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9045484Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9045602Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9045715Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9045794Z ) 2025-05-07T20:32:44.9046040Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9046139Z def test_silu_mul_quant( 2025-05-07T20:32:44.9046226Z self, 2025-05-07T20:32:44.9046306Z T: int, 2025-05-07T20:32:44.9046382Z D: int, 2025-05-07T20:32:44.9046490Z scale_ub: Optional[float], 2025-05-07T20:32:44.9046582Z contiguous: bool, 2025-05-07T20:32:44.9046669Z compiled: bool, 2025-05-07T20:32:44.9046760Z ) -> None: 2025-05-07T20:32:44.9046855Z torch.manual_seed(2025) 2025-05-07T20:32:44.9046949Z 2025-05-07T20:32:44.9047142Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9047225Z 2025-05-07T20:32:44.9047323Z x_sign = torch.sign(x) 2025-05-07T20:32:44.9047446Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.9047534Z x = x_sign * x_clamp 2025-05-07T20:32:44.9047624Z x0 = x[:, :D] 2025-05-07T20:32:44.9047705Z x1 = x[:, D:] 2025-05-07T20:32:44.9047778Z 2025-05-07T20:32:44.9047866Z if contiguous: 2025-05-07T20:32:44.9047958Z x0 = x0.contiguous() 2025-05-07T20:32:44.9048097Z x1 = x1.contiguous() 2025-05-07T20:32:44.9048175Z 2025-05-07T20:32:44.9048265Z if scale_ub is not None: 2025-05-07T20:32:44.9048409Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.9048550Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.9048625Z ) 2025-05-07T20:32:44.9048709Z else: 2025-05-07T20:32:44.9048802Z scale_ub_tensor = None 2025-05-07T20:32:44.9048873Z 2025-05-07T20:32:44.9049006Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.9049095Z op = silu_mul_quant 2025-05-07T20:32:44.9049179Z if compiled: 2025-05-07T20:32:44.9049285Z op = torch.compile(op) 2025-05-07T20:32:44.9049391Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9049462Z 2025-05-07T20:32:44.9049559Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.9049607Z 2025-05-07T20:32:44.9049704Z moe/activation_test.py:117: 2025-05-07T20:32:44.9049840Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9049942Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.9050106Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9050613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.9050710Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.9051068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.9051296Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.9051634Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.9051732Z kernel = self.compile( 2025-05-07T20:32:44.9052111Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.9052287Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.9052424Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9052428Z 2025-05-07T20:32:44.9052634Z self = 2025-05-07T20:32:44.9053419Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.9053919Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f980664cea0>} 2025-05-07T20:32:44.9054671Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.9054870Z context = 2025-05-07T20:32:44.9054874Z 2025-05-07T20:32:44.9055040Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.9055308Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.9055414Z module_map=module_map) 2025-05-07T20:32:44.9055574Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.9055679Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.9055755Z E ^ 2025-05-07T20:32:44.9056111Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.9056121Z 2025-05-07T20:32:44.9056542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.9056592Z 2025-05-07T20:32:44.9056696Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9056933Z self=, 2025-05-07T20:32:44.9057052Z T=16384, 2025-05-07T20:32:44.9057130Z D=7168, 2025-05-07T20:32:44.9057218Z scale_ub=1200.0, 2025-05-07T20:32:44.9061351Z contiguous=False, 2025-05-07T20:32:44.9061462Z compiled=True, 2025-05-07T20:32:44.9061539Z ) 2025-05-07T20:32:44.9061772Z self = 2025-05-07T20:32:44.9061953Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:44.9061957Z 2025-05-07T20:32:44.9062039Z @given( 2025-05-07T20:32:44.9062167Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9062267Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9062390Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9062587Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9062702Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9062787Z ) 2025-05-07T20:32:44.9063185Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9063286Z def test_silu_mul_quant( 2025-05-07T20:32:44.9063372Z self, 2025-05-07T20:32:44.9063450Z T: int, 2025-05-07T20:32:44.9063527Z D: int, 2025-05-07T20:32:44.9063637Z scale_ub: Optional[float], 2025-05-07T20:32:44.9063727Z contiguous: bool, 2025-05-07T20:32:44.9063814Z compiled: bool, 2025-05-07T20:32:44.9063903Z ) -> None: 2025-05-07T20:32:44.9063999Z torch.manual_seed(2025) 2025-05-07T20:32:44.9064082Z 2025-05-07T20:32:44.9064250Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9064327Z 2025-05-07T20:32:44.9064428Z x_sign = torch.sign(x) 2025-05-07T20:32:44.9064560Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.9064651Z x = x_sign * x_clamp 2025-05-07T20:32:44.9064740Z x0 = x[:, :D] 2025-05-07T20:32:44.9064826Z x1 = x[:, D:] 2025-05-07T20:32:44.9064901Z 2025-05-07T20:32:44.9064995Z if contiguous: 2025-05-07T20:32:44.9065090Z x0 = x0.contiguous() 2025-05-07T20:32:44.9065180Z x1 = x1.contiguous() 2025-05-07T20:32:44.9065264Z 2025-05-07T20:32:44.9065355Z if scale_ub is not None: 2025-05-07T20:32:44.9065463Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.9065607Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.9065689Z ) 2025-05-07T20:32:44.9065774Z else: 2025-05-07T20:32:44.9065870Z scale_ub_tensor = None 2025-05-07T20:32:44.9065945Z 2025-05-07T20:32:44.9066083Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.9066178Z op = silu_mul_quant 2025-05-07T20:32:44.9066267Z if compiled: 2025-05-07T20:32:44.9066379Z op = torch.compile(op) 2025-05-07T20:32:44.9066485Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9066565Z 2025-05-07T20:32:44.9066685Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.9066690Z 2025-05-07T20:32:44.9066801Z moe/activation_test.py:117: 2025-05-07T20:32:44.9066953Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9067055Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.9067156Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9067539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:44.9067634Z return fn(*args, **kwargs) 2025-05-07T20:32:44.9068129Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.9068234Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.9068644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.9068915Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.9069374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.9069469Z kernel = self.compile( 2025-05-07T20:32:44.9069859Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.9070034Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.9070163Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9070174Z 2025-05-07T20:32:44.9070380Z self = 2025-05-07T20:32:44.9071277Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.9071796Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f980664f9c0>} 2025-05-07T20:32:44.9072550Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.9072749Z context = 2025-05-07T20:32:44.9072753Z 2025-05-07T20:32:44.9072919Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.9073181Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.9073303Z module_map=module_map) 2025-05-07T20:32:44.9073464Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.9073570Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.9073651Z E ^ 2025-05-07T20:32:44.9074010Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.9074015Z 2025-05-07T20:32:44.9074436Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.9074440Z 2025-05-07T20:32:44.9074544Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9074766Z self=, 2025-05-07T20:32:44.9074850Z T=1, 2025-05-07T20:32:44.9074929Z D=7168, 2025-05-07T20:32:44.9075022Z scale_ub=None, 2025-05-07T20:32:44.9075110Z contiguous=False, 2025-05-07T20:32:44.9075197Z compiled=False, 2025-05-07T20:32:44.9075282Z ) 2025-05-07T20:32:44.9075499Z self = 2025-05-07T20:32:44.9075668Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:44.9075674Z 2025-05-07T20:32:44.9075759Z @given( 2025-05-07T20:32:44.9075877Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9075975Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9076098Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9076213Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9076333Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9076409Z ) 2025-05-07T20:32:44.9076654Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9076757Z def test_silu_mul_quant( 2025-05-07T20:32:44.9076838Z self, 2025-05-07T20:32:44.9076916Z T: int, 2025-05-07T20:32:44.9077007Z D: int, 2025-05-07T20:32:44.9077159Z scale_ub: Optional[float], 2025-05-07T20:32:44.9077251Z contiguous: bool, 2025-05-07T20:32:44.9077367Z compiled: bool, 2025-05-07T20:32:44.9077454Z ) -> None: 2025-05-07T20:32:44.9077612Z torch.manual_seed(2025) 2025-05-07T20:32:44.9077698Z 2025-05-07T20:32:44.9077870Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9077953Z 2025-05-07T20:32:44.9078050Z x_sign = torch.sign(x) 2025-05-07T20:32:44.9078177Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.9078278Z x = x_sign * x_clamp 2025-05-07T20:32:44.9078360Z x0 = x[:, :D] 2025-05-07T20:32:44.9078444Z x1 = x[:, D:] 2025-05-07T20:32:44.9078525Z 2025-05-07T20:32:44.9078611Z if contiguous: 2025-05-07T20:32:44.9078705Z x0 = x0.contiguous() 2025-05-07T20:32:44.9078804Z x1 = x1.contiguous() 2025-05-07T20:32:44.9078923Z 2025-05-07T20:32:44.9079018Z if scale_ub is not None: 2025-05-07T20:32:44.9079133Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.9079307Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.9079398Z ) 2025-05-07T20:32:44.9079476Z else: 2025-05-07T20:32:44.9079572Z scale_ub_tensor = None 2025-05-07T20:32:44.9079652Z 2025-05-07T20:32:44.9079783Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.9079873Z op = silu_mul_quant 2025-05-07T20:32:44.9079968Z if compiled: 2025-05-07T20:32:44.9080072Z op = torch.compile(op) 2025-05-07T20:32:44.9080178Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9080259Z 2025-05-07T20:32:44.9080353Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.9080357Z 2025-05-07T20:32:44.9080457Z moe/activation_test.py:117: 2025-05-07T20:32:44.9080597Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9080707Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.9080815Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9081322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.9081420Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.9081788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.9082012Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.9082354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.9082462Z kernel = self.compile( 2025-05-07T20:32:44.9082846Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.9083029Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.9083161Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9083166Z 2025-05-07T20:32:44.9083373Z self = 2025-05-07T20:32:44.9084167Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.9084671Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9655c5c860>} 2025-05-07T20:32:44.9085434Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.9085673Z context = 2025-05-07T20:32:44.9085678Z 2025-05-07T20:32:44.9085852Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.9086172Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.9086281Z module_map=module_map) 2025-05-07T20:32:44.9086452Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.9086554Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.9086632Z E ^ 2025-05-07T20:32:44.9087048Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.9087053Z 2025-05-07T20:32:44.9087469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.9087473Z 2025-05-07T20:32:44.9087624Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9087850Z self=, 2025-05-07T20:32:44.9087929Z T=2048, 2025-05-07T20:32:44.9088015Z D=7168, 2025-05-07T20:32:44.9088136Z scale_ub=None, 2025-05-07T20:32:44.9088228Z contiguous=False, 2025-05-07T20:32:44.9088322Z compiled=True, 2025-05-07T20:32:44.9088395Z ) 2025-05-07T20:32:44.9088615Z self = 2025-05-07T20:32:44.9088794Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:44.9088799Z 2025-05-07T20:32:44.9088877Z @given( 2025-05-07T20:32:44.9089000Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9089100Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9089218Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9089344Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9089460Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9089539Z ) 2025-05-07T20:32:44.9089794Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9089891Z def test_silu_mul_quant( 2025-05-07T20:32:44.9089978Z self, 2025-05-07T20:32:44.9090055Z T: int, 2025-05-07T20:32:44.9090133Z D: int, 2025-05-07T20:32:44.9090238Z scale_ub: Optional[float], 2025-05-07T20:32:44.9090330Z contiguous: bool, 2025-05-07T20:32:44.9090416Z compiled: bool, 2025-05-07T20:32:44.9090500Z ) -> None: 2025-05-07T20:32:44.9090597Z torch.manual_seed(2025) 2025-05-07T20:32:44.9090670Z 2025-05-07T20:32:44.9090844Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9090920Z 2025-05-07T20:32:44.9091016Z x_sign = torch.sign(x) 2025-05-07T20:32:44.9091147Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.9091238Z x = x_sign * x_clamp 2025-05-07T20:32:44.9091324Z x0 = x[:, :D] 2025-05-07T20:32:44.9091413Z x1 = x[:, D:] 2025-05-07T20:32:44.9091488Z 2025-05-07T20:32:44.9091581Z if contiguous: 2025-05-07T20:32:44.9091676Z x0 = x0.contiguous() 2025-05-07T20:32:44.9091767Z x1 = x1.contiguous() 2025-05-07T20:32:44.9091849Z 2025-05-07T20:32:44.9091941Z if scale_ub is not None: 2025-05-07T20:32:44.9092049Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.9092191Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.9092269Z ) 2025-05-07T20:32:44.9092346Z else: 2025-05-07T20:32:44.9092449Z scale_ub_tensor = None 2025-05-07T20:32:44.9092523Z 2025-05-07T20:32:44.9092657Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.9092758Z op = silu_mul_quant 2025-05-07T20:32:44.9092844Z if compiled: 2025-05-07T20:32:44.9092953Z op = torch.compile(op) 2025-05-07T20:32:44.9093114Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9093188Z 2025-05-07T20:32:44.9093290Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.9093295Z 2025-05-07T20:32:44.9093397Z moe/activation_test.py:117: 2025-05-07T20:32:44.9093568Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9093681Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.9093782Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9094152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:44.9094255Z return fn(*args, **kwargs) 2025-05-07T20:32:44.9094750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.9094857Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.9095215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.9095480Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.9095872Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.9095969Z kernel = self.compile( 2025-05-07T20:32:44.9096360Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.9096535Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.9096667Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9096671Z 2025-05-07T20:32:44.9096883Z self = 2025-05-07T20:32:44.9097720Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.9098241Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9655c5dbc0>} 2025-05-07T20:32:44.9098999Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.9099189Z context = 2025-05-07T20:32:44.9099193Z 2025-05-07T20:32:44.9099363Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.9099628Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.9099748Z module_map=module_map) 2025-05-07T20:32:44.9099911Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.9100019Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.9100106Z E ^ 2025-05-07T20:32:44.9100470Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.9100475Z 2025-05-07T20:32:44.9100907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.9100911Z 2025-05-07T20:32:44.9101015Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9101248Z self=, 2025-05-07T20:32:44.9101325Z T=4096, 2025-05-07T20:32:44.9101402Z D=7168, 2025-05-07T20:32:44.9101489Z scale_ub=None, 2025-05-07T20:32:44.9101577Z contiguous=False, 2025-05-07T20:32:44.9101661Z compiled=True, 2025-05-07T20:32:44.9101739Z ) 2025-05-07T20:32:44.9101959Z self = 2025-05-07T20:32:44.9102179Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:44.9102183Z 2025-05-07T20:32:44.9102267Z @given( 2025-05-07T20:32:44.9102458Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9102558Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9102678Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9102796Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9102915Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9102990Z ) 2025-05-07T20:32:44.9103235Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9103337Z def test_silu_mul_quant( 2025-05-07T20:32:44.9103414Z self, 2025-05-07T20:32:44.9103492Z T: int, 2025-05-07T20:32:44.9103576Z D: int, 2025-05-07T20:32:44.9103674Z scale_ub: Optional[float], 2025-05-07T20:32:44.9103809Z contiguous: bool, 2025-05-07T20:32:44.9103903Z compiled: bool, 2025-05-07T20:32:44.9103985Z ) -> None: 2025-05-07T20:32:44.9104086Z torch.manual_seed(2025) 2025-05-07T20:32:44.9104159Z 2025-05-07T20:32:44.9104370Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9104450Z 2025-05-07T20:32:44.9104542Z x_sign = torch.sign(x) 2025-05-07T20:32:44.9104666Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.9104760Z x = x_sign * x_clamp 2025-05-07T20:32:44.9104841Z x0 = x[:, :D] 2025-05-07T20:32:44.9104924Z x1 = x[:, D:] 2025-05-07T20:32:44.9105002Z 2025-05-07T20:32:44.9105086Z if contiguous: 2025-05-07T20:32:44.9105177Z x0 = x0.contiguous() 2025-05-07T20:32:44.9105272Z x1 = x1.contiguous() 2025-05-07T20:32:44.9105348Z 2025-05-07T20:32:44.9105441Z if scale_ub is not None: 2025-05-07T20:32:44.9105552Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.9105691Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.9105773Z ) 2025-05-07T20:32:44.9105850Z else: 2025-05-07T20:32:44.9105946Z scale_ub_tensor = None 2025-05-07T20:32:44.9106028Z 2025-05-07T20:32:44.9106161Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.9106251Z op = silu_mul_quant 2025-05-07T20:32:44.9106344Z if compiled: 2025-05-07T20:32:44.9106444Z op = torch.compile(op) 2025-05-07T20:32:44.9106548Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9106632Z 2025-05-07T20:32:44.9106722Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.9106727Z 2025-05-07T20:32:44.9106831Z moe/activation_test.py:117: 2025-05-07T20:32:44.9106961Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9107063Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.9107171Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9107579Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:44.9107689Z return fn(*args, **kwargs) 2025-05-07T20:32:44.9108192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.9108288Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.9108648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.9108870Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.9109329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.9109429Z kernel = self.compile( 2025-05-07T20:32:44.9109810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.9110038Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.9110174Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9110178Z 2025-05-07T20:32:44.9110420Z self = 2025-05-07T20:32:44.9111214Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.9111716Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9655c5e700>} 2025-05-07T20:32:44.9112474Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.9112706Z context = 2025-05-07T20:32:44.9112711Z 2025-05-07T20:32:44.9112913Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.9113183Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.9113289Z module_map=module_map) 2025-05-07T20:32:44.9113451Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.9113556Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.9113633Z E ^ 2025-05-07T20:32:44.9113998Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.9114002Z 2025-05-07T20:32:44.9114419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.9114427Z 2025-05-07T20:32:44.9114532Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9114762Z self=, 2025-05-07T20:32:44.9114842Z T=16384, 2025-05-07T20:32:44.9114927Z D=5120, 2025-05-07T20:32:44.9115012Z scale_ub=1200.0, 2025-05-07T20:32:44.9115100Z contiguous=False, 2025-05-07T20:32:44.9115193Z compiled=False, 2025-05-07T20:32:44.9115268Z ) 2025-05-07T20:32:44.9115486Z self = 2025-05-07T20:32:44.9115673Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:44.9115677Z 2025-05-07T20:32:44.9115759Z @given( 2025-05-07T20:32:44.9115877Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9115982Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9116098Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9116223Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9116339Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9116414Z ) 2025-05-07T20:32:44.9116669Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9116789Z def test_silu_mul_quant( 2025-05-07T20:32:44.9116870Z self, 2025-05-07T20:32:44.9116974Z T: int, 2025-05-07T20:32:44.9117051Z D: int, 2025-05-07T20:32:44.9117152Z scale_ub: Optional[float], 2025-05-07T20:32:44.9117248Z contiguous: bool, 2025-05-07T20:32:44.9117336Z compiled: bool, 2025-05-07T20:32:44.9117415Z ) -> None: 2025-05-07T20:32:44.9117518Z torch.manual_seed(2025) 2025-05-07T20:32:44.9117590Z 2025-05-07T20:32:44.9117759Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9117840Z 2025-05-07T20:32:44.9117933Z x_sign = torch.sign(x) 2025-05-07T20:32:44.9118065Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.9118211Z x = x_sign * x_clamp 2025-05-07T20:32:44.9118291Z x0 = x[:, :D] 2025-05-07T20:32:44.9118378Z x1 = x[:, D:] 2025-05-07T20:32:44.9118451Z 2025-05-07T20:32:44.9118538Z if contiguous: 2025-05-07T20:32:44.9118675Z x0 = x0.contiguous() 2025-05-07T20:32:44.9118765Z x1 = x1.contiguous() 2025-05-07T20:32:44.9118836Z 2025-05-07T20:32:44.9118932Z if scale_ub is not None: 2025-05-07T20:32:44.9119037Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.9119173Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.9119254Z ) 2025-05-07T20:32:44.9119330Z else: 2025-05-07T20:32:44.9119430Z scale_ub_tensor = None 2025-05-07T20:32:44.9119501Z 2025-05-07T20:32:44.9119629Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.9119727Z op = silu_mul_quant 2025-05-07T20:32:44.9119853Z if compiled: 2025-05-07T20:32:44.9119955Z op = torch.compile(op) 2025-05-07T20:32:44.9120066Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9120149Z 2025-05-07T20:32:44.9120287Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.9120295Z 2025-05-07T20:32:44.9120396Z moe/activation_test.py:117: 2025-05-07T20:32:44.9120526Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9120635Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.9120737Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9121243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.9121339Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.9121698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.9121926Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.9122271Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.9122367Z kernel = self.compile( 2025-05-07T20:32:44.9122758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.9122930Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.9123063Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9123068Z 2025-05-07T20:32:44.9123270Z self = 2025-05-07T20:32:44.9124052Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.9124568Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9655c5f060>} 2025-05-07T20:32:44.9125328Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.9125524Z context = 2025-05-07T20:32:44.9125529Z 2025-05-07T20:32:44.9125694Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.9125963Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.9126070Z module_map=module_map) 2025-05-07T20:32:44.9126231Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.9126337Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.9126418Z E ^ 2025-05-07T20:32:44.9126823Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.9126827Z 2025-05-07T20:32:44.9127311Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.9127317Z 2025-05-07T20:32:44.9127434Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9127683Z self=, 2025-05-07T20:32:44.9127763Z T=16384, 2025-05-07T20:32:44.9127840Z D=5120, 2025-05-07T20:32:44.9127931Z scale_ub=1200.0, 2025-05-07T20:32:44.9128017Z contiguous=True, 2025-05-07T20:32:44.9128100Z compiled=True, 2025-05-07T20:32:44.9128467Z ) 2025-05-07T20:32:44.9128782Z self = 2025-05-07T20:32:44.9128977Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:44.9129169Z 2025-05-07T20:32:44.9129247Z @given( 2025-05-07T20:32:44.9129371Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9129479Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9129672Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9129800Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9129926Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9130000Z ) 2025-05-07T20:32:44.9130286Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9130388Z def test_silu_mul_quant( 2025-05-07T20:32:44.9130465Z self, 2025-05-07T20:32:44.9130542Z T: int, 2025-05-07T20:32:44.9130626Z D: int, 2025-05-07T20:32:44.9130728Z scale_ub: Optional[float], 2025-05-07T20:32:44.9130825Z contiguous: bool, 2025-05-07T20:32:44.9130914Z compiled: bool, 2025-05-07T20:32:44.9130994Z ) -> None: 2025-05-07T20:32:44.9131100Z torch.manual_seed(2025) 2025-05-07T20:32:44.9131176Z 2025-05-07T20:32:44.9131360Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9131440Z 2025-05-07T20:32:44.9131539Z x_sign = torch.sign(x) 2025-05-07T20:32:44.9131670Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.9131771Z x = x_sign * x_clamp 2025-05-07T20:32:44.9131854Z x0 = x[:, :D] 2025-05-07T20:32:44.9131934Z x1 = x[:, D:] 2025-05-07T20:32:44.9132013Z 2025-05-07T20:32:44.9132098Z if contiguous: 2025-05-07T20:32:44.9132191Z x0 = x0.contiguous() 2025-05-07T20:32:44.9132291Z x1 = x1.contiguous() 2025-05-07T20:32:44.9132364Z 2025-05-07T20:32:44.9132463Z if scale_ub is not None: 2025-05-07T20:32:44.9132573Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.9132716Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.9132802Z ) 2025-05-07T20:32:44.9132881Z else: 2025-05-07T20:32:44.9132977Z scale_ub_tensor = None 2025-05-07T20:32:44.9133055Z 2025-05-07T20:32:44.9133194Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.9133288Z op = silu_mul_quant 2025-05-07T20:32:44.9133382Z if compiled: 2025-05-07T20:32:44.9133486Z op = torch.compile(op) 2025-05-07T20:32:44.9133599Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9133681Z 2025-05-07T20:32:44.9133775Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.9133780Z 2025-05-07T20:32:44.9133886Z moe/activation_test.py:117: 2025-05-07T20:32:44.9134028Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9134134Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.9134245Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9134687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:44.9134902Z return fn(*args, **kwargs) 2025-05-07T20:32:44.9135475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.9135573Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.9135937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.9136158Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.9136497Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.9136601Z kernel = self.compile( 2025-05-07T20:32:44.9136983Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.9137161Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.9137339Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9137345Z 2025-05-07T20:32:44.9137614Z self = 2025-05-07T20:32:44.9138430Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.9138932Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f96559e51c0>} 2025-05-07T20:32:44.9139691Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.9139882Z context = 2025-05-07T20:32:44.9139891Z 2025-05-07T20:32:44.9140056Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.9140328Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.9140435Z module_map=module_map) 2025-05-07T20:32:44.9140602Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.9140702Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.9140779Z E ^ 2025-05-07T20:32:44.9141145Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.9141150Z 2025-05-07T20:32:44.9141568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.9141572Z 2025-05-07T20:32:44.9141683Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9141911Z self=, 2025-05-07T20:32:44.9141990Z T=16384, 2025-05-07T20:32:44.9142073Z D=5120, 2025-05-07T20:32:44.9142157Z scale_ub=None, 2025-05-07T20:32:44.9142248Z contiguous=False, 2025-05-07T20:32:44.9142343Z compiled=True, 2025-05-07T20:32:44.9142416Z ) 2025-05-07T20:32:44.9142635Z self = 2025-05-07T20:32:44.9142819Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:44.9142824Z 2025-05-07T20:32:44.9142902Z @given( 2025-05-07T20:32:44.9143028Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9143128Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9143244Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9143369Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9143483Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9143560Z ) 2025-05-07T20:32:44.9143863Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9143958Z def test_silu_mul_quant( 2025-05-07T20:32:44.9144036Z self, 2025-05-07T20:32:44.9144156Z T: int, 2025-05-07T20:32:44.9144234Z D: int, 2025-05-07T20:32:44.9144332Z scale_ub: Optional[float], 2025-05-07T20:32:44.9144425Z contiguous: bool, 2025-05-07T20:32:44.9144511Z compiled: bool, 2025-05-07T20:32:44.9144594Z ) -> None: 2025-05-07T20:32:44.9144688Z torch.manual_seed(2025) 2025-05-07T20:32:44.9144760Z 2025-05-07T20:32:44.9144935Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9145009Z 2025-05-07T20:32:44.9145101Z x_sign = torch.sign(x) 2025-05-07T20:32:44.9145230Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.9145319Z x = x_sign * x_clamp 2025-05-07T20:32:44.9145398Z x0 = x[:, :D] 2025-05-07T20:32:44.9145533Z x1 = x[:, D:] 2025-05-07T20:32:44.9145605Z 2025-05-07T20:32:44.9145689Z if contiguous: 2025-05-07T20:32:44.9145786Z x0 = x0.contiguous() 2025-05-07T20:32:44.9145913Z x1 = x1.contiguous() 2025-05-07T20:32:44.9145990Z 2025-05-07T20:32:44.9146086Z if scale_ub is not None: 2025-05-07T20:32:44.9146193Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.9146332Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.9146408Z ) 2025-05-07T20:32:44.9146483Z else: 2025-05-07T20:32:44.9146581Z scale_ub_tensor = None 2025-05-07T20:32:44.9146654Z 2025-05-07T20:32:44.9146783Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.9146880Z op = silu_mul_quant 2025-05-07T20:32:44.9146966Z if compiled: 2025-05-07T20:32:44.9147066Z op = torch.compile(op) 2025-05-07T20:32:44.9147177Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9147256Z 2025-05-07T20:32:44.9147349Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.9147358Z 2025-05-07T20:32:44.9147455Z moe/activation_test.py:117: 2025-05-07T20:32:44.9147588Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9147695Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.9147794Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9148161Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:44.9148259Z return fn(*args, **kwargs) 2025-05-07T20:32:44.9148752Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.9148853Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.9149289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.9149517Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.9149864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.9149959Z kernel = self.compile( 2025-05-07T20:32:44.9150340Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.9150516Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.9150644Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9150649Z 2025-05-07T20:32:44.9150857Z self = 2025-05-07T20:32:44.9151639Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.9152198Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f96559e5d00>} 2025-05-07T20:32:44.9152996Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.9153186Z context = 2025-05-07T20:32:44.9153191Z 2025-05-07T20:32:44.9153359Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.9153620Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.9153727Z module_map=module_map) 2025-05-07T20:32:44.9153895Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.9154038Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.9154124Z E ^ 2025-05-07T20:32:44.9154520Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.9154525Z 2025-05-07T20:32:44.9154946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.9154951Z 2025-05-07T20:32:44.9155061Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9155285Z self=, 2025-05-07T20:32:44.9155369Z T=2048, 2025-05-07T20:32:44.9155445Z D=5120, 2025-05-07T20:32:44.9155527Z scale_ub=None, 2025-05-07T20:32:44.9155622Z contiguous=False, 2025-05-07T20:32:44.9155706Z compiled=True, 2025-05-07T20:32:44.9155778Z ) 2025-05-07T20:32:44.9156006Z self = 2025-05-07T20:32:44.9156182Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:44.9156190Z 2025-05-07T20:32:44.9156268Z @given( 2025-05-07T20:32:44.9156392Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9156497Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9156620Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9156735Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9156851Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9156932Z ) 2025-05-07T20:32:44.9157215Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9157323Z def test_silu_mul_quant( 2025-05-07T20:32:44.9157407Z self, 2025-05-07T20:32:44.9157483Z T: int, 2025-05-07T20:32:44.9157560Z D: int, 2025-05-07T20:32:44.9157665Z scale_ub: Optional[float], 2025-05-07T20:32:44.9157756Z contiguous: bool, 2025-05-07T20:32:44.9157846Z compiled: bool, 2025-05-07T20:32:44.9157936Z ) -> None: 2025-05-07T20:32:44.9158030Z torch.manual_seed(2025) 2025-05-07T20:32:44.9158110Z 2025-05-07T20:32:44.9158279Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9158355Z 2025-05-07T20:32:44.9158453Z x_sign = torch.sign(x) 2025-05-07T20:32:44.9158579Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.9158667Z x = x_sign * x_clamp 2025-05-07T20:32:44.9158755Z x0 = x[:, :D] 2025-05-07T20:32:44.9158837Z x1 = x[:, D:] 2025-05-07T20:32:44.9158910Z 2025-05-07T20:32:44.9159000Z if contiguous: 2025-05-07T20:32:44.9159092Z x0 = x0.contiguous() 2025-05-07T20:32:44.9159182Z x1 = x1.contiguous() 2025-05-07T20:32:44.9159259Z 2025-05-07T20:32:44.9159350Z if scale_ub is not None: 2025-05-07T20:32:44.9159457Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.9159596Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.9159746Z ) 2025-05-07T20:32:44.9159827Z else: 2025-05-07T20:32:44.9159921Z scale_ub_tensor = None 2025-05-07T20:32:44.9159993Z 2025-05-07T20:32:44.9160170Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.9160262Z op = silu_mul_quant 2025-05-07T20:32:44.9160346Z if compiled: 2025-05-07T20:32:44.9160453Z op = torch.compile(op) 2025-05-07T20:32:44.9160560Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9160633Z 2025-05-07T20:32:44.9160729Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.9160734Z 2025-05-07T20:32:44.9160831Z moe/activation_test.py:117: 2025-05-07T20:32:44.9160966Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9161067Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.9161167Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9161586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:44.9161680Z return fn(*args, **kwargs) 2025-05-07T20:32:44.9162242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.9162349Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.9162704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.9162933Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.9163270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.9163363Z kernel = self.compile( 2025-05-07T20:32:44.9163749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.9163923Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.9164051Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9164061Z 2025-05-07T20:32:44.9164270Z self = 2025-05-07T20:32:44.9165050Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.9165557Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f96559e5620>} 2025-05-07T20:32:44.9166309Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.9166509Z context = 2025-05-07T20:32:44.9166513Z 2025-05-07T20:32:44.9166678Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.9166946Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.9167058Z module_map=module_map) 2025-05-07T20:32:44.9167219Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.9167321Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.9167417Z E ^ 2025-05-07T20:32:44.9167811Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.9167815Z 2025-05-07T20:32:44.9168237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.9168241Z 2025-05-07T20:32:44.9168346Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9168614Z self=, 2025-05-07T20:32:44.9168700Z T=2048, 2025-05-07T20:32:44.9168777Z D=5120, 2025-05-07T20:32:44.9168907Z scale_ub=1200.0, 2025-05-07T20:32:44.9168995Z contiguous=False, 2025-05-07T20:32:44.9169080Z compiled=True, 2025-05-07T20:32:44.9169160Z ) 2025-05-07T20:32:44.9169378Z self = 2025-05-07T20:32:44.9169552Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:44.9169556Z 2025-05-07T20:32:44.9169643Z @given( 2025-05-07T20:32:44.9169764Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9169864Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9169989Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9170106Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9170267Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9170344Z ) 2025-05-07T20:32:44.9170590Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9170736Z def test_silu_mul_quant( 2025-05-07T20:32:44.9170816Z self, 2025-05-07T20:32:44.9170892Z T: int, 2025-05-07T20:32:44.9170975Z D: int, 2025-05-07T20:32:44.9171075Z scale_ub: Optional[float], 2025-05-07T20:32:44.9171163Z contiguous: bool, 2025-05-07T20:32:44.9171255Z compiled: bool, 2025-05-07T20:32:44.9171334Z ) -> None: 2025-05-07T20:32:44.9171429Z torch.manual_seed(2025) 2025-05-07T20:32:44.9171508Z 2025-05-07T20:32:44.9171676Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9171757Z 2025-05-07T20:32:44.9171849Z x_sign = torch.sign(x) 2025-05-07T20:32:44.9171973Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.9172071Z x = x_sign * x_clamp 2025-05-07T20:32:44.9172156Z x0 = x[:, :D] 2025-05-07T20:32:44.9172236Z x1 = x[:, D:] 2025-05-07T20:32:44.9172317Z 2025-05-07T20:32:44.9172403Z if contiguous: 2025-05-07T20:32:44.9172497Z x0 = x0.contiguous() 2025-05-07T20:32:44.9172597Z x1 = x1.contiguous() 2025-05-07T20:32:44.9172668Z 2025-05-07T20:32:44.9172758Z if scale_ub is not None: 2025-05-07T20:32:44.9172873Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.9173007Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.9173081Z ) 2025-05-07T20:32:44.9173165Z else: 2025-05-07T20:32:44.9173258Z scale_ub_tensor = None 2025-05-07T20:32:44.9173336Z 2025-05-07T20:32:44.9173465Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.9173554Z op = silu_mul_quant 2025-05-07T20:32:44.9173646Z if compiled: 2025-05-07T20:32:44.9173745Z op = torch.compile(op) 2025-05-07T20:32:44.9173854Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9173934Z 2025-05-07T20:32:44.9174024Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.9174029Z 2025-05-07T20:32:44.9174130Z moe/activation_test.py:117: 2025-05-07T20:32:44.9174263Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9174363Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.9174474Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9174840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:44.9174932Z return fn(*args, **kwargs) 2025-05-07T20:32:44.9175431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.9175526Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.9175882Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.9176167Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.9176546Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.9176646Z kernel = self.compile( 2025-05-07T20:32:44.9177026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.9177200Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.9177333Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9177337Z 2025-05-07T20:32:44.9177539Z self = 2025-05-07T20:32:44.9178327Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.9178969Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f96558145e0>} 2025-05-07T20:32:44.9179722Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.9179917Z context = 2025-05-07T20:32:44.9179921Z 2025-05-07T20:32:44.9180085Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.9180355Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.9180462Z module_map=module_map) 2025-05-07T20:32:44.9180631Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.9180739Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.9180816Z E ^ 2025-05-07T20:32:44.9181182Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.9181187Z 2025-05-07T20:32:44.9181603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.9181607Z 2025-05-07T20:32:44.9181713Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9181945Z self=, 2025-05-07T20:32:44.9182022Z T=4096, 2025-05-07T20:32:44.9182098Z D=5120, 2025-05-07T20:32:44.9182190Z scale_ub=1200.0, 2025-05-07T20:32:44.9182274Z contiguous=True, 2025-05-07T20:32:44.9182361Z compiled=True, 2025-05-07T20:32:44.9182434Z ) 2025-05-07T20:32:44.9182651Z self = 2025-05-07T20:32:44.9182832Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:44.9182837Z 2025-05-07T20:32:44.9182914Z @given( 2025-05-07T20:32:44.9183036Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9183141Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9183255Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9183371Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9183492Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9187657Z ) 2025-05-07T20:32:44.9187932Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9188030Z def test_silu_mul_quant( 2025-05-07T20:32:44.9188117Z self, 2025-05-07T20:32:44.9188195Z T: int, 2025-05-07T20:32:44.9188273Z D: int, 2025-05-07T20:32:44.9188384Z scale_ub: Optional[float], 2025-05-07T20:32:44.9188483Z contiguous: bool, 2025-05-07T20:32:44.9188665Z compiled: bool, 2025-05-07T20:32:44.9188748Z ) -> None: 2025-05-07T20:32:44.9188845Z torch.manual_seed(2025) 2025-05-07T20:32:44.9188931Z 2025-05-07T20:32:44.9189271Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9189350Z 2025-05-07T20:32:44.9189453Z x_sign = torch.sign(x) 2025-05-07T20:32:44.9189579Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.9189671Z x = x_sign * x_clamp 2025-05-07T20:32:44.9189762Z x0 = x[:, :D] 2025-05-07T20:32:44.9189843Z x1 = x[:, D:] 2025-05-07T20:32:44.9189917Z 2025-05-07T20:32:44.9190010Z if contiguous: 2025-05-07T20:32:44.9190103Z x0 = x0.contiguous() 2025-05-07T20:32:44.9190194Z x1 = x1.contiguous() 2025-05-07T20:32:44.9190276Z 2025-05-07T20:32:44.9190370Z if scale_ub is not None: 2025-05-07T20:32:44.9190485Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.9190669Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.9190749Z ) 2025-05-07T20:32:44.9190834Z else: 2025-05-07T20:32:44.9190971Z scale_ub_tensor = None 2025-05-07T20:32:44.9191048Z 2025-05-07T20:32:44.9191190Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.9191284Z op = silu_mul_quant 2025-05-07T20:32:44.9191370Z if compiled: 2025-05-07T20:32:44.9191480Z op = torch.compile(op) 2025-05-07T20:32:44.9191586Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9191659Z 2025-05-07T20:32:44.9191760Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.9191765Z 2025-05-07T20:32:44.9191865Z moe/activation_test.py:117: 2025-05-07T20:32:44.9192006Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9192111Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.9192217Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9192599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:44.9192695Z return fn(*args, **kwargs) 2025-05-07T20:32:44.9193194Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.9193300Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.9193660Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.9193890Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.9194230Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.9194324Z kernel = self.compile( 2025-05-07T20:32:44.9194718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.9194897Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.9195038Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9195044Z 2025-05-07T20:32:44.9195250Z self = 2025-05-07T20:32:44.9196036Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.9196548Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9655815120>} 2025-05-07T20:32:44.9197299Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.9197576Z context = 2025-05-07T20:32:44.9197581Z 2025-05-07T20:32:44.9197811Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.9198077Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.9198195Z module_map=module_map) 2025-05-07T20:32:44.9198358Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.9198465Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.9198543Z E ^ 2025-05-07T20:32:44.9198902Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.9198907Z 2025-05-07T20:32:44.9199333Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.9199510Z 2025-05-07T20:32:44.9199615Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9199845Z self=, 2025-05-07T20:32:44.9199963Z T=128, 2025-05-07T20:32:44.9200044Z D=5120, 2025-05-07T20:32:44.9200135Z scale_ub=1200.0, 2025-05-07T20:32:44.9200222Z contiguous=False, 2025-05-07T20:32:44.9200305Z compiled=True, 2025-05-07T20:32:44.9200385Z ) 2025-05-07T20:32:44.9200604Z self = 2025-05-07T20:32:44.9200776Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:44.9200780Z 2025-05-07T20:32:44.9200869Z @given( 2025-05-07T20:32:44.9200986Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9201086Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9201206Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9201326Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9201447Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9201521Z ) 2025-05-07T20:32:44.9201773Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9201874Z def test_silu_mul_quant( 2025-05-07T20:32:44.9201951Z self, 2025-05-07T20:32:44.9202028Z T: int, 2025-05-07T20:32:44.9202113Z D: int, 2025-05-07T20:32:44.9202211Z scale_ub: Optional[float], 2025-05-07T20:32:44.9202300Z contiguous: bool, 2025-05-07T20:32:44.9202394Z compiled: bool, 2025-05-07T20:32:44.9202473Z ) -> None: 2025-05-07T20:32:44.9202568Z torch.manual_seed(2025) 2025-05-07T20:32:44.9202648Z 2025-05-07T20:32:44.9202816Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9202898Z 2025-05-07T20:32:44.9202992Z x_sign = torch.sign(x) 2025-05-07T20:32:44.9203117Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.9203220Z x = x_sign * x_clamp 2025-05-07T20:32:44.9203303Z x0 = x[:, :D] 2025-05-07T20:32:44.9203387Z x1 = x[:, D:] 2025-05-07T20:32:44.9203468Z 2025-05-07T20:32:44.9203559Z if contiguous: 2025-05-07T20:32:44.9203652Z x0 = x0.contiguous() 2025-05-07T20:32:44.9203755Z x1 = x1.contiguous() 2025-05-07T20:32:44.9203828Z 2025-05-07T20:32:44.9203919Z if scale_ub is not None: 2025-05-07T20:32:44.9204038Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.9204175Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.9204262Z ) 2025-05-07T20:32:44.9204339Z else: 2025-05-07T20:32:44.9204436Z scale_ub_tensor = None 2025-05-07T20:32:44.9204517Z 2025-05-07T20:32:44.9204647Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.9204739Z op = silu_mul_quant 2025-05-07T20:32:44.9204840Z if compiled: 2025-05-07T20:32:44.9204992Z op = torch.compile(op) 2025-05-07T20:32:44.9205099Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9205182Z 2025-05-07T20:32:44.9205276Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.9205319Z 2025-05-07T20:32:44.9205421Z moe/activation_test.py:117: 2025-05-07T20:32:44.9205557Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9205660Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.9205769Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9206141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:44.9206237Z return fn(*args, **kwargs) 2025-05-07T20:32:44.9206744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.9206862Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.9207295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.9207561Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.9207906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.9208010Z kernel = self.compile( 2025-05-07T20:32:44.9208393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.9208567Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.9208707Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9208711Z 2025-05-07T20:32:44.9208916Z self = 2025-05-07T20:32:44.9209709Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.9210222Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9655816340>} 2025-05-07T20:32:44.9210980Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.9211171Z context = 2025-05-07T20:32:44.9211175Z 2025-05-07T20:32:44.9211340Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.9211609Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.9211720Z module_map=module_map) 2025-05-07T20:32:44.9211887Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.9211994Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.9212074Z E ^ 2025-05-07T20:32:44.9212444Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.9212448Z 2025-05-07T20:32:44.9212866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.9212870Z 2025-05-07T20:32:44.9212975Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9213208Z self=, 2025-05-07T20:32:44.9213287Z T=16384, 2025-05-07T20:32:44.9213374Z D=7168, 2025-05-07T20:32:44.9213459Z scale_ub=1200.0, 2025-05-07T20:32:44.9213545Z contiguous=True, 2025-05-07T20:32:44.9213639Z compiled=True, 2025-05-07T20:32:44.9213715Z ) 2025-05-07T20:32:44.9213979Z self = 2025-05-07T20:32:44.9214161Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:44.9214168Z 2025-05-07T20:32:44.9214286Z @given( 2025-05-07T20:32:44.9214406Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9214516Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9214631Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9214759Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9214873Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9214949Z ) 2025-05-07T20:32:44.9215202Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9215297Z def test_silu_mul_quant( 2025-05-07T20:32:44.9215374Z self, 2025-05-07T20:32:44.9215458Z T: int, 2025-05-07T20:32:44.9215536Z D: int, 2025-05-07T20:32:44.9215679Z scale_ub: Optional[float], 2025-05-07T20:32:44.9215780Z contiguous: bool, 2025-05-07T20:32:44.9215868Z compiled: bool, 2025-05-07T20:32:44.9215947Z ) -> None: 2025-05-07T20:32:44.9216088Z torch.manual_seed(2025) 2025-05-07T20:32:44.9216165Z 2025-05-07T20:32:44.9216334Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9216415Z 2025-05-07T20:32:44.9216507Z x_sign = torch.sign(x) 2025-05-07T20:32:44.9216639Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.9216729Z x = x_sign * x_clamp 2025-05-07T20:32:44.9216810Z x0 = x[:, :D] 2025-05-07T20:32:44.9216898Z x1 = x[:, D:] 2025-05-07T20:32:44.9216971Z 2025-05-07T20:32:44.9217056Z if contiguous: 2025-05-07T20:32:44.9217154Z x0 = x0.contiguous() 2025-05-07T20:32:44.9217263Z x1 = x1.contiguous() 2025-05-07T20:32:44.9217343Z 2025-05-07T20:32:44.9217463Z if scale_ub is not None: 2025-05-07T20:32:44.9217577Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.9217715Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.9217802Z ) 2025-05-07T20:32:44.9217884Z else: 2025-05-07T20:32:44.9217988Z scale_ub_tensor = None 2025-05-07T20:32:44.9218064Z 2025-05-07T20:32:44.9218194Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.9218292Z op = silu_mul_quant 2025-05-07T20:32:44.9218380Z if compiled: 2025-05-07T20:32:44.9218483Z op = torch.compile(op) 2025-05-07T20:32:44.9218596Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9218671Z 2025-05-07T20:32:44.9218762Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.9218767Z 2025-05-07T20:32:44.9218873Z moe/activation_test.py:117: 2025-05-07T20:32:44.9219005Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9219122Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.9219224Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9219596Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:44.9219699Z return fn(*args, **kwargs) 2025-05-07T20:32:44.9220193Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.9220290Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.9220656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.9220877Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.9221226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.9221322Z kernel = self.compile( 2025-05-07T20:32:44.9221707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.9221937Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.9222115Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9222119Z 2025-05-07T20:32:44.9222326Z self = 2025-05-07T20:32:44.9223124Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.9223629Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9655817c40>} 2025-05-07T20:32:44.9224394Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.9224665Z context = 2025-05-07T20:32:44.9224676Z 2025-05-07T20:32:44.9224852Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.9225116Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.9225228Z module_map=module_map) 2025-05-07T20:32:44.9225399Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.9225500Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.9225579Z E ^ 2025-05-07T20:32:44.9225946Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.9225950Z 2025-05-07T20:32:44.9226369Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.9226378Z 2025-05-07T20:32:44.9226490Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9226716Z self=, 2025-05-07T20:32:44.9226794Z T=16384, 2025-05-07T20:32:44.9226878Z D=5120, 2025-05-07T20:32:44.9226962Z scale_ub=1200.0, 2025-05-07T20:32:44.9227049Z contiguous=True, 2025-05-07T20:32:44.9227142Z compiled=False, 2025-05-07T20:32:44.9227215Z ) 2025-05-07T20:32:44.9227441Z self = 2025-05-07T20:32:44.9227646Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:44.9227654Z 2025-05-07T20:32:44.9227758Z @given( 2025-05-07T20:32:44.9227877Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9227977Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9228100Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9228756Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9228915Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9229001Z ) 2025-05-07T20:32:44.9229296Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9229391Z def test_silu_mul_quant( 2025-05-07T20:32:44.9229475Z self, 2025-05-07T20:32:44.9229554Z T: int, 2025-05-07T20:32:44.9229643Z D: int, 2025-05-07T20:32:44.9229742Z scale_ub: Optional[float], 2025-05-07T20:32:44.9229833Z contiguous: bool, 2025-05-07T20:32:44.9229926Z compiled: bool, 2025-05-07T20:32:44.9230005Z ) -> None: 2025-05-07T20:32:44.9230100Z torch.manual_seed(2025) 2025-05-07T20:32:44.9230182Z 2025-05-07T20:32:44.9230352Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9230426Z 2025-05-07T20:32:44.9230528Z x_sign = torch.sign(x) 2025-05-07T20:32:44.9230824Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.9230916Z x = x_sign * x_clamp 2025-05-07T20:32:44.9231001Z x0 = x[:, :D] 2025-05-07T20:32:44.9231084Z x1 = x[:, D:] 2025-05-07T20:32:44.9231254Z 2025-05-07T20:32:44.9231346Z if contiguous: 2025-05-07T20:32:44.9231437Z x0 = x0.contiguous() 2025-05-07T20:32:44.9231532Z x1 = x1.contiguous() 2025-05-07T20:32:44.9231605Z 2025-05-07T20:32:44.9231693Z if scale_ub is not None: 2025-05-07T20:32:44.9231804Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.9231938Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.9232013Z ) 2025-05-07T20:32:44.9232094Z else: 2025-05-07T20:32:44.9232186Z scale_ub_tensor = None 2025-05-07T20:32:44.9232257Z 2025-05-07T20:32:44.9232392Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.9232551Z op = silu_mul_quant 2025-05-07T20:32:44.9232638Z if compiled: 2025-05-07T20:32:44.9232742Z op = torch.compile(op) 2025-05-07T20:32:44.9232915Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9232997Z 2025-05-07T20:32:44.9233087Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.9233092Z 2025-05-07T20:32:44.9233189Z moe/activation_test.py:117: 2025-05-07T20:32:44.9233323Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9233422Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.9233522Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9234033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.9234128Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.9234491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.9234717Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.9235058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.9235159Z kernel = self.compile( 2025-05-07T20:32:44.9235540Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.9235711Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.9235848Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9235853Z 2025-05-07T20:32:44.9236056Z self = 2025-05-07T20:32:44.9236845Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.9237354Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9655618ae0>} 2025-05-07T20:32:44.9238114Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.9238301Z context = 2025-05-07T20:32:44.9238305Z 2025-05-07T20:32:44.9238467Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.9238732Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.9238838Z module_map=module_map) 2025-05-07T20:32:44.9239005Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.9239151Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.9239229Z E ^ 2025-05-07T20:32:44.9239594Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.9239637Z 2025-05-07T20:32:44.9240056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.9240061Z 2025-05-07T20:32:44.9240168Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9240396Z self=, 2025-05-07T20:32:44.9240473Z T=1, 2025-05-07T20:32:44.9240554Z D=7168, 2025-05-07T20:32:44.9240638Z scale_ub=1200.0, 2025-05-07T20:32:44.9240724Z contiguous=False, 2025-05-07T20:32:44.9240814Z compiled=False, 2025-05-07T20:32:44.9240886Z ) 2025-05-07T20:32:44.9241103Z self = 2025-05-07T20:32:44.9241320Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:44.9241327Z 2025-05-07T20:32:44.9241403Z @given( 2025-05-07T20:32:44.9241559Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9241668Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9241781Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9241904Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9242018Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9242090Z ) 2025-05-07T20:32:44.9242342Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9242436Z def test_silu_mul_quant( 2025-05-07T20:32:44.9242512Z self, 2025-05-07T20:32:44.9242596Z T: int, 2025-05-07T20:32:44.9242674Z D: int, 2025-05-07T20:32:44.9242771Z scale_ub: Optional[float], 2025-05-07T20:32:44.9242867Z contiguous: bool, 2025-05-07T20:32:44.9242959Z compiled: bool, 2025-05-07T20:32:44.9243044Z ) -> None: 2025-05-07T20:32:44.9243144Z torch.manual_seed(2025) 2025-05-07T20:32:44.9243218Z 2025-05-07T20:32:44.9243399Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9243471Z 2025-05-07T20:32:44.9243563Z x_sign = torch.sign(x) 2025-05-07T20:32:44.9243696Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.9243784Z x = x_sign * x_clamp 2025-05-07T20:32:44.9243864Z x0 = x[:, :D] 2025-05-07T20:32:44.9243951Z x1 = x[:, D:] 2025-05-07T20:32:44.9244024Z 2025-05-07T20:32:44.9244109Z if contiguous: 2025-05-07T20:32:44.9244207Z x0 = x0.contiguous() 2025-05-07T20:32:44.9244295Z x1 = x1.contiguous() 2025-05-07T20:32:44.9244370Z 2025-05-07T20:32:44.9244469Z if scale_ub is not None: 2025-05-07T20:32:44.9244576Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.9244727Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.9244805Z ) 2025-05-07T20:32:44.9244883Z else: 2025-05-07T20:32:44.9244983Z scale_ub_tensor = None 2025-05-07T20:32:44.9245060Z 2025-05-07T20:32:44.9245192Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.9245289Z op = silu_mul_quant 2025-05-07T20:32:44.9245375Z if compiled: 2025-05-07T20:32:44.9245475Z op = torch.compile(op) 2025-05-07T20:32:44.9245589Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9245660Z 2025-05-07T20:32:44.9245749Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.9245753Z 2025-05-07T20:32:44.9245854Z moe/activation_test.py:117: 2025-05-07T20:32:44.9245984Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9246090Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.9246189Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9246690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.9246847Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.9247295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.9247517Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.9247864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.9247977Z kernel = self.compile( 2025-05-07T20:32:44.9248359Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.9248533Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.9248670Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9248715Z 2025-05-07T20:32:44.9248921Z self = 2025-05-07T20:32:44.9249745Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.9250259Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9655618400>} 2025-05-07T20:32:44.9251010Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.9251208Z context = 2025-05-07T20:32:44.9251213Z 2025-05-07T20:32:44.9251378Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.9251658Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.9251767Z module_map=module_map) 2025-05-07T20:32:44.9251930Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.9252035Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.9252113Z E ^ 2025-05-07T20:32:44.9252471Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.9252476Z 2025-05-07T20:32:44.9252899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.9252904Z 2025-05-07T20:32:44.9253007Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9253236Z self=, 2025-05-07T20:32:44.9253316Z T=4096, 2025-05-07T20:32:44.9253394Z D=7168, 2025-05-07T20:32:44.9253484Z scale_ub=1200.0, 2025-05-07T20:32:44.9253570Z contiguous=False, 2025-05-07T20:32:44.9253655Z compiled=True, 2025-05-07T20:32:44.9253733Z ) 2025-05-07T20:32:44.9253956Z self = 2025-05-07T20:32:44.9254135Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:44.9254146Z 2025-05-07T20:32:44.9254222Z @given( 2025-05-07T20:32:44.9254341Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9254450Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9254563Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9254679Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9254799Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9254872Z ) 2025-05-07T20:32:44.9255116Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9255269Z def test_silu_mul_quant( 2025-05-07T20:32:44.9255345Z self, 2025-05-07T20:32:44.9255427Z T: int, 2025-05-07T20:32:44.9255505Z D: int, 2025-05-07T20:32:44.9255603Z scale_ub: Optional[float], 2025-05-07T20:32:44.9255739Z contiguous: bool, 2025-05-07T20:32:44.9255825Z compiled: bool, 2025-05-07T20:32:44.9255901Z ) -> None: 2025-05-07T20:32:44.9256003Z torch.manual_seed(2025) 2025-05-07T20:32:44.9256075Z 2025-05-07T20:32:44.9256243Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9256321Z 2025-05-07T20:32:44.9256413Z x_sign = torch.sign(x) 2025-05-07T20:32:44.9256536Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.9256631Z x = x_sign * x_clamp 2025-05-07T20:32:44.9256710Z x0 = x[:, :D] 2025-05-07T20:32:44.9256790Z x1 = x[:, D:] 2025-05-07T20:32:44.9256868Z 2025-05-07T20:32:44.9256996Z if contiguous: 2025-05-07T20:32:44.9257098Z x0 = x0.contiguous() 2025-05-07T20:32:44.9257187Z x1 = x1.contiguous() 2025-05-07T20:32:44.9257259Z 2025-05-07T20:32:44.9257398Z if scale_ub is not None: 2025-05-07T20:32:44.9257518Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.9257671Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.9257774Z ) 2025-05-07T20:32:44.9257850Z else: 2025-05-07T20:32:44.9257943Z scale_ub_tensor = None 2025-05-07T20:32:44.9258022Z 2025-05-07T20:32:44.9258150Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.9258240Z op = silu_mul_quant 2025-05-07T20:32:44.9258333Z if compiled: 2025-05-07T20:32:44.9258432Z op = torch.compile(op) 2025-05-07T20:32:44.9258547Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9258623Z 2025-05-07T20:32:44.9258716Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.9258724Z 2025-05-07T20:32:44.9258828Z moe/activation_test.py:117: 2025-05-07T20:32:44.9258959Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9259062Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.9259169Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9259545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:44.9259637Z return fn(*args, **kwargs) 2025-05-07T20:32:44.9260135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.9260237Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.9260602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.9260824Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.9261167Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.9261269Z kernel = self.compile( 2025-05-07T20:32:44.9261654Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.9261832Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.9261963Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9261967Z 2025-05-07T20:32:44.9262172Z self = 2025-05-07T20:32:44.9262966Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.9263467Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f965561af20>} 2025-05-07T20:32:44.9264353Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.9264544Z context = 2025-05-07T20:32:44.9264548Z 2025-05-07T20:32:44.9264715Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.9264986Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.9265094Z module_map=module_map) 2025-05-07T20:32:44.9265263Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.9265363Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.9265440Z E ^ 2025-05-07T20:32:44.9265844Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.9265851Z 2025-05-07T20:32:44.9266310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.9266315Z 2025-05-07T20:32:44.9266425Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9266647Z self=, 2025-05-07T20:32:44.9266725Z T=128, 2025-05-07T20:32:44.9266809Z D=7168, 2025-05-07T20:32:44.9266892Z scale_ub=1200.0, 2025-05-07T20:32:44.9266980Z contiguous=False, 2025-05-07T20:32:44.9267073Z compiled=True, 2025-05-07T20:32:44.9267145Z ) 2025-05-07T20:32:44.9267362Z self = 2025-05-07T20:32:44.9267541Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:44.9267545Z 2025-05-07T20:32:44.9267625Z @given( 2025-05-07T20:32:44.9267754Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9267854Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9267970Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9268097Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9268210Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9268284Z ) 2025-05-07T20:32:44.9268537Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9268631Z def test_silu_mul_quant( 2025-05-07T20:32:44.9268707Z self, 2025-05-07T20:32:44.9268793Z T: int, 2025-05-07T20:32:44.9268871Z D: int, 2025-05-07T20:32:44.9268967Z scale_ub: Optional[float], 2025-05-07T20:32:44.9269145Z contiguous: bool, 2025-05-07T20:32:44.9269230Z compiled: bool, 2025-05-07T20:32:44.9269315Z ) -> None: 2025-05-07T20:32:44.9269410Z torch.manual_seed(2025) 2025-05-07T20:32:44.9269486Z 2025-05-07T20:32:44.9269660Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9269736Z 2025-05-07T20:32:44.9269829Z x_sign = torch.sign(x) 2025-05-07T20:32:44.9269966Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.9270055Z x = x_sign * x_clamp 2025-05-07T20:32:44.9270134Z x0 = x[:, :D] 2025-05-07T20:32:44.9270219Z x1 = x[:, D:] 2025-05-07T20:32:44.9270291Z 2025-05-07T20:32:44.9270373Z if contiguous: 2025-05-07T20:32:44.9270469Z x0 = x0.contiguous() 2025-05-07T20:32:44.9270557Z x1 = x1.contiguous() 2025-05-07T20:32:44.9270633Z 2025-05-07T20:32:44.9270723Z if scale_ub is not None: 2025-05-07T20:32:44.9270828Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.9270967Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.9271042Z ) 2025-05-07T20:32:44.9271120Z else: 2025-05-07T20:32:44.9271271Z scale_ub_tensor = None 2025-05-07T20:32:44.9271344Z 2025-05-07T20:32:44.9271471Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.9271569Z op = silu_mul_quant 2025-05-07T20:32:44.9271698Z if compiled: 2025-05-07T20:32:44.9271800Z op = torch.compile(op) 2025-05-07T20:32:44.9271910Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9271981Z 2025-05-07T20:32:44.9272077Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.9272082Z 2025-05-07T20:32:44.9272179Z moe/activation_test.py:117: 2025-05-07T20:32:44.9272308Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9272413Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.9272514Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9272883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:44.9273024Z return fn(*args, **kwargs) 2025-05-07T20:32:44.9273517Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.9273657Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.9274016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.9274237Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.9274581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.9274673Z kernel = self.compile( 2025-05-07T20:32:44.9275053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.9275231Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.9275360Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9275367Z 2025-05-07T20:32:44.9275576Z self = 2025-05-07T20:32:44.9276362Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.9276867Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9655d14220>} 2025-05-07T20:32:44.9277675Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.9277863Z context = 2025-05-07T20:32:44.9277871Z 2025-05-07T20:32:44.9278048Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.9278312Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.9278426Z module_map=module_map) 2025-05-07T20:32:44.9278586Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.9278684Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.9278770Z E ^ 2025-05-07T20:32:44.9279129Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.9279134Z 2025-05-07T20:32:44.9279550Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.9279555Z 2025-05-07T20:32:44.9279663Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9279887Z self=, 2025-05-07T20:32:44.9280017Z T=2048, 2025-05-07T20:32:44.9280092Z D=7168, 2025-05-07T20:32:44.9280173Z scale_ub=None, 2025-05-07T20:32:44.9280265Z contiguous=True, 2025-05-07T20:32:44.9280349Z compiled=True, 2025-05-07T20:32:44.9280462Z ) 2025-05-07T20:32:44.9280692Z self = 2025-05-07T20:32:44.9280862Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:44.9280866Z 2025-05-07T20:32:44.9280943Z @given( 2025-05-07T20:32:44.9281067Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9281167Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9281289Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9281407Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9281519Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9281599Z ) 2025-05-07T20:32:44.9281889Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9281985Z def test_silu_mul_quant( 2025-05-07T20:32:44.9282067Z self, 2025-05-07T20:32:44.9282183Z T: int, 2025-05-07T20:32:44.9282263Z D: int, 2025-05-07T20:32:44.9282368Z scale_ub: Optional[float], 2025-05-07T20:32:44.9282461Z contiguous: bool, 2025-05-07T20:32:44.9282550Z compiled: bool, 2025-05-07T20:32:44.9282635Z ) -> None: 2025-05-07T20:32:44.9282731Z torch.manual_seed(2025) 2025-05-07T20:32:44.9282808Z 2025-05-07T20:32:44.9282977Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9283048Z 2025-05-07T20:32:44.9283147Z x_sign = torch.sign(x) 2025-05-07T20:32:44.9283276Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.9283365Z x = x_sign * x_clamp 2025-05-07T20:32:44.9283455Z x0 = x[:, :D] 2025-05-07T20:32:44.9283535Z x1 = x[:, D:] 2025-05-07T20:32:44.9283612Z 2025-05-07T20:32:44.9283705Z if contiguous: 2025-05-07T20:32:44.9283797Z x0 = x0.contiguous() 2025-05-07T20:32:44.9283886Z x1 = x1.contiguous() 2025-05-07T20:32:44.9283965Z 2025-05-07T20:32:44.9284059Z if scale_ub is not None: 2025-05-07T20:32:44.9284170Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.9284305Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.9284381Z ) 2025-05-07T20:32:44.9284462Z else: 2025-05-07T20:32:44.9284557Z scale_ub_tensor = None 2025-05-07T20:32:44.9284629Z 2025-05-07T20:32:44.9284763Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.9284852Z op = silu_mul_quant 2025-05-07T20:32:44.9284936Z if compiled: 2025-05-07T20:32:44.9285042Z op = torch.compile(op) 2025-05-07T20:32:44.9285147Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9285221Z 2025-05-07T20:32:44.9285321Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.9285326Z 2025-05-07T20:32:44.9285424Z moe/activation_test.py:117: 2025-05-07T20:32:44.9285562Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9285665Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.9285764Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9286137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:44.9286229Z return fn(*args, **kwargs) 2025-05-07T20:32:44.9286723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.9286824Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.9287180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.9287410Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.9287849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.9287946Z kernel = self.compile( 2025-05-07T20:32:44.9288377Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.9288551Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.9288680Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9288691Z 2025-05-07T20:32:44.9288895Z self = 2025-05-07T20:32:44.9289676Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.9290229Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9655d14d60>} 2025-05-07T20:32:44.9291046Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.9291243Z context = 2025-05-07T20:32:44.9291247Z 2025-05-07T20:32:44.9291410Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.9291672Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.9291782Z module_map=module_map) 2025-05-07T20:32:44.9291941Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.9292044Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.9292126Z E ^ 2025-05-07T20:32:44.9292483Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.9292487Z 2025-05-07T20:32:44.9292914Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.9292918Z 2025-05-07T20:32:44.9293020Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9293242Z self=, 2025-05-07T20:32:44.9293324Z T=16384, 2025-05-07T20:32:44.9293400Z D=5120, 2025-05-07T20:32:44.9293487Z scale_ub=None, 2025-05-07T20:32:44.9293573Z contiguous=False, 2025-05-07T20:32:44.9293657Z compiled=False, 2025-05-07T20:32:44.9293735Z ) 2025-05-07T20:32:44.9293954Z self = 2025-05-07T20:32:44.9294130Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:44.9294138Z 2025-05-07T20:32:44.9294222Z @given( 2025-05-07T20:32:44.9294338Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9294439Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9294563Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9294680Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9294799Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9294871Z ) 2025-05-07T20:32:44.9295118Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9295218Z def test_silu_mul_quant( 2025-05-07T20:32:44.9295294Z self, 2025-05-07T20:32:44.9295370Z T: int, 2025-05-07T20:32:44.9295452Z D: int, 2025-05-07T20:32:44.9295550Z scale_ub: Optional[float], 2025-05-07T20:32:44.9295638Z contiguous: bool, 2025-05-07T20:32:44.9295729Z compiled: bool, 2025-05-07T20:32:44.9295808Z ) -> None: 2025-05-07T20:32:44.9296003Z torch.manual_seed(2025) 2025-05-07T20:32:44.9296082Z 2025-05-07T20:32:44.9296250Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9296333Z 2025-05-07T20:32:44.9296464Z x_sign = torch.sign(x) 2025-05-07T20:32:44.9296588Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.9298411Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.9298458Z 2025-05-07T20:32:44.9298580Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:44.9298584Z 2025-05-07T20:32:44.9298692Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9298957Z self=, 2025-05-07T20:32:44.9299035Z T=4096, 2025-05-07T20:32:44.9299119Z D=7168, 2025-05-07T20:32:44.9299200Z scale_ub=1200.0, 2025-05-07T20:32:44.9299286Z contiguous=True, 2025-05-07T20:32:44.9299374Z compiled=True, 2025-05-07T20:32:44.9299447Z ) 2025-05-07T20:32:44.9299672Z self = 2025-05-07T20:32:44.9299846Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:44.9299851Z 2025-05-07T20:32:44.9299926Z @given( 2025-05-07T20:32:44.9300051Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9300148Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9300261Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9300389Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9300501Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9300574Z ) 2025-05-07T20:32:44.9300832Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9300927Z def test_silu_mul_quant( 2025-05-07T20:32:44.9301007Z self, 2025-05-07T20:32:44.9301082Z T: int, 2025-05-07T20:32:44.9301157Z D: int, 2025-05-07T20:32:44.9301258Z scale_ub: Optional[float], 2025-05-07T20:32:44.9301346Z contiguous: bool, 2025-05-07T20:32:44.9301429Z compiled: bool, 2025-05-07T20:32:44.9301513Z ) -> None: 2025-05-07T20:32:44.9301606Z torch.manual_seed(2025) 2025-05-07T20:32:44.9301678Z 2025-05-07T20:32:44.9301853Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9301924Z 2025-05-07T20:32:44.9302017Z x_sign = torch.sign(x) 2025-05-07T20:32:44.9302147Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.9303940Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.9303952Z 2025-05-07T20:32:44.9304068Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:44.9304072Z 2025-05-07T20:32:44.9304174Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9304399Z self=, 2025-05-07T20:32:44.9304477Z T=16384, 2025-05-07T20:32:44.9304603Z D=7168, 2025-05-07T20:32:44.9304689Z scale_ub=None, 2025-05-07T20:32:44.9304774Z contiguous=False, 2025-05-07T20:32:44.9304858Z compiled=False, 2025-05-07T20:32:44.9304937Z ) 2025-05-07T20:32:44.9305202Z self = 2025-05-07T20:32:44.9305382Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:44.9305392Z 2025-05-07T20:32:44.9305468Z @given( 2025-05-07T20:32:44.9305586Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9305691Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9305804Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9305919Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9306037Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9306110Z ) 2025-05-07T20:32:44.9306354Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9306501Z def test_silu_mul_quant( 2025-05-07T20:32:44.9306576Z self, 2025-05-07T20:32:44.9306656Z T: int, 2025-05-07T20:32:44.9306770Z D: int, 2025-05-07T20:32:44.9306876Z scale_ub: Optional[float], 2025-05-07T20:32:44.9306970Z contiguous: bool, 2025-05-07T20:32:44.9307074Z compiled: bool, 2025-05-07T20:32:44.9307156Z ) -> None: 2025-05-07T20:32:44.9307278Z torch.manual_seed(2025) 2025-05-07T20:32:44.9307351Z 2025-05-07T20:32:44.9307519Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9309412Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.9309425Z 2025-05-07T20:32:44.9309544Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:44.9309549Z 2025-05-07T20:32:44.9309659Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9309879Z self=, 2025-05-07T20:32:44.9309961Z T=2048, 2025-05-07T20:32:44.9310036Z D=7168, 2025-05-07T20:32:44.9310117Z scale_ub=1200.0, 2025-05-07T20:32:44.9310206Z contiguous=True, 2025-05-07T20:32:44.9310290Z compiled=True, 2025-05-07T20:32:44.9310362Z ) 2025-05-07T20:32:44.9310585Z self = 2025-05-07T20:32:44.9310755Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:44.9310763Z 2025-05-07T20:32:44.9314914Z @given( 2025-05-07T20:32:44.9315057Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9315171Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9315296Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9315413Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9315536Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9315612Z ) 2025-05-07T20:32:44.9315864Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9315970Z def test_silu_mul_quant( 2025-05-07T20:32:44.9316047Z self, 2025-05-07T20:32:44.9316126Z T: int, 2025-05-07T20:32:44.9316216Z D: int, 2025-05-07T20:32:44.9316315Z scale_ub: Optional[float], 2025-05-07T20:32:44.9316415Z contiguous: bool, 2025-05-07T20:32:44.9316501Z compiled: bool, 2025-05-07T20:32:44.9316584Z ) -> None: 2025-05-07T20:32:44.9316692Z torch.manual_seed(2025) 2025-05-07T20:32:44.9316853Z 2025-05-07T20:32:44.9317026Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9317109Z 2025-05-07T20:32:44.9317253Z x_sign = torch.sign(x) 2025-05-07T20:32:44.9317389Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.9319235Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.9319241Z 2025-05-07T20:32:44.9319406Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:44.9319412Z 2025-05-07T20:32:44.9319524Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9319787Z self=, 2025-05-07T20:32:44.9319880Z T=2048, 2025-05-07T20:32:44.9319957Z D=7168, 2025-05-07T20:32:44.9320041Z scale_ub=None, 2025-05-07T20:32:44.9320136Z contiguous=True, 2025-05-07T20:32:44.9320222Z compiled=False, 2025-05-07T20:32:44.9320297Z ) 2025-05-07T20:32:44.9320525Z self = 2025-05-07T20:32:44.9320698Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:44.9320702Z 2025-05-07T20:32:44.9320780Z @given( 2025-05-07T20:32:44.9320906Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9321005Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9321128Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9321251Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9321364Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9321447Z ) 2025-05-07T20:32:44.9321697Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9321792Z def test_silu_mul_quant( 2025-05-07T20:32:44.9321878Z self, 2025-05-07T20:32:44.9321957Z T: int, 2025-05-07T20:32:44.9322033Z D: int, 2025-05-07T20:32:44.9322140Z scale_ub: Optional[float], 2025-05-07T20:32:44.9322230Z contiguous: bool, 2025-05-07T20:32:44.9322317Z compiled: bool, 2025-05-07T20:32:44.9322402Z ) -> None: 2025-05-07T20:32:44.9322498Z torch.manual_seed(2025) 2025-05-07T20:32:44.9322580Z 2025-05-07T20:32:44.9322748Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9322823Z 2025-05-07T20:32:44.9322924Z > x_sign = torch.sign(x) 2025-05-07T20:32:44.9324724Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.9324731Z 2025-05-07T20:32:44.9324856Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:44.9324860Z 2025-05-07T20:32:44.9324964Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9325187Z self=, 2025-05-07T20:32:44.9325275Z T=1, 2025-05-07T20:32:44.9325352Z D=7168, 2025-05-07T20:32:44.9325437Z scale_ub=1200.0, 2025-05-07T20:32:44.9325580Z contiguous=True, 2025-05-07T20:32:44.9325665Z compiled=False, 2025-05-07T20:32:44.9325747Z ) 2025-05-07T20:32:44.9325975Z self = 2025-05-07T20:32:44.9326183Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:44.9326189Z 2025-05-07T20:32:44.9326276Z @given( 2025-05-07T20:32:44.9326394Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9326492Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9326615Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9326731Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9326844Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9326924Z ) 2025-05-07T20:32:44.9327169Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9327269Z def test_silu_mul_quant( 2025-05-07T20:32:44.9327427Z self, 2025-05-07T20:32:44.9327506Z T: int, 2025-05-07T20:32:44.9327591Z D: int, 2025-05-07T20:32:44.9327689Z scale_ub: Optional[float], 2025-05-07T20:32:44.9327820Z contiguous: bool, 2025-05-07T20:32:44.9327919Z compiled: bool, 2025-05-07T20:32:44.9327998Z ) -> None: 2025-05-07T20:32:44.9328094Z torch.manual_seed(2025) 2025-05-07T20:32:44.9328624Z 2025-05-07T20:32:44.9328858Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9328936Z 2025-05-07T20:32:44.9329040Z x_sign = torch.sign(x) 2025-05-07T20:32:44.9329166Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.9329267Z x = x_sign * x_clamp 2025-05-07T20:32:44.9329353Z x0 = x[:, :D] 2025-05-07T20:32:44.9329436Z x1 = x[:, D:] 2025-05-07T20:32:44.9329518Z 2025-05-07T20:32:44.9329606Z if contiguous: 2025-05-07T20:32:44.9329702Z x0 = x0.contiguous() 2025-05-07T20:32:44.9329804Z x1 = x1.contiguous() 2025-05-07T20:32:44.9329883Z 2025-05-07T20:32:44.9329978Z if scale_ub is not None: 2025-05-07T20:32:44.9330098Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.9330243Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.9330322Z ) 2025-05-07T20:32:44.9330409Z else: 2025-05-07T20:32:44.9330509Z scale_ub_tensor = None 2025-05-07T20:32:44.9330583Z 2025-05-07T20:32:44.9330725Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.9330819Z op = silu_mul_quant 2025-05-07T20:32:44.9330916Z if compiled: 2025-05-07T20:32:44.9331017Z op = torch.compile(op) 2025-05-07T20:32:44.9331124Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9331203Z 2025-05-07T20:32:44.9331295Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.9331300Z 2025-05-07T20:32:44.9331400Z moe/activation_test.py:117: 2025-05-07T20:32:44.9331548Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9331651Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.9331756Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9332270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.9332368Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.9332736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.9332961Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.9333303Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.9333407Z kernel = self.compile( 2025-05-07T20:32:44.9333791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.9334142Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.9334274Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9334441Z 2025-05-07T20:32:44.9334652Z self = 2025-05-07T20:32:44.9335447Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.9335951Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f965556c540>} 2025-05-07T20:32:44.9336750Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.9337020Z context = 2025-05-07T20:32:44.9337025Z 2025-05-07T20:32:44.9337254Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.9337525Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.9337634Z module_map=module_map) 2025-05-07T20:32:44.9337809Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.9337910Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.9337993Z E ^ 2025-05-07T20:32:44.9338355Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.9338359Z 2025-05-07T20:32:44.9338776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.9338785Z 2025-05-07T20:32:44.9338899Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9339123Z self=, 2025-05-07T20:32:44.9339206Z T=128, 2025-05-07T20:32:44.9339296Z D=5120, 2025-05-07T20:32:44.9339384Z scale_ub=None, 2025-05-07T20:32:44.9339472Z contiguous=True, 2025-05-07T20:32:44.9339568Z compiled=False, 2025-05-07T20:32:44.9339644Z ) 2025-05-07T20:32:44.9339863Z self = 2025-05-07T20:32:44.9340042Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:44.9340046Z 2025-05-07T20:32:44.9340124Z @given( 2025-05-07T20:32:44.9340251Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9340351Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9340468Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9340601Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9340718Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9340794Z ) 2025-05-07T20:32:44.9341048Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9341147Z def test_silu_mul_quant( 2025-05-07T20:32:44.9341226Z self, 2025-05-07T20:32:44.9341313Z T: int, 2025-05-07T20:32:44.9341392Z D: int, 2025-05-07T20:32:44.9341492Z scale_ub: Optional[float], 2025-05-07T20:32:44.9341592Z contiguous: bool, 2025-05-07T20:32:44.9341679Z compiled: bool, 2025-05-07T20:32:44.9341766Z ) -> None: 2025-05-07T20:32:44.9341863Z torch.manual_seed(2025) 2025-05-07T20:32:44.9341937Z 2025-05-07T20:32:44.9342110Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9342188Z 2025-05-07T20:32:44.9342281Z x_sign = torch.sign(x) 2025-05-07T20:32:44.9342413Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.9342553Z x = x_sign * x_clamp 2025-05-07T20:32:44.9342637Z x0 = x[:, :D] 2025-05-07T20:32:44.9342726Z x1 = x[:, D:] 2025-05-07T20:32:44.9342803Z 2025-05-07T20:32:44.9342888Z if contiguous: 2025-05-07T20:32:44.9343027Z x0 = x0.contiguous() 2025-05-07T20:32:44.9343118Z x1 = x1.contiguous() 2025-05-07T20:32:44.9343201Z 2025-05-07T20:32:44.9343291Z if scale_ub is not None: 2025-05-07T20:32:44.9343398Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.9343539Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.9343615Z ) 2025-05-07T20:32:44.9343693Z else: 2025-05-07T20:32:44.9343793Z scale_ub_tensor = None 2025-05-07T20:32:44.9343867Z 2025-05-07T20:32:44.9343996Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.9344094Z op = silu_mul_quant 2025-05-07T20:32:44.9344224Z if compiled: 2025-05-07T20:32:44.9344329Z op = torch.compile(op) 2025-05-07T20:32:44.9344441Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9344516Z 2025-05-07T20:32:44.9344652Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.9344660Z 2025-05-07T20:32:44.9344760Z moe/activation_test.py:117: 2025-05-07T20:32:44.9344893Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9345004Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.9345106Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9345606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.9345713Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.9346072Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.9346307Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.9346653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.9346755Z kernel = self.compile( 2025-05-07T20:32:44.9347199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.9347374Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.9347502Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9347517Z 2025-05-07T20:32:44.9347722Z self = 2025-05-07T20:32:44.9348508Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.9349021Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f965556d620>} 2025-05-07T20:32:44.9349850Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.9350049Z context = 2025-05-07T20:32:44.9350054Z 2025-05-07T20:32:44.9350219Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.9350483Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.9350602Z module_map=module_map) 2025-05-07T20:32:44.9350764Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.9350864Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.9350954Z E ^ 2025-05-07T20:32:44.9351356Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.9351361Z 2025-05-07T20:32:44.9351825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.9351829Z 2025-05-07T20:32:44.9351936Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9352159Z self=, 2025-05-07T20:32:44.9352248Z T=128, 2025-05-07T20:32:44.9352327Z D=7168, 2025-05-07T20:32:44.9352420Z scale_ub=None, 2025-05-07T20:32:44.9352508Z contiguous=True, 2025-05-07T20:32:44.9352595Z compiled=False, 2025-05-07T20:32:44.9352680Z ) 2025-05-07T20:32:44.9352898Z self = 2025-05-07T20:32:44.9353068Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:44.9353111Z 2025-05-07T20:32:44.9353201Z @given( 2025-05-07T20:32:44.9353319Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9353419Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9353585Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9353703Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9353829Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9353905Z ) 2025-05-07T20:32:44.9354151Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9354255Z def test_silu_mul_quant( 2025-05-07T20:32:44.9354333Z self, 2025-05-07T20:32:44.9354412Z T: int, 2025-05-07T20:32:44.9354497Z D: int, 2025-05-07T20:32:44.9354596Z scale_ub: Optional[float], 2025-05-07T20:32:44.9354691Z contiguous: bool, 2025-05-07T20:32:44.9354777Z compiled: bool, 2025-05-07T20:32:44.9354855Z ) -> None: 2025-05-07T20:32:44.9354958Z torch.manual_seed(2025) 2025-05-07T20:32:44.9355035Z 2025-05-07T20:32:44.9355203Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9355283Z 2025-05-07T20:32:44.9355383Z x_sign = torch.sign(x) 2025-05-07T20:32:44.9355507Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.9355603Z x = x_sign * x_clamp 2025-05-07T20:32:44.9355685Z x0 = x[:, :D] 2025-05-07T20:32:44.9355771Z x1 = x[:, D:] 2025-05-07T20:32:44.9355844Z 2025-05-07T20:32:44.9355932Z if contiguous: 2025-05-07T20:32:44.9356031Z x0 = x0.contiguous() 2025-05-07T20:32:44.9356119Z x1 = x1.contiguous() 2025-05-07T20:32:44.9356193Z 2025-05-07T20:32:44.9356291Z if scale_ub is not None: 2025-05-07T20:32:44.9356396Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.9356532Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.9356618Z ) 2025-05-07T20:32:44.9356700Z else: 2025-05-07T20:32:44.9356796Z scale_ub_tensor = None 2025-05-07T20:32:44.9356874Z 2025-05-07T20:32:44.9357006Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.9357098Z op = silu_mul_quant 2025-05-07T20:32:44.9357191Z if compiled: 2025-05-07T20:32:44.9357293Z op = torch.compile(op) 2025-05-07T20:32:44.9357404Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9357478Z 2025-05-07T20:32:44.9357569Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.9357574Z 2025-05-07T20:32:44.9357678Z moe/activation_test.py:117: 2025-05-07T20:32:44.9357805Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9357907Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.9358016Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9358514Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.9358669Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.9359028Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.9359324Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.9359670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.9359763Z kernel = self.compile( 2025-05-07T20:32:44.9360143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.9360322Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.9360448Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9360453Z 2025-05-07T20:32:44.9360662Z self = 2025-05-07T20:32:44.9361526Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.9362030Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f965556e480>} 2025-05-07T20:32:44.9362790Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.9362979Z context = 2025-05-07T20:32:44.9362984Z 2025-05-07T20:32:44.9363154Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.9363421Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.9363540Z module_map=module_map) 2025-05-07T20:32:44.9363702Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.9363803Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.9363889Z E ^ 2025-05-07T20:32:44.9364246Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.9364251Z 2025-05-07T20:32:44.9364665Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.9364670Z 2025-05-07T20:32:44.9364780Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9365002Z self=, 2025-05-07T20:32:44.9365089Z T=2048, 2025-05-07T20:32:44.9365166Z D=7168, 2025-05-07T20:32:44.9365249Z scale_ub=1200.0, 2025-05-07T20:32:44.9365347Z contiguous=True, 2025-05-07T20:32:44.9365431Z compiled=False, 2025-05-07T20:32:44.9365505Z ) 2025-05-07T20:32:44.9365736Z self = 2025-05-07T20:32:44.9365915Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:44.9365920Z 2025-05-07T20:32:44.9365997Z @given( 2025-05-07T20:32:44.9366121Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9366219Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9366341Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9366459Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9366592Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9366679Z ) 2025-05-07T20:32:44.9366949Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9367044Z def test_silu_mul_quant( 2025-05-07T20:32:44.9367132Z self, 2025-05-07T20:32:44.9367256Z T: int, 2025-05-07T20:32:44.9367333Z D: int, 2025-05-07T20:32:44.9367438Z scale_ub: Optional[float], 2025-05-07T20:32:44.9367530Z contiguous: bool, 2025-05-07T20:32:44.9367655Z compiled: bool, 2025-05-07T20:32:44.9367743Z ) -> None: 2025-05-07T20:32:44.9367839Z torch.manual_seed(2025) 2025-05-07T20:32:44.9367924Z 2025-05-07T20:32:44.9368094Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9369890Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.9369946Z 2025-05-07T20:32:44.9370065Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:44.9370107Z 2025-05-07T20:32:44.9370215Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9370444Z self=, 2025-05-07T20:32:44.9370522Z T=1, 2025-05-07T20:32:44.9370599Z D=5120, 2025-05-07T20:32:44.9370690Z scale_ub=1200.0, 2025-05-07T20:32:44.9370779Z contiguous=True, 2025-05-07T20:32:44.9370866Z compiled=False, 2025-05-07T20:32:44.9370947Z ) 2025-05-07T20:32:44.9371165Z self = 2025-05-07T20:32:44.9371340Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:44.9371344Z 2025-05-07T20:32:44.9371421Z @given( 2025-05-07T20:32:44.9371537Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9371648Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9371763Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9371883Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9372005Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9372081Z ) 2025-05-07T20:32:44.9372328Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9372429Z def test_silu_mul_quant( 2025-05-07T20:32:44.9372506Z self, 2025-05-07T20:32:44.9372592Z T: int, 2025-05-07T20:32:44.9372669Z D: int, 2025-05-07T20:32:44.9372767Z scale_ub: Optional[float], 2025-05-07T20:32:44.9372863Z contiguous: bool, 2025-05-07T20:32:44.9372950Z compiled: bool, 2025-05-07T20:32:44.9373029Z ) -> None: 2025-05-07T20:32:44.9373130Z torch.manual_seed(2025) 2025-05-07T20:32:44.9373203Z 2025-05-07T20:32:44.9373373Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9373456Z 2025-05-07T20:32:44.9373549Z x_sign = torch.sign(x) 2025-05-07T20:32:44.9373675Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.9373774Z x = x_sign * x_clamp 2025-05-07T20:32:44.9373855Z x0 = x[:, :D] 2025-05-07T20:32:44.9373944Z x1 = x[:, D:] 2025-05-07T20:32:44.9374016Z 2025-05-07T20:32:44.9374101Z if contiguous: 2025-05-07T20:32:44.9374200Z x0 = x0.contiguous() 2025-05-07T20:32:44.9374293Z x1 = x1.contiguous() 2025-05-07T20:32:44.9374365Z 2025-05-07T20:32:44.9374464Z if scale_ub is not None: 2025-05-07T20:32:44.9374570Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.9374705Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.9374790Z ) 2025-05-07T20:32:44.9374866Z else: 2025-05-07T20:32:44.9374979Z scale_ub_tensor = None 2025-05-07T20:32:44.9375054Z 2025-05-07T20:32:44.9375234Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.9375332Z op = silu_mul_quant 2025-05-07T20:32:44.9375419Z if compiled: 2025-05-07T20:32:44.9375567Z op = torch.compile(op) 2025-05-07T20:32:44.9375684Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9375758Z 2025-05-07T20:32:44.9375857Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.9375861Z 2025-05-07T20:32:44.9375959Z moe/activation_test.py:117: 2025-05-07T20:32:44.9376087Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9376200Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.9376299Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9376798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.9376901Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.9377352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.9377616Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.9377961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.9378055Z kernel = self.compile( 2025-05-07T20:32:44.9378445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.9378618Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.9378753Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9378757Z 2025-05-07T20:32:44.9378963Z self = 2025-05-07T20:32:44.9379742Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.9380259Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f965556fa60>} 2025-05-07T20:32:44.9381014Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.9381211Z context = 2025-05-07T20:32:44.9381215Z 2025-05-07T20:32:44.9381379Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.9381641Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.9381763Z module_map=module_map) 2025-05-07T20:32:44.9381924Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.9382029Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.9382109Z E ^ 2025-05-07T20:32:44.9382470Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.9382474Z 2025-05-07T20:32:44.9382897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.9382902Z 2025-05-07T20:32:44.9383005Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9383235Z self=, 2025-05-07T20:32:44.9383314Z T=2048, 2025-05-07T20:32:44.9383389Z D=5120, 2025-05-07T20:32:44.9383476Z scale_ub=None, 2025-05-07T20:32:44.9383562Z contiguous=True, 2025-05-07T20:32:44.9383646Z compiled=False, 2025-05-07T20:32:44.9383727Z ) 2025-05-07T20:32:44.9383990Z self = 2025-05-07T20:32:44.9384161Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:44.9384168Z 2025-05-07T20:32:44.9384294Z @given( 2025-05-07T20:32:44.9384413Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9384512Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9384633Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9384750Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9384870Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9384945Z ) 2025-05-07T20:32:44.9385190Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9385290Z def test_silu_mul_quant( 2025-05-07T20:32:44.9385367Z self, 2025-05-07T20:32:44.9385445Z T: int, 2025-05-07T20:32:44.9385527Z D: int, 2025-05-07T20:32:44.9385665Z scale_ub: Optional[float], 2025-05-07T20:32:44.9385758Z contiguous: bool, 2025-05-07T20:32:44.9385850Z compiled: bool, 2025-05-07T20:32:44.9385928Z ) -> None: 2025-05-07T20:32:44.9386060Z torch.manual_seed(2025) 2025-05-07T20:32:44.9386145Z 2025-05-07T20:32:44.9386312Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9386394Z 2025-05-07T20:32:44.9386489Z > x_sign = torch.sign(x) 2025-05-07T20:32:44.9388279Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.9388297Z 2025-05-07T20:32:44.9388415Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:44.9388420Z 2025-05-07T20:32:44.9388527Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9388756Z self=, 2025-05-07T20:32:44.9388833Z T=16384, 2025-05-07T20:32:44.9388911Z D=5120, 2025-05-07T20:32:44.9389003Z scale_ub=None, 2025-05-07T20:32:44.9389212Z contiguous=True, 2025-05-07T20:32:44.9389297Z compiled=False, 2025-05-07T20:32:44.9389378Z ) 2025-05-07T20:32:44.9389597Z self = 2025-05-07T20:32:44.9389778Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:44.9389782Z 2025-05-07T20:32:44.9389858Z @given( 2025-05-07T20:32:44.9389976Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9390086Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9390202Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9390318Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9390444Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9390518Z ) 2025-05-07T20:32:44.9390763Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9390863Z def test_silu_mul_quant( 2025-05-07T20:32:44.9390943Z self, 2025-05-07T20:32:44.9391024Z T: int, 2025-05-07T20:32:44.9391101Z D: int, 2025-05-07T20:32:44.9391200Z scale_ub: Optional[float], 2025-05-07T20:32:44.9391296Z contiguous: bool, 2025-05-07T20:32:44.9391383Z compiled: bool, 2025-05-07T20:32:44.9391462Z ) -> None: 2025-05-07T20:32:44.9391563Z torch.manual_seed(2025) 2025-05-07T20:32:44.9391640Z 2025-05-07T20:32:44.9391807Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9393718Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.9393725Z 2025-05-07T20:32:44.9393843Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:44.9393848Z 2025-05-07T20:32:44.9393955Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9394176Z self=, 2025-05-07T20:32:44.9394260Z T=4096, 2025-05-07T20:32:44.9394376Z D=5120, 2025-05-07T20:32:44.9394458Z scale_ub=None, 2025-05-07T20:32:44.9394555Z contiguous=True, 2025-05-07T20:32:44.9394640Z compiled=False, 2025-05-07T20:32:44.9394714Z ) 2025-05-07T20:32:44.9394983Z self = 2025-05-07T20:32:44.9395154Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:44.9395158Z 2025-05-07T20:32:44.9395240Z @given( 2025-05-07T20:32:44.9395361Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9395459Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9395582Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9395698Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9395811Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9395893Z ) 2025-05-07T20:32:44.9396137Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9396235Z def test_silu_mul_quant( 2025-05-07T20:32:44.9396325Z self, 2025-05-07T20:32:44.9396403Z T: int, 2025-05-07T20:32:44.9396481Z D: int, 2025-05-07T20:32:44.9396586Z scale_ub: Optional[float], 2025-05-07T20:32:44.9396681Z contiguous: bool, 2025-05-07T20:32:44.9396767Z compiled: bool, 2025-05-07T20:32:44.9396855Z ) -> None: 2025-05-07T20:32:44.9396955Z torch.manual_seed(2025) 2025-05-07T20:32:44.9397040Z 2025-05-07T20:32:44.9397234Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9399025Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.9399043Z 2025-05-07T20:32:44.9399163Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:44.9399170Z 2025-05-07T20:32:44.9399272Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9399498Z self=, 2025-05-07T20:32:44.9399575Z T=2048, 2025-05-07T20:32:44.9399651Z D=5120, 2025-05-07T20:32:44.9399739Z scale_ub=None, 2025-05-07T20:32:44.9399829Z contiguous=False, 2025-05-07T20:32:44.9399914Z compiled=False, 2025-05-07T20:32:44.9399993Z ) 2025-05-07T20:32:44.9400208Z self = 2025-05-07T20:32:44.9400384Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:44.9400389Z 2025-05-07T20:32:44.9400465Z @given( 2025-05-07T20:32:44.9400584Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9400736Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9400850Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9401007Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9401125Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9401200Z ) 2025-05-07T20:32:44.9401444Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9401545Z def test_silu_mul_quant( 2025-05-07T20:32:44.9401621Z self, 2025-05-07T20:32:44.9401705Z T: int, 2025-05-07T20:32:44.9401781Z D: int, 2025-05-07T20:32:44.9401879Z scale_ub: Optional[float], 2025-05-07T20:32:44.9401974Z contiguous: bool, 2025-05-07T20:32:44.9402059Z compiled: bool, 2025-05-07T20:32:44.9402138Z ) -> None: 2025-05-07T20:32:44.9402239Z torch.manual_seed(2025) 2025-05-07T20:32:44.9402354Z 2025-05-07T20:32:44.9402523Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9404339Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.9404346Z 2025-05-07T20:32:44.9404465Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:44.9404470Z 2025-05-07T20:32:44.9404579Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9404801Z self=, 2025-05-07T20:32:44.9404887Z T=4096, 2025-05-07T20:32:44.9404968Z D=7168, 2025-05-07T20:32:44.9405051Z scale_ub=None, 2025-05-07T20:32:44.9405141Z contiguous=True, 2025-05-07T20:32:44.9405224Z compiled=True, 2025-05-07T20:32:44.9405300Z ) 2025-05-07T20:32:44.9405526Z self = 2025-05-07T20:32:44.9405692Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:44.9405697Z 2025-05-07T20:32:44.9405775Z @given( 2025-05-07T20:32:44.9405896Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9405996Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9406115Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9406235Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9406351Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9406430Z ) 2025-05-07T20:32:44.9406674Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9406773Z def test_silu_mul_quant( 2025-05-07T20:32:44.9406856Z self, 2025-05-07T20:32:44.9406933Z T: int, 2025-05-07T20:32:44.9407013Z D: int, 2025-05-07T20:32:44.9407119Z scale_ub: Optional[float], 2025-05-07T20:32:44.9407208Z contiguous: bool, 2025-05-07T20:32:44.9407297Z compiled: bool, 2025-05-07T20:32:44.9407384Z ) -> None: 2025-05-07T20:32:44.9407498Z torch.manual_seed(2025) 2025-05-07T20:32:44.9407584Z 2025-05-07T20:32:44.9407774Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9409553Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.9409616Z 2025-05-07T20:32:44.9409772Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:44.9409777Z 2025-05-07T20:32:44.9409881Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9410109Z self=, 2025-05-07T20:32:44.9410188Z T=2048, 2025-05-07T20:32:44.9410265Z D=5120, 2025-05-07T20:32:44.9410355Z scale_ub=1200.0, 2025-05-07T20:32:44.9410442Z contiguous=False, 2025-05-07T20:32:44.9410529Z compiled=False, 2025-05-07T20:32:44.9410611Z ) 2025-05-07T20:32:44.9410828Z self = 2025-05-07T20:32:44.9411008Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:44.9411053Z 2025-05-07T20:32:44.9411133Z @given( 2025-05-07T20:32:44.9411250Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9411354Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9411511Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9411627Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9411750Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9411826Z ) 2025-05-07T20:32:44.9412076Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9412171Z def test_silu_mul_quant( 2025-05-07T20:32:44.9412249Z self, 2025-05-07T20:32:44.9412337Z T: int, 2025-05-07T20:32:44.9412413Z D: int, 2025-05-07T20:32:44.9412513Z scale_ub: Optional[float], 2025-05-07T20:32:44.9412608Z contiguous: bool, 2025-05-07T20:32:44.9412695Z compiled: bool, 2025-05-07T20:32:44.9412773Z ) -> None: 2025-05-07T20:32:44.9412876Z torch.manual_seed(2025) 2025-05-07T20:32:44.9412953Z 2025-05-07T20:32:44.9413119Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9414899Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.9414904Z 2025-05-07T20:32:44.9415021Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:44.9415025Z 2025-05-07T20:32:44.9415133Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9415355Z self=, 2025-05-07T20:32:44.9415438Z T=4096, 2025-05-07T20:32:44.9415514Z D=7168, 2025-05-07T20:32:44.9415597Z scale_ub=1200.0, 2025-05-07T20:32:44.9415690Z contiguous=True, 2025-05-07T20:32:44.9415778Z compiled=False, 2025-05-07T20:32:44.9415852Z ) 2025-05-07T20:32:44.9416077Z self = 2025-05-07T20:32:44.9416249Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:44.9416254Z 2025-05-07T20:32:44.9416330Z @given( 2025-05-07T20:32:44.9416451Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9416549Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9416667Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9416782Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9416895Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9416981Z ) 2025-05-07T20:32:44.9417273Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9417367Z def test_silu_mul_quant( 2025-05-07T20:32:44.9417451Z self, 2025-05-07T20:32:44.9417571Z T: int, 2025-05-07T20:32:44.9417649Z D: int, 2025-05-07T20:32:44.9417753Z scale_ub: Optional[float], 2025-05-07T20:32:44.9417844Z contiguous: bool, 2025-05-07T20:32:44.9417929Z compiled: bool, 2025-05-07T20:32:44.9418013Z ) -> None: 2025-05-07T20:32:44.9418108Z torch.manual_seed(2025) 2025-05-07T20:32:44.9418187Z 2025-05-07T20:32:44.9418352Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9420175Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.9420251Z 2025-05-07T20:32:44.9420368Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:44.9420374Z 2025-05-07T20:32:44.9420476Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9420703Z self=, 2025-05-07T20:32:44.9420780Z T=16384, 2025-05-07T20:32:44.9420858Z D=7168, 2025-05-07T20:32:44.9420948Z scale_ub=None, 2025-05-07T20:32:44.9421034Z contiguous=False, 2025-05-07T20:32:44.9421117Z compiled=True, 2025-05-07T20:32:44.9421196Z ) 2025-05-07T20:32:44.9421411Z self = 2025-05-07T20:32:44.9421593Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:44.9421600Z 2025-05-07T20:32:44.9421677Z @given( 2025-05-07T20:32:44.9421791Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9421899Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9422012Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9422128Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9422247Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9422321Z ) 2025-05-07T20:32:44.9422575Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9422673Z def test_silu_mul_quant( 2025-05-07T20:32:44.9422749Z self, 2025-05-07T20:32:44.9422831Z T: int, 2025-05-07T20:32:44.9422908Z D: int, 2025-05-07T20:32:44.9423007Z scale_ub: Optional[float], 2025-05-07T20:32:44.9423102Z contiguous: bool, 2025-05-07T20:32:44.9423192Z compiled: bool, 2025-05-07T20:32:44.9423273Z ) -> None: 2025-05-07T20:32:44.9423374Z torch.manual_seed(2025) 2025-05-07T20:32:44.9423448Z 2025-05-07T20:32:44.9423618Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9425411Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.9425417Z 2025-05-07T20:32:44.9425534Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:44.9425539Z 2025-05-07T20:32:44.9425650Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9425918Z self=, 2025-05-07T20:32:44.9426001Z T=4096, 2025-05-07T20:32:44.9426079Z D=7168, 2025-05-07T20:32:44.9426201Z scale_ub=None, 2025-05-07T20:32:44.9426299Z contiguous=True, 2025-05-07T20:32:44.9426382Z compiled=False, 2025-05-07T20:32:44.9426454Z ) 2025-05-07T20:32:44.9426682Z self = 2025-05-07T20:32:44.9426850Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:44.9426855Z 2025-05-07T20:32:44.9426933Z @given( 2025-05-07T20:32:44.9427060Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9427178Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9427315Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9427438Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9427590Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9427673Z ) 2025-05-07T20:32:44.9427917Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9428048Z def test_silu_mul_quant( 2025-05-07T20:32:44.9428499Z self, 2025-05-07T20:32:44.9428617Z T: int, 2025-05-07T20:32:44.9428720Z D: int, 2025-05-07T20:32:44.9428830Z scale_ub: Optional[float], 2025-05-07T20:32:44.9428921Z contiguous: bool, 2025-05-07T20:32:44.9429010Z compiled: bool, 2025-05-07T20:32:44.9429143Z ) -> None: 2025-05-07T20:32:44.9429241Z torch.manual_seed(2025) 2025-05-07T20:32:44.9429318Z 2025-05-07T20:32:44.9429485Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9431266Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.9431282Z 2025-05-07T20:32:44.9431397Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:44.9431401Z 2025-05-07T20:32:44.9431507Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9431736Z self=, 2025-05-07T20:32:44.9431811Z T=16384, 2025-05-07T20:32:44.9431887Z D=7168, 2025-05-07T20:32:44.9431972Z scale_ub=None, 2025-05-07T20:32:44.9432055Z contiguous=True, 2025-05-07T20:32:44.9432137Z compiled=False, 2025-05-07T20:32:44.9432215Z ) 2025-05-07T20:32:44.9432431Z self = 2025-05-07T20:32:44.9432615Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:44.9432619Z 2025-05-07T20:32:44.9432697Z @given( 2025-05-07T20:32:44.9432815Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9432917Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9433029Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9433144Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9433261Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9433334Z ) 2025-05-07T20:32:44.9433585Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9433678Z def test_silu_mul_quant( 2025-05-07T20:32:44.9433754Z self, 2025-05-07T20:32:44.9433835Z T: int, 2025-05-07T20:32:44.9433910Z D: int, 2025-05-07T20:32:44.9434009Z scale_ub: Optional[float], 2025-05-07T20:32:44.9434106Z contiguous: bool, 2025-05-07T20:32:44.9434367Z compiled: bool, 2025-05-07T20:32:44.9434443Z ) -> None: 2025-05-07T20:32:44.9434544Z torch.manual_seed(2025) 2025-05-07T20:32:44.9434617Z 2025-05-07T20:32:44.9434849Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9436632Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.9436639Z 2025-05-07T20:32:44.9436754Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:44.9436826Z 2025-05-07T20:32:44.9436936Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9437161Z self=, 2025-05-07T20:32:44.9437311Z T=16384, 2025-05-07T20:32:44.9437392Z D=7168, 2025-05-07T20:32:44.9437486Z scale_ub=1200.0, 2025-05-07T20:32:44.9437585Z contiguous=True, 2025-05-07T20:32:44.9437682Z compiled=False, 2025-05-07T20:32:44.9437763Z ) 2025-05-07T20:32:44.9437985Z self = 2025-05-07T20:32:44.9438159Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:44.9438163Z 2025-05-07T20:32:44.9438239Z @given( 2025-05-07T20:32:44.9438358Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9438455Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9438576Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9438695Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9438809Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9438891Z ) 2025-05-07T20:32:44.9439143Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9439237Z def test_silu_mul_quant( 2025-05-07T20:32:44.9439318Z self, 2025-05-07T20:32:44.9439394Z T: int, 2025-05-07T20:32:44.9439469Z D: int, 2025-05-07T20:32:44.9439571Z scale_ub: Optional[float], 2025-05-07T20:32:44.9439659Z contiguous: bool, 2025-05-07T20:32:44.9439743Z compiled: bool, 2025-05-07T20:32:44.9443822Z ) -> None: 2025-05-07T20:32:44.9443950Z torch.manual_seed(2025) 2025-05-07T20:32:44.9444024Z 2025-05-07T20:32:44.9444199Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9446013Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.9446025Z 2025-05-07T20:32:44.9446145Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:44.9446150Z 2025-05-07T20:32:44.9446259Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9446481Z self=, 2025-05-07T20:32:44.9446567Z T=128, 2025-05-07T20:32:44.9446644Z D=5120, 2025-05-07T20:32:44.9446728Z scale_ub=1200.0, 2025-05-07T20:32:44.9446823Z contiguous=False, 2025-05-07T20:32:44.9446909Z compiled=False, 2025-05-07T20:32:44.9446988Z ) 2025-05-07T20:32:44.9447286Z self = 2025-05-07T20:32:44.9447461Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:44.9447465Z 2025-05-07T20:32:44.9447586Z @given( 2025-05-07T20:32:44.9447713Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9447811Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9447935Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9448051Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9448164Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9448250Z ) 2025-05-07T20:32:44.9448496Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9448592Z def test_silu_mul_quant( 2025-05-07T20:32:44.9448679Z self, 2025-05-07T20:32:44.9448755Z T: int, 2025-05-07T20:32:44.9448876Z D: int, 2025-05-07T20:32:44.9448992Z scale_ub: Optional[float], 2025-05-07T20:32:44.9449083Z contiguous: bool, 2025-05-07T20:32:44.9449172Z compiled: bool, 2025-05-07T20:32:44.9449263Z ) -> None: 2025-05-07T20:32:44.9449399Z torch.manual_seed(2025) 2025-05-07T20:32:44.9449484Z 2025-05-07T20:32:44.9449653Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9449728Z 2025-05-07T20:32:44.9449829Z x_sign = torch.sign(x) 2025-05-07T20:32:44.9449957Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.9450048Z x = x_sign * x_clamp 2025-05-07T20:32:44.9450139Z x0 = x[:, :D] 2025-05-07T20:32:44.9450220Z x1 = x[:, D:] 2025-05-07T20:32:44.9450292Z 2025-05-07T20:32:44.9450389Z if contiguous: 2025-05-07T20:32:44.9450484Z x0 = x0.contiguous() 2025-05-07T20:32:44.9450579Z x1 = x1.contiguous() 2025-05-07T20:32:44.9450658Z 2025-05-07T20:32:44.9450752Z if scale_ub is not None: 2025-05-07T20:32:44.9450867Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.9451004Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.9451085Z ) 2025-05-07T20:32:44.9451171Z else: 2025-05-07T20:32:44.9451266Z scale_ub_tensor = None 2025-05-07T20:32:44.9451340Z 2025-05-07T20:32:44.9451480Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.9451573Z op = silu_mul_quant 2025-05-07T20:32:44.9451662Z if compiled: 2025-05-07T20:32:44.9451773Z op = torch.compile(op) 2025-05-07T20:32:44.9451880Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9451954Z 2025-05-07T20:32:44.9452054Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.9452058Z 2025-05-07T20:32:44.9452157Z moe/activation_test.py:117: 2025-05-07T20:32:44.9452299Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9452405Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.9452507Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9453029Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.9453127Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.9453489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.9453720Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.9454063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.9454167Z kernel = self.compile( 2025-05-07T20:32:44.9454552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.9454727Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.9454918Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9454922Z 2025-05-07T20:32:44.9455127Z self = 2025-05-07T20:32:44.9455965Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.9456471Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f96553b6660>} 2025-05-07T20:32:44.9457221Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.9457491Z context = 2025-05-07T20:32:44.9457500Z 2025-05-07T20:32:44.9457691Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.9458023Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.9458136Z module_map=module_map) 2025-05-07T20:32:44.9458297Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.9458403Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.9458481Z E ^ 2025-05-07T20:32:44.9458849Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.9458854Z 2025-05-07T20:32:44.9459271Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.9459275Z 2025-05-07T20:32:44.9459379Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9459619Z self=, 2025-05-07T20:32:44.9459696Z T=2048, 2025-05-07T20:32:44.9459776Z D=7168, 2025-05-07T20:32:44.9459869Z scale_ub=None, 2025-05-07T20:32:44.9459962Z contiguous=False, 2025-05-07T20:32:44.9460056Z compiled=False, 2025-05-07T20:32:44.9460129Z ) 2025-05-07T20:32:44.9460346Z self = 2025-05-07T20:32:44.9460527Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:44.9460532Z 2025-05-07T20:32:44.9460611Z @given( 2025-05-07T20:32:44.9460730Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9460839Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9460955Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9461077Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9461198Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9461283Z ) 2025-05-07T20:32:44.9461535Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9461633Z def test_silu_mul_quant( 2025-05-07T20:32:44.9461718Z self, 2025-05-07T20:32:44.9461804Z T: int, 2025-05-07T20:32:44.9461883Z D: int, 2025-05-07T20:32:44.9461982Z scale_ub: Optional[float], 2025-05-07T20:32:44.9462080Z contiguous: bool, 2025-05-07T20:32:44.9462168Z compiled: bool, 2025-05-07T20:32:44.9462250Z ) -> None: 2025-05-07T20:32:44.9462352Z torch.manual_seed(2025) 2025-05-07T20:32:44.9462427Z 2025-05-07T20:32:44.9462598Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9464433Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.9464479Z 2025-05-07T20:32:44.9464599Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:44.9464612Z 2025-05-07T20:32:44.9464716Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9464940Z self=, 2025-05-07T20:32:44.9465027Z T=128, 2025-05-07T20:32:44.9465105Z D=7168, 2025-05-07T20:32:44.9465194Z scale_ub=1200.0, 2025-05-07T20:32:44.9465293Z contiguous=True, 2025-05-07T20:32:44.9465378Z compiled=True, 2025-05-07T20:32:44.9465453Z ) 2025-05-07T20:32:44.9465681Z self = 2025-05-07T20:32:44.9465896Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:44.9465900Z 2025-05-07T20:32:44.9465980Z @given( 2025-05-07T20:32:44.9466145Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9466250Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9466374Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9466491Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9466604Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9466685Z ) 2025-05-07T20:32:44.9466931Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9467029Z def test_silu_mul_quant( 2025-05-07T20:32:44.9467119Z self, 2025-05-07T20:32:44.9467196Z T: int, 2025-05-07T20:32:44.9467273Z D: int, 2025-05-07T20:32:44.9467381Z scale_ub: Optional[float], 2025-05-07T20:32:44.9467470Z contiguous: bool, 2025-05-07T20:32:44.9467591Z compiled: bool, 2025-05-07T20:32:44.9467677Z ) -> None: 2025-05-07T20:32:44.9467795Z torch.manual_seed(2025) 2025-05-07T20:32:44.9467877Z 2025-05-07T20:32:44.9468049Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9468124Z 2025-05-07T20:32:44.9468224Z x_sign = torch.sign(x) 2025-05-07T20:32:44.9468351Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.9468443Z x = x_sign * x_clamp 2025-05-07T20:32:44.9468534Z x0 = x[:, :D] 2025-05-07T20:32:44.9468617Z x1 = x[:, D:] 2025-05-07T20:32:44.9468690Z 2025-05-07T20:32:44.9468782Z if contiguous: 2025-05-07T20:32:44.9468876Z x0 = x0.contiguous() 2025-05-07T20:32:44.9468968Z x1 = x1.contiguous() 2025-05-07T20:32:44.9469121Z 2025-05-07T20:32:44.9469213Z if scale_ub is not None: 2025-05-07T20:32:44.9469332Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.9469476Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.9469556Z ) 2025-05-07T20:32:44.9469640Z else: 2025-05-07T20:32:44.9469737Z scale_ub_tensor = None 2025-05-07T20:32:44.9469811Z 2025-05-07T20:32:44.9469950Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.9470040Z op = silu_mul_quant 2025-05-07T20:32:44.9470125Z if compiled: 2025-05-07T20:32:44.9470232Z op = torch.compile(op) 2025-05-07T20:32:44.9470337Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9470412Z 2025-05-07T20:32:44.9470509Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.9470514Z 2025-05-07T20:32:44.9470610Z moe/activation_test.py:117: 2025-05-07T20:32:44.9470745Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9470846Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.9470946Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.9471453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:44.9471547Z return fn(*args, **kwargs) 2025-05-07T20:32:44.9472085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.9472195Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.9472553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.9472783Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.9473122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.9473216Z kernel = self.compile( 2025-05-07T20:32:44.9473605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.9473821Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.9473958Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.9474001Z 2025-05-07T20:32:44.9474208Z self = 2025-05-07T20:32:44.9474992Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.9475505Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f96553b7c40>} 2025-05-07T20:32:44.9476260Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.9476464Z context = 2025-05-07T20:32:44.9476469Z 2025-05-07T20:32:44.9476639Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.9476904Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.9477019Z module_map=module_map) 2025-05-07T20:32:44.9477182Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.9477288Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.9477366Z E ^ 2025-05-07T20:32:44.9477729Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.9477734Z 2025-05-07T20:32:44.9478162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.9478170Z 2025-05-07T20:32:44.9478281Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9478513Z self=, 2025-05-07T20:32:44.9478591Z T=128, 2025-05-07T20:32:44.9478671Z D=7168, 2025-05-07T20:32:44.9478766Z scale_ub=1200.0, 2025-05-07T20:32:44.9478853Z contiguous=True, 2025-05-07T20:32:44.9478939Z compiled=False, 2025-05-07T20:32:44.9479019Z ) 2025-05-07T20:32:44.9479237Z self = 2025-05-07T20:32:44.9479408Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:44.9479412Z 2025-05-07T20:32:44.9479499Z @given( 2025-05-07T20:32:44.9479618Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9479718Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9479842Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9479960Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9480127Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9480201Z ) 2025-05-07T20:32:44.9480453Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9480596Z def test_silu_mul_quant( 2025-05-07T20:32:44.9480674Z self, 2025-05-07T20:32:44.9480756Z T: int, 2025-05-07T20:32:44.9480845Z D: int, 2025-05-07T20:32:44.9480945Z scale_ub: Optional[float], 2025-05-07T20:32:44.9481034Z contiguous: bool, 2025-05-07T20:32:44.9481130Z compiled: bool, 2025-05-07T20:32:44.9481209Z ) -> None: 2025-05-07T20:32:44.9481315Z torch.manual_seed(2025) 2025-05-07T20:32:44.9481392Z 2025-05-07T20:32:44.9481560Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9481643Z 2025-05-07T20:32:44.9481738Z x_sign = torch.sign(x) 2025-05-07T20:32:44.9481863Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.9483740Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.9483747Z 2025-05-07T20:32:44.9483866Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:44.9483871Z 2025-05-07T20:32:44.9483982Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9484203Z self=, 2025-05-07T20:32:44.9484281Z T=128, 2025-05-07T20:32:44.9484364Z D=5120, 2025-05-07T20:32:44.9484448Z scale_ub=1200.0, 2025-05-07T20:32:44.9484544Z contiguous=True, 2025-05-07T20:32:44.9484629Z compiled=True, 2025-05-07T20:32:44.9484703Z ) 2025-05-07T20:32:44.9484929Z self = 2025-05-07T20:32:44.9485098Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:44.9485104Z 2025-05-07T20:32:44.9485187Z @given( 2025-05-07T20:32:44.9485303Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9485401Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9485524Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9485639Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9485751Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9485835Z ) 2025-05-07T20:32:44.9486079Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9486174Z def test_silu_mul_quant( 2025-05-07T20:32:44.9486264Z self, 2025-05-07T20:32:44.9486341Z T: int, 2025-05-07T20:32:44.9486424Z D: int, 2025-05-07T20:32:44.9486521Z scale_ub: Optional[float], 2025-05-07T20:32:44.9486613Z contiguous: bool, 2025-05-07T20:32:44.9486707Z compiled: bool, 2025-05-07T20:32:44.9486786Z ) -> None: 2025-05-07T20:32:44.9486882Z torch.manual_seed(2025) 2025-05-07T20:32:44.9486966Z 2025-05-07T20:32:44.9487138Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9487214Z 2025-05-07T20:32:44.9487334Z x_sign = torch.sign(x) 2025-05-07T20:32:44.9487473Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.9489311Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.9489379Z 2025-05-07T20:32:44.9489497Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:44.9489501Z 2025-05-07T20:32:44.9489610Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.9489831Z self=, 2025-05-07T20:32:44.9489911Z T=128, 2025-05-07T20:32:44.9489993Z D=7168, 2025-05-07T20:32:44.9490077Z scale_ub=None, 2025-05-07T20:32:44.9490164Z contiguous=True, 2025-05-07T20:32:44.9490253Z compiled=True, 2025-05-07T20:32:44.9490330Z ) 2025-05-07T20:32:44.9490548Z self = 2025-05-07T20:32:44.9490719Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:44.9490767Z 2025-05-07T20:32:44.9490845Z @given( 2025-05-07T20:32:44.9490960Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.9491104Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.9491221Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.9491344Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.9491457Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.9491530Z ) 2025-05-07T20:32:44.9491779Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.9491872Z def test_silu_mul_quant( 2025-05-07T20:32:44.9491948Z self, 2025-05-07T20:32:44.9492031Z T: int, 2025-05-07T20:32:44.9492107Z D: int, 2025-05-07T20:32:44.9492203Z scale_ub: Optional[float], 2025-05-07T20:32:44.9492302Z contiguous: bool, 2025-05-07T20:32:44.9492386Z compiled: bool, 2025-05-07T20:32:44.9492477Z ) -> None: 2025-05-07T20:32:44.9492571Z torch.manual_seed(2025) 2025-05-07T20:32:44.9492645Z 2025-05-07T20:32:44.9492818Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.9494595Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:44.9494601Z 2025-05-07T20:32:44.9494725Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:44.9494859Z =============================== warnings summary =============================== 2025-05-07T20:32:44.9495173Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:44.9495483Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:44.9495782Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:44.9496674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.11/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details. 2025-05-07T20:32:44.9496903Z warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See " 2025-05-07T20:32:44.9496908Z 2025-05-07T20:32:44.9497118Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html 2025-05-07T20:32:44.9497293Z ================= 1 failed, 1 deselected, 3 warnings in 13.88s ================= 2025-05-07T20:32:46.5191792Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error) 2025-05-07T20:32:46.5810987Z [EXEC] [ATTEMPT 2/2] Command attempt failed. 2025-05-07T20:32:46.5811613Z 2025-05-07T20:32:46.5812080Z [EXEC] The command has failed after 2 + 1 attempts; aborting. 2025-05-07T20:32:46.5813520Z [TEST] Python test suite FAILED for some or all tests despite multiple retries: ./moe/activation_test.py 2025-05-07T20:32:46.5814324Z 2025-05-07T20:32:46.5814332Z 2025-05-07T20:32:46.5814340Z 2025-05-07T20:32:46.5831161Z ##[error]Process completed with exit code 1. 2025-05-07T20:32:46.5911701Z Post job cleanup. 2025-05-07T20:32:46.6891685Z [command]/usr/bin/git version 2025-05-07T20:32:46.6933015Z git version 2.47.1 2025-05-07T20:32:46.6967773Z Copying '/home/ec2-user/.gitconfig' to '/home/ec2-user/actions-runner/_work/_temp/b3674192-b0ff-41b0-bb10-935329a809c5/.gitconfig' 2025-05-07T20:32:46.6978458Z Temporarily overriding HOME='/home/ec2-user/actions-runner/_work/_temp/b3674192-b0ff-41b0-bb10-935329a809c5' before making global git config changes 2025-05-07T20:32:46.6979309Z Adding repository directory to the temporary git global config as a safe directory 2025-05-07T20:32:46.6983889Z [command]/usr/bin/git config --global --add safe.directory /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM 2025-05-07T20:32:46.7032664Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand 2025-05-07T20:32:46.7067704Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :" 2025-05-07T20:32:46.7406013Z Entering 'external/asmjit' 2025-05-07T20:32:46.7473222Z Entering 'external/composable_kernel' 2025-05-07T20:32:46.7547295Z Entering 'external/cpuinfo' 2025-05-07T20:32:46.7618285Z Entering 'external/cutlass' 2025-05-07T20:32:46.7694068Z Entering 'external/googletest' 2025-05-07T20:32:46.7761159Z Entering 'external/hipify_torch' 2025-05-07T20:32:46.7827284Z Entering 'external/json' 2025-05-07T20:32:46.7917690Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader 2025-05-07T20:32:46.7944352Z http.https://github.com/.extraheader 2025-05-07T20:32:46.7956321Z [command]/usr/bin/git config --local --unset-all http.https://github.com/.extraheader 2025-05-07T20:32:46.7992039Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :" 2025-05-07T20:32:46.8326872Z Entering 'external/asmjit' 2025-05-07T20:32:46.8369675Z http.https://github.com/.extraheader 2025-05-07T20:32:46.8412732Z Entering 'external/composable_kernel' 2025-05-07T20:32:46.8456953Z http.https://github.com/.extraheader 2025-05-07T20:32:46.8505770Z Entering 'external/cpuinfo' 2025-05-07T20:32:46.8549129Z http.https://github.com/.extraheader 2025-05-07T20:32:46.8592384Z Entering 'external/cutlass' 2025-05-07T20:32:46.8635477Z http.https://github.com/.extraheader 2025-05-07T20:32:46.8686279Z Entering 'external/googletest' 2025-05-07T20:32:46.8737277Z http.https://github.com/.extraheader 2025-05-07T20:32:46.8774204Z Entering 'external/hipify_torch' 2025-05-07T20:32:46.8816608Z http.https://github.com/.extraheader 2025-05-07T20:32:46.8859551Z Entering 'external/json' 2025-05-07T20:32:46.8903069Z http.https://github.com/.extraheader 2025-05-07T20:32:46.9052067Z A job completed hook has been configured by the self-hosted runner administrator 2025-05-07T20:32:46.9085424Z ##[group]Run '/home/ec2-user/runner-scripts/after_job.sh' 2025-05-07T20:32:46.9096190Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:32:46.9096548Z ##[endgroup] 2025-05-07T20:32:46.9196508Z [!ALERT!] Swap in detected! [!ALERT!] 2025-05-07T20:32:57.7037607Z [!ALERT!] Swap out detected [!ALERT!] 2025-05-07T20:33:14.0619137Z Cleaning up orphan processes